Commit Graph

7449 Commits

Author SHA1 Message Date
Yangtao Li
990f61e43c dm mirror: add DMERR message if alloc_workqueue fails
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-11 12:01:01 -04:00
Yangtao Li
b362c733ed dm: push error reporting down to dm_register_target()
Simplifies each DM target's init method by making dm_register_target()
responsible for its error reporting (on behalf of targets).

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-11 12:01:01 -04:00
Mike Snitzer
6b79a428c0 dm integrity: call kmem_cache_destroy() in dm_integrity_init() error path
Otherwise the journal_io_cache will leak if dm_register_target() fails.

Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-04 13:39:25 -04:00
Mike Snitzer
6827af4a9a dm clone: call kmem_cache_destroy() in dm_clone_init() error path
Otherwise the _hydration_cache will leak if dm_register_target() fails.

Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-04 13:30:17 -04:00
Mikulas Patocka
b6bcb84446 dm error: add discard support
Add io_err_io_hints() and set discard limits so that the zero target
advertises support for discards.

The error target will return -EIO for discards.

This is useful when the user combines dm-error with other
discard-supporting targets in the same table; without dm-error
support, discards would be disabled for the whole combined device.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-04 13:30:17 -04:00
Mikulas Patocka
00065f925e dm zero: add discard support
Add zero_io_hints() and set discard limits so that the zero target
advertises support for discards.

The zero target will ignore discards.

This is useful when the user combines dm-zero with other
discard-supporting targets in the same table; without dm-zero support,
discards would be disabled for the whole combined device.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-04 13:30:17 -04:00
Mikulas Patocka
85c938e891 dm table: allow targets without devices to set ->io_hints
In dm_calculate_queue_limits, add call to ->io_hints hook if the
target doesn't provide ->iterate_devices.

This is needed so the "error" and "zero" targets may support
discards. The 2 following commits will add their respective discard
support.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-04 13:30:17 -04:00
Michael Weiß
074c44664f dm verity: emit audit events on verification failure and more
dm-verity signals integrity violations by returning I/O errors
to user space. To identify integrity violations by a controlling
instance, the kernel audit subsystem can be used to emit audit
events to user space. Analogous to dm-integrity, we also use the
dm-audit submodule allowing to emit audit events on verification
failures of metadata and data blocks as well as if max corrupted
errors are reached.

The construction and destruction of verity device mappings are
also relevant for auditing a system. Thus, those events are also
logged as audit events.

Tested by starting a container with the container manager (cmld) of
GyroidOS which uses a dm-verity protected rootfs image root.img mapped
to /dev/mapper/<uuid>-root. One block was manipulated in the
underlying image file and repeated reads of the verity device were
performed again until the max corrupted errors is reached, e.g.:

  dd if=/dev/urandom of=root.img bs=512 count=1 seek=1000
  for i in range {1..101}; do \
    dd if=/dev/mapper/<uuid>-root of=/dev/null bs=4096 \
       count=1 skip=1000 \
  done

The resulting audit log looks as follows:

  type=DM_CTRL msg=audit(1677618791.876:962):
    module=verity op=ctr ppid=4876 pid=29102 auid=0 uid=0 gid=0
    euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=44
    comm="cmld" exe="/usr/sbin/cml/cmld" subj=unconfined
    dev=254:3 error_msg='success' res=1

  type=DM_EVENT msg=audit(1677619463.786:1074): module=verity
    op=verify-data dev=7:0 sector=1000 res=0
  ...
  type=DM_EVENT msg=audit(1677619596.727:1162): module=verity
    op=verify-data dev=7:0 sector=1000 res=0

  type=DM_EVENT msg=audit(1677619596.731:1163): module=verity
    op=max-corrupted-errors dev=254:3 sector=? res=0

Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-04 13:30:05 -04:00
Yeongjin Gil
e8c5d45f82 dm verity: fix error handling for check_at_most_once on FEC
In verity_end_io(), if bi_status is not BLK_STS_OK, it can be return
directly. But if FEC configured, it is desired to correct the data page
through verity_verify_io. And the return value will be converted to
blk_status and passed to verity_finish_io().

BTW, when a bit is set in v->validated_blocks, verity_verify_io() skips
verification regardless of I/O error for the corresponding bio. In this
case, the I/O error could not be returned properly, and as a result,
there is a problem that abnormal data could be read for the
corresponding block.

To fix this problem, when an I/O error occurs, do not skip verification
even if the bit related is set in v->validated_blocks.

Fixes: 843f38d382 ("dm verity: add 'check_at_most_once' option to only validate hashes once")
Cc: stable@vger.kernel.org
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Yeongjin Gil <youngjin.gil@samsung.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-04 11:03:39 -04:00
Joe Thornber
363b7fd76c dm: improve hash_locks sizing and hash function
Both bufio and bio-prison-v1 use the identical model for splitting
their respective locks and rbtrees. Improve dm_num_hash_locks() to
distribute across more rbtrees to improve overall performance -- but
the maximum number of locks/rbtrees is still 64.

Also factor out a common hash function named dm_hash_locks_index(),
the magic numbers used were determined to be best using this program:
 https://gist.github.com/jthornber/e05c47daa7b500c56dc339269c5467fc

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:51 -04:00
Mike Snitzer
b6279f82eb dm bio prison v1: intelligently size dm_bio_prison's prison_regions
Size the dm_bio_prison's number of prison_region structs using
dm_num_hash_locks().

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:51 -04:00
Mike Snitzer
c6273411d1 dm bio prison v1: prepare to intelligently size dm_bio_prison's prison_regions
Add num_locks member to dm_bio_prison struct and use it rather than
the NR_LOCKS magic value (64).

Next commit will size the dm_bio_prison's prison_regions according to
dm_num_hash_locks().

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:51 -04:00
Mike Snitzer
1e84c4b732 dm bufio: intelligently size dm_buffer_cache's buffer_trees
Size the dm_buffer_cache's number of buffer_tree structs using
dm_num_hash_locks().

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:51 -04:00
Mike Snitzer
36c18b8639 dm bufio: prepare to intelligently size dm_buffer_cache's buffer_trees
Add num_locks member to dm_buffer_cache struct and use it rather than
the NR_LOCKS magic value (64).

Next commit will size the dm_buffer_cache's buffer_trees according to
dm_num_hash_locks().

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:51 -04:00
Mike Snitzer
0bac3f2f28 dm: add dm_num_hash_locks()
Simple helper to use when DM core code needs to appropriately size,
based on num_online_cpus(), its data structures that split locks.

dm_num_hash_locks() rounds up num_online_cpus() to next power of 2
but caps return at DM_HASH_LOCKS_MAX (64).

This heuristic may evolve as warranted, but as-is it will serve as a
more informed basis for sizing the sharded lock structs in dm-bufio's
dm_buffer_cache (buffer_trees) and dm-bio-prison-v1's dm_bio_prison
(prison_regions).

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:51 -04:00
Mike Snitzer
3f8d3f5432 dm bio prison v1: add dm_cell_key_has_valid_range
Don't have bio_detain() BUG_ON if a dm_cell_key is beyond
BIO_PRISON_MAX_RANGE or spans a boundary.

Update dm-thin.c:build_key() to use dm_cell_key_has_valid_range() which
will do this checking without using BUG_ON. Also update
process_discard_bio() to check the discard bio that DM core passes in
(having first imposed max_discard_granularity based splitting).

dm_cell_key_has_valid_range() will merely WARN_ON_ONCE if it returns
false because if it does: it is programmer error that should be caught
with proper testing. So relax the BUG_ONs to be WARN_ON_ONCE.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:51 -04:00
Joe Thornber
e2dd8aca2d dm bio prison v1: improve concurrent IO performance
Split the bio prison into multiple regions, with a separate rbtree and
associated lock for each region.

To get fast bio prison locking and not damage the performance of
discards too much the bio-prison now stipulates that discards should
not cross a BIO_PRISON_MAX_RANGE boundary.

Because the range of a key (block_end - block_begin) must not exceed
BIO_PRISON_MAX_RANGE: break_up_discard_bio() now ensures the data
range reflected in PHYSICAL key doesn't exceed BIO_PRISON_MAX_RANGE.
And splitting the thin target's discards (handled with VIRTUAL key) is
achieved by updating dm-thin.c to set limits->max_discard_sectors in
terms of BIO_PRISON_MAX_RANGE _and_ setting the thin and thin-pool
targets' max_discard_granularity to true.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:51 -04:00
Mike Snitzer
06961c487a dm: split discards further if target sets max_discard_granularity
The block core (bio_split_discard) will already split discards based
on the 'discard_granularity' and 'max_discard_sectors' queue_limits.
But the DM thin target also needs to ensure that it doesn't receive a
discard that spans a 'max_discard_sectors' boundary.

Introduce a dm_target 'max_discard_granularity' flag that if set will
cause DM core to split discard bios relative to 'max_discard_sectors'.
This treats 'discard_granularity' as a "min_discard_granularity" and
'max_discard_sectors' as a "max_discard_granularity".

Requested-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:50 -04:00
Joe Thornber
bb46c56165 dm thin: speed up cell_defer_no_holder()
Reduce the time that a spinlock is held in cell_defer_no_holder().

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:50 -04:00
Mikulas Patocka
56c5de4406 dm bufio: use multi-page bio vector
The kernel supports multi page bio vector entries, so we can use them
in dm-bufio as an optimization.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:50 -04:00
Mikulas Patocka
f5f9354120 dm bufio: use waitqueue_active in __free_buffer_wake
Save one spinlock by using waitqueue_active. We hold the bufio lock at
this place, so no one can add entries to the waitqueue at this point.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:50 -04:00
Mike Snitzer
530f683ddc dm bufio: move dm_bufio_client members to avoid spanning cachelines
Movement also consolidates holes in dm_bufio_client struct. But the
overall size of the struct isn't changed.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:50 -04:00
Joe Thornber
791188065b dm bufio: add lock_history optimization for cache iterators
Sometimes it is beneficial to repeatedly get and drop locks as part of
an iteration.  Introduce lock_history struct to help avoid redundant
drop and gets of the same lock.

Optimizes cache_iterate, cache_mark_many and cache_evict.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:50 -04:00
Joe Thornber
450e8dee51 dm bufio: improve concurrent IO performance
When multiple threads perform IO to a thin device, the underlying
dm_bufio object can become a bottleneck; slowing down access to btree
nodes that store the thin metadata. Prior to this commit, each bufio
instance had a single mutex that was taken for every bufio operation.

This commit concentrates on improving the common case where: a user of
dm_bufio wishes to access, but not modify, a buffer which is already
within the dm_bufio cache.

Implementation::

  The code has been refactored; pulling out an 'lru' abstraction and a
  'buffer cache' abstraction (see 2 previous commits). This commit
  updates higher level bufio code (that performs allocation of buffers,
  IO and eviction/cache sizing) to leverage both abstractions. It also
  deals with the delicate locking requirements of both abstractions to
  provide finer grained locking. The result is significantly better
  concurrent IO performance.

  Before this commit, bufio has a global lru list it used to evict the
  oldest, clean buffers from _all_ clients. With the new locking we
  don’t want different ways to access the same buffer, so instead
  do_global_cleanup() loops around the clients asking them to free
  buffers older than a certain time.

  This commit also converts many old BUG_ONs to WARN_ON_ONCE, see the
  lru_evict and cache_evict code in particular.  They will return
  ER_DONT_EVICT if a given buffer somehow meets the invariants that
  should _never_ happen. [Aside from revising this commit's header and
  fixing coding style and whitespace nits: this switching to
  WARN_ON_ONCE is Mike Snitzer's lone contribution to this commit]

Testing::

  Some of the low level functions have been unit tested using dm-unit:
    https://github.com/jthornber/dm-unit/blob/main/src/tests/bufio.rs

  Higher level concurrency and IO is tested via a test only target
  found here:
    https://github.com/jthornber/linux/blob/2023-03-24-thin-concurrency-9/drivers/md/dm-bufio-test.c

  The associated userland side of these tests is here:
    https://github.com/jthornber/dmtest-python/blob/main/src/dmtest/bufio/bufio_tests.py

  In addition the full dmtest suite of tests (dm-thin, dm-cache, etc)
  has been run (~450 tests).

Performance::

  Most bufio operations have unchanged performance. But if multiple
  threads are attempting to get buffers concurrently, and these
  buffers are already in the cache then there's a big speed up. Eg,
  one test has 16 'hotspot' threads simulating btree lookups while
  another thread dirties the whole device. In this case the hotspot
  threads acquire the buffers about 25 times faster.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:50 -04:00
Joe Thornber
2cd7a6d41f dm bufio: add dm_buffer_cache abstraction
The buffer cache is responsible for managing the holder count,
tracking clean/dirty state, and choosing buffers via predicates.
Higher level code is responsible for allocation of buffers, IO and
eviction/cache sizing.

The buffer cache has thread safe methods for acquiring a reference
to an existing buffer. All other methods in buffer cache are _not_
threadsafe, and only contain enough locking to guarantee the safe
methods.

Rather than a single mutex, sharded rw_semaphores are used to allow
concurrent threads to 'get' buffers. Each rw_semaphore protects its
own rbtree of buffer entries.

Code that uses this new dm_buffer_cache abstraction will be introduced
in a following commit.

This commit moves the dm_buffer struct in preparation for finer grained
dm_buffer changes, in the next commit, to be more easily seen. It also
introduces temporary dm_buffer struct members to allow compilation of
this intermediate commit (they will be elided in the next commit).

This commit will cause "defined but not used" compiler warnings that
will be resolved by the next commit.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:50 -04:00
Joe Thornber
be845babda dm bufio: add LRU abstraction
A CLOCK algorithm is used in this LRU abstraction.  This avoids
relinking list nodes, which would require a write lock protecting it.

None of the LRU methods are threadsafe; locking must be done at a
higher level.

Code that uses this new LRU will be introduced in the next 2 commits.

As such, this commit will cause "defined but not used" compiler warnings
that will be resolved by the next 2 commits.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:50 -04:00
Mike Snitzer
b75a80f4f5 dm bufio: don't bug for clear developer oversight
Reasonable to relax to WARN_ON because these are easily avoided but do
offer some assurance future coding mistakes won't occur (if changes
tested properly).

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:50 -04:00
Mike Snitzer
0511228752 dm bufio: never crash if dm_bufio_in_request()
All these instances are entirely avoidable given that they speak to
coding mistakes that result in inappropriate use. Proper testing during
development will catch any such coding bug so its best to relax all of
these from BUG_ON to WARN_ON_ONCE.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:50 -04:00
Mike Snitzer
555977dd68 dm bufio: use WARN_ON in dm_bufio_client_destroy and dm_bufio_exit
Using BUG_ON when tearing down is excessive. Relax these to WARN_ONs.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:50 -04:00
Joe Thornber
96a2ff2a63 dm bufio: remove unused dm_bufio_release_move interface
Was used by multi-snapshot DM target that never went upstream.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:50 -04:00
Mike Snitzer
666eed4676 dm: fix __send_duplicate_bios() to always allow for splitting IO
Commit 7dd76d1fee ("dm: improve bio splitting and associated IO
accounting") only called setup_split_accounting() from
__send_duplicate_bios() if a single bio were being issued. But the case
where duplicate bios are issued must call it too.

Otherwise the bio won't be split and resubmitted (via recursion through
block core back to DM) to submit the later portions of a bio (which may
map to an entirely different target).

For example, when discarding an entire DM striped device with the
following DM table:
 vg-lvol0: 0 159744 striped 2 128 7:0 2048 7:1 2048
 vg-lvol0: 159744 45056 striped 2 128 7:2 2048 7:3 2048

Before (broken, discards the first striped target's devices twice):
 device-mapper: striped: target_stripe=0, bdev=7:0, start=2048 len=79872
 device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=79872
 device-mapper: striped: target_stripe=0, bdev=7:0, start=2049 len=22528
 device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=22528

After (works as expected):
 device-mapper: striped: target_stripe=0, bdev=7:0, start=2048 len=79872
 device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=79872
 device-mapper: striped: target_stripe=0, bdev=7:2, start=2048 len=22528
 device-mapper: striped: target_stripe=1, bdev=7:3, start=2048 len=22528

Fixes: 7dd76d1fee ("dm: improve bio splitting and associated IO accounting")
Cc: stable@vger.kernel.org
Reported-by: Orange Kao <orange@aiven.io>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:54:32 -04:00
Mike Snitzer
f7b58a69fa dm: fix improper splitting for abnormal bios
"Abnormal" bios include discards, write zeroes and secure erase. By no
longer passing the calculated 'len' pointer, commit 7dd06a2548 ("dm:
allow dm_accept_partial_bio() for dm_io without duplicate bios") took a
senseless approach to disallowing dm_accept_partial_bio() from working
for duplicate bios processed using __send_duplicate_bios().

It inadvertently and incorrectly stopped the use of 'len' when
initializing a target's io (in alloc_tio). As such the resulting tio
could address more area of a device than it should.

For example, when discarding an entire DM striped device with the
following DM table:
 vg-lvol0: 0 159744 striped 2 128 7:0 2048 7:1 2048
 vg-lvol0: 159744 45056 striped 2 128 7:2 2048 7:3 2048

Before this fix:

 device-mapper: striped: target_stripe=0, bdev=7:0, start=2048 len=102400
 blkdiscard: attempt to access beyond end of device
 loop0: rw=2051, sector=2048, nr_sectors = 102400 limit=81920

 device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=102400
 blkdiscard: attempt to access beyond end of device
 loop1: rw=2051, sector=2048, nr_sectors = 102400 limit=81920

After this fix;

 device-mapper: striped: target_stripe=0, bdev=7:0, start=2048 len=79872
 device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=79872

Fixes: 7dd06a2548 ("dm: allow dm_accept_partial_bio() for dm_io without duplicate bios")
Cc: stable@vger.kernel.org
Reported-by: Orange Kao <orange@aiven.io>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:54:32 -04:00
Linus Torvalds
5ad4fe9613 - Fix DM thin to work as a swap device by using 'limit_swap_bios' DM
target flag (initially added to allow swap to dm-crypt) to throttle
   the amount of outstanding swap bios.
 
 - Fix DM crypt soft lockup warnings by calling cond_resched() from the
   cpu intensive loop in dmcrypt_write().
 
 - Fix DM crypt to not access an uninitialized tasklet. This fix allows
   for consistent handling of IO completion, by _not_ needlessly punting
   to a workqueue when tasklets are not needed.
 
 - Fix DM core's alloc_dev() initialization for DM stats to check for
   and propagate alloc_percpu() failure.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmQd2p0ACgkQxSPxCi2d
 A1qvXAf/WCMNXRbFhO35QqukBqS7sUOMfWl1hIEdABRu+3Ul1KHBWzXVYWuWgebw
 kr79V3LZG63cLvhreCy64X/0tXLZa0c0AGWZI6rJ/QAozSCs9R8BqOrJnB5GT1o9
 /lvmOL31MloMnIKArWseIQViNM97gEHmFpuj0saqitcvNTjjipzxq/wOyhmDQwnE
 8rxJpKSHBJXs9X/VyM9FTWxtijTQw3c8wxJJo7eV6TTuLyrErm46tyI1cBQ4vDoa
 ogMVWVrf51uTsqL6DqGenDc+kO7CH5lipIJij1bTtKgs3aBNlaiZQC1nPkMST9Ue
 hpH61ixAg+bsWi4/xLFafCl6QAGMlA==
 =71ya
 -----END PGP SIGNATURE-----

Merge tag 'for-6.3/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - Fix DM thin to work as a swap device by using 'limit_swap_bios' DM
   target flag (initially added to allow swap to dm-crypt) to throttle
   the amount of outstanding swap bios.

 - Fix DM crypt soft lockup warnings by calling cond_resched() from the
   cpu intensive loop in dmcrypt_write().

 - Fix DM crypt to not access an uninitialized tasklet. This fix allows
   for consistent handling of IO completion, by _not_ needlessly punting
   to a workqueue when tasklets are not needed.

 - Fix DM core's alloc_dev() initialization for DM stats to check for
   and propagate alloc_percpu() failure.

* tag 'for-6.3/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm stats: check for and propagate alloc_percpu failure
  dm crypt: avoid accessing uninitialized tasklet
  dm crypt: add cond_resched() to dmcrypt_write()
  dm thin: fix deadlock when swapping to thin device
2023-03-24 14:20:48 -07:00
Jiasheng Jiang
d3aa3e060c dm stats: check for and propagate alloc_percpu failure
Check alloc_precpu()'s return value and return an error from
dm_stats_init() if it fails. Update alloc_dev() to fail if
dm_stats_init() does.

Otherwise, a NULL pointer dereference will occur in dm_stats_cleanup()
even if dm-stats isn't being actively used.

Fixes: fd2ed4d252 ("dm: add statistics support")
Cc: stable@vger.kernel.org
Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-16 13:37:06 -04:00
Jens Axboe
23e5b9307e Merge branch 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-6.3
Pull MD fixes from Song:

"This set contains two fixes for old issues (by Neil) and one fix
 for 6.3 (by Xiao)."

* 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  md: select BLOCK_LEGACY_AUTOLOAD
  md: avoid signed overflow in slot_store()
  md: Free resources in __md_stop
2023-03-15 12:18:07 -06:00
NeilBrown
6c0f589883 md: select BLOCK_LEGACY_AUTOLOAD
When BLOCK_LEGACY_AUTOLOAD is not enable, mdadm is not able to
activate new arrays unless "CREATE names=yes" appears in
mdadm.conf

As this is a regression we need to always enable BLOCK_LEGACY_AUTOLOAD
for when MD is selected - at least until mdadm is updated and the
updates widely available.

Cc: stable@vger.kernel.org # v5.18+
Fixes: fbdee71bb5 ("block: deprecate autoloading based on dev_t")
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Song Liu <song@kernel.org>
2023-03-15 11:12:14 -07:00
Yu Kuai
5f27571382 block: count 'ios' and 'sectors' when io is done for bio-based device
While using iostat for raid, I observed very strange 'await'
occasionally, and turns out it's due to that 'ios' and 'sectors' is
counted in bdev_start_io_acct(), while 'nsecs' is counted in
bdev_end_io_acct(). I'm not sure why they are ccounted like that
but I think this behaviour is obviously wrong because user will get
wrong disk stats.

Fix the problem by counting 'ios' and 'sectors' when io is done, like
what rq-based device does.

Fixes: 394ffa503b ("blk: introduce generic io stat accounting help function")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230223091226.1135678-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-15 09:25:04 -06:00
NeilBrown
3bc5729227 md: avoid signed overflow in slot_store()
slot_store() uses kstrtouint() to get a slot number, but stores the
result in an "int" variable (by casting a pointer).
This can result in a negative slot number if the unsigned int value is
very large.

A negative number means that the slot is empty, but setting a negative
slot number this way will not remove the device from the array.  I don't
think this is a serious problem, but it could cause confusion and it is
best to fix it.

Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Song Liu <song@kernel.org>
2023-03-13 12:50:54 -07:00
Xiao Ni
3e45352259 md: Free resources in __md_stop
If md_run() fails after ->active_io is initialized, then percpu_ref_exit
is called in error path. However, later md_free_disk will call
percpu_ref_exit again which leads to a panic because of null pointer
dereference. It can also trigger this bug when resources are initialized
but are freed in error path, then will be freed again in md_free_disk.

BUG: kernel NULL pointer dereference, address: 0000000000000038
Oops: 0000 [#1] PREEMPT SMP
Workqueue: md_misc mddev_delayed_delete
RIP: 0010:free_percpu+0x110/0x630
Call Trace:
 <TASK>
 __percpu_ref_exit+0x44/0x70
 percpu_ref_exit+0x16/0x90
 md_free_disk+0x2f/0x80
 disk_release+0x101/0x180
 device_release+0x84/0x110
 kobject_put+0x12a/0x380
 kobject_put+0x160/0x380
 mddev_delayed_delete+0x19/0x30
 process_one_work+0x269/0x680
 worker_thread+0x266/0x640
 kthread+0x151/0x1b0
 ret_from_fork+0x1f/0x30

For creating raid device, md raid calls do_md_run->md_run, dm raid calls
md_run. We alloc those memory in md_run. For stopping raid device, md raid
calls do_md_stop->__md_stop, dm raid calls md_stop->__md_stop. So we can
free those memory resources in __md_stop.

Fixes: 72adae23a7 ("md: Change active_io to percpu")
Reported-and-tested-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
2023-03-13 10:56:54 -07:00
Mike Snitzer
d9a02e016a dm crypt: avoid accessing uninitialized tasklet
When neither "no_read_workqueue" nor "no_write_workqueue" are enabled,
tasklet_trylock() in crypt_dec_pending() may still return false due to
an uninitialized state, and dm-crypt will unnecessarily do io completion
in io_queue workqueue instead of current context.

Fix this by adding an 'in_tasklet' flag to dm_crypt_io struct and
initialize it to false in crypt_io_init(). Set this flag to true in
kcryptd_queue_crypt() before calling tasklet_schedule(). If set
crypt_dec_pending() will punt io completion to a workqueue.

This also nicely avoids the tasklet_trylock/unlock hack when tasklets
aren't in use.

Fixes: 8e14f61015 ("dm crypt: do not call bio_endio() from the dm-crypt tasklet")
Cc: stable@vger.kernel.org
Reported-by: Hou Tao <houtao1@huawei.com>
Suggested-by: Ignat Korchagin <ignat@cloudflare.com>
Reviewed-by: Ignat Korchagin <ignat@cloudflare.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-09 10:04:38 -05:00
Mikulas Patocka
fb294b1c0b dm crypt: add cond_resched() to dmcrypt_write()
The loop in dmcrypt_write may be running for unbounded amount of time,
thus we need cond_resched() in it.

This commit fixes the following warning:

[ 3391.153255][   C12] watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [dmcrypt_write/2:2897]
...
[ 3391.387210][   C12] Call trace:
[ 3391.390338][   C12]  blk_attempt_bio_merge.part.6+0x38/0x158
[ 3391.395970][   C12]  blk_attempt_plug_merge+0xc0/0x1b0
[ 3391.401085][   C12]  blk_mq_submit_bio+0x398/0x550
[ 3391.405856][   C12]  submit_bio_noacct+0x308/0x380
[ 3391.410630][   C12]  dmcrypt_write+0x1e4/0x208 [dm_crypt]
[ 3391.416005][   C12]  kthread+0x130/0x138
[ 3391.419911][   C12]  ret_from_fork+0x10/0x18

Reported-by: yangerkun <yangerkun@huawei.com>
Fixes: dc2676210c ("dm crypt: offload writes to thread")
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-06 16:50:12 -05:00
Coly Li
9bbf5feecc dm thin: fix deadlock when swapping to thin device
This is an already known issue that dm-thin volume cannot be used as
swap, otherwise a deadlock may happen when dm-thin internal memory
demand triggers swap I/O on the dm-thin volume itself.

But thanks to commit a666e5c05e ("dm: fix deadlock when swapping to
encrypted device"), the limit_swap_bios target flag can also be used
for dm-thin to avoid the recursive I/O when it is used as swap.

Fix is to simply set ti->limit_swap_bios to true in both pool_ctr()
and thin_ctr().

In my test, I create a dm-thin volume /dev/vg/swap and use it as swap
device. Then I run fio on another dm-thin volume /dev/vg/main and use
large --blocksize to trigger swap I/O onto /dev/vg/swap.

The following fio command line is used in my test,
  fio --name recursive-swap-io --lockmem 1 --iodepth 128 \
     --ioengine libaio --filename /dev/vg/main --rw randrw \
    --blocksize 1M --numjobs 32 --time_based --runtime=12h

Without this fix, the whole system can be locked up within 15 seconds.

With this fix, there is no any deadlock or hung task observed after
2 hours of running fio.

Furthermore, if blocksize is changed from 1M to 128M, after around 30
seconds fio has no visible I/O, and the out-of-memory killer message
shows up in kernel message. After around 20 minutes all fio processes
are killed and the whole system is back to being alive.

This is exactly what is expected when recursive I/O happens on dm-thin
volume when it is used as swap.

Depends-on: a666e5c05e ("dm: fix deadlock when swapping to encrypted device")
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-06 16:37:06 -05:00
Linus Torvalds
472a2abb7a flexible-array transformations for 6.3-rc1
Hi Linus,
 
 Please, pull the following patches that transform zero-length arrays,
 in unions, into flexible arrays. These patches have been baking in
 linux-next for the whole development cycle.
 
 Thanks
 --
 Gustavo
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEkmRahXBSurMIg1YvRwW0y0cG2zEFAmP35ckACgkQRwW0y0cG
 2zGFrA/+Pdn6woNEEU8g+iIIK6tkyoqvbTjqLFr3x+EEGCPDenWn0o3exvZJjAp/
 sDjUK9//sBzFEEHMLIu0+C+3dGSlIpmanQlDf/5unOYDUMogSu9gC7Mj9B4jDWPh
 0ltEKXEezxLwun41LBmeFEk8Ot7QLnC2CMIC0KLYfQodoGYbCDKnaEVaXrxIPctS
 mhwt98CllFbEFvtQyxRm+CxfQT8UkDL8mxRn+x1BoEO/xIKt5MOC7g31kWD31pm1
 SNJ2Nbt++MthcJMRP33Q9dJxAtLW4ckooJJm62QmsZoZOpDGBdOO6QBfIAd+KvGm
 AOFAOeYJwYCcG6VWibkFHcxy95ZGfuDek3wYn/PcoGpXPeT+La+eH/KDq5Ll+vSo
 2pwaDdHb3uJItBsc7sxXROLJME/6cV+1pt3xcK3dqHlgb26MBVkut/B17Em/ig1K
 AjI35+pZwxoB/nc8dEPIVk/yGqa9sAGqqKCoP56mUJu3GiqcOJCByU8q0lkdN992
 4y8w+IKSmegUvQD/MZs6GQ5DseYbdQadwW/5vIbN4N9d+R/6tJvwXFafMYlXJidu
 qZ2ilGqZzNjj3XUo+xXiwTeV9vQFW0TALm/OMmW7tdwhpG76RPMDdmZ5axPHv94o
 3+gp7ENo4zOKVRjuf4R5Uqu/Ijto9k9eugNhD+Z1ekSofyriID0=
 =v4oD
 -----END PGP SIGNATURE-----

Merge tag 'flex-array-transformations-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux

Pull flexible-array updates from Gustavo Silva:
 "Transform zero-length arrays, in unions, into flexible arrays"

* tag 'flex-array-transformations-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux:
  bcache: Replace zero-length arrays with DECLARE_FLEX_ARRAY() helper
  mm/memremap: Replace zero-length array with DECLARE_FLEX_ARRAY() helper
  exportfs: Replace zero-length array with DECLARE_FLEX_ARRAY() helper
2023-02-25 12:53:42 -08:00
Linus Torvalds
f0b2769a01 - Fix DM cache target to free background tracker work items, otherwise
slab BUG will occur when kmem_cache_destroy() is called.
 
 - Improve 2 of DM's shrinker names to reflect their use.
 
 - Fix the DM flakey target to not corrupt the zero page. Fix dm-flakey
   on 32-bit hughmem systems by using  bvec_kmap_local instead of
   page_address. Also, fix logic used when imposing the
   "corrupt_bio_byte" feature.
 
 - Stop using WQ_UNBOUND for DM verity target's verify_wq because it
   causes significant Android latencies on ARM64 (and doesn't show real
   benefit on other architectures).
 
 - Add negative check to catch simple case of a DM table referencing
   itself. More complex scenarios that use intermediate devices to
   self-reference still need to be avoided/handled in userspace.
 
 - Fix DM core's resize to only send one uevent instead of two. This
   fixes a race with udev, that if udev wins, will cause udev to miss
   uevents (which caused premature unmount attempts by systemd).
 
 - Add cond_resched() to workqueue functions in DM core, dn-thin and
   dm-cache so that their loops aren't the cause of unintended cpu
   scheduling fairness issues.
 
 - Fix all of DM's checkpatch errors and warnings (famous last words).
   Various other small cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmPzrP4ACgkQxSPxCi2d
 A1quGQgArlqtlYTl3ese9Kxdpq5fta69v77IooF2gp7PJgRzQ624L7gTFaWZE38v
 9ib5FRgTe84Nm+H/x0TAJKgoWOhwen24w2G5KMXKOhIOJgXV6xBK0gXV7cQajr6e
 RPml8hL6e/1K1IbmGrPn1Mpg6tOlSUM273z8pL+E6IkzIFdU/pay3WN6fcjC5vsM
 a3y739KCeo2/fMTCSX5B4owSvwTm1rX/wF4QwdqhgcaHhEqddFmcvmHAn/p7kHxb
 WbAT58A5jP5SaRyWv1MLCb8pzOivI8WFxFw4l2Fs/opYTG9jLrmmTejJndWVEE1Q
 PFcjFv/L5sRhXGRfH8dqNEbhX9Lubw==
 =2o1v
 -----END PGP SIGNATURE-----

Merge tag 'for-6.3/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - Fix DM cache target to free background tracker work items, otherwise
   slab BUG will occur when kmem_cache_destroy() is called.

 - Improve 2 of DM's shrinker names to reflect their use.

 - Fix the DM flakey target to not corrupt the zero page. Fix dm-flakey
   on 32-bit hughmem systems by using bvec_kmap_local instead of
   page_address. Also, fix logic used when imposing the
   "corrupt_bio_byte" feature.

 - Stop using WQ_UNBOUND for DM verity target's verify_wq because it
   causes significant Android latencies on ARM64 (and doesn't show real
   benefit on other architectures).

 - Add negative check to catch simple case of a DM table referencing
   itself. More complex scenarios that use intermediate devices to
   self-reference still need to be avoided/handled in userspace.

 - Fix DM core's resize to only send one uevent instead of two. This
   fixes a race with udev, that if udev wins, will cause udev to miss
   uevents (which caused premature unmount attempts by systemd).

 - Add cond_resched() to workqueue functions in DM core, dn-thin and
   dm-cache so that their loops aren't the cause of unintended cpu
   scheduling fairness issues.

 - Fix all of DM's checkpatch errors and warnings (famous last words).
   Various other small cleanups.

* tag 'for-6.3/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (62 commits)
  dm: remove unnecessary (void*) conversion in event_callback()
  dm ioctl: remove unnecessary check when using dm_get_mdptr()
  dm ioctl: assert _hash_lock is held in __hash_remove
  dm cache: add cond_resched() to various workqueue loops
  dm thin: add cond_resched() to various workqueue loops
  dm: add cond_resched() to dm_wq_requeue_work()
  dm: add cond_resched() to dm_wq_work()
  dm sysfs: make kobj_type structure constant
  dm: update targets using system workqueues to use a local workqueue
  dm: remove flush_scheduled_work() during local_exit()
  dm clone: prefer kvmalloc_array()
  dm: declare variables static when sensible
  dm: fix suspect indent whitespace
  dm ioctl: prefer strscpy() instead of strlcpy()
  dm: avoid void function return statements
  dm integrity: change macros min/max() -> min_t/max_t where appropriate
  dm: fix use of sizeof() macro
  dm: avoid 'do {} while(0)' loop in single statement macros
  dm log: avoid multiple line dereference
  dm log: avoid trailing semicolon in macro
  ...
2023-02-22 13:21:31 -08:00
Linus Torvalds
36289a03bc This update includes the following changes:
API:
 
 - Use kmap_local instead of kmap_atomic.
 - Change request callback to take void pointer.
 - Print FIPS status in /proc/crypto (when enabled).
 
 Algorithms:
 
 - Add rfc4106/gcm support on arm64.
 - Add ARIA AVX2/512 support on x86.
 
 Drivers:
 
 - Add TRNG driver for StarFive SoC.
 - Delete ux500/hash driver (subsumed by stm32/hash).
 - Add zlib support in qat.
 - Add RSA support in aspeed.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmPzAiwACgkQxycdCkmx
 i6et8xAAoO3w5MZFGXMzWsYhfSZFdceXBEQfDR7JOCdHxpMIQhw0FLlb0uttFk6m
 SeWrdP9wiifBDoCmw7qffFJml8ZftPL/XeXjob2d9v7jKbPyw3lDSIdsNfN/5EEL
 oIc9915zwrgawvahPAa+PQ4Ue03qRjUyOcV42dpd1W3NYhzDVHoK5OUU+mEFYDvx
 Sgw/YUugKf0VXkVDFzG5049+CPcheyRZqclAo9jyl2eZiXujgUyV33nxRCtqIA+t
 7jlHKwi+6QzFHY0CX5BvShR8xyEuH5MLoU3H/jYGXnRb3nEpRYAEO4VZchIHqF0F
 Y6pKIKc6Q8OyIVY8RsjQY3hioCqYnQFZ5Xtc1zGtOYEitVLbkmItMG0mVn0XOfyt
 gJDi6gkEw5uPUbEQdI4R1xEgJ8eCckMsOJ+uRxqTm+uLqNDxPbsB9bohKniMogXV
 lDlVXjU23AA9VeKtqU8FvWjfgqsN47X4aoq1j4/4aI7X9F7P9FOP21TZloP7+ssj
 PFrzNaRXUrMEsvyS1wqPegIh987lj6WkH4hyU0wjzaIq4IQELidHsSXFS12iWIPH
 kTEoC/trAVoYSr0zXKWUCs4h/x0FztVNbjs4KiDP2FLXX1RzeVZ0WlaXZhryHr+n
 1+8yCuS6tVofAbSX0wNkZdf0x5+3CIBw4kqSIvjKDPYYEfIDaT0=
 =dMYe
 -----END PGP SIGNATURE-----

Merge tag 'v6.3-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto update from Herbert Xu:
 "API:
   - Use kmap_local instead of kmap_atomic
   - Change request callback to take void pointer
   - Print FIPS status in /proc/crypto (when enabled)

  Algorithms:
   - Add rfc4106/gcm support on arm64
   - Add ARIA AVX2/512 support on x86

  Drivers:
   - Add TRNG driver for StarFive SoC
   - Delete ux500/hash driver (subsumed by stm32/hash)
   - Add zlib support in qat
   - Add RSA support in aspeed"

* tag 'v6.3-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (156 commits)
  crypto: x86/aria-avx - Do not use avx2 instructions
  crypto: aspeed - Fix modular aspeed-acry
  crypto: hisilicon/qm - fix coding style issues
  crypto: hisilicon/qm - update comments to match function
  crypto: hisilicon/qm - change function names
  crypto: hisilicon/qm - use min() instead of min_t()
  crypto: hisilicon/qm - remove some unused defines
  crypto: proc - Print fips status
  crypto: crypto4xx - Call dma_unmap_page when done
  crypto: octeontx2 - Fix objects shared between several modules
  crypto: nx - Fix sparse warnings
  crypto: ecc - Silence sparse warning
  tls: Pass rec instead of aead_req into tls_encrypt_done
  crypto: api - Remove completion function scaffolding
  tls: Remove completion function scaffolding
  tipc: Remove completion function scaffolding
  net: ipv6: Remove completion function scaffolding
  net: ipv4: Remove completion function scaffolding
  net: macsec: Remove completion function scaffolding
  dm: Remove completion function scaffolding
  ...
2023-02-21 18:10:50 -08:00
Linus Torvalds
8cc01d43f8 RCU pull request for v6.3
This pull request contains the following branches:
 
 doc.2023.01.05a: Documentation updates.
 
 fixes.2023.01.23a: Miscellaneous fixes, perhaps most notably:
 
 o	Throttling callback invocation based on the number of callbacks
 	that are now ready to invoke instead of on the total number
 	of callbacks.
 
 o	Several patches that suppress false-positive boot-time
 	diagnostics, for example, due to lockdep not yet being
 	initialized.
 
 o	Make expedited RCU CPU stall warnings dump stacks of any tasks
 	that are blocking the stalled grace period.  (Normal RCU CPU
 	stall warnings have doen this for mnay years.)
 
 o	Lazy-callback fixes to avoid delays during boot, suspend, and
 	resume.  (Note that lazy callbacks must be explicitly enabled,
 	so this should not (yet) affect production use cases.)
 
 kvfree.2023.01.03a: Cause kfree_rcu() and friends to take advantage of
 	polled grace periods, thus reducing memory footprint by almost
 	two orders of magnitude, admittedly on a microbenchmark.
 	This series also begins the transition from kfree_rcu(p) to
 	kfree_rcu_mightsleep(p).  This transition was motivated by bugs
 	where kfree_rcu(p), which can block, was typed instead of the
 	intended kfree_rcu(p, rh).
 
 srcu.2023.01.03a: SRCU updates, perhaps most notably fixing a bug that
 	causes SRCU to fail when booted on a system with a non-zero boot
 	CPU.  This surprising situation actually happens for kdump kernels
 	on the powerpc architecture.  It also adds an srcu_down_read()
 	and srcu_up_read(), which act like srcu_read_lock() and
 	srcu_read_unlock(), but allow an SRCU read-side critical section
 	to be handed off from one task to another.
 
 srcu-always.2023.02.02a: Cleans up the now-useless SRCU Kconfig option.
 	There are a few more commits that are not yet acked or pulled
 	into maintainer trees, and these will be in a pull request for
 	a later merge window.
 
 tasks.2023.01.03a: RCU-tasks updates, perhaps most notably these fixes:
 
 o	A strange interaction between PID-namespace unshare and the
 	RCU-tasks grace period that results in a low-probability but
 	very real hang.
 
 o	A race between an RCU tasks rude grace period on a single-CPU
 	system and CPU-hotplug addition of the second CPU that can result
 	in a too-short grace period.
 
 o	A race between shrinking RCU tasks down to a single callback list
 	and queuing a new callback to some other CPU, but where that
 	queuing is delayed for more than an RCU grace period.  This can
 	result in that callback being stranded on the non-boot CPU.
 
 torture.2023.01.05a: Torture-test updates and fixes.
 
 torturescript.2023.01.03a: Torture-test scripting updates and fixes.
 
 stall.2023.01.09a: Provide additional RCU CPU stall-warning information
 	in kernels built with CONFIG_RCU_CPU_STALL_CPUTIME=y, and
 	restore the full five-minute timeout limit for expedited RCU
 	CPU stall warnings.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmPq29UTHHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jAhVEACEAKJY1VJ9IUqz7CwzAYkzgRJfiygh
 oDUXmlqtm6ew9pr2GdLUVCVsUSldzBc0K7Djb/G1niv4JPs+v7YwupIV33+UbStU
 Qxt6ztTdxc4lKospLm1+2vF9ZdzVEmiP4wVCc4iDarv5FM3FpWSTNc8+L7qmlC+X
 myjv+GqMTxkXZBvYJOgJGFjDwN8noTd7Fr3mCCVLFm3PXMDa7tcwD6HRP5AqD2N8
 qC5M6LEqepKVGmz0mYMLlSN1GPaqIsEcexIFEazRsPEivPh/iafyQCQ/cqxwhXmV
 vEt7u+dXGZT/oiDq9cJ+/XRDS2RyKIS6dUE14TiiHolDCn1ONESahfA/gXWKykC2
 BaGPfjWXrWv/hwbeZ+8xEdkAvTIV92tGpXir9Fby1Z5PjP3balvrnn6hs5AnQBJb
 NdhRPLzy/dCnEF+CweAYYm1qvTo8cd5nyiNwBZHn7rEAIu3Axrecag1rhFl3AJ07
 cpVMQXZtkQVa2X8aIRTUC+ijX6yIqNaHlu0HqNXgIUTDzL4nv5cMjOMzpNQP9/dZ
 FwAMZYNiOk9IlMiKJ8ZiVcxeiA8ouIBlkYM3k6vGrmiONZ7a/EV/mSHoJqI8bvqr
 AxUIJ2Ayhg3bxPboL5oKgCiLql0A7ZVvz6quX6McitWGMgaSvel1fDzT3TnZd41e
 4AFBFd/+VedUGg==
 =bBYK
 -----END PGP SIGNATURE-----

Merge tag 'rcu.2023.02.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull RCU updates from Paul McKenney:

 - Documentation updates

 - Miscellaneous fixes, perhaps most notably:

      - Throttling callback invocation based on the number of callbacks
        that are now ready to invoke instead of on the total number of
        callbacks

      - Several patches that suppress false-positive boot-time
        diagnostics, for example, due to lockdep not yet being
        initialized

      - Make expedited RCU CPU stall warnings dump stacks of any tasks
        that are blocking the stalled grace period. (Normal RCU CPU
        stall warnings have done this for many years)

      - Lazy-callback fixes to avoid delays during boot, suspend, and
        resume. (Note that lazy callbacks must be explicitly enabled, so
        this should not (yet) affect production use cases)

 - Make kfree_rcu() and friends take advantage of polled grace periods,
   thus reducing memory footprint by almost two orders of magnitude,
   admittedly on a microbenchmark

   This also begins the transition from kfree_rcu(p) to
   kfree_rcu_mightsleep(p). This transition was motivated by bugs where
   kfree_rcu(p), which can block, was typed instead of the intended
   kfree_rcu(p, rh)

 - SRCU updates, perhaps most notably fixing a bug that causes SRCU to
   fail when booted on a system with a non-zero boot CPU. This
   surprising situation actually happens for kdump kernels on the
   powerpc architecture

   This also adds an srcu_down_read() and srcu_up_read(), which act like
   srcu_read_lock() and srcu_read_unlock(), but allow an SRCU read-side
   critical section to be handed off from one task to another

 - Clean up the now-useless SRCU Kconfig option

   There are a few more commits that are not yet acked or pulled into
   maintainer trees, and these will be in a pull request for a later
   merge window

 - RCU-tasks updates, perhaps most notably these fixes:

      - A strange interaction between PID-namespace unshare and the
        RCU-tasks grace period that results in a low-probability but
        very real hang

      - A race between an RCU tasks rude grace period on a single-CPU
        system and CPU-hotplug addition of the second CPU that can
        result in a too-short grace period

      - A race between shrinking RCU tasks down to a single callback
        list and queuing a new callback to some other CPU, but where
        that queuing is delayed for more than an RCU grace period. This
        can result in that callback being stranded on the non-boot CPU

 - Torture-test updates and fixes

 - Torture-test scripting updates and fixes

 - Provide additional RCU CPU stall-warning information in kernels built
   with CONFIG_RCU_CPU_STALL_CPUTIME=y, and restore the full five-minute
   timeout limit for expedited RCU CPU stall warnings

* tag 'rcu.2023.02.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (80 commits)
  rcu/kvfree: Add kvfree_rcu_mightsleep() and kfree_rcu_mightsleep()
  kernel/notifier: Remove CONFIG_SRCU
  init: Remove "select SRCU"
  fs/quota: Remove "select SRCU"
  fs/notify: Remove "select SRCU"
  fs/btrfs: Remove "select SRCU"
  fs: Remove CONFIG_SRCU
  drivers/pci/controller: Remove "select SRCU"
  drivers/net: Remove "select SRCU"
  drivers/md: Remove "select SRCU"
  drivers/hwtracing/stm: Remove "select SRCU"
  drivers/dax: Remove "select SRCU"
  drivers/base: Remove CONFIG_SRCU
  rcu: Disable laziness if lazy-tracking says so
  rcu: Track laziness during boot and suspend
  rcu: Remove redundant call to rcu_boost_kthread_setaffinity()
  rcu: Allow up to five minutes expedited RCU CPU stall-warning timeouts
  rcu: Align the output of RCU CPU stall warning messages
  rcu: Add RCU stall diagnosis information
  sched: Add helper nr_context_switches_cpu()
  ...
2023-02-21 10:45:51 -08:00
XU pengfei
d695e44157 dm: remove unnecessary (void*) conversion in event_callback()
Pointer variables of void * type do not require type cast.

Signed-off-by: XU pengfei <xupengfei@nfschina.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-20 11:52:49 -05:00
Hou Tao
a2f998a78a dm ioctl: remove unnecessary check when using dm_get_mdptr()
__hash_remove() removes hash_cell with _hash_lock locked, so acquiring
_hash_lock can guarantee no-NULL hc returned from dm_get_mdptr() must
have not been removed and hc->md must still be md.

__hash_remove() also acquires dm_hash_cells_mutex before setting mdptr
as NULL. So in dm_copy_name_and_uuid(), after acquiring
dm_hash_cells_mutex and ensuring returned hc is not NULL, the returned
hc must still be alive and hc->md must still be md.

Remove the unnecessary hc->md != md checks when using dm_get_mdptr()
with _hash_lock or dm_hash_cells_mutex acquired.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-17 14:49:21 -05:00
Mike Snitzer
69868bebfe dm ioctl: assert _hash_lock is held in __hash_remove
Also update dm_early_create() to take _hash_lock when calling both
__get_name_cell and __hash_remove -- given dm_early_create()'s early
boot usecase this locking isn't about correctness but it allows
lockdep_assert_held() to be added to __hash_remove.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-17 14:49:21 -05:00
Mike Snitzer
76227f6dc8 dm cache: add cond_resched() to various workqueue loops
Otherwise on resource constrained systems these workqueues may be too
greedy.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-17 14:49:12 -05:00