Commit Graph

1105865 Commits

Author SHA1 Message Date
Mikulas Patocka
b7f362d641 dm writecache: fix smatch warning about invalid return from writecache_map
There's a smatch warning "inconsistent returns '&wc->lock'" in
dm-writecache. The reason for the warning is that writecache_map()
doesn't drop the lock on the impossible path.

Fix this warning by adding wc_unlock() after the BUG statement (so
that it will be compiled-away anyway).

Fixes: df699cc16e ("dm writecache: report invalid return from writecache_map helpers")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-08-09 19:20:23 -04:00
Mike Snitzer
f876df9f12 dm verity: fix verity_parse_opt_args parsing
Commit df326e7a06 ("dm verity: allow optional args to alter primary
args handling") introduced a bug where verity_parse_opt_args() wouldn't
properly shift past an optional argument's additional params (by
ignoring them).

Fix this by avoiding returning with error if an unknown argument is
encountered when @only_modifier_opts=true is passed to
verity_parse_opt_args().

In practice this regressed the cryptsetup testsuite's FEC testing
because unknown optional arguments were encountered, wherey
short-circuiting ever testing FEC mode. With this fix all of the
cryptsetup testsuite's verity FEC tests pass.

Fixes: df326e7a06 ("dm verity: allow optional args to alter primary args handling")
Reported-by: Milan Broz <gmazyland@gmail.com>>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-08-09 19:15:17 -04:00
Mike Snitzer
8c22816dbc dm verity: fix DM_VERITY_OPTS_MAX value yet again
Must account for the possibility that "try_verify_in_tasklet" is used.

This is the same issue that was fixed with commit 160f99db94 -- it
is far too easy to miss that additional a new argument(s) require
bumping DM_VERITY_OPTS_MAX accordingly.

Fixes: 5721d4e5a9 ("dm verity: Add optional "try_verify_in_tasklet" feature")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-08-09 19:12:14 -04:00
Mike Snitzer
b33b6fdca3 dm bufio: simplify DM_BUFIO_CLIENT_NO_SLEEP locking
Historically none of the bufio code runs in interrupt context but with
the use of DM_BUFIO_CLIENT_NO_SLEEP a bufio client can, see: commit
5721d4e5a9 ("dm verity: Add optional "try_verify_in_tasklet" feature")
That said, the new tasklet usecase still doesn't require interrupts be
disabled by bufio (let alone conditionally restore them).

Yet with PREEMPT_RT, and falling back from tasklet to workqueue, care
must be taken to properly synchronize between softirq and process
context, otherwise ABBA deadlock may occur. While it is unnecessary to
disable bottom-half preemption within a tasklet, we must consistently do
so in process context to ensure locking is in the proper order.

Fix these issues by switching from spin_lock_irq{save,restore} to using
spin_{lock,unlock}_bh instead. Also remove the 'spinlock_flags' member
in dm_bufio_client struct (that can be used unsafely if bufio must
recurse on behalf of some caller, e.g. block layer's submit_bio).

Fixes: 5721d4e5a9 ("dm verity: Add optional "try_verify_in_tasklet" feature")
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2022-08-09 19:12:14 -04:00
Mike Snitzer
12907efde6 dm verity: have verify_wq use WQ_HIGHPRI if "try_verify_in_tasklet"
Allow verify_wq to preempt softirq since verification in tasklet will
fall-back to using it for error handling (or if the bufio cache
doesn't have required hashes).

Suggested-by: Nathan Huckleberry <nhuck@google.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-08-04 15:59:52 -04:00
Mike Snitzer
43fa47cb11 dm verity: remove WQ_CPU_INTENSIVE flag since using WQ_UNBOUND
The documentation [1] says that WQ_CPU_INTENSIVE is "meaningless" for
unbound wq. So remove WQ_CPU_INTENSIVE from the verify_wq allocation.

1. https://www.kernel.org/doc/html/latest/core-api/workqueue.html#flags

Suggested-by: Maksym Planeta <mplaneta@os.inf.tu-dresden.de>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-08-04 14:53:07 -04:00
Mike Snitzer
e9307e3deb dm verity: only copy bvec_iter in verity_verify_io if in_tasklet
Avoid extra bvec_iter copy unless it is needed to allow retrying
verification, that failed from a tasklet, from a workqueue.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-08-04 14:33:42 -04:00
Mike Snitzer
0a36463f4c dm verity: optimize verity_verify_io if FEC not configured
Only declare and copy bvec_iter if CONFIG_DM_VERITY_FEC is defined and
FEC enabled for the verity device.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-08-04 14:02:20 -04:00
Mike Snitzer
ba2cce82ba dm verity: conditionally enable branching for "try_verify_in_tasklet"
Use jump_label to limit the need for branching unless the optional
"try_verify_in_tasklet" feature is used.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-08-04 13:50:43 -04:00
Mike Snitzer
3c1c875d05 dm bufio: conditionally enable branching for DM_BUFIO_CLIENT_NO_SLEEP
Use jump_label to limit the need for branching unless the optional
DM_BUFIO_CLIENT_NO_SLEEP is used.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-08-04 13:50:43 -04:00
Mike Snitzer
df326e7a06 dm verity: allow optional args to alter primary args handling
The previous commit ("dm verity: Add optional "try_verify_in_tasklet"
feature") imposed that CRYPTO_ALG_ASYNC mask be used even if the
optional "try_verify_in_tasklet" feature was not specified. This was
because verity_parse_opt_args() was called after handling the primary
args (due to it having data dependencies on having first parsed all
primary args).

Enhance verity_ctr() so that simple optional args, that don't have a
data dependency on primary args parsing, can alter how the primary
args are handled. In practice this means verity_parse_opt_args() gets
called twice. First with the new 'only_modifier_opts' arg set to true,
then again with it set to false _after_ parsing all primary args.

This allows the v->use_tasklet flag to be properly set and then used
when verity_ctr() parses the primary args and then calls
crypto_alloc_ahash() with CRYPTO_ALG_ASYNC conditionally set.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-08-04 13:50:43 -04:00
Nathan Huckleberry
5721d4e5a9 dm verity: Add optional "try_verify_in_tasklet" feature
Using tasklets for disk verification can reduce IO latency. When there
are accelerated hash instructions it is often better to compute the
hash immediately using a tasklet rather than deferring verification to
a work-queue. This reduces time spent waiting to schedule work-queue
jobs, but requires spending slightly more time in interrupt context.

If the dm-bufio cache does not have the required hashes we fallback to
the work-queue implementation. FEC is only possible using work-queue
because code to support the FEC feature may sleep.

The following shows a speed comparison of random reads on a dm-verity
device. The dm-verity device uses a 1G ramdisk for data and a 1G
ramdisk for hashes. One test was run using tasklets and one test was
run using the existing work-queue solution. Both tests were run when
the dm-bufio cache was hot. The tasklet implementation performs
significantly better since there is no time spent waiting for
work-queue jobs to be scheduled.

   READ: bw=181MiB/s (190MB/s), 181MiB/s-181MiB/s (190MB/s-190MB/s),
   io=512MiB (537MB), run=2827-2827msec
   READ: bw=23.6MiB/s (24.8MB/s), 23.6MiB/s-23.6MiB/s (24.8MB/s-24.8MB/s),
   io=512MiB (537MB), run=21688-21688msec

Signed-off-by: Nathan Huckleberry <nhuck@google.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-08-04 13:50:40 -04:00
Nathan Huckleberry
b32d45824a dm bufio: Add DM_BUFIO_CLIENT_NO_SLEEP flag
Add an optional flag that ensures dm_bufio_client does not sleep
(primary focus is to service dm_bufio_get without sleeping). This
allows the dm-bufio cache to be queried from interrupt context.

To ensure that dm-bufio does not sleep, dm-bufio must use a spinlock
instead of a mutex. Additionally, to avoid deadlocks, special care
must be taken so that dm-bufio does not sleep while holding the
spinlock.

But again: the scope of this no_sleep is initially confined to
dm_bufio_get, so __alloc_buffer_wait_no_callback is _not_ changed to
avoid sleeping because __bufio_new avoids allocation for NF_GET.

Signed-off-by: Nathan Huckleberry <nhuck@google.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28 17:46:14 -04:00
Nathan Huckleberry
0fcb100d50 dm bufio: Add flags argument to dm_bufio_client_create
Add a flags argument to dm_bufio_client_create and update all the
callers. This is in preparation to add the DM_BUFIO_NO_SLEEP flag.

Signed-off-by: Nathan Huckleberry <nhuck@google.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28 17:46:14 -04:00
Mike Snitzer
9dd1cd3220 dm: fix dm-raid crash if md_handle_request() splits bio
Commit ca522482e3 ("dm: pass NULL bdev to bio_alloc_clone")
introduced the optimization to _not_ perform bio_associate_blkg()'s
relatively costly work when DM core clones its bio. But in doing so it
exposed the possibility for DM's cloned bio to alter DM target
behavior (e.g. crash) if a target were to issue IO without first
calling bio_set_dev().

The DM raid target can trigger an MD crash due to its need to split
the DM bio that is passed to md_handle_request(). The split will
recurse to submit_bio_noacct() using a bio with an uninitialized
->bi_blkg. This NULL bio->bi_blkg causes blk_throtl_bio() to
dereference a NULL blkg_to_tg(bio->bi_blkg).

Fix this in DM core by adding a new 'needs_bio_set_dev' target flag that
will make alloc_tio() call bio_set_dev() on behalf of the target.
dm-raid is the only target that requires this flag. bio_set_dev()
initializes the DM cloned bio's ->bi_blkg, using bio_associate_blkg,
before passing the bio to md_handle_request().

Long-term fix would be to audit and refactor MD code to rely on DM to
split its bio, using dm_accept_partial_bio(), but there are MD raid
personalities (e.g. raid1 and raid10) whose implementation are tightly
coupled to handling the bio splitting inline.

Fixes: ca522482e3 ("dm: pass NULL bdev to bio_alloc_clone")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28 17:36:30 -04:00
Mikulas Patocka
7dad24db59 dm raid: fix address sanitizer warning in raid_resume
There is a KASAN warning in raid_resume when running the lvm test
lvconvert-raid.sh. The reason for the warning is that mddev->raid_disks
is greater than rs->raid_disks, so the loop touches one entry beyond
the allocated length.

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28 17:29:56 -04:00
Mikulas Patocka
1fbeea217d dm raid: fix address sanitizer warning in raid_status
There is this warning when using a kernel with the address sanitizer
and running this testsuite:
https://gitlab.com/cki-project/kernel-tests/-/tree/main/storage/swraid/scsi_raid

==================================================================
BUG: KASAN: slab-out-of-bounds in raid_status+0x1747/0x2820 [dm_raid]
Read of size 4 at addr ffff888079d2c7e8 by task lvcreate/13319
CPU: 0 PID: 13319 Comm: lvcreate Not tainted 5.18.0-0.rc3.<snip> #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
 <TASK>
 dump_stack_lvl+0x6a/0x9c
 print_address_description.constprop.0+0x1f/0x1e0
 print_report.cold+0x55/0x244
 kasan_report+0xc9/0x100
 raid_status+0x1747/0x2820 [dm_raid]
 dm_ima_measure_on_table_load+0x4b8/0xca0 [dm_mod]
 table_load+0x35c/0x630 [dm_mod]
 ctl_ioctl+0x411/0x630 [dm_mod]
 dm_ctl_ioctl+0xa/0x10 [dm_mod]
 __x64_sys_ioctl+0x12a/0x1a0
 do_syscall_64+0x5b/0x80

The warning is caused by reading conf->max_nr_stripes in raid_status. The
code in raid_status reads mddev->private, casts it to struct r5conf and
reads the entry max_nr_stripes.

However, if we have different raid type than 4/5/6, mddev->private
doesn't point to struct r5conf; it may point to struct r0conf, struct
r1conf, struct r10conf or struct mpconf. If we cast a pointer to one
of these structs to struct r5conf, we will be reading invalid memory
and KASAN warns about it.

Fix this bug by reading struct r5conf only if raid type is 4, 5 or 6.

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28 17:29:56 -04:00
Mike Christie
c6adada5b7 dm: Start pr_preempt from the same starting path
pr_preempt has a similar issue as reserve where for all the
reservation types except the All Registrants ones the preempt can
create a reservation. And a follow up reservation or release needs to
go down the same path the preempt did. This has the pr_preempt work
like reserve and release where we always start from the first path in
the first group.

This commit has been tested with windows failover clustering's
validation test and libiscsi's PGR tests to check for regressions.
They both don't have tests to verify this case, so I tested it
manually.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28 17:29:56 -04:00
Mike Christie
08a3c338e0 dm: Fix PR release handling for non All Registrants
This commit fixes a bug where we are leaving the reservation in place
even though pr_release has run and returned success.

If we have a Write Exclusive, Exclusive Access, or Write/Exclusive
Registrants only reservation, the release must be sent down the path
that is the reservation holder. The problem is multipath_prepare_ioctl
most likely selected path N for the reservation, then later when we do
the release multipath_prepare_ioctl will select a completely different
path. The device will then return success becuase the nvme and scsi
specs say to return success if there is no reservation or if the
release is sent down from a path that is not the holder. We then think
we have released the reservation.

This commit has us loop over each path and send a release so we can
make sure the release is executed on the correct path. It has been
tested with windows failover clustering's validation test which checks
this case, and it has been tested manually (the libiscsi PGR tests
don't have a test case for this yet, but I will be adding one).

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28 17:29:56 -04:00
Mike Christie
7015108759 dm: Start pr_reserve from the same starting path
When an app does a pr_reserve it will go to whatever path we happen to
be using at the time. This can result in errors when the app does a
second pr_reserve call and expects success but gets a failure because
the reserve is not done on the holder's path. This commit has us
always start trying to do reserves from the first path in the first
group.

Windows failover clustering will produce the type of pattern above.
With this commit, we will now pass its validation test for this case.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28 17:29:56 -04:00
Mike Christie
8dd87f3c52 dm: Allow dm_call_pr to be used for path searches
The specs state that if you send a reserve down a path that is already
the holder success must be returned and if it goes down a path that
is not the holder reservation conflict must be returned. Windows
failover clustering will send a second reservation and expects that a
device returns success. The problem for multipathing is that for an
All Registrants reservation, we can send the reserve down any path but
for all other reservation types there is one path that is the holder.

To handle this we could add PR state to dm but that can get nasty.
Look at target_core_pr.c for an example of the type of things we'd
have to track. It will also get more complicated because other
initiators can change the state so we will have to add in async
event/sense handling.

This commit, and the 3 commits that follow, tries to keep dm simple
and keep just doing passthrough. This commit modifies dm_call_pr to be
able to find the first usable path that can execute our pr_op then
return. When dm_pr_reserve is converted to dm_call_pr in the next
commit for the normal case we will use the same path for every
reserve.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28 17:29:56 -04:00
Mike Snitzer
e120a5f1e7 dm: return early from dm_pr_call() if DM device is suspended
Otherwise PR ops may be issued while the broader DM device is being
reconfigured, etc.

Fixes: 9c72bad1f3 ("dm: call PR reserve/unreserve on each underlying device")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28 17:29:56 -04:00
Luo Meng
3534e5a5ed dm thin: fix use-after-free crash in dm_sm_register_threshold_callback
Fault inject on pool metadata device reports:
  BUG: KASAN: use-after-free in dm_pool_register_metadata_threshold+0x40/0x80
  Read of size 8 at addr ffff8881b9d50068 by task dmsetup/950

  CPU: 7 PID: 950 Comm: dmsetup Tainted: G        W         5.19.0-rc6 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014
  Call Trace:
   <TASK>
   dump_stack_lvl+0x34/0x44
   print_address_description.constprop.0.cold+0xeb/0x3f4
   kasan_report.cold+0xe6/0x147
   dm_pool_register_metadata_threshold+0x40/0x80
   pool_ctr+0xa0a/0x1150
   dm_table_add_target+0x2c8/0x640
   table_load+0x1fd/0x430
   ctl_ioctl+0x2c4/0x5a0
   dm_ctl_ioctl+0xa/0x10
   __x64_sys_ioctl+0xb3/0xd0
   do_syscall_64+0x35/0x80
   entry_SYSCALL_64_after_hwframe+0x46/0xb0

This can be easily reproduced using:
  echo offline > /sys/block/sda/device/state
  dd if=/dev/zero of=/dev/mapper/thin bs=4k count=10
  dmsetup load pool --table "0 20971520 thin-pool /dev/sda /dev/sdb 128 0 0"

If a metadata commit fails, the transaction will be aborted and the
metadata space maps will be destroyed. If a DM table reload then
happens for this failed thin-pool, a use-after-free will occur in
dm_sm_register_threshold_callback (called from
dm_pool_register_metadata_threshold).

Fix this by in dm_pool_register_metadata_threshold() by returning the
-EINVAL error if the thin-pool is in fail mode. Also fail pool_ctr()
with a new error message: "Error registering metadata threshold".

Fixes: ac8c3f3df6 ("dm thin: generate event when metadata threshold passed")
Cc: stable@vger.kernel.org
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Luo Meng <luomeng12@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-15 18:09:14 -04:00
Mikulas Patocka
2ee73ef60d dm writecache: count number of blocks discarded, not number of discard bios
Change dm-writecache, so that it counts the number of blocks discarded
instead of the number of discard bios. Make it consistent with the
read and write statistics counters that were changed to count the
number of blocks instead of bios.

Fixes: e3a35d0340 ("dm writecache: add event counters")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-14 15:54:46 -04:00
Mikulas Patocka
b2676e1482 dm writecache: count number of blocks written, not number of write bios
Change dm-writecache, so that it counts the number of blocks written
instead of the number of write bios. Bios can be split and requeued
using the dm_accept_partial_bio function, so counting bios caused
inaccurate results.

Fixes: e3a35d0340 ("dm writecache: add event counters")
Reported-by: Yu Kuai <yukuai1@huaweicloud.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-14 15:53:25 -04:00
Mikulas Patocka
2c6e755b49 dm writecache: count number of blocks read, not number of read bios
Change dm-writecache, so that it counts the number of blocks read
instead of the number of read bios. Bios can be split and requeued
using the dm_accept_partial_bio function, so counting bios caused
inaccurate results.

Fixes: e3a35d0340 ("dm writecache: add event counters")
Reported-by: Yu Kuai <yukuai1@huaweicloud.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-14 15:52:32 -04:00
Mikulas Patocka
9bc0c92e4b dm writecache: return void from functions
The functions writecache_map_remap_origin and writecache_bio_copy_ssd
only return a single value, thus they can be made to return void.

This helps simplify the following IO accounting changes.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-14 15:52:31 -04:00
Mikulas Patocka
949d49ec30 dm kcopyd: use __GFP_HIGHMEM when allocating pages
dm-kcopyd doesn't access the allocated pages directly, it only passes
them to dm-io which adds them to a bio list - thus, we can allocate
the pages from high memory. This will reduce pressure on the low
memory when there are a large number of kcopyd jobs in progress.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-14 15:52:31 -04:00
Mikulas Patocka
ca7dc242e3 dm writecache: set a default MAX_WRITEBACK_JOBS
dm-writecache has the capability to limit the number of writeback jobs
in progress. However, this feature was off by default. As such there
were some out-of-memory crashes observed when lowering the low
watermark while the cache is full.

This commit enables writeback limit by default. It is set to 256MiB or
1/16 of total system memory, whichever is smaller.

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-14 15:52:31 -04:00
Bagas Sanjaya
11093e6f0d Documentation: dm writecache: Render status list as list
The status list isn't rendered as list, but rather as normal paragraph,
because there is missing blank line between "Status:" line and the list.

Fix the issue by adding the blank line separator.

Fixes: 48debafe4f ("dm: add writecache target")
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-07 14:40:32 -04:00
Mauro Carvalho Chehab
8b301af4c6 Documentation: dm writecache: add blank line before optional parameters
Otherwise this warning occurs:
  Documentation/admin-guide/device-mapper/writecache.rst:23: WARNING: Unexpected indentation.

Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-07 14:37:58 -04:00
Zhang Jiaming
962c6296f0 dm snapshot: fix typo in snapshot_map() comment
Signed-off-by: Zhang Jiaming <jiaming@nfschina.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-07 11:49:39 -04:00
Jiang Jian
ce92fc4b8b dm raid: remove redundant "the" in parse_raid_params() comment
Signed-off-by: Jiang Jian <jiangjian@cdjrlc.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-07 11:49:38 -04:00
Steven Lung
5c29e78473 dm cache: fix typo in 2 comment blocks
Replace neccessarily with necessarily.

Signed-off-by: Steven Lung <1030steven@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-07 11:49:37 -04:00
JeongHyeon Lee
20e6fc8562 dm verity: fix checkpatch close brace error
Resolves: ERROR: else should follow close brace '}'

Signed-off-by: JeongHyeon Lee <jhs2.lee@samsung.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-07 11:49:36 -04:00
Mike Snitzer
899ab445a4 dm table: rename dm_target variable in dm_table_add_target()
Rename from "tgt" to "ti" so that all of dm-table.c code uses the same
naming for dm_target variables.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-07 11:49:35 -04:00
Mike Snitzer
564b5c5476 dm table: audit all dm_table_get_target() callers
All callers of dm_table_get_target() are expected to do proper bounds
checking on the index they pass.

Move dm_table_get_target() to dm-core.h to make it extra clear that only
DM core code should be using it. Switch it to be inlined while at it.

Standardize all DM core callers to use the same for loop pattern and
make associated variables as local as possible. Rename some variables
(e.g. s/table/t/ and s/tgt/ti/) along the way.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-07 11:49:34 -04:00
Mike Snitzer
2aec377a29 dm table: remove dm_table_get_num_targets() wrapper
More efficient and readable to just access table->num_targets directly.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-07 11:49:33 -04:00
Ming Lei
8b211aaccb dm: add two stage requeue mechanism
Commit 61b6e2e532 ("dm: fix BLK_STS_DM_REQUEUE handling when dm_io
represents split bio") reverted DM core's bio splitting back to using
bio_split()+bio_chain() because it was found that otherwise DM's
BLK_STS_DM_REQUEUE would trigger a live-lock waiting for bio
completion that would never occur.

Restore using bio_trim()+bio_inc_remaining(), like was done in commit
7dd76d1fee ("dm: improve bio splitting and associated IO
accounting"), but this time with proper handling for the above
scenario that is covered in more detail in the commit header for
61b6e2e532.

Solve this issue by adding a two staged dm_io requeue mechanism that
uses the new dm_bio_rewind() via dm_io_rewind():

1) requeue the dm_io into the requeue_list added to struct
   mapped_device, and schedule it via new added requeue work. This
   workqueue just clones the dm_io->orig_bio (which DM saves and
   ensures its end sector isn't modified). dm_io_rewind() uses the
   sectors and sectors_offset members of the dm_io that are recorded
   relative to the end of orig_bio: dm_bio_rewind()+bio_trim() are
   then used to make that cloned bio reflect the subset of the
   original bio that is represented by the dm_io that is being
   requeued.

2) the 2nd stage requeue is same with original requeue, but
   io->orig_bio points to new cloned bio (which matches the requeued
   dm_io as described above).

This allows DM core to shift the need for bio cloning from bio-split
time (during IO submission) to the less likely BLK_STS_DM_REQUEUE
handling (after IO completes with that error).

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-07 11:49:32 -04:00
Ming Lei
61cbe7888d dm: add dm_bio_rewind() API to DM core
Commit 7759eb23fd ("block: remove bio_rewind_iter()") removed
a similar API for the following reasons:
    ```
    It is pointed that bio_rewind_iter() is one very bad API[1]:

    1) bio size may not be restored after rewinding

    2) it causes some bogus change, such as 5151842b9d (block: reset
    bi_iter.bi_done after splitting bio)

    3) rewinding really makes things complicated wrt. bio splitting

    4) unnecessary updating of .bi_done in fast path

    [1] https://marc.info/?t=153549924200005&r=1&w=2

    So this patch takes Kent's suggestion to restore one bio into its original
    state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
    given now bio_rewind_iter() is only used by bio integrity code.
    ```
However, saving off a copy of the 32 bytes bio->bi_iter in case rewind
needed isn't efficient because it bloats per-bio-data for what is an
unlikely case. That suggestion also ignores the need to restore
crypto and integrity info.

Add dm_bio_rewind() API for a specific use-case that is much more narrow
than the previous more generic rewind code that was reverted:

1) most bios have a fixed end sector since bio split is done from front
   of the bio, if driver just records how many sectors between current
   bio's start sector and the original bio's end sector, the original
   position can be restored. Keeping the original bio's end sector
   fixed is a _hard_ requirement for this interface!

2) if a bio's end sector won't change (usually bio_trim() isn't
   called, or in the case of DM it preserves original bio), user can
   restore the original position by storing sector offset from the
   current ->bi_iter.bi_sector to bio's end sector; together with
   saving bio size, only 8 bytes is needed to restore to original
   bio.

3) DM's requeue use case: when BLK_STS_DM_REQUEUE happens, DM core
   needs to restore to an "original bio" which represents the current
   dm_io to be requeued (which may be a subset of the original bio).
   By storing the sector offset from the original bio's end sector and
   dm_io's size, dm_bio_rewind() can restore such original bio. See
   commit 7dd76d1fee ("dm: improve bio splitting and associated IO
   accounting") for more details on how DM does this. Leveraging this,
   allows DM core to shift the need for bio cloning from bio-split
   time (during IO submission) to the less likely BLK_STS_DM_REQUEUE
   handling (after IO completes with that error).

4) Unlike the original rewind API, dm_bio_rewind() doesn't add .bi_done
   to bvec_iter and there is no effect on the fast path.

Implement dm_bio_rewind() by factoring out clear helpers that it calls:
dm_bio_integrity_rewind, dm_bio_crypt_rewind and dm_bio_rewind_iter.

DM is able to ensure that dm_bio_rewind() is used safely but, given
the constraint that the bio's end must never change, other
hypothetical future callers may not take the same care. So make
dm_bio_rewind() and all supporting code local to DM to avoid risk of
hypothetical abuse. A "dm_" prefix was added to all functions to avoid
any namespace collisions.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-07 11:26:55 -04:00
Ming Lei
444fe04f7a dm: improve BLK_STS_DM_REQUEUE and BLK_STS_AGAIN handling
If either BLK_STS_DM_REQUEUE or BLK_STS_AGAIN is returned for POLLED
io, we requeue the original bio into deferred list and kick md->wq to
re-submit it to block layer.

Improve the handling in the following way:

1) Factor out dm_handle_requeue() for handling dm_io requeue.

2) Unify handling for BLK_STS_DM_REQUEUE and BLK_STS_AGAIN: clear
   REQ_POLLED for BLK_STS_DM_REQUEUE too, for the sake of simplicity,
   given BLK_STS_DM_REQUEUE is very unusual.

3) Queue md->wq explicitly in dm_handle_requeue(), so requeue handling
   becomes more robust.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-06-29 14:17:16 -04:00
Christoph Hellwig
e810cb78bc dm: refactor dm_md_mempool allocation
The current split between dm_table_alloc_md_mempools and
dm_alloc_md_mempools is rather arbitrary, so merge the two
into one easy to follow function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-06-29 12:46:06 -04:00
Christoph Hellwig
4ed045d875 dm: unexport dm_get_reserved_rq_based_ios
dm_get_reserved_rq_based_ios is only used in the core dm code, so
remove the export.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-06-29 12:46:05 -04:00
Christoph Hellwig
22d0c4080f block: simplify disk_set_independent_access_ranges
Lift setting disk->ia_ranges from disk_register_independent_access_ranges
into disk_set_independent_access_ranges, and make the behavior the same
for the registered vs non-registered queue cases.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20220629062013.1331068-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-29 08:36:46 -06:00
Christoph Hellwig
6a27d28c81 block: move ->ia_ranges from the request_queue to the gendisk
Independent access ranges only matter for file system I/O and are only
valid with a registered gendisk, so move them there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20220629062013.1331068-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-29 08:36:46 -06:00
Ying Sun
b9a1c179bd block: remove "select BLK_RQ_IO_DATA_LEN" from BLK_CGROUP_IOCOST dependency
The configuration item BLK_RQ_IO_DATA_LEN is not declared in the kernel.
Select BLK_RQ_IO_DATA_LEN is meaningless which could be removed.

Signed-off-by: Ying Sun <sunying@nj.iscas.ac.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220629062409.19458-1-sunying@nj.iscas.ac.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-29 08:35:57 -06:00
Christoph Hellwig
8682b92e5a blk-mq: cleanup disk sysfs registration
Pass a gendisk to the sysfs register/unregister functions and give
them descriptive names.  Also move the unregistration helper next
to the one doing the registration.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220628171850.1313069-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-28 11:32:42 -06:00
Christoph Hellwig
eaa870f975 blk-mq: rename blk_mq_sysfs_{,un}register
Add a _hctx postfix to better describe what the functions do, match
the debugfs equivalents and release the old names for functions that
should be using them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220628171850.1313069-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-28 11:32:42 -06:00
Christoph Hellwig
81f0c2ef41 block: remove the extra gendisk reference in __blk_mq_register_dev
kobject_add already grabs a reference to the parent, no need to have
another one.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220628171850.1313069-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-28 11:32:42 -06:00
Christoph Hellwig
4a8d14bba4 block: use default groups to register the queue attributes
Set up the default_groups for blk_queue_ktype instead of manually calling
sysfs_create_group.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220628171850.1313069-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-28 11:32:42 -06:00