If a device has no NUMA node information associated with it, the driver
puts the device in node first_memory_node (say node 0). Not having a
NUMA node and being associated with node 0 are completely different
things and it makes little sense to mix the two.
Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
The nvme_fc_fcp_op structure describing an AEN operation is initialized with a
null request structure pointer. An FC LLDD may make a call to
nvme_fc_io_getuuid passing a pointer to an nvmefc_fcp_req for an AEN operation.
Add validation of the request structure pointer before dereference.
Signed-off-by: Nigel Kirkland <nkirkland2304@gmail.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Statically allocated array of pointed to hwmon_channel_info can be made
const for safety.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Keith Busch <kbusch@kernel.org>
iov_len is the valid data length, so pass iov_len instead of sg->length to
bvec_set_page().
Fixes: 5bfaba275a ("nvmet-tcp: don't map pages which can't come from HIGHMEM")
Signed-off-by: Rakshana Sridhar <rakshanas@chelsio.com>
Signed-off-by: Varun Prakash <varun@chelsio.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
There isn't any reason to not support REQ_OP_ZONE_RESET_ALL given everything
is actually handled in userspace, not mention it is pretty easy to support
RESET_ALL.
So enable REQ_OP_ZONE_RESET_ALL and let userspace handle it.
Verified by 'tools/zbc_reset_zone -all /dev/ublkb0' in libzbc[1] with
libublk-rs based ublk-zoned target prototype[2], follows command line
for creating ublk-zoned:
cargo run --example zoned -- add -1 1024 # add $dev_id $DEV_SIZE
[1] https://github.com/westerndigitalcorporation/libzbc
[2] https://github.com/ming1/libublk-rs/tree/zoned.v2
Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Andreas Hindborg <a.hindborg@samsung.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230810124326.321472-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull MD changes from Song:
"1. Fix perf regression for raid0 large sequential writes, by Jan Kara.
2. Fix split bio iostat for raid0, by David Jeffery.
3. Various raid1 fixes, by Heinz Mauelshagen and Xueshi Hu."
* tag 'md-next-20230817' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
md: raid0: account for split bio in iostat accounting
md/raid0: Fix performance regression for large sequential writes
md/raid0: Factor out helper for mapping and submitting a bio
md raid1: allow writebehind to work on any leg device set WriteMostly
md/raid1: hold the barrier until handle_read_error() finishes
md/raid1: free the r1bio before waiting for blocked rdev
md/raid1: call free_r1bio() before allow_barrier() in raid_end_bio_io()
When a bio is split by md raid0, the newly created bio will not be tracked
by md for I/O accounting. Only the portion of I/O still assigned to the
original bio which was reduced by the split will be accounted for. This
results in md iostat data sometimes showing I/O values far below the actual
amount of data being sent through md.
md_account_bio() needs to be called for all bio generated by the bio split.
A simple example of the issue was generated using a raid0 device on partitions
to the same device. Since all raid0 I/O then goes to one device, it makes it
easy to see a gap between the md device and its sd storage. Reading an lvm
device on top of the md device, the iostat output (some 0 columns and extra
devices removed to make the data more compact) was:
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read
md2 0.00 0.00 0.00 0.00 0
sde 0.00 0.00 0.00 0.00 0
md2 1364.00 411496.00 0.00 0.00 411496
sde 1734.00 646144.00 0.00 0.00 646144
md2 1699.00 510680.00 0.00 0.00 510680
sde 2155.00 802784.00 0.00 0.00 802784
md2 803.00 241480.00 0.00 0.00 241480
sde 1016.00 377888.00 0.00 0.00 377888
md2 0.00 0.00 0.00 0.00 0
sde 0.00 0.00 0.00 0.00 0
I/O was generated doing large direct I/O reads (12M) with dd to a linear
lvm volume on top of the 4 leg raid0 device.
The md2 reads were showing as roughly 2/3 of the reads to the sde device
containing all of md2's raid partitions. The sum of reads to sde was
1826816 kB, which was the expected amount as it was the amount read by
dd. With the patch, the total reads from md will match the reads from
sde and be consistent with the amount of I/O generated.
Fixes: 10764815ff ("md: add io accounting for raid0 and raid5")
Signed-off-by: David Jeffery <djeffery@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230816181433.13289-1-djeffery@redhat.com
Commit f00d7c85be ("md/raid0: fix up bio splitting.") among other
things changed how bio that needs to be split is submitted. Before this
commit, we have split the bio, mapped and submitted each part. After
this commit, we map only the first part of the split bio and submit the
second part unmapped. Due to bio sorting in __submit_bio_noacct() this
results in the following request ordering:
9,0 18 1181 0.525037895 15995 Q WS 1479315464 + 63392
Split off chunk-sized (1024 sectors) request:
9,0 18 1182 0.629019647 15995 X WS 1479315464 / 1479316488
Request is unaligned to the chunk so it's split in
raid0_make_request(). This is the first part mapped and punted to
bio_list:
8,0 18 7053 0.629020455 15995 A WS 739921928 + 1016 <- (9,0) 1479315464
Now raid0_make_request() returns, second part is postponed on
bio_list. __submit_bio_noacct() resorts the bio_list, mapped request
is submitted to the underlying device:
8,0 18 7054 0.629022782 15995 G WS 739921928 + 1016
Now we take another request from the bio_list which is the remainder
of the original huge request. Split off another chunk-sized bit from
it and the situation repeats:
9,0 18 1183 0.629024499 15995 X WS 1479316488 / 1479317512
8,16 18 6998 0.629025110 15995 A WS 739921928 + 1016 <- (9,0) 1479316488
8,16 18 6999 0.629026728 15995 G WS 739921928 + 1016
...
9,0 18 1184 0.629032940 15995 X WS 1479317512 / 1479318536 [libnetacq-write]
8,0 18 7059 0.629033294 15995 A WS 739922952 + 1016 <- (9,0) 1479317512
8,0 18 7060 0.629033902 15995 G WS 739922952 + 1016
...
This repeats until we consume the whole original huge request. Now we
finally get to processing the second parts of the split off requests
(in reverse order):
8,16 18 7181 0.629161384 15995 A WS 739952640 + 8 <- (9,0) 1479377920
8,0 18 7239 0.629162140 15995 A WS 739952640 + 8 <- (9,0) 1479376896
8,16 18 7186 0.629163881 15995 A WS 739951616 + 8 <- (9,0) 1479375872
8,0 18 7242 0.629164421 15995 A WS 739951616 + 8 <- (9,0) 1479374848
...
I guess it is obvious that this IO pattern is extremely inefficient way
to perform sequential IO. It also makes bio_list to grow to rather long
lengths.
Change raid0_make_request() to map both parts of the split bio. Since we
know we are provided with at most chunk-sized bios, we will always need
to split the incoming bio at most once.
Fixes: f00d7c85be ("md/raid0: fix up bio splitting.")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230814092720.3931-2-jack@suse.cz
Signed-off-by: Song Liu <song@kernel.org>
Factor out helper function for mapping and submitting a bio out of
raid0_make_request(). We will use it later for submitting both parts of
a split bio.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230814092720.3931-1-jack@suse.cz
Signed-off-by: Song Liu <song@kernel.org>
handle_read_error() will call allow_barrier() to match the former barrier
raising. However, it should put the allow_barrier() at the end to avoid a
concurrent raid reshape.
Fixes: 689389a06c ("md/raid1: simplify handle_read_error().")
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xueshi Hu <xueshi.hu@smartx.com>
Link: https://lore.kernel.org/r/20230814135356.1113639-4-xueshi.hu@smartx.com
Signed-off-by: Song Liu <song@kernel.org>
Raid1 reshape will change mempool and r1conf::raid_disks which are
needed to free r1bio. allow_barrier() make a concurrent raid1_reshape()
possible. So, free the in-flight r1bio before waiting blocked rdev.
Fixes: 6bfe0b4990 ("md: support blocking writes to an array on device failure")
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xueshi Hu <xueshi.hu@smartx.com>
Link: https://lore.kernel.org/r/20230814135356.1113639-3-xueshi.hu@smartx.com
Signed-off-by: Song Liu <song@kernel.org>
After allow_barrier, a concurrent raid1_reshape() will replace old mempool
and r1conf::raid_disks. Move allow_barrier() to the end of raid_end_bio_io(),
so that r1bio can be freed safely.
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xueshi Hu <xueshi.hu@smartx.com>
Link: https://lore.kernel.org/r/20230814135356.1113639-2-xueshi.hu@smartx.com
Signed-off-by: Song Liu <song@kernel.org>
blk-iocost sometimes causes the following crash:
BUG: kernel NULL pointer dereference, address: 00000000000000e0
...
RIP: 0010:_raw_spin_lock+0x17/0x30
Code: be 01 02 00 00 e8 79 38 39 ff 31 d2 89 d0 5d c3 0f 1f 00 0f 1f 44 00 00 55 48 89 e5 65 ff 05 48 d0 34 7e b9 01 00 00 00 31 c0 <f0> 0f b1 0f 75 02 5d c3 89 c6 e8 ea 04 00 00 5d c3 0f 1f 84 00 00
RSP: 0018:ffffc900023b3d40 EFLAGS: 00010046
RAX: 0000000000000000 RBX: 00000000000000e0 RCX: 0000000000000001
RDX: ffffc900023b3d20 RSI: ffffc900023b3cf0 RDI: 00000000000000e0
RBP: ffffc900023b3d40 R08: ffffc900023b3c10 R09: 0000000000000003
R10: 0000000000000064 R11: 000000000000000a R12: ffff888102337000
R13: fffffffffffffff2 R14: ffff88810af408c8 R15: ffff8881070c3600
FS: 00007faaaf364fc0(0000) GS:ffff88842fdc0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000e0 CR3: 00000001097b1000 CR4: 0000000000350ea0
Call Trace:
<TASK>
ioc_weight_write+0x13d/0x410
cgroup_file_write+0x7a/0x130
kernfs_fop_write_iter+0xf5/0x170
vfs_write+0x298/0x370
ksys_write+0x5f/0xb0
__x64_sys_write+0x1b/0x20
do_syscall_64+0x3d/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
This happens because iocg->ioc is NULL. The field is initialized by
ioc_pd_init() and never cleared. The NULL deref is caused by
blkcg_activate_policy() installing blkg_policy_data before initializing it.
blkcg_activate_policy() was doing the following:
1. Allocate pd's for all existing blkg's and install them in blkg->pd[].
2. Initialize all pd's.
3. Online all pd's.
blkcg_activate_policy() only grabs the queue_lock and may release and
re-acquire the lock as allocation may need to sleep. ioc_weight_write()
grabs blkcg->lock and iterates all its blkg's. The two can race and if
ioc_weight_write() runs during #1 or between #1 and #2, it can encounter a
pd which is not initialized yet, leading to crash.
The crash can be reproduced with the following script:
#!/bin/bash
echo +io > /sys/fs/cgroup/cgroup.subtree_control
systemd-run --unit touch-sda --scope dd if=/dev/sda of=/dev/null bs=1M count=1 iflag=direct
echo 100 > /sys/fs/cgroup/system.slice/io.weight
bash -c "echo '8:0 enable=1' > /sys/fs/cgroup/io.cost.qos" &
sleep .2
echo 100 > /sys/fs/cgroup/system.slice/io.weight
with the following patch applied:
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index fc49be622e05..38d671d5e10c 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1553,6 +1553,12 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
> pd->online = false;
> }
>
> + if (system_state == SYSTEM_RUNNING) {
> + spin_unlock_irq(&q->queue_lock);
> + ssleep(1);
> + spin_lock_irq(&q->queue_lock);
> + }
> +
> /* all allocated, init in the same order */
> if (pol->pd_init_fn)
> list_for_each_entry_reverse(blkg, &q->blkg_list, q_node)
I don't see a reason why all pd's should be allocated, initialized and
onlined together. The only ordering requirement is that parent blkgs to be
initialized and onlined before children, which is guaranteed from the
walking order. Let's fix the bug by allocating, initializing and onlining pd
for each blkg and holding blkcg->lock over initialization and onlining. This
ensures that an installed blkg is always fully initialized and onlined
removing the the race window.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Breno Leitao <leitao@debian.org>
Fixes: 9d179b8654 ("blkcg: Fix multiple bugs in blkcg_activate_policy()")
Link: https://lore.kernel.org/r/ZN0p5_W-Q9mAHBVY@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 137380c0ec renamed 'rnbd-client' to 'rnbd_client', this changed
sysfs interface to /sys/devices/virtual/rnbd_client/ctl/map_device
from /sys/devices/virtual/rnbd-client/ctl/map_device.
CC: Ivan Orlov <ivan.orlov0322@gmail.com>
CC: "Md. Haris Iqbal" <haris.iqbal@ionos.com>
CC: Jack Wang <jinpu.wang@ionos.com>
Fixes: 137380c0ec ("block/rnbd: make all 'class' structures const")
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20230816022210.2501228-1-lizhijian@fujitsu.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull MD fixes from Song:
"1. raid6test build fixes, by WANG Xuerui
2. Various non-urgent fixes."
* tag 'md-next-20230814-resend' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
md/raid5-cache: fix null-ptr-deref for r5l_flush_stripe_to_raid()
raid6: test: only check for Altivec if building on powerpc hosts
raid6: test: make sure all intermediate and artifact files are .gitignored
raid6: test: cosmetic cleanups for the test Makefile
raid6: guard the tables.c include of <linux/export.h> with __KERNEL__
raid6: remove the <linux/export.h> include from recov.c
md: Hold mddev->reconfig_mutex when trying to get mddev->sync_thread
md/raid10: fix a 'conf->barrier' leakage in raid10_takeover()
md: raid1: fix potential OOB in raid1_remove_disk()
md/raid5-cache: fix a deadlock in r5l_exit_log()
r5l_flush_stripe_to_raid() will check if the list 'flushing_ios' is
empty, and then submit 'flush_bio', however, r5l_log_flush_endio()
is clearing the list first and then clear the bio, which will cause
null-ptr-deref:
T1: submit flush io
raid5d
handle_active_stripes
r5l_flush_stripe_to_raid
// list is empty
// add 'io_end_ios' to the list
bio_init
submit_bio
// io1
T2: io1 is done
r5l_log_flush_endio
list_splice_tail_init
// clear the list
T3: submit new flush io
...
r5l_flush_stripe_to_raid
// list is empty
// add 'io_end_ios' to the list
bio_init
bio_uninit
// clear bio->bi_blkg
submit_bio
// null-ptr-deref
Fix this problem by clearing bio before clearing the list in
r5l_log_flush_endio().
Fixes: 0dd00cba99 ("raid5-cache: fully initialize flush_bio when needed")
Reported-and-tested-by: Corey Hickey <bugfood-ml@fatooh.org>
Closes: https://lore.kernel.org/all/cddd7213-3dfd-4ab7-a3ac-edd54d74a626@fatooh.org/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Altivec is only available for powerpc hosts, so only check for its
availability when the host is powerpc, to avoid error messages being
shown on architectures other than x86, arm or powerpc.
Signed-off-by: WANG Xuerui <git@xen0n.name>
Link: https://lore.kernel.org/r/20230731104911.411964-6-kernel@xen0n.name
Signed-off-by: Song Liu <song@kernel.org>
Currently when the raid6test utility is built, the resulting binary and
an int.uc file are not being ignored, which can get inadvertently
committed as a result when one works on the raid6 code. Ignore them to
make `git status` clean at all times.
Signed-off-by: WANG Xuerui <git@xen0n.name>
Link: https://lore.kernel.org/r/20230731104911.411964-5-kernel@xen0n.name
Signed-off-by: Song Liu <song@kernel.org>
Use tabs/spaces consistently: hard tabs for marking recipe lines only,
spaces for everything else.
Also, the OPTFLAGS declaration actually included the tabs preceding the
line comment, making compiler invocation lines unnecessarily long. As
the entire block of declarations are meant for ad-hoc customization
(otherwise they would probably make use of `?=` instead of `=`), move
the "Adjust as desired" comment above the block too to fix the long
invocation lines.
Signed-off-by: WANG Xuerui <git@xen0n.name>
Link: https://lore.kernel.org/r/20230731104911.411964-4-kernel@xen0n.name
Signed-off-by: Song Liu <song@kernel.org>
The export directives for the tables are already emitted with __KERNEL__
guards, but the <linux/export.h> include is not, causing errors when
building the raid6test program. Guard this include too to fix the
raid6test build.
Signed-off-by: WANG Xuerui <git@xen0n.name>
Link: https://lore.kernel.org/r/20230731104911.411964-3-kernel@xen0n.name
Signed-off-by: Song Liu <song@kernel.org>
There is no exported symbol left in recov.c, so the include is now
unnecessary, and breaks the raid6test build. Remove it.
Signed-off-by: WANG Xuerui <git@xen0n.name>
Link: https://lore.kernel.org/r/20230731104911.411964-2-kernel@xen0n.name
Signed-off-by: Song Liu <song@kernel.org>
Commit ba9d9f1a707f ("Revert "md: unlock mddev before reap sync_thread in
action_store"") removed the scenario of calling md_unregister_thread()
without holding mddev->reconfig_mutex, so add a lock holding check before
acquiring mddev->sync_thread by passing mdev to md_unregister_thread().
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230803071711.2546560-1-lilingfeng@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
After commit b39f35ebe8 ("md: don't quiesce in mddev_suspend()"),
'conf->barrier' will be leaked in the case that raid10 takeover raid0:
level_store
pers->takeover -> raid10_takeover
raid10_takeover_raid0
WRITE_ONCE(conf->barrier, 1)
mddev_suspend
// still raid0
mddev->pers = pers
// switch to raid10
mddev_resume
// resume without suspend
After the above commit, mddev_resume() will not decrease 'conf->barrier'
that is set in raid10_takeover_raid0().
Fix this problem by not setting 'conf->barrier' in raid10_takeover_raid0().
By the way, this problem is found while I'm trying to make
mddev_suspend/resume() to be independent from raid personalities. raid10
is the only personality to use reference count in the quiesce() callback
and this problem is only related to raid10.
Fixes: b39f35ebe8 ("md: don't quiesce in mddev_suspend()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Link: https://lore.kernel.org/r/20230731022800.1424902-1-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
If rddev->raid_disk is greater than mddev->raid_disks, there will be
an out-of-bounds in raid1_remove_disk(). We have already found
similar reports as follows:
1) commit d17f744e88 ("md-raid10: fix KASAN warning")
2) commit 1ebc2cec0b ("dm raid: fix KASAN warning in raid5_remove_disk")
Fix this bug by checking whether the "number" variable is
valid.
Signed-off-by: Zhang Shurong <zhang_shurong@foxmail.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/tencent_0D24426FAC6A21B69AC0C03CE4143A508F09@qq.com
Signed-off-by: Song Liu <song@kernel.org>
Commit b13015af94 ("md/raid5-cache: Clear conf->log after finishing
work") introduce a new problem:
// caller hold reconfig_mutex
r5l_exit_log
flush_work(&log->disable_writeback_work)
r5c_disable_writeback_async
wait_event
/*
* conf->log is not NULL, and mddev_trylock()
* will fail, wait_event() can never pass.
*/
conf->log = NULL
Fix this problem by setting 'config->log' to NULL before wake_up() as it
used to be, so that wait_event() from r5c_disable_writeback_async() can
exist. In the meantime, move forward md_unregister_thread() so that
null-ptr-deref this commit fixed can still be fixed.
Fixes: b13015af94 ("md/raid5-cache: Clear conf->log after finishing work")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230708091727.1417894-1-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Use memdup_user_nul() helper instead of open-coding
to simplify the code.
Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230815114815.1551171-1-ruanjinjie@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The use of the "class" argument name in the ioprio_value() inline
function in include/uapi/linux/ioprio.h confuses C++ compilers
resulting in compilation errors such as:
/usr/include/linux/ioprio.h:110:43: error: expected primary-expression before ‘int’
110 | static __always_inline __u16 ioprio_value(int class, int level, int hint)
| ^~~
for user C++ programs including linux/ioprio.h.
Avoid these errors by renaming the arguments of the ioprio_value()
function to prioclass, priolevel and priohint. For consistency, the
arguments of the IOPRIO_PRIO_VALUE() and IOPRIO_PRIO_VALUE_HINT() macros
are also renamed in the same manner.
Reported-by: Igor Pylypiv <ipylypiv@google.com>
Fixes: 01584c1e23 ("scsi: block: Improve ioprio value validity checks")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by: Igor Pylypiv <ipylypiv@google.com>
Link: https://lore.kernel.org/r/20230814215833.259286-1-dlemoal@kernel.org
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio_iov_iter_get_pages() trims the IO based on the block size of the
block device the IO will be issued to.
However, bcachefs is a multi device filesystem; when we're creating the
bio we don't yet know which block device the bio will be submitted to -
we have to handle the alignment checks elsewhere.
Thus this is needed to avoid a null ptr deref.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Link: https://lore.kernel.org/r/20230813182636.2966159-3-kent.overstreet@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- bio_set_pages_dirty(), bio_check_pages_dirty() - dio path
- blk_status_to_str() - error messages
- bio_add_folio() - this should definitely be exported for everyone,
it's the modern version of bio_add_page()
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Cc: linux-block@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Link: https://lore.kernel.org/r/20230813182636.2966159-2-kent.overstreet@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The added check of 'req_op(req) == REQ_OP_ZONE_APPEND' should have been
done after the request is confirmed as valid.
Actually here, the request should always been true, so add one
WARN_ON_ONCE(!req), meantime move the zone_append check after
checking the request.
Cc: Andreas Hindborg <a.hindborg@samsung.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Fixes: 29802d7ca3 ("ublk: enable zoned storage support")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230811135216.420404-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A previous commit added a lockdep annotation, but botched it. Use the
right type.
Fixes: 4eb44d1076 ("block: remove init_mutex and open-code blk_iolatency_try_init")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is the module init function, which by definition is used only
locally, so mark it static to avoid a warning:
drivers/block/swim3.c:1280:5: error: no previous prototype for 'swim3_init' [-Werror=missing-prototypes]
Reviewed-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit a13696b83d ("blk-iolatency: Make initialization lazy") adds
a mutex named "init_mutex" in blk_iolatency_try_init for the race
condition of initializing RQ_QOS_LATENCY.
Now a new lock has been add to struct request_queue by commit a13bd91be2
("block/rq_qos: protect rq_qos apis with a new lock"). And it has been
held in blkg_conf_open_bdev before calling blk_iolatency_init.
So it's not necessary to keep init_mutex in blk_iolatency_try_init, just
remove it.
Since init_mutex has been removed, blk_iolatency_try_init can be
open-coded back to iolatency_set_limit() like ioc_qos_write().
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Link: https://lore.kernel.org/r/20230810035111.2236335-1-lilingfeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are two warnings reported by smatch:
drivers/block/ublk_drv.c:445 ublk_setup_iod_zoned() warn:
signedness bug returning '(-95)'
drivers/block/ublk_drv.c:963 ublk_setup_iod() warn:
signedness bug returning '(-5)'
The type of "blk_status_t" is either be a u32 or u8, but this two
functions return a negative value when not supported or failed. Use
the error code of the blk module to fix these warnings.
Fixes: 29802d7ca3 ("ublk: enable zoned storage support")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202308100201.TCRhgdvN-lkp@intel.com/
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230810084836.3535322-1-lizetao1@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In general, the bvec data structure consists of one for physically
continuous pages. But, in the bvec configuration for bip, physically
continuous integrity pages are composed of each bvec.
Allow bio_integrity_add_page() to create multi-page bvecs, just like
the bio payloads. This simplifies adding larger payloads, and fixes
support for non-tiny workloads with nvme, which stopped using
scatterlist for metadata a while ago.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Fixes: 783b94bd92 ("nvme-pci: do not build a scatterlist to map metadata")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jinyoung Choi <j-young.choi@samsung.com>
Tested-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230803025202epcms2p82f57cbfe32195da38c776377b55aed59@epcms2p8
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio_integrity_add_page() returns the add length if successful, else 0,
just as bio_add_page. Simply check return value checking in
bio_integrity_prep to not deal with a > 0 but < len case that can't
happen.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jinyoung Choi <j-young.choi@samsung.com>
Tested-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230803025058epcms2p5a4d0db5da2ad967668932d463661c633@epcms2p5
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Previously, the bip's bi_size has been set before an integrity pages
were added. If a problem occurs in the process of adding pages for
bip, the bi_size mismatch problem must be dealt with.
When the page is successfully added to bvec, the bi_size is updated.
The parts affected by the change were also contained in this commit.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jinyoung Choi <j-young.choi@samsung.com>
Tested-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230803024956epcms2p38186a17392706650c582d38ef3dbcd32@epcms2p3
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This will be used for multi-page configuration for integrity payload.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jinyoung Choi <j-young.choi@samsung.com>
Tested-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230803024827epcms2p838d9e9131492c86a159fff25d195658f@epcms2p8
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The original formula was inaccurate:
dd->async_depth = max(1UL, 3 * q->nr_requests / 4);
For write requests, when we assign a tags from sched_tags,
data->shallow_depth will be passed to sbitmap_find_bit,
see the following code:
nr = sbitmap_find_bit_in_word(&sb->map[index],
min_t (unsigned int,
__map_depth(sb, index),
depth),
alloc_hint, wrap);
The smaller of data->shallow_depth and __map_depth(sb, index)
will be used as the maximum range when allocating bits.
For a mmc device (one hw queue, deadline I/O scheduler):
q->nr_requests = sched_tags = 128, so according to the previous
calculation method, dd->async_depth = data->shallow_depth = 96,
and the platform is 64bits with 8 cpus, sched_tags.bitmap_tags.sb.shift=5,
sb.maps[]=32/32/32/32, 32 is smaller than 96, whether it is a read or
a write I/O, tags can be allocated to the maximum range each time,
which has not throttling effect.
In addition, refer to the methods of bfg/kyber I/O scheduler,
limit ratiois are calculated base on sched_tags.bitmap_tags.sb.shift.
This patch can throttle write requests really.
Fixes: 07757588e5 ("block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests")
Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/1691061162-22898-1-git-send-email-zhiguo.niu@unisoc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add zoned storage support to ublk: report_zones and operations:
- REQ_OP_ZONE_OPEN
- REQ_OP_ZONE_CLOSE
- REQ_OP_ZONE_FINISH
- REQ_OP_ZONE_RESET
- REQ_OP_ZONE_APPEND
The zone append feature uses the `addr` field of `struct ublksrv_io_cmd` to
communicate ALBA back to the kernel. Therefore ublk must be used with the
user copy feature (UBLK_F_USER_COPY) for zoned storage support to be
available. Without this feature, ublk will not allow zoned storage support.
Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230804114610.179530-4-nmi@metaspace.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for zoned storage support, move the check for empty `addr`
field into the command handler case statement. Note that the check makes no
sense for `UBLK_IO_NEED_GET_DATA` because the `addr` field must always be
set for this command.
Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230804114610.179530-3-nmi@metaspace.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This will be used by ublk zoned storage support.
Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230804114610.179530-2-nmi@metaspace.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The iocg can have three throttled metrics: wait, debt, delay. This patch
add missing wait_ms to IocgStat to show the latest wait_ms of iocg.
As we are here, group iocg usage percents "inflt%" and "usage%" together,
and group iocg throttled metrics "wait", "debt" and "delay" together.
Effect after changes:
nvme0n1 RUN per=50.0ms cur_per=177105.713:v1053528.587 busy= +0 vrate=135.00%:270.00% params=ssd_dfl(CQ)
active weight hweight% inflt% usage% wait debt delay
InterfererGroup0 * 100/ 100 54.28/ 9.09 0.34 24.07 0.00 0.00 0.00
interfered * 84/ 1000 45.72/ 90.91 0.48 41.09 0.00 0.00 0.00
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230804065039.8885-3-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The real vrate iocost inuse is not base_vrate, but the atomic vtime_rate.
We need iocost_monitor tool to display this real vrate that iocost use,
to check if the boosted compensated vrate is normal.
Effect after change:
nvme0n1 RUN per=50.0ms cur_per=172116.580:v1040587.433 busy= +0 \
vrate=135.00%:270.00% params=ssd_dfl(CQ)
^
|
this is real vrate inuse
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Link: https://lore.kernel.org/r/20230804065039.8885-2-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When I use iocost_monitor on nvme0n1, this error shows up:
"Could not find ioc for nvme0n1"
There is no kobj in struct queue in recent kernel, it seems that the commit
2bd85221a6 ("block: untangle request_queue refcounting from sysfs")
move the queue kobj to struct gendisk.
Fix it by using mq_kobj which is at the same level with queue kobj.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Link: https://lore.kernel.org/r/20230804065039.8885-1-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are some compile errors reported by kernel test robot:
arm-linux-gnueabi-ld: fs/romfs/storage.o: in function `romfs_dev_read':
storage.c:(.text+0x64): undefined reference to `__brelse'
arm-linux-gnueabi-ld: storage.c:(.text+0x9c): undefined reference to `__bread_gfp'
arm-linux-gnueabi-ld: fs/romfs/storage.o: in function `romfs_dev_strnlen':
storage.c:(.text+0x128): undefined reference to `__brelse'
arm-linux-gnueabi-ld: storage.c:(.text+0x16c): undefined reference to `__bread_gfp'
arm-linux-gnueabi-ld: fs/romfs/storage.o: in function `romfs_dev_strcmp':
storage.c:(.text+0x22c): undefined reference to `__bread_gfp'
arm-linux-gnueabi-ld: storage.c:(.text+0x27c): undefined reference to `__brelse'
arm-linux-gnueabi-ld: storage.c:(.text+0x2a8): undefined reference to `__bread_gfp'
arm-linux-gnueabi-ld: storage.c:(.text+0x2bc): undefined reference to `__brelse'
arm-linux-gnueabi-ld: storage.c:(.text+0x2d4): undefined reference to `__brelse'
arm-linux-gnueabi-ld: storage.c:(.text+0x2f4): undefined reference to `__brelse'
arm-linux-gnueabi-ld: storage.c:(.text+0x304): undefined reference to `__brelse'
The reason for the problem is that the commit
"925c86a19bac" ("fs: add CONFIG_BUFFER_HEAD") has added a new config
"CONFIG_BUFFER_HEAD" that controls building the buffer_head code, and
romfs needs to use the buffer_head API, but no corresponding config has
beed added. Select the config "CONFIG_BUFFER_HEAD" in romfs Kconfig to
resolve the problem.
Fixes: 925c86a19b ("fs: add CONFIG_BUFFER_HEAD")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202308031810.pQzGmR1v-lkp@intel.com/
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Tested-by: Li Zetao <lizetao1@huawei.com>
Signed-off-by: Li Zetao <lizetao1@huawei.com>
[axboe: fold in Christoph's incremental]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a new config option that controls building the buffer_head code, and
select it from all file systems and stacking drivers that need it.
For the block device nodes and alternative iomap based buffered I/O path
is provided when buffer_head support is not enabled, and iomap needs a
a small tweak to define the IOMAP_F_BUFFER_HEAD flag to 0 to not call
into the buffer_head code when it doesn't exist.
Otherwise this is just Kconfig and ifdef changes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230801172201.1923299-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use iomap in buffer_head compat mode to write to block devices.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230801172201.1923299-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>