Add Hannes as the nvme-auth maintainer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
The code to set the result field for the admin and I/O connect commands
is not only verbose and duplicated, but also violates the aliasing
rules as it accesses both the u16 and u32 members in the union.
Add a little helper to sort all that out.
Fixes: db1312dd95 ("nvmet: implement basic In-Band Authentication")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Mark them as unsigned so that we don't need extra casts, and define
them relative to cdword0 instead of requiring extra shifts.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Currently blktests nvme/002 trips up debugobjects if CONFIG_NVME_AUTH is
enabled, but authentication is not on a queue. This is because
nvmet_auth_sq_free cancels sq->auth_expired_work unconditionaly, while
auth_expired_work is only ever initialized if authentication is enabled
for a given controller.
Fix this by calling most of what is nvmet_init_auth unconditionally
when initializing the SQ, and just do the setting of the result
field in the connect command handler.
Fixes: db1312dd95 ("nvmet: implement basic In-Band Authentication")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
There is only a single call-site of nvmet_tcp_finish_cmd(), this
becomes redundant. Remove nvmet_tcp_finish_cmd() and use the original
function body instead.
Suggested-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
ttag is used as an index to get cmd in nvmet_tcp_handle_h2c_data_pdu(),
add a bounds check to avoid out-of-bounds access.
Signed-off-by: Varun Prakash <varun@chelsio.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
As per NVMe/TCP transport specification ICReq PDU is the first PDU received
by the controller and controller should receive only one ICReq PDU.
If controller receives more than one ICReq PDU then this can be considered
as fatal error.
nvmet-tcp driver does not check for ICReq PDU opcode if queue state is
NVMET_TCP_Q_LIVE. In LIVE state ICReq PDU is treated as CapsuleCmd PDU,
this can result in abnormal behavior.
Add a check for ICReq PDU in nvmet_tcp_done_recv_pdu() to fix this issue.
Signed-off-by: Varun Prakash <varun@chelsio.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
We've been reporting 2 maps regardless of whether the module parameter
asked for anything beyond the default queues. A consequence of this
means that blk-mq will reinitialize the all the hardware contexts and io
schedulers on every controller reset when the mapping is exactly the
same as before. This unnecessary overhead is adding several milliseconds
on a reset for environments that don't need it. Report the actual number
of mappings in use.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
If swiotlb is force enabled dma_max_mapping_size ends up calling
swiotlb_max_mapping_size which takes into account the min align mask for
the device. Set the min align mask for nvme driver before calling
dma_max_mapping_size while calculating max hw sectors.
Signed-off-by: Rishabh Bhatnagar <risbhat@amazon.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
When a discovery controller is disconnected, no AENs will arrive to
notify the host about discovery log change events.
In order to solve this, send a uevent notification when a
persistent discovery controller reconnects. We add a new ctrl
flag NVME_CTRL_STARTED_ONCE that will be set on the first
start, and consecutive calls will find it set, and send the
event to userspace if the controller is a discovery controller.
Upon the event reception, userspace will re-read the discovery
log page and will act upon changes as it sees fit.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
We expect to grow a few of these flags for various purposes
so make them a proper enumeration.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The subsystem reset writes to a register, so we have to ensure the
device state is capable of handling that otherwise the driver may access
unmapped registers. Use the state machine to ensure the subsystem reset
doesn't try to write registers on a device already undergoing this type
of reset.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=214771
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The passthrough commands already have this restriction, but the other
operations do not. Require the same capabilities for all users as all of
these operations, which include resets and rescans, can be disruptive.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The firmware revision can change on after a reset so copy the most
recent info each time instead of just the first time, otherwise the
sysfs firmware_rev entry may contain stale data.
Reported-by: Jeff Lien <jeff.lien@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
If a reset occurs after the scan work attempts to issue a command, the
reset may quisce the admin queue, which blocks the scan work's command
from dispatching. The scan work will not be able to complete while the
queue is quiesced.
Meanwhile, the reset work will cancel all outstanding admin tags and
wait until all requests have transitioned to idle, which includes the
passthrough request. But the passthrough request won't be set to idle
until after the scan_work flushes, so we're deadlocked.
Fix this by handling the end effects after the request has been freed.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216354
Reported-by: Jonathan Derrick <Jonathan.Derrick@solidigm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Prepare for storing the blkcg information in the gendisk instead of
the request_queue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass the gendisk to blkcg_schedule_throttle as part of moving the
blk-cgroup infrastructure to be gendisk based. Remove the unused
!BLK_CGROUP stub while we're at it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass the gendisk to blkg_destroy_all as part of moving the blk-cgroup
infrastructure to be gendisk based.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass the gendisk to blk_throtl_cancel_bios as part of moving the
blk-cgroup infrastructure to be gendisk based.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass the gendisk to blk_throtl_register_queue as part of moving the
blk-cgroup infrastructure to be gendisk based.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass the gendisk to blk_throtl_init and blk_throtl_exit as part of moving
the blk-cgroup infrastructure to be gendisk based.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use a local disk variable instead of retrieving the disk and
request_queue over and over by various means.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass the gendisk to blk_iocost_init as part of moving the blk-cgroup
infrastructure to be gendisk based.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just directly dereference the disk name instead of going through multiple
hoops to find the same value.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass the gendisk to blk_iolatency_init as part of moving the blk-cgroup
infrastructure to be gendisk based.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-9-hch@lst.de
[axboe: missed inline for blk_iolatency_init() and !CONFIG_BLK_CGROUP_IOLATENCY]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass the gendisk to blk_ioprio_init and blk_ioprio_exit as part of moving
the blk-cgroup infrastructure to be gendisk based.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass the gendisk to blkcg_init_disk and blkcg_exit_disk as part of moving
the blk-cgroup infrastructure to be gendisk based. Also remove the
rather pointless kerneldoc comments for these internal functions with a
single caller each.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The combinations of an error check with an ERR_PTR return and a lookup
with a NULL return leads to ugly handling of the return values in the
callers. Just open coding the check and the lookup is much simpler.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a fully inlined blkg_lookup as the extra two checks aren't going
to generated a lot more code vs the call to the slowpath routine, and
open code the hint update in the two callers that care.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use blkg_lookup instead of open coding it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just open code it in the only caller and drop the unused !BLK_CGROUP
stub.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When blk_throtl_init fails, we need to call blk_ioprio_exit. Switch to
proper goto based unwinding to fix this.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
High-performance NVMe devices usually support a large hw queues, which
ensures a 1:1 mapping of hctx and ctx. In this case there will be no
remote request, so we don't need to care about it.
Signed-off-by: Liu Song <liusong@linux.alibaba.com>
Link: https://lore.kernel.org/r/1663731123-81536-1-git-send-email-liusong@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
"tg->has_rules" is extended to "tg->has_rules_iops/bps", thus bios that
don't need to be throttled can be checked accurately.
With this patch, bio will be throttled if:
1) Bio is read/write, and corresponding read/write iops limit exist.
2) If corresponding doesn't exist, corresponding bps limit exist and
bio is not throttled before.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921095309.1481289-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, "tg->has_rules" and "tg->flags & THROTL_TG_HAS_IOPS_LIMIT"
both try to bypass bios that don't need to be throttled, however, they are
a little redundant and both not perfect:
1) "tg->has_rules" only distinguish read and write, but not iops and bps
limit.
2) "tg->flags & THROTL_TG_HAS_IOPS_LIMIT" only check if iops limit
exist, read and write is not distinguished, and bps limit is not
checked.
tg->has_rules will extended to distinguish bps and iops in the following
patch. There is no need to keep the flag.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921095309.1481289-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
START_USER_RECOVERY and END_USER_RECOVERY are two new control commands
to support user recovery feature.
After a crash, user should send START_USER_RECOVERY, it will:
(1) check if (a)current ublk_device is UBLK_S_DEV_QUIESCED which was
set by quiesce_work and (b)chardev is released
(2) reinit all ubqs, including:
(a) put the task_struct and reset ->ubq_daemon to NULL.
(b) reset all ublk_io.
(3) reset ub->mm to NULL.
Then, user should start a new process and send FETCH_REQ on each
ubq_daemon.
Finally, user should send END_USER_RECOVERY, it will:
(1) wait for all new ubq_daemons getting ready.
(2) update ublksrv_pid
(3) unquiesce the request queue and expect incoming ublk_queue_rq()
(4) convert ub's state to UBLK_S_DEV_LIVE
Note: we can handle STOP_DEV between START_USER_RECOVERY and
END_USER_RECOVERY. This is helpful to users who cannot start new process
after sending START_USER_RECOVERY ctrl-cmd.
Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220923153919.44078-7-ZiyangZhang@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
UBLK_F_USER_RECOVERY_REISSUE implies that:
With a dying ubq_daemon, ublk_drv let monitor_work requeues rq issued to
userspace(ublksrv) before the ubq_daemon is dying.
UBLK_F_USER_RECOVERY_REISSUE is designed for backends which:
(1) tolerate double-write since ublk_drv may issue the same rq
twice.
(2) does not let frontend users get I/O error, such as read-only FS
and VM backend.
Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220923153919.44078-6-ZiyangZhang@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With USER_RECOVERY feature enabled, the monitor_work schedules
quiesce_work after finding a dying ubq_daemon. The monitor_work
should also abort all rqs issued to userspace before the ubq_daemon is
dying. The quiesce_work's job is to:
(1) quiesce request queue.
(2) check if there is any INFLIGHT rq. If so, we retry until all these
rqs are requeued and become IDLE. These rqs should be requeued by
ublk_queue_rq(), task work, io_uring fallback wq or monitor_work.
(3) complete all ioucmds by calling io_uring_cmd_done(). We are safe to
do so because no ioucmd can be referenced now.
(5) set ub's state to UBLK_S_DEV_QUIESCED, which means we are ready for
recovery. This state is exposed to userspace by GET_DEV_INFO.
The driver can always handle STOP_DEV and cleanup everything no matter
ub's state is LIVE or QUIESCED. After ub's state is UBLK_S_DEV_QUIESCED,
user can recover with new process.
Note: we do not change the default behavior with reocvery feature
disabled. monitor_work still schedules stop_work and abort inflight
rqs. And finally ublk_device is released.
Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220923153919.44078-5-ZiyangZhang@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With recovery feature enabled, in ublk_queue_rq or task work
(in exit_task_work or fallback wq), we requeue rqs instead of
ending(aborting) them. Besides, No matter recovery feature is enabled
or disabled, we schedule monitor_work immediately.
Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220923153919.44078-4-ZiyangZhang@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Define some macros for recovery feature.
UBLK_S_DEV_QUIESCED implies that ublk_device is quiesced
and is ready for recovery. This state can be observed by userspace.
UBLK_F_USER_RECOVERY implies that:
(1) ublk_drv enables recovery feature. It won't let monitor_work to
automatically abort rqs and release the device.
(2) With a dying ubq_daemon, ublk_drv ends(aborts) rqs issued to
userspace(ublksrv) before crash.
(3) With a dying ubq_daemon, in task work and ublk_queue_rq(),
ublk_drv requeues rqs.
Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220923153919.44078-3-ZiyangZhang@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This check is not atomic. So with recovery feature, ubq_daemon may be
modified simultaneously by recovery task. Instead, check 'current' is
safe here because 'current' never changes.
Also add comment explaining this check, which is really important for
understanding recovery feature.
Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220923153919.44078-2-ZiyangZhang@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull MD updates and fixes from Song:
"1. Various raid5 fix and clean up, by Logan Gunthorpe and David Sloan.
2. Raid10 performance optimization, by Yu Kuai."
* 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
md: Fix spelling mistake in comments of r5l_log
md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d
md/raid10: convert resync_lock to use seqlock
md/raid10: fix improper BUG_ON() in raise_barrier()
md/raid10: prevent unnecessary calls to wake_up() in fast path
md/raid10: don't modify 'nr_waitng' in wait_barrier() for the case nowait
md/raid10: factor out code from wait_barrier() to stop_waiting_barrier()
md: Remove extra mddev_get() in md_seq_start()
md/raid5: Remove unnecessary bio_put() in raid5_read_one_chunk()
md/raid5: Ensure stripe_fill happens on non-read IO with journal
md/raid5: Don't read ->active_stripes if it's not needed
md/raid5: Cleanup prototype of raid5_get_active_stripe()
md/raid5: Drop extern on function declarations in raid5.h
md/raid5: Refactor raid5_get_active_stripe()
md: Replace snprintf with scnprintf
md/raid10: fix compile warning
md/raid5: Fix spelling mistakes in comments
A complicated deadlock exists when using the journal and an elevated
group_thrtead_cnt. It was found with loop devices, but its not clear
whether it can be seen with real disks. The deadlock can occur simply
by writing data with an fio script.
When the deadlock occurs, multiple threads will hang in different ways:
1) The group threads will hang in the blk-wbt code with bios waiting to
be submitted to the block layer:
io_schedule+0x70/0xb0
rq_qos_wait+0x153/0x210
wbt_wait+0x115/0x1b0
io_schedule+0x70/0xb0
rq_qos_wait+0x153/0x210
wbt_wait+0x115/0x1b0
__rq_qos_throttle+0x38/0x60
blk_mq_submit_bio+0x589/0xcd0
wbt_wait+0x115/0x1b0
__rq_qos_throttle+0x38/0x60
blk_mq_submit_bio+0x589/0xcd0
__submit_bio+0xe6/0x100
submit_bio_noacct_nocheck+0x42e/0x470
submit_bio_noacct+0x4c2/0xbb0
ops_run_io+0x46b/0x1a30
handle_stripe+0xcd3/0x36b0
handle_active_stripes.constprop.0+0x6f6/0xa60
raid5_do_work+0x177/0x330
Or:
io_schedule+0x70/0xb0
rq_qos_wait+0x153/0x210
wbt_wait+0x115/0x1b0
__rq_qos_throttle+0x38/0x60
blk_mq_submit_bio+0x589/0xcd0
__submit_bio+0xe6/0x100
submit_bio_noacct_nocheck+0x42e/0x470
submit_bio_noacct+0x4c2/0xbb0
flush_deferred_bios+0x136/0x170
raid5_do_work+0x262/0x330
2) The r5l_reclaim thread will hang in the same way, submitting a
bio to the block layer:
io_schedule+0x70/0xb0
rq_qos_wait+0x153/0x210
wbt_wait+0x115/0x1b0
__rq_qos_throttle+0x38/0x60
blk_mq_submit_bio+0x589/0xcd0
__submit_bio+0xe6/0x100
submit_bio_noacct_nocheck+0x42e/0x470
submit_bio_noacct+0x4c2/0xbb0
submit_bio+0x3f/0xf0
md_super_write+0x12f/0x1b0
md_update_sb.part.0+0x7c6/0xff0
md_update_sb+0x30/0x60
r5l_do_reclaim+0x4f9/0x5e0
r5l_reclaim_thread+0x69/0x30b
However, before hanging, the MD_SB_CHANGE_PENDING flag will be
set for sb_flags in r5l_write_super_and_discard_space(). This
flag will never be cleared because the submit_bio() call never
returns.
3) Due to the MD_SB_CHANGE_PENDING flag being set, handle_stripe()
will do no processing on any pending stripes and re-set
STRIPE_HANDLE. This will cause the raid5d thread to enter an
infinite loop, constantly trying to handle the same stripes
stuck in the queue.
The raid5d thread has a blk_plug that holds a number of bios
that are also stuck waiting seeing the thread is in a loop
that never schedules. These bios have been accounted for by
blk-wbt thus preventing the other threads above from
continuing when they try to submit bios. --Deadlock.
To fix this, add the same wait_event() that is used in raid5_do_work()
to raid5d() such that if MD_SB_CHANGE_PENDING is set, the thread will
schedule and wait until the flag is cleared. The schedule action will
flush the plug which will allow the r5l_reclaim thread to continue,
thus preventing the deadlock.
However, md_check_recovery() calls can also clear MD_SB_CHANGE_PENDING
from the same thread and can thus deadlock if the thread is put to
sleep. So avoid waiting if md_check_recovery() is being called in the
loop.
It's not clear when the deadlock was introduced, but the similar
wait_event() call in raid5_do_work() was added in 2017 by this
commit:
16d997b78b ("md/raid5: simplfy delaying of writes while metadata
is updated.")
Link: https://lore.kernel.org/r/7f3b87b6-b52a-f737-51d7-a4eec5c44112@deltatee.com
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
This patchset tries to avoid that two locks are held unconditionally
in hot path.
Test environment:
Architecture:
aarch64 Huawei KUNPENG 920
x86 Intel(R) Xeon(R) Platinum 8380
Raid10 initialize:
mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 \
/dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
Test cmd:
(task set -c 0-15) fio -name=0 -ioengine=libaio -direct=1 -\
group_reporting=1 -randseed=2022 -rwmixread=70 -refill_buffers \
-filename=/dev/md0 -numjobs=16 -runtime=60s -bs=4k -iodepth=256 \
-rw=randread
Test result:
aarch64:
before this patchset: 3.2 GiB/s
bind node before this patchset: 6.9 Gib/s
after this patchset: 7.9 Gib/s
bind node after this patchset: 8.0 Gib/s
x86:(bind node is not tested yet)
before this patchset: 7.0 GiB/s
after this patchset : 9.3 GiB/s
Please noted that in the test machine, memory access latency is very bad
across nodes compare to local node in aarch64, which is why bandwidth
while bind node is much better.
Currently, wait_barrier() will hold 'resync_lock' to read 'conf->barrier',
and io can't be dispatched until 'barrier' is dropped.
Since holding the 'barrier' is not common, convert 'resync_lock' to use
seqlock so that holding lock can be avoided in fast path.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-and-Tested-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
'conf->barrier' is protected by 'conf->resync_lock', reading
'conf->barrier' without holding the lock is wrong.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Currently, wake_up() is called unconditionally in fast path such as
raid10_make_request(), which will cause lock contention under high
concurrency:
raid10_make_request
wake_up
__wake_up_common_lock
spin_lock_irqsave
Improve performance by only call wake_up() if waitqueue is not empty
in allow_barrier() and raid10_make_request().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
For the case nowait in wait_barrier(), there is no point to increase
nr_waiting and then decrease it.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>