2019-05-01 02:42:43 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2014-05-29 00:15:41 +08:00
|
|
|
/*
|
|
|
|
* Block multiqueue core code
|
|
|
|
*
|
|
|
|
* Copyright (C) 2013-2014 Jens Axboe
|
|
|
|
* Copyright (C) 2013-2014 Christoph Hellwig
|
|
|
|
*/
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
#include <linux/kernel.h>
|
|
|
|
#include <linux/module.h>
|
|
|
|
#include <linux/backing-dev.h>
|
|
|
|
#include <linux/bio.h>
|
|
|
|
#include <linux/blkdev.h>
|
2021-09-20 20:33:27 +08:00
|
|
|
#include <linux/blk-integrity.h>
|
2015-09-15 01:16:02 +08:00
|
|
|
#include <linux/kmemleak.h>
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/init.h>
|
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/workqueue.h>
|
|
|
|
#include <linux/smp.h>
|
2021-09-20 20:33:13 +08:00
|
|
|
#include <linux/interrupt.h>
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
#include <linux/llist.h>
|
|
|
|
#include <linux/cpu.h>
|
|
|
|
#include <linux/cache.h>
|
2017-02-01 23:36:40 +08:00
|
|
|
#include <linux/sched/topology.h>
|
2017-02-03 02:15:33 +08:00
|
|
|
#include <linux/sched/signal.h>
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
#include <linux/delay.h>
|
2014-09-17 22:27:03 +08:00
|
|
|
#include <linux/crash_dump.h>
|
2016-08-25 22:07:30 +08:00
|
|
|
#include <linux/prefetch.h>
|
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 08:37:18 +08:00
|
|
|
#include <linux/blk-crypto.h>
|
2021-11-24 02:53:12 +08:00
|
|
|
#include <linux/part_stat.h>
|
2024-03-22 10:12:44 +08:00
|
|
|
#include <linux/sched/isolation.h>
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
|
|
|
#include <trace/events/block.h>
|
|
|
|
|
2019-09-16 23:44:29 +08:00
|
|
|
#include <linux/t10-pi.h>
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
#include "blk.h"
|
|
|
|
#include "blk-mq.h"
|
2017-05-04 22:17:21 +08:00
|
|
|
#include "blk-mq-debugfs.h"
|
2018-09-27 05:01:10 +08:00
|
|
|
#include "blk-pm.h"
|
2016-11-08 12:32:37 +08:00
|
|
|
#include "blk-stat.h"
|
2017-01-17 21:03:22 +08:00
|
|
|
#include "blk-mq-sched.h"
|
2018-07-03 23:14:59 +08:00
|
|
|
#include "blk-rq-qos.h"
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2021-01-24 04:10:27 +08:00
|
|
|
static DEFINE_PER_CPU(struct llist_head, blk_cpu_done);
|
2023-07-17 12:00:55 +08:00
|
|
|
static DEFINE_PER_CPU(call_single_data_t, blk_cpu_csd);
|
2020-06-11 14:44:41 +08:00
|
|
|
|
2023-04-13 14:40:54 +08:00
|
|
|
static void blk_mq_insert_request(struct request *rq, blk_insert_t flags);
|
2023-05-19 12:40:46 +08:00
|
|
|
static void blk_mq_request_bypass_insert(struct request *rq,
|
|
|
|
blk_insert_t flags);
|
2023-04-13 14:40:41 +08:00
|
|
|
static void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
|
|
|
|
struct list_head *list);
|
2023-06-13 03:03:42 +08:00
|
|
|
static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
|
|
|
|
struct io_comp_batch *iob, unsigned int flags);
|
2021-10-12 19:12:24 +08:00
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
/*
|
2019-03-24 17:57:08 +08:00
|
|
|
* Check if any of the ctx, dispatch list or elevator
|
|
|
|
* have pending work in this hardware queue.
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
*/
|
2017-11-11 00:13:21 +08:00
|
|
|
static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2017-11-11 00:13:21 +08:00
|
|
|
return !list_empty_careful(&hctx->dispatch) ||
|
|
|
|
sbitmap_any_bit_set(&hctx->ctx_map) ||
|
2017-01-17 21:03:22 +08:00
|
|
|
blk_mq_sched_has_work(hctx);
|
2014-05-19 23:23:55 +08:00
|
|
|
}
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
/*
|
|
|
|
* Mark this ctx as having pending work in this hardware queue
|
|
|
|
*/
|
|
|
|
static void blk_mq_hctx_mark_pending(struct blk_mq_hw_ctx *hctx,
|
|
|
|
struct blk_mq_ctx *ctx)
|
|
|
|
{
|
2018-10-30 03:13:29 +08:00
|
|
|
const int bit = ctx->index_hw[hctx->type];
|
|
|
|
|
|
|
|
if (!sbitmap_test_bit(&hctx->ctx_map, bit))
|
|
|
|
sbitmap_set_bit(&hctx->ctx_map, bit);
|
2014-05-19 23:23:55 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
|
|
|
|
struct blk_mq_ctx *ctx)
|
|
|
|
{
|
2018-10-30 03:13:29 +08:00
|
|
|
const int bit = ctx->index_hw[hctx->type];
|
|
|
|
|
|
|
|
sbitmap_clear_bit(&hctx->ctx_map, bit);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2017-08-09 07:51:45 +08:00
|
|
|
struct mq_inflight {
|
2020-11-24 16:36:54 +08:00
|
|
|
struct block_device *part;
|
2019-10-01 02:55:34 +08:00
|
|
|
unsigned int inflight[2];
|
2017-08-09 07:51:45 +08:00
|
|
|
};
|
|
|
|
|
2022-07-06 20:03:53 +08:00
|
|
|
static bool blk_mq_check_inflight(struct request *rq, void *priv)
|
2017-08-09 07:51:45 +08:00
|
|
|
{
|
|
|
|
struct mq_inflight *mi = priv;
|
|
|
|
|
2022-05-30 14:40:59 +08:00
|
|
|
if (rq->part && blk_do_io_stat(rq) &&
|
|
|
|
(!mi->part->bd_partno || rq->part == mi->part) &&
|
2020-12-02 19:11:45 +08:00
|
|
|
blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT)
|
2019-10-01 02:55:33 +08:00
|
|
|
mi->inflight[rq_data_dir(rq)]++;
|
2018-11-09 01:24:07 +08:00
|
|
|
|
|
|
|
return true;
|
2017-08-09 07:51:45 +08:00
|
|
|
}
|
|
|
|
|
2020-11-24 16:36:54 +08:00
|
|
|
unsigned int blk_mq_in_flight(struct request_queue *q,
|
|
|
|
struct block_device *part)
|
2017-08-09 07:51:45 +08:00
|
|
|
{
|
2019-10-01 02:55:34 +08:00
|
|
|
struct mq_inflight mi = { .part = part };
|
2017-08-09 07:51:45 +08:00
|
|
|
|
|
|
|
blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
|
2018-12-07 00:41:21 +08:00
|
|
|
|
2019-10-01 02:55:34 +08:00
|
|
|
return mi.inflight[0] + mi.inflight[1];
|
2018-04-26 15:21:59 +08:00
|
|
|
}
|
|
|
|
|
2020-11-24 16:36:54 +08:00
|
|
|
void blk_mq_in_flight_rw(struct request_queue *q, struct block_device *part,
|
|
|
|
unsigned int inflight[2])
|
2018-04-26 15:21:59 +08:00
|
|
|
{
|
2019-10-01 02:55:34 +08:00
|
|
|
struct mq_inflight mi = { .part = part };
|
2018-04-26 15:21:59 +08:00
|
|
|
|
2019-10-01 02:55:33 +08:00
|
|
|
blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
|
2019-10-01 02:55:34 +08:00
|
|
|
inflight[0] = mi.inflight[0];
|
|
|
|
inflight[1] = mi.inflight[1];
|
2018-04-26 15:21:59 +08:00
|
|
|
}
|
|
|
|
|
2017-03-27 20:06:57 +08:00
|
|
|
void blk_freeze_queue_start(struct request_queue *q)
|
2013-12-26 21:31:35 +08:00
|
|
|
{
|
2019-05-21 11:25:55 +08:00
|
|
|
mutex_lock(&q->mq_freeze_lock);
|
|
|
|
if (++q->mq_freeze_depth == 1) {
|
2015-10-22 01:20:12 +08:00
|
|
|
percpu_ref_kill(&q->q_usage_counter);
|
2019-05-21 11:25:55 +08:00
|
|
|
mutex_unlock(&q->mq_freeze_lock);
|
2018-11-16 03:22:51 +08:00
|
|
|
if (queue_is_mq(q))
|
2017-11-10 02:49:53 +08:00
|
|
|
blk_mq_run_hw_queues(q, false);
|
2019-05-21 11:25:55 +08:00
|
|
|
} else {
|
|
|
|
mutex_unlock(&q->mq_freeze_lock);
|
2014-08-16 20:02:24 +08:00
|
|
|
}
|
2014-11-05 02:52:27 +08:00
|
|
|
}
|
2017-03-27 20:06:57 +08:00
|
|
|
EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
|
2014-11-05 02:52:27 +08:00
|
|
|
|
2017-03-02 03:22:10 +08:00
|
|
|
void blk_mq_freeze_queue_wait(struct request_queue *q)
|
2014-11-05 02:52:27 +08:00
|
|
|
{
|
2015-10-22 01:20:12 +08:00
|
|
|
wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
|
2013-12-26 21:31:35 +08:00
|
|
|
}
|
2017-03-02 03:22:10 +08:00
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait);
|
2013-12-26 21:31:35 +08:00
|
|
|
|
2017-03-02 03:22:11 +08:00
|
|
|
int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
|
|
|
|
unsigned long timeout)
|
|
|
|
{
|
|
|
|
return wait_event_timeout(q->mq_freeze_wq,
|
|
|
|
percpu_ref_is_zero(&q->q_usage_counter),
|
|
|
|
timeout);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait_timeout);
|
2013-12-26 21:31:35 +08:00
|
|
|
|
2014-11-05 02:52:27 +08:00
|
|
|
/*
|
|
|
|
* Guarantee no request is in use, so we can change any data structure of
|
|
|
|
* the queue afterward.
|
|
|
|
*/
|
2015-10-22 01:20:12 +08:00
|
|
|
void blk_freeze_queue(struct request_queue *q)
|
2014-11-05 02:52:27 +08:00
|
|
|
{
|
2015-10-22 01:20:12 +08:00
|
|
|
/*
|
|
|
|
* In the !blk_mq case we are only calling this to kill the
|
|
|
|
* q_usage_counter, otherwise this increases the freeze depth
|
|
|
|
* and waits for it to return to zero. For this reason there is
|
|
|
|
* no blk_unfreeze_queue(), and blk_freeze_queue() is not
|
|
|
|
* exported to drivers as the only user for unfreeze is blk_mq.
|
|
|
|
*/
|
2017-03-27 20:06:57 +08:00
|
|
|
blk_freeze_queue_start(q);
|
2014-11-05 02:52:27 +08:00
|
|
|
blk_mq_freeze_queue_wait(q);
|
|
|
|
}
|
2015-10-22 01:20:12 +08:00
|
|
|
|
|
|
|
void blk_mq_freeze_queue(struct request_queue *q)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* ...just an alias to keep freeze and unfreeze actions balanced
|
|
|
|
* in the blk_mq_* namespace
|
|
|
|
*/
|
|
|
|
blk_freeze_queue(q);
|
|
|
|
}
|
2015-01-03 06:05:12 +08:00
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_freeze_queue);
|
2014-11-05 02:52:27 +08:00
|
|
|
|
2021-09-29 15:12:41 +08:00
|
|
|
void __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2019-05-21 11:25:55 +08:00
|
|
|
mutex_lock(&q->mq_freeze_lock);
|
2021-09-29 15:12:41 +08:00
|
|
|
if (force_atomic)
|
|
|
|
q->q_usage_counter.data->force_atomic = true;
|
2019-05-21 11:25:55 +08:00
|
|
|
q->mq_freeze_depth--;
|
|
|
|
WARN_ON_ONCE(q->mq_freeze_depth < 0);
|
|
|
|
if (!q->mq_freeze_depth) {
|
2018-09-27 05:01:08 +08:00
|
|
|
percpu_ref_resurrect(&q->q_usage_counter);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
wake_up_all(&q->mq_freeze_wq);
|
2014-07-02 00:34:38 +08:00
|
|
|
}
|
2019-05-21 11:25:55 +08:00
|
|
|
mutex_unlock(&q->mq_freeze_lock);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
2021-09-29 15:12:41 +08:00
|
|
|
|
|
|
|
void blk_mq_unfreeze_queue(struct request_queue *q)
|
|
|
|
{
|
|
|
|
__blk_mq_unfreeze_queue(q, false);
|
|
|
|
}
|
2014-12-20 08:54:14 +08:00
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2017-06-22 01:55:47 +08:00
|
|
|
/*
|
|
|
|
* FIXME: replace the scsi_internal_device_*block_nowait() calls in the
|
|
|
|
* mpt3sas driver such that this function can be removed.
|
|
|
|
*/
|
|
|
|
void blk_mq_quiesce_queue_nowait(struct request_queue *q)
|
|
|
|
{
|
2021-10-14 16:17:10 +08:00
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
spin_lock_irqsave(&q->queue_lock, flags);
|
|
|
|
if (!q->quiesce_depth++)
|
|
|
|
blk_queue_flag_set(QUEUE_FLAG_QUIESCED, q);
|
|
|
|
spin_unlock_irqrestore(&q->queue_lock, flags);
|
2017-06-22 01:55:47 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
|
|
|
|
|
2016-11-03 00:09:51 +08:00
|
|
|
/**
|
2021-11-09 15:11:41 +08:00
|
|
|
* blk_mq_wait_quiesce_done() - wait until in-progress quiesce is done
|
2022-11-01 23:00:48 +08:00
|
|
|
* @set: tag_set to wait on
|
2016-11-03 00:09:51 +08:00
|
|
|
*
|
2021-11-09 15:11:41 +08:00
|
|
|
* Note: it is driver's responsibility for making sure that quiesce has
|
2022-11-01 23:00:48 +08:00
|
|
|
* been started on or more of the request_queues of the tag_set. This
|
|
|
|
* function only waits for the quiesce on those request_queues that had
|
|
|
|
* the quiesce flag set using blk_mq_quiesce_queue_nowait.
|
2016-11-03 00:09:51 +08:00
|
|
|
*/
|
2022-11-01 23:00:48 +08:00
|
|
|
void blk_mq_wait_quiesce_done(struct blk_mq_tag_set *set)
|
2016-11-03 00:09:51 +08:00
|
|
|
{
|
2022-11-01 23:00:48 +08:00
|
|
|
if (set->flags & BLK_MQ_F_BLOCKING)
|
|
|
|
synchronize_srcu(set->srcu);
|
2021-12-03 21:15:32 +08:00
|
|
|
else
|
2016-11-03 00:09:51 +08:00
|
|
|
synchronize_rcu();
|
|
|
|
}
|
2021-11-09 15:11:41 +08:00
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_wait_quiesce_done);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_mq_quiesce_queue() - wait until all ongoing dispatches have finished
|
|
|
|
* @q: request queue.
|
|
|
|
*
|
|
|
|
* Note: this function does not prevent that the struct request end_io()
|
|
|
|
* callback function is invoked. Once this function is returned, we make
|
|
|
|
* sure no dispatch can happen until the queue is unquiesced via
|
|
|
|
* blk_mq_unquiesce_queue().
|
|
|
|
*/
|
|
|
|
void blk_mq_quiesce_queue(struct request_queue *q)
|
|
|
|
{
|
|
|
|
blk_mq_quiesce_queue_nowait(q);
|
2022-11-01 23:00:46 +08:00
|
|
|
/* nothing to wait for non-mq queues */
|
|
|
|
if (queue_is_mq(q))
|
2022-11-01 23:00:48 +08:00
|
|
|
blk_mq_wait_quiesce_done(q->tag_set);
|
2021-11-09 15:11:41 +08:00
|
|
|
}
|
2016-11-03 00:09:51 +08:00
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
|
|
|
|
|
2017-06-06 23:22:03 +08:00
|
|
|
/*
|
|
|
|
* blk_mq_unquiesce_queue() - counterpart of blk_mq_quiesce_queue()
|
|
|
|
* @q: request queue.
|
|
|
|
*
|
|
|
|
* This function recovers queue into the state before quiescing
|
|
|
|
* which is done by blk_mq_quiesce_queue.
|
|
|
|
*/
|
|
|
|
void blk_mq_unquiesce_queue(struct request_queue *q)
|
|
|
|
{
|
2021-10-14 16:17:10 +08:00
|
|
|
unsigned long flags;
|
|
|
|
bool run_queue = false;
|
|
|
|
|
|
|
|
spin_lock_irqsave(&q->queue_lock, flags);
|
|
|
|
if (WARN_ON_ONCE(q->quiesce_depth <= 0)) {
|
|
|
|
;
|
|
|
|
} else if (!--q->quiesce_depth) {
|
|
|
|
blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
|
|
|
|
run_queue = true;
|
|
|
|
}
|
|
|
|
spin_unlock_irqrestore(&q->queue_lock, flags);
|
2017-06-19 04:24:27 +08:00
|
|
|
|
2017-06-06 23:22:08 +08:00
|
|
|
/* dispatch requests which are inserted during quiescing */
|
2021-10-14 16:17:10 +08:00
|
|
|
if (run_queue)
|
|
|
|
blk_mq_run_hw_queues(q, true);
|
2017-06-06 23:22:03 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_unquiesce_queue);
|
|
|
|
|
2022-11-01 23:00:49 +08:00
|
|
|
void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set)
|
|
|
|
{
|
|
|
|
struct request_queue *q;
|
|
|
|
|
|
|
|
mutex_lock(&set->tag_list_lock);
|
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list) {
|
|
|
|
if (!blk_queue_skip_tagset_quiesce(q))
|
|
|
|
blk_mq_quiesce_queue_nowait(q);
|
|
|
|
}
|
|
|
|
blk_mq_wait_quiesce_done(set);
|
|
|
|
mutex_unlock(&set->tag_list_lock);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_quiesce_tagset);
|
|
|
|
|
|
|
|
void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set)
|
|
|
|
{
|
|
|
|
struct request_queue *q;
|
|
|
|
|
|
|
|
mutex_lock(&set->tag_list_lock);
|
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list) {
|
|
|
|
if (!blk_queue_skip_tagset_quiesce(q))
|
|
|
|
blk_mq_unquiesce_queue(q);
|
|
|
|
}
|
|
|
|
mutex_unlock(&set->tag_list_lock);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_unquiesce_tagset);
|
|
|
|
|
2014-12-23 05:04:42 +08:00
|
|
|
void blk_mq_wake_waiters(struct request_queue *q)
|
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
2022-03-08 15:32:18 +08:00
|
|
|
unsigned long i;
|
2014-12-23 05:04:42 +08:00
|
|
|
|
|
|
|
queue_for_each_hw_ctx(q, hctx, i)
|
|
|
|
if (blk_mq_hw_queue_mapped(hctx))
|
|
|
|
blk_mq_tag_wakeup_all(hctx->tags, true);
|
|
|
|
}
|
|
|
|
|
2021-11-17 14:13:59 +08:00
|
|
|
void blk_rq_init(struct request_queue *q, struct request *rq)
|
|
|
|
{
|
|
|
|
memset(rq, 0, sizeof(*rq));
|
|
|
|
|
|
|
|
INIT_LIST_HEAD(&rq->queuelist);
|
|
|
|
rq->q = q;
|
|
|
|
rq->__sector = (sector_t) -1;
|
|
|
|
INIT_HLIST_NODE(&rq->hash);
|
|
|
|
RB_CLEAR_NODE(&rq->rb_node);
|
|
|
|
rq->tag = BLK_MQ_NO_TAG;
|
|
|
|
rq->internal_tag = BLK_MQ_NO_TAG;
|
2024-01-16 05:45:07 +08:00
|
|
|
rq->start_time_ns = blk_time_get_ns();
|
2021-11-17 14:13:59 +08:00
|
|
|
rq->part = NULL;
|
|
|
|
blk_crypto_rq_set_defaults(rq);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_rq_init);
|
|
|
|
|
2023-07-10 18:55:16 +08:00
|
|
|
/* Set start and alloc time when the allocated request is actually used */
|
|
|
|
static inline void blk_mq_rq_time_init(struct request *rq, u64 alloc_time_ns)
|
|
|
|
{
|
|
|
|
if (blk_mq_need_time_stamp(rq))
|
2024-01-16 05:45:07 +08:00
|
|
|
rq->start_time_ns = blk_time_get_ns();
|
2023-07-10 18:55:16 +08:00
|
|
|
else
|
|
|
|
rq->start_time_ns = 0;
|
|
|
|
|
|
|
|
#ifdef CONFIG_BLK_RQ_ALLOC_TIME
|
|
|
|
if (blk_queue_rq_alloc_time(rq->q))
|
|
|
|
rq->alloc_time_ns = alloc_time_ns ?: rq->start_time_ns;
|
|
|
|
else
|
|
|
|
rq->alloc_time_ns = 0;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2017-06-17 00:15:27 +08:00
|
|
|
static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
|
2023-07-10 18:55:16 +08:00
|
|
|
struct blk_mq_tags *tags, unsigned int tag)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2021-10-19 04:37:28 +08:00
|
|
|
struct blk_mq_ctx *ctx = data->ctx;
|
|
|
|
struct blk_mq_hw_ctx *hctx = data->hctx;
|
|
|
|
struct request_queue *q = data->q;
|
2017-06-17 00:15:27 +08:00
|
|
|
struct request *rq = tags->static_rqs[tag];
|
2017-06-21 02:15:43 +08:00
|
|
|
|
2021-10-19 23:33:00 +08:00
|
|
|
rq->q = q;
|
|
|
|
rq->mq_ctx = ctx;
|
|
|
|
rq->mq_hctx = hctx;
|
|
|
|
rq->cmd_flags = data->cmd_flags;
|
|
|
|
|
|
|
|
if (data->flags & BLK_MQ_REQ_PM)
|
|
|
|
data->rq_flags |= RQF_PM;
|
|
|
|
if (blk_queue_io_stat(q))
|
|
|
|
data->rq_flags |= RQF_IO_STAT;
|
|
|
|
rq->rq_flags = data->rq_flags;
|
|
|
|
|
2023-05-18 13:31:01 +08:00
|
|
|
if (data->rq_flags & RQF_SCHED_TAGS) {
|
2021-10-19 23:32:57 +08:00
|
|
|
rq->tag = BLK_MQ_NO_TAG;
|
|
|
|
rq->internal_tag = tag;
|
2023-05-18 13:31:01 +08:00
|
|
|
} else {
|
|
|
|
rq->tag = tag;
|
|
|
|
rq->internal_tag = BLK_MQ_NO_TAG;
|
2017-06-17 00:15:27 +08:00
|
|
|
}
|
2021-10-19 23:33:00 +08:00
|
|
|
rq->timeout = 0;
|
2017-06-17 00:15:27 +08:00
|
|
|
|
2014-05-06 18:12:45 +08:00
|
|
|
rq->part = NULL;
|
2018-05-09 17:08:50 +08:00
|
|
|
rq->io_start_time_ns = 0;
|
2019-05-21 15:59:03 +08:00
|
|
|
rq->stats_sectors = 0;
|
2014-05-06 18:12:45 +08:00
|
|
|
rq->nr_phys_segments = 0;
|
|
|
|
#if defined(CONFIG_BLK_DEV_INTEGRITY)
|
|
|
|
rq->nr_integrity_segments = 0;
|
|
|
|
#endif
|
|
|
|
rq->end_io = NULL;
|
|
|
|
rq->end_io_data = NULL;
|
|
|
|
|
2021-10-19 04:37:27 +08:00
|
|
|
blk_crypto_rq_set_defaults(rq);
|
|
|
|
INIT_LIST_HEAD(&rq->queuelist);
|
|
|
|
/* tag was already set */
|
|
|
|
WRITE_ONCE(rq->deadline, 0);
|
2021-10-15 04:39:59 +08:00
|
|
|
req_ref_set(rq, 1);
|
2020-05-29 21:53:10 +08:00
|
|
|
|
2023-05-18 13:31:01 +08:00
|
|
|
if (rq->rq_flags & RQF_USE_SCHED) {
|
2020-05-29 21:53:10 +08:00
|
|
|
struct elevator_queue *e = data->q->elevator;
|
|
|
|
|
2021-10-19 04:37:27 +08:00
|
|
|
INIT_HLIST_NODE(&rq->hash);
|
|
|
|
RB_CLEAR_NODE(&rq->rb_node);
|
|
|
|
|
2023-05-18 13:31:01 +08:00
|
|
|
if (e->type->ops.prepare_request)
|
2020-05-29 21:53:10 +08:00
|
|
|
e->type->ops.prepare_request(rq);
|
|
|
|
}
|
|
|
|
|
2017-06-17 00:15:27 +08:00
|
|
|
return rq;
|
2014-05-28 02:59:47 +08:00
|
|
|
}
|
|
|
|
|
2021-10-10 03:10:39 +08:00
|
|
|
static inline struct request *
|
2023-07-10 18:55:16 +08:00
|
|
|
__blk_mq_alloc_requests_batch(struct blk_mq_alloc_data *data)
|
2021-10-10 03:10:39 +08:00
|
|
|
{
|
|
|
|
unsigned int tag, tag_offset;
|
2021-10-19 23:32:58 +08:00
|
|
|
struct blk_mq_tags *tags;
|
2021-10-10 03:10:39 +08:00
|
|
|
struct request *rq;
|
2021-10-19 23:32:58 +08:00
|
|
|
unsigned long tag_mask;
|
2021-10-10 03:10:39 +08:00
|
|
|
int i, nr = 0;
|
|
|
|
|
2021-10-19 23:32:58 +08:00
|
|
|
tag_mask = blk_mq_get_tags(data, data->nr_tags, &tag_offset);
|
|
|
|
if (unlikely(!tag_mask))
|
2021-10-10 03:10:39 +08:00
|
|
|
return NULL;
|
|
|
|
|
2021-10-19 23:32:58 +08:00
|
|
|
tags = blk_mq_tags_from_data(data);
|
|
|
|
for (i = 0; tag_mask; i++) {
|
|
|
|
if (!(tag_mask & (1UL << i)))
|
2021-10-10 03:10:39 +08:00
|
|
|
continue;
|
|
|
|
tag = tag_offset + i;
|
2021-11-01 20:56:09 +08:00
|
|
|
prefetch(tags->static_rqs[tag]);
|
2021-10-19 23:32:58 +08:00
|
|
|
tag_mask &= ~(1UL << i);
|
2023-07-10 18:55:16 +08:00
|
|
|
rq = blk_mq_rq_ctx_init(data, tags, tag);
|
2021-10-13 21:58:52 +08:00
|
|
|
rq_list_add(data->cached_rq, rq);
|
2021-11-03 19:49:07 +08:00
|
|
|
nr++;
|
2021-10-10 03:10:39 +08:00
|
|
|
}
|
2023-09-13 23:16:12 +08:00
|
|
|
if (!(data->rq_flags & RQF_SCHED_TAGS))
|
|
|
|
blk_mq_add_active_requests(data->hctx, nr);
|
2021-11-03 19:49:07 +08:00
|
|
|
/* caller already holds a reference, add for remainder */
|
|
|
|
percpu_ref_get_many(&data->q->q_usage_counter, nr - 1);
|
2021-10-10 03:10:39 +08:00
|
|
|
data->nr_tags -= nr;
|
|
|
|
|
2021-10-13 21:58:52 +08:00
|
|
|
return rq_list_pop(data->cached_rq);
|
2021-10-10 03:10:39 +08:00
|
|
|
}
|
|
|
|
|
2021-10-12 18:40:44 +08:00
|
|
|
static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
|
2017-06-17 00:15:19 +08:00
|
|
|
{
|
2020-05-29 21:53:09 +08:00
|
|
|
struct request_queue *q = data->q;
|
2019-08-29 06:05:57 +08:00
|
|
|
u64 alloc_time_ns = 0;
|
2021-10-06 20:34:11 +08:00
|
|
|
struct request *rq;
|
2020-05-29 21:53:13 +08:00
|
|
|
unsigned int tag;
|
2017-06-17 00:15:19 +08:00
|
|
|
|
2019-08-29 06:05:57 +08:00
|
|
|
/* alloc_time includes depth and tag waits */
|
|
|
|
if (blk_queue_rq_alloc_time(q))
|
2024-01-16 05:45:07 +08:00
|
|
|
alloc_time_ns = blk_time_get_ns();
|
2019-08-29 06:05:57 +08:00
|
|
|
|
2018-10-30 03:11:38 +08:00
|
|
|
if (data->cmd_flags & REQ_NOWAIT)
|
2017-06-20 20:05:46 +08:00
|
|
|
data->flags |= BLK_MQ_REQ_NOWAIT;
|
2017-06-17 00:15:19 +08:00
|
|
|
|
2021-11-02 22:34:09 +08:00
|
|
|
if (q->elevator) {
|
2023-05-18 13:31:01 +08:00
|
|
|
/*
|
|
|
|
* All requests use scheduler tags when an I/O scheduler is
|
|
|
|
* enabled for the queue.
|
|
|
|
*/
|
|
|
|
data->rq_flags |= RQF_SCHED_TAGS;
|
2021-11-02 22:34:09 +08:00
|
|
|
|
2017-06-17 00:15:19 +08:00
|
|
|
/*
|
2021-04-15 11:39:20 +08:00
|
|
|
* Flush/passthrough requests are special and go directly to the
|
2023-05-18 13:31:01 +08:00
|
|
|
* dispatch list.
|
2017-06-17 00:15:19 +08:00
|
|
|
*/
|
2023-05-19 12:40:47 +08:00
|
|
|
if ((data->cmd_flags & REQ_OP_MASK) != REQ_OP_FLUSH &&
|
2023-05-18 13:31:01 +08:00
|
|
|
!blk_op_is_passthrough(data->cmd_flags)) {
|
|
|
|
struct elevator_mq_ops *ops = &q->elevator->type->ops;
|
|
|
|
|
|
|
|
WARN_ON_ONCE(data->flags & BLK_MQ_REQ_RESERVED);
|
|
|
|
|
|
|
|
data->rq_flags |= RQF_USE_SCHED;
|
|
|
|
if (ops->limit_depth)
|
|
|
|
ops->limit_depth(data->cmd_flags, data);
|
|
|
|
}
|
2017-06-17 00:15:19 +08:00
|
|
|
}
|
|
|
|
|
2020-05-29 21:53:15 +08:00
|
|
|
retry:
|
2020-05-29 21:53:13 +08:00
|
|
|
data->ctx = blk_mq_get_ctx(q);
|
|
|
|
data->hctx = blk_mq_map_queue(q, data->cmd_flags, data->ctx);
|
2023-05-18 13:31:01 +08:00
|
|
|
if (!(data->rq_flags & RQF_SCHED_TAGS))
|
2020-05-29 21:53:13 +08:00
|
|
|
blk_mq_tag_busy(data->hctx);
|
|
|
|
|
2022-07-06 20:03:50 +08:00
|
|
|
if (data->flags & BLK_MQ_REQ_RESERVED)
|
|
|
|
data->rq_flags |= RQF_RESV;
|
|
|
|
|
2021-10-10 03:10:39 +08:00
|
|
|
/*
|
|
|
|
* Try batched alloc if we want more than 1 tag.
|
|
|
|
*/
|
|
|
|
if (data->nr_tags > 1) {
|
2023-07-10 18:55:16 +08:00
|
|
|
rq = __blk_mq_alloc_requests_batch(data);
|
|
|
|
if (rq) {
|
|
|
|
blk_mq_rq_time_init(rq, alloc_time_ns);
|
2021-10-10 03:10:39 +08:00
|
|
|
return rq;
|
2023-07-10 18:55:16 +08:00
|
|
|
}
|
2021-10-10 03:10:39 +08:00
|
|
|
data->nr_tags = 1;
|
|
|
|
}
|
|
|
|
|
2020-05-29 21:53:15 +08:00
|
|
|
/*
|
|
|
|
* Waiting allocations only fail because of an inactive hctx. In that
|
|
|
|
* case just retry the hctx assignment and tag allocation as CPU hotplug
|
|
|
|
* should have migrated us to an online CPU by now.
|
|
|
|
*/
|
2017-06-17 00:15:27 +08:00
|
|
|
tag = blk_mq_get_tag(data);
|
2020-05-29 21:53:15 +08:00
|
|
|
if (tag == BLK_MQ_NO_TAG) {
|
|
|
|
if (data->flags & BLK_MQ_REQ_NOWAIT)
|
|
|
|
return NULL;
|
|
|
|
/*
|
2021-10-10 03:10:39 +08:00
|
|
|
* Give up the CPU and sleep for a random short time to
|
|
|
|
* ensure that thread using a realtime scheduling class
|
|
|
|
* are migrated off the CPU, and thus off the hctx that
|
|
|
|
* is going away.
|
2020-05-29 21:53:15 +08:00
|
|
|
*/
|
|
|
|
msleep(3);
|
|
|
|
goto retry;
|
|
|
|
}
|
2021-10-06 20:34:11 +08:00
|
|
|
|
2023-09-13 23:16:12 +08:00
|
|
|
if (!(data->rq_flags & RQF_SCHED_TAGS))
|
|
|
|
blk_mq_inc_active_requests(data->hctx);
|
2023-07-10 18:55:16 +08:00
|
|
|
rq = blk_mq_rq_ctx_init(data, blk_mq_tags_from_data(data), tag);
|
|
|
|
blk_mq_rq_time_init(rq, alloc_time_ns);
|
|
|
|
return rq;
|
2017-06-17 00:15:19 +08:00
|
|
|
}
|
|
|
|
|
2022-09-21 22:22:09 +08:00
|
|
|
static struct request *blk_mq_rq_cache_fill(struct request_queue *q,
|
|
|
|
struct blk_plug *plug,
|
|
|
|
blk_opf_t opf,
|
|
|
|
blk_mq_req_flags_t flags)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2020-05-29 21:53:09 +08:00
|
|
|
struct blk_mq_alloc_data data = {
|
|
|
|
.q = q,
|
|
|
|
.flags = flags,
|
2022-07-15 02:06:32 +08:00
|
|
|
.cmd_flags = opf,
|
2022-09-21 22:22:09 +08:00
|
|
|
.nr_tags = plug->nr_ios,
|
|
|
|
.cached_rq = &plug->cached_rq,
|
2020-05-29 21:53:09 +08:00
|
|
|
};
|
2017-01-17 21:03:22 +08:00
|
|
|
struct request *rq;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2022-09-21 22:22:09 +08:00
|
|
|
if (blk_queue_enter(q, flags))
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
plug->nr_ios = 1;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2021-10-12 18:40:44 +08:00
|
|
|
rq = __blk_mq_alloc_requests(&data);
|
2022-09-21 22:22:09 +08:00
|
|
|
if (unlikely(!rq))
|
|
|
|
blk_queue_exit(q);
|
|
|
|
return rq;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct request *blk_mq_alloc_cached_request(struct request_queue *q,
|
|
|
|
blk_opf_t opf,
|
|
|
|
blk_mq_req_flags_t flags)
|
|
|
|
{
|
|
|
|
struct blk_plug *plug = current->plug;
|
|
|
|
struct request *rq;
|
|
|
|
|
|
|
|
if (!plug)
|
|
|
|
return NULL;
|
2022-11-02 10:52:30 +08:00
|
|
|
|
2022-09-21 22:22:09 +08:00
|
|
|
if (rq_list_empty(plug->cached_rq)) {
|
|
|
|
if (plug->nr_ios == 1)
|
|
|
|
return NULL;
|
|
|
|
rq = blk_mq_rq_cache_fill(q, plug, opf, flags);
|
2022-11-02 10:52:30 +08:00
|
|
|
if (!rq)
|
|
|
|
return NULL;
|
|
|
|
} else {
|
|
|
|
rq = rq_list_peek(&plug->cached_rq);
|
|
|
|
if (!rq || rq->q != q)
|
|
|
|
return NULL;
|
2022-09-21 22:22:09 +08:00
|
|
|
|
2022-11-02 10:52:30 +08:00
|
|
|
if (blk_mq_get_hctx_type(opf) != rq->mq_hctx->type)
|
|
|
|
return NULL;
|
|
|
|
if (op_is_flush(rq->cmd_flags) != op_is_flush(opf))
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
plug->cached_rq = rq_list_next(rq);
|
2023-07-10 18:55:16 +08:00
|
|
|
blk_mq_rq_time_init(rq, 0);
|
2022-11-02 10:52:30 +08:00
|
|
|
}
|
2022-09-21 22:22:09 +08:00
|
|
|
|
|
|
|
rq->cmd_flags = opf;
|
|
|
|
INIT_LIST_HEAD(&rq->queuelist);
|
|
|
|
return rq;
|
|
|
|
}
|
|
|
|
|
|
|
|
struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf,
|
|
|
|
blk_mq_req_flags_t flags)
|
|
|
|
{
|
|
|
|
struct request *rq;
|
|
|
|
|
|
|
|
rq = blk_mq_alloc_cached_request(q, opf, flags);
|
|
|
|
if (!rq) {
|
|
|
|
struct blk_mq_alloc_data data = {
|
|
|
|
.q = q,
|
|
|
|
.flags = flags,
|
|
|
|
.cmd_flags = opf,
|
|
|
|
.nr_tags = 1,
|
|
|
|
};
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = blk_queue_enter(q, flags);
|
|
|
|
if (ret)
|
|
|
|
return ERR_PTR(ret);
|
|
|
|
|
|
|
|
rq = __blk_mq_alloc_requests(&data);
|
|
|
|
if (!rq)
|
|
|
|
goto out_queue_exit;
|
|
|
|
}
|
2016-07-19 17:31:50 +08:00
|
|
|
rq->__data_len = 0;
|
|
|
|
rq->__sector = (sector_t) -1;
|
|
|
|
rq->bio = rq->biotail = NULL;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
return rq;
|
2020-05-17 02:27:58 +08:00
|
|
|
out_queue_exit:
|
|
|
|
blk_queue_exit(q);
|
|
|
|
return ERR_PTR(-EWOULDBLOCK);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
2014-05-09 23:36:49 +08:00
|
|
|
EXPORT_SYMBOL(blk_mq_alloc_request);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2017-06-21 02:15:39 +08:00
|
|
|
struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
|
2022-07-15 02:06:32 +08:00
|
|
|
blk_opf_t opf, blk_mq_req_flags_t flags, unsigned int hctx_idx)
|
2016-06-13 22:45:21 +08:00
|
|
|
{
|
2020-05-29 21:53:09 +08:00
|
|
|
struct blk_mq_alloc_data data = {
|
|
|
|
.q = q,
|
|
|
|
.flags = flags,
|
2022-07-15 02:06:32 +08:00
|
|
|
.cmd_flags = opf,
|
2021-10-06 20:34:11 +08:00
|
|
|
.nr_tags = 1,
|
2020-05-29 21:53:09 +08:00
|
|
|
};
|
2020-05-29 21:53:13 +08:00
|
|
|
u64 alloc_time_ns = 0;
|
2022-10-26 18:35:13 +08:00
|
|
|
struct request *rq;
|
2017-02-28 02:28:27 +08:00
|
|
|
unsigned int cpu;
|
2020-05-29 21:53:13 +08:00
|
|
|
unsigned int tag;
|
2016-06-13 22:45:21 +08:00
|
|
|
int ret;
|
|
|
|
|
2020-05-29 21:53:13 +08:00
|
|
|
/* alloc_time includes depth and tag waits */
|
|
|
|
if (blk_queue_rq_alloc_time(q))
|
2024-01-16 05:45:07 +08:00
|
|
|
alloc_time_ns = blk_time_get_ns();
|
2020-05-29 21:53:13 +08:00
|
|
|
|
2016-06-13 22:45:21 +08:00
|
|
|
/*
|
|
|
|
* If the tag allocator sleeps we could get an allocation for a
|
|
|
|
* different hardware context. No need to complicate the low level
|
|
|
|
* allocator for this for the rare use case of a command tied to
|
|
|
|
* a specific queue.
|
|
|
|
*/
|
2023-01-18 17:37:13 +08:00
|
|
|
if (WARN_ON_ONCE(!(flags & BLK_MQ_REQ_NOWAIT)) ||
|
|
|
|
WARN_ON_ONCE(!(flags & BLK_MQ_REQ_RESERVED)))
|
2016-06-13 22:45:21 +08:00
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
|
|
|
|
if (hctx_idx >= q->nr_hw_queues)
|
|
|
|
return ERR_PTR(-EIO);
|
|
|
|
|
2017-11-10 02:49:58 +08:00
|
|
|
ret = blk_queue_enter(q, flags);
|
2016-06-13 22:45:21 +08:00
|
|
|
if (ret)
|
|
|
|
return ERR_PTR(ret);
|
|
|
|
|
2016-09-24 00:25:48 +08:00
|
|
|
/*
|
|
|
|
* Check if the hardware context is actually mapped to anything.
|
|
|
|
* If not tell the caller that it should skip this queue.
|
|
|
|
*/
|
2020-05-17 02:27:58 +08:00
|
|
|
ret = -EXDEV;
|
2022-03-08 15:32:19 +08:00
|
|
|
data.hctx = xa_load(&q->hctx_table, hctx_idx);
|
2020-05-29 21:53:09 +08:00
|
|
|
if (!blk_mq_hw_queue_mapped(data.hctx))
|
2020-05-17 02:27:58 +08:00
|
|
|
goto out_queue_exit;
|
2020-05-29 21:53:09 +08:00
|
|
|
cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
|
2022-06-16 05:00:04 +08:00
|
|
|
if (cpu >= nr_cpu_ids)
|
|
|
|
goto out_queue_exit;
|
2020-05-29 21:53:09 +08:00
|
|
|
data.ctx = __blk_mq_get_ctx(q, cpu);
|
2016-06-13 22:45:21 +08:00
|
|
|
|
2023-05-18 13:31:01 +08:00
|
|
|
if (q->elevator)
|
|
|
|
data.rq_flags |= RQF_SCHED_TAGS;
|
2021-11-02 22:34:09 +08:00
|
|
|
else
|
2023-05-18 13:31:01 +08:00
|
|
|
blk_mq_tag_busy(data.hctx);
|
2020-05-29 21:53:13 +08:00
|
|
|
|
2022-07-06 20:03:50 +08:00
|
|
|
if (flags & BLK_MQ_REQ_RESERVED)
|
|
|
|
data.rq_flags |= RQF_RESV;
|
|
|
|
|
2020-05-17 02:27:58 +08:00
|
|
|
ret = -EWOULDBLOCK;
|
2020-05-29 21:53:13 +08:00
|
|
|
tag = blk_mq_get_tag(&data);
|
|
|
|
if (tag == BLK_MQ_NO_TAG)
|
2020-05-17 02:27:58 +08:00
|
|
|
goto out_queue_exit;
|
2023-09-13 23:16:12 +08:00
|
|
|
if (!(data.rq_flags & RQF_SCHED_TAGS))
|
|
|
|
blk_mq_inc_active_requests(data.hctx);
|
2023-07-10 18:55:16 +08:00
|
|
|
rq = blk_mq_rq_ctx_init(&data, blk_mq_tags_from_data(&data), tag);
|
|
|
|
blk_mq_rq_time_init(rq, alloc_time_ns);
|
2022-10-26 18:35:13 +08:00
|
|
|
rq->__data_len = 0;
|
|
|
|
rq->__sector = (sector_t) -1;
|
|
|
|
rq->bio = rq->biotail = NULL;
|
|
|
|
return rq;
|
2020-05-29 21:53:13 +08:00
|
|
|
|
2020-05-17 02:27:58 +08:00
|
|
|
out_queue_exit:
|
|
|
|
blk_queue_exit(q);
|
|
|
|
return ERR_PTR(ret);
|
2016-06-13 22:45:21 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
|
|
|
|
|
2023-08-13 23:23:25 +08:00
|
|
|
static void blk_mq_finish_request(struct request *rq)
|
|
|
|
{
|
|
|
|
struct request_queue *q = rq->q;
|
|
|
|
|
2024-05-01 19:09:04 +08:00
|
|
|
blk_zone_finish_request(rq);
|
|
|
|
|
2023-08-13 23:23:25 +08:00
|
|
|
if (rq->rq_flags & RQF_USE_SCHED) {
|
|
|
|
q->elevator->type->ops.finish_request(rq);
|
|
|
|
/*
|
|
|
|
* For postflush request that may need to be
|
|
|
|
* completed twice, we should clear this flag
|
|
|
|
* to avoid double finish_request() on the rq.
|
|
|
|
*/
|
|
|
|
rq->rq_flags &= ~RQF_USE_SCHED;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-05-29 21:52:28 +08:00
|
|
|
static void __blk_mq_free_request(struct request *rq)
|
|
|
|
{
|
|
|
|
struct request_queue *q = rq->q;
|
|
|
|
struct blk_mq_ctx *ctx = rq->mq_ctx;
|
2018-10-30 05:06:13 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
|
2018-05-29 21:52:28 +08:00
|
|
|
const int sched_tag = rq->internal_tag;
|
|
|
|
|
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 08:37:18 +08:00
|
|
|
blk_crypto_free_request(rq);
|
2018-09-27 05:01:10 +08:00
|
|
|
blk_pm_mark_last_busy(rq);
|
2018-10-30 05:06:13 +08:00
|
|
|
rq->mq_hctx = NULL;
|
2023-05-14 06:12:27 +08:00
|
|
|
|
2023-09-13 23:16:12 +08:00
|
|
|
if (rq->tag != BLK_MQ_NO_TAG) {
|
|
|
|
blk_mq_dec_active_requests(hctx);
|
2020-02-26 20:10:15 +08:00
|
|
|
blk_mq_put_tag(hctx->tags, ctx, rq->tag);
|
2023-09-13 23:16:12 +08:00
|
|
|
}
|
2020-05-29 21:53:12 +08:00
|
|
|
if (sched_tag != BLK_MQ_NO_TAG)
|
2020-02-26 20:10:15 +08:00
|
|
|
blk_mq_put_tag(hctx->sched_tags, ctx, sched_tag);
|
2018-05-29 21:52:28 +08:00
|
|
|
blk_mq_sched_restart(hctx);
|
|
|
|
blk_queue_exit(q);
|
|
|
|
}
|
|
|
|
|
2017-06-17 00:15:22 +08:00
|
|
|
void blk_mq_free_request(struct request *rq)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
|
|
|
struct request_queue *q = rq->q;
|
2017-06-17 00:15:22 +08:00
|
|
|
|
2023-08-13 23:23:25 +08:00
|
|
|
blk_mq_finish_request(rq);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2017-09-30 16:08:24 +08:00
|
|
|
if (unlikely(laptop_mode && !blk_rq_is_passthrough(rq)))
|
2021-08-16 21:46:24 +08:00
|
|
|
laptop_io_completion(q->disk->bdi);
|
2017-09-30 16:08:24 +08:00
|
|
|
|
2018-07-03 23:32:35 +08:00
|
|
|
rq_qos_done(q, rq);
|
2014-05-14 05:10:52 +08:00
|
|
|
|
2018-05-29 21:52:28 +08:00
|
|
|
WRITE_ONCE(rq->state, MQ_RQ_IDLE);
|
2021-10-15 04:39:59 +08:00
|
|
|
if (req_ref_put_and_test(rq))
|
2018-05-29 21:52:28 +08:00
|
|
|
__blk_mq_free_request(rq);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
2014-11-18 01:40:48 +08:00
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_free_request);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2021-10-06 20:34:11 +08:00
|
|
|
void blk_mq_free_plug_rqs(struct blk_plug *plug)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2021-10-13 21:58:52 +08:00
|
|
|
struct request *rq;
|
2018-11-29 01:50:07 +08:00
|
|
|
|
2021-11-03 19:49:07 +08:00
|
|
|
while ((rq = rq_list_pop(&plug->cached_rq)) != NULL)
|
2021-10-06 20:34:11 +08:00
|
|
|
blk_mq_free_request(rq);
|
|
|
|
}
|
block: consolidate struct request timestamp fields
Currently, struct request has four timestamp fields:
- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq
These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 17:08:53 +08:00
|
|
|
|
2021-11-17 14:14:02 +08:00
|
|
|
void blk_dump_rq_flags(struct request *rq, char *msg)
|
|
|
|
{
|
|
|
|
printk(KERN_INFO "%s: dev %s: flags=%llx\n", msg,
|
2021-11-26 20:18:00 +08:00
|
|
|
rq->q->disk ? rq->q->disk->disk_name : "?",
|
2022-07-15 02:06:32 +08:00
|
|
|
(__force unsigned long long) rq->cmd_flags);
|
2021-11-17 14:14:02 +08:00
|
|
|
|
|
|
|
printk(KERN_INFO " sector %llu, nr/cnr %u/%u\n",
|
|
|
|
(unsigned long long)blk_rq_pos(rq),
|
|
|
|
blk_rq_sectors(rq), blk_rq_cur_sectors(rq));
|
|
|
|
printk(KERN_INFO " bio %p, biotail %p, len %u\n",
|
|
|
|
rq->bio, rq->biotail, blk_rq_bytes(rq));
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_dump_rq_flags);
|
|
|
|
|
2021-10-14 23:17:01 +08:00
|
|
|
static void blk_account_io_completion(struct request *req, unsigned int bytes)
|
|
|
|
{
|
|
|
|
if (req->part && blk_do_io_stat(req)) {
|
|
|
|
const int sgrp = op_stat_group(req_op(req));
|
|
|
|
|
|
|
|
part_stat_lock();
|
|
|
|
part_stat_add(req->part, sectors[sgrp], bytes >> 9);
|
|
|
|
part_stat_unlock();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-11-17 14:14:03 +08:00
|
|
|
static void blk_print_req_error(struct request *req, blk_status_t status)
|
|
|
|
{
|
|
|
|
printk_ratelimited(KERN_ERR
|
|
|
|
"%s error, dev %s, sector %llu op 0x%x:(%s) flags 0x%x "
|
|
|
|
"phys_seg %u prio class %u\n",
|
|
|
|
blk_status_to_str(status),
|
2021-11-26 20:18:00 +08:00
|
|
|
req->q->disk ? req->q->disk->disk_name : "?",
|
2022-07-15 02:06:32 +08:00
|
|
|
blk_rq_pos(req), (__force u32)req_op(req),
|
|
|
|
blk_op_str(req_op(req)),
|
|
|
|
(__force u32)(req->cmd_flags & ~REQ_OP_MASK),
|
2021-11-17 14:14:03 +08:00
|
|
|
req->nr_phys_segments,
|
|
|
|
IOPRIO_PRIO_CLASS(req->ioprio));
|
|
|
|
}
|
|
|
|
|
2021-12-02 06:01:51 +08:00
|
|
|
/*
|
|
|
|
* Fully end IO on a request. Does not support partial completions, or
|
|
|
|
* errors.
|
|
|
|
*/
|
|
|
|
static void blk_complete_request(struct request *req)
|
|
|
|
{
|
|
|
|
const bool is_flush = (req->rq_flags & RQF_FLUSH_SEQ) != 0;
|
|
|
|
int total_bytes = blk_rq_bytes(req);
|
|
|
|
struct bio *bio = req->bio;
|
|
|
|
|
|
|
|
trace_block_rq_complete(req, BLK_STS_OK, total_bytes);
|
|
|
|
|
|
|
|
if (!bio)
|
|
|
|
return;
|
|
|
|
|
|
|
|
#ifdef CONFIG_BLK_DEV_INTEGRITY
|
|
|
|
if (blk_integrity_rq(req) && req_op(req) == REQ_OP_READ)
|
|
|
|
req->q->integrity.profile->complete_fn(req, total_bytes);
|
|
|
|
#endif
|
|
|
|
|
2023-03-16 02:39:02 +08:00
|
|
|
/*
|
|
|
|
* Upper layers may call blk_crypto_evict_key() anytime after the last
|
|
|
|
* bio_endio(). Therefore, the keyslot must be released before that.
|
|
|
|
*/
|
|
|
|
blk_crypto_rq_put_keyslot(req);
|
|
|
|
|
2021-12-02 06:01:51 +08:00
|
|
|
blk_account_io_completion(req, total_bytes);
|
|
|
|
|
|
|
|
do {
|
|
|
|
struct bio *next = bio->bi_next;
|
|
|
|
|
|
|
|
/* Completion has already been traced */
|
|
|
|
bio_clear_flag(bio, BIO_TRACE_COMPLETION);
|
2022-02-11 17:34:25 +08:00
|
|
|
|
2024-04-08 09:41:03 +08:00
|
|
|
blk_zone_update_request_bio(req, bio);
|
2022-02-11 17:34:25 +08:00
|
|
|
|
2021-12-02 06:01:51 +08:00
|
|
|
if (!is_flush)
|
|
|
|
bio_endio(bio);
|
|
|
|
bio = next;
|
|
|
|
} while (bio);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Reset counters so that the request stacking driver
|
|
|
|
* can find how many bytes remain in the request
|
|
|
|
* later.
|
|
|
|
*/
|
2022-09-21 22:24:16 +08:00
|
|
|
if (!req->end_io) {
|
|
|
|
req->bio = NULL;
|
|
|
|
req->__data_len = 0;
|
|
|
|
}
|
2021-12-02 06:01:51 +08:00
|
|
|
}
|
|
|
|
|
2021-10-14 23:17:01 +08:00
|
|
|
/**
|
|
|
|
* blk_update_request - Complete multiple bytes without completing the request
|
|
|
|
* @req: the request being processed
|
|
|
|
* @error: block status code
|
|
|
|
* @nr_bytes: number of bytes to complete for @req
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Ends I/O on a number of bytes attached to @req, but doesn't complete
|
|
|
|
* the request structure even if @req doesn't have leftover.
|
|
|
|
* If @req has leftover, sets it up for the next range of segments.
|
|
|
|
*
|
|
|
|
* Passing the result of blk_rq_bytes() as @nr_bytes guarantees
|
|
|
|
* %false return from this function.
|
|
|
|
*
|
|
|
|
* Note:
|
|
|
|
* The RQF_SPECIAL_PAYLOAD flag is ignored on purpose in this function
|
|
|
|
* except in the consistency check at the end of this function.
|
|
|
|
*
|
|
|
|
* Return:
|
|
|
|
* %false - this request doesn't have any more data
|
|
|
|
* %true - this request has more data
|
|
|
|
**/
|
|
|
|
bool blk_update_request(struct request *req, blk_status_t error,
|
|
|
|
unsigned int nr_bytes)
|
|
|
|
{
|
2024-04-08 09:41:02 +08:00
|
|
|
bool is_flush = req->rq_flags & RQF_FLUSH_SEQ;
|
|
|
|
bool quiet = req->rq_flags & RQF_QUIET;
|
2021-10-14 23:17:01 +08:00
|
|
|
int total_bytes;
|
|
|
|
|
2021-10-18 16:45:18 +08:00
|
|
|
trace_block_rq_complete(req, error, nr_bytes);
|
2021-10-14 23:17:01 +08:00
|
|
|
|
|
|
|
if (!req->bio)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
#ifdef CONFIG_BLK_DEV_INTEGRITY
|
|
|
|
if (blk_integrity_rq(req) && req_op(req) == REQ_OP_READ &&
|
|
|
|
error == BLK_STS_OK)
|
|
|
|
req->q->integrity.profile->complete_fn(req, nr_bytes);
|
|
|
|
#endif
|
|
|
|
|
2023-03-16 02:39:02 +08:00
|
|
|
/*
|
|
|
|
* Upper layers may call blk_crypto_evict_key() anytime after the last
|
|
|
|
* bio_endio(). Therefore, the keyslot must be released before that.
|
|
|
|
*/
|
|
|
|
if (blk_crypto_rq_has_keyslot(req) && nr_bytes >= blk_rq_bytes(req))
|
|
|
|
__blk_crypto_rq_put_keyslot(req);
|
|
|
|
|
2024-04-08 09:41:02 +08:00
|
|
|
if (unlikely(error && !blk_rq_is_passthrough(req) && !quiet) &&
|
|
|
|
!test_bit(GD_DEAD, &req->q->disk->state)) {
|
2021-10-14 23:17:01 +08:00
|
|
|
blk_print_req_error(req, error);
|
2022-02-11 06:52:22 +08:00
|
|
|
trace_block_rq_error(req, error, nr_bytes);
|
|
|
|
}
|
2021-10-14 23:17:01 +08:00
|
|
|
|
|
|
|
blk_account_io_completion(req, nr_bytes);
|
|
|
|
|
|
|
|
total_bytes = 0;
|
|
|
|
while (req->bio) {
|
|
|
|
struct bio *bio = req->bio;
|
|
|
|
unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes);
|
|
|
|
|
2024-04-08 09:41:02 +08:00
|
|
|
if (unlikely(error))
|
|
|
|
bio->bi_status = error;
|
|
|
|
|
|
|
|
if (bio_bytes == bio->bi_iter.bi_size) {
|
2021-10-14 23:17:01 +08:00
|
|
|
req->bio = bio->bi_next;
|
2024-04-08 09:41:10 +08:00
|
|
|
} else if (bio_is_zone_append(bio) && error == BLK_STS_OK) {
|
2024-04-08 09:41:02 +08:00
|
|
|
/*
|
|
|
|
* Partial zone append completions cannot be supported
|
|
|
|
* as the BIO fragments may end up not being written
|
|
|
|
* sequentially.
|
|
|
|
*/
|
|
|
|
bio->bi_status = BLK_STS_IOERR;
|
|
|
|
}
|
2021-10-14 23:17:01 +08:00
|
|
|
|
|
|
|
/* Completion has already been traced */
|
|
|
|
bio_clear_flag(bio, BIO_TRACE_COMPLETION);
|
2024-04-08 09:41:02 +08:00
|
|
|
if (unlikely(quiet))
|
|
|
|
bio_set_flag(bio, BIO_QUIET);
|
|
|
|
|
|
|
|
bio_advance(bio, bio_bytes);
|
|
|
|
|
|
|
|
/* Don't actually finish bio if it's part of flush sequence */
|
2024-04-08 09:41:03 +08:00
|
|
|
if (!bio->bi_iter.bi_size) {
|
|
|
|
blk_zone_update_request_bio(req, bio);
|
|
|
|
if (!is_flush)
|
|
|
|
bio_endio(bio);
|
2024-04-08 09:41:02 +08:00
|
|
|
}
|
2021-10-14 23:17:01 +08:00
|
|
|
|
|
|
|
total_bytes += bio_bytes;
|
|
|
|
nr_bytes -= bio_bytes;
|
|
|
|
|
|
|
|
if (!nr_bytes)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* completely done
|
|
|
|
*/
|
|
|
|
if (!req->bio) {
|
|
|
|
/*
|
|
|
|
* Reset counters so that the request stacking driver
|
|
|
|
* can find how many bytes remain in the request
|
|
|
|
* later.
|
|
|
|
*/
|
|
|
|
req->__data_len = 0;
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
req->__data_len -= total_bytes;
|
|
|
|
|
|
|
|
/* update sector only for requests with clear definition of sector */
|
|
|
|
if (!blk_rq_is_passthrough(req))
|
|
|
|
req->__sector += total_bytes >> 9;
|
|
|
|
|
|
|
|
/* mixed attributes always follow the first bio */
|
|
|
|
if (req->rq_flags & RQF_MIXED_MERGE) {
|
|
|
|
req->cmd_flags &= ~REQ_FAILFAST_MASK;
|
|
|
|
req->cmd_flags |= req->bio->bi_opf & REQ_FAILFAST_MASK;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!(req->rq_flags & RQF_SPECIAL_PAYLOAD)) {
|
|
|
|
/*
|
|
|
|
* If total number of sectors is less than the first segment
|
|
|
|
* size, something has gone terribly wrong.
|
|
|
|
*/
|
|
|
|
if (blk_rq_bytes(req) < blk_rq_cur_bytes(req)) {
|
|
|
|
blk_dump_rq_flags(req, "request botched");
|
|
|
|
req->__data_len = blk_rq_cur_bytes(req);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* recalculate the number of segments */
|
|
|
|
req->nr_phys_segments = blk_recalc_rq_segments(req);
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_update_request);
|
|
|
|
|
2021-11-17 14:14:01 +08:00
|
|
|
static inline void blk_account_io_done(struct request *req, u64 now)
|
|
|
|
{
|
2023-05-20 16:40:57 +08:00
|
|
|
trace_block_io_done(req);
|
|
|
|
|
2021-11-17 14:14:01 +08:00
|
|
|
/*
|
|
|
|
* Account IO completion. flush_rq isn't accounted as a
|
|
|
|
* normal IO on queueing nor completion. Accounting the
|
|
|
|
* containing request is enough.
|
|
|
|
*/
|
|
|
|
if (blk_do_io_stat(req) && req->part &&
|
2023-03-27 15:34:27 +08:00
|
|
|
!(req->rq_flags & RQF_FLUSH_SEQ)) {
|
|
|
|
const int sgrp = op_stat_group(req_op(req));
|
2021-11-17 14:14:01 +08:00
|
|
|
|
2023-03-27 15:34:27 +08:00
|
|
|
part_stat_lock();
|
|
|
|
update_io_ticks(req->part, jiffies, true);
|
|
|
|
part_stat_inc(req->part, ios[sgrp]);
|
|
|
|
part_stat_add(req->part, nsecs[sgrp], now - req->start_time_ns);
|
block: support to account io_ticks precisely
Currently, io_ticks is accounted based on sampling, specifically
update_io_ticks() will always account io_ticks by 1 jiffies from
bdev_start_io_acct()/blk_account_io_start(), and the result can be
inaccurate, for example(HZ is 250):
Test script:
fio -filename=/dev/sda -bs=4k -rw=write -direct=1 -name=test -thinktime=4ms
Test result: util is about 90%, while the disk is really idle.
This behaviour is introduced by commit 5b18b5a73760 ("block: delete
part_round_stats and switch to less precise counting"), however, there
was a key point that is missed that this patch also improve performance
a lot:
Before the commit:
part_round_stats:
if (part->stamp != now)
stats |= 1;
part_in_flight()
-> there can be lots of task here in 1 jiffies.
part_round_stats_single()
__part_stat_add()
part->stamp = now;
After the commit:
update_io_ticks:
stamp = part->bd_stamp;
if (time_after(now, stamp))
if (try_cmpxchg())
__part_stat_add()
-> only one task can reach here in 1 jiffies.
Hence in order to account io_ticks precisely, we only need to know if
there are IO inflight at most once in one jiffies. Noted that for
rq-based device, iterating tags should not be used here because
'tags->lock' is grabbed in blk_mq_find_and_get_req(), hence
part_stat_lock_inc/dec() and part_in_flight() is used to trace inflight.
The additional overhead is quite little:
- per cpu add/dec for each IO for rq-based device;
- per cpu sum for each jiffies;
And it's verified by null-blk that there are no performance degration
under heavy IO pressure.
Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240509123717.3223892-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-09 20:37:16 +08:00
|
|
|
part_stat_local_dec(req->part,
|
|
|
|
in_flight[op_is_write(req_op(req))]);
|
2023-03-27 15:34:27 +08:00
|
|
|
part_stat_unlock();
|
|
|
|
}
|
2021-11-17 14:14:01 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void blk_account_io_start(struct request *req)
|
|
|
|
{
|
2023-05-20 16:40:57 +08:00
|
|
|
trace_block_io_start(req);
|
|
|
|
|
2023-03-27 15:34:26 +08:00
|
|
|
if (blk_do_io_stat(req)) {
|
|
|
|
/*
|
|
|
|
* All non-passthrough requests are created from a bio with one
|
|
|
|
* exception: when a flush command that is part of a flush sequence
|
|
|
|
* generated by the state machine in blk-flush.c is cloned onto the
|
|
|
|
* lower device by dm-multipath we can get here without a bio.
|
|
|
|
*/
|
|
|
|
if (req->bio)
|
|
|
|
req->part = req->bio->bi_bdev;
|
|
|
|
else
|
|
|
|
req->part = req->q->disk->part0;
|
|
|
|
|
|
|
|
part_stat_lock();
|
|
|
|
update_io_ticks(req->part, jiffies, false);
|
block: support to account io_ticks precisely
Currently, io_ticks is accounted based on sampling, specifically
update_io_ticks() will always account io_ticks by 1 jiffies from
bdev_start_io_acct()/blk_account_io_start(), and the result can be
inaccurate, for example(HZ is 250):
Test script:
fio -filename=/dev/sda -bs=4k -rw=write -direct=1 -name=test -thinktime=4ms
Test result: util is about 90%, while the disk is really idle.
This behaviour is introduced by commit 5b18b5a73760 ("block: delete
part_round_stats and switch to less precise counting"), however, there
was a key point that is missed that this patch also improve performance
a lot:
Before the commit:
part_round_stats:
if (part->stamp != now)
stats |= 1;
part_in_flight()
-> there can be lots of task here in 1 jiffies.
part_round_stats_single()
__part_stat_add()
part->stamp = now;
After the commit:
update_io_ticks:
stamp = part->bd_stamp;
if (time_after(now, stamp))
if (try_cmpxchg())
__part_stat_add()
-> only one task can reach here in 1 jiffies.
Hence in order to account io_ticks precisely, we only need to know if
there are IO inflight at most once in one jiffies. Noted that for
rq-based device, iterating tags should not be used here because
'tags->lock' is grabbed in blk_mq_find_and_get_req(), hence
part_stat_lock_inc/dec() and part_in_flight() is used to trace inflight.
The additional overhead is quite little:
- per cpu add/dec for each IO for rq-based device;
- per cpu sum for each jiffies;
And it's verified by null-blk that there are no performance degration
under heavy IO pressure.
Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240509123717.3223892-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-09 20:37:16 +08:00
|
|
|
part_stat_local_inc(req->part,
|
|
|
|
in_flight[op_is_write(req_op(req))]);
|
2023-03-27 15:34:26 +08:00
|
|
|
part_stat_unlock();
|
|
|
|
}
|
2021-11-17 14:14:01 +08:00
|
|
|
}
|
|
|
|
|
2021-10-08 19:50:46 +08:00
|
|
|
static inline void __blk_mq_end_request_acct(struct request *rq, u64 now)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2023-03-21 03:49:26 +08:00
|
|
|
if (rq->rq_flags & RQF_STATS)
|
block: consolidate struct request timestamp fields
Currently, struct request has four timestamp fields:
- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq
These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 17:08:53 +08:00
|
|
|
blk_stat_add(rq, now);
|
2018-05-09 17:08:52 +08:00
|
|
|
|
2020-07-04 15:28:21 +08:00
|
|
|
blk_mq_sched_completed_request(rq, now);
|
block: consolidate struct request timestamp fields
Currently, struct request has four timestamp fields:
- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq
These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 17:08:53 +08:00
|
|
|
blk_account_io_done(rq, now);
|
2021-10-08 19:50:46 +08:00
|
|
|
}
|
block: consolidate struct request timestamp fields
Currently, struct request has four timestamp fields:
- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq
These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 17:08:53 +08:00
|
|
|
|
2021-10-08 19:50:46 +08:00
|
|
|
inline void __blk_mq_end_request(struct request *rq, blk_status_t error)
|
|
|
|
{
|
|
|
|
if (blk_mq_need_time_stamp(rq))
|
2024-01-16 05:45:07 +08:00
|
|
|
__blk_mq_end_request_acct(rq, blk_time_get_ns());
|
2013-12-06 01:50:39 +08:00
|
|
|
|
2023-08-13 23:23:25 +08:00
|
|
|
blk_mq_finish_request(rq);
|
|
|
|
|
2014-04-16 15:44:53 +08:00
|
|
|
if (rq->end_io) {
|
2018-07-03 23:32:35 +08:00
|
|
|
rq_qos_done(rq->q, rq);
|
2022-09-22 05:19:54 +08:00
|
|
|
if (rq->end_io(rq, error) == RQ_END_IO_FREE)
|
|
|
|
blk_mq_free_request(rq);
|
2014-04-16 15:44:53 +08:00
|
|
|
} else {
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
blk_mq_free_request(rq);
|
2014-04-16 15:44:53 +08:00
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
2014-09-14 07:40:10 +08:00
|
|
|
EXPORT_SYMBOL(__blk_mq_end_request);
|
2014-04-16 15:44:52 +08:00
|
|
|
|
2017-06-03 15:38:04 +08:00
|
|
|
void blk_mq_end_request(struct request *rq, blk_status_t error)
|
2014-04-16 15:44:52 +08:00
|
|
|
{
|
|
|
|
if (blk_update_request(rq, error, blk_rq_bytes(rq)))
|
|
|
|
BUG();
|
2014-09-14 07:40:10 +08:00
|
|
|
__blk_mq_end_request(rq, error);
|
2014-04-16 15:44:52 +08:00
|
|
|
}
|
2014-09-14 07:40:10 +08:00
|
|
|
EXPORT_SYMBOL(blk_mq_end_request);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2021-10-08 19:50:46 +08:00
|
|
|
#define TAG_COMP_BATCH 32
|
|
|
|
|
|
|
|
static inline void blk_mq_flush_tag_batch(struct blk_mq_hw_ctx *hctx,
|
|
|
|
int *tag_array, int nr_tags)
|
|
|
|
{
|
|
|
|
struct request_queue *q = hctx->queue;
|
|
|
|
|
2023-09-13 23:16:12 +08:00
|
|
|
blk_mq_sub_active_requests(hctx, nr_tags);
|
2021-11-02 23:36:19 +08:00
|
|
|
|
2021-10-08 19:50:46 +08:00
|
|
|
blk_mq_put_tags(hctx->tags, tag_array, nr_tags);
|
|
|
|
percpu_ref_put_many(&q->q_usage_counter, nr_tags);
|
|
|
|
}
|
|
|
|
|
|
|
|
void blk_mq_end_request_batch(struct io_comp_batch *iob)
|
|
|
|
{
|
|
|
|
int tags[TAG_COMP_BATCH], nr_tags = 0;
|
2021-10-29 02:08:34 +08:00
|
|
|
struct blk_mq_hw_ctx *cur_hctx = NULL;
|
2021-10-08 19:50:46 +08:00
|
|
|
struct request *rq;
|
|
|
|
u64 now = 0;
|
|
|
|
|
|
|
|
if (iob->need_ts)
|
2024-01-16 05:45:07 +08:00
|
|
|
now = blk_time_get_ns();
|
2021-10-08 19:50:46 +08:00
|
|
|
|
|
|
|
while ((rq = rq_list_pop(&iob->req_list)) != NULL) {
|
|
|
|
prefetch(rq->bio);
|
|
|
|
prefetch(rq->rq_next);
|
|
|
|
|
2021-12-02 06:01:51 +08:00
|
|
|
blk_complete_request(rq);
|
2021-10-08 19:50:46 +08:00
|
|
|
if (iob->need_ts)
|
|
|
|
__blk_mq_end_request_acct(rq, now);
|
|
|
|
|
2023-08-13 23:23:25 +08:00
|
|
|
blk_mq_finish_request(rq);
|
|
|
|
|
2021-11-27 00:53:23 +08:00
|
|
|
rq_qos_done(rq->q, rq);
|
|
|
|
|
2022-09-21 22:24:16 +08:00
|
|
|
/*
|
|
|
|
* If end_io handler returns NONE, then it still has
|
|
|
|
* ownership of the request.
|
|
|
|
*/
|
|
|
|
if (rq->end_io && rq->end_io(rq, 0) == RQ_END_IO_NONE)
|
|
|
|
continue;
|
|
|
|
|
2021-10-08 19:50:46 +08:00
|
|
|
WRITE_ONCE(rq->state, MQ_RQ_IDLE);
|
2021-10-15 04:39:59 +08:00
|
|
|
if (!req_ref_put_and_test(rq))
|
2021-10-08 19:50:46 +08:00
|
|
|
continue;
|
|
|
|
|
|
|
|
blk_crypto_free_request(rq);
|
|
|
|
blk_pm_mark_last_busy(rq);
|
|
|
|
|
2021-10-29 02:08:34 +08:00
|
|
|
if (nr_tags == TAG_COMP_BATCH || cur_hctx != rq->mq_hctx) {
|
|
|
|
if (cur_hctx)
|
|
|
|
blk_mq_flush_tag_batch(cur_hctx, tags, nr_tags);
|
2021-10-08 19:50:46 +08:00
|
|
|
nr_tags = 0;
|
2021-10-29 02:08:34 +08:00
|
|
|
cur_hctx = rq->mq_hctx;
|
2021-10-08 19:50:46 +08:00
|
|
|
}
|
|
|
|
tags[nr_tags++] = rq->tag;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (nr_tags)
|
2021-10-29 02:08:34 +08:00
|
|
|
blk_mq_flush_tag_batch(cur_hctx, tags, nr_tags);
|
2021-10-08 19:50:46 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_end_request_batch);
|
|
|
|
|
2021-01-24 04:10:27 +08:00
|
|
|
static void blk_complete_reqs(struct llist_head *list)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2021-01-24 04:10:27 +08:00
|
|
|
struct llist_node *entry = llist_reverse_order(llist_del_all(list));
|
|
|
|
struct request *rq, *next;
|
2020-06-11 14:44:41 +08:00
|
|
|
|
2021-01-24 04:10:27 +08:00
|
|
|
llist_for_each_entry_safe(rq, next, entry, ipi_list)
|
2020-06-11 14:44:41 +08:00
|
|
|
rq->q->mq_ops->complete(rq);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2021-01-24 04:10:27 +08:00
|
|
|
static __latent_entropy void blk_done_softirq(struct softirq_action *h)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2021-01-24 04:10:27 +08:00
|
|
|
blk_complete_reqs(this_cpu_ptr(&blk_cpu_done));
|
2020-06-11 14:44:42 +08:00
|
|
|
}
|
|
|
|
|
2020-06-11 14:44:41 +08:00
|
|
|
static int blk_softirq_cpu_dead(unsigned int cpu)
|
|
|
|
{
|
2021-01-24 04:10:27 +08:00
|
|
|
blk_complete_reqs(&per_cpu(blk_cpu_done, cpu));
|
2020-06-11 14:44:41 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-06-11 14:44:50 +08:00
|
|
|
static void __blk_mq_complete_request_remote(void *data)
|
2020-06-11 14:44:41 +08:00
|
|
|
{
|
2021-01-24 04:10:27 +08:00
|
|
|
__raise_softirq_irqoff(BLOCK_SOFTIRQ);
|
2020-06-11 14:44:41 +08:00
|
|
|
}
|
|
|
|
|
2020-06-11 14:44:49 +08:00
|
|
|
static inline bool blk_mq_complete_need_ipi(struct request *rq)
|
|
|
|
{
|
|
|
|
int cpu = raw_smp_processor_id();
|
|
|
|
|
|
|
|
if (!IS_ENABLED(CONFIG_SMP) ||
|
|
|
|
!test_bit(QUEUE_FLAG_SAME_COMP, &rq->q->queue_flags))
|
|
|
|
return false;
|
2020-12-05 03:13:54 +08:00
|
|
|
/*
|
|
|
|
* With force threaded interrupts enabled, raising softirq from an SMP
|
|
|
|
* function call will always result in waking the ksoftirqd thread.
|
|
|
|
* This is probably worse than completing the request on a different
|
|
|
|
* cache domain.
|
|
|
|
*/
|
2021-06-03 02:03:38 +08:00
|
|
|
if (force_irqthreads())
|
2020-12-05 03:13:54 +08:00
|
|
|
return false;
|
2020-06-11 14:44:49 +08:00
|
|
|
|
2024-02-23 23:57:49 +08:00
|
|
|
/* same CPU or cache domain and capacity? Complete locally */
|
2020-06-11 14:44:49 +08:00
|
|
|
if (cpu == rq->mq_ctx->cpu ||
|
|
|
|
(!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) &&
|
2024-02-23 23:57:49 +08:00
|
|
|
cpus_share_cache(cpu, rq->mq_ctx->cpu) &&
|
|
|
|
cpus_equal_capacity(cpu, rq->mq_ctx->cpu)))
|
2020-06-11 14:44:49 +08:00
|
|
|
return false;
|
|
|
|
|
|
|
|
/* don't try to IPI to an offline CPU */
|
|
|
|
return cpu_online(rq->mq_ctx->cpu);
|
|
|
|
}
|
|
|
|
|
2021-01-24 04:10:27 +08:00
|
|
|
static void blk_mq_complete_send_ipi(struct request *rq)
|
|
|
|
{
|
|
|
|
unsigned int cpu;
|
|
|
|
|
|
|
|
cpu = rq->mq_ctx->cpu;
|
2023-07-17 12:00:55 +08:00
|
|
|
if (llist_add(&rq->ipi_list, &per_cpu(blk_cpu_done, cpu)))
|
|
|
|
smp_call_function_single_async(cpu, &per_cpu(blk_cpu_csd, cpu));
|
2021-01-24 04:10:27 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void blk_mq_raise_softirq(struct request *rq)
|
|
|
|
{
|
|
|
|
struct llist_head *list;
|
|
|
|
|
|
|
|
preempt_disable();
|
|
|
|
list = this_cpu_ptr(&blk_cpu_done);
|
|
|
|
if (llist_add(&rq->ipi_list, list))
|
|
|
|
raise_softirq(BLOCK_SOFTIRQ);
|
|
|
|
preempt_enable();
|
|
|
|
}
|
|
|
|
|
2020-06-11 14:44:50 +08:00
|
|
|
bool blk_mq_complete_request_remote(struct request *rq)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2018-11-27 00:54:30 +08:00
|
|
|
WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
|
2018-09-28 16:42:20 +08:00
|
|
|
|
2018-11-19 07:15:35 +08:00
|
|
|
/*
|
2022-09-21 11:32:03 +08:00
|
|
|
* For request which hctx has only one ctx mapping,
|
|
|
|
* or a polled request, always complete locally,
|
|
|
|
* it's pointless to redirect the completion.
|
2018-11-19 07:15:35 +08:00
|
|
|
*/
|
2023-06-14 08:25:29 +08:00
|
|
|
if ((rq->mq_hctx->nr_ctx == 1 &&
|
|
|
|
rq->mq_ctx->cpu == raw_smp_processor_id()) ||
|
|
|
|
rq->cmd_flags & REQ_POLLED)
|
2020-06-11 14:44:50 +08:00
|
|
|
return false;
|
2014-04-25 17:32:53 +08:00
|
|
|
|
2020-06-11 14:44:49 +08:00
|
|
|
if (blk_mq_complete_need_ipi(rq)) {
|
2021-01-24 04:10:27 +08:00
|
|
|
blk_mq_complete_send_ipi(rq);
|
|
|
|
return true;
|
2014-01-09 01:33:37 +08:00
|
|
|
}
|
2020-06-11 14:44:50 +08:00
|
|
|
|
2021-01-24 04:10:27 +08:00
|
|
|
if (rq->q->nr_hw_queues == 1) {
|
|
|
|
blk_mq_raise_softirq(rq);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
return false;
|
2020-06-11 14:44:50 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_complete_request_remote);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_mq_complete_request - end I/O on a request
|
|
|
|
* @rq: the request being processed
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Complete a request by scheduling the ->complete_rq operation.
|
|
|
|
**/
|
|
|
|
void blk_mq_complete_request(struct request *rq)
|
|
|
|
{
|
|
|
|
if (!blk_mq_complete_request_remote(rq))
|
|
|
|
rq->q->mq_ops->complete(rq);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
2020-06-11 14:44:47 +08:00
|
|
|
EXPORT_SYMBOL(blk_mq_complete_request);
|
2014-02-10 19:24:38 +08:00
|
|
|
|
2020-01-07 02:08:18 +08:00
|
|
|
/**
|
|
|
|
* blk_mq_start_request - Start processing a request
|
|
|
|
* @rq: Pointer to request to be started
|
|
|
|
*
|
|
|
|
* Function used by device drivers to notify the block layer that a request
|
|
|
|
* is going to be processed now, so blk layer can do proper initializations
|
|
|
|
* such as starting the timeout timer.
|
|
|
|
*/
|
2014-09-14 07:40:09 +08:00
|
|
|
void blk_mq_start_request(struct request *rq)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
|
|
|
struct request_queue *q = rq->q;
|
|
|
|
|
2020-12-04 00:21:39 +08:00
|
|
|
trace_block_rq_issue(rq);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2023-11-24 03:03:31 +08:00
|
|
|
if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags) &&
|
|
|
|
!blk_rq_is_passthrough(rq)) {
|
2024-01-16 05:45:07 +08:00
|
|
|
rq->io_start_time_ns = blk_time_get_ns();
|
2019-05-21 15:59:03 +08:00
|
|
|
rq->stats_sectors = blk_rq_sectors(rq);
|
2016-11-08 12:32:37 +08:00
|
|
|
rq->rq_flags |= RQF_STATS;
|
2018-07-03 23:32:35 +08:00
|
|
|
rq_qos_issue(q, rq);
|
2016-11-08 12:32:37 +08:00
|
|
|
}
|
|
|
|
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10 00:29:48 +08:00
|
|
|
WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
|
2014-09-17 00:37:37 +08:00
|
|
|
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10 00:29:48 +08:00
|
|
|
blk_add_timer(rq);
|
2018-05-29 21:52:28 +08:00
|
|
|
WRITE_ONCE(rq->state, MQ_RQ_IN_FLIGHT);
|
2023-09-13 23:16:15 +08:00
|
|
|
rq->mq_hctx->tags->rqs[rq->tag] = rq;
|
2014-02-12 00:27:14 +08:00
|
|
|
|
2019-09-16 23:44:29 +08:00
|
|
|
#ifdef CONFIG_BLK_DEV_INTEGRITY
|
|
|
|
if (blk_integrity_rq(rq) && req_op(rq) == REQ_OP_WRITE)
|
|
|
|
q->integrity.profile->prepare_fn(rq);
|
|
|
|
#endif
|
2021-10-12 19:12:24 +08:00
|
|
|
if (rq->bio && rq->bio->bi_opf & REQ_POLLED)
|
2023-06-13 03:03:42 +08:00
|
|
|
WRITE_ONCE(rq->bio->bi_cookie, rq->mq_hctx->queue_num);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
2014-09-14 07:40:09 +08:00
|
|
|
EXPORT_SYMBOL(blk_mq_start_request);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2022-05-12 22:00:10 +08:00
|
|
|
/*
|
|
|
|
* Allow 2x BLK_MAX_REQUEST_COUNT requests on plug queue for multiple
|
|
|
|
* queues. This is important for md arrays to benefit from merging
|
|
|
|
* requests.
|
|
|
|
*/
|
|
|
|
static inline unsigned short blk_plug_max_rq_count(struct blk_plug *plug)
|
|
|
|
{
|
|
|
|
if (plug->multiple_queues)
|
|
|
|
return BLK_MAX_REQUEST_COUNT * 2;
|
|
|
|
return BLK_MAX_REQUEST_COUNT;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
|
|
|
|
{
|
|
|
|
struct request *last = rq_list_peek(&plug->mq_list);
|
|
|
|
|
|
|
|
if (!plug->rq_count) {
|
|
|
|
trace_block_plug(rq->q);
|
|
|
|
} else if (plug->rq_count >= blk_plug_max_rq_count(plug) ||
|
|
|
|
(!blk_queue_nomerges(rq->q) &&
|
|
|
|
blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
|
|
|
|
blk_mq_flush_plug_list(plug, false);
|
2022-11-01 08:54:13 +08:00
|
|
|
last = NULL;
|
2022-05-12 22:00:10 +08:00
|
|
|
trace_block_plug(rq->q);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!plug->multiple_queues && last && last->q != rq->q)
|
|
|
|
plug->multiple_queues = true;
|
2023-06-24 21:01:05 +08:00
|
|
|
/*
|
|
|
|
* Any request allocated from sched tags can't be issued to
|
|
|
|
* ->queue_rqs() directly
|
|
|
|
*/
|
|
|
|
if (!plug->has_elevator && (rq->rq_flags & RQF_SCHED_TAGS))
|
2022-05-12 22:00:10 +08:00
|
|
|
plug->has_elevator = true;
|
|
|
|
rq->rq_next = NULL;
|
|
|
|
rq_list_add(&plug->mq_list, rq);
|
|
|
|
plug->rq_count++;
|
|
|
|
}
|
|
|
|
|
2021-11-17 14:13:56 +08:00
|
|
|
/**
|
|
|
|
* blk_execute_rq_nowait - insert a request to I/O scheduler for execution
|
|
|
|
* @rq: request to insert
|
|
|
|
* @at_head: insert request at head or tail of queue
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Insert a fully prepared request at the back of the I/O scheduler queue
|
|
|
|
* for execution. Don't wait for completion.
|
|
|
|
*
|
|
|
|
* Note:
|
|
|
|
* This function will invoke @done directly if the queue is dead.
|
|
|
|
*/
|
2022-05-24 20:15:30 +08:00
|
|
|
void blk_execute_rq_nowait(struct request *rq, bool at_head)
|
2021-11-17 14:13:56 +08:00
|
|
|
{
|
2023-04-13 14:40:51 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
|
|
|
|
|
2022-05-24 20:15:28 +08:00
|
|
|
WARN_ON(irqs_disabled());
|
|
|
|
WARN_ON(!blk_rq_is_passthrough(rq));
|
2021-11-17 14:13:56 +08:00
|
|
|
|
2022-05-24 20:15:28 +08:00
|
|
|
blk_account_io_start(rq);
|
2022-09-29 22:41:41 +08:00
|
|
|
|
2023-04-13 14:40:51 +08:00
|
|
|
if (current->plug && !at_head) {
|
2022-05-24 20:15:28 +08:00
|
|
|
blk_add_rq_to_plug(current->plug, rq);
|
2023-04-13 14:40:51 +08:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2023-04-13 14:40:54 +08:00
|
|
|
blk_mq_insert_request(rq, at_head ? BLK_MQ_INSERT_AT_HEAD : 0);
|
block: Improve performance for BLK_MQ_F_BLOCKING drivers
blk_mq_run_queue() runs the queue asynchronously if BLK_MQ_F_BLOCKING
has been set. This is suboptimal since running the queue asynchronously
is slower than running the queue synchronously. This patch modifies
blk_mq_run_queue() as follows if BLK_MQ_F_BLOCKING has been set:
- Run the queue synchronously if it is allowed to sleep.
- Run the queue asynchronously if it is not allowed to sleep.
Additionally, blk_mq_run_hw_queue(hctx, false) calls are modified into
blk_mq_run_hw_queue(hctx, hctx->flags & BLK_MQ_F_BLOCKING) if the caller
may be invoked from atomic context.
The following caller chains have been reviewed:
blk_mq_run_hw_queue(hctx, false)
blk_mq_get_tag() /* may sleep, hence the functions it calls may also sleep */
blk_execute_rq() /* may sleep */
blk_mq_run_hw_queues(q, async=false)
blk_freeze_queue_start() /* may sleep */
blk_mq_requeue_work() /* may sleep */
scsi_kick_queue()
scsi_requeue_run_queue() /* may sleep */
scsi_run_host_queues()
scsi_ioctl_reset() /* may sleep */
blk_mq_insert_requests(hctx, ctx, list, run_queue_async=false)
blk_mq_dispatch_plug_list(plug, from_sched=false)
blk_mq_flush_plug_list(plug, from_schedule=false)
__blk_flush_plug(plug, from_schedule=false)
blk_add_rq_to_plug()
blk_mq_submit_bio() /* may sleep if REQ_NOWAIT has not been set */
blk_mq_plug_issue_direct()
blk_mq_flush_plug_list() /* see above */
blk_mq_dispatch_plug_list(plug, from_sched=false)
blk_mq_flush_plug_list() /* see above */
blk_mq_try_issue_directly()
blk_mq_submit_bio() /* may sleep if REQ_NOWAIT has not been set */
blk_mq_try_issue_list_directly(hctx, list)
blk_mq_insert_requests() /* see above */
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230721172731.955724-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-22 01:27:30 +08:00
|
|
|
blk_mq_run_hw_queue(hctx, hctx->flags & BLK_MQ_F_BLOCKING);
|
2021-11-17 14:13:56 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_execute_rq_nowait);
|
|
|
|
|
2022-05-24 20:15:29 +08:00
|
|
|
struct blk_rq_wait {
|
|
|
|
struct completion done;
|
|
|
|
blk_status_t ret;
|
|
|
|
};
|
|
|
|
|
2022-09-22 05:19:54 +08:00
|
|
|
static enum rq_end_io_ret blk_end_sync_rq(struct request *rq, blk_status_t ret)
|
2022-05-24 20:15:29 +08:00
|
|
|
{
|
|
|
|
struct blk_rq_wait *wait = rq->end_io_data;
|
|
|
|
|
|
|
|
wait->ret = ret;
|
|
|
|
complete(&wait->done);
|
2022-09-22 05:19:54 +08:00
|
|
|
return RQ_END_IO_NONE;
|
2022-05-24 20:15:29 +08:00
|
|
|
}
|
|
|
|
|
2022-08-24 00:14:42 +08:00
|
|
|
bool blk_rq_is_poll(struct request *rq)
|
2021-11-17 14:13:56 +08:00
|
|
|
{
|
|
|
|
if (!rq->mq_hctx)
|
|
|
|
return false;
|
|
|
|
if (rq->mq_hctx->type != HCTX_TYPE_POLL)
|
|
|
|
return false;
|
|
|
|
return true;
|
|
|
|
}
|
2022-08-24 00:14:42 +08:00
|
|
|
EXPORT_SYMBOL_GPL(blk_rq_is_poll);
|
2021-11-17 14:13:56 +08:00
|
|
|
|
|
|
|
static void blk_rq_poll_completion(struct request *rq, struct completion *wait)
|
|
|
|
{
|
|
|
|
do {
|
2023-06-13 03:03:42 +08:00
|
|
|
blk_hctx_poll(rq->q, rq->mq_hctx, NULL, 0);
|
2021-11-17 14:13:56 +08:00
|
|
|
cond_resched();
|
|
|
|
} while (!completion_done(wait));
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_execute_rq - insert a request into queue for execution
|
|
|
|
* @rq: request to insert
|
|
|
|
* @at_head: insert request at head or tail of queue
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Insert a fully prepared request at the back of the I/O scheduler queue
|
|
|
|
* for execution and wait for completion.
|
|
|
|
* Return: The blk_status_t result provided to blk_mq_end_request().
|
|
|
|
*/
|
2021-11-26 20:18:01 +08:00
|
|
|
blk_status_t blk_execute_rq(struct request *rq, bool at_head)
|
2021-11-17 14:13:56 +08:00
|
|
|
{
|
2023-04-13 14:40:51 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
|
2022-05-24 20:15:29 +08:00
|
|
|
struct blk_rq_wait wait = {
|
|
|
|
.done = COMPLETION_INITIALIZER_ONSTACK(wait.done),
|
|
|
|
};
|
2021-11-17 14:13:56 +08:00
|
|
|
|
2022-05-24 20:15:28 +08:00
|
|
|
WARN_ON(irqs_disabled());
|
|
|
|
WARN_ON(!blk_rq_is_passthrough(rq));
|
2021-11-17 14:13:56 +08:00
|
|
|
|
|
|
|
rq->end_io_data = &wait;
|
2022-05-24 20:15:28 +08:00
|
|
|
rq->end_io = blk_end_sync_rq;
|
2021-11-17 14:13:56 +08:00
|
|
|
|
2022-05-24 20:15:28 +08:00
|
|
|
blk_account_io_start(rq);
|
2023-04-13 14:40:54 +08:00
|
|
|
blk_mq_insert_request(rq, at_head ? BLK_MQ_INSERT_AT_HEAD : 0);
|
2023-04-13 14:40:51 +08:00
|
|
|
blk_mq_run_hw_queue(hctx, false);
|
2021-11-17 14:13:56 +08:00
|
|
|
|
2024-02-23 23:59:09 +08:00
|
|
|
if (blk_rq_is_poll(rq))
|
2022-05-24 20:15:29 +08:00
|
|
|
blk_rq_poll_completion(rq, &wait.done);
|
2024-02-23 23:59:09 +08:00
|
|
|
else
|
|
|
|
blk_wait_io(&wait.done);
|
2021-11-17 14:13:56 +08:00
|
|
|
|
2022-05-24 20:15:29 +08:00
|
|
|
return wait.ret;
|
2021-11-17 14:13:56 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_execute_rq);
|
|
|
|
|
2014-04-16 15:44:57 +08:00
|
|
|
static void __blk_mq_requeue_request(struct request *rq)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
|
|
|
struct request_queue *q = rq->q;
|
|
|
|
|
2017-11-02 23:24:38 +08:00
|
|
|
blk_mq_put_driver_tag(rq);
|
|
|
|
|
2020-12-04 00:21:39 +08:00
|
|
|
trace_block_rq_requeue(rq);
|
2018-07-03 23:32:35 +08:00
|
|
|
rq_qos_requeue(q, rq);
|
2014-02-12 00:27:14 +08:00
|
|
|
|
2018-05-29 21:52:28 +08:00
|
|
|
if (blk_mq_request_started(rq)) {
|
|
|
|
WRITE_ONCE(rq->state, MQ_RQ_IDLE);
|
2018-06-14 19:58:45 +08:00
|
|
|
rq->rq_flags &= ~RQF_TIMED_OUT;
|
2014-09-14 07:40:09 +08:00
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2016-10-29 08:21:41 +08:00
|
|
|
void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list)
|
2014-04-16 15:44:57 +08:00
|
|
|
{
|
2023-04-13 14:40:53 +08:00
|
|
|
struct request_queue *q = rq->q;
|
2023-05-19 12:40:50 +08:00
|
|
|
unsigned long flags;
|
2023-04-13 14:40:53 +08:00
|
|
|
|
2014-04-16 15:44:57 +08:00
|
|
|
__blk_mq_requeue_request(rq);
|
|
|
|
|
2018-02-23 23:36:56 +08:00
|
|
|
/* this request will be re-inserted to io scheduler queue */
|
|
|
|
blk_mq_sched_requeue_request(rq);
|
|
|
|
|
2023-05-19 12:40:50 +08:00
|
|
|
spin_lock_irqsave(&q->requeue_lock, flags);
|
|
|
|
list_add_tail(&rq->queuelist, &q->requeue_list);
|
|
|
|
spin_unlock_irqrestore(&q->requeue_lock, flags);
|
2023-04-13 14:40:53 +08:00
|
|
|
|
|
|
|
if (kick_requeue_list)
|
|
|
|
blk_mq_kick_requeue_list(q);
|
2014-04-16 15:44:57 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_requeue_request);
|
|
|
|
|
2014-05-28 22:08:02 +08:00
|
|
|
static void blk_mq_requeue_work(struct work_struct *work)
|
|
|
|
{
|
|
|
|
struct request_queue *q =
|
2016-09-15 01:28:30 +08:00
|
|
|
container_of(work, struct request_queue, requeue_work.work);
|
2014-05-28 22:08:02 +08:00
|
|
|
LIST_HEAD(rq_list);
|
2023-05-19 12:40:50 +08:00
|
|
|
LIST_HEAD(flush_list);
|
|
|
|
struct request *rq;
|
2014-05-28 22:08:02 +08:00
|
|
|
|
2017-07-27 22:03:57 +08:00
|
|
|
spin_lock_irq(&q->requeue_lock);
|
2014-05-28 22:08:02 +08:00
|
|
|
list_splice_init(&q->requeue_list, &rq_list);
|
2023-05-19 12:40:50 +08:00
|
|
|
list_splice_init(&q->flush_list, &flush_list);
|
2017-07-27 22:03:57 +08:00
|
|
|
spin_unlock_irq(&q->requeue_lock);
|
2014-05-28 22:08:02 +08:00
|
|
|
|
2023-05-19 12:40:50 +08:00
|
|
|
while (!list_empty(&rq_list)) {
|
|
|
|
rq = list_entry(rq_list.next, struct request, queuelist);
|
blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue
When requeue, if RQF_DONTPREP, rq has contained some driver
specific data, so insert it to hctx dispatch list to avoid any
merge. Take scsi as example, here is the trace event log (no
io scheduler, because RQF_STARTED would prevent merging),
kworker/0:1H-339 [000] ...1 2037.209289: block_rq_insert: 8,0 R 4096 () 32768 + 8 [kworker/0:1H]
scsi_inert_test-1987 [000] .... 2037.220465: block_bio_queue: 8,0 R 32776 + 8 [scsi_inert_test]
scsi_inert_test-1987 [000] ...2 2037.220466: block_bio_backmerge: 8,0 R 32776 + 8 [scsi_inert_test]
kworker/0:1H-339 [000] .... 2047.220913: block_rq_issue: 8,0 R 8192 () 32768 + 16 [kworker/0:1H]
scsi_inert_test-1996 [000] ..s1 2047.221007: block_rq_complete: 8,0 R () 32768 + 8 [0]
scsi_inert_test-1996 [000] .Ns1 2047.221045: block_rq_requeue: 8,0 R () 32776 + 8 [0]
kworker/0:1H-339 [000] ...1 2047.221054: block_rq_insert: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
kworker/0:1H-339 [000] ...1 2047.221056: block_rq_issue: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
scsi_inert_test-1986 [000] ..s1 2047.221119: block_rq_complete: 8,0 R () 32776 + 8 [0]
(32768 + 8) was requeued by scsi_queue_insert and had RQF_DONTPREP.
Then it was merged with (32776 + 8) and issued. Due to RQF_DONTPREP,
the sdb only contained the part of (32768 + 8), then only that part
was completed. The lucky thing was that scsi_io_completion detected
it and requeued the remaining part. So we didn't get corrupted data.
However, the requeue of (32776 + 8) is not expected.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-12 09:56:25 +08:00
|
|
|
/*
|
2023-04-13 14:40:48 +08:00
|
|
|
* If RQF_DONTPREP ist set, the request has been started by the
|
|
|
|
* driver already and might have driver-specific data allocated
|
|
|
|
* already. Insert it into the hctx dispatch list to avoid
|
|
|
|
* block layer merges for the request.
|
blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue
When requeue, if RQF_DONTPREP, rq has contained some driver
specific data, so insert it to hctx dispatch list to avoid any
merge. Take scsi as example, here is the trace event log (no
io scheduler, because RQF_STARTED would prevent merging),
kworker/0:1H-339 [000] ...1 2037.209289: block_rq_insert: 8,0 R 4096 () 32768 + 8 [kworker/0:1H]
scsi_inert_test-1987 [000] .... 2037.220465: block_bio_queue: 8,0 R 32776 + 8 [scsi_inert_test]
scsi_inert_test-1987 [000] ...2 2037.220466: block_bio_backmerge: 8,0 R 32776 + 8 [scsi_inert_test]
kworker/0:1H-339 [000] .... 2047.220913: block_rq_issue: 8,0 R 8192 () 32768 + 16 [kworker/0:1H]
scsi_inert_test-1996 [000] ..s1 2047.221007: block_rq_complete: 8,0 R () 32768 + 8 [0]
scsi_inert_test-1996 [000] .Ns1 2047.221045: block_rq_requeue: 8,0 R () 32776 + 8 [0]
kworker/0:1H-339 [000] ...1 2047.221054: block_rq_insert: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
kworker/0:1H-339 [000] ...1 2047.221056: block_rq_issue: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
scsi_inert_test-1986 [000] ..s1 2047.221119: block_rq_complete: 8,0 R () 32776 + 8 [0]
(32768 + 8) was requeued by scsi_queue_insert and had RQF_DONTPREP.
Then it was merged with (32776 + 8) and issued. Due to RQF_DONTPREP,
the sdb only contained the part of (32768 + 8), then only that part
was completed. The lucky thing was that scsi_io_completion detected
it and requeued the remaining part. So we didn't get corrupted data.
However, the requeue of (32776 + 8) is not expected.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-12 09:56:25 +08:00
|
|
|
*/
|
2023-04-13 14:40:48 +08:00
|
|
|
if (rq->rq_flags & RQF_DONTPREP) {
|
|
|
|
list_del_init(&rq->queuelist);
|
2023-04-13 14:40:55 +08:00
|
|
|
blk_mq_request_bypass_insert(rq, 0);
|
2023-05-19 12:40:50 +08:00
|
|
|
} else {
|
2023-04-13 14:40:48 +08:00
|
|
|
list_del_init(&rq->queuelist);
|
2023-04-13 14:40:54 +08:00
|
|
|
blk_mq_insert_request(rq, BLK_MQ_INSERT_AT_HEAD);
|
2023-04-13 14:40:48 +08:00
|
|
|
}
|
2014-05-28 22:08:02 +08:00
|
|
|
}
|
|
|
|
|
2023-05-19 12:40:50 +08:00
|
|
|
while (!list_empty(&flush_list)) {
|
|
|
|
rq = list_entry(flush_list.next, struct request, queuelist);
|
2014-05-28 22:08:02 +08:00
|
|
|
list_del_init(&rq->queuelist);
|
2023-04-13 14:40:54 +08:00
|
|
|
blk_mq_insert_request(rq, 0);
|
2014-05-28 22:08:02 +08:00
|
|
|
}
|
|
|
|
|
2016-10-29 08:20:32 +08:00
|
|
|
blk_mq_run_hw_queues(q, false);
|
2014-05-28 22:08:02 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
void blk_mq_kick_requeue_list(struct request_queue *q)
|
|
|
|
{
|
2018-01-20 00:58:55 +08:00
|
|
|
kblockd_mod_delayed_work_on(WORK_CPU_UNBOUND, &q->requeue_work, 0);
|
2014-05-28 22:08:02 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_kick_requeue_list);
|
|
|
|
|
2016-09-15 01:28:30 +08:00
|
|
|
void blk_mq_delay_kick_requeue_list(struct request_queue *q,
|
|
|
|
unsigned long msecs)
|
|
|
|
{
|
2017-08-10 02:28:06 +08:00
|
|
|
kblockd_mod_delayed_work_on(WORK_CPU_UNBOUND, &q->requeue_work,
|
|
|
|
msecs_to_jiffies(msecs));
|
2016-09-15 01:28:30 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list);
|
|
|
|
|
2023-12-01 16:56:05 +08:00
|
|
|
static bool blk_is_flush_data_rq(struct request *rq)
|
|
|
|
{
|
|
|
|
return (rq->rq_flags & RQF_FLUSH_SEQ) && !is_flush_rq(rq);
|
|
|
|
}
|
|
|
|
|
2022-07-06 20:03:53 +08:00
|
|
|
static bool blk_mq_rq_inflight(struct request *rq, void *priv)
|
2018-11-09 00:03:51 +08:00
|
|
|
{
|
|
|
|
/*
|
2021-12-06 20:49:48 +08:00
|
|
|
* If we find a request that isn't idle we know the queue is busy
|
|
|
|
* as it's checked in the iter.
|
|
|
|
* Return false to stop the iteration.
|
2023-12-01 16:56:05 +08:00
|
|
|
*
|
|
|
|
* In case of queue quiesce, if one flush data request is completed,
|
|
|
|
* don't count it as inflight given the flush sequence is suspended,
|
|
|
|
* and the original flush data request is invisible to driver, just
|
|
|
|
* like other pending requests because of quiesce
|
2018-11-09 00:03:51 +08:00
|
|
|
*/
|
2023-12-01 16:56:05 +08:00
|
|
|
if (blk_mq_request_started(rq) && !(blk_queue_quiesced(rq->q) &&
|
|
|
|
blk_is_flush_data_rq(rq) &&
|
|
|
|
blk_mq_request_completed(rq))) {
|
2018-11-09 00:03:51 +08:00
|
|
|
bool *busy = priv;
|
|
|
|
|
|
|
|
*busy = true;
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2018-12-18 12:11:17 +08:00
|
|
|
bool blk_mq_queue_inflight(struct request_queue *q)
|
2018-11-09 00:03:51 +08:00
|
|
|
{
|
|
|
|
bool busy = false;
|
|
|
|
|
2018-12-18 12:11:17 +08:00
|
|
|
blk_mq_queue_tag_busy_iter(q, blk_mq_rq_inflight, &busy);
|
2018-11-09 00:03:51 +08:00
|
|
|
return busy;
|
|
|
|
}
|
2018-12-18 12:11:17 +08:00
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_queue_inflight);
|
2018-11-09 00:03:51 +08:00
|
|
|
|
2022-07-06 20:03:51 +08:00
|
|
|
static void blk_mq_rq_timed_out(struct request *req)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2018-06-14 19:58:45 +08:00
|
|
|
req->rq_flags |= RQF_TIMED_OUT;
|
2018-05-29 21:52:39 +08:00
|
|
|
if (req->q->mq_ops->timeout) {
|
|
|
|
enum blk_eh_timer_return ret;
|
|
|
|
|
2022-07-06 20:03:51 +08:00
|
|
|
ret = req->q->mq_ops->timeout(req);
|
2018-05-29 21:52:39 +08:00
|
|
|
if (ret == BLK_EH_DONE)
|
|
|
|
return;
|
|
|
|
WARN_ON_ONCE(ret != BLK_EH_RESET_TIMER);
|
2014-09-14 07:40:12 +08:00
|
|
|
}
|
2018-05-29 21:52:39 +08:00
|
|
|
|
|
|
|
blk_add_timer(req);
|
2014-04-24 22:51:47 +08:00
|
|
|
}
|
2015-01-08 09:55:46 +08:00
|
|
|
|
blk-mq: avoid double ->queue_rq() because of early timeout
David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in VM use case because of long vmexit latency or preempt
latency of vCPU pthread or long page fault in vCPU pthread, then block
IO req could be timed out before queuing the request to hardware but after
calling blk_mq_start_request() during ->queue_rq(), then timeout handler
may handle it by requeue, then double ->queue_rq() is caused, and kernel
panic.
So far, it is driver's responsibility to cover the race between timeout
and completion, so it seems supposed to be solved in driver in theory,
given driver has enough knowledge.
But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, even it isn't easy
for driver to handle the race. So David suggests this patch by draining
in-progress ->queue_rq() for solving this issue.
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: virtualization@lists.linux-foundation.org
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221026051957.358818-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-26 13:19:57 +08:00
|
|
|
struct blk_expired_data {
|
|
|
|
bool has_timedout_rq;
|
|
|
|
unsigned long next;
|
|
|
|
unsigned long timeout_start;
|
|
|
|
};
|
|
|
|
|
|
|
|
static bool blk_mq_req_expired(struct request *rq, struct blk_expired_data *expired)
|
2014-09-14 07:40:11 +08:00
|
|
|
{
|
2018-05-29 21:52:28 +08:00
|
|
|
unsigned long deadline;
|
2014-04-24 22:51:47 +08:00
|
|
|
|
2018-05-29 21:52:28 +08:00
|
|
|
if (blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT)
|
|
|
|
return false;
|
2018-06-14 19:58:45 +08:00
|
|
|
if (rq->rq_flags & RQF_TIMED_OUT)
|
|
|
|
return false;
|
2017-09-06 16:00:22 +08:00
|
|
|
|
2018-11-15 00:02:05 +08:00
|
|
|
deadline = READ_ONCE(rq->deadline);
|
blk-mq: avoid double ->queue_rq() because of early timeout
David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in VM use case because of long vmexit latency or preempt
latency of vCPU pthread or long page fault in vCPU pthread, then block
IO req could be timed out before queuing the request to hardware but after
calling blk_mq_start_request() during ->queue_rq(), then timeout handler
may handle it by requeue, then double ->queue_rq() is caused, and kernel
panic.
So far, it is driver's responsibility to cover the race between timeout
and completion, so it seems supposed to be solved in driver in theory,
given driver has enough knowledge.
But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, even it isn't easy
for driver to handle the race. So David suggests this patch by draining
in-progress ->queue_rq() for solving this issue.
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: virtualization@lists.linux-foundation.org
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221026051957.358818-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-26 13:19:57 +08:00
|
|
|
if (time_after_eq(expired->timeout_start, deadline))
|
2018-05-29 21:52:28 +08:00
|
|
|
return true;
|
2017-09-06 16:00:22 +08:00
|
|
|
|
blk-mq: avoid double ->queue_rq() because of early timeout
David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in VM use case because of long vmexit latency or preempt
latency of vCPU pthread or long page fault in vCPU pthread, then block
IO req could be timed out before queuing the request to hardware but after
calling blk_mq_start_request() during ->queue_rq(), then timeout handler
may handle it by requeue, then double ->queue_rq() is caused, and kernel
panic.
So far, it is driver's responsibility to cover the race between timeout
and completion, so it seems supposed to be solved in driver in theory,
given driver has enough knowledge.
But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, even it isn't easy
for driver to handle the race. So David suggests this patch by draining
in-progress ->queue_rq() for solving this issue.
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: virtualization@lists.linux-foundation.org
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221026051957.358818-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-26 13:19:57 +08:00
|
|
|
if (expired->next == 0)
|
|
|
|
expired->next = deadline;
|
|
|
|
else if (time_after(expired->next, deadline))
|
|
|
|
expired->next = deadline;
|
2018-05-29 21:52:28 +08:00
|
|
|
return false;
|
2014-04-24 22:51:47 +08:00
|
|
|
}
|
|
|
|
|
2021-05-11 23:22:34 +08:00
|
|
|
void blk_mq_put_rq_ref(struct request *rq)
|
|
|
|
{
|
2022-09-22 05:19:54 +08:00
|
|
|
if (is_flush_rq(rq)) {
|
|
|
|
if (rq->end_io(rq, 0) == RQ_END_IO_FREE)
|
|
|
|
blk_mq_free_request(rq);
|
|
|
|
} else if (req_ref_put_and_test(rq)) {
|
2021-05-11 23:22:34 +08:00
|
|
|
__blk_mq_free_request(rq);
|
2022-09-22 05:19:54 +08:00
|
|
|
}
|
2021-05-11 23:22:34 +08:00
|
|
|
}
|
|
|
|
|
2022-07-06 20:03:53 +08:00
|
|
|
static bool blk_mq_check_expired(struct request *rq, void *priv)
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10 00:29:48 +08:00
|
|
|
{
|
blk-mq: avoid double ->queue_rq() because of early timeout
David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in VM use case because of long vmexit latency or preempt
latency of vCPU pthread or long page fault in vCPU pthread, then block
IO req could be timed out before queuing the request to hardware but after
calling blk_mq_start_request() during ->queue_rq(), then timeout handler
may handle it by requeue, then double ->queue_rq() is caused, and kernel
panic.
So far, it is driver's responsibility to cover the race between timeout
and completion, so it seems supposed to be solved in driver in theory,
given driver has enough knowledge.
But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, even it isn't easy
for driver to handle the race. So David suggests this patch by draining
in-progress ->queue_rq() for solving this issue.
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: virtualization@lists.linux-foundation.org
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221026051957.358818-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-26 13:19:57 +08:00
|
|
|
struct blk_expired_data *expired = priv;
|
2018-05-29 21:52:28 +08:00
|
|
|
|
|
|
|
/*
|
2021-08-11 23:52:02 +08:00
|
|
|
* blk_mq_queue_tag_busy_iter() has locked the request, so it cannot
|
|
|
|
* be reallocated underneath the timeout handler's processing, then
|
|
|
|
* the expire check is reliable. If the request is not expired, then
|
|
|
|
* it was completed and reallocated as a new request after returning
|
|
|
|
* from blk_mq_check_expired().
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10 00:29:48 +08:00
|
|
|
*/
|
blk-mq: avoid double ->queue_rq() because of early timeout
David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in VM use case because of long vmexit latency or preempt
latency of vCPU pthread or long page fault in vCPU pthread, then block
IO req could be timed out before queuing the request to hardware but after
calling blk_mq_start_request() during ->queue_rq(), then timeout handler
may handle it by requeue, then double ->queue_rq() is caused, and kernel
panic.
So far, it is driver's responsibility to cover the race between timeout
and completion, so it seems supposed to be solved in driver in theory,
given driver has enough knowledge.
But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, even it isn't easy
for driver to handle the race. So David suggests this patch by draining
in-progress ->queue_rq() for solving this issue.
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: virtualization@lists.linux-foundation.org
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221026051957.358818-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-26 13:19:57 +08:00
|
|
|
if (blk_mq_req_expired(rq, expired)) {
|
|
|
|
expired->has_timedout_rq = true;
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool blk_mq_handle_expired(struct request *rq, void *priv)
|
|
|
|
{
|
|
|
|
struct blk_expired_data *expired = priv;
|
|
|
|
|
|
|
|
if (blk_mq_req_expired(rq, expired))
|
2022-07-06 20:03:51 +08:00
|
|
|
blk_mq_rq_timed_out(rq);
|
2018-11-09 01:24:07 +08:00
|
|
|
return true;
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10 00:29:48 +08:00
|
|
|
}
|
|
|
|
|
2015-10-30 20:57:30 +08:00
|
|
|
static void blk_mq_timeout_work(struct work_struct *work)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2015-10-30 20:57:30 +08:00
|
|
|
struct request_queue *q =
|
|
|
|
container_of(work, struct request_queue, timeout_work);
|
blk-mq: avoid double ->queue_rq() because of early timeout
David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in VM use case because of long vmexit latency or preempt
latency of vCPU pthread or long page fault in vCPU pthread, then block
IO req could be timed out before queuing the request to hardware but after
calling blk_mq_start_request() during ->queue_rq(), then timeout handler
may handle it by requeue, then double ->queue_rq() is caused, and kernel
panic.
So far, it is driver's responsibility to cover the race between timeout
and completion, so it seems supposed to be solved in driver in theory,
given driver has enough knowledge.
But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, even it isn't easy
for driver to handle the race. So David suggests this patch by draining
in-progress ->queue_rq() for solving this issue.
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: virtualization@lists.linux-foundation.org
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221026051957.358818-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-26 13:19:57 +08:00
|
|
|
struct blk_expired_data expired = {
|
|
|
|
.timeout_start = jiffies,
|
|
|
|
};
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10 00:29:48 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx;
|
2022-03-08 15:32:18 +08:00
|
|
|
unsigned long i;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
blk-mq: Allow timeouts to run while queue is freezing
In case a submitted request gets stuck for some reason, the block layer
can prevent the request starvation by starting the scheduled timeout work.
If this stuck request occurs at the same time another thread has started
a queue freeze, the blk_mq_timeout_work will not be able to acquire the
queue reference and will return silently, thus not issuing the timeout.
But since the request is already holding a q_usage_counter reference and
is unable to complete, it will never release its reference, preventing
the queue from completing the freeze started by first thread. This puts
the request_queue in a hung state, forever waiting for the freeze
completion.
This was observed while running IO to a NVMe device at the same time we
toggled the CPU hotplug code. Eventually, once a request got stuck
requiring a timeout during a queue freeze, we saw the CPU Hotplug
notification code get stuck inside blk_mq_freeze_queue_wait, as shown in
the trace below.
[c000000deaf13690] [c000000deaf13738] 0xc000000deaf13738 (unreliable)
[c000000deaf13860] [c000000000015ce8] __switch_to+0x1f8/0x350
[c000000deaf138b0] [c000000000ade0e4] __schedule+0x314/0x990
[c000000deaf13940] [c000000000ade7a8] schedule+0x48/0xc0
[c000000deaf13970] [c0000000005492a4] blk_mq_freeze_queue_wait+0x74/0x110
[c000000deaf139e0] [c00000000054b6a8] blk_mq_queue_reinit_notify+0x1a8/0x2e0
[c000000deaf13a40] [c0000000000e7878] notifier_call_chain+0x98/0x100
[c000000deaf13a90] [c0000000000b8e08] cpu_notify_nofail+0x48/0xa0
[c000000deaf13ac0] [c0000000000b92f0] _cpu_down+0x2a0/0x400
[c000000deaf13b90] [c0000000000b94a8] cpu_down+0x58/0xa0
[c000000deaf13bc0] [c0000000006d5dcc] cpu_subsys_offline+0x2c/0x50
[c000000deaf13bf0] [c0000000006cd244] device_offline+0x104/0x140
[c000000deaf13c30] [c0000000006cd40c] online_store+0x6c/0xc0
[c000000deaf13c80] [c0000000006c8c78] dev_attr_store+0x68/0xa0
[c000000deaf13cc0] [c0000000003974d0] sysfs_kf_write+0x80/0xb0
[c000000deaf13d00] [c0000000003963e8] kernfs_fop_write+0x188/0x200
[c000000deaf13d50] [c0000000002e0f6c] __vfs_write+0x6c/0xe0
[c000000deaf13d90] [c0000000002e1ca0] vfs_write+0xc0/0x230
[c000000deaf13de0] [c0000000002e2cdc] SyS_write+0x6c/0x110
[c000000deaf13e30] [c000000000009204] system_call+0x38/0xb4
The fix is to allow the timeout work to execute in the window between
dropping the initial refcount reference and the release of the last
reference, which actually marks the freeze completion. This can be
achieved with percpu_refcount_tryget, which does not require the counter
to be alive. This way the timeout work can do it's job and terminate a
stuck request even during a freeze, returning its reference and avoiding
the deadlock.
Allowing the timeout to run is just a part of the fix, since for some
devices, we might get stuck again inside the device driver's timeout
handler, should it attempt to allocate a new request in that path -
which is a quite common action for Abort commands, which need to be sent
after a timeout. In NVMe, for instance, we call blk_mq_alloc_request
from inside the timeout handler, which will fail during a freeze, since
it also tries to acquire a queue reference.
I considered a similar change to blk_mq_alloc_request as a generic
solution for further device driver hangs, but we can't do that, since it
would allow new requests to disturb the freeze process. I thought about
creating a new function in the block layer to support unfreezable
requests for these occasions, but after working on it for a while, I
feel like this should be handled in a per-driver basis. I'm now
experimenting with changes to the NVMe timeout path, but I'm open to
suggestions of ways to make this generic.
Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: linux-nvme@lists.infradead.org
Cc: linux-block@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-01 22:23:39 +08:00
|
|
|
/* A deadlock might occur if a request is stuck requiring a
|
|
|
|
* timeout at the same time a queue freeze is waiting
|
|
|
|
* completion, since the timeout code would not be able to
|
|
|
|
* acquire the queue reference here.
|
|
|
|
*
|
|
|
|
* That's why we don't use blk_queue_enter here; instead, we use
|
|
|
|
* percpu_ref_tryget directly, because we need to be able to
|
|
|
|
* obtain a reference even in the short window between the queue
|
|
|
|
* starting to freeze, by dropping the first reference in
|
2017-03-27 20:06:57 +08:00
|
|
|
* blk_freeze_queue_start, and the moment the last request is
|
blk-mq: Allow timeouts to run while queue is freezing
In case a submitted request gets stuck for some reason, the block layer
can prevent the request starvation by starting the scheduled timeout work.
If this stuck request occurs at the same time another thread has started
a queue freeze, the blk_mq_timeout_work will not be able to acquire the
queue reference and will return silently, thus not issuing the timeout.
But since the request is already holding a q_usage_counter reference and
is unable to complete, it will never release its reference, preventing
the queue from completing the freeze started by first thread. This puts
the request_queue in a hung state, forever waiting for the freeze
completion.
This was observed while running IO to a NVMe device at the same time we
toggled the CPU hotplug code. Eventually, once a request got stuck
requiring a timeout during a queue freeze, we saw the CPU Hotplug
notification code get stuck inside blk_mq_freeze_queue_wait, as shown in
the trace below.
[c000000deaf13690] [c000000deaf13738] 0xc000000deaf13738 (unreliable)
[c000000deaf13860] [c000000000015ce8] __switch_to+0x1f8/0x350
[c000000deaf138b0] [c000000000ade0e4] __schedule+0x314/0x990
[c000000deaf13940] [c000000000ade7a8] schedule+0x48/0xc0
[c000000deaf13970] [c0000000005492a4] blk_mq_freeze_queue_wait+0x74/0x110
[c000000deaf139e0] [c00000000054b6a8] blk_mq_queue_reinit_notify+0x1a8/0x2e0
[c000000deaf13a40] [c0000000000e7878] notifier_call_chain+0x98/0x100
[c000000deaf13a90] [c0000000000b8e08] cpu_notify_nofail+0x48/0xa0
[c000000deaf13ac0] [c0000000000b92f0] _cpu_down+0x2a0/0x400
[c000000deaf13b90] [c0000000000b94a8] cpu_down+0x58/0xa0
[c000000deaf13bc0] [c0000000006d5dcc] cpu_subsys_offline+0x2c/0x50
[c000000deaf13bf0] [c0000000006cd244] device_offline+0x104/0x140
[c000000deaf13c30] [c0000000006cd40c] online_store+0x6c/0xc0
[c000000deaf13c80] [c0000000006c8c78] dev_attr_store+0x68/0xa0
[c000000deaf13cc0] [c0000000003974d0] sysfs_kf_write+0x80/0xb0
[c000000deaf13d00] [c0000000003963e8] kernfs_fop_write+0x188/0x200
[c000000deaf13d50] [c0000000002e0f6c] __vfs_write+0x6c/0xe0
[c000000deaf13d90] [c0000000002e1ca0] vfs_write+0xc0/0x230
[c000000deaf13de0] [c0000000002e2cdc] SyS_write+0x6c/0x110
[c000000deaf13e30] [c000000000009204] system_call+0x38/0xb4
The fix is to allow the timeout work to execute in the window between
dropping the initial refcount reference and the release of the last
reference, which actually marks the freeze completion. This can be
achieved with percpu_refcount_tryget, which does not require the counter
to be alive. This way the timeout work can do it's job and terminate a
stuck request even during a freeze, returning its reference and avoiding
the deadlock.
Allowing the timeout to run is just a part of the fix, since for some
devices, we might get stuck again inside the device driver's timeout
handler, should it attempt to allocate a new request in that path -
which is a quite common action for Abort commands, which need to be sent
after a timeout. In NVMe, for instance, we call blk_mq_alloc_request
from inside the timeout handler, which will fail during a freeze, since
it also tries to acquire a queue reference.
I considered a similar change to blk_mq_alloc_request as a generic
solution for further device driver hangs, but we can't do that, since it
would allow new requests to disturb the freeze process. I thought about
creating a new function in the block layer to support unfreezable
requests for these occasions, but after working on it for a while, I
feel like this should be handled in a per-driver basis. I'm now
experimenting with changes to the NVMe timeout path, but I'm open to
suggestions of ways to make this generic.
Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: linux-nvme@lists.infradead.org
Cc: linux-block@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-01 22:23:39 +08:00
|
|
|
* consumed, marked by the instant q_usage_counter reaches
|
|
|
|
* zero.
|
|
|
|
*/
|
|
|
|
if (!percpu_ref_tryget(&q->q_usage_counter))
|
2015-10-30 20:57:30 +08:00
|
|
|
return;
|
|
|
|
|
blk-mq: avoid double ->queue_rq() because of early timeout
David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in VM use case because of long vmexit latency or preempt
latency of vCPU pthread or long page fault in vCPU pthread, then block
IO req could be timed out before queuing the request to hardware but after
calling blk_mq_start_request() during ->queue_rq(), then timeout handler
may handle it by requeue, then double ->queue_rq() is caused, and kernel
panic.
So far, it is driver's responsibility to cover the race between timeout
and completion, so it seems supposed to be solved in driver in theory,
given driver has enough knowledge.
But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, even it isn't easy
for driver to handle the race. So David suggests this patch by draining
in-progress ->queue_rq() for solving this issue.
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: virtualization@lists.linux-foundation.org
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221026051957.358818-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-26 13:19:57 +08:00
|
|
|
/* check if there is any timed-out request */
|
|
|
|
blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &expired);
|
|
|
|
if (expired.has_timedout_rq) {
|
|
|
|
/*
|
|
|
|
* Before walking tags, we must ensure any submit started
|
|
|
|
* before the current time has finished. Since the submit
|
|
|
|
* uses srcu or rcu, wait for a synchronization point to
|
|
|
|
* ensure all running submits have finished
|
|
|
|
*/
|
2022-11-01 23:00:48 +08:00
|
|
|
blk_mq_wait_quiesce_done(q->tag_set);
|
blk-mq: avoid double ->queue_rq() because of early timeout
David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in VM use case because of long vmexit latency or preempt
latency of vCPU pthread or long page fault in vCPU pthread, then block
IO req could be timed out before queuing the request to hardware but after
calling blk_mq_start_request() during ->queue_rq(), then timeout handler
may handle it by requeue, then double ->queue_rq() is caused, and kernel
panic.
So far, it is driver's responsibility to cover the race between timeout
and completion, so it seems supposed to be solved in driver in theory,
given driver has enough knowledge.
But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, even it isn't easy
for driver to handle the race. So David suggests this patch by draining
in-progress ->queue_rq() for solving this issue.
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: virtualization@lists.linux-foundation.org
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221026051957.358818-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-26 13:19:57 +08:00
|
|
|
|
|
|
|
expired.next = 0;
|
|
|
|
blk_mq_queue_tag_busy_iter(q, blk_mq_handle_expired, &expired);
|
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
blk-mq: avoid double ->queue_rq() because of early timeout
David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in VM use case because of long vmexit latency or preempt
latency of vCPU pthread or long page fault in vCPU pthread, then block
IO req could be timed out before queuing the request to hardware but after
calling blk_mq_start_request() during ->queue_rq(), then timeout handler
may handle it by requeue, then double ->queue_rq() is caused, and kernel
panic.
So far, it is driver's responsibility to cover the race between timeout
and completion, so it seems supposed to be solved in driver in theory,
given driver has enough knowledge.
But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, even it isn't easy
for driver to handle the race. So David suggests this patch by draining
in-progress ->queue_rq() for solving this issue.
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: virtualization@lists.linux-foundation.org
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221026051957.358818-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-26 13:19:57 +08:00
|
|
|
if (expired.next != 0) {
|
|
|
|
mod_timer(&q->timeout, expired.next);
|
2014-05-14 05:10:52 +08:00
|
|
|
} else {
|
2018-01-11 00:33:33 +08:00
|
|
|
/*
|
|
|
|
* Request timeouts are handled as a forward rolling timer. If
|
|
|
|
* we end up here it means that no requests are pending and
|
|
|
|
* also that no request has been pending for a while. Mark
|
|
|
|
* each hctx as idle.
|
|
|
|
*/
|
2015-04-21 10:00:19 +08:00
|
|
|
queue_for_each_hw_ctx(q, hctx, i) {
|
|
|
|
/* the hctx may be unmapped, so check it here */
|
|
|
|
if (blk_mq_hw_queue_mapped(hctx))
|
|
|
|
blk_mq_tag_idle(hctx);
|
|
|
|
}
|
2014-05-14 05:10:52 +08:00
|
|
|
}
|
2015-10-30 20:57:30 +08:00
|
|
|
blk_queue_exit(q);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2016-09-17 22:38:44 +08:00
|
|
|
struct flush_busy_ctx_data {
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
struct list_head *list;
|
|
|
|
};
|
|
|
|
|
|
|
|
static bool flush_busy_ctx(struct sbitmap *sb, unsigned int bitnr, void *data)
|
|
|
|
{
|
|
|
|
struct flush_busy_ctx_data *flush_data = data;
|
|
|
|
struct blk_mq_hw_ctx *hctx = flush_data->hctx;
|
|
|
|
struct blk_mq_ctx *ctx = hctx->ctxs[bitnr];
|
2018-12-17 23:44:05 +08:00
|
|
|
enum hctx_type type = hctx->type;
|
2016-09-17 22:38:44 +08:00
|
|
|
|
|
|
|
spin_lock(&ctx->lock);
|
2018-12-17 23:44:05 +08:00
|
|
|
list_splice_tail_init(&ctx->rq_lists[type], flush_data->list);
|
2018-02-28 08:56:42 +08:00
|
|
|
sbitmap_clear_bit(sb, bitnr);
|
2016-09-17 22:38:44 +08:00
|
|
|
spin_unlock(&ctx->lock);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2014-05-19 23:23:55 +08:00
|
|
|
/*
|
|
|
|
* Process software queues that have been marked busy, splicing them
|
|
|
|
* to the for-dispatch
|
|
|
|
*/
|
2016-12-15 05:34:47 +08:00
|
|
|
void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
|
2014-05-19 23:23:55 +08:00
|
|
|
{
|
2016-09-17 22:38:44 +08:00
|
|
|
struct flush_busy_ctx_data data = {
|
|
|
|
.hctx = hctx,
|
|
|
|
.list = list,
|
|
|
|
};
|
2014-05-19 23:23:55 +08:00
|
|
|
|
2016-09-17 22:38:44 +08:00
|
|
|
sbitmap_for_each_set(&hctx->ctx_map, flush_busy_ctx, &data);
|
2014-05-19 23:23:55 +08:00
|
|
|
}
|
2016-12-15 05:34:47 +08:00
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_flush_busy_ctxs);
|
2014-05-19 23:23:55 +08:00
|
|
|
|
blk-mq-sched: improve dispatching from sw queue
SCSI devices use host-wide tagset, and the shared driver tag space is
often quite big. However, there is also a queue depth for each lun(
.cmd_per_lun), which is often small, for example, on both lpfc and
qla2xxx, .cmd_per_lun is just 3.
So lots of requests may stay in sw queue, and we always flush all
belonging to same hw queue and dispatch them all to driver.
Unfortunately it is easy to cause queue busy because of the small
.cmd_per_lun. Once these requests are flushed out, they have to stay in
hctx->dispatch, and no bio merge can happen on these requests, and
sequential IO performance is harmed.
This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
from a sw queue, so that we can dispatch them in scheduler's way. We can
then avoid dequeueing too many requests from sw queue, since we don't
flush ->dispatch completely.
This patch improves dispatching from sw queue by using the .get_budget
and .put_budget callbacks.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-14 17:22:30 +08:00
|
|
|
struct dispatch_rq_data {
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
struct request *rq;
|
|
|
|
};
|
|
|
|
|
|
|
|
static bool dispatch_rq_from_ctx(struct sbitmap *sb, unsigned int bitnr,
|
|
|
|
void *data)
|
|
|
|
{
|
|
|
|
struct dispatch_rq_data *dispatch_data = data;
|
|
|
|
struct blk_mq_hw_ctx *hctx = dispatch_data->hctx;
|
|
|
|
struct blk_mq_ctx *ctx = hctx->ctxs[bitnr];
|
2018-12-17 23:44:05 +08:00
|
|
|
enum hctx_type type = hctx->type;
|
blk-mq-sched: improve dispatching from sw queue
SCSI devices use host-wide tagset, and the shared driver tag space is
often quite big. However, there is also a queue depth for each lun(
.cmd_per_lun), which is often small, for example, on both lpfc and
qla2xxx, .cmd_per_lun is just 3.
So lots of requests may stay in sw queue, and we always flush all
belonging to same hw queue and dispatch them all to driver.
Unfortunately it is easy to cause queue busy because of the small
.cmd_per_lun. Once these requests are flushed out, they have to stay in
hctx->dispatch, and no bio merge can happen on these requests, and
sequential IO performance is harmed.
This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
from a sw queue, so that we can dispatch them in scheduler's way. We can
then avoid dequeueing too many requests from sw queue, since we don't
flush ->dispatch completely.
This patch improves dispatching from sw queue by using the .get_budget
and .put_budget callbacks.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-14 17:22:30 +08:00
|
|
|
|
|
|
|
spin_lock(&ctx->lock);
|
2018-12-17 23:44:05 +08:00
|
|
|
if (!list_empty(&ctx->rq_lists[type])) {
|
|
|
|
dispatch_data->rq = list_entry_rq(ctx->rq_lists[type].next);
|
blk-mq-sched: improve dispatching from sw queue
SCSI devices use host-wide tagset, and the shared driver tag space is
often quite big. However, there is also a queue depth for each lun(
.cmd_per_lun), which is often small, for example, on both lpfc and
qla2xxx, .cmd_per_lun is just 3.
So lots of requests may stay in sw queue, and we always flush all
belonging to same hw queue and dispatch them all to driver.
Unfortunately it is easy to cause queue busy because of the small
.cmd_per_lun. Once these requests are flushed out, they have to stay in
hctx->dispatch, and no bio merge can happen on these requests, and
sequential IO performance is harmed.
This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
from a sw queue, so that we can dispatch them in scheduler's way. We can
then avoid dequeueing too many requests from sw queue, since we don't
flush ->dispatch completely.
This patch improves dispatching from sw queue by using the .get_budget
and .put_budget callbacks.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-14 17:22:30 +08:00
|
|
|
list_del_init(&dispatch_data->rq->queuelist);
|
2018-12-17 23:44:05 +08:00
|
|
|
if (list_empty(&ctx->rq_lists[type]))
|
blk-mq-sched: improve dispatching from sw queue
SCSI devices use host-wide tagset, and the shared driver tag space is
often quite big. However, there is also a queue depth for each lun(
.cmd_per_lun), which is often small, for example, on both lpfc and
qla2xxx, .cmd_per_lun is just 3.
So lots of requests may stay in sw queue, and we always flush all
belonging to same hw queue and dispatch them all to driver.
Unfortunately it is easy to cause queue busy because of the small
.cmd_per_lun. Once these requests are flushed out, they have to stay in
hctx->dispatch, and no bio merge can happen on these requests, and
sequential IO performance is harmed.
This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
from a sw queue, so that we can dispatch them in scheduler's way. We can
then avoid dequeueing too many requests from sw queue, since we don't
flush ->dispatch completely.
This patch improves dispatching from sw queue by using the .get_budget
and .put_budget callbacks.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-14 17:22:30 +08:00
|
|
|
sbitmap_clear_bit(sb, bitnr);
|
|
|
|
}
|
|
|
|
spin_unlock(&ctx->lock);
|
|
|
|
|
|
|
|
return !dispatch_data->rq;
|
|
|
|
}
|
|
|
|
|
|
|
|
struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
|
|
|
|
struct blk_mq_ctx *start)
|
|
|
|
{
|
2018-10-30 03:13:29 +08:00
|
|
|
unsigned off = start ? start->index_hw[hctx->type] : 0;
|
blk-mq-sched: improve dispatching from sw queue
SCSI devices use host-wide tagset, and the shared driver tag space is
often quite big. However, there is also a queue depth for each lun(
.cmd_per_lun), which is often small, for example, on both lpfc and
qla2xxx, .cmd_per_lun is just 3.
So lots of requests may stay in sw queue, and we always flush all
belonging to same hw queue and dispatch them all to driver.
Unfortunately it is easy to cause queue busy because of the small
.cmd_per_lun. Once these requests are flushed out, they have to stay in
hctx->dispatch, and no bio merge can happen on these requests, and
sequential IO performance is harmed.
This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
from a sw queue, so that we can dispatch them in scheduler's way. We can
then avoid dequeueing too many requests from sw queue, since we don't
flush ->dispatch completely.
This patch improves dispatching from sw queue by using the .get_budget
and .put_budget callbacks.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-14 17:22:30 +08:00
|
|
|
struct dispatch_rq_data data = {
|
|
|
|
.hctx = hctx,
|
|
|
|
.rq = NULL,
|
|
|
|
};
|
|
|
|
|
|
|
|
__sbitmap_for_each_set(&hctx->ctx_map, off,
|
|
|
|
dispatch_rq_from_ctx, &data);
|
|
|
|
|
|
|
|
return data.rq;
|
|
|
|
}
|
|
|
|
|
2023-09-13 23:16:12 +08:00
|
|
|
bool __blk_mq_alloc_driver_tag(struct request *rq)
|
2020-06-30 22:03:55 +08:00
|
|
|
{
|
2021-10-05 18:23:38 +08:00
|
|
|
struct sbitmap_queue *bt = &rq->mq_hctx->tags->bitmap_tags;
|
2020-06-30 22:03:55 +08:00
|
|
|
unsigned int tag_offset = rq->mq_hctx->tags->nr_reserved_tags;
|
|
|
|
int tag;
|
|
|
|
|
2020-07-06 22:41:11 +08:00
|
|
|
blk_mq_tag_busy(rq->mq_hctx);
|
|
|
|
|
2020-06-30 22:03:55 +08:00
|
|
|
if (blk_mq_tag_is_reserved(rq->mq_hctx->sched_tags, rq->internal_tag)) {
|
2021-10-05 18:23:38 +08:00
|
|
|
bt = &rq->mq_hctx->tags->breserved_tags;
|
2020-06-30 22:03:55 +08:00
|
|
|
tag_offset = 0;
|
2020-09-11 18:41:14 +08:00
|
|
|
} else {
|
|
|
|
if (!hctx_may_queue(rq->mq_hctx, bt))
|
|
|
|
return false;
|
2020-06-30 22:03:55 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
tag = __sbitmap_queue_get(bt);
|
|
|
|
if (tag == BLK_MQ_NO_TAG)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
rq->tag = tag + tag_offset;
|
2023-09-13 23:16:12 +08:00
|
|
|
blk_mq_inc_active_requests(rq->mq_hctx);
|
2020-07-06 22:41:11 +08:00
|
|
|
return true;
|
2020-06-30 22:03:55 +08:00
|
|
|
}
|
|
|
|
|
2017-11-09 23:32:43 +08:00
|
|
|
static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
|
|
|
|
int flags, void *key)
|
2017-02-23 02:58:29 +08:00
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
|
|
|
|
hctx = container_of(wait, struct blk_mq_hw_ctx, dispatch_wait);
|
|
|
|
|
2018-06-25 19:31:47 +08:00
|
|
|
spin_lock(&hctx->dispatch_wait_lock);
|
2019-03-26 02:34:10 +08:00
|
|
|
if (!list_empty(&wait->entry)) {
|
|
|
|
struct sbitmap_queue *sbq;
|
|
|
|
|
|
|
|
list_del_init(&wait->entry);
|
2021-10-05 18:23:38 +08:00
|
|
|
sbq = &hctx->tags->bitmap_tags;
|
2019-03-26 02:34:10 +08:00
|
|
|
atomic_dec(&sbq->ws_active);
|
|
|
|
}
|
2018-06-25 19:31:47 +08:00
|
|
|
spin_unlock(&hctx->dispatch_wait_lock);
|
|
|
|
|
2017-02-23 02:58:29 +08:00
|
|
|
blk_mq_run_hw_queue(hctx, true);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2017-11-10 07:10:13 +08:00
|
|
|
/*
|
|
|
|
* Mark us waiting for a tag. For shared tags, this involves hooking us into
|
2018-01-10 02:09:15 +08:00
|
|
|
* the tag wakeups. For non-shared tags, we can simply mark us needing a
|
|
|
|
* restart. For both cases, take care to check the condition again after
|
2017-11-10 07:10:13 +08:00
|
|
|
* marking us as waiting.
|
|
|
|
*/
|
2018-06-25 19:31:46 +08:00
|
|
|
static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
|
2017-11-10 07:10:13 +08:00
|
|
|
struct request *rq)
|
2017-02-23 02:58:29 +08:00
|
|
|
{
|
2023-01-18 17:37:15 +08:00
|
|
|
struct sbitmap_queue *sbq;
|
2018-06-25 19:31:47 +08:00
|
|
|
struct wait_queue_head *wq;
|
2017-11-10 07:10:13 +08:00
|
|
|
wait_queue_entry_t *wait;
|
|
|
|
bool ret;
|
2017-02-23 02:58:29 +08:00
|
|
|
|
2023-01-18 17:37:16 +08:00
|
|
|
if (!(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) &&
|
|
|
|
!(blk_mq_is_shared_tags(hctx->flags))) {
|
2019-03-15 11:05:10 +08:00
|
|
|
blk_mq_sched_mark_restart_hctx(hctx);
|
2017-11-10 07:10:13 +08:00
|
|
|
|
2018-01-11 05:41:21 +08:00
|
|
|
/*
|
|
|
|
* It's possible that a tag was freed in the window between the
|
|
|
|
* allocation failure and adding the hardware queue to the wait
|
|
|
|
* queue.
|
|
|
|
*
|
|
|
|
* Don't clear RESTART here, someone else could have set it.
|
|
|
|
* At most this will cost an extra queue run.
|
|
|
|
*/
|
2018-06-25 19:31:45 +08:00
|
|
|
return blk_mq_get_driver_tag(rq);
|
2017-11-09 23:32:43 +08:00
|
|
|
}
|
|
|
|
|
2018-06-25 19:31:46 +08:00
|
|
|
wait = &hctx->dispatch_wait;
|
2018-01-11 05:41:21 +08:00
|
|
|
if (!list_empty_careful(&wait->entry))
|
|
|
|
return false;
|
|
|
|
|
2023-01-18 17:37:15 +08:00
|
|
|
if (blk_mq_tag_is_reserved(rq->mq_hctx->sched_tags, rq->internal_tag))
|
|
|
|
sbq = &hctx->tags->breserved_tags;
|
|
|
|
else
|
|
|
|
sbq = &hctx->tags->bitmap_tags;
|
2019-03-26 02:34:10 +08:00
|
|
|
wq = &bt_wait_ptr(sbq, hctx)->wait;
|
2018-06-25 19:31:47 +08:00
|
|
|
|
|
|
|
spin_lock_irq(&wq->lock);
|
|
|
|
spin_lock(&hctx->dispatch_wait_lock);
|
2018-01-11 05:41:21 +08:00
|
|
|
if (!list_empty(&wait->entry)) {
|
2018-06-25 19:31:47 +08:00
|
|
|
spin_unlock(&hctx->dispatch_wait_lock);
|
|
|
|
spin_unlock_irq(&wq->lock);
|
2018-01-11 05:41:21 +08:00
|
|
|
return false;
|
2017-11-09 23:32:43 +08:00
|
|
|
}
|
|
|
|
|
2019-03-26 02:34:10 +08:00
|
|
|
atomic_inc(&sbq->ws_active);
|
2018-06-25 19:31:47 +08:00
|
|
|
wait->flags &= ~WQ_FLAG_EXCLUSIVE;
|
|
|
|
__add_wait_queue(wq, wait);
|
2018-01-11 05:41:21 +08:00
|
|
|
|
2024-01-12 20:26:26 +08:00
|
|
|
/*
|
|
|
|
* Add one explicit barrier since blk_mq_get_driver_tag() may
|
|
|
|
* not imply barrier in case of failure.
|
|
|
|
*
|
|
|
|
* Order adding us to wait queue and allocating driver tag.
|
|
|
|
*
|
|
|
|
* The pair is the one implied in sbitmap_queue_wake_up() which
|
|
|
|
* orders clearing sbitmap tag bits and waitqueue_active() in
|
|
|
|
* __sbitmap_queue_wake_up(), since waitqueue_active() is lockless
|
|
|
|
*
|
|
|
|
* Otherwise, re-order of adding wait queue and getting driver tag
|
|
|
|
* may cause __sbitmap_queue_wake_up() to wake up nothing because
|
|
|
|
* the waitqueue_active() may not observe us in wait queue.
|
|
|
|
*/
|
|
|
|
smp_mb();
|
|
|
|
|
2017-02-23 02:58:29 +08:00
|
|
|
/*
|
2017-11-09 23:32:43 +08:00
|
|
|
* It's possible that a tag was freed in the window between the
|
|
|
|
* allocation failure and adding the hardware queue to the wait
|
|
|
|
* queue.
|
2017-02-23 02:58:29 +08:00
|
|
|
*/
|
2018-06-25 19:31:45 +08:00
|
|
|
ret = blk_mq_get_driver_tag(rq);
|
2018-01-11 05:41:21 +08:00
|
|
|
if (!ret) {
|
2018-06-25 19:31:47 +08:00
|
|
|
spin_unlock(&hctx->dispatch_wait_lock);
|
|
|
|
spin_unlock_irq(&wq->lock);
|
2018-01-11 05:41:21 +08:00
|
|
|
return false;
|
2017-11-09 23:32:43 +08:00
|
|
|
}
|
2018-01-11 05:41:21 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We got a tag, remove ourselves from the wait queue to ensure
|
|
|
|
* someone else gets the wakeup.
|
|
|
|
*/
|
|
|
|
list_del_init(&wait->entry);
|
2019-03-26 02:34:10 +08:00
|
|
|
atomic_dec(&sbq->ws_active);
|
2018-06-25 19:31:47 +08:00
|
|
|
spin_unlock(&hctx->dispatch_wait_lock);
|
|
|
|
spin_unlock_irq(&wq->lock);
|
2018-01-11 05:41:21 +08:00
|
|
|
|
|
|
|
return true;
|
2017-02-23 02:58:29 +08:00
|
|
|
}
|
|
|
|
|
2018-07-03 23:03:16 +08:00
|
|
|
#define BLK_MQ_DISPATCH_BUSY_EWMA_WEIGHT 8
|
|
|
|
#define BLK_MQ_DISPATCH_BUSY_EWMA_FACTOR 4
|
|
|
|
/*
|
|
|
|
* Update dispatch busy with the Exponential Weighted Moving Average(EWMA):
|
|
|
|
* - EWMA is one simple way to compute running average value
|
|
|
|
* - weight(7/8 and 1/8) is applied so that it can decrease exponentially
|
|
|
|
* - take 4 as factor for avoiding to get too small(0) result, and this
|
|
|
|
* factor doesn't matter because EWMA decreases exponentially
|
|
|
|
*/
|
|
|
|
static void blk_mq_update_dispatch_busy(struct blk_mq_hw_ctx *hctx, bool busy)
|
|
|
|
{
|
|
|
|
unsigned int ewma;
|
|
|
|
|
|
|
|
ewma = hctx->dispatch_busy;
|
|
|
|
|
|
|
|
if (!ewma && !busy)
|
|
|
|
return;
|
|
|
|
|
|
|
|
ewma *= BLK_MQ_DISPATCH_BUSY_EWMA_WEIGHT - 1;
|
|
|
|
if (busy)
|
|
|
|
ewma += 1 << BLK_MQ_DISPATCH_BUSY_EWMA_FACTOR;
|
|
|
|
ewma /= BLK_MQ_DISPATCH_BUSY_EWMA_WEIGHT;
|
|
|
|
|
|
|
|
hctx->dispatch_busy = ewma;
|
|
|
|
}
|
|
|
|
|
2018-01-31 11:04:57 +08:00
|
|
|
#define BLK_MQ_RESOURCE_DELAY 3 /* ms units */
|
|
|
|
|
2020-03-24 23:24:44 +08:00
|
|
|
static void blk_mq_handle_dev_resource(struct request *rq,
|
|
|
|
struct list_head *list)
|
|
|
|
{
|
|
|
|
list_add(&rq->queuelist, list);
|
|
|
|
__blk_mq_requeue_request(rq);
|
|
|
|
}
|
|
|
|
|
2020-06-30 18:24:58 +08:00
|
|
|
enum prep_dispatch {
|
|
|
|
PREP_DISPATCH_OK,
|
|
|
|
PREP_DISPATCH_NO_TAG,
|
|
|
|
PREP_DISPATCH_NO_BUDGET,
|
|
|
|
};
|
|
|
|
|
|
|
|
static enum prep_dispatch blk_mq_prep_dispatch_rq(struct request *rq,
|
|
|
|
bool need_budget)
|
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
|
2021-01-22 10:33:12 +08:00
|
|
|
int budget_token = -1;
|
2020-06-30 18:24:58 +08:00
|
|
|
|
2021-01-22 10:33:12 +08:00
|
|
|
if (need_budget) {
|
|
|
|
budget_token = blk_mq_get_dispatch_budget(rq->q);
|
|
|
|
if (budget_token < 0) {
|
|
|
|
blk_mq_put_driver_tag(rq);
|
|
|
|
return PREP_DISPATCH_NO_BUDGET;
|
|
|
|
}
|
|
|
|
blk_mq_set_rq_budget_token(rq, budget_token);
|
2020-06-30 18:24:58 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (!blk_mq_get_driver_tag(rq)) {
|
|
|
|
/*
|
|
|
|
* The initial allocation attempt failed, so we need to
|
|
|
|
* rerun the hardware queue when a tag is freed. The
|
|
|
|
* waitqueue takes care of that. If the queue is run
|
|
|
|
* before we add this entry back on the dispatch list,
|
|
|
|
* we'll re-run it below.
|
|
|
|
*/
|
|
|
|
if (!blk_mq_mark_tag_wait(hctx, rq)) {
|
2020-06-30 18:25:00 +08:00
|
|
|
/*
|
|
|
|
* All budgets not got from this function will be put
|
|
|
|
* together during handling partial dispatch
|
|
|
|
*/
|
|
|
|
if (need_budget)
|
2021-01-22 10:33:12 +08:00
|
|
|
blk_mq_put_dispatch_budget(rq->q, budget_token);
|
2020-06-30 18:24:58 +08:00
|
|
|
return PREP_DISPATCH_NO_TAG;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return PREP_DISPATCH_OK;
|
|
|
|
}
|
|
|
|
|
2020-06-30 18:25:00 +08:00
|
|
|
/* release all allocated budgets before calling to blk_mq_dispatch_rq_list */
|
|
|
|
static void blk_mq_release_budgets(struct request_queue *q,
|
2021-01-22 10:33:12 +08:00
|
|
|
struct list_head *list)
|
2020-06-30 18:25:00 +08:00
|
|
|
{
|
2021-01-22 10:33:12 +08:00
|
|
|
struct request *rq;
|
2020-06-30 18:25:00 +08:00
|
|
|
|
2021-01-22 10:33:12 +08:00
|
|
|
list_for_each_entry(rq, list, queuelist) {
|
|
|
|
int budget_token = blk_mq_get_rq_budget_token(rq);
|
2020-06-30 18:25:00 +08:00
|
|
|
|
2021-01-22 10:33:12 +08:00
|
|
|
if (budget_token >= 0)
|
|
|
|
blk_mq_put_dispatch_budget(q, budget_token);
|
|
|
|
}
|
2020-06-30 18:25:00 +08:00
|
|
|
}
|
|
|
|
|
2023-01-18 17:37:19 +08:00
|
|
|
/*
|
|
|
|
* blk_mq_commit_rqs will notify driver using bd->last that there is no
|
|
|
|
* more requests. (See comment in struct blk_mq_ops for commit_rqs for
|
|
|
|
* details)
|
|
|
|
* Attention, we should explicitly call this in unusual cases:
|
|
|
|
* 1) did not queue everything initially scheduled to queue
|
|
|
|
* 2) the last attempt to queue a request failed
|
|
|
|
*/
|
|
|
|
static void blk_mq_commit_rqs(struct blk_mq_hw_ctx *hctx, int queued,
|
|
|
|
bool from_schedule)
|
|
|
|
{
|
|
|
|
if (hctx->queue->mq_ops->commit_rqs && queued) {
|
|
|
|
trace_block_unplug(hctx->queue, queued, !from_schedule);
|
|
|
|
hctx->queue->mq_ops->commit_rqs(hctx);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
blk-mq: don't queue more if we get a busy return
Some devices have different queue limits depending on the type of IO. A
classic case is SATA NCQ, where some commands can queue, but others
cannot. If we have NCQ commands inflight and encounter a non-queueable
command, the driver returns busy. Currently we attempt to dispatch more
from the scheduler, if we were able to queue some commands. But for the
case where we ended up stopping due to BUSY, we should not attempt to
retrieve more from the scheduler. If we do, we can get into a situation
where we attempt to queue a non-queueable command, get BUSY, then
successfully retrieve more commands from that scheduler and queue those.
This can repeat forever, starving the non-queuable command indefinitely.
Fix this by NOT attempting to pull more commands from the scheduler, if
we get a BUSY return. This should also be more optimal in terms of
letting requests stay in the scheduler for as long as possible, if we
get a BUSY due to the regular out-of-tags condition.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-29 01:54:01 +08:00
|
|
|
/*
|
|
|
|
* Returns true if we did some work AND can potentially do more.
|
|
|
|
*/
|
2020-06-30 18:24:57 +08:00
|
|
|
bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list,
|
2020-06-30 18:25:00 +08:00
|
|
|
unsigned int nr_budgets)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2020-06-30 18:24:58 +08:00
|
|
|
enum prep_dispatch prep;
|
2020-06-30 18:24:57 +08:00
|
|
|
struct request_queue *q = hctx->queue;
|
2023-01-18 17:37:24 +08:00
|
|
|
struct request *rq;
|
2023-01-18 17:37:23 +08:00
|
|
|
int queued;
|
2018-01-31 11:04:57 +08:00
|
|
|
blk_status_t ret = BLK_STS_OK;
|
2021-10-27 00:51:27 +08:00
|
|
|
bool needs_resource = false;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
blk-mq: use the right hctx when getting a driver tag fails
While dispatching requests, if we fail to get a driver tag, we mark the
hardware queue as waiting for a tag and put the requests on a
hctx->dispatch list to be run later when a driver tag is freed. However,
blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
queues if using a single-queue scheduler with a multiqueue device. If
blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
are processing. This means we end up using the hardware queue of the
previous request, which may or may not be the same as that of the
current request. If it isn't, the wrong hardware queue will end up
waiting for a tag, and the requests will be on the wrong dispatch list,
leading to a hang.
The fix is twofold:
1. Make sure we save which hardware queue we were trying to get a
request for in blk_mq_get_driver_tag() regardless of whether it
succeeds or not.
2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
blk_mq_hw_queue to make it clear that it must handle multiple
hardware queues, since I've already messed this up on a couple of
occasions.
This didn't appear in testing with nvme and mq-deadline because nvme has
more driver tags than the default number of scheduler tags. However,
with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 22:56:26 +08:00
|
|
|
if (list_empty(list))
|
|
|
|
return false;
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
/*
|
|
|
|
* Now process all the entries, sending them to the driver.
|
|
|
|
*/
|
2023-01-18 17:37:23 +08:00
|
|
|
queued = 0;
|
blk-mq: use the right hctx when getting a driver tag fails
While dispatching requests, if we fail to get a driver tag, we mark the
hardware queue as waiting for a tag and put the requests on a
hctx->dispatch list to be run later when a driver tag is freed. However,
blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
queues if using a single-queue scheduler with a multiqueue device. If
blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
are processing. This means we end up using the hardware queue of the
previous request, which may or may not be the same as that of the
current request. If it isn't, the wrong hardware queue will end up
waiting for a tag, and the requests will be on the wrong dispatch list,
leading to a hang.
The fix is twofold:
1. Make sure we save which hardware queue we were trying to get a
request for in blk_mq_get_driver_tag() regardless of whether it
succeeds or not.
2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
blk_mq_hw_queue to make it clear that it must handle multiple
hardware queues, since I've already messed this up on a couple of
occasions.
This didn't appear in testing with nvme and mq-deadline because nvme has
more driver tags than the default number of scheduler tags. However,
with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 22:56:26 +08:00
|
|
|
do {
|
2014-10-30 01:14:52 +08:00
|
|
|
struct blk_mq_queue_data bd;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2016-12-07 23:41:17 +08:00
|
|
|
rq = list_first_entry(list, struct request, queuelist);
|
blk-mq: order getting budget and driver tag
This patch orders getting budget and driver tag by making sure to acquire
driver tag after budget is got, this way can help to avoid the following
race:
1) before dispatch request from scheduler queue, get one budget first, then
dequeue a request, call it request A.
2) in another IO path for dispatching request B which is from hctx->dispatch,
driver tag is got, then try to get budget in blk_mq_dispatch_rq_list(),
unfortunately the budget is held by request A.
3) meantime blk_mq_dispatch_rq_list() is called for dispatching request
A, and try to get driver tag first, unfortunately no driver tag is
available because the driver tag is held by request B
4) both two IO pathes can't move on, and IO stall is caused.
This issue can be observed when running dbench on USB storage.
This patch fixes this issue by always getting budget before getting
driver tag.
Cc: stable@vger.kernel.org
Fixes: de1482974080ec9e ("blk-mq: introduce .get_budget and .put_budget in blk_mq_ops")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-05 00:35:21 +08:00
|
|
|
|
2020-06-30 18:24:57 +08:00
|
|
|
WARN_ON_ONCE(hctx != rq->mq_hctx);
|
2020-06-30 18:25:00 +08:00
|
|
|
prep = blk_mq_prep_dispatch_rq(rq, !nr_budgets);
|
2020-06-30 18:24:58 +08:00
|
|
|
if (prep != PREP_DISPATCH_OK)
|
blk-mq: order getting budget and driver tag
This patch orders getting budget and driver tag by making sure to acquire
driver tag after budget is got, this way can help to avoid the following
race:
1) before dispatch request from scheduler queue, get one budget first, then
dequeue a request, call it request A.
2) in another IO path for dispatching request B which is from hctx->dispatch,
driver tag is got, then try to get budget in blk_mq_dispatch_rq_list(),
unfortunately the budget is held by request A.
3) meantime blk_mq_dispatch_rq_list() is called for dispatching request
A, and try to get driver tag first, unfortunately no driver tag is
available because the driver tag is held by request B
4) both two IO pathes can't move on, and IO stall is caused.
This issue can be observed when running dbench on USB storage.
This patch fixes this issue by always getting budget before getting
driver tag.
Cc: stable@vger.kernel.org
Fixes: de1482974080ec9e ("blk-mq: introduce .get_budget and .put_budget in blk_mq_ops")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-05 00:35:21 +08:00
|
|
|
break;
|
2017-10-14 17:22:29 +08:00
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
list_del_init(&rq->queuelist);
|
|
|
|
|
2014-10-30 01:14:52 +08:00
|
|
|
bd.rq = rq;
|
2023-01-18 17:37:24 +08:00
|
|
|
bd.last = list_empty(list);
|
2014-10-30 01:14:52 +08:00
|
|
|
|
2020-06-30 18:25:00 +08:00
|
|
|
/*
|
|
|
|
* once the request is queued to lld, no need to cover the
|
|
|
|
* budget any more
|
|
|
|
*/
|
|
|
|
if (nr_budgets)
|
|
|
|
nr_budgets--;
|
2014-10-30 01:14:52 +08:00
|
|
|
ret = q->mq_ops->queue_rq(hctx, &bd);
|
2020-07-01 21:58:57 +08:00
|
|
|
switch (ret) {
|
|
|
|
case BLK_STS_OK:
|
|
|
|
queued++;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
break;
|
2020-07-01 21:58:57 +08:00
|
|
|
case BLK_STS_RESOURCE:
|
2021-10-27 00:51:27 +08:00
|
|
|
needs_resource = true;
|
|
|
|
fallthrough;
|
2020-07-01 21:58:57 +08:00
|
|
|
case BLK_STS_DEV_RESOURCE:
|
|
|
|
blk_mq_handle_dev_resource(rq, list);
|
|
|
|
goto out;
|
|
|
|
default:
|
2020-09-30 16:02:53 +08:00
|
|
|
blk_mq_end_request(rq, ret);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
blk-mq: use the right hctx when getting a driver tag fails
While dispatching requests, if we fail to get a driver tag, we mark the
hardware queue as waiting for a tag and put the requests on a
hctx->dispatch list to be run later when a driver tag is freed. However,
blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
queues if using a single-queue scheduler with a multiqueue device. If
blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
are processing. This means we end up using the hardware queue of the
previous request, which may or may not be the same as that of the
current request. If it isn't, the wrong hardware queue will end up
waiting for a tag, and the requests will be on the wrong dispatch list,
leading to a hang.
The fix is twofold:
1. Make sure we save which hardware queue we were trying to get a
request for in blk_mq_get_driver_tag() regardless of whether it
succeeds or not.
2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
blk_mq_hw_queue to make it clear that it must handle multiple
hardware queues, since I've already messed this up on a couple of
occasions.
This didn't appear in testing with nvme and mq-deadline because nvme has
more driver tags than the default number of scheduler tags. However,
with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 22:56:26 +08:00
|
|
|
} while (!list_empty(list));
|
2020-07-01 21:58:57 +08:00
|
|
|
out:
|
2020-09-05 19:25:56 +08:00
|
|
|
/* If we didn't flush the entire list, we could have told the driver
|
|
|
|
* there was more coming, but that turned out to be a lie.
|
|
|
|
*/
|
2023-01-18 17:37:22 +08:00
|
|
|
if (!list_empty(list) || ret != BLK_STS_OK)
|
|
|
|
blk_mq_commit_rqs(hctx, queued, false);
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
/*
|
|
|
|
* Any items that need requeuing? Stuff them into hctx->dispatch,
|
|
|
|
* that is where we will continue on next queue run.
|
|
|
|
*/
|
2016-12-07 23:41:17 +08:00
|
|
|
if (!list_empty(list)) {
|
2018-01-31 11:04:57 +08:00
|
|
|
bool needs_restart;
|
2020-06-30 18:24:58 +08:00
|
|
|
/* For non-shared tags, the RESTART check will suffice */
|
|
|
|
bool no_tag = prep == PREP_DISPATCH_NO_TAG &&
|
2023-01-18 17:37:16 +08:00
|
|
|
((hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) ||
|
|
|
|
blk_mq_is_shared_tags(hctx->flags));
|
2018-01-31 11:04:57 +08:00
|
|
|
|
2021-01-22 10:33:12 +08:00
|
|
|
if (nr_budgets)
|
|
|
|
blk_mq_release_budgets(q, list);
|
2018-01-31 11:04:57 +08:00
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
spin_lock(&hctx->lock);
|
2020-02-25 09:04:32 +08:00
|
|
|
list_splice_tail_init(list, &hctx->dispatch);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
spin_unlock(&hctx->lock);
|
2016-12-07 23:41:17 +08:00
|
|
|
|
2020-08-17 18:01:15 +08:00
|
|
|
/*
|
|
|
|
* Order adding requests to hctx->dispatch and checking
|
|
|
|
* SCHED_RESTART flag. The pair of this smp_mb() is the one
|
|
|
|
* in blk_mq_sched_restart(). Avoid restart code path to
|
|
|
|
* miss the new added requests to hctx->dispatch, meantime
|
|
|
|
* SCHED_RESTART is observed here.
|
|
|
|
*/
|
|
|
|
smp_mb();
|
|
|
|
|
2015-05-05 04:32:48 +08:00
|
|
|
/*
|
2017-04-08 02:16:51 +08:00
|
|
|
* If SCHED_RESTART was set by the caller of this function and
|
|
|
|
* it is no longer set that means that it was cleared by another
|
|
|
|
* thread and hence that a queue rerun is needed.
|
2015-05-05 04:32:48 +08:00
|
|
|
*
|
2017-11-09 23:32:43 +08:00
|
|
|
* If 'no_tag' is set, that means that we failed getting
|
|
|
|
* a driver tag with an I/O scheduler attached. If our dispatch
|
|
|
|
* waitqueue is no longer active, ensure that we run the queue
|
|
|
|
* AFTER adding our entries back to the list.
|
2017-01-17 21:03:22 +08:00
|
|
|
*
|
2017-04-08 02:16:51 +08:00
|
|
|
* If no I/O scheduler has been configured it is possible that
|
|
|
|
* the hardware queue got stopped and restarted before requests
|
|
|
|
* were pushed back onto the dispatch list. Rerun the queue to
|
|
|
|
* avoid starvation. Notes:
|
|
|
|
* - blk_mq_run_hw_queue() checks whether or not a queue has
|
|
|
|
* been stopped before rerunning a queue.
|
|
|
|
* - Some but not all block drivers stop a queue before
|
2017-06-03 15:38:05 +08:00
|
|
|
* returning BLK_STS_RESOURCE. Two exceptions are scsi-mq
|
2017-04-08 02:16:51 +08:00
|
|
|
* and dm-rq.
|
2018-01-31 11:04:57 +08:00
|
|
|
*
|
|
|
|
* If driver returns BLK_STS_RESOURCE and SCHED_RESTART
|
|
|
|
* bit is set, run queue after a delay to avoid IO stalls
|
2020-04-21 00:24:51 +08:00
|
|
|
* that could otherwise occur if the queue is idle. We'll do
|
2021-10-27 00:51:27 +08:00
|
|
|
* similar if we couldn't get budget or couldn't lock a zone
|
|
|
|
* and SCHED_RESTART is set.
|
2017-01-17 21:03:22 +08:00
|
|
|
*/
|
2018-01-31 11:04:57 +08:00
|
|
|
needs_restart = blk_mq_sched_needs_restart(hctx);
|
2021-10-27 00:51:27 +08:00
|
|
|
if (prep == PREP_DISPATCH_NO_BUDGET)
|
|
|
|
needs_resource = true;
|
2018-01-31 11:04:57 +08:00
|
|
|
if (!needs_restart ||
|
2017-11-09 23:32:43 +08:00
|
|
|
(no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
|
2017-01-17 21:03:22 +08:00
|
|
|
blk_mq_run_hw_queue(hctx, true);
|
2022-09-05 18:19:50 +08:00
|
|
|
else if (needs_resource)
|
2018-01-31 11:04:57 +08:00
|
|
|
blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
|
blk-mq: don't queue more if we get a busy return
Some devices have different queue limits depending on the type of IO. A
classic case is SATA NCQ, where some commands can queue, but others
cannot. If we have NCQ commands inflight and encounter a non-queueable
command, the driver returns busy. Currently we attempt to dispatch more
from the scheduler, if we were able to queue some commands. But for the
case where we ended up stopping due to BUSY, we should not attempt to
retrieve more from the scheduler. If we do, we can get into a situation
where we attempt to queue a non-queueable command, get BUSY, then
successfully retrieve more commands from that scheduler and queue those.
This can repeat forever, starving the non-queuable command indefinitely.
Fix this by NOT attempting to pull more commands from the scheduler, if
we get a BUSY return. This should also be more optimal in terms of
letting requests stay in the scheduler for as long as possible, if we
get a BUSY due to the regular out-of-tags condition.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-29 01:54:01 +08:00
|
|
|
|
2018-07-03 23:03:16 +08:00
|
|
|
blk_mq_update_dispatch_busy(hctx, true);
|
blk-mq: don't queue more if we get a busy return
Some devices have different queue limits depending on the type of IO. A
classic case is SATA NCQ, where some commands can queue, but others
cannot. If we have NCQ commands inflight and encounter a non-queueable
command, the driver returns busy. Currently we attempt to dispatch more
from the scheduler, if we were able to queue some commands. But for the
case where we ended up stopping due to BUSY, we should not attempt to
retrieve more from the scheduler. If we do, we can get into a situation
where we attempt to queue a non-queueable command, get BUSY, then
successfully retrieve more commands from that scheduler and queue those.
This can repeat forever, starving the non-queuable command indefinitely.
Fix this by NOT attempting to pull more commands from the scheduler, if
we get a BUSY return. This should also be more optimal in terms of
letting requests stay in the scheduler for as long as possible, if we
get a BUSY due to the regular out-of-tags condition.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-29 01:54:01 +08:00
|
|
|
return false;
|
2023-01-18 17:37:23 +08:00
|
|
|
}
|
2016-12-07 23:41:17 +08:00
|
|
|
|
2023-01-18 17:37:23 +08:00
|
|
|
blk_mq_update_dispatch_busy(hctx, false);
|
|
|
|
return true;
|
2016-12-07 23:41:17 +08:00
|
|
|
}
|
|
|
|
|
2018-04-08 17:48:10 +08:00
|
|
|
static inline int blk_mq_first_mapped_cpu(struct blk_mq_hw_ctx *hctx)
|
|
|
|
{
|
|
|
|
int cpu = cpumask_first_and(hctx->cpumask, cpu_online_mask);
|
|
|
|
|
|
|
|
if (cpu >= nr_cpu_ids)
|
|
|
|
cpu = cpumask_first(hctx->cpumask);
|
|
|
|
return cpu;
|
|
|
|
}
|
|
|
|
|
2024-03-22 10:12:44 +08:00
|
|
|
/*
|
|
|
|
* ->next_cpu is always calculated from hctx->cpumask, so simply use
|
|
|
|
* it for speeding up the check
|
|
|
|
*/
|
|
|
|
static bool blk_mq_hctx_empty_cpumask(struct blk_mq_hw_ctx *hctx)
|
|
|
|
{
|
|
|
|
return hctx->next_cpu >= nr_cpu_ids;
|
|
|
|
}
|
|
|
|
|
2014-05-08 00:26:44 +08:00
|
|
|
/*
|
|
|
|
* It'd be great if the workqueue API had a way to pass
|
|
|
|
* in a mask and had some smarts for more clever placement.
|
|
|
|
* For now we just round-robin here, switching for every
|
|
|
|
* BLK_MQ_CPU_WORK_BATCH queued items.
|
|
|
|
*/
|
|
|
|
static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
|
|
|
|
{
|
blk-mq: make sure hctx->next_cpu is set correctly
When hctx->next_cpu is set from possible online CPUs, there is one
race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
break workqueue.
The race can be triggered in the following two sitations:
1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
to dispatch requests from the DEAD cpu context, but at that
time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
a bad value.
2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
should be run on the other CPU A, then CPU A may become offline at the
same time and all CPUs in hctx->cpumask become offline.
This patch deals with this issue by re-selecting next CPU, and making
sure it is set correctly.
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-18 00:41:51 +08:00
|
|
|
bool tried = false;
|
2018-04-08 17:48:09 +08:00
|
|
|
int next_cpu = hctx->next_cpu;
|
blk-mq: make sure hctx->next_cpu is set correctly
When hctx->next_cpu is set from possible online CPUs, there is one
race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
break workqueue.
The race can be triggered in the following two sitations:
1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
to dispatch requests from the DEAD cpu context, but at that
time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
a bad value.
2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
should be run on the other CPU A, then CPU A may become offline at the
same time and all CPUs in hctx->cpumask become offline.
This patch deals with this issue by re-selecting next CPU, and making
sure it is set correctly.
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-18 00:41:51 +08:00
|
|
|
|
2024-03-22 10:12:44 +08:00
|
|
|
/* Switch to unbound if no allowable CPUs in this hctx */
|
|
|
|
if (hctx->queue->nr_hw_queues == 1 || blk_mq_hctx_empty_cpumask(hctx))
|
2014-11-24 16:27:23 +08:00
|
|
|
return WORK_CPU_UNBOUND;
|
2014-05-08 00:26:44 +08:00
|
|
|
|
|
|
|
if (--hctx->next_cpu_batch <= 0) {
|
blk-mq: make sure hctx->next_cpu is set correctly
When hctx->next_cpu is set from possible online CPUs, there is one
race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
break workqueue.
The race can be triggered in the following two sitations:
1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
to dispatch requests from the DEAD cpu context, but at that
time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
a bad value.
2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
should be run on the other CPU A, then CPU A may become offline at the
same time and all CPUs in hctx->cpumask become offline.
This patch deals with this issue by re-selecting next CPU, and making
sure it is set correctly.
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-18 00:41:51 +08:00
|
|
|
select_cpu:
|
2018-04-08 17:48:09 +08:00
|
|
|
next_cpu = cpumask_next_and(next_cpu, hctx->cpumask,
|
2018-01-12 10:53:06 +08:00
|
|
|
cpu_online_mask);
|
2014-05-08 00:26:44 +08:00
|
|
|
if (next_cpu >= nr_cpu_ids)
|
2018-04-08 17:48:10 +08:00
|
|
|
next_cpu = blk_mq_first_mapped_cpu(hctx);
|
2014-05-08 00:26:44 +08:00
|
|
|
hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
|
|
|
|
}
|
|
|
|
|
blk-mq: make sure hctx->next_cpu is set correctly
When hctx->next_cpu is set from possible online CPUs, there is one
race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
break workqueue.
The race can be triggered in the following two sitations:
1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
to dispatch requests from the DEAD cpu context, but at that
time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
a bad value.
2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
should be run on the other CPU A, then CPU A may become offline at the
same time and all CPUs in hctx->cpumask become offline.
This patch deals with this issue by re-selecting next CPU, and making
sure it is set correctly.
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-18 00:41:51 +08:00
|
|
|
/*
|
|
|
|
* Do unbound schedule if we can't find a online CPU for this hctx,
|
|
|
|
* and it should only happen in the path of handling CPU DEAD.
|
|
|
|
*/
|
2018-04-08 17:48:09 +08:00
|
|
|
if (!cpu_online(next_cpu)) {
|
blk-mq: make sure hctx->next_cpu is set correctly
When hctx->next_cpu is set from possible online CPUs, there is one
race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
break workqueue.
The race can be triggered in the following two sitations:
1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
to dispatch requests from the DEAD cpu context, but at that
time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
a bad value.
2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
should be run on the other CPU A, then CPU A may become offline at the
same time and all CPUs in hctx->cpumask become offline.
This patch deals with this issue by re-selecting next CPU, and making
sure it is set correctly.
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-18 00:41:51 +08:00
|
|
|
if (!tried) {
|
|
|
|
tried = true;
|
|
|
|
goto select_cpu;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Make sure to re-select CPU next time once after CPUs
|
|
|
|
* in hctx->cpumask become online again.
|
|
|
|
*/
|
2018-04-08 17:48:09 +08:00
|
|
|
hctx->next_cpu = next_cpu;
|
blk-mq: make sure hctx->next_cpu is set correctly
When hctx->next_cpu is set from possible online CPUs, there is one
race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
break workqueue.
The race can be triggered in the following two sitations:
1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
to dispatch requests from the DEAD cpu context, but at that
time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
a bad value.
2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
should be run on the other CPU A, then CPU A may become offline at the
same time and all CPUs in hctx->cpumask become offline.
This patch deals with this issue by re-selecting next CPU, and making
sure it is set correctly.
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-18 00:41:51 +08:00
|
|
|
hctx->next_cpu_batch = 1;
|
|
|
|
return WORK_CPU_UNBOUND;
|
|
|
|
}
|
2018-04-08 17:48:09 +08:00
|
|
|
|
|
|
|
hctx->next_cpu = next_cpu;
|
|
|
|
return next_cpu;
|
2014-05-08 00:26:44 +08:00
|
|
|
}
|
|
|
|
|
2020-01-07 02:08:18 +08:00
|
|
|
/**
|
2023-04-13 14:06:50 +08:00
|
|
|
* blk_mq_delay_run_hw_queue - Run a hardware queue asynchronously.
|
2020-01-07 02:08:18 +08:00
|
|
|
* @hctx: Pointer to the hardware queue to run.
|
2020-12-04 23:20:55 +08:00
|
|
|
* @msecs: Milliseconds of delay to wait before running the queue.
|
2020-01-07 02:08:18 +08:00
|
|
|
*
|
2023-04-13 14:06:50 +08:00
|
|
|
* Run a hardware queue asynchronously with a delay of @msecs.
|
2020-01-07 02:08:18 +08:00
|
|
|
*/
|
2023-04-13 14:06:50 +08:00
|
|
|
void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2017-06-21 02:15:49 +08:00
|
|
|
if (unlikely(blk_mq_hctx_stopped(hctx)))
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
return;
|
2018-01-20 00:58:55 +08:00
|
|
|
kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work,
|
|
|
|
msecs_to_jiffies(msecs));
|
2017-04-08 02:16:52 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
|
|
|
|
|
2020-01-07 02:08:18 +08:00
|
|
|
/**
|
|
|
|
* blk_mq_run_hw_queue - Start to run a hardware queue.
|
|
|
|
* @hctx: Pointer to the hardware queue to run.
|
|
|
|
* @async: If we want to run the queue asynchronously.
|
|
|
|
*
|
|
|
|
* Check if the request queue is not in a quiesced state and if there are
|
|
|
|
* pending requests to be sent. If this is true, run the queue to send requests
|
|
|
|
* to hardware.
|
|
|
|
*/
|
2019-10-30 00:59:30 +08:00
|
|
|
void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
|
2017-04-08 02:16:52 +08:00
|
|
|
{
|
2018-01-06 16:27:38 +08:00
|
|
|
bool need_run;
|
|
|
|
|
2023-04-13 14:06:51 +08:00
|
|
|
/*
|
|
|
|
* We can't run the queue inline with interrupts disabled.
|
|
|
|
*/
|
|
|
|
WARN_ON_ONCE(!async && in_interrupt());
|
|
|
|
|
block: Improve performance for BLK_MQ_F_BLOCKING drivers
blk_mq_run_queue() runs the queue asynchronously if BLK_MQ_F_BLOCKING
has been set. This is suboptimal since running the queue asynchronously
is slower than running the queue synchronously. This patch modifies
blk_mq_run_queue() as follows if BLK_MQ_F_BLOCKING has been set:
- Run the queue synchronously if it is allowed to sleep.
- Run the queue asynchronously if it is not allowed to sleep.
Additionally, blk_mq_run_hw_queue(hctx, false) calls are modified into
blk_mq_run_hw_queue(hctx, hctx->flags & BLK_MQ_F_BLOCKING) if the caller
may be invoked from atomic context.
The following caller chains have been reviewed:
blk_mq_run_hw_queue(hctx, false)
blk_mq_get_tag() /* may sleep, hence the functions it calls may also sleep */
blk_execute_rq() /* may sleep */
blk_mq_run_hw_queues(q, async=false)
blk_freeze_queue_start() /* may sleep */
blk_mq_requeue_work() /* may sleep */
scsi_kick_queue()
scsi_requeue_run_queue() /* may sleep */
scsi_run_host_queues()
scsi_ioctl_reset() /* may sleep */
blk_mq_insert_requests(hctx, ctx, list, run_queue_async=false)
blk_mq_dispatch_plug_list(plug, from_sched=false)
blk_mq_flush_plug_list(plug, from_schedule=false)
__blk_flush_plug(plug, from_schedule=false)
blk_add_rq_to_plug()
blk_mq_submit_bio() /* may sleep if REQ_NOWAIT has not been set */
blk_mq_plug_issue_direct()
blk_mq_flush_plug_list() /* see above */
blk_mq_dispatch_plug_list(plug, from_sched=false)
blk_mq_flush_plug_list() /* see above */
blk_mq_try_issue_directly()
blk_mq_submit_bio() /* may sleep if REQ_NOWAIT has not been set */
blk_mq_try_issue_list_directly(hctx, list)
blk_mq_insert_requests() /* see above */
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230721172731.955724-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-22 01:27:30 +08:00
|
|
|
might_sleep_if(!async && hctx->flags & BLK_MQ_F_BLOCKING);
|
|
|
|
|
2018-01-06 16:27:38 +08:00
|
|
|
/*
|
|
|
|
* When queue is quiesced, we may be switching io scheduler, or
|
|
|
|
* updating nr_hw_queues, or other things, and we can't run queue
|
|
|
|
* any more, even __blk_mq_hctx_has_pending() can't be called safely.
|
|
|
|
*
|
|
|
|
* And queue will be rerun in blk_mq_unquiesce_queue() if it is
|
|
|
|
* quiesced.
|
|
|
|
*/
|
2021-12-06 19:12:13 +08:00
|
|
|
__blk_mq_run_dispatch_ops(hctx->queue, false,
|
2021-12-03 21:15:31 +08:00
|
|
|
need_run = !blk_queue_quiesced(hctx->queue) &&
|
|
|
|
blk_mq_hctx_has_pending(hctx));
|
2018-01-06 16:27:38 +08:00
|
|
|
|
2023-04-13 14:06:50 +08:00
|
|
|
if (!need_run)
|
|
|
|
return;
|
|
|
|
|
block: Improve performance for BLK_MQ_F_BLOCKING drivers
blk_mq_run_queue() runs the queue asynchronously if BLK_MQ_F_BLOCKING
has been set. This is suboptimal since running the queue asynchronously
is slower than running the queue synchronously. This patch modifies
blk_mq_run_queue() as follows if BLK_MQ_F_BLOCKING has been set:
- Run the queue synchronously if it is allowed to sleep.
- Run the queue asynchronously if it is not allowed to sleep.
Additionally, blk_mq_run_hw_queue(hctx, false) calls are modified into
blk_mq_run_hw_queue(hctx, hctx->flags & BLK_MQ_F_BLOCKING) if the caller
may be invoked from atomic context.
The following caller chains have been reviewed:
blk_mq_run_hw_queue(hctx, false)
blk_mq_get_tag() /* may sleep, hence the functions it calls may also sleep */
blk_execute_rq() /* may sleep */
blk_mq_run_hw_queues(q, async=false)
blk_freeze_queue_start() /* may sleep */
blk_mq_requeue_work() /* may sleep */
scsi_kick_queue()
scsi_requeue_run_queue() /* may sleep */
scsi_run_host_queues()
scsi_ioctl_reset() /* may sleep */
blk_mq_insert_requests(hctx, ctx, list, run_queue_async=false)
blk_mq_dispatch_plug_list(plug, from_sched=false)
blk_mq_flush_plug_list(plug, from_schedule=false)
__blk_flush_plug(plug, from_schedule=false)
blk_add_rq_to_plug()
blk_mq_submit_bio() /* may sleep if REQ_NOWAIT has not been set */
blk_mq_plug_issue_direct()
blk_mq_flush_plug_list() /* see above */
blk_mq_dispatch_plug_list(plug, from_sched=false)
blk_mq_flush_plug_list() /* see above */
blk_mq_try_issue_directly()
blk_mq_submit_bio() /* may sleep if REQ_NOWAIT has not been set */
blk_mq_try_issue_list_directly(hctx, list)
blk_mq_insert_requests() /* see above */
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230721172731.955724-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-22 01:27:30 +08:00
|
|
|
if (async || !cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)) {
|
2023-04-13 14:06:50 +08:00
|
|
|
blk_mq_delay_run_hw_queue(hctx, 0);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2023-04-13 14:06:51 +08:00
|
|
|
blk_mq_run_dispatch_ops(hctx->queue,
|
|
|
|
blk_mq_sched_dispatch_requests(hctx));
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
2017-04-14 16:00:00 +08:00
|
|
|
EXPORT_SYMBOL(blk_mq_run_hw_queue);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2021-01-12 00:47:17 +08:00
|
|
|
/*
|
|
|
|
* Return prefered queue to dispatch from (if any) for non-mq aware IO
|
|
|
|
* scheduler.
|
|
|
|
*/
|
|
|
|
static struct blk_mq_hw_ctx *blk_mq_get_sq_hctx(struct request_queue *q)
|
|
|
|
{
|
2022-05-22 20:23:50 +08:00
|
|
|
struct blk_mq_ctx *ctx = blk_mq_get_ctx(q);
|
2021-01-12 00:47:17 +08:00
|
|
|
/*
|
|
|
|
* If the IO scheduler does not respect hardware queues when
|
|
|
|
* dispatching, we just don't bother with multiple HW queues and
|
|
|
|
* dispatch from hctx for the current CPU since running multiple queues
|
|
|
|
* just causes lock contention inside the scheduler and pointless cache
|
|
|
|
* bouncing.
|
|
|
|
*/
|
2022-06-16 06:55:49 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx = ctx->hctxs[HCTX_TYPE_DEFAULT];
|
2022-05-22 20:23:50 +08:00
|
|
|
|
2021-01-12 00:47:17 +08:00
|
|
|
if (!blk_mq_hctx_stopped(hctx))
|
|
|
|
return hctx;
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2020-01-07 02:08:18 +08:00
|
|
|
/**
|
2020-10-24 00:32:54 +08:00
|
|
|
* blk_mq_run_hw_queues - Run all hardware queues in a request queue.
|
2020-01-07 02:08:18 +08:00
|
|
|
* @q: Pointer to the request queue to run.
|
|
|
|
* @async: If we want to run the queue asynchronously.
|
|
|
|
*/
|
2015-03-12 11:56:38 +08:00
|
|
|
void blk_mq_run_hw_queues(struct request_queue *q, bool async)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2021-01-12 00:47:17 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx, *sq_hctx;
|
2022-03-08 15:32:18 +08:00
|
|
|
unsigned long i;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2021-01-12 00:47:17 +08:00
|
|
|
sq_hctx = NULL;
|
2022-06-16 09:44:00 +08:00
|
|
|
if (blk_queue_sq_sched(q))
|
2021-01-12 00:47:17 +08:00
|
|
|
sq_hctx = blk_mq_get_sq_hctx(q);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
queue_for_each_hw_ctx(q, hctx, i) {
|
2017-11-11 00:13:21 +08:00
|
|
|
if (blk_mq_hctx_stopped(hctx))
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
continue;
|
2021-01-12 00:47:17 +08:00
|
|
|
/*
|
|
|
|
* Dispatch from this hctx either if there's no hctx preferred
|
|
|
|
* by IO scheduler or if it has requests that bypass the
|
|
|
|
* scheduler.
|
|
|
|
*/
|
|
|
|
if (!sq_hctx || sq_hctx == hctx ||
|
|
|
|
!list_empty_careful(&hctx->dispatch))
|
|
|
|
blk_mq_run_hw_queue(hctx, async);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
}
|
2015-03-12 11:56:38 +08:00
|
|
|
EXPORT_SYMBOL(blk_mq_run_hw_queues);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2020-04-21 00:24:52 +08:00
|
|
|
/**
|
|
|
|
* blk_mq_delay_run_hw_queues - Run all hardware queues asynchronously.
|
|
|
|
* @q: Pointer to the request queue to run.
|
2020-12-04 23:20:55 +08:00
|
|
|
* @msecs: Milliseconds of delay to wait before running the queues.
|
2020-04-21 00:24:52 +08:00
|
|
|
*/
|
|
|
|
void blk_mq_delay_run_hw_queues(struct request_queue *q, unsigned long msecs)
|
|
|
|
{
|
2021-01-12 00:47:17 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx, *sq_hctx;
|
2022-03-08 15:32:18 +08:00
|
|
|
unsigned long i;
|
2020-04-21 00:24:52 +08:00
|
|
|
|
2021-01-12 00:47:17 +08:00
|
|
|
sq_hctx = NULL;
|
2022-06-16 09:44:00 +08:00
|
|
|
if (blk_queue_sq_sched(q))
|
2021-01-12 00:47:17 +08:00
|
|
|
sq_hctx = blk_mq_get_sq_hctx(q);
|
2020-04-21 00:24:52 +08:00
|
|
|
queue_for_each_hw_ctx(q, hctx, i) {
|
|
|
|
if (blk_mq_hctx_stopped(hctx))
|
|
|
|
continue;
|
2022-02-01 04:33:37 +08:00
|
|
|
/*
|
|
|
|
* If there is already a run_work pending, leave the
|
|
|
|
* pending delay untouched. Otherwise, a hctx can stall
|
|
|
|
* if another hctx is re-delaying the other's work
|
|
|
|
* before the work executes.
|
|
|
|
*/
|
|
|
|
if (delayed_work_pending(&hctx->run_work))
|
|
|
|
continue;
|
2021-01-12 00:47:17 +08:00
|
|
|
/*
|
|
|
|
* Dispatch from this hctx either if there's no hctx preferred
|
|
|
|
* by IO scheduler or if it has requests that bypass the
|
|
|
|
* scheduler.
|
|
|
|
*/
|
|
|
|
if (!sq_hctx || sq_hctx == hctx ||
|
|
|
|
!list_empty_careful(&hctx->dispatch))
|
|
|
|
blk_mq_delay_run_hw_queue(hctx, msecs);
|
2020-04-21 00:24:52 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_delay_run_hw_queues);
|
|
|
|
|
2017-06-06 23:22:09 +08:00
|
|
|
/*
|
|
|
|
* This function is often used for pausing .queue_rq() by driver when
|
|
|
|
* there isn't enough resource or some conditions aren't satisfied, and
|
2017-08-18 07:23:00 +08:00
|
|
|
* BLK_STS_RESOURCE is usually returned.
|
2017-06-06 23:22:09 +08:00
|
|
|
*
|
|
|
|
* We do not guarantee that dispatch can be drained or blocked
|
|
|
|
* after blk_mq_stop_hw_queue() returns. Please use
|
|
|
|
* blk_mq_quiesce_queue() for that requirement.
|
|
|
|
*/
|
2017-05-04 01:08:14 +08:00
|
|
|
void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx)
|
|
|
|
{
|
2017-06-06 23:22:10 +08:00
|
|
|
cancel_delayed_work(&hctx->run_work);
|
2013-10-25 21:45:58 +08:00
|
|
|
|
2017-06-06 23:22:10 +08:00
|
|
|
set_bit(BLK_MQ_S_STOPPED, &hctx->state);
|
2017-05-04 01:08:14 +08:00
|
|
|
}
|
2017-06-06 23:22:10 +08:00
|
|
|
EXPORT_SYMBOL(blk_mq_stop_hw_queue);
|
2017-05-04 01:08:14 +08:00
|
|
|
|
2017-06-06 23:22:09 +08:00
|
|
|
/*
|
|
|
|
* This function is often used for pausing .queue_rq() by driver when
|
|
|
|
* there isn't enough resource or some conditions aren't satisfied, and
|
2017-08-18 07:23:00 +08:00
|
|
|
* BLK_STS_RESOURCE is usually returned.
|
2017-06-06 23:22:09 +08:00
|
|
|
*
|
|
|
|
* We do not guarantee that dispatch can be drained or blocked
|
|
|
|
* after blk_mq_stop_hw_queues() returns. Please use
|
|
|
|
* blk_mq_quiesce_queue() for that requirement.
|
|
|
|
*/
|
2017-05-04 01:08:14 +08:00
|
|
|
void blk_mq_stop_hw_queues(struct request_queue *q)
|
|
|
|
{
|
2017-06-06 23:22:10 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx;
|
2022-03-08 15:32:18 +08:00
|
|
|
unsigned long i;
|
2017-06-06 23:22:10 +08:00
|
|
|
|
|
|
|
queue_for_each_hw_ctx(q, hctx, i)
|
|
|
|
blk_mq_stop_hw_queue(hctx);
|
2013-10-25 21:45:58 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_stop_hw_queues);
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
void blk_mq_start_hw_queue(struct blk_mq_hw_ctx *hctx)
|
|
|
|
{
|
|
|
|
clear_bit(BLK_MQ_S_STOPPED, &hctx->state);
|
2014-04-10 00:18:23 +08:00
|
|
|
|
block: Improve performance for BLK_MQ_F_BLOCKING drivers
blk_mq_run_queue() runs the queue asynchronously if BLK_MQ_F_BLOCKING
has been set. This is suboptimal since running the queue asynchronously
is slower than running the queue synchronously. This patch modifies
blk_mq_run_queue() as follows if BLK_MQ_F_BLOCKING has been set:
- Run the queue synchronously if it is allowed to sleep.
- Run the queue asynchronously if it is not allowed to sleep.
Additionally, blk_mq_run_hw_queue(hctx, false) calls are modified into
blk_mq_run_hw_queue(hctx, hctx->flags & BLK_MQ_F_BLOCKING) if the caller
may be invoked from atomic context.
The following caller chains have been reviewed:
blk_mq_run_hw_queue(hctx, false)
blk_mq_get_tag() /* may sleep, hence the functions it calls may also sleep */
blk_execute_rq() /* may sleep */
blk_mq_run_hw_queues(q, async=false)
blk_freeze_queue_start() /* may sleep */
blk_mq_requeue_work() /* may sleep */
scsi_kick_queue()
scsi_requeue_run_queue() /* may sleep */
scsi_run_host_queues()
scsi_ioctl_reset() /* may sleep */
blk_mq_insert_requests(hctx, ctx, list, run_queue_async=false)
blk_mq_dispatch_plug_list(plug, from_sched=false)
blk_mq_flush_plug_list(plug, from_schedule=false)
__blk_flush_plug(plug, from_schedule=false)
blk_add_rq_to_plug()
blk_mq_submit_bio() /* may sleep if REQ_NOWAIT has not been set */
blk_mq_plug_issue_direct()
blk_mq_flush_plug_list() /* see above */
blk_mq_dispatch_plug_list(plug, from_sched=false)
blk_mq_flush_plug_list() /* see above */
blk_mq_try_issue_directly()
blk_mq_submit_bio() /* may sleep if REQ_NOWAIT has not been set */
blk_mq_try_issue_list_directly(hctx, list)
blk_mq_insert_requests() /* see above */
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230721172731.955724-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-22 01:27:30 +08:00
|
|
|
blk_mq_run_hw_queue(hctx, hctx->flags & BLK_MQ_F_BLOCKING);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_start_hw_queue);
|
|
|
|
|
2014-04-16 15:44:56 +08:00
|
|
|
void blk_mq_start_hw_queues(struct request_queue *q)
|
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
2022-03-08 15:32:18 +08:00
|
|
|
unsigned long i;
|
2014-04-16 15:44:56 +08:00
|
|
|
|
|
|
|
queue_for_each_hw_ctx(q, hctx, i)
|
|
|
|
blk_mq_start_hw_queue(hctx);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_start_hw_queues);
|
|
|
|
|
2016-12-09 04:19:30 +08:00
|
|
|
void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
|
|
|
|
{
|
|
|
|
if (!blk_mq_hctx_stopped(hctx))
|
|
|
|
return;
|
|
|
|
|
|
|
|
clear_bit(BLK_MQ_S_STOPPED, &hctx->state);
|
|
|
|
blk_mq_run_hw_queue(hctx, async);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_start_stopped_hw_queue);
|
|
|
|
|
2014-04-16 15:44:54 +08:00
|
|
|
void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
2022-03-08 15:32:18 +08:00
|
|
|
unsigned long i;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2016-12-09 04:19:30 +08:00
|
|
|
queue_for_each_hw_ctx(q, hctx, i)
|
block: Improve performance for BLK_MQ_F_BLOCKING drivers
blk_mq_run_queue() runs the queue asynchronously if BLK_MQ_F_BLOCKING
has been set. This is suboptimal since running the queue asynchronously
is slower than running the queue synchronously. This patch modifies
blk_mq_run_queue() as follows if BLK_MQ_F_BLOCKING has been set:
- Run the queue synchronously if it is allowed to sleep.
- Run the queue asynchronously if it is not allowed to sleep.
Additionally, blk_mq_run_hw_queue(hctx, false) calls are modified into
blk_mq_run_hw_queue(hctx, hctx->flags & BLK_MQ_F_BLOCKING) if the caller
may be invoked from atomic context.
The following caller chains have been reviewed:
blk_mq_run_hw_queue(hctx, false)
blk_mq_get_tag() /* may sleep, hence the functions it calls may also sleep */
blk_execute_rq() /* may sleep */
blk_mq_run_hw_queues(q, async=false)
blk_freeze_queue_start() /* may sleep */
blk_mq_requeue_work() /* may sleep */
scsi_kick_queue()
scsi_requeue_run_queue() /* may sleep */
scsi_run_host_queues()
scsi_ioctl_reset() /* may sleep */
blk_mq_insert_requests(hctx, ctx, list, run_queue_async=false)
blk_mq_dispatch_plug_list(plug, from_sched=false)
blk_mq_flush_plug_list(plug, from_schedule=false)
__blk_flush_plug(plug, from_schedule=false)
blk_add_rq_to_plug()
blk_mq_submit_bio() /* may sleep if REQ_NOWAIT has not been set */
blk_mq_plug_issue_direct()
blk_mq_flush_plug_list() /* see above */
blk_mq_dispatch_plug_list(plug, from_sched=false)
blk_mq_flush_plug_list() /* see above */
blk_mq_try_issue_directly()
blk_mq_submit_bio() /* may sleep if REQ_NOWAIT has not been set */
blk_mq_try_issue_list_directly(hctx, list)
blk_mq_insert_requests() /* see above */
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230721172731.955724-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-22 01:27:30 +08:00
|
|
|
blk_mq_start_stopped_hw_queue(hctx, async ||
|
|
|
|
(hctx->flags & BLK_MQ_F_BLOCKING));
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_start_stopped_hw_queues);
|
|
|
|
|
2014-04-17 00:48:08 +08:00
|
|
|
static void blk_mq_run_work_fn(struct work_struct *work)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2023-04-13 14:06:48 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx =
|
|
|
|
container_of(work, struct blk_mq_hw_ctx, run_work.work);
|
2017-06-21 02:15:47 +08:00
|
|
|
|
2023-04-13 14:06:51 +08:00
|
|
|
blk_mq_run_dispatch_ops(hctx->queue,
|
|
|
|
blk_mq_sched_dispatch_requests(hctx));
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2020-01-07 02:08:18 +08:00
|
|
|
/**
|
|
|
|
* blk_mq_request_bypass_insert - Insert a request at dispatch list.
|
|
|
|
* @rq: Pointer to request to be inserted.
|
2023-04-13 14:40:55 +08:00
|
|
|
* @flags: BLK_MQ_INSERT_*
|
2020-01-07 02:08:18 +08:00
|
|
|
*
|
2017-09-12 06:43:57 +08:00
|
|
|
* Should only be used carefully, when the caller knows we want to
|
|
|
|
* bypass a potential IO scheduler on the target device.
|
|
|
|
*/
|
2023-05-19 12:40:46 +08:00
|
|
|
static void blk_mq_request_bypass_insert(struct request *rq, blk_insert_t flags)
|
2017-09-12 06:43:57 +08:00
|
|
|
{
|
2018-10-30 05:06:13 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
|
2017-09-12 06:43:57 +08:00
|
|
|
|
|
|
|
spin_lock(&hctx->lock);
|
2023-04-13 14:40:55 +08:00
|
|
|
if (flags & BLK_MQ_INSERT_AT_HEAD)
|
2020-02-25 09:04:32 +08:00
|
|
|
list_add(&rq->queuelist, &hctx->dispatch);
|
|
|
|
else
|
|
|
|
list_add_tail(&rq->queuelist, &hctx->dispatch);
|
2017-09-12 06:43:57 +08:00
|
|
|
spin_unlock(&hctx->lock);
|
|
|
|
}
|
|
|
|
|
2023-04-13 14:40:42 +08:00
|
|
|
static void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx,
|
|
|
|
struct blk_mq_ctx *ctx, struct list_head *list,
|
|
|
|
bool run_queue_async)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2018-07-02 17:35:58 +08:00
|
|
|
struct request *rq;
|
2018-12-17 23:44:05 +08:00
|
|
|
enum hctx_type type = hctx->type;
|
2018-07-02 17:35:58 +08:00
|
|
|
|
2023-04-13 14:40:41 +08:00
|
|
|
/*
|
|
|
|
* Try to issue requests directly if the hw queue isn't busy to save an
|
|
|
|
* extra enqueue & dequeue to the sw queue.
|
|
|
|
*/
|
|
|
|
if (!hctx->dispatch_busy && !run_queue_async) {
|
|
|
|
blk_mq_run_dispatch_ops(hctx->queue,
|
|
|
|
blk_mq_try_issue_list_directly(hctx, list));
|
|
|
|
if (list_empty(list))
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
/*
|
|
|
|
* preemption doesn't flush plug list, so it's possible ctx->cpu is
|
|
|
|
* offline now
|
|
|
|
*/
|
2018-07-02 17:35:58 +08:00
|
|
|
list_for_each_entry(rq, list, queuelist) {
|
2016-08-25 05:34:35 +08:00
|
|
|
BUG_ON(rq->mq_ctx != ctx);
|
2020-12-04 00:21:39 +08:00
|
|
|
trace_block_rq_insert(rq);
|
block: Improve performance for BLK_MQ_F_BLOCKING drivers
blk_mq_run_queue() runs the queue asynchronously if BLK_MQ_F_BLOCKING
has been set. This is suboptimal since running the queue asynchronously
is slower than running the queue synchronously. This patch modifies
blk_mq_run_queue() as follows if BLK_MQ_F_BLOCKING has been set:
- Run the queue synchronously if it is allowed to sleep.
- Run the queue asynchronously if it is not allowed to sleep.
Additionally, blk_mq_run_hw_queue(hctx, false) calls are modified into
blk_mq_run_hw_queue(hctx, hctx->flags & BLK_MQ_F_BLOCKING) if the caller
may be invoked from atomic context.
The following caller chains have been reviewed:
blk_mq_run_hw_queue(hctx, false)
blk_mq_get_tag() /* may sleep, hence the functions it calls may also sleep */
blk_execute_rq() /* may sleep */
blk_mq_run_hw_queues(q, async=false)
blk_freeze_queue_start() /* may sleep */
blk_mq_requeue_work() /* may sleep */
scsi_kick_queue()
scsi_requeue_run_queue() /* may sleep */
scsi_run_host_queues()
scsi_ioctl_reset() /* may sleep */
blk_mq_insert_requests(hctx, ctx, list, run_queue_async=false)
blk_mq_dispatch_plug_list(plug, from_sched=false)
blk_mq_flush_plug_list(plug, from_schedule=false)
__blk_flush_plug(plug, from_schedule=false)
blk_add_rq_to_plug()
blk_mq_submit_bio() /* may sleep if REQ_NOWAIT has not been set */
blk_mq_plug_issue_direct()
blk_mq_flush_plug_list() /* see above */
blk_mq_dispatch_plug_list(plug, from_sched=false)
blk_mq_flush_plug_list() /* see above */
blk_mq_try_issue_directly()
blk_mq_submit_bio() /* may sleep if REQ_NOWAIT has not been set */
blk_mq_try_issue_list_directly(hctx, list)
blk_mq_insert_requests() /* see above */
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230721172731.955724-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-22 01:27:30 +08:00
|
|
|
if (rq->cmd_flags & REQ_NOWAIT)
|
|
|
|
run_queue_async = true;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
2018-07-02 17:35:58 +08:00
|
|
|
|
|
|
|
spin_lock(&ctx->lock);
|
2018-12-17 23:44:05 +08:00
|
|
|
list_splice_tail_init(list, &ctx->rq_lists[type]);
|
2015-10-20 23:13:57 +08:00
|
|
|
blk_mq_hctx_mark_pending(hctx, ctx);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
spin_unlock(&ctx->lock);
|
2023-04-13 14:40:41 +08:00
|
|
|
out:
|
|
|
|
blk_mq_run_hw_queue(hctx, run_queue_async);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2023-04-13 14:40:54 +08:00
|
|
|
static void blk_mq_insert_request(struct request *rq, blk_insert_t flags)
|
2023-04-13 14:40:43 +08:00
|
|
|
{
|
|
|
|
struct request_queue *q = rq->q;
|
|
|
|
struct blk_mq_ctx *ctx = rq->mq_ctx;
|
|
|
|
struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
|
|
|
|
|
2023-04-13 14:40:47 +08:00
|
|
|
if (blk_rq_is_passthrough(rq)) {
|
|
|
|
/*
|
|
|
|
* Passthrough request have to be added to hctx->dispatch
|
|
|
|
* directly. The device may be in a situation where it can't
|
|
|
|
* handle FS request, and always returns BLK_STS_RESOURCE for
|
|
|
|
* them, which gets them added to hctx->dispatch.
|
|
|
|
*
|
|
|
|
* If a passthrough request is required to unblock the queues,
|
|
|
|
* and it is added to the scheduler queue, there is no chance to
|
|
|
|
* dispatch it given we prioritize requests in hctx->dispatch.
|
|
|
|
*/
|
2023-04-13 14:40:55 +08:00
|
|
|
blk_mq_request_bypass_insert(rq, flags);
|
2023-05-19 12:40:47 +08:00
|
|
|
} else if (req_op(rq) == REQ_OP_FLUSH) {
|
2023-04-13 14:40:43 +08:00
|
|
|
/*
|
|
|
|
* Firstly normal IO request is inserted to scheduler queue or
|
|
|
|
* sw queue, meantime we add flush request to dispatch queue(
|
|
|
|
* hctx->dispatch) directly and there is at most one in-flight
|
|
|
|
* flush request for each hw queue, so it doesn't matter to add
|
|
|
|
* flush request to tail or front of the dispatch queue.
|
|
|
|
*
|
|
|
|
* Secondly in case of NCQ, flush request belongs to non-NCQ
|
|
|
|
* command, and queueing it will fail when there is any
|
|
|
|
* in-flight normal IO request(NCQ command). When adding flush
|
|
|
|
* rq to the front of hctx->dispatch, it is easier to introduce
|
|
|
|
* extra time to flush rq's latency because of S_SCHED_RESTART
|
|
|
|
* compared with adding to the tail of dispatch queue, then
|
|
|
|
* chance of flush merge is increased, and less flush requests
|
|
|
|
* will be issued to controller. It is observed that ~10% time
|
|
|
|
* is saved in blktests block/004 on disk attached to AHCI/NCQ
|
|
|
|
* drive when adding flush rq to the front of hctx->dispatch.
|
|
|
|
*
|
|
|
|
* Simply queue flush rq to the front of hctx->dispatch so that
|
|
|
|
* intensive flush workloads can benefit in case of NCQ HW.
|
|
|
|
*/
|
2023-04-13 14:40:55 +08:00
|
|
|
blk_mq_request_bypass_insert(rq, BLK_MQ_INSERT_AT_HEAD);
|
2023-04-13 14:40:47 +08:00
|
|
|
} else if (q->elevator) {
|
2023-04-13 14:40:43 +08:00
|
|
|
LIST_HEAD(list);
|
|
|
|
|
2023-04-13 14:40:47 +08:00
|
|
|
WARN_ON_ONCE(rq->tag != BLK_MQ_NO_TAG);
|
|
|
|
|
2023-04-13 14:40:43 +08:00
|
|
|
list_add(&rq->queuelist, &list);
|
2023-04-13 14:40:56 +08:00
|
|
|
q->elevator->type->ops.insert_requests(hctx, &list, flags);
|
2023-04-13 14:40:43 +08:00
|
|
|
} else {
|
2023-04-13 14:40:45 +08:00
|
|
|
trace_block_rq_insert(rq);
|
|
|
|
|
2023-04-13 14:40:43 +08:00
|
|
|
spin_lock(&ctx->lock);
|
2023-04-13 14:40:54 +08:00
|
|
|
if (flags & BLK_MQ_INSERT_AT_HEAD)
|
2023-04-13 14:40:45 +08:00
|
|
|
list_add(&rq->queuelist, &ctx->rq_lists[hctx->type]);
|
|
|
|
else
|
|
|
|
list_add_tail(&rq->queuelist,
|
|
|
|
&ctx->rq_lists[hctx->type]);
|
2023-04-13 14:40:44 +08:00
|
|
|
blk_mq_hctx_mark_pending(hctx, ctx);
|
2023-04-13 14:40:43 +08:00
|
|
|
spin_unlock(&ctx->lock);
|
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2019-06-06 18:29:01 +08:00
|
|
|
static void blk_mq_bio_to_request(struct request *rq, struct bio *bio,
|
|
|
|
unsigned int nr_segs)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2020-09-16 11:53:14 +08:00
|
|
|
int err;
|
|
|
|
|
2019-06-06 18:29:00 +08:00
|
|
|
if (bio->bi_opf & REQ_RAHEAD)
|
|
|
|
rq->cmd_flags |= REQ_FAILFAST_MASK;
|
|
|
|
|
|
|
|
rq->__sector = bio->bi_iter.bi_sector;
|
2024-02-03 04:39:25 +08:00
|
|
|
rq->write_hint = bio->bi_write_hint;
|
2019-06-06 18:29:01 +08:00
|
|
|
blk_rq_bio_prep(rq, bio, nr_segs);
|
2020-09-16 11:53:14 +08:00
|
|
|
|
|
|
|
/* This can't fail, since GFP_NOIO includes __GFP_DIRECT_RECLAIM. */
|
|
|
|
err = blk_crypto_rq_bio_prep(rq, bio, GFP_NOIO);
|
|
|
|
WARN_ON_ONCE(err);
|
2014-05-30 01:00:11 +08:00
|
|
|
|
2020-05-27 13:24:16 +08:00
|
|
|
blk_account_io_start(rq);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2018-01-18 00:25:56 +08:00
|
|
|
static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
|
2021-10-12 19:12:24 +08:00
|
|
|
struct request *rq, bool last)
|
2015-05-09 01:51:32 +08:00
|
|
|
{
|
|
|
|
struct request_queue *q = rq->q;
|
|
|
|
struct blk_mq_queue_data bd = {
|
|
|
|
.rq = rq,
|
2018-11-25 01:15:46 +08:00
|
|
|
.last = last,
|
2015-05-09 01:51:32 +08:00
|
|
|
};
|
2017-06-13 01:22:46 +08:00
|
|
|
blk_status_t ret;
|
2018-01-18 00:25:56 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* For OK queue, we are done. For error, caller may kill it.
|
|
|
|
* Any other error (busy), just add it to our list as we
|
|
|
|
* previously would have done.
|
|
|
|
*/
|
|
|
|
ret = q->mq_ops->queue_rq(hctx, &bd);
|
|
|
|
switch (ret) {
|
|
|
|
case BLK_STS_OK:
|
2018-07-10 09:03:31 +08:00
|
|
|
blk_mq_update_dispatch_busy(hctx, false);
|
2018-01-18 00:25:56 +08:00
|
|
|
break;
|
|
|
|
case BLK_STS_RESOURCE:
|
2018-01-31 11:04:57 +08:00
|
|
|
case BLK_STS_DEV_RESOURCE:
|
2018-07-10 09:03:31 +08:00
|
|
|
blk_mq_update_dispatch_busy(hctx, true);
|
2018-01-18 00:25:56 +08:00
|
|
|
__blk_mq_requeue_request(rq);
|
|
|
|
break;
|
|
|
|
default:
|
2018-07-10 09:03:31 +08:00
|
|
|
blk_mq_update_dispatch_busy(hctx, false);
|
2018-01-18 00:25:56 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2023-04-13 14:40:49 +08:00
|
|
|
static bool blk_mq_get_budget_and_tag(struct request *rq)
|
2018-01-18 00:25:56 +08:00
|
|
|
{
|
2021-01-22 10:33:12 +08:00
|
|
|
int budget_token;
|
2017-06-06 23:22:00 +08:00
|
|
|
|
2023-04-13 14:40:49 +08:00
|
|
|
budget_token = blk_mq_get_dispatch_budget(rq->q);
|
2021-01-22 10:33:12 +08:00
|
|
|
if (budget_token < 0)
|
2023-04-13 14:40:49 +08:00
|
|
|
return false;
|
2021-01-22 10:33:12 +08:00
|
|
|
blk_mq_set_rq_budget_token(rq, budget_token);
|
2018-06-25 19:31:45 +08:00
|
|
|
if (!blk_mq_get_driver_tag(rq)) {
|
2023-04-13 14:40:49 +08:00
|
|
|
blk_mq_put_dispatch_budget(rq->q, budget_token);
|
|
|
|
return false;
|
2017-11-05 02:21:12 +08:00
|
|
|
}
|
2023-04-13 14:40:49 +08:00
|
|
|
return true;
|
2019-04-05 01:08:43 +08:00
|
|
|
}
|
|
|
|
|
2020-01-07 02:08:18 +08:00
|
|
|
/**
|
|
|
|
* blk_mq_try_issue_directly - Try to send a request directly to device driver.
|
|
|
|
* @hctx: Pointer of the associated hardware queue.
|
|
|
|
* @rq: Pointer to request to be sent.
|
|
|
|
*
|
|
|
|
* If the device has enough resources to accept a new request now, send the
|
|
|
|
* request directly to device driver. Else, insert at hctx->dispatch queue, so
|
|
|
|
* we can try send it another time in the future. Requests inserted at this
|
|
|
|
* queue have higher priority.
|
|
|
|
*/
|
2019-04-05 01:08:43 +08:00
|
|
|
static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
|
2021-10-12 19:12:24 +08:00
|
|
|
struct request *rq)
|
2019-04-05 01:08:43 +08:00
|
|
|
{
|
2023-04-13 14:40:50 +08:00
|
|
|
blk_status_t ret;
|
|
|
|
|
|
|
|
if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(rq->q)) {
|
2023-04-13 14:40:54 +08:00
|
|
|
blk_mq_insert_request(rq, 0);
|
2023-04-13 14:40:50 +08:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2023-05-18 13:31:01 +08:00
|
|
|
if ((rq->rq_flags & RQF_USE_SCHED) || !blk_mq_get_budget_and_tag(rq)) {
|
2023-04-13 14:40:54 +08:00
|
|
|
blk_mq_insert_request(rq, 0);
|
block: Improve performance for BLK_MQ_F_BLOCKING drivers
blk_mq_run_queue() runs the queue asynchronously if BLK_MQ_F_BLOCKING
has been set. This is suboptimal since running the queue asynchronously
is slower than running the queue synchronously. This patch modifies
blk_mq_run_queue() as follows if BLK_MQ_F_BLOCKING has been set:
- Run the queue synchronously if it is allowed to sleep.
- Run the queue asynchronously if it is not allowed to sleep.
Additionally, blk_mq_run_hw_queue(hctx, false) calls are modified into
blk_mq_run_hw_queue(hctx, hctx->flags & BLK_MQ_F_BLOCKING) if the caller
may be invoked from atomic context.
The following caller chains have been reviewed:
blk_mq_run_hw_queue(hctx, false)
blk_mq_get_tag() /* may sleep, hence the functions it calls may also sleep */
blk_execute_rq() /* may sleep */
blk_mq_run_hw_queues(q, async=false)
blk_freeze_queue_start() /* may sleep */
blk_mq_requeue_work() /* may sleep */
scsi_kick_queue()
scsi_requeue_run_queue() /* may sleep */
scsi_run_host_queues()
scsi_ioctl_reset() /* may sleep */
blk_mq_insert_requests(hctx, ctx, list, run_queue_async=false)
blk_mq_dispatch_plug_list(plug, from_sched=false)
blk_mq_flush_plug_list(plug, from_schedule=false)
__blk_flush_plug(plug, from_schedule=false)
blk_add_rq_to_plug()
blk_mq_submit_bio() /* may sleep if REQ_NOWAIT has not been set */
blk_mq_plug_issue_direct()
blk_mq_flush_plug_list() /* see above */
blk_mq_dispatch_plug_list(plug, from_sched=false)
blk_mq_flush_plug_list() /* see above */
blk_mq_try_issue_directly()
blk_mq_submit_bio() /* may sleep if REQ_NOWAIT has not been set */
blk_mq_try_issue_list_directly(hctx, list)
blk_mq_insert_requests() /* see above */
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230721172731.955724-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-22 01:27:30 +08:00
|
|
|
blk_mq_run_hw_queue(hctx, rq->cmd_flags & REQ_NOWAIT);
|
2023-04-13 14:40:50 +08:00
|
|
|
return;
|
|
|
|
}
|
2019-04-05 01:08:43 +08:00
|
|
|
|
2023-04-13 14:40:50 +08:00
|
|
|
ret = __blk_mq_issue_directly(hctx, rq, true);
|
|
|
|
switch (ret) {
|
|
|
|
case BLK_STS_OK:
|
|
|
|
break;
|
|
|
|
case BLK_STS_RESOURCE:
|
|
|
|
case BLK_STS_DEV_RESOURCE:
|
2023-04-13 14:40:55 +08:00
|
|
|
blk_mq_request_bypass_insert(rq, 0);
|
2023-04-13 14:40:52 +08:00
|
|
|
blk_mq_run_hw_queue(hctx, false);
|
2023-04-13 14:40:50 +08:00
|
|
|
break;
|
|
|
|
default:
|
2019-04-05 01:08:43 +08:00
|
|
|
blk_mq_end_request(rq, ret);
|
2023-04-13 14:40:50 +08:00
|
|
|
break;
|
|
|
|
}
|
2019-04-05 01:08:43 +08:00
|
|
|
}
|
|
|
|
|
2021-11-17 14:13:58 +08:00
|
|
|
static blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last)
|
2019-04-05 01:08:43 +08:00
|
|
|
{
|
2023-04-13 14:40:50 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
|
|
|
|
|
|
|
|
if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(rq->q)) {
|
2023-04-13 14:40:54 +08:00
|
|
|
blk_mq_insert_request(rq, 0);
|
2023-04-13 14:40:50 +08:00
|
|
|
return BLK_STS_OK;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!blk_mq_get_budget_and_tag(rq))
|
|
|
|
return BLK_STS_RESOURCE;
|
|
|
|
return __blk_mq_issue_directly(hctx, rq, last);
|
2017-03-23 03:01:51 +08:00
|
|
|
}
|
|
|
|
|
2023-01-18 17:37:18 +08:00
|
|
|
static void blk_mq_plug_issue_direct(struct blk_plug *plug)
|
2021-11-17 14:13:57 +08:00
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx = NULL;
|
|
|
|
struct request *rq;
|
|
|
|
int queued = 0;
|
2023-01-18 17:37:20 +08:00
|
|
|
blk_status_t ret = BLK_STS_OK;
|
2021-11-17 14:13:57 +08:00
|
|
|
|
|
|
|
while ((rq = rq_list_pop(&plug->mq_list))) {
|
|
|
|
bool last = rq_list_empty(plug->mq_list);
|
|
|
|
|
|
|
|
if (hctx != rq->mq_hctx) {
|
2023-01-18 17:37:19 +08:00
|
|
|
if (hctx) {
|
|
|
|
blk_mq_commit_rqs(hctx, queued, false);
|
|
|
|
queued = 0;
|
|
|
|
}
|
2021-11-17 14:13:57 +08:00
|
|
|
hctx = rq->mq_hctx;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = blk_mq_request_issue_directly(rq, last);
|
|
|
|
switch (ret) {
|
|
|
|
case BLK_STS_OK:
|
|
|
|
queued++;
|
|
|
|
break;
|
|
|
|
case BLK_STS_RESOURCE:
|
|
|
|
case BLK_STS_DEV_RESOURCE:
|
2023-04-13 14:40:55 +08:00
|
|
|
blk_mq_request_bypass_insert(rq, 0);
|
2023-04-13 14:40:52 +08:00
|
|
|
blk_mq_run_hw_queue(hctx, false);
|
2023-01-18 17:37:20 +08:00
|
|
|
goto out;
|
2021-11-17 14:13:57 +08:00
|
|
|
default:
|
|
|
|
blk_mq_end_request(rq, ret);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-01-18 17:37:20 +08:00
|
|
|
out:
|
|
|
|
if (ret != BLK_STS_OK)
|
2023-01-18 17:37:19 +08:00
|
|
|
blk_mq_commit_rqs(hctx, queued, false);
|
2021-11-17 14:13:57 +08:00
|
|
|
}
|
|
|
|
|
2021-12-21 04:59:19 +08:00
|
|
|
static void __blk_mq_flush_plug_list(struct request_queue *q,
|
|
|
|
struct blk_plug *plug)
|
|
|
|
{
|
|
|
|
if (blk_queue_quiesced(q))
|
|
|
|
return;
|
|
|
|
q->mq_ops->queue_rqs(&plug->mq_list);
|
|
|
|
}
|
|
|
|
|
2022-03-12 01:24:17 +08:00
|
|
|
static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched)
|
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *this_hctx = NULL;
|
|
|
|
struct blk_mq_ctx *this_ctx = NULL;
|
|
|
|
struct request *requeue_list = NULL;
|
2023-03-13 17:30:02 +08:00
|
|
|
struct request **requeue_lastp = &requeue_list;
|
2022-03-12 01:24:17 +08:00
|
|
|
unsigned int depth = 0;
|
2023-05-18 13:30:59 +08:00
|
|
|
bool is_passthrough = false;
|
2022-03-12 01:24:17 +08:00
|
|
|
LIST_HEAD(list);
|
|
|
|
|
|
|
|
do {
|
|
|
|
struct request *rq = rq_list_pop(&plug->mq_list);
|
|
|
|
|
|
|
|
if (!this_hctx) {
|
|
|
|
this_hctx = rq->mq_hctx;
|
|
|
|
this_ctx = rq->mq_ctx;
|
2023-05-18 13:30:59 +08:00
|
|
|
is_passthrough = blk_rq_is_passthrough(rq);
|
|
|
|
} else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx ||
|
|
|
|
is_passthrough != blk_rq_is_passthrough(rq)) {
|
2023-03-13 17:30:02 +08:00
|
|
|
rq_list_add_tail(&requeue_lastp, rq);
|
2022-03-12 01:24:17 +08:00
|
|
|
continue;
|
|
|
|
}
|
2023-03-13 17:30:02 +08:00
|
|
|
list_add(&rq->queuelist, &list);
|
2022-03-12 01:24:17 +08:00
|
|
|
depth++;
|
|
|
|
} while (!rq_list_empty(plug->mq_list));
|
|
|
|
|
|
|
|
plug->mq_list = requeue_list;
|
|
|
|
trace_block_unplug(this_hctx->queue, depth, !from_sched);
|
2023-04-13 14:40:42 +08:00
|
|
|
|
|
|
|
percpu_ref_get(&this_hctx->queue->q_usage_counter);
|
2023-05-18 13:30:59 +08:00
|
|
|
/* passthrough requests should never be issued to the I/O scheduler */
|
2023-06-21 21:22:08 +08:00
|
|
|
if (is_passthrough) {
|
|
|
|
spin_lock(&this_hctx->lock);
|
|
|
|
list_splice_tail_init(&list, &this_hctx->dispatch);
|
|
|
|
spin_unlock(&this_hctx->lock);
|
|
|
|
blk_mq_run_hw_queue(this_hctx, from_sched);
|
|
|
|
} else if (this_hctx->queue->elevator) {
|
2023-04-13 14:40:42 +08:00
|
|
|
this_hctx->queue->elevator->type->ops.insert_requests(this_hctx,
|
2023-04-13 14:40:56 +08:00
|
|
|
&list, 0);
|
2023-04-13 14:40:42 +08:00
|
|
|
blk_mq_run_hw_queue(this_hctx, from_sched);
|
|
|
|
} else {
|
|
|
|
blk_mq_insert_requests(this_hctx, this_ctx, &list, from_sched);
|
|
|
|
}
|
|
|
|
percpu_ref_put(&this_hctx->queue->q_usage_counter);
|
2022-03-12 01:24:17 +08:00
|
|
|
}
|
|
|
|
|
2021-11-17 14:13:57 +08:00
|
|
|
void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
|
|
|
|
{
|
2021-12-03 21:48:53 +08:00
|
|
|
struct request *rq;
|
2021-11-17 14:13:57 +08:00
|
|
|
|
2023-07-14 18:11:06 +08:00
|
|
|
/*
|
|
|
|
* We may have been called recursively midway through handling
|
|
|
|
* plug->mq_list via a schedule() in the driver's queue_rq() callback.
|
|
|
|
* To avoid mq_list changing under our feet, clear rq_count early and
|
|
|
|
* bail out specifically if rq_count is 0 rather than checking
|
|
|
|
* whether the mq_list is empty.
|
|
|
|
*/
|
|
|
|
if (plug->rq_count == 0)
|
2021-11-17 14:13:57 +08:00
|
|
|
return;
|
|
|
|
plug->rq_count = 0;
|
|
|
|
|
|
|
|
if (!plug->multiple_queues && !plug->has_elevator && !from_schedule) {
|
2021-12-03 21:48:53 +08:00
|
|
|
struct request_queue *q;
|
|
|
|
|
|
|
|
rq = rq_list_peek(&plug->mq_list);
|
|
|
|
q = rq->q;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Peek first request and see if we have a ->queue_rqs() hook.
|
|
|
|
* If we do, we can dispatch the whole plug list in one go. We
|
|
|
|
* already know at this point that all requests belong to the
|
|
|
|
* same queue, caller must ensure that's the case.
|
|
|
|
*/
|
2023-09-13 23:16:14 +08:00
|
|
|
if (q->mq_ops->queue_rqs) {
|
2021-12-03 21:48:53 +08:00
|
|
|
blk_mq_run_dispatch_ops(q,
|
2021-12-21 04:59:19 +08:00
|
|
|
__blk_mq_flush_plug_list(q, plug));
|
2021-12-03 21:48:53 +08:00
|
|
|
if (rq_list_empty(plug->mq_list))
|
|
|
|
return;
|
|
|
|
}
|
2021-12-06 11:33:50 +08:00
|
|
|
|
|
|
|
blk_mq_run_dispatch_ops(q,
|
2023-01-18 17:37:18 +08:00
|
|
|
blk_mq_plug_issue_direct(plug));
|
2021-11-17 14:13:57 +08:00
|
|
|
if (rq_list_empty(plug->mq_list))
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
do {
|
2022-03-12 01:24:17 +08:00
|
|
|
blk_mq_dispatch_plug_list(plug, from_schedule);
|
2021-11-17 14:13:57 +08:00
|
|
|
} while (!rq_list_empty(plug->mq_list));
|
|
|
|
}
|
|
|
|
|
2023-04-13 14:40:41 +08:00
|
|
|
static void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
|
2018-07-10 09:03:31 +08:00
|
|
|
struct list_head *list)
|
|
|
|
{
|
2020-04-07 02:13:48 +08:00
|
|
|
int queued = 0;
|
2023-01-18 17:37:21 +08:00
|
|
|
blk_status_t ret = BLK_STS_OK;
|
2020-04-07 02:13:48 +08:00
|
|
|
|
2018-07-10 09:03:31 +08:00
|
|
|
while (!list_empty(list)) {
|
|
|
|
struct request *rq = list_first_entry(list, struct request,
|
|
|
|
queuelist);
|
|
|
|
|
|
|
|
list_del_init(&rq->queuelist);
|
2019-04-05 01:08:43 +08:00
|
|
|
ret = blk_mq_request_issue_directly(rq, list_empty(list));
|
2023-01-18 17:37:25 +08:00
|
|
|
switch (ret) {
|
|
|
|
case BLK_STS_OK:
|
2020-04-07 02:13:48 +08:00
|
|
|
queued++;
|
2023-01-18 17:37:25 +08:00
|
|
|
break;
|
|
|
|
case BLK_STS_RESOURCE:
|
|
|
|
case BLK_STS_DEV_RESOURCE:
|
2023-04-13 14:40:55 +08:00
|
|
|
blk_mq_request_bypass_insert(rq, 0);
|
2023-04-13 14:40:52 +08:00
|
|
|
if (list_empty(list))
|
|
|
|
blk_mq_run_hw_queue(hctx, false);
|
2023-01-18 17:37:25 +08:00
|
|
|
goto out;
|
|
|
|
default:
|
|
|
|
blk_mq_end_request(rq, ret);
|
|
|
|
break;
|
|
|
|
}
|
2018-07-10 09:03:31 +08:00
|
|
|
}
|
2018-11-28 08:02:25 +08:00
|
|
|
|
2023-01-18 17:37:25 +08:00
|
|
|
out:
|
2023-01-18 17:37:21 +08:00
|
|
|
if (ret != BLK_STS_OK)
|
|
|
|
blk_mq_commit_rqs(hctx, queued, false);
|
2018-07-10 09:03:31 +08:00
|
|
|
}
|
|
|
|
|
2021-11-11 16:51:34 +08:00
|
|
|
static bool blk_mq_attempt_bio_merge(struct request_queue *q,
|
2021-11-24 00:04:41 +08:00
|
|
|
struct bio *bio, unsigned int nr_segs)
|
2021-11-03 19:47:09 +08:00
|
|
|
{
|
|
|
|
if (!blk_queue_nomerges(q) && bio_mergeable(bio)) {
|
2021-11-24 00:04:41 +08:00
|
|
|
if (blk_attempt_plug_merge(q, bio, nr_segs))
|
2021-11-03 19:47:09 +08:00
|
|
|
return true;
|
|
|
|
if (blk_mq_sched_bio_merge(q, bio, nr_segs))
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2021-11-03 19:52:45 +08:00
|
|
|
static struct request *blk_mq_get_new_requests(struct request_queue *q,
|
|
|
|
struct blk_plug *plug,
|
2022-03-08 16:09:15 +08:00
|
|
|
struct bio *bio,
|
|
|
|
unsigned int nsegs)
|
2021-11-03 19:52:45 +08:00
|
|
|
{
|
|
|
|
struct blk_mq_alloc_data data = {
|
|
|
|
.q = q,
|
|
|
|
.nr_tags = 1,
|
2022-01-04 21:42:23 +08:00
|
|
|
.cmd_flags = bio->bi_opf,
|
2021-11-03 19:52:45 +08:00
|
|
|
};
|
|
|
|
struct request *rq;
|
|
|
|
|
2022-03-08 16:09:15 +08:00
|
|
|
rq_qos_throttle(q, bio);
|
|
|
|
|
2021-11-03 19:52:45 +08:00
|
|
|
if (plug) {
|
|
|
|
data.nr_tags = plug->nr_ios;
|
|
|
|
plug->nr_ios = 1;
|
|
|
|
data.cached_rq = &plug->cached_rq;
|
|
|
|
}
|
|
|
|
|
|
|
|
rq = __blk_mq_alloc_requests(&data);
|
2021-12-03 03:42:58 +08:00
|
|
|
if (rq)
|
|
|
|
return rq;
|
2021-11-03 19:52:45 +08:00
|
|
|
rq_qos_cleanup(q, bio);
|
|
|
|
if (bio->bi_opf & REQ_NOWAIT)
|
|
|
|
bio_wouldblock_error(bio);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2024-01-11 21:57:04 +08:00
|
|
|
/*
|
2024-01-24 17:26:57 +08:00
|
|
|
* Check if there is a suitable cached request and return it.
|
2024-01-11 21:57:04 +08:00
|
|
|
*/
|
2024-01-24 17:26:57 +08:00
|
|
|
static struct request *blk_mq_peek_cached_request(struct blk_plug *plug,
|
|
|
|
struct request_queue *q, blk_opf_t opf)
|
2021-11-03 19:52:45 +08:00
|
|
|
{
|
2024-01-24 17:26:57 +08:00
|
|
|
enum hctx_type type = blk_mq_get_hctx_type(opf);
|
|
|
|
struct request *rq;
|
2021-11-03 19:52:45 +08:00
|
|
|
|
2024-01-24 17:26:57 +08:00
|
|
|
if (!plug)
|
|
|
|
return NULL;
|
|
|
|
rq = rq_list_peek(&plug->cached_rq);
|
|
|
|
if (!rq || rq->q != q)
|
|
|
|
return NULL;
|
|
|
|
if (type != rq->mq_hctx->type &&
|
|
|
|
(type != HCTX_TYPE_READ || rq->mq_hctx->type != HCTX_TYPE_DEFAULT))
|
|
|
|
return NULL;
|
|
|
|
if (op_is_flush(rq->cmd_flags) != op_is_flush(opf))
|
|
|
|
return NULL;
|
|
|
|
return rq;
|
|
|
|
}
|
2022-03-08 16:09:15 +08:00
|
|
|
|
2024-01-24 17:26:57 +08:00
|
|
|
static void blk_mq_use_cached_rq(struct request *rq, struct blk_plug *plug,
|
|
|
|
struct bio *bio)
|
|
|
|
{
|
|
|
|
WARN_ON_ONCE(rq_list_peek(&plug->cached_rq) != rq);
|
2021-11-24 14:28:56 +08:00
|
|
|
|
2022-06-22 00:03:57 +08:00
|
|
|
/*
|
|
|
|
* If any qos ->throttle() end up blocking, we will have flushed the
|
|
|
|
* plug and hence killed the cached_rq list as well. Pop this entry
|
|
|
|
* before we throttle.
|
|
|
|
*/
|
2021-11-24 14:28:56 +08:00
|
|
|
plug->cached_rq = rq_list_next(rq);
|
2023-11-13 11:52:31 +08:00
|
|
|
rq_qos_throttle(rq->q, bio);
|
2022-06-22 00:03:57 +08:00
|
|
|
|
2023-07-10 18:55:16 +08:00
|
|
|
blk_mq_rq_time_init(rq, 0);
|
2023-11-13 11:52:31 +08:00
|
|
|
rq->cmd_flags = bio->bi_opf;
|
2021-11-24 14:28:56 +08:00
|
|
|
INIT_LIST_HEAD(&rq->queuelist);
|
2021-11-03 19:52:45 +08:00
|
|
|
}
|
|
|
|
|
2020-01-07 02:08:18 +08:00
|
|
|
/**
|
2020-07-01 16:59:43 +08:00
|
|
|
* blk_mq_submit_bio - Create and send a request to block device.
|
2020-01-07 02:08:18 +08:00
|
|
|
* @bio: Bio pointer.
|
|
|
|
*
|
|
|
|
* Builds up a request structure from @q and @bio and send to the device. The
|
|
|
|
* request may not be queued directly to hardware if:
|
|
|
|
* * This request can be merged with another one
|
|
|
|
* * We want to place request at plug queue for possible future merging
|
|
|
|
* * There is an IO scheduler active at this queue
|
|
|
|
*
|
|
|
|
* It will not queue the request if there is an error with the bio, or at the
|
|
|
|
* request creation.
|
|
|
|
*/
|
2021-10-12 19:12:24 +08:00
|
|
|
void blk_mq_submit_bio(struct bio *bio)
|
2014-05-23 00:40:51 +08:00
|
|
|
{
|
2021-10-14 22:03:30 +08:00
|
|
|
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
|
2024-04-08 09:41:28 +08:00
|
|
|
struct blk_plug *plug = current->plug;
|
2016-10-28 22:48:16 +08:00
|
|
|
const int is_sync = op_is_sync(bio->bi_opf);
|
2023-04-13 14:40:51 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx;
|
2021-10-14 02:43:41 +08:00
|
|
|
unsigned int nr_segs = 1;
|
2024-01-24 17:26:58 +08:00
|
|
|
struct request *rq;
|
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 08:37:18 +08:00
|
|
|
blk_status_t ret;
|
2014-05-23 00:40:51 +08:00
|
|
|
|
block: Introduce zone write plugging
Zone write plugging implements a per-zone "plug" for write operations
to control the submission and execution order of write operations to
sequential write required zones of a zoned block device. Per-zone
plugging guarantees that at any time there is at most only one write
request per zone being executed. This mechanism is intended to replace
zone write locking which implements a similar per-zone write throttling
at the scheduler level, but is implemented only by mq-deadline.
Unlike zone write locking which operates on requests, zone write
plugging operates on BIOs. A zone write plug is simply a BIO list that
is atomically manipulated using a spinlock and a kblockd submission
work. A write BIO to a zone is "plugged" to delay its execution if a
write BIO for the same zone was already issued, that is, if a write
request for the same zone is being executed. The next plugged BIO is
unplugged and issued once the write request completes.
This mechanism allows to:
- Untangle zone write ordering from block IO schedulers. This allows
removing the restriction on using mq-deadline for writing to zoned
block devices. Any block IO scheduler, including "none" can be used.
- Zone write plugging operates on BIOs instead of requests. Plugged
BIOs waiting for execution thus do not hold scheduling tags and thus
are not preventing other BIOs from executing (reads or writes to
other zones). Depending on the workload, this can significantly
improve the device use (higher queue depth operation) and
performance.
- Both blk-mq (request based) zoned devices and BIO-based zoned devices
(e.g. device mapper) can use zone write plugging. It is mandatory
for the former but optional for the latter. BIO-based drivers can
use zone write plugging to implement write ordering guarantees, or
the drivers can implement their own if needed.
- The code is less invasive in the block layer and is mostly limited to
blk-zoned.c with some small changes in blk-mq.c, blk-merge.c and
bio.c.
Zone write plugging is implemented using struct blk_zone_wplug. This
structure includes a spinlock, a BIO list and a work structure to
handle the submission of plugged BIOs. Zone write plugs structures are
managed using a per-disk hash table.
Plugging of zone write BIOs is done using the function
blk_zone_write_plug_bio() which returns false if a BIO execution does
not need to be delayed and true otherwise. This function is called
from blk_mq_submit_bio() after a BIO is split to avoid large BIOs
spanning multiple zones which would cause mishandling of zone write
plugs. This ichange enables by default zone write plugging for any mq
request-based block device. BIO-based device drivers can also use zone
write plugging by expliclty calling blk_zone_write_plug_bio() in their
->submit_bio method. For such devices, the driver must ensure that a
BIO passed to blk_zone_write_plug_bio() is already split and not
straddling zone boundaries.
Only write and write zeroes BIOs are plugged. Zone write plugging does
not introduce any significant overhead for other operations. A BIO that
is being handled through zone write plugging is flagged using the new
BIO flag BIO_ZONE_WRITE_PLUGGING. A request handling a BIO flagged with
this new flag is flagged with the new RQF_ZONE_WRITE_PLUGGING flag.
The completion of BIOs and requests flagged trigger respectively calls
to the functions blk_zone_write_bio_endio() and
blk_zone_write_complete_request(). The latter function is used to
trigger submission of the next plugged BIO using the zone plug work.
blk_zone_write_bio_endio() does the same for BIO-based devices.
This ensures that at any time, at most one request (blk-mq devices) or
one BIO (BIO-based devices) is being executed for any zone. The
handling of zone write plugs using a per-zone plug spinlock maximizes
parallelism and device usage by allowing multiple zones to be writen
simultaneously without lock contention.
Zone write plugging ignores flush BIOs without data. Hovever, any flush
BIO that has data is always plugged so that the write part of the flush
sequence is serialized with other regular writes.
Given that any BIO handled through zone write plugging will be the only
BIO in flight for the target zone when it is executed, the unplugging
and submission of a BIO will have no chance of successfully merging with
plugged requests or requests in the scheduler. To overcome this
potential performance degradation, blk_mq_submit_bio() calls the
function blk_zone_write_plug_attempt_merge() to try to merge other
plugged BIOs with the one just unplugged and submitted. Successful
merging is signaled using blk_zone_write_plug_bio_merged(), called from
bio_attempt_back_merge(). Furthermore, to avoid recalculating the number
of segments of plugged BIOs to attempt merging, the number of segments
of a plugged BIO is saved using the new struct bio field
__bi_nr_segments. To avoid growing the size of struct bio, this field is
added as a union with the bio_cookie field. This is safe to do as
polling is always disabled for plugged BIOs.
When BIOs are plugged in a zone write plug, the device request queue
usage counter is always incremented. This reference is kept and reused
for blk-mq devices when the plugged BIO is unplugged and submitted
again using submit_bio_noacct_nocheck(). For this case, the unplugged
BIO is already flagged with BIO_ZONE_WRITE_PLUGGING and
blk_mq_submit_bio() proceeds directly to allocating a new request for
the BIO, re-using the usage reference count taken when the BIO was
plugged. This extra reference count is dropped in
blk_zone_write_plug_attempt_merge() for any plugged BIO that is
successfully merged. Given that BIO-based devices will not take this
path, the extra reference is dropped after a plugged BIO is unplugged
and submitted.
Zone write plugs are dynamically allocated and managed using a hash
table (an array of struct hlist_head) with RCU protection.
A zone write plug is allocated when a write BIO is received for the
zone and not freed until the zone is fully written, reset or finished.
To detect when a zone write plug can be freed, the write state of each
zone is tracked using a write pointer offset which corresponds to the
offset of a zone write pointer relative to the zone start. Write
operations always increment this write pointer offset. Zone reset
operations set it to 0 and zone finish operations set it to the zone
size.
If a write error happens, the wp_offset value of a zone write plug may
become incorrect and out of sync with the device managed write pointer.
This is handled using the zone write plug flag BLK_ZONE_WPLUG_ERROR.
The function blk_zone_wplug_handle_error() is called from the new disk
zone write plug work when this flag is set. This function executes a
report zone to update the zone write pointer offset to the current
value as indicated by the device. The disk zone write plug work is
scheduled whenever a BIO flagged with BIO_ZONE_WRITE_PLUGGING completes
with an error or when bio_zone_wplug_prepare_bio() detects an unaligned
write. Once scheduled, the disk zone write plugs work keeps running
until all zone errors are handled.
To match the new data structures used for zoned disks, the function
disk_free_zone_bitmaps() is renamed to the more generic
disk_free_zone_resources(). The function disk_init_zone_resources() is
also introduced to initialize zone write plugs resources when a gendisk
is allocated.
In order to guarantee that the user can simultaneously write up to a
number of zones equal to a device max active zone limit or max open zone
limit, zone write plugs are allocated using a mempool sized to the
maximum of these 2 device limits. For a device that does not have
active and open zone limits, 128 is used as the default mempool size.
If a change to the device active and open zone limits is detected, the
disk mempool is resized when blk_revalidate_disk_zones() is executed.
This commit contains contributions from Christoph Hellwig <hch@lst.de>.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240408014128.205141-8-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-08 09:41:07 +08:00
|
|
|
/*
|
|
|
|
* If the plug has a cached request for this queue, try to use it.
|
|
|
|
*/
|
|
|
|
rq = blk_mq_peek_cached_request(plug, q, bio->bi_opf);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* A BIO that was released from a zone write plug has already been
|
|
|
|
* through the preparation in this function, already holds a reference
|
|
|
|
* on the queue usage counter, and is the only write BIO in-flight for
|
|
|
|
* the target zone. Go straight to preparing a request for it.
|
|
|
|
*/
|
|
|
|
if (bio_zone_write_plugging(bio)) {
|
|
|
|
nr_segs = bio->__bi_nr_segments;
|
|
|
|
if (rq)
|
|
|
|
blk_queue_exit(q);
|
|
|
|
goto new_request;
|
|
|
|
}
|
|
|
|
|
2022-07-28 00:22:56 +08:00
|
|
|
bio = blk_queue_bounce(bio, q);
|
2022-06-23 15:48:33 +08:00
|
|
|
|
2024-01-24 17:26:58 +08:00
|
|
|
/*
|
|
|
|
* The cached request already holds a q_usage_counter reference and we
|
|
|
|
* don't have to acquire a new one if we use it.
|
|
|
|
*/
|
|
|
|
if (!rq) {
|
2023-11-13 11:52:31 +08:00
|
|
|
if (unlikely(bio_queue_enter(bio)))
|
2021-11-24 14:28:56 +08:00
|
|
|
return;
|
2023-11-13 11:52:31 +08:00
|
|
|
}
|
|
|
|
|
2024-01-24 17:26:57 +08:00
|
|
|
if (unlikely(bio_may_exceed_limits(bio, &q->limits))) {
|
|
|
|
bio = __bio_split_to_limits(bio, &q->limits, &nr_segs);
|
|
|
|
if (!bio)
|
2024-01-24 17:26:56 +08:00
|
|
|
goto queue_exit;
|
2023-11-13 11:52:31 +08:00
|
|
|
}
|
2024-01-24 17:26:57 +08:00
|
|
|
if (!bio_integrity_prep(bio))
|
|
|
|
goto queue_exit;
|
2023-11-13 11:52:31 +08:00
|
|
|
|
2024-01-24 17:26:56 +08:00
|
|
|
if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
|
|
|
|
goto queue_exit;
|
|
|
|
|
block: Introduce zone write plugging
Zone write plugging implements a per-zone "plug" for write operations
to control the submission and execution order of write operations to
sequential write required zones of a zoned block device. Per-zone
plugging guarantees that at any time there is at most only one write
request per zone being executed. This mechanism is intended to replace
zone write locking which implements a similar per-zone write throttling
at the scheduler level, but is implemented only by mq-deadline.
Unlike zone write locking which operates on requests, zone write
plugging operates on BIOs. A zone write plug is simply a BIO list that
is atomically manipulated using a spinlock and a kblockd submission
work. A write BIO to a zone is "plugged" to delay its execution if a
write BIO for the same zone was already issued, that is, if a write
request for the same zone is being executed. The next plugged BIO is
unplugged and issued once the write request completes.
This mechanism allows to:
- Untangle zone write ordering from block IO schedulers. This allows
removing the restriction on using mq-deadline for writing to zoned
block devices. Any block IO scheduler, including "none" can be used.
- Zone write plugging operates on BIOs instead of requests. Plugged
BIOs waiting for execution thus do not hold scheduling tags and thus
are not preventing other BIOs from executing (reads or writes to
other zones). Depending on the workload, this can significantly
improve the device use (higher queue depth operation) and
performance.
- Both blk-mq (request based) zoned devices and BIO-based zoned devices
(e.g. device mapper) can use zone write plugging. It is mandatory
for the former but optional for the latter. BIO-based drivers can
use zone write plugging to implement write ordering guarantees, or
the drivers can implement their own if needed.
- The code is less invasive in the block layer and is mostly limited to
blk-zoned.c with some small changes in blk-mq.c, blk-merge.c and
bio.c.
Zone write plugging is implemented using struct blk_zone_wplug. This
structure includes a spinlock, a BIO list and a work structure to
handle the submission of plugged BIOs. Zone write plugs structures are
managed using a per-disk hash table.
Plugging of zone write BIOs is done using the function
blk_zone_write_plug_bio() which returns false if a BIO execution does
not need to be delayed and true otherwise. This function is called
from blk_mq_submit_bio() after a BIO is split to avoid large BIOs
spanning multiple zones which would cause mishandling of zone write
plugs. This ichange enables by default zone write plugging for any mq
request-based block device. BIO-based device drivers can also use zone
write plugging by expliclty calling blk_zone_write_plug_bio() in their
->submit_bio method. For such devices, the driver must ensure that a
BIO passed to blk_zone_write_plug_bio() is already split and not
straddling zone boundaries.
Only write and write zeroes BIOs are plugged. Zone write plugging does
not introduce any significant overhead for other operations. A BIO that
is being handled through zone write plugging is flagged using the new
BIO flag BIO_ZONE_WRITE_PLUGGING. A request handling a BIO flagged with
this new flag is flagged with the new RQF_ZONE_WRITE_PLUGGING flag.
The completion of BIOs and requests flagged trigger respectively calls
to the functions blk_zone_write_bio_endio() and
blk_zone_write_complete_request(). The latter function is used to
trigger submission of the next plugged BIO using the zone plug work.
blk_zone_write_bio_endio() does the same for BIO-based devices.
This ensures that at any time, at most one request (blk-mq devices) or
one BIO (BIO-based devices) is being executed for any zone. The
handling of zone write plugs using a per-zone plug spinlock maximizes
parallelism and device usage by allowing multiple zones to be writen
simultaneously without lock contention.
Zone write plugging ignores flush BIOs without data. Hovever, any flush
BIO that has data is always plugged so that the write part of the flush
sequence is serialized with other regular writes.
Given that any BIO handled through zone write plugging will be the only
BIO in flight for the target zone when it is executed, the unplugging
and submission of a BIO will have no chance of successfully merging with
plugged requests or requests in the scheduler. To overcome this
potential performance degradation, blk_mq_submit_bio() calls the
function blk_zone_write_plug_attempt_merge() to try to merge other
plugged BIOs with the one just unplugged and submitted. Successful
merging is signaled using blk_zone_write_plug_bio_merged(), called from
bio_attempt_back_merge(). Furthermore, to avoid recalculating the number
of segments of plugged BIOs to attempt merging, the number of segments
of a plugged BIO is saved using the new struct bio field
__bi_nr_segments. To avoid growing the size of struct bio, this field is
added as a union with the bio_cookie field. This is safe to do as
polling is always disabled for plugged BIOs.
When BIOs are plugged in a zone write plug, the device request queue
usage counter is always incremented. This reference is kept and reused
for blk-mq devices when the plugged BIO is unplugged and submitted
again using submit_bio_noacct_nocheck(). For this case, the unplugged
BIO is already flagged with BIO_ZONE_WRITE_PLUGGING and
blk_mq_submit_bio() proceeds directly to allocating a new request for
the BIO, re-using the usage reference count taken when the BIO was
plugged. This extra reference count is dropped in
blk_zone_write_plug_attempt_merge() for any plugged BIO that is
successfully merged. Given that BIO-based devices will not take this
path, the extra reference is dropped after a plugged BIO is unplugged
and submitted.
Zone write plugs are dynamically allocated and managed using a hash
table (an array of struct hlist_head) with RCU protection.
A zone write plug is allocated when a write BIO is received for the
zone and not freed until the zone is fully written, reset or finished.
To detect when a zone write plug can be freed, the write state of each
zone is tracked using a write pointer offset which corresponds to the
offset of a zone write pointer relative to the zone start. Write
operations always increment this write pointer offset. Zone reset
operations set it to 0 and zone finish operations set it to the zone
size.
If a write error happens, the wp_offset value of a zone write plug may
become incorrect and out of sync with the device managed write pointer.
This is handled using the zone write plug flag BLK_ZONE_WPLUG_ERROR.
The function blk_zone_wplug_handle_error() is called from the new disk
zone write plug work when this flag is set. This function executes a
report zone to update the zone write pointer offset to the current
value as indicated by the device. The disk zone write plug work is
scheduled whenever a BIO flagged with BIO_ZONE_WRITE_PLUGGING completes
with an error or when bio_zone_wplug_prepare_bio() detects an unaligned
write. Once scheduled, the disk zone write plugs work keeps running
until all zone errors are handled.
To match the new data structures used for zoned disks, the function
disk_free_zone_bitmaps() is renamed to the more generic
disk_free_zone_resources(). The function disk_init_zone_resources() is
also introduced to initialize zone write plugs resources when a gendisk
is allocated.
In order to guarantee that the user can simultaneously write up to a
number of zones equal to a device max active zone limit or max open zone
limit, zone write plugs are allocated using a mempool sized to the
maximum of these 2 device limits. For a device that does not have
active and open zone limits, 128 is used as the default mempool size.
If a change to the device active and open zone limits is detected, the
disk mempool is resized when blk_revalidate_disk_zones() is executed.
This commit contains contributions from Christoph Hellwig <hch@lst.de>.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240408014128.205141-8-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-08 09:41:07 +08:00
|
|
|
if (blk_queue_is_zoned(q) && blk_zone_plug_bio(bio, nr_segs))
|
|
|
|
goto queue_exit;
|
|
|
|
|
|
|
|
new_request:
|
2024-01-24 17:26:58 +08:00
|
|
|
if (!rq) {
|
|
|
|
rq = blk_mq_get_new_requests(q, plug, bio, nr_segs);
|
|
|
|
if (unlikely(!rq))
|
|
|
|
goto queue_exit;
|
|
|
|
} else {
|
|
|
|
blk_mq_use_cached_rq(rq, plug, bio);
|
2021-11-24 14:28:56 +08:00
|
|
|
}
|
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 03:38:14 +08:00
|
|
|
|
2020-12-04 00:21:36 +08:00
|
|
|
trace_block_getrq(bio);
|
2018-10-23 22:30:50 +08:00
|
|
|
|
2018-07-03 23:14:59 +08:00
|
|
|
rq_qos_track(q, rq, bio);
|
2014-05-23 00:40:51 +08:00
|
|
|
|
2019-07-01 23:47:30 +08:00
|
|
|
blk_mq_bio_to_request(rq, bio, nr_segs);
|
|
|
|
|
2023-03-16 02:39:02 +08:00
|
|
|
ret = blk_crypto_rq_get_keyslot(rq);
|
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 08:37:18 +08:00
|
|
|
if (ret != BLK_STS_OK) {
|
|
|
|
bio->bi_status = ret;
|
|
|
|
bio_endio(bio);
|
|
|
|
blk_mq_free_request(rq);
|
2021-10-12 19:12:24 +08:00
|
|
|
return;
|
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 08:37:18 +08:00
|
|
|
}
|
|
|
|
|
2024-05-01 19:09:02 +08:00
|
|
|
if (bio_zone_write_plugging(bio))
|
|
|
|
blk_zone_write_plug_init_request(rq);
|
|
|
|
|
2023-05-19 12:40:46 +08:00
|
|
|
if (op_is_flush(bio->bi_opf) && blk_insert_flush(rq))
|
2021-10-19 20:25:53 +08:00
|
|
|
return;
|
|
|
|
|
2023-04-13 14:40:51 +08:00
|
|
|
if (plug) {
|
2018-11-28 08:13:56 +08:00
|
|
|
blk_add_rq_to_plug(plug, rq);
|
2023-04-13 14:40:51 +08:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
hctx = rq->mq_hctx;
|
2023-05-18 13:31:01 +08:00
|
|
|
if ((rq->rq_flags & RQF_USE_SCHED) ||
|
2023-04-13 14:40:51 +08:00
|
|
|
(hctx->dispatch_busy && (q->nr_hw_queues == 1 || !is_sync))) {
|
2023-04-13 14:40:54 +08:00
|
|
|
blk_mq_insert_request(rq, 0);
|
2023-04-13 14:40:51 +08:00
|
|
|
blk_mq_run_hw_queue(hctx, true);
|
|
|
|
} else {
|
|
|
|
blk_mq_run_dispatch_ops(q, blk_mq_try_issue_directly(hctx, rq));
|
|
|
|
}
|
2024-01-24 17:26:56 +08:00
|
|
|
return;
|
|
|
|
|
|
|
|
queue_exit:
|
2024-01-24 17:26:58 +08:00
|
|
|
/*
|
|
|
|
* Don't drop the queue reference if we were trying to use a cached
|
|
|
|
* request and thus didn't acquire one.
|
|
|
|
*/
|
|
|
|
if (!rq)
|
|
|
|
blk_queue_exit(q);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2022-02-15 18:05:36 +08:00
|
|
|
#ifdef CONFIG_BLK_MQ_STACKING
|
2021-11-17 14:13:58 +08:00
|
|
|
/**
|
2022-02-15 18:05:37 +08:00
|
|
|
* blk_insert_cloned_request - Helper for stacking drivers to submit a request
|
|
|
|
* @rq: the request being queued
|
2021-11-17 14:13:58 +08:00
|
|
|
*/
|
2022-02-15 18:05:38 +08:00
|
|
|
blk_status_t blk_insert_cloned_request(struct request *rq)
|
2021-11-17 14:13:58 +08:00
|
|
|
{
|
2022-02-15 18:05:38 +08:00
|
|
|
struct request_queue *q = rq->q;
|
2021-11-17 14:13:58 +08:00
|
|
|
unsigned int max_sectors = blk_queue_get_max_sectors(q, req_op(rq));
|
2023-03-01 08:06:55 +08:00
|
|
|
unsigned int max_segments = blk_rq_get_max_segments(rq);
|
2022-02-15 18:05:37 +08:00
|
|
|
blk_status_t ret;
|
2021-11-17 14:13:58 +08:00
|
|
|
|
|
|
|
if (blk_rq_sectors(rq) > max_sectors) {
|
|
|
|
/*
|
|
|
|
* SCSI device does not have a good way to return if
|
|
|
|
* Write Same/Zero is actually supported. If a device rejects
|
|
|
|
* a non-read/write command (discard, write same,etc.) the
|
|
|
|
* low-level device driver will set the relevant queue limit to
|
|
|
|
* 0 to prevent blk-lib from issuing more of the offending
|
|
|
|
* operations. Commands queued prior to the queue limit being
|
|
|
|
* reset need to be completed with BLK_STS_NOTSUPP to avoid I/O
|
|
|
|
* errors being propagated to upper layers.
|
|
|
|
*/
|
|
|
|
if (max_sectors == 0)
|
|
|
|
return BLK_STS_NOTSUPP;
|
|
|
|
|
|
|
|
printk(KERN_ERR "%s: over max size limit. (%u > %u)\n",
|
|
|
|
__func__, blk_rq_sectors(rq), max_sectors);
|
|
|
|
return BLK_STS_IOERR;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The queue settings related to segment counting may differ from the
|
|
|
|
* original queue.
|
|
|
|
*/
|
|
|
|
rq->nr_phys_segments = blk_recalc_rq_segments(rq);
|
2023-03-01 08:06:55 +08:00
|
|
|
if (rq->nr_phys_segments > max_segments) {
|
|
|
|
printk(KERN_ERR "%s: over max segments limit. (%u > %u)\n",
|
|
|
|
__func__, rq->nr_phys_segments, max_segments);
|
2021-11-17 14:13:58 +08:00
|
|
|
return BLK_STS_IOERR;
|
|
|
|
}
|
|
|
|
|
2022-02-15 18:05:38 +08:00
|
|
|
if (q->disk && should_fail_request(q->disk->part0, blk_rq_bytes(rq)))
|
2021-11-17 14:13:58 +08:00
|
|
|
return BLK_STS_IOERR;
|
|
|
|
|
2023-03-16 02:39:06 +08:00
|
|
|
ret = blk_crypto_rq_get_keyslot(rq);
|
|
|
|
if (ret != BLK_STS_OK)
|
|
|
|
return ret;
|
2021-11-17 14:13:58 +08:00
|
|
|
|
|
|
|
blk_account_io_start(rq);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Since we have a scheduler attached on the top device,
|
|
|
|
* bypass a potential scheduler on the bottom device for
|
|
|
|
* insert.
|
|
|
|
*/
|
2022-02-15 18:05:38 +08:00
|
|
|
blk_mq_run_dispatch_ops(q,
|
2021-12-03 21:15:34 +08:00
|
|
|
ret = blk_mq_request_issue_directly(rq, true));
|
2022-01-26 09:21:32 +08:00
|
|
|
if (ret)
|
2024-01-16 05:45:07 +08:00
|
|
|
blk_account_io_done(rq, blk_time_get_ns());
|
2021-12-03 21:15:34 +08:00
|
|
|
return ret;
|
2021-11-17 14:13:58 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_insert_cloned_request);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_rq_unprep_clone - Helper function to free all bios in a cloned request
|
|
|
|
* @rq: the clone request to be cleaned up
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Free all bios in @rq for a cloned request.
|
|
|
|
*/
|
|
|
|
void blk_rq_unprep_clone(struct request *rq)
|
|
|
|
{
|
|
|
|
struct bio *bio;
|
|
|
|
|
|
|
|
while ((bio = rq->bio) != NULL) {
|
|
|
|
rq->bio = bio->bi_next;
|
|
|
|
|
|
|
|
bio_put(bio);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_rq_prep_clone - Helper function to setup clone request
|
|
|
|
* @rq: the request to be setup
|
|
|
|
* @rq_src: original request to be cloned
|
|
|
|
* @bs: bio_set that bios for clone are allocated from
|
|
|
|
* @gfp_mask: memory allocation mask for bio
|
|
|
|
* @bio_ctr: setup function to be called for each clone bio.
|
|
|
|
* Returns %0 for success, non %0 for failure.
|
|
|
|
* @data: private data to be passed to @bio_ctr
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Clones bios in @rq_src to @rq, and copies attributes of @rq_src to @rq.
|
|
|
|
* Also, pages which the original bios are pointing to are not copied
|
|
|
|
* and the cloned bios just point same pages.
|
|
|
|
* So cloned bios must be completed before original bios, which means
|
|
|
|
* the caller must complete @rq before @rq_src.
|
|
|
|
*/
|
|
|
|
int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
|
|
|
|
struct bio_set *bs, gfp_t gfp_mask,
|
|
|
|
int (*bio_ctr)(struct bio *, struct bio *, void *),
|
|
|
|
void *data)
|
|
|
|
{
|
|
|
|
struct bio *bio, *bio_src;
|
|
|
|
|
|
|
|
if (!bs)
|
|
|
|
bs = &fs_bio_set;
|
|
|
|
|
|
|
|
__rq_for_each_bio(bio_src, rq_src) {
|
2022-02-03 00:01:09 +08:00
|
|
|
bio = bio_alloc_clone(rq->q->disk->part0, bio_src, gfp_mask,
|
|
|
|
bs);
|
2021-11-17 14:13:58 +08:00
|
|
|
if (!bio)
|
|
|
|
goto free_and_out;
|
|
|
|
|
|
|
|
if (bio_ctr && bio_ctr(bio, bio_src, data))
|
|
|
|
goto free_and_out;
|
|
|
|
|
|
|
|
if (rq->bio) {
|
|
|
|
rq->biotail->bi_next = bio;
|
|
|
|
rq->biotail = bio;
|
|
|
|
} else {
|
|
|
|
rq->bio = rq->biotail = bio;
|
|
|
|
}
|
|
|
|
bio = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Copy attributes of the original request to the clone request. */
|
|
|
|
rq->__sector = blk_rq_pos(rq_src);
|
|
|
|
rq->__data_len = blk_rq_bytes(rq_src);
|
|
|
|
if (rq_src->rq_flags & RQF_SPECIAL_PAYLOAD) {
|
|
|
|
rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
|
|
|
|
rq->special_vec = rq_src->special_vec;
|
|
|
|
}
|
|
|
|
rq->nr_phys_segments = rq_src->nr_phys_segments;
|
|
|
|
rq->ioprio = rq_src->ioprio;
|
2024-02-03 04:39:25 +08:00
|
|
|
rq->write_hint = rq_src->write_hint;
|
2021-11-17 14:13:58 +08:00
|
|
|
|
|
|
|
if (rq->bio && blk_crypto_rq_bio_prep(rq, rq->bio, gfp_mask) < 0)
|
|
|
|
goto free_and_out;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
free_and_out:
|
|
|
|
if (bio)
|
|
|
|
bio_put(bio);
|
|
|
|
blk_rq_unprep_clone(rq);
|
|
|
|
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_rq_prep_clone);
|
2022-02-15 18:05:36 +08:00
|
|
|
#endif /* CONFIG_BLK_MQ_STACKING */
|
2021-11-17 14:13:58 +08:00
|
|
|
|
2021-11-17 14:14:00 +08:00
|
|
|
/*
|
|
|
|
* Steal bios from a request and add them to a bio list.
|
|
|
|
* The request must not have been partially completed before.
|
|
|
|
*/
|
|
|
|
void blk_steal_bios(struct bio_list *list, struct request *rq)
|
|
|
|
{
|
|
|
|
if (rq->bio) {
|
|
|
|
if (list->tail)
|
|
|
|
list->tail->bi_next = rq->bio;
|
|
|
|
else
|
|
|
|
list->head = rq->bio;
|
|
|
|
list->tail = rq->biotail;
|
|
|
|
|
|
|
|
rq->bio = NULL;
|
|
|
|
rq->biotail = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
rq->__data_len = 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_steal_bios);
|
|
|
|
|
2021-05-11 23:22:35 +08:00
|
|
|
static size_t order_to_size(unsigned int order)
|
|
|
|
{
|
|
|
|
return (size_t)PAGE_SIZE << order;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* called before freeing request pool in @tags */
|
2021-10-05 18:23:32 +08:00
|
|
|
static void blk_mq_clear_rq_mapping(struct blk_mq_tags *drv_tags,
|
|
|
|
struct blk_mq_tags *tags)
|
2021-05-11 23:22:35 +08:00
|
|
|
{
|
|
|
|
struct page *page;
|
|
|
|
unsigned long flags;
|
|
|
|
|
2022-10-11 22:22:53 +08:00
|
|
|
/*
|
|
|
|
* There is no need to clear mapping if driver tags is not initialized
|
|
|
|
* or the mapping belongs to the driver tags.
|
|
|
|
*/
|
|
|
|
if (!drv_tags || drv_tags == tags)
|
2021-10-05 18:23:33 +08:00
|
|
|
return;
|
|
|
|
|
2021-05-11 23:22:35 +08:00
|
|
|
list_for_each_entry(page, &tags->page_list, lru) {
|
|
|
|
unsigned long start = (unsigned long)page_address(page);
|
|
|
|
unsigned long end = start + order_to_size(page->private);
|
|
|
|
int i;
|
|
|
|
|
2021-10-05 18:23:32 +08:00
|
|
|
for (i = 0; i < drv_tags->nr_tags; i++) {
|
2021-05-11 23:22:35 +08:00
|
|
|
struct request *rq = drv_tags->rqs[i];
|
|
|
|
unsigned long rq_addr = (unsigned long)rq;
|
|
|
|
|
|
|
|
if (rq_addr >= start && rq_addr < end) {
|
2021-10-15 04:39:59 +08:00
|
|
|
WARN_ON_ONCE(req_ref_read(rq) != 0);
|
2021-05-11 23:22:35 +08:00
|
|
|
cmpxchg(&drv_tags->rqs[i], rq, NULL);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Wait until all pending iteration is done.
|
|
|
|
*
|
|
|
|
* Request reference is cleared and it is guaranteed to be observed
|
|
|
|
* after the ->lock is released.
|
|
|
|
*/
|
|
|
|
spin_lock_irqsave(&drv_tags->lock, flags);
|
|
|
|
spin_unlock_irqrestore(&drv_tags->lock, flags);
|
|
|
|
}
|
|
|
|
|
2017-01-12 05:29:56 +08:00
|
|
|
void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
|
|
|
|
unsigned int hctx_idx)
|
2014-03-15 00:43:15 +08:00
|
|
|
{
|
2021-10-05 18:23:32 +08:00
|
|
|
struct blk_mq_tags *drv_tags;
|
2014-04-16 03:59:10 +08:00
|
|
|
struct page *page;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2022-03-08 13:51:48 +08:00
|
|
|
if (list_empty(&tags->page_list))
|
|
|
|
return;
|
|
|
|
|
2021-10-05 18:23:39 +08:00
|
|
|
if (blk_mq_is_shared_tags(set->flags))
|
|
|
|
drv_tags = set->shared_tags;
|
2021-10-05 18:23:37 +08:00
|
|
|
else
|
|
|
|
drv_tags = set->tags[hctx_idx];
|
2021-10-05 18:23:32 +08:00
|
|
|
|
2021-10-05 18:23:26 +08:00
|
|
|
if (tags->static_rqs && set->ops->exit_request) {
|
2014-04-16 03:59:10 +08:00
|
|
|
int i;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2014-04-16 04:14:00 +08:00
|
|
|
for (i = 0; i < tags->nr_tags; i++) {
|
2017-01-14 05:39:30 +08:00
|
|
|
struct request *rq = tags->static_rqs[i];
|
|
|
|
|
|
|
|
if (!rq)
|
2014-04-16 03:59:10 +08:00
|
|
|
continue;
|
2017-05-02 00:19:08 +08:00
|
|
|
set->ops->exit_request(set, rq, hctx_idx);
|
2017-01-14 05:39:30 +08:00
|
|
|
tags->static_rqs[i] = NULL;
|
2014-04-16 03:59:10 +08:00
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2021-10-05 18:23:32 +08:00
|
|
|
blk_mq_clear_rq_mapping(drv_tags, tags);
|
2021-05-11 23:22:35 +08:00
|
|
|
|
2014-04-16 04:14:00 +08:00
|
|
|
while (!list_empty(&tags->page_list)) {
|
|
|
|
page = list_first_entry(&tags->page_list, struct page, lru);
|
2014-01-09 11:17:46 +08:00
|
|
|
list_del_init(&page->lru);
|
2015-09-15 01:16:02 +08:00
|
|
|
/*
|
|
|
|
* Remove kmemleak object previously allocated in
|
2019-05-03 03:48:11 +08:00
|
|
|
* blk_mq_alloc_rqs().
|
2015-09-15 01:16:02 +08:00
|
|
|
*/
|
|
|
|
kmemleak_free(page_address(page));
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
__free_pages(page, page->private);
|
|
|
|
}
|
2017-01-12 05:29:56 +08:00
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2021-10-05 18:23:37 +08:00
|
|
|
void blk_mq_free_rq_map(struct blk_mq_tags *tags)
|
2017-01-12 05:29:56 +08:00
|
|
|
{
|
2014-04-16 04:14:00 +08:00
|
|
|
kfree(tags->rqs);
|
2017-01-12 05:29:56 +08:00
|
|
|
tags->rqs = NULL;
|
2017-01-14 05:39:30 +08:00
|
|
|
kfree(tags->static_rqs);
|
|
|
|
tags->static_rqs = NULL;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2021-10-05 18:23:37 +08:00
|
|
|
blk_mq_free_tags(tags);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2022-03-08 15:32:14 +08:00
|
|
|
static enum hctx_type hctx_idx_to_type(struct blk_mq_tag_set *set,
|
|
|
|
unsigned int hctx_idx)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < set->nr_maps; i++) {
|
|
|
|
unsigned int start = set->map[i].queue_offset;
|
|
|
|
unsigned int end = start + set->map[i].nr_queues;
|
|
|
|
|
|
|
|
if (hctx_idx >= start && hctx_idx < end)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (i >= set->nr_maps)
|
|
|
|
i = HCTX_TYPE_DEFAULT;
|
|
|
|
|
|
|
|
return i;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int blk_mq_get_hctx_node(struct blk_mq_tag_set *set,
|
|
|
|
unsigned int hctx_idx)
|
|
|
|
{
|
|
|
|
enum hctx_type type = hctx_idx_to_type(set, hctx_idx);
|
|
|
|
|
|
|
|
return blk_mq_hw_queue_to_node(&set->map[type], hctx_idx);
|
|
|
|
}
|
|
|
|
|
2021-10-05 18:23:35 +08:00
|
|
|
static struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
|
|
|
|
unsigned int hctx_idx,
|
|
|
|
unsigned int nr_tags,
|
2021-10-05 18:23:37 +08:00
|
|
|
unsigned int reserved_tags)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2022-03-08 15:32:14 +08:00
|
|
|
int node = blk_mq_get_hctx_node(set, hctx_idx);
|
2014-04-16 04:14:00 +08:00
|
|
|
struct blk_mq_tags *tags;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2017-02-02 01:53:14 +08:00
|
|
|
if (node == NUMA_NO_NODE)
|
|
|
|
node = set->numa_node;
|
|
|
|
|
2021-10-05 18:23:37 +08:00
|
|
|
tags = blk_mq_init_tags(nr_tags, reserved_tags, node,
|
|
|
|
BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags));
|
2014-04-16 04:14:00 +08:00
|
|
|
if (!tags)
|
|
|
|
return NULL;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
treewide: kzalloc_node() -> kcalloc_node()
The kzalloc_node() function has a 2-factor argument form, kcalloc_node(). This
patch replaces cases of:
kzalloc_node(a * b, gfp, node)
with:
kcalloc_node(a * b, gfp, node)
as well as handling cases of:
kzalloc_node(a * b * c, gfp, node)
with:
kzalloc_node(array3_size(a, b, c), gfp, node)
as it's slightly less ugly than:
kcalloc_node(array_size(a, b), c, gfp, node)
This does, however, attempt to ignore constant size factors like:
kzalloc_node(4 * 1024, gfp, node)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@
(
kzalloc_node(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kzalloc_node(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@
(
kzalloc_node(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)
// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@
- kzalloc_node
+ kcalloc_node
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
kzalloc_node(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc_node(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc_node(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc_node(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
kzalloc_node(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc_node(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc_node(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@
(
kzalloc_node(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)
// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
kzalloc_node(C1 * C2 * C3, ...)
|
kzalloc_node(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc_node(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc_node(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc_node(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)
// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@
(
kzalloc_node(sizeof(THING) * C2, ...)
|
kzalloc_node(sizeof(TYPE) * C2, ...)
|
kzalloc_node(C1 * C2 * C3, ...)
|
kzalloc_node(C1 * C2, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- (E1) * E2
+ E1, E2
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- E1 * E2
+ E1, E2
, ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-13 05:04:20 +08:00
|
|
|
tags->rqs = kcalloc_node(nr_tags, sizeof(struct request *),
|
2016-12-06 23:31:44 +08:00
|
|
|
GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
|
2017-02-02 01:53:14 +08:00
|
|
|
node);
|
2022-11-02 10:52:29 +08:00
|
|
|
if (!tags->rqs)
|
|
|
|
goto err_free_tags;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
treewide: kzalloc_node() -> kcalloc_node()
The kzalloc_node() function has a 2-factor argument form, kcalloc_node(). This
patch replaces cases of:
kzalloc_node(a * b, gfp, node)
with:
kcalloc_node(a * b, gfp, node)
as well as handling cases of:
kzalloc_node(a * b * c, gfp, node)
with:
kzalloc_node(array3_size(a, b, c), gfp, node)
as it's slightly less ugly than:
kcalloc_node(array_size(a, b), c, gfp, node)
This does, however, attempt to ignore constant size factors like:
kzalloc_node(4 * 1024, gfp, node)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@
(
kzalloc_node(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kzalloc_node(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@
(
kzalloc_node(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)
// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@
- kzalloc_node
+ kcalloc_node
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
kzalloc_node(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc_node(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc_node(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc_node(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
kzalloc_node(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc_node(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc_node(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@
(
kzalloc_node(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)
// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
kzalloc_node(C1 * C2 * C3, ...)
|
kzalloc_node(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc_node(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc_node(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc_node(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)
// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@
(
kzalloc_node(sizeof(THING) * C2, ...)
|
kzalloc_node(sizeof(TYPE) * C2, ...)
|
kzalloc_node(C1 * C2 * C3, ...)
|
kzalloc_node(C1 * C2, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- (E1) * E2
+ E1, E2
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- E1 * E2
+ E1, E2
, ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-13 05:04:20 +08:00
|
|
|
tags->static_rqs = kcalloc_node(nr_tags, sizeof(struct request *),
|
|
|
|
GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
|
|
|
|
node);
|
2022-11-02 10:52:29 +08:00
|
|
|
if (!tags->static_rqs)
|
|
|
|
goto err_free_rqs;
|
2017-01-14 05:39:30 +08:00
|
|
|
|
2017-01-12 05:29:56 +08:00
|
|
|
return tags;
|
2022-11-02 10:52:29 +08:00
|
|
|
|
|
|
|
err_free_rqs:
|
|
|
|
kfree(tags->rqs);
|
|
|
|
err_free_tags:
|
|
|
|
blk_mq_free_tags(tags);
|
|
|
|
return NULL;
|
2017-01-12 05:29:56 +08:00
|
|
|
}
|
|
|
|
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10 00:29:48 +08:00
|
|
|
static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
|
|
|
|
unsigned int hctx_idx, int node)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (set->ops->init_request) {
|
|
|
|
ret = set->ops->init_request(set, rq, hctx_idx, node);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-05-29 21:52:28 +08:00
|
|
|
WRITE_ONCE(rq->state, MQ_RQ_IDLE);
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10 00:29:48 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2021-10-05 18:23:35 +08:00
|
|
|
static int blk_mq_alloc_rqs(struct blk_mq_tag_set *set,
|
|
|
|
struct blk_mq_tags *tags,
|
|
|
|
unsigned int hctx_idx, unsigned int depth)
|
2017-01-12 05:29:56 +08:00
|
|
|
{
|
|
|
|
unsigned int i, j, entries_per_page, max_order = 4;
|
2022-03-08 15:32:14 +08:00
|
|
|
int node = blk_mq_get_hctx_node(set, hctx_idx);
|
2017-01-12 05:29:56 +08:00
|
|
|
size_t rq_size, left;
|
2017-02-02 01:53:14 +08:00
|
|
|
|
|
|
|
if (node == NUMA_NO_NODE)
|
|
|
|
node = set->numa_node;
|
2017-01-12 05:29:56 +08:00
|
|
|
|
|
|
|
INIT_LIST_HEAD(&tags->page_list);
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
/*
|
|
|
|
* rq_size is the size of the request plus driver payload, rounded
|
|
|
|
* to the cacheline size
|
|
|
|
*/
|
2014-04-16 04:14:00 +08:00
|
|
|
rq_size = round_up(sizeof(struct request) + set->cmd_size,
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
cache_line_size());
|
2017-01-12 05:29:56 +08:00
|
|
|
left = rq_size * depth;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2017-01-12 05:29:56 +08:00
|
|
|
for (i = 0; i < depth; ) {
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
int this_order = max_order;
|
|
|
|
struct page *page;
|
|
|
|
int to_do;
|
|
|
|
void *p;
|
|
|
|
|
2016-05-16 23:54:47 +08:00
|
|
|
while (this_order && left < order_to_size(this_order - 1))
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
this_order--;
|
|
|
|
|
|
|
|
do {
|
2017-02-02 01:53:14 +08:00
|
|
|
page = alloc_pages_node(node,
|
2016-12-06 23:31:44 +08:00
|
|
|
GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO,
|
2014-09-10 23:02:03 +08:00
|
|
|
this_order);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
if (page)
|
|
|
|
break;
|
|
|
|
if (!this_order--)
|
|
|
|
break;
|
|
|
|
if (order_to_size(this_order) < rq_size)
|
|
|
|
break;
|
|
|
|
} while (1);
|
|
|
|
|
|
|
|
if (!page)
|
2014-04-16 04:14:00 +08:00
|
|
|
goto fail;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
|
|
|
page->private = this_order;
|
2014-04-16 04:14:00 +08:00
|
|
|
list_add_tail(&page->lru, &tags->page_list);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
|
|
|
p = page_address(page);
|
2015-09-15 01:16:02 +08:00
|
|
|
/*
|
|
|
|
* Allow kmemleak to scan these pages as they contain pointers
|
|
|
|
* to additional allocations like via ops->init_request().
|
|
|
|
*/
|
2016-12-06 23:31:44 +08:00
|
|
|
kmemleak_alloc(p, order_to_size(this_order), 1, GFP_NOIO);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
entries_per_page = order_to_size(this_order) / rq_size;
|
2017-01-12 05:29:56 +08:00
|
|
|
to_do = min(entries_per_page, depth - i);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
left -= to_do * rq_size;
|
|
|
|
for (j = 0; j < to_do; j++) {
|
2017-01-14 05:39:30 +08:00
|
|
|
struct request *rq = p;
|
|
|
|
|
|
|
|
tags->static_rqs[i] = rq;
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10 00:29:48 +08:00
|
|
|
if (blk_mq_init_request(set, rq, hctx_idx, node)) {
|
|
|
|
tags->static_rqs[i] = NULL;
|
|
|
|
goto fail;
|
2014-04-16 03:59:10 +08:00
|
|
|
}
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
p += rq_size;
|
|
|
|
i++;
|
|
|
|
}
|
|
|
|
}
|
2017-01-12 05:29:56 +08:00
|
|
|
return 0;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2014-04-16 04:14:00 +08:00
|
|
|
fail:
|
2017-01-12 05:29:56 +08:00
|
|
|
blk_mq_free_rqs(set, tags, hctx_idx);
|
|
|
|
return -ENOMEM;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2020-05-29 21:53:15 +08:00
|
|
|
struct rq_iter_data {
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
bool has_rq;
|
|
|
|
};
|
|
|
|
|
2022-07-06 20:03:53 +08:00
|
|
|
static bool blk_mq_has_request(struct request *rq, void *data)
|
2020-05-29 21:53:15 +08:00
|
|
|
{
|
|
|
|
struct rq_iter_data *iter_data = data;
|
|
|
|
|
|
|
|
if (rq->mq_hctx != iter_data->hctx)
|
|
|
|
return true;
|
|
|
|
iter_data->has_rq = true;
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
|
|
|
|
{
|
|
|
|
struct blk_mq_tags *tags = hctx->sched_tags ?
|
|
|
|
hctx->sched_tags : hctx->tags;
|
|
|
|
struct rq_iter_data data = {
|
|
|
|
.hctx = hctx,
|
|
|
|
};
|
|
|
|
|
|
|
|
blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
|
|
|
|
return data.has_rq;
|
|
|
|
}
|
|
|
|
|
2024-03-22 10:12:44 +08:00
|
|
|
static bool blk_mq_hctx_has_online_cpu(struct blk_mq_hw_ctx *hctx,
|
|
|
|
unsigned int this_cpu)
|
2020-05-29 21:53:15 +08:00
|
|
|
{
|
2024-03-22 10:12:44 +08:00
|
|
|
enum hctx_type type = hctx->type;
|
|
|
|
int cpu;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* hctx->cpumask has to rule out isolated CPUs, but userspace still
|
|
|
|
* might submit IOs on these isolated CPUs, so use the queue map to
|
|
|
|
* check if all CPUs mapped to this hctx are offline
|
|
|
|
*/
|
|
|
|
for_each_online_cpu(cpu) {
|
|
|
|
struct blk_mq_hw_ctx *h = blk_mq_map_queue_type(hctx->queue,
|
|
|
|
type, cpu);
|
|
|
|
|
|
|
|
if (h != hctx)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* this hctx has at least one online CPU */
|
|
|
|
if (this_cpu != cpu)
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
return false;
|
2020-05-29 21:53:15 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
|
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
|
|
|
|
struct blk_mq_hw_ctx, cpuhp_online);
|
|
|
|
|
2024-03-22 10:12:44 +08:00
|
|
|
if (blk_mq_hctx_has_online_cpu(hctx, cpu))
|
2020-05-29 21:53:15 +08:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Prevent new request from being allocated on the current hctx.
|
|
|
|
*
|
|
|
|
* The smp_mb__after_atomic() Pairs with the implied barrier in
|
|
|
|
* test_and_set_bit_lock in sbitmap_get(). Ensures the inactive flag is
|
|
|
|
* seen once we return from the tag allocator.
|
|
|
|
*/
|
|
|
|
set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
|
|
|
|
smp_mb__after_atomic();
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Try to grab a reference to the queue and wait for any outstanding
|
|
|
|
* requests. If we could not grab a reference the queue has been
|
|
|
|
* frozen and there are no requests.
|
|
|
|
*/
|
|
|
|
if (percpu_ref_tryget(&hctx->queue->q_usage_counter)) {
|
|
|
|
while (blk_mq_hctx_has_requests(hctx))
|
|
|
|
msleep(5);
|
|
|
|
percpu_ref_put(&hctx->queue->q_usage_counter);
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
|
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
|
|
|
|
struct blk_mq_hw_ctx, cpuhp_online);
|
|
|
|
|
|
|
|
if (cpumask_test_cpu(cpu, hctx->cpumask))
|
|
|
|
clear_bit(BLK_MQ_S_INACTIVE, &hctx->state);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2016-08-25 05:34:35 +08:00
|
|
|
/*
|
|
|
|
* 'cpu' is going away. splice any existing rq_list entries from this
|
|
|
|
* software queue to the hw queue dispatch list, and ensure that it
|
|
|
|
* gets run.
|
|
|
|
*/
|
2016-09-22 22:05:17 +08:00
|
|
|
static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
|
2014-05-22 04:01:15 +08:00
|
|
|
{
|
2016-09-22 22:05:17 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx;
|
2014-05-22 04:01:15 +08:00
|
|
|
struct blk_mq_ctx *ctx;
|
|
|
|
LIST_HEAD(tmp);
|
2018-12-17 23:44:05 +08:00
|
|
|
enum hctx_type type;
|
2014-05-22 04:01:15 +08:00
|
|
|
|
2016-09-22 22:05:17 +08:00
|
|
|
hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
|
2020-05-29 21:53:15 +08:00
|
|
|
if (!cpumask_test_cpu(cpu, hctx->cpumask))
|
|
|
|
return 0;
|
|
|
|
|
2016-08-25 05:34:35 +08:00
|
|
|
ctx = __blk_mq_get_ctx(hctx->queue, cpu);
|
2018-12-17 23:44:05 +08:00
|
|
|
type = hctx->type;
|
2014-05-22 04:01:15 +08:00
|
|
|
|
|
|
|
spin_lock(&ctx->lock);
|
2018-12-17 23:44:05 +08:00
|
|
|
if (!list_empty(&ctx->rq_lists[type])) {
|
|
|
|
list_splice_init(&ctx->rq_lists[type], &tmp);
|
2014-05-22 04:01:15 +08:00
|
|
|
blk_mq_hctx_clear_pending(hctx, ctx);
|
|
|
|
}
|
|
|
|
spin_unlock(&ctx->lock);
|
|
|
|
|
|
|
|
if (list_empty(&tmp))
|
2016-09-22 22:05:17 +08:00
|
|
|
return 0;
|
2014-05-22 04:01:15 +08:00
|
|
|
|
2016-08-25 05:34:35 +08:00
|
|
|
spin_lock(&hctx->lock);
|
|
|
|
list_splice_tail_init(&tmp, &hctx->dispatch);
|
|
|
|
spin_unlock(&hctx->lock);
|
2014-05-22 04:01:15 +08:00
|
|
|
|
|
|
|
blk_mq_run_hw_queue(hctx, true);
|
2016-09-22 22:05:17 +08:00
|
|
|
return 0;
|
2014-05-22 04:01:15 +08:00
|
|
|
}
|
|
|
|
|
2016-09-22 22:05:17 +08:00
|
|
|
static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
|
2014-05-22 04:01:15 +08:00
|
|
|
{
|
2020-05-29 21:53:15 +08:00
|
|
|
if (!(hctx->flags & BLK_MQ_F_STACKING))
|
|
|
|
cpuhp_state_remove_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
|
|
|
|
&hctx->cpuhp_online);
|
2016-09-22 22:05:17 +08:00
|
|
|
cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD,
|
|
|
|
&hctx->cpuhp_dead);
|
2014-05-22 04:01:15 +08:00
|
|
|
}
|
|
|
|
|
2021-05-11 23:22:36 +08:00
|
|
|
/*
|
|
|
|
* Before freeing hw queue, clearing the flush request reference in
|
|
|
|
* tags->rqs[] for avoiding potential UAF.
|
|
|
|
*/
|
|
|
|
static void blk_mq_clear_flush_rq_mapping(struct blk_mq_tags *tags,
|
|
|
|
unsigned int queue_depth, struct request *flush_rq)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
/* The hw queue may not be mapped yet */
|
|
|
|
if (!tags)
|
|
|
|
return;
|
|
|
|
|
2021-10-15 04:39:59 +08:00
|
|
|
WARN_ON_ONCE(req_ref_read(flush_rq) != 0);
|
2021-05-11 23:22:36 +08:00
|
|
|
|
|
|
|
for (i = 0; i < queue_depth; i++)
|
|
|
|
cmpxchg(&tags->rqs[i], flush_rq, NULL);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Wait until all pending iteration is done.
|
|
|
|
*
|
|
|
|
* Request reference is cleared and it is guaranteed to be observed
|
|
|
|
* after the ->lock is released.
|
|
|
|
*/
|
|
|
|
spin_lock_irqsave(&tags->lock, flags);
|
|
|
|
spin_unlock_irqrestore(&tags->lock, flags);
|
|
|
|
}
|
|
|
|
|
2015-06-04 22:25:04 +08:00
|
|
|
/* hctx->ctxs will be freed in queue's release handler */
|
2014-09-25 23:23:38 +08:00
|
|
|
static void blk_mq_exit_hctx(struct request_queue *q,
|
|
|
|
struct blk_mq_tag_set *set,
|
|
|
|
struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
|
|
|
|
{
|
2021-05-11 23:22:36 +08:00
|
|
|
struct request *flush_rq = hctx->fq->flush_rq;
|
|
|
|
|
2018-01-09 21:28:29 +08:00
|
|
|
if (blk_mq_hw_queue_mapped(hctx))
|
|
|
|
blk_mq_tag_idle(hctx);
|
2014-09-25 23:23:38 +08:00
|
|
|
|
2022-06-16 09:44:01 +08:00
|
|
|
if (blk_queue_init_done(q))
|
|
|
|
blk_mq_clear_flush_rq_mapping(set->tags[hctx_idx],
|
|
|
|
set->queue_depth, flush_rq);
|
2014-09-25 23:23:47 +08:00
|
|
|
if (set->ops->exit_request)
|
2021-05-11 23:22:36 +08:00
|
|
|
set->ops->exit_request(set, flush_rq, hctx_idx);
|
2014-09-25 23:23:47 +08:00
|
|
|
|
2014-09-25 23:23:38 +08:00
|
|
|
if (set->ops->exit_hctx)
|
|
|
|
set->ops->exit_hctx(hctx, hctx_idx);
|
|
|
|
|
2016-09-22 22:05:17 +08:00
|
|
|
blk_mq_remove_cpuhp(hctx);
|
blk-mq: always free hctx after request queue is freed
In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().
However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.
Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:52:27 +08:00
|
|
|
|
2022-03-08 15:32:19 +08:00
|
|
|
xa_erase(&q->hctx_table, hctx_idx);
|
|
|
|
|
blk-mq: always free hctx after request queue is freed
In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().
However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.
Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:52:27 +08:00
|
|
|
spin_lock(&q->unused_hctx_lock);
|
|
|
|
list_add(&hctx->hctx_list, &q->unused_hctx_list);
|
|
|
|
spin_unlock(&q->unused_hctx_lock);
|
2014-09-25 23:23:38 +08:00
|
|
|
}
|
|
|
|
|
2014-05-27 23:35:13 +08:00
|
|
|
static void blk_mq_exit_hw_queues(struct request_queue *q,
|
|
|
|
struct blk_mq_tag_set *set, int nr_queue)
|
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
2022-03-08 15:32:18 +08:00
|
|
|
unsigned long i;
|
2014-05-27 23:35:13 +08:00
|
|
|
|
|
|
|
queue_for_each_hw_ctx(q, hctx, i) {
|
|
|
|
if (i == nr_queue)
|
|
|
|
break;
|
2014-09-25 23:23:38 +08:00
|
|
|
blk_mq_exit_hctx(q, set, hctx, i);
|
2014-05-27 23:35:13 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-09-25 23:23:38 +08:00
|
|
|
static int blk_mq_init_hctx(struct request_queue *q,
|
|
|
|
struct blk_mq_tag_set *set,
|
|
|
|
struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2019-04-30 09:52:26 +08:00
|
|
|
hctx->queue_num = hctx_idx;
|
|
|
|
|
2020-05-29 21:53:15 +08:00
|
|
|
if (!(hctx->flags & BLK_MQ_F_STACKING))
|
|
|
|
cpuhp_state_add_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
|
|
|
|
&hctx->cpuhp_online);
|
2019-04-30 09:52:26 +08:00
|
|
|
cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
|
|
|
|
|
|
|
|
hctx->tags = set->tags[hctx_idx];
|
|
|
|
|
|
|
|
if (set->ops->init_hctx &&
|
|
|
|
set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
|
|
|
|
goto unregister_cpu_notifier;
|
2014-09-25 23:23:38 +08:00
|
|
|
|
2019-04-30 09:52:26 +08:00
|
|
|
if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx,
|
|
|
|
hctx->numa_node))
|
|
|
|
goto exit_hctx;
|
2022-03-08 15:32:19 +08:00
|
|
|
|
|
|
|
if (xa_insert(&q->hctx_table, hctx_idx, hctx, GFP_KERNEL))
|
|
|
|
goto exit_flush_rq;
|
|
|
|
|
2019-04-30 09:52:26 +08:00
|
|
|
return 0;
|
|
|
|
|
2022-03-08 15:32:19 +08:00
|
|
|
exit_flush_rq:
|
|
|
|
if (set->ops->exit_request)
|
|
|
|
set->ops->exit_request(set, hctx->fq->flush_rq, hctx_idx);
|
2019-04-30 09:52:26 +08:00
|
|
|
exit_hctx:
|
|
|
|
if (set->ops->exit_hctx)
|
|
|
|
set->ops->exit_hctx(hctx, hctx_idx);
|
|
|
|
unregister_cpu_notifier:
|
|
|
|
blk_mq_remove_cpuhp(hctx);
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct blk_mq_hw_ctx *
|
|
|
|
blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
|
|
|
|
int node)
|
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
|
|
|
|
|
2021-12-03 21:15:32 +08:00
|
|
|
hctx = kzalloc_node(sizeof(struct blk_mq_hw_ctx), gfp, node);
|
2019-04-30 09:52:26 +08:00
|
|
|
if (!hctx)
|
|
|
|
goto fail_alloc_hctx;
|
|
|
|
|
|
|
|
if (!zalloc_cpumask_var_node(&hctx->cpumask, gfp, node))
|
|
|
|
goto free_hctx;
|
|
|
|
|
|
|
|
atomic_set(&hctx->nr_active, 0);
|
2014-09-25 23:23:38 +08:00
|
|
|
if (node == NUMA_NO_NODE)
|
2019-04-30 09:52:26 +08:00
|
|
|
node = set->numa_node;
|
|
|
|
hctx->numa_node = node;
|
2014-09-25 23:23:38 +08:00
|
|
|
|
2017-04-10 23:54:54 +08:00
|
|
|
INIT_DELAYED_WORK(&hctx->run_work, blk_mq_run_work_fn);
|
2014-09-25 23:23:38 +08:00
|
|
|
spin_lock_init(&hctx->lock);
|
|
|
|
INIT_LIST_HEAD(&hctx->dispatch);
|
|
|
|
hctx->queue = q;
|
2020-08-19 23:20:19 +08:00
|
|
|
hctx->flags = set->flags & ~BLK_MQ_F_TAG_QUEUE_SHARED;
|
2014-09-25 23:23:38 +08:00
|
|
|
|
blk-mq: always free hctx after request queue is freed
In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().
However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.
Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:52:27 +08:00
|
|
|
INIT_LIST_HEAD(&hctx->hctx_list);
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
/*
|
2014-09-25 23:23:38 +08:00
|
|
|
* Allocate space for all possible cpus to avoid allocation at
|
|
|
|
* runtime
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
*/
|
2017-11-16 09:32:33 +08:00
|
|
|
hctx->ctxs = kmalloc_array_node(nr_cpu_ids, sizeof(void *),
|
2019-04-30 09:52:26 +08:00
|
|
|
gfp, node);
|
2014-09-25 23:23:38 +08:00
|
|
|
if (!hctx->ctxs)
|
2019-04-30 09:52:26 +08:00
|
|
|
goto free_cpumask;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2018-10-12 18:07:26 +08:00
|
|
|
if (sbitmap_init_node(&hctx->ctx_map, nr_cpu_ids, ilog2(8),
|
2021-01-22 10:33:08 +08:00
|
|
|
gfp, node, false, false))
|
2014-09-25 23:23:38 +08:00
|
|
|
goto free_ctxs;
|
|
|
|
hctx->nr_ctx = 0;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2018-06-25 19:31:47 +08:00
|
|
|
spin_lock_init(&hctx->dispatch_wait_lock);
|
2017-11-09 23:32:43 +08:00
|
|
|
init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake);
|
|
|
|
INIT_LIST_HEAD(&hctx->dispatch_wait.entry);
|
|
|
|
|
2020-03-10 05:41:37 +08:00
|
|
|
hctx->fq = blk_alloc_flush_queue(hctx->numa_node, set->cmd_size, gfp);
|
2014-09-25 23:23:47 +08:00
|
|
|
if (!hctx->fq)
|
2019-04-30 09:52:26 +08:00
|
|
|
goto free_bitmap;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2019-04-30 09:52:26 +08:00
|
|
|
blk_mq_hctx_kobj_init(hctx);
|
2016-11-03 00:09:51 +08:00
|
|
|
|
2019-04-30 09:52:26 +08:00
|
|
|
return hctx;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2014-09-25 23:23:38 +08:00
|
|
|
free_bitmap:
|
2016-09-17 22:38:44 +08:00
|
|
|
sbitmap_free(&hctx->ctx_map);
|
2014-09-25 23:23:38 +08:00
|
|
|
free_ctxs:
|
|
|
|
kfree(hctx->ctxs);
|
2019-04-30 09:52:26 +08:00
|
|
|
free_cpumask:
|
|
|
|
free_cpumask_var(hctx->cpumask);
|
|
|
|
free_hctx:
|
|
|
|
kfree(hctx);
|
|
|
|
fail_alloc_hctx:
|
|
|
|
return NULL;
|
2014-09-25 23:23:38 +08:00
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
|
|
|
static void blk_mq_init_cpu_queues(struct request_queue *q,
|
|
|
|
unsigned int nr_hw_queues)
|
|
|
|
{
|
2018-10-31 00:36:06 +08:00
|
|
|
struct blk_mq_tag_set *set = q->tag_set;
|
|
|
|
unsigned int i, j;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
|
|
|
for_each_possible_cpu(i) {
|
|
|
|
struct blk_mq_ctx *__ctx = per_cpu_ptr(q->queue_ctx, i);
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
2018-12-17 23:44:05 +08:00
|
|
|
int k;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
|
|
|
__ctx->cpu = i;
|
|
|
|
spin_lock_init(&__ctx->lock);
|
2018-12-17 23:44:05 +08:00
|
|
|
for (k = HCTX_TYPE_DEFAULT; k < HCTX_MAX_TYPES; k++)
|
|
|
|
INIT_LIST_HEAD(&__ctx->rq_lists[k]);
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
__ctx->queue = q;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Set local node, IFF we have more than one hw queue. If
|
|
|
|
* not, we remain on the home node of the device
|
|
|
|
*/
|
2018-10-31 00:36:06 +08:00
|
|
|
for (j = 0; j < set->nr_maps; j++) {
|
|
|
|
hctx = blk_mq_map_queue_type(q, j, i);
|
|
|
|
if (nr_hw_queues > 1 && hctx->numa_node == NUMA_NO_NODE)
|
2020-10-19 16:20:47 +08:00
|
|
|
hctx->numa_node = cpu_to_node(i);
|
2018-10-31 00:36:06 +08:00
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-10-05 18:23:35 +08:00
|
|
|
struct blk_mq_tags *blk_mq_alloc_map_and_rqs(struct blk_mq_tag_set *set,
|
|
|
|
unsigned int hctx_idx,
|
|
|
|
unsigned int depth)
|
2017-01-12 05:29:56 +08:00
|
|
|
{
|
2021-10-05 18:23:35 +08:00
|
|
|
struct blk_mq_tags *tags;
|
|
|
|
int ret;
|
2017-01-12 05:29:56 +08:00
|
|
|
|
2021-10-05 18:23:37 +08:00
|
|
|
tags = blk_mq_alloc_rq_map(set, hctx_idx, depth, set->reserved_tags);
|
2021-10-05 18:23:35 +08:00
|
|
|
if (!tags)
|
|
|
|
return NULL;
|
2017-01-12 05:29:56 +08:00
|
|
|
|
2021-10-05 18:23:35 +08:00
|
|
|
ret = blk_mq_alloc_rqs(set, tags, hctx_idx, depth);
|
|
|
|
if (ret) {
|
2021-10-05 18:23:37 +08:00
|
|
|
blk_mq_free_rq_map(tags);
|
2021-10-05 18:23:35 +08:00
|
|
|
return NULL;
|
|
|
|
}
|
2017-01-12 05:29:56 +08:00
|
|
|
|
2021-10-05 18:23:35 +08:00
|
|
|
return tags;
|
2017-01-12 05:29:56 +08:00
|
|
|
}
|
|
|
|
|
2021-10-05 18:23:35 +08:00
|
|
|
static bool __blk_mq_alloc_map_and_rqs(struct blk_mq_tag_set *set,
|
|
|
|
int hctx_idx)
|
2017-01-12 05:29:56 +08:00
|
|
|
{
|
2021-10-05 18:23:39 +08:00
|
|
|
if (blk_mq_is_shared_tags(set->flags)) {
|
|
|
|
set->tags[hctx_idx] = set->shared_tags;
|
2020-08-19 23:20:22 +08:00
|
|
|
|
2021-10-05 18:23:37 +08:00
|
|
|
return true;
|
2017-01-17 21:03:22 +08:00
|
|
|
}
|
2021-10-05 18:23:37 +08:00
|
|
|
|
2021-10-05 18:23:35 +08:00
|
|
|
set->tags[hctx_idx] = blk_mq_alloc_map_and_rqs(set, hctx_idx,
|
|
|
|
set->queue_depth);
|
|
|
|
|
|
|
|
return set->tags[hctx_idx];
|
2017-01-12 05:29:56 +08:00
|
|
|
}
|
|
|
|
|
2021-10-05 18:23:36 +08:00
|
|
|
void blk_mq_free_map_and_rqs(struct blk_mq_tag_set *set,
|
|
|
|
struct blk_mq_tags *tags,
|
|
|
|
unsigned int hctx_idx)
|
2017-01-12 05:29:56 +08:00
|
|
|
{
|
2021-10-05 18:23:36 +08:00
|
|
|
if (tags) {
|
|
|
|
blk_mq_free_rqs(set, tags, hctx_idx);
|
2021-10-05 18:23:37 +08:00
|
|
|
blk_mq_free_rq_map(tags);
|
2017-01-17 21:03:22 +08:00
|
|
|
}
|
2017-01-12 05:29:56 +08:00
|
|
|
}
|
|
|
|
|
2021-10-05 18:23:37 +08:00
|
|
|
static void __blk_mq_free_map_and_rqs(struct blk_mq_tag_set *set,
|
|
|
|
unsigned int hctx_idx)
|
|
|
|
{
|
2021-10-05 18:23:39 +08:00
|
|
|
if (!blk_mq_is_shared_tags(set->flags))
|
2021-10-05 18:23:37 +08:00
|
|
|
blk_mq_free_map_and_rqs(set, set->tags[hctx_idx], hctx_idx);
|
|
|
|
|
|
|
|
set->tags[hctx_idx] = NULL;
|
2017-01-12 05:29:56 +08:00
|
|
|
}
|
|
|
|
|
2017-06-26 18:20:57 +08:00
|
|
|
static void blk_mq_map_swqueue(struct request_queue *q)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2022-03-08 15:32:18 +08:00
|
|
|
unsigned int j, hctx_idx;
|
|
|
|
unsigned long i;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
struct blk_mq_ctx *ctx;
|
2015-04-21 10:00:20 +08:00
|
|
|
struct blk_mq_tag_set *set = q->tag_set;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
|
|
|
queue_for_each_hw_ctx(q, hctx, i) {
|
2014-04-10 00:18:23 +08:00
|
|
|
cpumask_clear(hctx->cpumask);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
hctx->nr_ctx = 0;
|
2018-05-18 22:32:30 +08:00
|
|
|
hctx->dispatch_from = NULL;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2017-06-26 18:20:57 +08:00
|
|
|
* Map software to hardware queues.
|
2018-04-25 04:01:44 +08:00
|
|
|
*
|
|
|
|
* If the cpu isn't present, the cpu is mapped to first hctx.
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
*/
|
2018-01-12 10:53:06 +08:00
|
|
|
for_each_possible_cpu(i) {
|
2018-04-25 04:01:44 +08:00
|
|
|
|
2016-03-19 18:30:33 +08:00
|
|
|
ctx = per_cpu_ptr(q->queue_ctx, i);
|
2018-10-31 00:36:06 +08:00
|
|
|
for (j = 0; j < set->nr_maps; j++) {
|
2019-01-24 18:25:33 +08:00
|
|
|
if (!set->map[j].nr_queues) {
|
|
|
|
ctx->hctxs[j] = blk_mq_map_queue_type(q,
|
|
|
|
HCTX_TYPE_DEFAULT, i);
|
2018-12-18 01:28:56 +08:00
|
|
|
continue;
|
2019-01-24 18:25:33 +08:00
|
|
|
}
|
block: alloc map and request for new hardware queue
Alloc new map and request for new hardware queue when increse
hardware queue count. Before this patch, it will show a
warning for each new hardware queue, but it's not enough, these
hctx have no maps and reqeust, when a bio was mapped to these
hardware queue, it will trigger kernel panic when get request
from these hctx.
Test environment:
* A NVMe disk supports 128 io queues
* 96 cpus in system
A corner case can always trigger this panic, there are 96
io queues allocated for HCTX_TYPE_DEFAULT type, the corresponding kernel
log: nvme nvme0: 96/0/0 default/read/poll queues. Now we set nvme write
queues to 96, then nvme will alloc others(32) queues for read, but
blk_mq_update_nr_hw_queues does not alloc map and request for these new
added io queues. So when process read nvme disk, it will trigger kernel
panic when get request from these hardware context.
Reproduce script:
nr=$(expr `cat /sys/block/nvme0n1/device/queue_count` - 1)
echo $nr > /sys/module/nvme/parameters/write_queues
echo 1 > /sys/block/nvme0n1/device/reset_controller
dd if=/dev/nvme0n1 of=/dev/null bs=4K count=1
[ 8040.805626] ------------[ cut here ]------------
[ 8040.805627] WARNING: CPU: 82 PID: 12921 at block/blk-mq.c:2578 blk_mq_map_swqueue+0x2b6/0x2c0
[ 8040.805627] Modules linked in: nvme nvme_core nf_conntrack_netlink xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nf_conntrack_tftp nft_masq nf_tables_set nft_fib_inet nft_f
ib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack tun bridge nf_defrag_ipv6 nf_defrag_ipv4 stp llc ip6_tables ip_tables nft_compat rfkill ip_set nf_tables nfne
tlink sunrpc intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel intel_
cstate intel_uncore raid0 joydev intel_rapl_perf ipmi_si pcspkr mei_me ioatdma sg ipmi_devintf mei i2c_i801 dca lpc_ich ipmi_msghandler acpi_power_meter acpi_pad xfs libcrc32c sd_mod ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm d
rm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
[ 8040.805637] ahci drm i40e libahci crc32c_intel libata t10_pi wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nvme_core]
[ 8040.805640] CPU: 82 PID: 12921 Comm: kworker/u194:2 Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2
[ 8040.805640] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
[ 8040.805641] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[ 8040.805642] RIP: 0010:blk_mq_map_swqueue+0x2b6/0x2c0
[ 8040.805643] Code: 00 00 00 00 00 41 83 c5 01 44 39 6d 50 77 b8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b bb 98 00 00 00 89 d6 e8 8c 81 03 00 eb 83 <0f> 0b e9 52 ff ff ff 0f 1f 00 0f 1f 44 00 00 41 57 48 89 f1 41 56
[ 8040.805643] RSP: 0018:ffffba590d2e7d48 EFLAGS: 00010246
[ 8040.805643] RAX: 0000000000000000 RBX: ffff9f013e1ba800 RCX: 000000000000003d
[ 8040.805644] RDX: ffff9f00ffff6000 RSI: 0000000000000003 RDI: ffff9ed200246d90
[ 8040.805644] RBP: ffff9f00f6a79860 R08: 0000000000000000 R09: 000000000000003d
[ 8040.805645] R10: 0000000000000001 R11: ffff9f0138c3d000 R12: ffff9f00fb3a9008
[ 8040.805645] R13: 000000000000007f R14: ffffffff96822660 R15: 000000000000005f
[ 8040.805645] FS: 0000000000000000(0000) GS:ffff9f013fa80000(0000) knlGS:0000000000000000
[ 8040.805646] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8040.805646] CR2: 00007f7f397fa6f8 CR3: 0000003d8240a002 CR4: 00000000007606e0
[ 8040.805647] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8040.805647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8040.805647] PKRU: 55555554
[ 8040.805647] Call Trace:
[ 8040.805649] blk_mq_update_nr_hw_queues+0x31b/0x390
[ 8040.805650] nvme_reset_work+0xb4b/0xeab [nvme]
[ 8040.805651] process_one_work+0x1a7/0x370
[ 8040.805652] worker_thread+0x1c9/0x380
[ 8040.805653] ? max_active_store+0x80/0x80
[ 8040.805655] kthread+0x112/0x130
[ 8040.805656] ? __kthread_parkme+0x70/0x70
[ 8040.805657] ret_from_fork+0x35/0x40
[ 8040.805658] ---[ end trace b5f13b1e73ccb5d3 ]---
[ 8229.365135] BUG: kernel NULL pointer dereference, address: 0000000000000004
[ 8229.365165] #PF: supervisor read access in kernel mode
[ 8229.365178] #PF: error_code(0x0000) - not-present page
[ 8229.365191] PGD 0 P4D 0
[ 8229.365201] Oops: 0000 [#1] SMP PTI
[ 8229.365212] CPU: 77 PID: 13024 Comm: dd Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2
[ 8229.365232] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
[ 8229.365253] RIP: 0010:blk_mq_get_tag+0x227/0x250
[ 8229.365265] Code: 44 24 04 44 01 e0 48 8b 74 24 38 65 48 33 34 25 28 00 00 00 75 33 48 83 c4 40 5b 5d 41 5c 41 5d 41 5e c3 48 8d 68 10 4c 89 ef <44> 8b 60 04 48 89 ee e8 dd f9 ff ff 83 f8 ff 75 c8 e9 67 fe ff ff
[ 8229.365304] RSP: 0018:ffffba590e977970 EFLAGS: 00010246
[ 8229.365317] RAX: 0000000000000000 RBX: ffff9f00f6a79860 RCX: ffffba590e977998
[ 8229.365333] RDX: 0000000000000000 RSI: ffff9f012039b140 RDI: ffffba590e977a38
[ 8229.365349] RBP: 0000000000000010 R08: ffffda58ff94e190 R09: ffffda58ff94e198
[ 8229.365365] R10: 0000000000000011 R11: ffff9f00f6a79860 R12: 0000000000000000
[ 8229.365381] R13: ffffba590e977a38 R14: ffff9f012039b140 R15: 0000000000000001
[ 8229.365397] FS: 00007f481c230580(0000) GS:ffff9f013f940000(0000) knlGS:0000000000000000
[ 8229.365415] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8229.365428] CR2: 0000000000000004 CR3: 0000005f35e26004 CR4: 00000000007606e0
[ 8229.365444] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8229.365460] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8229.365476] PKRU: 55555554
[ 8229.365484] Call Trace:
[ 8229.365498] ? finish_wait+0x80/0x80
[ 8229.365512] blk_mq_get_request+0xcb/0x3f0
[ 8229.365525] blk_mq_make_request+0x143/0x5d0
[ 8229.365538] generic_make_request+0xcf/0x310
[ 8229.365553] ? scan_shadow_nodes+0x30/0x30
[ 8229.365564] submit_bio+0x3c/0x150
[ 8229.365576] mpage_readpages+0x163/0x1a0
[ 8229.365588] ? blkdev_direct_IO+0x490/0x490
[ 8229.365601] read_pages+0x6b/0x190
[ 8229.365612] __do_page_cache_readahead+0x1c1/0x1e0
[ 8229.365626] ondemand_readahead+0x182/0x2f0
[ 8229.365639] generic_file_buffered_read+0x590/0xab0
[ 8229.365655] new_sync_read+0x12a/0x1c0
[ 8229.365666] vfs_read+0x8a/0x140
[ 8229.365676] ksys_read+0x59/0xd0
[ 8229.365688] do_syscall_64+0x55/0x1d0
[ 8229.365700] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
Tested-by: Weiping Zhang <zhangweiping@didiglobal.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-07 21:04:08 +08:00
|
|
|
hctx_idx = set->map[j].mq_map[i];
|
|
|
|
/* unmapped hw queue can be remapped after CPU topo changed */
|
|
|
|
if (!set->tags[hctx_idx] &&
|
2021-10-05 18:23:35 +08:00
|
|
|
!__blk_mq_alloc_map_and_rqs(set, hctx_idx)) {
|
block: alloc map and request for new hardware queue
Alloc new map and request for new hardware queue when increse
hardware queue count. Before this patch, it will show a
warning for each new hardware queue, but it's not enough, these
hctx have no maps and reqeust, when a bio was mapped to these
hardware queue, it will trigger kernel panic when get request
from these hctx.
Test environment:
* A NVMe disk supports 128 io queues
* 96 cpus in system
A corner case can always trigger this panic, there are 96
io queues allocated for HCTX_TYPE_DEFAULT type, the corresponding kernel
log: nvme nvme0: 96/0/0 default/read/poll queues. Now we set nvme write
queues to 96, then nvme will alloc others(32) queues for read, but
blk_mq_update_nr_hw_queues does not alloc map and request for these new
added io queues. So when process read nvme disk, it will trigger kernel
panic when get request from these hardware context.
Reproduce script:
nr=$(expr `cat /sys/block/nvme0n1/device/queue_count` - 1)
echo $nr > /sys/module/nvme/parameters/write_queues
echo 1 > /sys/block/nvme0n1/device/reset_controller
dd if=/dev/nvme0n1 of=/dev/null bs=4K count=1
[ 8040.805626] ------------[ cut here ]------------
[ 8040.805627] WARNING: CPU: 82 PID: 12921 at block/blk-mq.c:2578 blk_mq_map_swqueue+0x2b6/0x2c0
[ 8040.805627] Modules linked in: nvme nvme_core nf_conntrack_netlink xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nf_conntrack_tftp nft_masq nf_tables_set nft_fib_inet nft_f
ib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack tun bridge nf_defrag_ipv6 nf_defrag_ipv4 stp llc ip6_tables ip_tables nft_compat rfkill ip_set nf_tables nfne
tlink sunrpc intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel intel_
cstate intel_uncore raid0 joydev intel_rapl_perf ipmi_si pcspkr mei_me ioatdma sg ipmi_devintf mei i2c_i801 dca lpc_ich ipmi_msghandler acpi_power_meter acpi_pad xfs libcrc32c sd_mod ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm d
rm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
[ 8040.805637] ahci drm i40e libahci crc32c_intel libata t10_pi wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nvme_core]
[ 8040.805640] CPU: 82 PID: 12921 Comm: kworker/u194:2 Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2
[ 8040.805640] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
[ 8040.805641] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[ 8040.805642] RIP: 0010:blk_mq_map_swqueue+0x2b6/0x2c0
[ 8040.805643] Code: 00 00 00 00 00 41 83 c5 01 44 39 6d 50 77 b8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b bb 98 00 00 00 89 d6 e8 8c 81 03 00 eb 83 <0f> 0b e9 52 ff ff ff 0f 1f 00 0f 1f 44 00 00 41 57 48 89 f1 41 56
[ 8040.805643] RSP: 0018:ffffba590d2e7d48 EFLAGS: 00010246
[ 8040.805643] RAX: 0000000000000000 RBX: ffff9f013e1ba800 RCX: 000000000000003d
[ 8040.805644] RDX: ffff9f00ffff6000 RSI: 0000000000000003 RDI: ffff9ed200246d90
[ 8040.805644] RBP: ffff9f00f6a79860 R08: 0000000000000000 R09: 000000000000003d
[ 8040.805645] R10: 0000000000000001 R11: ffff9f0138c3d000 R12: ffff9f00fb3a9008
[ 8040.805645] R13: 000000000000007f R14: ffffffff96822660 R15: 000000000000005f
[ 8040.805645] FS: 0000000000000000(0000) GS:ffff9f013fa80000(0000) knlGS:0000000000000000
[ 8040.805646] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8040.805646] CR2: 00007f7f397fa6f8 CR3: 0000003d8240a002 CR4: 00000000007606e0
[ 8040.805647] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8040.805647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8040.805647] PKRU: 55555554
[ 8040.805647] Call Trace:
[ 8040.805649] blk_mq_update_nr_hw_queues+0x31b/0x390
[ 8040.805650] nvme_reset_work+0xb4b/0xeab [nvme]
[ 8040.805651] process_one_work+0x1a7/0x370
[ 8040.805652] worker_thread+0x1c9/0x380
[ 8040.805653] ? max_active_store+0x80/0x80
[ 8040.805655] kthread+0x112/0x130
[ 8040.805656] ? __kthread_parkme+0x70/0x70
[ 8040.805657] ret_from_fork+0x35/0x40
[ 8040.805658] ---[ end trace b5f13b1e73ccb5d3 ]---
[ 8229.365135] BUG: kernel NULL pointer dereference, address: 0000000000000004
[ 8229.365165] #PF: supervisor read access in kernel mode
[ 8229.365178] #PF: error_code(0x0000) - not-present page
[ 8229.365191] PGD 0 P4D 0
[ 8229.365201] Oops: 0000 [#1] SMP PTI
[ 8229.365212] CPU: 77 PID: 13024 Comm: dd Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2
[ 8229.365232] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
[ 8229.365253] RIP: 0010:blk_mq_get_tag+0x227/0x250
[ 8229.365265] Code: 44 24 04 44 01 e0 48 8b 74 24 38 65 48 33 34 25 28 00 00 00 75 33 48 83 c4 40 5b 5d 41 5c 41 5d 41 5e c3 48 8d 68 10 4c 89 ef <44> 8b 60 04 48 89 ee e8 dd f9 ff ff 83 f8 ff 75 c8 e9 67 fe ff ff
[ 8229.365304] RSP: 0018:ffffba590e977970 EFLAGS: 00010246
[ 8229.365317] RAX: 0000000000000000 RBX: ffff9f00f6a79860 RCX: ffffba590e977998
[ 8229.365333] RDX: 0000000000000000 RSI: ffff9f012039b140 RDI: ffffba590e977a38
[ 8229.365349] RBP: 0000000000000010 R08: ffffda58ff94e190 R09: ffffda58ff94e198
[ 8229.365365] R10: 0000000000000011 R11: ffff9f00f6a79860 R12: 0000000000000000
[ 8229.365381] R13: ffffba590e977a38 R14: ffff9f012039b140 R15: 0000000000000001
[ 8229.365397] FS: 00007f481c230580(0000) GS:ffff9f013f940000(0000) knlGS:0000000000000000
[ 8229.365415] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8229.365428] CR2: 0000000000000004 CR3: 0000005f35e26004 CR4: 00000000007606e0
[ 8229.365444] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8229.365460] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8229.365476] PKRU: 55555554
[ 8229.365484] Call Trace:
[ 8229.365498] ? finish_wait+0x80/0x80
[ 8229.365512] blk_mq_get_request+0xcb/0x3f0
[ 8229.365525] blk_mq_make_request+0x143/0x5d0
[ 8229.365538] generic_make_request+0xcf/0x310
[ 8229.365553] ? scan_shadow_nodes+0x30/0x30
[ 8229.365564] submit_bio+0x3c/0x150
[ 8229.365576] mpage_readpages+0x163/0x1a0
[ 8229.365588] ? blkdev_direct_IO+0x490/0x490
[ 8229.365601] read_pages+0x6b/0x190
[ 8229.365612] __do_page_cache_readahead+0x1c1/0x1e0
[ 8229.365626] ondemand_readahead+0x182/0x2f0
[ 8229.365639] generic_file_buffered_read+0x590/0xab0
[ 8229.365655] new_sync_read+0x12a/0x1c0
[ 8229.365666] vfs_read+0x8a/0x140
[ 8229.365676] ksys_read+0x59/0xd0
[ 8229.365688] do_syscall_64+0x55/0x1d0
[ 8229.365700] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
Tested-by: Weiping Zhang <zhangweiping@didiglobal.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-07 21:04:08 +08:00
|
|
|
/*
|
|
|
|
* If tags initialization fail for some hctx,
|
|
|
|
* that hctx won't be brought online. In this
|
|
|
|
* case, remap the current ctx to hctx[0] which
|
|
|
|
* is guaranteed to always have tags allocated
|
|
|
|
*/
|
|
|
|
set->map[j].mq_map[i] = 0;
|
|
|
|
}
|
2018-12-18 01:28:56 +08:00
|
|
|
|
2018-10-31 00:36:06 +08:00
|
|
|
hctx = blk_mq_map_queue_type(q, j, i);
|
2019-01-24 18:25:32 +08:00
|
|
|
ctx->hctxs[j] = hctx;
|
2018-10-31 00:36:06 +08:00
|
|
|
/*
|
|
|
|
* If the CPU is already set in the mask, then we've
|
|
|
|
* mapped this one already. This can happen if
|
|
|
|
* devices share queues across queue maps.
|
|
|
|
*/
|
|
|
|
if (cpumask_test_cpu(i, hctx->cpumask))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
cpumask_set_cpu(i, hctx->cpumask);
|
|
|
|
hctx->type = j;
|
|
|
|
ctx->index_hw[hctx->type] = hctx->nr_ctx;
|
|
|
|
hctx->ctxs[hctx->nr_ctx++] = ctx;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the nr_ctx type overflows, we have exceeded the
|
|
|
|
* amount of sw queues we can support.
|
|
|
|
*/
|
|
|
|
BUG_ON(!hctx->nr_ctx);
|
|
|
|
}
|
2019-01-24 18:25:33 +08:00
|
|
|
|
|
|
|
for (; j < HCTX_MAX_TYPES; j++)
|
|
|
|
ctx->hctxs[j] = blk_mq_map_queue_type(q,
|
|
|
|
HCTX_TYPE_DEFAULT, i);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
2014-05-08 00:26:44 +08:00
|
|
|
|
|
|
|
queue_for_each_hw_ctx(q, hctx, i) {
|
2024-03-22 10:12:44 +08:00
|
|
|
int cpu;
|
|
|
|
|
2018-04-25 04:01:44 +08:00
|
|
|
/*
|
|
|
|
* If no software queues are mapped to this hardware queue,
|
|
|
|
* disable it and free the request entries.
|
|
|
|
*/
|
|
|
|
if (!hctx->nr_ctx) {
|
|
|
|
/* Never unmap queue 0. We need it as a
|
|
|
|
* fallback in case of a new remap fails
|
|
|
|
* allocation
|
|
|
|
*/
|
2021-10-05 18:23:37 +08:00
|
|
|
if (i)
|
|
|
|
__blk_mq_free_map_and_rqs(set, i);
|
2018-04-25 04:01:44 +08:00
|
|
|
|
|
|
|
hctx->tags = NULL;
|
|
|
|
continue;
|
|
|
|
}
|
2014-05-22 04:01:15 +08:00
|
|
|
|
2015-04-21 10:00:20 +08:00
|
|
|
hctx->tags = set->tags[i];
|
|
|
|
WARN_ON(!hctx->tags);
|
|
|
|
|
2015-04-16 01:39:29 +08:00
|
|
|
/*
|
|
|
|
* Set the map size to the number of mapped software queues.
|
|
|
|
* This is more accurate and more efficient than looping
|
|
|
|
* over all possibly mapped software queues.
|
|
|
|
*/
|
2016-09-17 22:38:44 +08:00
|
|
|
sbitmap_resize(&hctx->ctx_map, hctx->nr_ctx);
|
2015-04-16 01:39:29 +08:00
|
|
|
|
2024-03-22 10:12:44 +08:00
|
|
|
/*
|
|
|
|
* Rule out isolated CPUs from hctx->cpumask to avoid
|
|
|
|
* running block kworker on isolated CPUs
|
|
|
|
*/
|
|
|
|
for_each_cpu(cpu, hctx->cpumask) {
|
|
|
|
if (cpu_is_isolated(cpu))
|
|
|
|
cpumask_clear_cpu(cpu, hctx->cpumask);
|
|
|
|
}
|
|
|
|
|
2014-05-22 04:01:15 +08:00
|
|
|
/*
|
|
|
|
* Initialize batch roundrobin counts
|
|
|
|
*/
|
2018-04-08 17:48:10 +08:00
|
|
|
hctx->next_cpu = blk_mq_first_mapped_cpu(hctx);
|
2014-05-08 00:26:44 +08:00
|
|
|
hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
|
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2017-06-21 07:56:13 +08:00
|
|
|
/*
|
|
|
|
* Caller needs to ensure that we're either frozen/quiesced, or that
|
|
|
|
* the queue isn't live yet.
|
|
|
|
*/
|
2015-11-03 23:40:06 +08:00
|
|
|
static void queue_set_hctx_shared(struct request_queue *q, bool shared)
|
2014-05-14 05:10:52 +08:00
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
2022-03-08 15:32:18 +08:00
|
|
|
unsigned long i;
|
2014-05-14 05:10:52 +08:00
|
|
|
|
2015-11-03 23:40:06 +08:00
|
|
|
queue_for_each_hw_ctx(q, hctx, i) {
|
2021-07-31 14:21:30 +08:00
|
|
|
if (shared) {
|
2020-08-19 23:20:19 +08:00
|
|
|
hctx->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
|
2021-07-31 14:21:30 +08:00
|
|
|
} else {
|
|
|
|
blk_mq_tag_idle(hctx);
|
2020-08-19 23:20:19 +08:00
|
|
|
hctx->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
|
2021-07-31 14:21:30 +08:00
|
|
|
}
|
2015-11-03 23:40:06 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-08-19 23:20:20 +08:00
|
|
|
static void blk_mq_update_tag_set_shared(struct blk_mq_tag_set *set,
|
|
|
|
bool shared)
|
2015-11-03 23:40:06 +08:00
|
|
|
{
|
|
|
|
struct request_queue *q;
|
2014-05-14 05:10:52 +08:00
|
|
|
|
2017-04-08 02:16:49 +08:00
|
|
|
lockdep_assert_held(&set->tag_list_lock);
|
|
|
|
|
2014-05-14 05:10:52 +08:00
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list) {
|
|
|
|
blk_mq_freeze_queue(q);
|
2015-11-03 23:40:06 +08:00
|
|
|
queue_set_hctx_shared(q, shared);
|
2014-05-14 05:10:52 +08:00
|
|
|
blk_mq_unfreeze_queue(q);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void blk_mq_del_queue_tag_set(struct request_queue *q)
|
|
|
|
{
|
|
|
|
struct blk_mq_tag_set *set = q->tag_set;
|
|
|
|
|
|
|
|
mutex_lock(&set->tag_list_lock);
|
2020-07-28 21:29:51 +08:00
|
|
|
list_del(&q->tag_set_list);
|
2015-11-03 23:40:06 +08:00
|
|
|
if (list_is_singular(&set->tag_list)) {
|
|
|
|
/* just transitioned to unshared */
|
2020-08-19 23:20:19 +08:00
|
|
|
set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
|
2015-11-03 23:40:06 +08:00
|
|
|
/* update existing queue */
|
2020-08-19 23:20:20 +08:00
|
|
|
blk_mq_update_tag_set_shared(set, false);
|
2015-11-03 23:40:06 +08:00
|
|
|
}
|
2014-05-14 05:10:52 +08:00
|
|
|
mutex_unlock(&set->tag_list_lock);
|
2018-06-11 04:38:24 +08:00
|
|
|
INIT_LIST_HEAD(&q->tag_set_list);
|
2014-05-14 05:10:52 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
|
|
|
|
struct request_queue *q)
|
|
|
|
{
|
|
|
|
mutex_lock(&set->tag_list_lock);
|
2015-11-03 23:40:06 +08:00
|
|
|
|
2017-11-11 13:05:12 +08:00
|
|
|
/*
|
|
|
|
* Check to see if we're transitioning to shared (from 1 to 2 queues).
|
|
|
|
*/
|
|
|
|
if (!list_empty(&set->tag_list) &&
|
2020-08-19 23:20:19 +08:00
|
|
|
!(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
|
|
|
|
set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
|
2015-11-03 23:40:06 +08:00
|
|
|
/* update existing queue */
|
2020-08-19 23:20:20 +08:00
|
|
|
blk_mq_update_tag_set_shared(set, true);
|
2015-11-03 23:40:06 +08:00
|
|
|
}
|
2020-08-19 23:20:19 +08:00
|
|
|
if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
|
2015-11-03 23:40:06 +08:00
|
|
|
queue_set_hctx_shared(q, true);
|
2020-07-28 21:29:51 +08:00
|
|
|
list_add_tail(&q->tag_set_list, &set->tag_list);
|
2015-11-03 23:40:06 +08:00
|
|
|
|
2014-05-14 05:10:52 +08:00
|
|
|
mutex_unlock(&set->tag_list_lock);
|
|
|
|
}
|
|
|
|
|
2018-11-20 09:44:35 +08:00
|
|
|
/* All allocations will be freed in release handler of q->mq_kobj */
|
|
|
|
static int blk_mq_alloc_ctxs(struct request_queue *q)
|
|
|
|
{
|
|
|
|
struct blk_mq_ctxs *ctxs;
|
|
|
|
int cpu;
|
|
|
|
|
|
|
|
ctxs = kzalloc(sizeof(*ctxs), GFP_KERNEL);
|
|
|
|
if (!ctxs)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
ctxs->queue_ctx = alloc_percpu(struct blk_mq_ctx);
|
|
|
|
if (!ctxs->queue_ctx)
|
|
|
|
goto fail;
|
|
|
|
|
|
|
|
for_each_possible_cpu(cpu) {
|
|
|
|
struct blk_mq_ctx *ctx = per_cpu_ptr(ctxs->queue_ctx, cpu);
|
|
|
|
ctx->ctxs = ctxs;
|
|
|
|
}
|
|
|
|
|
|
|
|
q->mq_kobj = &ctxs->kobj;
|
|
|
|
q->queue_ctx = ctxs->queue_ctx;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
fail:
|
|
|
|
kfree(ctxs);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
2015-01-29 20:17:27 +08:00
|
|
|
/*
|
|
|
|
* It is the actual release handler for mq, but we do it from
|
|
|
|
* request queue's release handler for avoiding use-after-free
|
|
|
|
* and headache because q->mq_kobj shouldn't have been introduced,
|
|
|
|
* but we can't group ctx/kctx kobj without it.
|
|
|
|
*/
|
|
|
|
void blk_mq_release(struct request_queue *q)
|
|
|
|
{
|
blk-mq: always free hctx after request queue is freed
In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().
However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.
Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:52:27 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx, *next;
|
2022-03-08 15:32:18 +08:00
|
|
|
unsigned long i;
|
2015-01-29 20:17:27 +08:00
|
|
|
|
blk-mq: always free hctx after request queue is freed
In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().
However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.
Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:52:27 +08:00
|
|
|
queue_for_each_hw_ctx(q, hctx, i)
|
|
|
|
WARN_ON_ONCE(hctx && list_empty(&hctx->hctx_list));
|
|
|
|
|
|
|
|
/* all hctx are in .unused_hctx_list now */
|
|
|
|
list_for_each_entry_safe(hctx, next, &q->unused_hctx_list, hctx_list) {
|
|
|
|
list_del_init(&hctx->hctx_list);
|
2017-02-22 18:14:01 +08:00
|
|
|
kobject_put(&hctx->kobj);
|
2015-06-04 22:25:04 +08:00
|
|
|
}
|
2015-01-29 20:17:27 +08:00
|
|
|
|
2022-03-08 15:32:19 +08:00
|
|
|
xa_destroy(&q->hctx_table);
|
2015-01-29 20:17:27 +08:00
|
|
|
|
2017-02-22 18:14:00 +08:00
|
|
|
/*
|
|
|
|
* release .mq_kobj and sw queue's kobject now because
|
|
|
|
* both share lifetime with request queue.
|
|
|
|
*/
|
|
|
|
blk_mq_sysfs_deinit(q);
|
2015-01-29 20:17:27 +08:00
|
|
|
}
|
|
|
|
|
2024-02-13 15:34:19 +08:00
|
|
|
struct request_queue *blk_mq_alloc_queue(struct blk_mq_tag_set *set,
|
|
|
|
struct queue_limits *lim, void *queuedata)
|
2015-03-13 11:56:02 +08:00
|
|
|
{
|
2024-02-13 15:34:19 +08:00
|
|
|
struct queue_limits default_lim = { };
|
2021-06-02 14:53:17 +08:00
|
|
|
struct request_queue *q;
|
|
|
|
int ret;
|
2015-03-13 11:56:02 +08:00
|
|
|
|
2024-02-13 15:34:19 +08:00
|
|
|
q = blk_alloc_queue(lim ? lim : &default_lim, set->numa_node);
|
2024-02-13 15:34:18 +08:00
|
|
|
if (IS_ERR(q))
|
|
|
|
return q;
|
2021-06-02 14:53:17 +08:00
|
|
|
q->queuedata = queuedata;
|
|
|
|
ret = blk_mq_init_allocated_queue(set, q);
|
|
|
|
if (ret) {
|
2022-06-19 14:05:51 +08:00
|
|
|
blk_put_queue(q);
|
2021-06-02 14:53:17 +08:00
|
|
|
return ERR_PTR(ret);
|
|
|
|
}
|
2015-03-13 11:56:02 +08:00
|
|
|
return q;
|
|
|
|
}
|
2024-02-13 15:34:19 +08:00
|
|
|
EXPORT_SYMBOL(blk_mq_alloc_queue);
|
2015-03-13 11:56:02 +08:00
|
|
|
|
2022-06-19 14:05:51 +08:00
|
|
|
/**
|
|
|
|
* blk_mq_destroy_queue - shutdown a request queue
|
|
|
|
* @q: request queue to shutdown
|
|
|
|
*
|
2024-02-13 15:34:19 +08:00
|
|
|
* This shuts down a request queue allocated by blk_mq_alloc_queue(). All future
|
2023-01-31 05:12:33 +08:00
|
|
|
* requests will be failed with -ENODEV. The caller is responsible for dropping
|
2024-02-13 15:34:19 +08:00
|
|
|
* the reference from blk_mq_alloc_queue() by calling blk_put_queue().
|
2022-06-19 14:05:51 +08:00
|
|
|
*
|
|
|
|
* Context: can sleep
|
|
|
|
*/
|
|
|
|
void blk_mq_destroy_queue(struct request_queue *q)
|
|
|
|
{
|
|
|
|
WARN_ON_ONCE(!queue_is_mq(q));
|
|
|
|
WARN_ON_ONCE(blk_queue_registered(q));
|
|
|
|
|
|
|
|
might_sleep();
|
|
|
|
|
|
|
|
blk_queue_flag_set(QUEUE_FLAG_DYING, q);
|
|
|
|
blk_queue_start_drain(q);
|
2022-10-30 16:32:12 +08:00
|
|
|
blk_mq_freeze_queue_wait(q);
|
2022-06-19 14:05:51 +08:00
|
|
|
|
|
|
|
blk_sync_queue(q);
|
|
|
|
blk_mq_cancel_work_sync(q);
|
|
|
|
blk_mq_exit_queue(q);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_destroy_queue);
|
|
|
|
|
2024-02-13 15:34:20 +08:00
|
|
|
struct gendisk *__blk_mq_alloc_disk(struct blk_mq_tag_set *set,
|
|
|
|
struct queue_limits *lim, void *queuedata,
|
2021-08-16 21:19:05 +08:00
|
|
|
struct lock_class_key *lkclass)
|
2018-10-15 22:40:37 +08:00
|
|
|
{
|
|
|
|
struct request_queue *q;
|
2021-06-02 14:53:18 +08:00
|
|
|
struct gendisk *disk;
|
2018-10-15 22:40:37 +08:00
|
|
|
|
2024-02-13 15:34:20 +08:00
|
|
|
q = blk_mq_alloc_queue(set, lim, queuedata);
|
2021-06-02 14:53:18 +08:00
|
|
|
if (IS_ERR(q))
|
|
|
|
return ERR_CAST(q);
|
2018-10-15 22:40:37 +08:00
|
|
|
|
2021-08-16 21:19:08 +08:00
|
|
|
disk = __alloc_disk_node(q, set->numa_node, lkclass);
|
2021-06-02 14:53:18 +08:00
|
|
|
if (!disk) {
|
2022-07-20 21:05:40 +08:00
|
|
|
blk_mq_destroy_queue(q);
|
2022-10-18 21:57:17 +08:00
|
|
|
blk_put_queue(q);
|
2021-06-02 14:53:18 +08:00
|
|
|
return ERR_PTR(-ENOMEM);
|
2018-10-15 22:40:37 +08:00
|
|
|
}
|
2022-06-19 14:05:51 +08:00
|
|
|
set_bit(GD_OWNS_QUEUE, &disk->state);
|
2021-06-02 14:53:18 +08:00
|
|
|
return disk;
|
2018-10-15 22:40:37 +08:00
|
|
|
}
|
2021-06-02 14:53:18 +08:00
|
|
|
EXPORT_SYMBOL(__blk_mq_alloc_disk);
|
2018-10-15 22:40:37 +08:00
|
|
|
|
2022-06-19 14:05:51 +08:00
|
|
|
struct gendisk *blk_mq_alloc_disk_for_queue(struct request_queue *q,
|
|
|
|
struct lock_class_key *lkclass)
|
|
|
|
{
|
2022-11-22 15:27:53 +08:00
|
|
|
struct gendisk *disk;
|
|
|
|
|
2022-06-19 14:05:51 +08:00
|
|
|
if (!blk_get_queue(q))
|
|
|
|
return NULL;
|
2022-11-22 15:27:53 +08:00
|
|
|
disk = __alloc_disk_node(q, NUMA_NO_NODE, lkclass);
|
|
|
|
if (!disk)
|
|
|
|
blk_put_queue(q);
|
|
|
|
return disk;
|
2022-06-19 14:05:51 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_alloc_disk_for_queue);
|
|
|
|
|
2018-10-12 18:07:27 +08:00
|
|
|
static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
|
|
|
|
struct blk_mq_tag_set *set, struct request_queue *q,
|
|
|
|
int hctx_idx, int node)
|
|
|
|
{
|
blk-mq: always free hctx after request queue is freed
In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().
However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.
Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:52:27 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx = NULL, *tmp;
|
2018-10-12 18:07:27 +08:00
|
|
|
|
blk-mq: always free hctx after request queue is freed
In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().
However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.
Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:52:27 +08:00
|
|
|
/* reuse dead hctx first */
|
|
|
|
spin_lock(&q->unused_hctx_lock);
|
|
|
|
list_for_each_entry(tmp, &q->unused_hctx_list, hctx_list) {
|
|
|
|
if (tmp->numa_node == node) {
|
|
|
|
hctx = tmp;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (hctx)
|
|
|
|
list_del_init(&hctx->hctx_list);
|
|
|
|
spin_unlock(&q->unused_hctx_lock);
|
|
|
|
|
|
|
|
if (!hctx)
|
|
|
|
hctx = blk_mq_alloc_hctx(q, set, node);
|
2018-10-12 18:07:27 +08:00
|
|
|
if (!hctx)
|
2019-04-30 09:52:26 +08:00
|
|
|
goto fail;
|
2018-10-12 18:07:27 +08:00
|
|
|
|
2019-04-30 09:52:26 +08:00
|
|
|
if (blk_mq_init_hctx(q, set, hctx, hctx_idx))
|
|
|
|
goto free_hctx;
|
2018-10-12 18:07:27 +08:00
|
|
|
|
|
|
|
return hctx;
|
2019-04-30 09:52:26 +08:00
|
|
|
|
|
|
|
free_hctx:
|
|
|
|
kobject_put(&hctx->kobj);
|
|
|
|
fail:
|
|
|
|
return NULL;
|
2018-10-12 18:07:27 +08:00
|
|
|
}
|
|
|
|
|
2015-12-18 08:08:14 +08:00
|
|
|
static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
|
|
|
|
struct request_queue *q)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2022-03-08 15:32:19 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
unsigned long i, j;
|
2019-10-26 00:50:09 +08:00
|
|
|
|
2018-01-06 16:27:40 +08:00
|
|
|
/* protect against switching io scheduler */
|
|
|
|
mutex_lock(&q->sysfs_lock);
|
2014-04-16 04:14:00 +08:00
|
|
|
for (i = 0; i < set->nr_hw_queues; i++) {
|
2022-03-08 15:32:15 +08:00
|
|
|
int old_node;
|
2022-03-08 15:32:14 +08:00
|
|
|
int node = blk_mq_get_hctx_node(set, i);
|
2022-03-08 15:32:19 +08:00
|
|
|
struct blk_mq_hw_ctx *old_hctx = xa_load(&q->hctx_table, i);
|
2015-12-18 08:08:14 +08:00
|
|
|
|
2022-03-08 15:32:15 +08:00
|
|
|
if (old_hctx) {
|
|
|
|
old_node = old_hctx->numa_node;
|
|
|
|
blk_mq_exit_hctx(q, set, old_hctx, i);
|
|
|
|
}
|
2015-12-18 08:08:14 +08:00
|
|
|
|
2022-03-08 15:32:19 +08:00
|
|
|
if (!blk_mq_alloc_and_init_hctx(set, q, i, node)) {
|
2022-03-08 15:32:15 +08:00
|
|
|
if (!old_hctx)
|
2018-10-12 18:07:27 +08:00
|
|
|
break;
|
2022-03-08 15:32:15 +08:00
|
|
|
pr_warn("Allocate new hctx on node %d fails, fallback to previous one on node %d\n",
|
|
|
|
node, old_node);
|
2022-03-08 15:32:19 +08:00
|
|
|
hctx = blk_mq_alloc_and_init_hctx(set, q, i, old_node);
|
|
|
|
WARN_ON_ONCE(!hctx);
|
2015-12-18 08:08:14 +08:00
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
blk-mq: fallback to previous nr_hw_queues when updating fails
When we try to increate the nr_hw_queues, we may fail due to
shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops
and some entries in q->queue_hw_ctx are left with NULL. However,
because queue map has been updated with new nr_hw_queues, some cpus
have been mapped to hw queue which just encounters allocation failure,
thus blk_mq_map_queue could return NULL. This will cause panic in
following blk_mq_map_swqueue.
To fix it, when increase nr_hw_queues fails, fallback to previous
nr_hw_queues and post warning. At the same time, driver's .map_queues
usually use completion irq affinity to map hw and cpu, fallback
nr_hw_queues will cause lack of some cpu's map to hw, so use default
blk_mq_map_queues to do that.
Reported-by: syzbot+83e8cbe702263932d9d4@syzkaller.appspotmail.com
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-12 18:07:28 +08:00
|
|
|
/*
|
|
|
|
* Increasing nr_hw_queues fails. Free the newly allocated
|
|
|
|
* hctxs and keep the previous q->nr_hw_queues.
|
|
|
|
*/
|
|
|
|
if (i != set->nr_hw_queues) {
|
|
|
|
j = q->nr_hw_queues;
|
|
|
|
} else {
|
|
|
|
j = i;
|
|
|
|
q->nr_hw_queues = set->nr_hw_queues;
|
|
|
|
}
|
2018-10-12 18:07:27 +08:00
|
|
|
|
2022-03-08 15:32:19 +08:00
|
|
|
xa_for_each_start(&q->hctx_table, j, hctx, j)
|
|
|
|
blk_mq_exit_hctx(q, set, hctx, j);
|
2018-01-06 16:27:40 +08:00
|
|
|
mutex_unlock(&q->sysfs_lock);
|
2015-12-18 08:08:14 +08:00
|
|
|
}
|
|
|
|
|
2022-03-08 15:32:16 +08:00
|
|
|
static void blk_mq_update_poll_flag(struct request_queue *q)
|
|
|
|
{
|
|
|
|
struct blk_mq_tag_set *set = q->tag_set;
|
|
|
|
|
|
|
|
if (set->nr_maps > HCTX_TYPE_POLL &&
|
|
|
|
set->map[HCTX_TYPE_POLL].nr_queues)
|
|
|
|
blk_queue_flag_set(QUEUE_FLAG_POLL, q);
|
|
|
|
else
|
|
|
|
blk_queue_flag_clear(QUEUE_FLAG_POLL, q);
|
|
|
|
}
|
|
|
|
|
2021-06-02 14:53:17 +08:00
|
|
|
int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
|
|
|
|
struct request_queue *q)
|
2015-12-18 08:08:14 +08:00
|
|
|
{
|
2016-02-12 15:27:00 +08:00
|
|
|
/* mark the queue as mq asap */
|
|
|
|
q->mq_ops = set->ops;
|
|
|
|
|
2018-11-20 09:44:35 +08:00
|
|
|
if (blk_mq_alloc_ctxs(q))
|
2023-03-21 03:49:26 +08:00
|
|
|
goto err_exit;
|
2015-12-18 08:08:14 +08:00
|
|
|
|
2017-02-22 18:13:59 +08:00
|
|
|
/* init q->mq_kobj and sw queues' kobjects */
|
|
|
|
blk_mq_sysfs_init(q);
|
|
|
|
|
blk-mq: always free hctx after request queue is freed
In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().
However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.
Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:52:27 +08:00
|
|
|
INIT_LIST_HEAD(&q->unused_hctx_list);
|
|
|
|
spin_lock_init(&q->unused_hctx_lock);
|
|
|
|
|
2022-03-08 15:32:19 +08:00
|
|
|
xa_init(&q->hctx_table);
|
|
|
|
|
2015-12-18 08:08:14 +08:00
|
|
|
blk_mq_realloc_hw_ctxs(set, q);
|
|
|
|
if (!q->nr_hw_queues)
|
|
|
|
goto err_hctxs;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2015-10-30 20:57:30 +08:00
|
|
|
INIT_WORK(&q->timeout_work, blk_mq_timeout_work);
|
2015-07-16 19:53:22 +08:00
|
|
|
blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2018-10-17 04:23:06 +08:00
|
|
|
q->tag_set = set;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2013-11-20 00:25:07 +08:00
|
|
|
q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT;
|
2022-03-08 15:32:16 +08:00
|
|
|
blk_mq_update_poll_flag(q);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2016-09-15 01:28:30 +08:00
|
|
|
INIT_DELAYED_WORK(&q->requeue_work, blk_mq_requeue_work);
|
2023-05-19 12:40:50 +08:00
|
|
|
INIT_LIST_HEAD(&q->flush_list);
|
2014-05-28 22:08:02 +08:00
|
|
|
INIT_LIST_HEAD(&q->requeue_list);
|
|
|
|
spin_lock_init(&q->requeue_lock);
|
|
|
|
|
2014-05-21 05:17:27 +08:00
|
|
|
q->nr_requests = set->queue_depth;
|
|
|
|
|
2014-04-16 04:14:00 +08:00
|
|
|
blk_mq_init_cpu_queues(q, set->nr_hw_queues);
|
2014-05-14 05:10:52 +08:00
|
|
|
blk_mq_add_queue_tag_set(set, q);
|
2017-06-26 18:20:57 +08:00
|
|
|
blk_mq_map_swqueue(q);
|
2021-06-02 14:53:17 +08:00
|
|
|
return 0;
|
2014-02-11 00:29:00 +08:00
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
err_hctxs:
|
blk-mq: Fix kmemleak in blk_mq_init_allocated_queue
There is a kmemleak caused by modprobe null_blk.ko
unreferenced object 0xffff8881acb1f000 (size 1024):
comm "modprobe", pid 836, jiffies 4294971190 (age 27.068s)
hex dump (first 32 bytes):
00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N..........
ff ff ff ff ff ff ff ff 00 53 99 9e ff ff ff ff .........S......
backtrace:
[<000000004a10c249>] kmalloc_node_trace+0x22/0x60
[<00000000648f7950>] blk_mq_alloc_and_init_hctx+0x289/0x350
[<00000000af06de0e>] blk_mq_realloc_hw_ctxs+0x2fe/0x3d0
[<00000000e00c1872>] blk_mq_init_allocated_queue+0x48c/0x1440
[<00000000d16b4e68>] __blk_mq_alloc_disk+0xc8/0x1c0
[<00000000d10c98c3>] 0xffffffffc450d69d
[<00000000b9299f48>] 0xffffffffc4538392
[<0000000061c39ed6>] do_one_initcall+0xd0/0x4f0
[<00000000b389383b>] do_init_module+0x1a4/0x680
[<0000000087cf3542>] load_module+0x6249/0x7110
[<00000000beba61b8>] __do_sys_finit_module+0x140/0x200
[<00000000fdcfff51>] do_syscall_64+0x35/0x80
[<000000003c0f1f71>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
That is because q->ma_ops is set to NULL before blk_release_queue is
called.
blk_mq_init_queue_data
blk_mq_init_allocated_queue
blk_mq_realloc_hw_ctxs
for (i = 0; i < set->nr_hw_queues; i++) {
old_hctx = xa_load(&q->hctx_table, i);
if (!blk_mq_alloc_and_init_hctx(.., i, ..)) [1]
if (!old_hctx)
break;
xa_for_each_start(&q->hctx_table, j, hctx, j)
blk_mq_exit_hctx(q, set, hctx, j); [2]
if (!q->nr_hw_queues) [3]
goto err_hctxs;
err_exit:
q->mq_ops = NULL; [4]
blk_put_queue
blk_release_queue
if (queue_is_mq(q)) [5]
blk_mq_release(q);
[1]: blk_mq_alloc_and_init_hctx failed at i != 0.
[2]: The hctxs allocated by [1] are moved to q->unused_hctx_list and
will be cleaned up in blk_mq_release.
[3]: q->nr_hw_queues is 0.
[4]: Set q->mq_ops to NULL.
[5]: queue_is_mq returns false due to [4]. And blk_mq_release
will not be called. The hctxs in q->unused_hctx_list are leaked.
To fix it, call blk_release_queue in exception path.
Fixes: 2f8f1336a48b ("blk-mq: always free hctx after request queue is freed")
Signed-off-by: Yuan Can <yuancan@huawei.com>
Signed-off-by: Chen Jun <chenjun102@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20221031031242.94107-1-chenjun102@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-31 11:12:42 +08:00
|
|
|
blk_mq_release(q);
|
2016-05-26 14:23:27 +08:00
|
|
|
err_exit:
|
|
|
|
q->mq_ops = NULL;
|
2021-06-02 14:53:17 +08:00
|
|
|
return -ENOMEM;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
2015-03-13 11:56:02 +08:00
|
|
|
EXPORT_SYMBOL(blk_mq_init_allocated_queue);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
blk-mq: free hw queue's resource in hctx's release handler
Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
("blk-mq: Fix a use-after-free") fixes this issue exactly.
However, that commit introduces another issue. Before 45a9c9d909b2,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().
We have invented ways for addressing this kind of issue before, such as:
8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")
But still can't cover all cases, recently James reports another such
kind of issue:
https://marc.info/?l=linux-scsi&m=155389088124782&w=2
This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.
Fixes the above issue by freeing hctx's resources in its release handler, and this
way is safe becasue tags isn't needed for freeing such hctx resource.
This approach follows typical design pattern wrt. kobject's release handler.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reported-by: James Smart <james.smart@broadcom.com>
Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
Cc: stable@vger.kernel.org
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:52:25 +08:00
|
|
|
/* tags can _not_ be used after returning from blk_mq_exit_queue */
|
|
|
|
void blk_mq_exit_queue(struct request_queue *q)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2021-05-14 01:15:29 +08:00
|
|
|
struct blk_mq_tag_set *set = q->tag_set;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2021-05-14 01:15:29 +08:00
|
|
|
/* Checks hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED. */
|
2014-05-27 23:35:13 +08:00
|
|
|
blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
|
2021-05-14 01:15:29 +08:00
|
|
|
/* May clear BLK_MQ_F_TAG_QUEUE_SHARED in hctx->flags. */
|
|
|
|
blk_mq_del_queue_tag_set(q);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2014-09-10 23:02:03 +08:00
|
|
|
static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
2021-10-05 18:23:39 +08:00
|
|
|
if (blk_mq_is_shared_tags(set->flags)) {
|
|
|
|
set->shared_tags = blk_mq_alloc_map_and_rqs(set,
|
2021-10-05 18:23:37 +08:00
|
|
|
BLK_MQ_NO_HCTX_IDX,
|
|
|
|
set->queue_depth);
|
2021-10-05 18:23:39 +08:00
|
|
|
if (!set->shared_tags)
|
2021-10-05 18:23:37 +08:00
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps()
We found blk_mq_alloc_rq_maps() takes more time in kernel space when
testing nvme device hot-plugging. The test and anlysis as below.
Debug code,
1, blk_mq_alloc_rq_maps():
u64 start, end;
depth = set->queue_depth;
start = ktime_get_ns();
pr_err("[%d:%s switch:%ld,%ld] queue depth %d, nr_hw_queues %d\n",
current->pid, current->comm, current->nvcsw, current->nivcsw,
set->queue_depth, set->nr_hw_queues);
do {
err = __blk_mq_alloc_rq_maps(set);
if (!err)
break;
set->queue_depth >>= 1;
if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) {
err = -ENOMEM;
break;
}
} while (set->queue_depth);
end = ktime_get_ns();
pr_err("[%d:%s switch:%ld,%ld] all hw queues init cost time %lld ns\n",
current->pid, current->comm,
current->nvcsw, current->nivcsw, end - start);
2, __blk_mq_alloc_rq_maps():
u64 start, end;
for (i = 0; i < set->nr_hw_queues; i++) {
start = ktime_get_ns();
if (!__blk_mq_alloc_rq_map(set, i))
goto out_unwind;
end = ktime_get_ns();
pr_err("hw queue %d init cost time %lld ns\n", i, end - start);
}
Test nvme hot-plugging with above debug code, we found it totally cost more
than 3ms in kernel space without being scheduled out when alloc rqs for all
16 hw queues with depth 1023, each hw queue cost about 140-250us. The cost
time will be increased with hw queue number and queue depth increasing. And
in an extreme case, if __blk_mq_alloc_rq_maps() returns -ENOMEM, it will try
"queue_depth >>= 1", more time will be consumed.
[ 428.428771] nvme nvme0: pci function 10000:01:00.0
[ 428.428798] nvme 10000:01:00.0: enabling device (0000 -> 0002)
[ 428.428806] pcieport 10000:00:00.0: can't derive routing for PCI INT A
[ 428.428809] nvme 10000:01:00.0: PCI INT A: no GSI
[ 432.593374] [4688:kworker/u33:8 switch:663,2] queue depth 30, nr_hw_queues 1
[ 432.593404] hw queue 0 init cost time 22883 ns
[ 432.593408] [4688:kworker/u33:8 switch:663,2] all hw queues init cost time 35960 ns
[ 432.595953] nvme nvme0: 16/0/0 default/read/poll queues
[ 432.595958] [4688:kworker/u33:8 switch:700,2] queue depth 1023, nr_hw_queues 16
[ 432.596203] hw queue 0 init cost time 242630 ns
[ 432.596441] hw queue 1 init cost time 235913 ns
[ 432.596659] hw queue 2 init cost time 216461 ns
[ 432.596877] hw queue 3 init cost time 215851 ns
[ 432.597107] hw queue 4 init cost time 228406 ns
[ 432.597336] hw queue 5 init cost time 227298 ns
[ 432.597564] hw queue 6 init cost time 224633 ns
[ 432.597785] hw queue 7 init cost time 219954 ns
[ 432.597937] hw queue 8 init cost time 150930 ns
[ 432.598082] hw queue 9 init cost time 143496 ns
[ 432.598231] hw queue 10 init cost time 147261 ns
[ 432.598397] hw queue 11 init cost time 164522 ns
[ 432.598542] hw queue 12 init cost time 143401 ns
[ 432.598692] hw queue 13 init cost time 148934 ns
[ 432.598841] hw queue 14 init cost time 147194 ns
[ 432.598991] hw queue 15 init cost time 148942 ns
[ 432.598993] [4688:kworker/u33:8 switch:700,2] all hw queues init cost time 3035099 ns
[ 432.602611] nvme0n1: p1
So use this patch to trigger schedule between each hw queue init, to avoid
other threads getting stuck. It is not in atomic context when executing
__blk_mq_alloc_rq_maps(), so it is safe to call cond_resched().
Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-26 10:39:47 +08:00
|
|
|
for (i = 0; i < set->nr_hw_queues; i++) {
|
2021-10-05 18:23:35 +08:00
|
|
|
if (!__blk_mq_alloc_map_and_rqs(set, i))
|
2014-09-10 23:02:03 +08:00
|
|
|
goto out_unwind;
|
blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps()
We found blk_mq_alloc_rq_maps() takes more time in kernel space when
testing nvme device hot-plugging. The test and anlysis as below.
Debug code,
1, blk_mq_alloc_rq_maps():
u64 start, end;
depth = set->queue_depth;
start = ktime_get_ns();
pr_err("[%d:%s switch:%ld,%ld] queue depth %d, nr_hw_queues %d\n",
current->pid, current->comm, current->nvcsw, current->nivcsw,
set->queue_depth, set->nr_hw_queues);
do {
err = __blk_mq_alloc_rq_maps(set);
if (!err)
break;
set->queue_depth >>= 1;
if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) {
err = -ENOMEM;
break;
}
} while (set->queue_depth);
end = ktime_get_ns();
pr_err("[%d:%s switch:%ld,%ld] all hw queues init cost time %lld ns\n",
current->pid, current->comm,
current->nvcsw, current->nivcsw, end - start);
2, __blk_mq_alloc_rq_maps():
u64 start, end;
for (i = 0; i < set->nr_hw_queues; i++) {
start = ktime_get_ns();
if (!__blk_mq_alloc_rq_map(set, i))
goto out_unwind;
end = ktime_get_ns();
pr_err("hw queue %d init cost time %lld ns\n", i, end - start);
}
Test nvme hot-plugging with above debug code, we found it totally cost more
than 3ms in kernel space without being scheduled out when alloc rqs for all
16 hw queues with depth 1023, each hw queue cost about 140-250us. The cost
time will be increased with hw queue number and queue depth increasing. And
in an extreme case, if __blk_mq_alloc_rq_maps() returns -ENOMEM, it will try
"queue_depth >>= 1", more time will be consumed.
[ 428.428771] nvme nvme0: pci function 10000:01:00.0
[ 428.428798] nvme 10000:01:00.0: enabling device (0000 -> 0002)
[ 428.428806] pcieport 10000:00:00.0: can't derive routing for PCI INT A
[ 428.428809] nvme 10000:01:00.0: PCI INT A: no GSI
[ 432.593374] [4688:kworker/u33:8 switch:663,2] queue depth 30, nr_hw_queues 1
[ 432.593404] hw queue 0 init cost time 22883 ns
[ 432.593408] [4688:kworker/u33:8 switch:663,2] all hw queues init cost time 35960 ns
[ 432.595953] nvme nvme0: 16/0/0 default/read/poll queues
[ 432.595958] [4688:kworker/u33:8 switch:700,2] queue depth 1023, nr_hw_queues 16
[ 432.596203] hw queue 0 init cost time 242630 ns
[ 432.596441] hw queue 1 init cost time 235913 ns
[ 432.596659] hw queue 2 init cost time 216461 ns
[ 432.596877] hw queue 3 init cost time 215851 ns
[ 432.597107] hw queue 4 init cost time 228406 ns
[ 432.597336] hw queue 5 init cost time 227298 ns
[ 432.597564] hw queue 6 init cost time 224633 ns
[ 432.597785] hw queue 7 init cost time 219954 ns
[ 432.597937] hw queue 8 init cost time 150930 ns
[ 432.598082] hw queue 9 init cost time 143496 ns
[ 432.598231] hw queue 10 init cost time 147261 ns
[ 432.598397] hw queue 11 init cost time 164522 ns
[ 432.598542] hw queue 12 init cost time 143401 ns
[ 432.598692] hw queue 13 init cost time 148934 ns
[ 432.598841] hw queue 14 init cost time 147194 ns
[ 432.598991] hw queue 15 init cost time 148942 ns
[ 432.598993] [4688:kworker/u33:8 switch:700,2] all hw queues init cost time 3035099 ns
[ 432.602611] nvme0n1: p1
So use this patch to trigger schedule between each hw queue init, to avoid
other threads getting stuck. It is not in atomic context when executing
__blk_mq_alloc_rq_maps(), so it is safe to call cond_resched().
Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-26 10:39:47 +08:00
|
|
|
cond_resched();
|
|
|
|
}
|
2014-09-10 23:02:03 +08:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
out_unwind:
|
|
|
|
while (--i >= 0)
|
2021-10-05 18:23:37 +08:00
|
|
|
__blk_mq_free_map_and_rqs(set, i);
|
|
|
|
|
2021-10-05 18:23:39 +08:00
|
|
|
if (blk_mq_is_shared_tags(set->flags)) {
|
|
|
|
blk_mq_free_map_and_rqs(set, set->shared_tags,
|
2021-10-05 18:23:37 +08:00
|
|
|
BLK_MQ_NO_HCTX_IDX);
|
2021-10-05 18:23:36 +08:00
|
|
|
}
|
2014-09-10 23:02:03 +08:00
|
|
|
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocate the request maps associated with this tag_set. Note that this
|
|
|
|
* may reduce the depth asked for, if memory is tight. set->queue_depth
|
|
|
|
* will be updated to reflect the allocated depth.
|
|
|
|
*/
|
2021-10-05 18:23:35 +08:00
|
|
|
static int blk_mq_alloc_set_map_and_rqs(struct blk_mq_tag_set *set)
|
2014-09-10 23:02:03 +08:00
|
|
|
{
|
|
|
|
unsigned int depth;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
depth = set->queue_depth;
|
|
|
|
do {
|
|
|
|
err = __blk_mq_alloc_rq_maps(set);
|
|
|
|
if (!err)
|
|
|
|
break;
|
|
|
|
|
|
|
|
set->queue_depth >>= 1;
|
|
|
|
if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) {
|
|
|
|
err = -ENOMEM;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
} while (set->queue_depth);
|
|
|
|
|
|
|
|
if (!set->queue_depth || err) {
|
|
|
|
pr_err("blk-mq: failed to allocate request map\n");
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (depth != set->queue_depth)
|
|
|
|
pr_info("blk-mq: reduced tag depth (%u -> %u)\n",
|
|
|
|
depth, set->queue_depth);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2022-08-16 01:00:43 +08:00
|
|
|
static void blk_mq_update_queue_map(struct blk_mq_tag_set *set)
|
2017-04-07 22:53:11 +08:00
|
|
|
{
|
2020-03-10 12:26:17 +08:00
|
|
|
/*
|
|
|
|
* blk_mq_map_queues() and multiple .map_queues() implementations
|
|
|
|
* expect that set->map[HCTX_TYPE_DEFAULT].nr_queues is set to the
|
|
|
|
* number of hardware queues.
|
|
|
|
*/
|
|
|
|
if (set->nr_maps == 1)
|
|
|
|
set->map[HCTX_TYPE_DEFAULT].nr_queues = set->nr_hw_queues;
|
|
|
|
|
2024-02-28 12:08:57 +08:00
|
|
|
if (set->ops->map_queues) {
|
2018-10-31 00:36:06 +08:00
|
|
|
int i;
|
|
|
|
|
2018-01-06 16:27:39 +08:00
|
|
|
/*
|
|
|
|
* transport .map_queues is usually done in the following
|
|
|
|
* way:
|
|
|
|
*
|
|
|
|
* for (queue = 0; queue < set->nr_hw_queues; queue++) {
|
|
|
|
* mask = get_cpu_mask(queue)
|
|
|
|
* for_each_cpu(cpu, mask)
|
2018-10-31 00:36:06 +08:00
|
|
|
* set->map[x].mq_map[cpu] = queue;
|
2018-01-06 16:27:39 +08:00
|
|
|
* }
|
|
|
|
*
|
|
|
|
* When we need to remap, the table has to be cleared for
|
|
|
|
* killing stale mapping since one CPU may not be mapped
|
|
|
|
* to any hw queue.
|
|
|
|
*/
|
2018-10-31 00:36:06 +08:00
|
|
|
for (i = 0; i < set->nr_maps; i++)
|
|
|
|
blk_mq_clear_mq_map(&set->map[i]);
|
2018-01-06 16:27:39 +08:00
|
|
|
|
2022-08-16 01:00:43 +08:00
|
|
|
set->ops->map_queues(set);
|
2018-10-31 00:36:06 +08:00
|
|
|
} else {
|
|
|
|
BUG_ON(set->nr_maps > 1);
|
2022-08-16 01:00:43 +08:00
|
|
|
blk_mq_map_queues(&set->map[HCTX_TYPE_DEFAULT]);
|
2018-10-31 00:36:06 +08:00
|
|
|
}
|
2017-04-07 22:53:11 +08:00
|
|
|
}
|
|
|
|
|
2019-10-26 00:50:10 +08:00
|
|
|
static int blk_mq_realloc_tag_set_tags(struct blk_mq_tag_set *set,
|
2022-11-09 18:08:11 +08:00
|
|
|
int new_nr_hw_queues)
|
2019-10-26 00:50:10 +08:00
|
|
|
{
|
|
|
|
struct blk_mq_tags **new_tags;
|
2023-08-21 17:56:00 +08:00
|
|
|
int i;
|
2019-10-26 00:50:10 +08:00
|
|
|
|
2023-09-08 08:57:02 +08:00
|
|
|
if (set->nr_hw_queues >= new_nr_hw_queues)
|
2022-11-22 16:49:17 +08:00
|
|
|
goto done;
|
2019-10-26 00:50:10 +08:00
|
|
|
|
|
|
|
new_tags = kcalloc_node(new_nr_hw_queues, sizeof(struct blk_mq_tags *),
|
|
|
|
GFP_KERNEL, set->numa_node);
|
|
|
|
if (!new_tags)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
if (set->tags)
|
2022-11-09 18:08:11 +08:00
|
|
|
memcpy(new_tags, set->tags, set->nr_hw_queues *
|
2019-10-26 00:50:10 +08:00
|
|
|
sizeof(*set->tags));
|
|
|
|
kfree(set->tags);
|
|
|
|
set->tags = new_tags;
|
2023-08-21 17:56:02 +08:00
|
|
|
|
|
|
|
for (i = set->nr_hw_queues; i < new_nr_hw_queues; i++) {
|
|
|
|
if (!__blk_mq_alloc_map_and_rqs(set, i)) {
|
|
|
|
while (--i >= set->nr_hw_queues)
|
|
|
|
__blk_mq_free_map_and_rqs(set, i);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
cond_resched();
|
|
|
|
}
|
|
|
|
|
2022-11-22 16:49:17 +08:00
|
|
|
done:
|
2019-10-26 00:50:10 +08:00
|
|
|
set->nr_hw_queues = new_nr_hw_queues;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-06-06 05:21:56 +08:00
|
|
|
/*
|
|
|
|
* Alloc a tag set to be associated with one or more request queues.
|
|
|
|
* May fail with EINVAL for various error conditions. May adjust the
|
2018-06-30 21:12:41 +08:00
|
|
|
* requested depth down, if it's too large. In that case, the set
|
2014-06-06 05:21:56 +08:00
|
|
|
* value will be stored in set->queue_depth.
|
|
|
|
*/
|
2014-04-16 04:14:00 +08:00
|
|
|
int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
|
|
|
|
{
|
2018-10-31 00:36:06 +08:00
|
|
|
int i, ret;
|
2016-09-14 22:18:55 +08:00
|
|
|
|
2014-10-30 21:45:11 +08:00
|
|
|
BUILD_BUG_ON(BLK_MQ_MAX_DEPTH > 1 << BLK_MQ_UNIQUE_TAG_BITS);
|
|
|
|
|
2014-04-16 04:14:00 +08:00
|
|
|
if (!set->nr_hw_queues)
|
|
|
|
return -EINVAL;
|
2014-06-06 05:21:56 +08:00
|
|
|
if (!set->queue_depth)
|
2014-04-16 04:14:00 +08:00
|
|
|
return -EINVAL;
|
|
|
|
if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2016-09-14 22:18:54 +08:00
|
|
|
if (!set->ops->queue_rq)
|
2014-04-16 04:14:00 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
2017-10-14 17:22:29 +08:00
|
|
|
if (!set->ops->get_budget ^ !set->ops->put_budget)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2014-06-06 05:21:56 +08:00
|
|
|
if (set->queue_depth > BLK_MQ_MAX_DEPTH) {
|
|
|
|
pr_info("blk-mq: reduced tag depth to %u\n",
|
|
|
|
BLK_MQ_MAX_DEPTH);
|
|
|
|
set->queue_depth = BLK_MQ_MAX_DEPTH;
|
|
|
|
}
|
2014-04-16 04:14:00 +08:00
|
|
|
|
2018-10-31 00:36:06 +08:00
|
|
|
if (!set->nr_maps)
|
|
|
|
set->nr_maps = 1;
|
|
|
|
else if (set->nr_maps > HCTX_MAX_TYPES)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2014-12-01 08:00:58 +08:00
|
|
|
/*
|
|
|
|
* If a crashdump is active, then we are potentially in a very
|
2024-02-28 12:08:57 +08:00
|
|
|
* memory constrained environment. Limit us to 64 tags to prevent
|
|
|
|
* using too much memory.
|
2014-12-01 08:00:58 +08:00
|
|
|
*/
|
2024-02-28 12:08:57 +08:00
|
|
|
if (is_kdump_kernel())
|
2014-12-01 08:00:58 +08:00
|
|
|
set->queue_depth = min(64U, set->queue_depth);
|
2024-02-28 12:08:57 +08:00
|
|
|
|
2015-12-18 08:08:14 +08:00
|
|
|
/*
|
2018-10-30 03:25:27 +08:00
|
|
|
* There is no use for more h/w queues than cpus if we just have
|
|
|
|
* a single map
|
2015-12-18 08:08:14 +08:00
|
|
|
*/
|
2018-10-30 03:25:27 +08:00
|
|
|
if (set->nr_maps == 1 && set->nr_hw_queues > nr_cpu_ids)
|
2015-12-18 08:08:14 +08:00
|
|
|
set->nr_hw_queues = nr_cpu_ids;
|
2014-12-01 08:00:58 +08:00
|
|
|
|
2022-11-01 23:00:47 +08:00
|
|
|
if (set->flags & BLK_MQ_F_BLOCKING) {
|
|
|
|
set->srcu = kmalloc(sizeof(*set->srcu), GFP_KERNEL);
|
|
|
|
if (!set->srcu)
|
|
|
|
return -ENOMEM;
|
|
|
|
ret = init_srcu_struct(set->srcu);
|
|
|
|
if (ret)
|
|
|
|
goto out_free_srcu;
|
|
|
|
}
|
2014-04-16 04:14:00 +08:00
|
|
|
|
2016-09-14 22:18:55 +08:00
|
|
|
ret = -ENOMEM;
|
2022-11-09 18:08:10 +08:00
|
|
|
set->tags = kcalloc_node(set->nr_hw_queues,
|
|
|
|
sizeof(struct blk_mq_tags *), GFP_KERNEL,
|
|
|
|
set->numa_node);
|
|
|
|
if (!set->tags)
|
2022-11-01 23:00:47 +08:00
|
|
|
goto out_cleanup_srcu;
|
2014-04-16 04:14:00 +08:00
|
|
|
|
2018-10-31 00:36:06 +08:00
|
|
|
for (i = 0; i < set->nr_maps; i++) {
|
|
|
|
set->map[i].mq_map = kcalloc_node(nr_cpu_ids,
|
2018-12-17 18:42:45 +08:00
|
|
|
sizeof(set->map[i].mq_map[0]),
|
2018-10-31 00:36:06 +08:00
|
|
|
GFP_KERNEL, set->numa_node);
|
|
|
|
if (!set->map[i].mq_map)
|
|
|
|
goto out_free_mq_map;
|
2024-02-28 12:08:57 +08:00
|
|
|
set->map[i].nr_queues = set->nr_hw_queues;
|
2018-10-31 00:36:06 +08:00
|
|
|
}
|
2016-09-14 22:18:53 +08:00
|
|
|
|
2022-08-16 01:00:43 +08:00
|
|
|
blk_mq_update_queue_map(set);
|
2016-09-14 22:18:55 +08:00
|
|
|
|
2021-10-05 18:23:35 +08:00
|
|
|
ret = blk_mq_alloc_set_map_and_rqs(set);
|
2016-09-14 22:18:55 +08:00
|
|
|
if (ret)
|
2016-09-14 22:18:53 +08:00
|
|
|
goto out_free_mq_map;
|
2014-04-16 04:14:00 +08:00
|
|
|
|
2014-05-14 05:10:52 +08:00
|
|
|
mutex_init(&set->tag_list_lock);
|
|
|
|
INIT_LIST_HEAD(&set->tag_list);
|
|
|
|
|
2014-04-16 04:14:00 +08:00
|
|
|
return 0;
|
2016-09-14 22:18:53 +08:00
|
|
|
|
|
|
|
out_free_mq_map:
|
2018-10-31 00:36:06 +08:00
|
|
|
for (i = 0; i < set->nr_maps; i++) {
|
|
|
|
kfree(set->map[i].mq_map);
|
|
|
|
set->map[i].mq_map = NULL;
|
|
|
|
}
|
2014-09-03 00:38:44 +08:00
|
|
|
kfree(set->tags);
|
|
|
|
set->tags = NULL;
|
2022-11-01 23:00:47 +08:00
|
|
|
out_cleanup_srcu:
|
|
|
|
if (set->flags & BLK_MQ_F_BLOCKING)
|
|
|
|
cleanup_srcu_struct(set->srcu);
|
|
|
|
out_free_srcu:
|
|
|
|
if (set->flags & BLK_MQ_F_BLOCKING)
|
|
|
|
kfree(set->srcu);
|
2016-09-14 22:18:55 +08:00
|
|
|
return ret;
|
2014-04-16 04:14:00 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_alloc_tag_set);
|
|
|
|
|
2021-06-02 14:53:16 +08:00
|
|
|
/* allocate and initialize a tagset for a simple single-queue device */
|
|
|
|
int blk_mq_alloc_sq_tag_set(struct blk_mq_tag_set *set,
|
|
|
|
const struct blk_mq_ops *ops, unsigned int queue_depth,
|
|
|
|
unsigned int set_flags)
|
|
|
|
{
|
|
|
|
memset(set, 0, sizeof(*set));
|
|
|
|
set->ops = ops;
|
|
|
|
set->nr_hw_queues = 1;
|
|
|
|
set->nr_maps = 1;
|
|
|
|
set->queue_depth = queue_depth;
|
|
|
|
set->numa_node = NUMA_NO_NODE;
|
|
|
|
set->flags = set_flags;
|
|
|
|
return blk_mq_alloc_tag_set(set);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_alloc_sq_tag_set);
|
|
|
|
|
2014-04-16 04:14:00 +08:00
|
|
|
void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
|
|
|
|
{
|
2018-10-31 00:36:06 +08:00
|
|
|
int i, j;
|
2014-04-16 04:14:00 +08:00
|
|
|
|
2019-10-26 00:50:10 +08:00
|
|
|
for (i = 0; i < set->nr_hw_queues; i++)
|
2021-10-05 18:23:37 +08:00
|
|
|
__blk_mq_free_map_and_rqs(set, i);
|
2014-05-22 04:01:15 +08:00
|
|
|
|
2021-10-05 18:23:39 +08:00
|
|
|
if (blk_mq_is_shared_tags(set->flags)) {
|
|
|
|
blk_mq_free_map_and_rqs(set, set->shared_tags,
|
2021-10-05 18:23:37 +08:00
|
|
|
BLK_MQ_NO_HCTX_IDX);
|
|
|
|
}
|
blk-mq: Facilitate a shared sbitmap per tagset
Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags.
In addition, these drivers want to use interrupt assignment in
pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
CPU hotplug may cause in-flight IO completion to not be serviced when an
interrupt is shutdown. That problem is solved in commit bf0beec0607d
("blk-mq: drain I/O when all CPUs in a hctx are offline").
However, to take advantage of that blk-mq feature, the HBA HW queuess are
required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
queues need to be exposed to the upper layer.
In making that transition, the per-SCSI command request tags are no
longer unique per Scsi host - they are just unique per hctx. As such, the
HBA LLDD would have to generate this tag internally, which has a certain
performance overhead.
However another problem is that blk-mq assumes the host may accept
(Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi:
core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
counter was removed, which would stop the LLDD being sent more than
.can_queue commands; however, it should still be ensured that the block
layer does not issue more than .can_queue commands to the Scsi host.
To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
which may be requested at init time.
New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
tagset to indicate whether the shared sbitmap should be used.
Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
are still allocated per hctx; the reason for this is that if tags and
requests were only allocated for a single hctx - like hctx0 - it may break
block drivers which expect a request be associated with a specific hctx,
i.e. not always hctx0. This will introduce extra memory usage.
This change is based on work originally from Ming Lei in [1] and from
Bart's suggestion in [2].
[0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
[1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
[2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-19 23:20:24 +08:00
|
|
|
|
2018-10-31 00:36:06 +08:00
|
|
|
for (j = 0; j < set->nr_maps; j++) {
|
|
|
|
kfree(set->map[j].mq_map);
|
|
|
|
set->map[j].mq_map = NULL;
|
|
|
|
}
|
2016-09-14 22:18:53 +08:00
|
|
|
|
2014-04-24 00:07:34 +08:00
|
|
|
kfree(set->tags);
|
2014-09-03 00:38:44 +08:00
|
|
|
set->tags = NULL;
|
2022-11-01 23:00:47 +08:00
|
|
|
if (set->flags & BLK_MQ_F_BLOCKING) {
|
|
|
|
cleanup_srcu_struct(set->srcu);
|
|
|
|
kfree(set->srcu);
|
|
|
|
}
|
2014-04-16 04:14:00 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_free_tag_set);
|
|
|
|
|
2014-05-21 01:49:02 +08:00
|
|
|
int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
|
|
|
|
{
|
|
|
|
struct blk_mq_tag_set *set = q->tag_set;
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
2022-03-08 15:32:18 +08:00
|
|
|
int ret;
|
|
|
|
unsigned long i;
|
2014-05-21 01:49:02 +08:00
|
|
|
|
2017-01-17 21:03:22 +08:00
|
|
|
if (!set)
|
2014-05-21 01:49:02 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
2019-02-09 00:14:05 +08:00
|
|
|
if (q->nr_requests == nr)
|
|
|
|
return 0;
|
|
|
|
|
2017-01-20 01:59:07 +08:00
|
|
|
blk_mq_freeze_queue(q);
|
2018-01-06 16:27:38 +08:00
|
|
|
blk_mq_quiesce_queue(q);
|
2017-01-20 01:59:07 +08:00
|
|
|
|
2014-05-21 01:49:02 +08:00
|
|
|
ret = 0;
|
|
|
|
queue_for_each_hw_ctx(q, hctx, i) {
|
2016-02-19 05:56:35 +08:00
|
|
|
if (!hctx->tags)
|
|
|
|
continue;
|
2017-01-17 21:03:22 +08:00
|
|
|
/*
|
|
|
|
* If we're using an MQ scheduler, just update the scheduler
|
|
|
|
* queue depth. This is similar to what the old code would do.
|
|
|
|
*/
|
2021-10-05 18:23:29 +08:00
|
|
|
if (hctx->sched_tags) {
|
2017-01-20 01:59:07 +08:00
|
|
|
ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags,
|
2021-10-05 18:23:29 +08:00
|
|
|
nr, true);
|
|
|
|
} else {
|
|
|
|
ret = blk_mq_tag_update_depth(hctx, &hctx->tags, nr,
|
|
|
|
false);
|
2017-01-20 01:59:07 +08:00
|
|
|
}
|
2014-05-21 01:49:02 +08:00
|
|
|
if (ret)
|
|
|
|
break;
|
2019-01-19 01:34:16 +08:00
|
|
|
if (q->elevator && q->elevator->type->ops.depth_updated)
|
|
|
|
q->elevator->type->ops.depth_updated(hctx);
|
2014-05-21 01:49:02 +08:00
|
|
|
}
|
blk-mq: Use request queue-wide tags for tagset-wide sbitmap
The tags used for an IO scheduler are currently per hctx.
As such, when q->nr_hw_queues grows, so does the request queue total IO
scheduler tag depth.
This may cause problems for SCSI MQ HBAs whose total driver depth is
fixed.
Ming and Yanhui report higher CPU usage and lower throughput in scenarios
where the fixed total driver tag depth is appreciably lower than the total
scheduler tag depth:
https://lore.kernel.org/linux-block/440dfcfc-1a2c-bd98-1161-cec4d78c6dfc@huawei.com/T/#mc0d6d4f95275a2743d1c8c3e4dc9ff6c9aa3a76b
In that scenario, since the scheduler tag is got first, much contention
is introduced since a driver tag may not be available after we have got
the sched tag.
Improve this scenario by introducing request queue-wide tags for when
a tagset-wide sbitmap is used. The static sched requests are still
allocated per hctx, as requests are initialised per hctx, as in
blk_mq_init_request(..., hctx_idx, ...) ->
set->ops->init_request(.., hctx_idx, ...).
For simplicity of resizing the request queue sbitmap when updating the
request queue depth, just init at the max possible size, so we don't need
to deal with the possibly with swapping out a new sbitmap for old if
we need to grow.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1620907258-30910-3-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-13 20:00:58 +08:00
|
|
|
if (!ret) {
|
2014-05-21 01:49:02 +08:00
|
|
|
q->nr_requests = nr;
|
2021-10-05 18:23:39 +08:00
|
|
|
if (blk_mq_is_shared_tags(set->flags)) {
|
2021-10-05 18:23:28 +08:00
|
|
|
if (q->elevator)
|
2021-10-05 18:23:39 +08:00
|
|
|
blk_mq_tag_update_sched_shared_tags(q);
|
2021-10-05 18:23:28 +08:00
|
|
|
else
|
2021-10-05 18:23:39 +08:00
|
|
|
blk_mq_tag_resize_shared_tags(set, nr);
|
2021-10-05 18:23:28 +08:00
|
|
|
}
|
blk-mq: Use request queue-wide tags for tagset-wide sbitmap
The tags used for an IO scheduler are currently per hctx.
As such, when q->nr_hw_queues grows, so does the request queue total IO
scheduler tag depth.
This may cause problems for SCSI MQ HBAs whose total driver depth is
fixed.
Ming and Yanhui report higher CPU usage and lower throughput in scenarios
where the fixed total driver tag depth is appreciably lower than the total
scheduler tag depth:
https://lore.kernel.org/linux-block/440dfcfc-1a2c-bd98-1161-cec4d78c6dfc@huawei.com/T/#mc0d6d4f95275a2743d1c8c3e4dc9ff6c9aa3a76b
In that scenario, since the scheduler tag is got first, much contention
is introduced since a driver tag may not be available after we have got
the sched tag.
Improve this scenario by introducing request queue-wide tags for when
a tagset-wide sbitmap is used. The static sched requests are still
allocated per hctx, as requests are initialised per hctx, as in
blk_mq_init_request(..., hctx_idx, ...) ->
set->ops->init_request(.., hctx_idx, ...).
For simplicity of resizing the request queue sbitmap when updating the
request queue depth, just init at the max possible size, so we don't need
to deal with the possibly with swapping out a new sbitmap for old if
we need to grow.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1620907258-30910-3-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-13 20:00:58 +08:00
|
|
|
}
|
2014-05-21 01:49:02 +08:00
|
|
|
|
2018-01-06 16:27:38 +08:00
|
|
|
blk_mq_unquiesce_queue(q);
|
2017-01-20 01:59:07 +08:00
|
|
|
blk_mq_unfreeze_queue(q);
|
|
|
|
|
2014-05-21 01:49:02 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-08-21 15:15:03 +08:00
|
|
|
/*
|
|
|
|
* request_queue and elevator_type pair.
|
|
|
|
* It is just used by __blk_mq_update_nr_hw_queues to cache
|
|
|
|
* the elevator_type associated with a request_queue.
|
|
|
|
*/
|
|
|
|
struct blk_mq_qe_pair {
|
|
|
|
struct list_head node;
|
|
|
|
struct request_queue *q;
|
|
|
|
struct elevator_type *type;
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Cache the elevator_type in qe pair list and switch the
|
|
|
|
* io scheduler to 'none'
|
|
|
|
*/
|
|
|
|
static bool blk_mq_elv_switch_none(struct list_head *head,
|
|
|
|
struct request_queue *q)
|
|
|
|
{
|
|
|
|
struct blk_mq_qe_pair *qe;
|
|
|
|
|
|
|
|
qe = kmalloc(sizeof(*qe), GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY);
|
|
|
|
if (!qe)
|
|
|
|
return false;
|
|
|
|
|
2022-06-16 09:43:59 +08:00
|
|
|
/* q->elevator needs protection from ->sysfs_lock */
|
|
|
|
mutex_lock(&q->sysfs_lock);
|
|
|
|
|
2023-06-16 21:23:54 +08:00
|
|
|
/* the check has to be done with holding sysfs_lock */
|
|
|
|
if (!q->elevator) {
|
|
|
|
kfree(qe);
|
|
|
|
goto unlock;
|
|
|
|
}
|
|
|
|
|
2018-08-21 15:15:03 +08:00
|
|
|
INIT_LIST_HEAD(&qe->node);
|
|
|
|
qe->q = q;
|
|
|
|
qe->type = q->elevator->type;
|
2022-10-20 14:48:16 +08:00
|
|
|
/* keep a reference to the elevator module as we'll switch back */
|
|
|
|
__elevator_get(qe->type);
|
2018-08-21 15:15:03 +08:00
|
|
|
list_add(&qe->node, head);
|
2022-10-30 18:07:14 +08:00
|
|
|
elevator_disable(q);
|
2023-06-16 21:23:54 +08:00
|
|
|
unlock:
|
2018-08-21 15:15:03 +08:00
|
|
|
mutex_unlock(&q->sysfs_lock);
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2022-03-31 17:12:18 +08:00
|
|
|
static struct blk_mq_qe_pair *blk_lookup_qe_pair(struct list_head *head,
|
|
|
|
struct request_queue *q)
|
2018-08-21 15:15:03 +08:00
|
|
|
{
|
|
|
|
struct blk_mq_qe_pair *qe;
|
|
|
|
|
|
|
|
list_for_each_entry(qe, head, node)
|
2022-03-31 17:12:18 +08:00
|
|
|
if (qe->q == q)
|
|
|
|
return qe;
|
2018-08-21 15:15:03 +08:00
|
|
|
|
2022-03-31 17:12:18 +08:00
|
|
|
return NULL;
|
|
|
|
}
|
2018-08-21 15:15:03 +08:00
|
|
|
|
2022-03-31 17:12:18 +08:00
|
|
|
static void blk_mq_elv_switch_back(struct list_head *head,
|
|
|
|
struct request_queue *q)
|
|
|
|
{
|
|
|
|
struct blk_mq_qe_pair *qe;
|
|
|
|
struct elevator_type *t;
|
|
|
|
|
|
|
|
qe = blk_lookup_qe_pair(head, q);
|
|
|
|
if (!qe)
|
|
|
|
return;
|
|
|
|
t = qe->type;
|
2018-08-21 15:15:03 +08:00
|
|
|
list_del(&qe->node);
|
|
|
|
kfree(qe);
|
|
|
|
|
|
|
|
mutex_lock(&q->sysfs_lock);
|
2022-09-27 23:56:52 +08:00
|
|
|
elevator_switch(q, t);
|
block: fix up elevator_type refcounting
The current reference management logic of io scheduler modules contains
refcnt problems. For example, blk_mq_init_sched may fail before or after
the calling of e->ops.init_sched. If it fails before the calling, it does
nothing to the reference to the io scheduler module. But if it fails after
the calling, it releases the reference by calling kobject_put(&eq->kobj).
As the callers of blk_mq_init_sched can't know exactly where the failure
happens, they can't handle the reference to the io scheduler module
properly: releasing the reference on failure results in double-release if
blk_mq_init_sched has released it, and not releasing the reference results
in ghost reference if blk_mq_init_sched did not release it either.
The same problem also exists in io schedulers' init_sched implementations.
We can address the problem by adding releasing statements to the error
handling procedures of blk_mq_init_sched and init_sched implementations.
But that is counterintuitive and requires modifications to existing io
schedulers.
Instead, We make elevator_alloc get the io scheduler module references
that will be released by elevator_release. And then, we match each
elevator_get with an elevator_put. Therefore, each reference to an io
scheduler module explicitly has its own getter and releaser, and we no
longer need to worry about the refcnt problems.
The bugs and the patch can be validated with tools here:
https://github.com/nickyc975/linux_elv_refcnt_bug.git
[hch: split out a few bits into separate patches, use a non-try
module_get in elevator_alloc]
Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221020064819.1469928-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-20 14:48:19 +08:00
|
|
|
/* drop the reference acquired in blk_mq_elv_switch_none */
|
|
|
|
elevator_put(t);
|
2018-08-21 15:15:03 +08:00
|
|
|
mutex_unlock(&q->sysfs_lock);
|
|
|
|
}
|
|
|
|
|
2017-05-31 02:39:11 +08:00
|
|
|
static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
|
|
|
|
int nr_hw_queues)
|
2015-12-18 08:08:14 +08:00
|
|
|
{
|
|
|
|
struct request_queue *q;
|
2018-08-21 15:15:03 +08:00
|
|
|
LIST_HEAD(head);
|
2023-09-08 08:57:02 +08:00
|
|
|
int prev_nr_hw_queues = set->nr_hw_queues;
|
|
|
|
int i;
|
2015-12-18 08:08:14 +08:00
|
|
|
|
2017-04-08 02:16:49 +08:00
|
|
|
lockdep_assert_held(&set->tag_list_lock);
|
|
|
|
|
2018-10-30 03:25:27 +08:00
|
|
|
if (set->nr_maps == 1 && nr_hw_queues > nr_cpu_ids)
|
2015-12-18 08:08:14 +08:00
|
|
|
nr_hw_queues = nr_cpu_ids;
|
2020-06-17 14:18:37 +08:00
|
|
|
if (nr_hw_queues < 1)
|
|
|
|
return;
|
|
|
|
if (set->nr_maps == 1 && nr_hw_queues == set->nr_hw_queues)
|
2015-12-18 08:08:14 +08:00
|
|
|
return;
|
|
|
|
|
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list)
|
|
|
|
blk_mq_freeze_queue(q);
|
2018-08-21 15:15:03 +08:00
|
|
|
/*
|
|
|
|
* Switch IO scheduler to 'none', cleaning up the data associated
|
|
|
|
* with the previous scheduler. We will switch back once we are done
|
|
|
|
* updating the new sw to hw queue mappings.
|
|
|
|
*/
|
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list)
|
|
|
|
if (!blk_mq_elv_switch_none(&head, q))
|
|
|
|
goto switch_back;
|
2015-12-18 08:08:14 +08:00
|
|
|
|
2018-10-12 18:07:25 +08:00
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list) {
|
|
|
|
blk_mq_debugfs_unregister_hctxs(q);
|
2022-06-29 01:18:49 +08:00
|
|
|
blk_mq_sysfs_unregister_hctxs(q);
|
2018-10-12 18:07:25 +08:00
|
|
|
}
|
|
|
|
|
2022-11-09 18:08:11 +08:00
|
|
|
if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0)
|
2019-10-26 00:50:10 +08:00
|
|
|
goto reregister;
|
|
|
|
|
blk-mq: fallback to previous nr_hw_queues when updating fails
When we try to increate the nr_hw_queues, we may fail due to
shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops
and some entries in q->queue_hw_ctx are left with NULL. However,
because queue map has been updated with new nr_hw_queues, some cpus
have been mapped to hw queue which just encounters allocation failure,
thus blk_mq_map_queue could return NULL. This will cause panic in
following blk_mq_map_swqueue.
To fix it, when increase nr_hw_queues fails, fallback to previous
nr_hw_queues and post warning. At the same time, driver's .map_queues
usually use completion irq affinity to map hw and cpu, fallback
nr_hw_queues will cause lack of some cpu's map to hw, so use default
blk_mq_map_queues to do that.
Reported-by: syzbot+83e8cbe702263932d9d4@syzkaller.appspotmail.com
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-12 18:07:28 +08:00
|
|
|
fallback:
|
2020-05-13 08:44:05 +08:00
|
|
|
blk_mq_update_queue_map(set);
|
2015-12-18 08:08:14 +08:00
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list) {
|
|
|
|
blk_mq_realloc_hw_ctxs(set, q);
|
2022-03-08 15:32:16 +08:00
|
|
|
blk_mq_update_poll_flag(q);
|
blk-mq: fallback to previous nr_hw_queues when updating fails
When we try to increate the nr_hw_queues, we may fail due to
shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops
and some entries in q->queue_hw_ctx are left with NULL. However,
because queue map has been updated with new nr_hw_queues, some cpus
have been mapped to hw queue which just encounters allocation failure,
thus blk_mq_map_queue could return NULL. This will cause panic in
following blk_mq_map_swqueue.
To fix it, when increase nr_hw_queues fails, fallback to previous
nr_hw_queues and post warning. At the same time, driver's .map_queues
usually use completion irq affinity to map hw and cpu, fallback
nr_hw_queues will cause lack of some cpu's map to hw, so use default
blk_mq_map_queues to do that.
Reported-by: syzbot+83e8cbe702263932d9d4@syzkaller.appspotmail.com
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-12 18:07:28 +08:00
|
|
|
if (q->nr_hw_queues != set->nr_hw_queues) {
|
2021-11-08 15:40:19 +08:00
|
|
|
int i = prev_nr_hw_queues;
|
|
|
|
|
blk-mq: fallback to previous nr_hw_queues when updating fails
When we try to increate the nr_hw_queues, we may fail due to
shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops
and some entries in q->queue_hw_ctx are left with NULL. However,
because queue map has been updated with new nr_hw_queues, some cpus
have been mapped to hw queue which just encounters allocation failure,
thus blk_mq_map_queue could return NULL. This will cause panic in
following blk_mq_map_swqueue.
To fix it, when increase nr_hw_queues fails, fallback to previous
nr_hw_queues and post warning. At the same time, driver's .map_queues
usually use completion irq affinity to map hw and cpu, fallback
nr_hw_queues will cause lack of some cpu's map to hw, so use default
blk_mq_map_queues to do that.
Reported-by: syzbot+83e8cbe702263932d9d4@syzkaller.appspotmail.com
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-12 18:07:28 +08:00
|
|
|
pr_warn("Increasing nr_hw_queues to %d fails, fallback to %d\n",
|
|
|
|
nr_hw_queues, prev_nr_hw_queues);
|
2021-11-08 15:40:19 +08:00
|
|
|
for (; i < set->nr_hw_queues; i++)
|
|
|
|
__blk_mq_free_map_and_rqs(set, i);
|
|
|
|
|
blk-mq: fallback to previous nr_hw_queues when updating fails
When we try to increate the nr_hw_queues, we may fail due to
shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops
and some entries in q->queue_hw_ctx are left with NULL. However,
because queue map has been updated with new nr_hw_queues, some cpus
have been mapped to hw queue which just encounters allocation failure,
thus blk_mq_map_queue could return NULL. This will cause panic in
following blk_mq_map_swqueue.
To fix it, when increase nr_hw_queues fails, fallback to previous
nr_hw_queues and post warning. At the same time, driver's .map_queues
usually use completion irq affinity to map hw and cpu, fallback
nr_hw_queues will cause lack of some cpu's map to hw, so use default
blk_mq_map_queues to do that.
Reported-by: syzbot+83e8cbe702263932d9d4@syzkaller.appspotmail.com
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-12 18:07:28 +08:00
|
|
|
set->nr_hw_queues = prev_nr_hw_queues;
|
|
|
|
goto fallback;
|
|
|
|
}
|
2018-10-12 18:07:25 +08:00
|
|
|
blk_mq_map_swqueue(q);
|
|
|
|
}
|
|
|
|
|
2019-10-26 00:50:10 +08:00
|
|
|
reregister:
|
2018-10-12 18:07:25 +08:00
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list) {
|
2022-06-29 01:18:49 +08:00
|
|
|
blk_mq_sysfs_register_hctxs(q);
|
2018-10-12 18:07:25 +08:00
|
|
|
blk_mq_debugfs_register_hctxs(q);
|
2015-12-18 08:08:14 +08:00
|
|
|
}
|
|
|
|
|
2018-08-21 15:15:03 +08:00
|
|
|
switch_back:
|
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list)
|
|
|
|
blk_mq_elv_switch_back(&head, q);
|
|
|
|
|
2015-12-18 08:08:14 +08:00
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list)
|
|
|
|
blk_mq_unfreeze_queue(q);
|
2023-09-08 08:57:02 +08:00
|
|
|
|
|
|
|
/* Free the excess tags when nr_hw_queues shrink. */
|
|
|
|
for (i = set->nr_hw_queues; i < prev_nr_hw_queues; i++)
|
|
|
|
__blk_mq_free_map_and_rqs(set, i);
|
2015-12-18 08:08:14 +08:00
|
|
|
}
|
2017-05-31 02:39:11 +08:00
|
|
|
|
|
|
|
void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
|
|
|
|
{
|
|
|
|
mutex_lock(&set->tag_list_lock);
|
|
|
|
__blk_mq_update_nr_hw_queues(set, nr_hw_queues);
|
|
|
|
mutex_unlock(&set->tag_list_lock);
|
|
|
|
}
|
2015-12-18 08:08:14 +08:00
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues);
|
|
|
|
|
2023-06-13 03:03:42 +08:00
|
|
|
static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
|
|
|
|
struct io_comp_batch *iob, unsigned int flags)
|
2016-11-04 23:34:34 +08:00
|
|
|
{
|
2021-10-12 19:12:16 +08:00
|
|
|
long state = get_current_state();
|
|
|
|
int ret;
|
2016-11-04 23:34:34 +08:00
|
|
|
|
2018-11-14 12:32:10 +08:00
|
|
|
do {
|
2021-10-12 23:24:29 +08:00
|
|
|
ret = q->mq_ops->poll(hctx, iob);
|
2016-11-04 23:34:34 +08:00
|
|
|
if (ret > 0) {
|
2018-11-16 23:37:34 +08:00
|
|
|
__set_current_state(TASK_RUNNING);
|
2018-11-07 04:30:55 +08:00
|
|
|
return ret;
|
2016-11-04 23:34:34 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (signal_pending_state(state, current))
|
2018-11-16 23:37:34 +08:00
|
|
|
__set_current_state(TASK_RUNNING);
|
2021-06-11 16:28:12 +08:00
|
|
|
if (task_is_running(current))
|
2018-11-07 04:30:55 +08:00
|
|
|
return 1;
|
2021-10-12 19:12:16 +08:00
|
|
|
|
2021-10-12 19:12:19 +08:00
|
|
|
if (ret < 0 || (flags & BLK_POLL_ONESHOT))
|
2016-11-04 23:34:34 +08:00
|
|
|
break;
|
|
|
|
cpu_relax();
|
2018-11-14 12:32:10 +08:00
|
|
|
} while (!need_resched());
|
2016-11-04 23:34:34 +08:00
|
|
|
|
2018-02-13 23:48:12 +08:00
|
|
|
__set_current_state(TASK_RUNNING);
|
2018-11-07 04:30:55 +08:00
|
|
|
return 0;
|
2016-11-04 23:34:34 +08:00
|
|
|
}
|
2018-11-26 23:21:49 +08:00
|
|
|
|
2023-06-13 03:03:42 +08:00
|
|
|
int blk_mq_poll(struct request_queue *q, blk_qc_t cookie,
|
|
|
|
struct io_comp_batch *iob, unsigned int flags)
|
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx = xa_load(&q->hctx_table, cookie);
|
|
|
|
|
|
|
|
return blk_hctx_poll(q, hctx, iob, flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
|
|
|
|
unsigned int poll_flags)
|
|
|
|
{
|
|
|
|
struct request_queue *q = rq->q;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!blk_rq_is_poll(rq))
|
|
|
|
return 0;
|
|
|
|
if (!percpu_ref_tryget(&q->q_usage_counter))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
ret = blk_hctx_poll(q, rq->mq_hctx, iob, poll_flags);
|
|
|
|
blk_queue_exit(q);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_rq_poll);
|
|
|
|
|
2018-11-01 07:01:22 +08:00
|
|
|
unsigned int blk_mq_rq_cpu(struct request *rq)
|
|
|
|
{
|
|
|
|
return rq->mq_ctx->cpu;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_rq_cpu);
|
|
|
|
|
blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release()
For avoiding to slow down queue destroy, we don't call
blk_mq_quiesce_queue() in blk_cleanup_queue(), instead of delaying to
cancel dispatch work in blk_release_queue().
However, this way has caused kernel oops[1], reported by Changhui. The log
shows that scsi_device can be freed before running blk_release_queue(),
which is expected too since scsi_device is released after the scsi disk
is closed and the scsi_device is removed.
Fixes the issue by canceling blk-mq dispatch work in both blk_cleanup_queue()
and disk_release():
1) when disk_release() is run, the disk has been closed, and any sync
dispatch activities have been done, so canceling dispatch work is enough to
quiesce filesystem I/O dispatch activity.
2) in blk_cleanup_queue(), we only focus on passthrough request, and
passthrough request is always explicitly allocated & freed by
its caller, so once queue is frozen, all sync dispatch activity
for passthrough request has been done, then it is enough to just cancel
dispatch work for avoiding any dispatch activity.
[1] kernel panic log
[12622.769416] BUG: kernel NULL pointer dereference, address: 0000000000000300
[12622.777186] #PF: supervisor read access in kernel mode
[12622.782918] #PF: error_code(0x0000) - not-present page
[12622.788649] PGD 0 P4D 0
[12622.791474] Oops: 0000 [#1] PREEMPT SMP PTI
[12622.796138] CPU: 10 PID: 744 Comm: kworker/10:1H Kdump: loaded Not tainted 5.15.0+ #1
[12622.804877] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
[12622.813321] Workqueue: kblockd blk_mq_run_work_fn
[12622.818572] RIP: 0010:sbitmap_get+0x75/0x190
[12622.823336] Code: 85 80 00 00 00 41 8b 57 08 85 d2 0f 84 b1 00 00 00 45 31 e4 48 63 cd 48 8d 1c 49 48 c1 e3 06 49 03 5f 10 4c 8d 6b 40 83 f0 01 <48> 8b 33 44 89 f2 4c 89 ef 0f b6 c8 e8 fa f3 ff ff 83 f8 ff 75 58
[12622.844290] RSP: 0018:ffffb00a446dbd40 EFLAGS: 00010202
[12622.850120] RAX: 0000000000000001 RBX: 0000000000000300 RCX: 0000000000000004
[12622.858082] RDX: 0000000000000006 RSI: 0000000000000082 RDI: ffffa0b7a2dfe030
[12622.866042] RBP: 0000000000000004 R08: 0000000000000001 R09: ffffa0b742721334
[12622.874003] R10: 0000000000000008 R11: 0000000000000008 R12: 0000000000000000
[12622.881964] R13: 0000000000000340 R14: 0000000000000000 R15: ffffa0b7a2dfe030
[12622.889926] FS: 0000000000000000(0000) GS:ffffa0baafb40000(0000) knlGS:0000000000000000
[12622.898956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12622.905367] CR2: 0000000000000300 CR3: 0000000641210001 CR4: 00000000001706e0
[12622.913328] Call Trace:
[12622.916055] <TASK>
[12622.918394] scsi_mq_get_budget+0x1a/0x110
[12622.922969] __blk_mq_do_dispatch_sched+0x1d4/0x320
[12622.928404] ? pick_next_task_fair+0x39/0x390
[12622.933268] __blk_mq_sched_dispatch_requests+0xf4/0x140
[12622.939194] blk_mq_sched_dispatch_requests+0x30/0x60
[12622.944829] __blk_mq_run_hw_queue+0x30/0xa0
[12622.949593] process_one_work+0x1e8/0x3c0
[12622.954059] worker_thread+0x50/0x3b0
[12622.958144] ? rescuer_thread+0x370/0x370
[12622.962616] kthread+0x158/0x180
[12622.966218] ? set_kthread_struct+0x40/0x40
[12622.970884] ret_from_fork+0x22/0x30
[12622.974875] </TASK>
[12622.977309] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc dm_multipath intel_rapl_msr intel_rapl_common dell_wmi_descriptor sb_edac rfkill video x86_pkg_temp_thermal intel_powerclamp dcdbas coretemp kvm_intel kvm mgag200 irqbypass i2c_algo_bit rapl drm_kms_helper ipmi_ssif intel_cstate intel_uncore syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr cec mei_me lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sr_mod cdrom sd_mod t10_pi sg ixgbe ahci libahci crct10dif_pclmul crc32_pclmul crc32c_intel libata megaraid_sas ghash_clmulni_intel tg3 wdat_wdt mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_debug]
Reported-by: ChanghuiZhong <czhong@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211116014343.610501-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-16 09:43:43 +08:00
|
|
|
void blk_mq_cancel_work_sync(struct request_queue *q)
|
|
|
|
{
|
2022-10-30 17:47:30 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
unsigned long i;
|
blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release()
For avoiding to slow down queue destroy, we don't call
blk_mq_quiesce_queue() in blk_cleanup_queue(), instead of delaying to
cancel dispatch work in blk_release_queue().
However, this way has caused kernel oops[1], reported by Changhui. The log
shows that scsi_device can be freed before running blk_release_queue(),
which is expected too since scsi_device is released after the scsi disk
is closed and the scsi_device is removed.
Fixes the issue by canceling blk-mq dispatch work in both blk_cleanup_queue()
and disk_release():
1) when disk_release() is run, the disk has been closed, and any sync
dispatch activities have been done, so canceling dispatch work is enough to
quiesce filesystem I/O dispatch activity.
2) in blk_cleanup_queue(), we only focus on passthrough request, and
passthrough request is always explicitly allocated & freed by
its caller, so once queue is frozen, all sync dispatch activity
for passthrough request has been done, then it is enough to just cancel
dispatch work for avoiding any dispatch activity.
[1] kernel panic log
[12622.769416] BUG: kernel NULL pointer dereference, address: 0000000000000300
[12622.777186] #PF: supervisor read access in kernel mode
[12622.782918] #PF: error_code(0x0000) - not-present page
[12622.788649] PGD 0 P4D 0
[12622.791474] Oops: 0000 [#1] PREEMPT SMP PTI
[12622.796138] CPU: 10 PID: 744 Comm: kworker/10:1H Kdump: loaded Not tainted 5.15.0+ #1
[12622.804877] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
[12622.813321] Workqueue: kblockd blk_mq_run_work_fn
[12622.818572] RIP: 0010:sbitmap_get+0x75/0x190
[12622.823336] Code: 85 80 00 00 00 41 8b 57 08 85 d2 0f 84 b1 00 00 00 45 31 e4 48 63 cd 48 8d 1c 49 48 c1 e3 06 49 03 5f 10 4c 8d 6b 40 83 f0 01 <48> 8b 33 44 89 f2 4c 89 ef 0f b6 c8 e8 fa f3 ff ff 83 f8 ff 75 58
[12622.844290] RSP: 0018:ffffb00a446dbd40 EFLAGS: 00010202
[12622.850120] RAX: 0000000000000001 RBX: 0000000000000300 RCX: 0000000000000004
[12622.858082] RDX: 0000000000000006 RSI: 0000000000000082 RDI: ffffa0b7a2dfe030
[12622.866042] RBP: 0000000000000004 R08: 0000000000000001 R09: ffffa0b742721334
[12622.874003] R10: 0000000000000008 R11: 0000000000000008 R12: 0000000000000000
[12622.881964] R13: 0000000000000340 R14: 0000000000000000 R15: ffffa0b7a2dfe030
[12622.889926] FS: 0000000000000000(0000) GS:ffffa0baafb40000(0000) knlGS:0000000000000000
[12622.898956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12622.905367] CR2: 0000000000000300 CR3: 0000000641210001 CR4: 00000000001706e0
[12622.913328] Call Trace:
[12622.916055] <TASK>
[12622.918394] scsi_mq_get_budget+0x1a/0x110
[12622.922969] __blk_mq_do_dispatch_sched+0x1d4/0x320
[12622.928404] ? pick_next_task_fair+0x39/0x390
[12622.933268] __blk_mq_sched_dispatch_requests+0xf4/0x140
[12622.939194] blk_mq_sched_dispatch_requests+0x30/0x60
[12622.944829] __blk_mq_run_hw_queue+0x30/0xa0
[12622.949593] process_one_work+0x1e8/0x3c0
[12622.954059] worker_thread+0x50/0x3b0
[12622.958144] ? rescuer_thread+0x370/0x370
[12622.962616] kthread+0x158/0x180
[12622.966218] ? set_kthread_struct+0x40/0x40
[12622.970884] ret_from_fork+0x22/0x30
[12622.974875] </TASK>
[12622.977309] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc dm_multipath intel_rapl_msr intel_rapl_common dell_wmi_descriptor sb_edac rfkill video x86_pkg_temp_thermal intel_powerclamp dcdbas coretemp kvm_intel kvm mgag200 irqbypass i2c_algo_bit rapl drm_kms_helper ipmi_ssif intel_cstate intel_uncore syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr cec mei_me lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sr_mod cdrom sd_mod t10_pi sg ixgbe ahci libahci crct10dif_pclmul crc32_pclmul crc32c_intel libata megaraid_sas ghash_clmulni_intel tg3 wdat_wdt mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_debug]
Reported-by: ChanghuiZhong <czhong@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211116014343.610501-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-16 09:43:43 +08:00
|
|
|
|
2022-10-30 17:47:30 +08:00
|
|
|
cancel_delayed_work_sync(&q->requeue_work);
|
blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release()
For avoiding to slow down queue destroy, we don't call
blk_mq_quiesce_queue() in blk_cleanup_queue(), instead of delaying to
cancel dispatch work in blk_release_queue().
However, this way has caused kernel oops[1], reported by Changhui. The log
shows that scsi_device can be freed before running blk_release_queue(),
which is expected too since scsi_device is released after the scsi disk
is closed and the scsi_device is removed.
Fixes the issue by canceling blk-mq dispatch work in both blk_cleanup_queue()
and disk_release():
1) when disk_release() is run, the disk has been closed, and any sync
dispatch activities have been done, so canceling dispatch work is enough to
quiesce filesystem I/O dispatch activity.
2) in blk_cleanup_queue(), we only focus on passthrough request, and
passthrough request is always explicitly allocated & freed by
its caller, so once queue is frozen, all sync dispatch activity
for passthrough request has been done, then it is enough to just cancel
dispatch work for avoiding any dispatch activity.
[1] kernel panic log
[12622.769416] BUG: kernel NULL pointer dereference, address: 0000000000000300
[12622.777186] #PF: supervisor read access in kernel mode
[12622.782918] #PF: error_code(0x0000) - not-present page
[12622.788649] PGD 0 P4D 0
[12622.791474] Oops: 0000 [#1] PREEMPT SMP PTI
[12622.796138] CPU: 10 PID: 744 Comm: kworker/10:1H Kdump: loaded Not tainted 5.15.0+ #1
[12622.804877] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
[12622.813321] Workqueue: kblockd blk_mq_run_work_fn
[12622.818572] RIP: 0010:sbitmap_get+0x75/0x190
[12622.823336] Code: 85 80 00 00 00 41 8b 57 08 85 d2 0f 84 b1 00 00 00 45 31 e4 48 63 cd 48 8d 1c 49 48 c1 e3 06 49 03 5f 10 4c 8d 6b 40 83 f0 01 <48> 8b 33 44 89 f2 4c 89 ef 0f b6 c8 e8 fa f3 ff ff 83 f8 ff 75 58
[12622.844290] RSP: 0018:ffffb00a446dbd40 EFLAGS: 00010202
[12622.850120] RAX: 0000000000000001 RBX: 0000000000000300 RCX: 0000000000000004
[12622.858082] RDX: 0000000000000006 RSI: 0000000000000082 RDI: ffffa0b7a2dfe030
[12622.866042] RBP: 0000000000000004 R08: 0000000000000001 R09: ffffa0b742721334
[12622.874003] R10: 0000000000000008 R11: 0000000000000008 R12: 0000000000000000
[12622.881964] R13: 0000000000000340 R14: 0000000000000000 R15: ffffa0b7a2dfe030
[12622.889926] FS: 0000000000000000(0000) GS:ffffa0baafb40000(0000) knlGS:0000000000000000
[12622.898956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12622.905367] CR2: 0000000000000300 CR3: 0000000641210001 CR4: 00000000001706e0
[12622.913328] Call Trace:
[12622.916055] <TASK>
[12622.918394] scsi_mq_get_budget+0x1a/0x110
[12622.922969] __blk_mq_do_dispatch_sched+0x1d4/0x320
[12622.928404] ? pick_next_task_fair+0x39/0x390
[12622.933268] __blk_mq_sched_dispatch_requests+0xf4/0x140
[12622.939194] blk_mq_sched_dispatch_requests+0x30/0x60
[12622.944829] __blk_mq_run_hw_queue+0x30/0xa0
[12622.949593] process_one_work+0x1e8/0x3c0
[12622.954059] worker_thread+0x50/0x3b0
[12622.958144] ? rescuer_thread+0x370/0x370
[12622.962616] kthread+0x158/0x180
[12622.966218] ? set_kthread_struct+0x40/0x40
[12622.970884] ret_from_fork+0x22/0x30
[12622.974875] </TASK>
[12622.977309] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc dm_multipath intel_rapl_msr intel_rapl_common dell_wmi_descriptor sb_edac rfkill video x86_pkg_temp_thermal intel_powerclamp dcdbas coretemp kvm_intel kvm mgag200 irqbypass i2c_algo_bit rapl drm_kms_helper ipmi_ssif intel_cstate intel_uncore syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr cec mei_me lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sr_mod cdrom sd_mod t10_pi sg ixgbe ahci libahci crct10dif_pclmul crc32_pclmul crc32c_intel libata megaraid_sas ghash_clmulni_intel tg3 wdat_wdt mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_debug]
Reported-by: ChanghuiZhong <czhong@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211116014343.610501-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-16 09:43:43 +08:00
|
|
|
|
2022-10-30 17:47:30 +08:00
|
|
|
queue_for_each_hw_ctx(q, hctx, i)
|
|
|
|
cancel_delayed_work_sync(&hctx->run_work);
|
blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release()
For avoiding to slow down queue destroy, we don't call
blk_mq_quiesce_queue() in blk_cleanup_queue(), instead of delaying to
cancel dispatch work in blk_release_queue().
However, this way has caused kernel oops[1], reported by Changhui. The log
shows that scsi_device can be freed before running blk_release_queue(),
which is expected too since scsi_device is released after the scsi disk
is closed and the scsi_device is removed.
Fixes the issue by canceling blk-mq dispatch work in both blk_cleanup_queue()
and disk_release():
1) when disk_release() is run, the disk has been closed, and any sync
dispatch activities have been done, so canceling dispatch work is enough to
quiesce filesystem I/O dispatch activity.
2) in blk_cleanup_queue(), we only focus on passthrough request, and
passthrough request is always explicitly allocated & freed by
its caller, so once queue is frozen, all sync dispatch activity
for passthrough request has been done, then it is enough to just cancel
dispatch work for avoiding any dispatch activity.
[1] kernel panic log
[12622.769416] BUG: kernel NULL pointer dereference, address: 0000000000000300
[12622.777186] #PF: supervisor read access in kernel mode
[12622.782918] #PF: error_code(0x0000) - not-present page
[12622.788649] PGD 0 P4D 0
[12622.791474] Oops: 0000 [#1] PREEMPT SMP PTI
[12622.796138] CPU: 10 PID: 744 Comm: kworker/10:1H Kdump: loaded Not tainted 5.15.0+ #1
[12622.804877] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
[12622.813321] Workqueue: kblockd blk_mq_run_work_fn
[12622.818572] RIP: 0010:sbitmap_get+0x75/0x190
[12622.823336] Code: 85 80 00 00 00 41 8b 57 08 85 d2 0f 84 b1 00 00 00 45 31 e4 48 63 cd 48 8d 1c 49 48 c1 e3 06 49 03 5f 10 4c 8d 6b 40 83 f0 01 <48> 8b 33 44 89 f2 4c 89 ef 0f b6 c8 e8 fa f3 ff ff 83 f8 ff 75 58
[12622.844290] RSP: 0018:ffffb00a446dbd40 EFLAGS: 00010202
[12622.850120] RAX: 0000000000000001 RBX: 0000000000000300 RCX: 0000000000000004
[12622.858082] RDX: 0000000000000006 RSI: 0000000000000082 RDI: ffffa0b7a2dfe030
[12622.866042] RBP: 0000000000000004 R08: 0000000000000001 R09: ffffa0b742721334
[12622.874003] R10: 0000000000000008 R11: 0000000000000008 R12: 0000000000000000
[12622.881964] R13: 0000000000000340 R14: 0000000000000000 R15: ffffa0b7a2dfe030
[12622.889926] FS: 0000000000000000(0000) GS:ffffa0baafb40000(0000) knlGS:0000000000000000
[12622.898956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12622.905367] CR2: 0000000000000300 CR3: 0000000641210001 CR4: 00000000001706e0
[12622.913328] Call Trace:
[12622.916055] <TASK>
[12622.918394] scsi_mq_get_budget+0x1a/0x110
[12622.922969] __blk_mq_do_dispatch_sched+0x1d4/0x320
[12622.928404] ? pick_next_task_fair+0x39/0x390
[12622.933268] __blk_mq_sched_dispatch_requests+0xf4/0x140
[12622.939194] blk_mq_sched_dispatch_requests+0x30/0x60
[12622.944829] __blk_mq_run_hw_queue+0x30/0xa0
[12622.949593] process_one_work+0x1e8/0x3c0
[12622.954059] worker_thread+0x50/0x3b0
[12622.958144] ? rescuer_thread+0x370/0x370
[12622.962616] kthread+0x158/0x180
[12622.966218] ? set_kthread_struct+0x40/0x40
[12622.970884] ret_from_fork+0x22/0x30
[12622.974875] </TASK>
[12622.977309] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc dm_multipath intel_rapl_msr intel_rapl_common dell_wmi_descriptor sb_edac rfkill video x86_pkg_temp_thermal intel_powerclamp dcdbas coretemp kvm_intel kvm mgag200 irqbypass i2c_algo_bit rapl drm_kms_helper ipmi_ssif intel_cstate intel_uncore syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr cec mei_me lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sr_mod cdrom sd_mod t10_pi sg ixgbe ahci libahci crct10dif_pclmul crc32_pclmul crc32c_intel libata megaraid_sas ghash_clmulni_intel tg3 wdat_wdt mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_debug]
Reported-by: ChanghuiZhong <czhong@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211116014343.610501-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-16 09:43:43 +08:00
|
|
|
}
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
static int __init blk_mq_init(void)
|
|
|
|
{
|
2020-06-11 14:44:41 +08:00
|
|
|
int i;
|
|
|
|
|
|
|
|
for_each_possible_cpu(i)
|
2021-01-24 04:10:27 +08:00
|
|
|
init_llist_head(&per_cpu(blk_cpu_done, i));
|
2023-07-17 12:00:55 +08:00
|
|
|
for_each_possible_cpu(i)
|
|
|
|
INIT_CSD(&per_cpu(blk_cpu_csd, i),
|
|
|
|
__blk_mq_complete_request_remote, NULL);
|
2020-06-11 14:44:41 +08:00
|
|
|
open_softirq(BLOCK_SOFTIRQ, blk_done_softirq);
|
|
|
|
|
|
|
|
cpuhp_setup_state_nocalls(CPUHP_BLOCK_SOFTIRQ_DEAD,
|
|
|
|
"block/softirq:dead", NULL,
|
|
|
|
blk_softirq_cpu_dead);
|
2016-09-22 22:05:17 +08:00
|
|
|
cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
|
|
|
|
blk_mq_hctx_notify_dead);
|
2020-05-29 21:53:15 +08:00
|
|
|
cpuhp_setup_state_multi(CPUHP_AP_BLK_MQ_ONLINE, "block/mq:online",
|
|
|
|
blk_mq_hctx_notify_online,
|
|
|
|
blk_mq_hctx_notify_offline);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
subsys_initcall(blk_mq_init);
|