2019-05-01 02:42:43 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2014-05-29 00:15:41 +08:00
|
|
|
/*
|
|
|
|
* Block multiqueue core code
|
|
|
|
*
|
|
|
|
* Copyright (C) 2013-2014 Jens Axboe
|
|
|
|
* Copyright (C) 2013-2014 Christoph Hellwig
|
|
|
|
*/
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
#include <linux/kernel.h>
|
|
|
|
#include <linux/module.h>
|
|
|
|
#include <linux/backing-dev.h>
|
|
|
|
#include <linux/bio.h>
|
|
|
|
#include <linux/blkdev.h>
|
2021-09-20 20:33:27 +08:00
|
|
|
#include <linux/blk-integrity.h>
|
2015-09-15 01:16:02 +08:00
|
|
|
#include <linux/kmemleak.h>
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/init.h>
|
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/workqueue.h>
|
|
|
|
#include <linux/smp.h>
|
2021-09-20 20:33:13 +08:00
|
|
|
#include <linux/interrupt.h>
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
#include <linux/llist.h>
|
|
|
|
#include <linux/cpu.h>
|
|
|
|
#include <linux/cache.h>
|
|
|
|
#include <linux/sched/sysctl.h>
|
2017-02-01 23:36:40 +08:00
|
|
|
#include <linux/sched/topology.h>
|
2017-02-03 02:15:33 +08:00
|
|
|
#include <linux/sched/signal.h>
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
#include <linux/delay.h>
|
2014-09-17 22:27:03 +08:00
|
|
|
#include <linux/crash_dump.h>
|
2016-08-25 22:07:30 +08:00
|
|
|
#include <linux/prefetch.h>
|
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 08:37:18 +08:00
|
|
|
#include <linux/blk-crypto.h>
|
2021-11-24 02:53:12 +08:00
|
|
|
#include <linux/part_stat.h>
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
|
|
|
#include <trace/events/block.h>
|
|
|
|
|
|
|
|
#include <linux/blk-mq.h>
|
2019-09-16 23:44:29 +08:00
|
|
|
#include <linux/t10-pi.h>
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
#include "blk.h"
|
|
|
|
#include "blk-mq.h"
|
2017-05-04 22:17:21 +08:00
|
|
|
#include "blk-mq-debugfs.h"
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
#include "blk-mq-tag.h"
|
2018-09-27 05:01:10 +08:00
|
|
|
#include "blk-pm.h"
|
2016-11-08 12:32:37 +08:00
|
|
|
#include "blk-stat.h"
|
2017-01-17 21:03:22 +08:00
|
|
|
#include "blk-mq-sched.h"
|
2018-07-03 23:14:59 +08:00
|
|
|
#include "blk-rq-qos.h"
|
2022-06-23 15:48:32 +08:00
|
|
|
#include "blk-ioprio.h"
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2021-01-24 04:10:27 +08:00
|
|
|
static DEFINE_PER_CPU(struct llist_head, blk_cpu_done);
|
2020-06-11 14:44:41 +08:00
|
|
|
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 23:56:08 +08:00
|
|
|
static void blk_mq_poll_stats_start(struct request_queue *q);
|
|
|
|
static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
|
|
|
|
|
2017-04-07 20:24:03 +08:00
|
|
|
static int blk_mq_poll_stats_bkt(const struct request *rq)
|
|
|
|
{
|
2019-05-21 15:59:03 +08:00
|
|
|
int ddir, sectors, bucket;
|
2017-04-07 20:24:03 +08:00
|
|
|
|
2017-04-21 21:55:42 +08:00
|
|
|
ddir = rq_data_dir(rq);
|
2019-05-21 15:59:03 +08:00
|
|
|
sectors = blk_rq_stats_sectors(rq);
|
2017-04-07 20:24:03 +08:00
|
|
|
|
2019-05-21 15:59:03 +08:00
|
|
|
bucket = ddir + 2 * ilog2(sectors);
|
2017-04-07 20:24:03 +08:00
|
|
|
|
|
|
|
if (bucket < 0)
|
|
|
|
return -1;
|
|
|
|
else if (bucket >= BLK_MQ_POLL_STATS_BKTS)
|
|
|
|
return ddir + BLK_MQ_POLL_STATS_BKTS - 2;
|
|
|
|
|
|
|
|
return bucket;
|
|
|
|
}
|
|
|
|
|
2021-10-12 19:12:24 +08:00
|
|
|
#define BLK_QC_T_SHIFT 16
|
|
|
|
#define BLK_QC_T_INTERNAL (1U << 31)
|
|
|
|
|
2021-10-12 19:12:15 +08:00
|
|
|
static inline struct blk_mq_hw_ctx *blk_qc_to_hctx(struct request_queue *q,
|
|
|
|
blk_qc_t qc)
|
|
|
|
{
|
2022-03-08 15:32:19 +08:00
|
|
|
return xa_load(&q->hctx_table,
|
|
|
|
(qc & ~BLK_QC_T_INTERNAL) >> BLK_QC_T_SHIFT);
|
2021-10-12 19:12:15 +08:00
|
|
|
}
|
|
|
|
|
2021-10-12 19:12:16 +08:00
|
|
|
static inline struct request *blk_qc_to_rq(struct blk_mq_hw_ctx *hctx,
|
|
|
|
blk_qc_t qc)
|
|
|
|
{
|
2021-10-12 19:12:17 +08:00
|
|
|
unsigned int tag = qc & ((1U << BLK_QC_T_SHIFT) - 1);
|
|
|
|
|
|
|
|
if (qc & BLK_QC_T_INTERNAL)
|
|
|
|
return blk_mq_tag_to_rq(hctx->sched_tags, tag);
|
|
|
|
return blk_mq_tag_to_rq(hctx->tags, tag);
|
2021-10-12 19:12:16 +08:00
|
|
|
}
|
|
|
|
|
2021-10-12 19:12:24 +08:00
|
|
|
static inline blk_qc_t blk_rq_to_qc(struct request *rq)
|
|
|
|
{
|
|
|
|
return (rq->mq_hctx->queue_num << BLK_QC_T_SHIFT) |
|
|
|
|
(rq->tag != -1 ?
|
|
|
|
rq->tag : (rq->internal_tag | BLK_QC_T_INTERNAL));
|
|
|
|
}
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
/*
|
2019-03-24 17:57:08 +08:00
|
|
|
* Check if any of the ctx, dispatch list or elevator
|
|
|
|
* have pending work in this hardware queue.
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
*/
|
2017-11-11 00:13:21 +08:00
|
|
|
static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2017-11-11 00:13:21 +08:00
|
|
|
return !list_empty_careful(&hctx->dispatch) ||
|
|
|
|
sbitmap_any_bit_set(&hctx->ctx_map) ||
|
2017-01-17 21:03:22 +08:00
|
|
|
blk_mq_sched_has_work(hctx);
|
2014-05-19 23:23:55 +08:00
|
|
|
}
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
/*
|
|
|
|
* Mark this ctx as having pending work in this hardware queue
|
|
|
|
*/
|
|
|
|
static void blk_mq_hctx_mark_pending(struct blk_mq_hw_ctx *hctx,
|
|
|
|
struct blk_mq_ctx *ctx)
|
|
|
|
{
|
2018-10-30 03:13:29 +08:00
|
|
|
const int bit = ctx->index_hw[hctx->type];
|
|
|
|
|
|
|
|
if (!sbitmap_test_bit(&hctx->ctx_map, bit))
|
|
|
|
sbitmap_set_bit(&hctx->ctx_map, bit);
|
2014-05-19 23:23:55 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
|
|
|
|
struct blk_mq_ctx *ctx)
|
|
|
|
{
|
2018-10-30 03:13:29 +08:00
|
|
|
const int bit = ctx->index_hw[hctx->type];
|
|
|
|
|
|
|
|
sbitmap_clear_bit(&hctx->ctx_map, bit);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2017-08-09 07:51:45 +08:00
|
|
|
struct mq_inflight {
|
2020-11-24 16:36:54 +08:00
|
|
|
struct block_device *part;
|
2019-10-01 02:55:34 +08:00
|
|
|
unsigned int inflight[2];
|
2017-08-09 07:51:45 +08:00
|
|
|
};
|
|
|
|
|
2022-07-06 20:03:53 +08:00
|
|
|
static bool blk_mq_check_inflight(struct request *rq, void *priv)
|
2017-08-09 07:51:45 +08:00
|
|
|
{
|
|
|
|
struct mq_inflight *mi = priv;
|
|
|
|
|
2022-05-30 14:40:59 +08:00
|
|
|
if (rq->part && blk_do_io_stat(rq) &&
|
|
|
|
(!mi->part->bd_partno || rq->part == mi->part) &&
|
2020-12-02 19:11:45 +08:00
|
|
|
blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT)
|
2019-10-01 02:55:33 +08:00
|
|
|
mi->inflight[rq_data_dir(rq)]++;
|
2018-11-09 01:24:07 +08:00
|
|
|
|
|
|
|
return true;
|
2017-08-09 07:51:45 +08:00
|
|
|
}
|
|
|
|
|
2020-11-24 16:36:54 +08:00
|
|
|
unsigned int blk_mq_in_flight(struct request_queue *q,
|
|
|
|
struct block_device *part)
|
2017-08-09 07:51:45 +08:00
|
|
|
{
|
2019-10-01 02:55:34 +08:00
|
|
|
struct mq_inflight mi = { .part = part };
|
2017-08-09 07:51:45 +08:00
|
|
|
|
|
|
|
blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
|
2018-12-07 00:41:21 +08:00
|
|
|
|
2019-10-01 02:55:34 +08:00
|
|
|
return mi.inflight[0] + mi.inflight[1];
|
2018-04-26 15:21:59 +08:00
|
|
|
}
|
|
|
|
|
2020-11-24 16:36:54 +08:00
|
|
|
void blk_mq_in_flight_rw(struct request_queue *q, struct block_device *part,
|
|
|
|
unsigned int inflight[2])
|
2018-04-26 15:21:59 +08:00
|
|
|
{
|
2019-10-01 02:55:34 +08:00
|
|
|
struct mq_inflight mi = { .part = part };
|
2018-04-26 15:21:59 +08:00
|
|
|
|
2019-10-01 02:55:33 +08:00
|
|
|
blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
|
2019-10-01 02:55:34 +08:00
|
|
|
inflight[0] = mi.inflight[0];
|
|
|
|
inflight[1] = mi.inflight[1];
|
2018-04-26 15:21:59 +08:00
|
|
|
}
|
|
|
|
|
2017-03-27 20:06:57 +08:00
|
|
|
void blk_freeze_queue_start(struct request_queue *q)
|
2013-12-26 21:31:35 +08:00
|
|
|
{
|
2019-05-21 11:25:55 +08:00
|
|
|
mutex_lock(&q->mq_freeze_lock);
|
|
|
|
if (++q->mq_freeze_depth == 1) {
|
2015-10-22 01:20:12 +08:00
|
|
|
percpu_ref_kill(&q->q_usage_counter);
|
2019-05-21 11:25:55 +08:00
|
|
|
mutex_unlock(&q->mq_freeze_lock);
|
2018-11-16 03:22:51 +08:00
|
|
|
if (queue_is_mq(q))
|
2017-11-10 02:49:53 +08:00
|
|
|
blk_mq_run_hw_queues(q, false);
|
2019-05-21 11:25:55 +08:00
|
|
|
} else {
|
|
|
|
mutex_unlock(&q->mq_freeze_lock);
|
2014-08-16 20:02:24 +08:00
|
|
|
}
|
2014-11-05 02:52:27 +08:00
|
|
|
}
|
2017-03-27 20:06:57 +08:00
|
|
|
EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
|
2014-11-05 02:52:27 +08:00
|
|
|
|
2017-03-02 03:22:10 +08:00
|
|
|
void blk_mq_freeze_queue_wait(struct request_queue *q)
|
2014-11-05 02:52:27 +08:00
|
|
|
{
|
2015-10-22 01:20:12 +08:00
|
|
|
wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
|
2013-12-26 21:31:35 +08:00
|
|
|
}
|
2017-03-02 03:22:10 +08:00
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait);
|
2013-12-26 21:31:35 +08:00
|
|
|
|
2017-03-02 03:22:11 +08:00
|
|
|
int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
|
|
|
|
unsigned long timeout)
|
|
|
|
{
|
|
|
|
return wait_event_timeout(q->mq_freeze_wq,
|
|
|
|
percpu_ref_is_zero(&q->q_usage_counter),
|
|
|
|
timeout);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait_timeout);
|
2013-12-26 21:31:35 +08:00
|
|
|
|
2014-11-05 02:52:27 +08:00
|
|
|
/*
|
|
|
|
* Guarantee no request is in use, so we can change any data structure of
|
|
|
|
* the queue afterward.
|
|
|
|
*/
|
2015-10-22 01:20:12 +08:00
|
|
|
void blk_freeze_queue(struct request_queue *q)
|
2014-11-05 02:52:27 +08:00
|
|
|
{
|
2015-10-22 01:20:12 +08:00
|
|
|
/*
|
|
|
|
* In the !blk_mq case we are only calling this to kill the
|
|
|
|
* q_usage_counter, otherwise this increases the freeze depth
|
|
|
|
* and waits for it to return to zero. For this reason there is
|
|
|
|
* no blk_unfreeze_queue(), and blk_freeze_queue() is not
|
|
|
|
* exported to drivers as the only user for unfreeze is blk_mq.
|
|
|
|
*/
|
2017-03-27 20:06:57 +08:00
|
|
|
blk_freeze_queue_start(q);
|
2014-11-05 02:52:27 +08:00
|
|
|
blk_mq_freeze_queue_wait(q);
|
|
|
|
}
|
2015-10-22 01:20:12 +08:00
|
|
|
|
|
|
|
void blk_mq_freeze_queue(struct request_queue *q)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* ...just an alias to keep freeze and unfreeze actions balanced
|
|
|
|
* in the blk_mq_* namespace
|
|
|
|
*/
|
|
|
|
blk_freeze_queue(q);
|
|
|
|
}
|
2015-01-03 06:05:12 +08:00
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_freeze_queue);
|
2014-11-05 02:52:27 +08:00
|
|
|
|
2021-09-29 15:12:41 +08:00
|
|
|
void __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2019-05-21 11:25:55 +08:00
|
|
|
mutex_lock(&q->mq_freeze_lock);
|
2021-09-29 15:12:41 +08:00
|
|
|
if (force_atomic)
|
|
|
|
q->q_usage_counter.data->force_atomic = true;
|
2019-05-21 11:25:55 +08:00
|
|
|
q->mq_freeze_depth--;
|
|
|
|
WARN_ON_ONCE(q->mq_freeze_depth < 0);
|
|
|
|
if (!q->mq_freeze_depth) {
|
2018-09-27 05:01:08 +08:00
|
|
|
percpu_ref_resurrect(&q->q_usage_counter);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
wake_up_all(&q->mq_freeze_wq);
|
2014-07-02 00:34:38 +08:00
|
|
|
}
|
2019-05-21 11:25:55 +08:00
|
|
|
mutex_unlock(&q->mq_freeze_lock);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
2021-09-29 15:12:41 +08:00
|
|
|
|
|
|
|
void blk_mq_unfreeze_queue(struct request_queue *q)
|
|
|
|
{
|
|
|
|
__blk_mq_unfreeze_queue(q, false);
|
|
|
|
}
|
2014-12-20 08:54:14 +08:00
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2017-06-22 01:55:47 +08:00
|
|
|
/*
|
|
|
|
* FIXME: replace the scsi_internal_device_*block_nowait() calls in the
|
|
|
|
* mpt3sas driver such that this function can be removed.
|
|
|
|
*/
|
|
|
|
void blk_mq_quiesce_queue_nowait(struct request_queue *q)
|
|
|
|
{
|
2021-10-14 16:17:10 +08:00
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
spin_lock_irqsave(&q->queue_lock, flags);
|
|
|
|
if (!q->quiesce_depth++)
|
|
|
|
blk_queue_flag_set(QUEUE_FLAG_QUIESCED, q);
|
|
|
|
spin_unlock_irqrestore(&q->queue_lock, flags);
|
2017-06-22 01:55:47 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
|
|
|
|
|
2016-11-03 00:09:51 +08:00
|
|
|
/**
|
2021-11-09 15:11:41 +08:00
|
|
|
* blk_mq_wait_quiesce_done() - wait until in-progress quiesce is done
|
2022-11-01 23:00:48 +08:00
|
|
|
* @set: tag_set to wait on
|
2016-11-03 00:09:51 +08:00
|
|
|
*
|
2021-11-09 15:11:41 +08:00
|
|
|
* Note: it is driver's responsibility for making sure that quiesce has
|
2022-11-01 23:00:48 +08:00
|
|
|
* been started on or more of the request_queues of the tag_set. This
|
|
|
|
* function only waits for the quiesce on those request_queues that had
|
|
|
|
* the quiesce flag set using blk_mq_quiesce_queue_nowait.
|
2016-11-03 00:09:51 +08:00
|
|
|
*/
|
2022-11-01 23:00:48 +08:00
|
|
|
void blk_mq_wait_quiesce_done(struct blk_mq_tag_set *set)
|
2016-11-03 00:09:51 +08:00
|
|
|
{
|
2022-11-01 23:00:48 +08:00
|
|
|
if (set->flags & BLK_MQ_F_BLOCKING)
|
|
|
|
synchronize_srcu(set->srcu);
|
2021-12-03 21:15:32 +08:00
|
|
|
else
|
2016-11-03 00:09:51 +08:00
|
|
|
synchronize_rcu();
|
|
|
|
}
|
2021-11-09 15:11:41 +08:00
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_wait_quiesce_done);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_mq_quiesce_queue() - wait until all ongoing dispatches have finished
|
|
|
|
* @q: request queue.
|
|
|
|
*
|
|
|
|
* Note: this function does not prevent that the struct request end_io()
|
|
|
|
* callback function is invoked. Once this function is returned, we make
|
|
|
|
* sure no dispatch can happen until the queue is unquiesced via
|
|
|
|
* blk_mq_unquiesce_queue().
|
|
|
|
*/
|
|
|
|
void blk_mq_quiesce_queue(struct request_queue *q)
|
|
|
|
{
|
|
|
|
blk_mq_quiesce_queue_nowait(q);
|
2022-11-01 23:00:46 +08:00
|
|
|
/* nothing to wait for non-mq queues */
|
|
|
|
if (queue_is_mq(q))
|
2022-11-01 23:00:48 +08:00
|
|
|
blk_mq_wait_quiesce_done(q->tag_set);
|
2021-11-09 15:11:41 +08:00
|
|
|
}
|
2016-11-03 00:09:51 +08:00
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
|
|
|
|
|
2017-06-06 23:22:03 +08:00
|
|
|
/*
|
|
|
|
* blk_mq_unquiesce_queue() - counterpart of blk_mq_quiesce_queue()
|
|
|
|
* @q: request queue.
|
|
|
|
*
|
|
|
|
* This function recovers queue into the state before quiescing
|
|
|
|
* which is done by blk_mq_quiesce_queue.
|
|
|
|
*/
|
|
|
|
void blk_mq_unquiesce_queue(struct request_queue *q)
|
|
|
|
{
|
2021-10-14 16:17:10 +08:00
|
|
|
unsigned long flags;
|
|
|
|
bool run_queue = false;
|
|
|
|
|
|
|
|
spin_lock_irqsave(&q->queue_lock, flags);
|
|
|
|
if (WARN_ON_ONCE(q->quiesce_depth <= 0)) {
|
|
|
|
;
|
|
|
|
} else if (!--q->quiesce_depth) {
|
|
|
|
blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
|
|
|
|
run_queue = true;
|
|
|
|
}
|
|
|
|
spin_unlock_irqrestore(&q->queue_lock, flags);
|
2017-06-19 04:24:27 +08:00
|
|
|
|
2017-06-06 23:22:08 +08:00
|
|
|
/* dispatch requests which are inserted during quiescing */
|
2021-10-14 16:17:10 +08:00
|
|
|
if (run_queue)
|
|
|
|
blk_mq_run_hw_queues(q, true);
|
2017-06-06 23:22:03 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_unquiesce_queue);
|
|
|
|
|
2022-11-01 23:00:49 +08:00
|
|
|
void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set)
|
|
|
|
{
|
|
|
|
struct request_queue *q;
|
|
|
|
|
|
|
|
mutex_lock(&set->tag_list_lock);
|
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list) {
|
|
|
|
if (!blk_queue_skip_tagset_quiesce(q))
|
|
|
|
blk_mq_quiesce_queue_nowait(q);
|
|
|
|
}
|
|
|
|
blk_mq_wait_quiesce_done(set);
|
|
|
|
mutex_unlock(&set->tag_list_lock);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_quiesce_tagset);
|
|
|
|
|
|
|
|
void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set)
|
|
|
|
{
|
|
|
|
struct request_queue *q;
|
|
|
|
|
|
|
|
mutex_lock(&set->tag_list_lock);
|
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list) {
|
|
|
|
if (!blk_queue_skip_tagset_quiesce(q))
|
|
|
|
blk_mq_unquiesce_queue(q);
|
|
|
|
}
|
|
|
|
mutex_unlock(&set->tag_list_lock);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_unquiesce_tagset);
|
|
|
|
|
2014-12-23 05:04:42 +08:00
|
|
|
void blk_mq_wake_waiters(struct request_queue *q)
|
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
2022-03-08 15:32:18 +08:00
|
|
|
unsigned long i;
|
2014-12-23 05:04:42 +08:00
|
|
|
|
|
|
|
queue_for_each_hw_ctx(q, hctx, i)
|
|
|
|
if (blk_mq_hw_queue_mapped(hctx))
|
|
|
|
blk_mq_tag_wakeup_all(hctx->tags, true);
|
|
|
|
}
|
|
|
|
|
2021-11-17 14:13:59 +08:00
|
|
|
void blk_rq_init(struct request_queue *q, struct request *rq)
|
|
|
|
{
|
|
|
|
memset(rq, 0, sizeof(*rq));
|
|
|
|
|
|
|
|
INIT_LIST_HEAD(&rq->queuelist);
|
|
|
|
rq->q = q;
|
|
|
|
rq->__sector = (sector_t) -1;
|
|
|
|
INIT_HLIST_NODE(&rq->hash);
|
|
|
|
RB_CLEAR_NODE(&rq->rb_node);
|
|
|
|
rq->tag = BLK_MQ_NO_TAG;
|
|
|
|
rq->internal_tag = BLK_MQ_NO_TAG;
|
|
|
|
rq->start_time_ns = ktime_get_ns();
|
|
|
|
rq->part = NULL;
|
|
|
|
blk_crypto_rq_set_defaults(rq);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_rq_init);
|
|
|
|
|
2017-06-17 00:15:27 +08:00
|
|
|
static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
|
2021-10-19 23:32:58 +08:00
|
|
|
struct blk_mq_tags *tags, unsigned int tag, u64 alloc_time_ns)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2021-10-19 04:37:28 +08:00
|
|
|
struct blk_mq_ctx *ctx = data->ctx;
|
|
|
|
struct blk_mq_hw_ctx *hctx = data->hctx;
|
|
|
|
struct request_queue *q = data->q;
|
2017-06-17 00:15:27 +08:00
|
|
|
struct request *rq = tags->static_rqs[tag];
|
2017-06-21 02:15:43 +08:00
|
|
|
|
2021-10-19 23:33:00 +08:00
|
|
|
rq->q = q;
|
|
|
|
rq->mq_ctx = ctx;
|
|
|
|
rq->mq_hctx = hctx;
|
|
|
|
rq->cmd_flags = data->cmd_flags;
|
|
|
|
|
|
|
|
if (data->flags & BLK_MQ_REQ_PM)
|
|
|
|
data->rq_flags |= RQF_PM;
|
|
|
|
if (blk_queue_io_stat(q))
|
|
|
|
data->rq_flags |= RQF_IO_STAT;
|
|
|
|
rq->rq_flags = data->rq_flags;
|
|
|
|
|
2021-10-19 23:32:57 +08:00
|
|
|
if (!(data->rq_flags & RQF_ELV)) {
|
2017-06-17 00:15:27 +08:00
|
|
|
rq->tag = tag;
|
2020-05-29 21:53:12 +08:00
|
|
|
rq->internal_tag = BLK_MQ_NO_TAG;
|
2021-10-19 23:32:57 +08:00
|
|
|
} else {
|
|
|
|
rq->tag = BLK_MQ_NO_TAG;
|
|
|
|
rq->internal_tag = tag;
|
2017-06-17 00:15:27 +08:00
|
|
|
}
|
2021-10-19 23:33:00 +08:00
|
|
|
rq->timeout = 0;
|
2017-06-17 00:15:27 +08:00
|
|
|
|
2021-10-19 04:37:27 +08:00
|
|
|
if (blk_mq_need_time_stamp(rq))
|
|
|
|
rq->start_time_ns = ktime_get_ns();
|
|
|
|
else
|
|
|
|
rq->start_time_ns = 0;
|
2014-05-06 18:12:45 +08:00
|
|
|
rq->part = NULL;
|
2019-08-29 06:05:57 +08:00
|
|
|
#ifdef CONFIG_BLK_RQ_ALLOC_TIME
|
|
|
|
rq->alloc_time_ns = alloc_time_ns;
|
|
|
|
#endif
|
2018-05-09 17:08:50 +08:00
|
|
|
rq->io_start_time_ns = 0;
|
2019-05-21 15:59:03 +08:00
|
|
|
rq->stats_sectors = 0;
|
2014-05-06 18:12:45 +08:00
|
|
|
rq->nr_phys_segments = 0;
|
|
|
|
#if defined(CONFIG_BLK_DEV_INTEGRITY)
|
|
|
|
rq->nr_integrity_segments = 0;
|
|
|
|
#endif
|
|
|
|
rq->end_io = NULL;
|
|
|
|
rq->end_io_data = NULL;
|
|
|
|
|
2021-10-19 04:37:27 +08:00
|
|
|
blk_crypto_rq_set_defaults(rq);
|
|
|
|
INIT_LIST_HEAD(&rq->queuelist);
|
|
|
|
/* tag was already set */
|
|
|
|
WRITE_ONCE(rq->deadline, 0);
|
2021-10-15 04:39:59 +08:00
|
|
|
req_ref_set(rq, 1);
|
2020-05-29 21:53:10 +08:00
|
|
|
|
2021-10-19 04:37:27 +08:00
|
|
|
if (rq->rq_flags & RQF_ELV) {
|
2020-05-29 21:53:10 +08:00
|
|
|
struct elevator_queue *e = data->q->elevator;
|
|
|
|
|
2021-10-19 04:37:27 +08:00
|
|
|
INIT_HLIST_NODE(&rq->hash);
|
|
|
|
RB_CLEAR_NODE(&rq->rb_node);
|
|
|
|
|
|
|
|
if (!op_is_flush(data->cmd_flags) &&
|
|
|
|
e->type->ops.prepare_request) {
|
2020-05-29 21:53:10 +08:00
|
|
|
e->type->ops.prepare_request(rq);
|
|
|
|
rq->rq_flags |= RQF_ELVPRIV;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-06-17 00:15:27 +08:00
|
|
|
return rq;
|
2014-05-28 02:59:47 +08:00
|
|
|
}
|
|
|
|
|
2021-10-10 03:10:39 +08:00
|
|
|
static inline struct request *
|
|
|
|
__blk_mq_alloc_requests_batch(struct blk_mq_alloc_data *data,
|
|
|
|
u64 alloc_time_ns)
|
|
|
|
{
|
|
|
|
unsigned int tag, tag_offset;
|
2021-10-19 23:32:58 +08:00
|
|
|
struct blk_mq_tags *tags;
|
2021-10-10 03:10:39 +08:00
|
|
|
struct request *rq;
|
2021-10-19 23:32:58 +08:00
|
|
|
unsigned long tag_mask;
|
2021-10-10 03:10:39 +08:00
|
|
|
int i, nr = 0;
|
|
|
|
|
2021-10-19 23:32:58 +08:00
|
|
|
tag_mask = blk_mq_get_tags(data, data->nr_tags, &tag_offset);
|
|
|
|
if (unlikely(!tag_mask))
|
2021-10-10 03:10:39 +08:00
|
|
|
return NULL;
|
|
|
|
|
2021-10-19 23:32:58 +08:00
|
|
|
tags = blk_mq_tags_from_data(data);
|
|
|
|
for (i = 0; tag_mask; i++) {
|
|
|
|
if (!(tag_mask & (1UL << i)))
|
2021-10-10 03:10:39 +08:00
|
|
|
continue;
|
|
|
|
tag = tag_offset + i;
|
2021-11-01 20:56:09 +08:00
|
|
|
prefetch(tags->static_rqs[tag]);
|
2021-10-19 23:32:58 +08:00
|
|
|
tag_mask &= ~(1UL << i);
|
|
|
|
rq = blk_mq_rq_ctx_init(data, tags, tag, alloc_time_ns);
|
2021-10-13 21:58:52 +08:00
|
|
|
rq_list_add(data->cached_rq, rq);
|
2021-11-03 19:49:07 +08:00
|
|
|
nr++;
|
2021-10-10 03:10:39 +08:00
|
|
|
}
|
2021-11-03 19:49:07 +08:00
|
|
|
/* caller already holds a reference, add for remainder */
|
|
|
|
percpu_ref_get_many(&data->q->q_usage_counter, nr - 1);
|
2021-10-10 03:10:39 +08:00
|
|
|
data->nr_tags -= nr;
|
|
|
|
|
2021-10-13 21:58:52 +08:00
|
|
|
return rq_list_pop(data->cached_rq);
|
2021-10-10 03:10:39 +08:00
|
|
|
}
|
|
|
|
|
2021-10-12 18:40:44 +08:00
|
|
|
static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
|
2017-06-17 00:15:19 +08:00
|
|
|
{
|
2020-05-29 21:53:09 +08:00
|
|
|
struct request_queue *q = data->q;
|
2019-08-29 06:05:57 +08:00
|
|
|
u64 alloc_time_ns = 0;
|
2021-10-06 20:34:11 +08:00
|
|
|
struct request *rq;
|
2020-05-29 21:53:13 +08:00
|
|
|
unsigned int tag;
|
2017-06-17 00:15:19 +08:00
|
|
|
|
2019-08-29 06:05:57 +08:00
|
|
|
/* alloc_time includes depth and tag waits */
|
|
|
|
if (blk_queue_rq_alloc_time(q))
|
|
|
|
alloc_time_ns = ktime_get_ns();
|
|
|
|
|
2018-10-30 03:11:38 +08:00
|
|
|
if (data->cmd_flags & REQ_NOWAIT)
|
2017-06-20 20:05:46 +08:00
|
|
|
data->flags |= BLK_MQ_REQ_NOWAIT;
|
2017-06-17 00:15:19 +08:00
|
|
|
|
2021-11-02 22:34:09 +08:00
|
|
|
if (q->elevator) {
|
|
|
|
struct elevator_queue *e = q->elevator;
|
|
|
|
|
|
|
|
data->rq_flags |= RQF_ELV;
|
|
|
|
|
2017-06-17 00:15:19 +08:00
|
|
|
/*
|
2021-04-15 11:39:20 +08:00
|
|
|
* Flush/passthrough requests are special and go directly to the
|
2018-05-10 03:28:50 +08:00
|
|
|
* dispatch list. Don't include reserved tags in the
|
|
|
|
* limiting, as it isn't useful.
|
2017-06-17 00:15:19 +08:00
|
|
|
*/
|
2018-10-30 03:11:38 +08:00
|
|
|
if (!op_is_flush(data->cmd_flags) &&
|
2021-04-15 11:39:20 +08:00
|
|
|
!blk_op_is_passthrough(data->cmd_flags) &&
|
2018-10-30 03:11:38 +08:00
|
|
|
e->type->ops.limit_depth &&
|
2018-05-10 03:28:50 +08:00
|
|
|
!(data->flags & BLK_MQ_REQ_RESERVED))
|
2018-10-30 03:11:38 +08:00
|
|
|
e->type->ops.limit_depth(data->cmd_flags, data);
|
2017-06-17 00:15:19 +08:00
|
|
|
}
|
|
|
|
|
2020-05-29 21:53:15 +08:00
|
|
|
retry:
|
2020-05-29 21:53:13 +08:00
|
|
|
data->ctx = blk_mq_get_ctx(q);
|
|
|
|
data->hctx = blk_mq_map_queue(q, data->cmd_flags, data->ctx);
|
2021-11-02 22:34:09 +08:00
|
|
|
if (!(data->rq_flags & RQF_ELV))
|
2020-05-29 21:53:13 +08:00
|
|
|
blk_mq_tag_busy(data->hctx);
|
|
|
|
|
2022-07-06 20:03:50 +08:00
|
|
|
if (data->flags & BLK_MQ_REQ_RESERVED)
|
|
|
|
data->rq_flags |= RQF_RESV;
|
|
|
|
|
2021-10-10 03:10:39 +08:00
|
|
|
/*
|
|
|
|
* Try batched alloc if we want more than 1 tag.
|
|
|
|
*/
|
|
|
|
if (data->nr_tags > 1) {
|
|
|
|
rq = __blk_mq_alloc_requests_batch(data, alloc_time_ns);
|
|
|
|
if (rq)
|
|
|
|
return rq;
|
|
|
|
data->nr_tags = 1;
|
|
|
|
}
|
|
|
|
|
2020-05-29 21:53:15 +08:00
|
|
|
/*
|
|
|
|
* Waiting allocations only fail because of an inactive hctx. In that
|
|
|
|
* case just retry the hctx assignment and tag allocation as CPU hotplug
|
|
|
|
* should have migrated us to an online CPU by now.
|
|
|
|
*/
|
2017-06-17 00:15:27 +08:00
|
|
|
tag = blk_mq_get_tag(data);
|
2020-05-29 21:53:15 +08:00
|
|
|
if (tag == BLK_MQ_NO_TAG) {
|
|
|
|
if (data->flags & BLK_MQ_REQ_NOWAIT)
|
|
|
|
return NULL;
|
|
|
|
/*
|
2021-10-10 03:10:39 +08:00
|
|
|
* Give up the CPU and sleep for a random short time to
|
|
|
|
* ensure that thread using a realtime scheduling class
|
|
|
|
* are migrated off the CPU, and thus off the hctx that
|
|
|
|
* is going away.
|
2020-05-29 21:53:15 +08:00
|
|
|
*/
|
|
|
|
msleep(3);
|
|
|
|
goto retry;
|
|
|
|
}
|
2021-10-06 20:34:11 +08:00
|
|
|
|
2021-10-19 23:32:58 +08:00
|
|
|
return blk_mq_rq_ctx_init(data, blk_mq_tags_from_data(data), tag,
|
|
|
|
alloc_time_ns);
|
2017-06-17 00:15:19 +08:00
|
|
|
}
|
|
|
|
|
2022-09-21 22:22:09 +08:00
|
|
|
static struct request *blk_mq_rq_cache_fill(struct request_queue *q,
|
|
|
|
struct blk_plug *plug,
|
|
|
|
blk_opf_t opf,
|
|
|
|
blk_mq_req_flags_t flags)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2020-05-29 21:53:09 +08:00
|
|
|
struct blk_mq_alloc_data data = {
|
|
|
|
.q = q,
|
|
|
|
.flags = flags,
|
2022-07-15 02:06:32 +08:00
|
|
|
.cmd_flags = opf,
|
2022-09-21 22:22:09 +08:00
|
|
|
.nr_tags = plug->nr_ios,
|
|
|
|
.cached_rq = &plug->cached_rq,
|
2020-05-29 21:53:09 +08:00
|
|
|
};
|
2017-01-17 21:03:22 +08:00
|
|
|
struct request *rq;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2022-09-21 22:22:09 +08:00
|
|
|
if (blk_queue_enter(q, flags))
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
plug->nr_ios = 1;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2021-10-12 18:40:44 +08:00
|
|
|
rq = __blk_mq_alloc_requests(&data);
|
2022-09-21 22:22:09 +08:00
|
|
|
if (unlikely(!rq))
|
|
|
|
blk_queue_exit(q);
|
|
|
|
return rq;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct request *blk_mq_alloc_cached_request(struct request_queue *q,
|
|
|
|
blk_opf_t opf,
|
|
|
|
blk_mq_req_flags_t flags)
|
|
|
|
{
|
|
|
|
struct blk_plug *plug = current->plug;
|
|
|
|
struct request *rq;
|
|
|
|
|
|
|
|
if (!plug)
|
|
|
|
return NULL;
|
2022-11-02 10:52:30 +08:00
|
|
|
|
2022-09-21 22:22:09 +08:00
|
|
|
if (rq_list_empty(plug->cached_rq)) {
|
|
|
|
if (plug->nr_ios == 1)
|
|
|
|
return NULL;
|
|
|
|
rq = blk_mq_rq_cache_fill(q, plug, opf, flags);
|
2022-11-02 10:52:30 +08:00
|
|
|
if (!rq)
|
|
|
|
return NULL;
|
|
|
|
} else {
|
|
|
|
rq = rq_list_peek(&plug->cached_rq);
|
|
|
|
if (!rq || rq->q != q)
|
|
|
|
return NULL;
|
2022-09-21 22:22:09 +08:00
|
|
|
|
2022-11-02 10:52:30 +08:00
|
|
|
if (blk_mq_get_hctx_type(opf) != rq->mq_hctx->type)
|
|
|
|
return NULL;
|
|
|
|
if (op_is_flush(rq->cmd_flags) != op_is_flush(opf))
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
plug->cached_rq = rq_list_next(rq);
|
|
|
|
}
|
2022-09-21 22:22:09 +08:00
|
|
|
|
|
|
|
rq->cmd_flags = opf;
|
|
|
|
INIT_LIST_HEAD(&rq->queuelist);
|
|
|
|
return rq;
|
|
|
|
}
|
|
|
|
|
|
|
|
struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf,
|
|
|
|
blk_mq_req_flags_t flags)
|
|
|
|
{
|
|
|
|
struct request *rq;
|
|
|
|
|
|
|
|
rq = blk_mq_alloc_cached_request(q, opf, flags);
|
|
|
|
if (!rq) {
|
|
|
|
struct blk_mq_alloc_data data = {
|
|
|
|
.q = q,
|
|
|
|
.flags = flags,
|
|
|
|
.cmd_flags = opf,
|
|
|
|
.nr_tags = 1,
|
|
|
|
};
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = blk_queue_enter(q, flags);
|
|
|
|
if (ret)
|
|
|
|
return ERR_PTR(ret);
|
|
|
|
|
|
|
|
rq = __blk_mq_alloc_requests(&data);
|
|
|
|
if (!rq)
|
|
|
|
goto out_queue_exit;
|
|
|
|
}
|
2016-07-19 17:31:50 +08:00
|
|
|
rq->__data_len = 0;
|
|
|
|
rq->__sector = (sector_t) -1;
|
|
|
|
rq->bio = rq->biotail = NULL;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
return rq;
|
2020-05-17 02:27:58 +08:00
|
|
|
out_queue_exit:
|
|
|
|
blk_queue_exit(q);
|
|
|
|
return ERR_PTR(-EWOULDBLOCK);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
2014-05-09 23:36:49 +08:00
|
|
|
EXPORT_SYMBOL(blk_mq_alloc_request);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2017-06-21 02:15:39 +08:00
|
|
|
struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
|
2022-07-15 02:06:32 +08:00
|
|
|
blk_opf_t opf, blk_mq_req_flags_t flags, unsigned int hctx_idx)
|
2016-06-13 22:45:21 +08:00
|
|
|
{
|
2020-05-29 21:53:09 +08:00
|
|
|
struct blk_mq_alloc_data data = {
|
|
|
|
.q = q,
|
|
|
|
.flags = flags,
|
2022-07-15 02:06:32 +08:00
|
|
|
.cmd_flags = opf,
|
2021-10-06 20:34:11 +08:00
|
|
|
.nr_tags = 1,
|
2020-05-29 21:53:09 +08:00
|
|
|
};
|
2020-05-29 21:53:13 +08:00
|
|
|
u64 alloc_time_ns = 0;
|
2022-10-26 18:35:13 +08:00
|
|
|
struct request *rq;
|
2017-02-28 02:28:27 +08:00
|
|
|
unsigned int cpu;
|
2020-05-29 21:53:13 +08:00
|
|
|
unsigned int tag;
|
2016-06-13 22:45:21 +08:00
|
|
|
int ret;
|
|
|
|
|
2020-05-29 21:53:13 +08:00
|
|
|
/* alloc_time includes depth and tag waits */
|
|
|
|
if (blk_queue_rq_alloc_time(q))
|
|
|
|
alloc_time_ns = ktime_get_ns();
|
|
|
|
|
2016-06-13 22:45:21 +08:00
|
|
|
/*
|
|
|
|
* If the tag allocator sleeps we could get an allocation for a
|
|
|
|
* different hardware context. No need to complicate the low level
|
|
|
|
* allocator for this for the rare use case of a command tied to
|
|
|
|
* a specific queue.
|
|
|
|
*/
|
2023-01-18 17:37:13 +08:00
|
|
|
if (WARN_ON_ONCE(!(flags & BLK_MQ_REQ_NOWAIT)) ||
|
|
|
|
WARN_ON_ONCE(!(flags & BLK_MQ_REQ_RESERVED)))
|
2016-06-13 22:45:21 +08:00
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
|
|
|
|
if (hctx_idx >= q->nr_hw_queues)
|
|
|
|
return ERR_PTR(-EIO);
|
|
|
|
|
2017-11-10 02:49:58 +08:00
|
|
|
ret = blk_queue_enter(q, flags);
|
2016-06-13 22:45:21 +08:00
|
|
|
if (ret)
|
|
|
|
return ERR_PTR(ret);
|
|
|
|
|
2016-09-24 00:25:48 +08:00
|
|
|
/*
|
|
|
|
* Check if the hardware context is actually mapped to anything.
|
|
|
|
* If not tell the caller that it should skip this queue.
|
|
|
|
*/
|
2020-05-17 02:27:58 +08:00
|
|
|
ret = -EXDEV;
|
2022-03-08 15:32:19 +08:00
|
|
|
data.hctx = xa_load(&q->hctx_table, hctx_idx);
|
2020-05-29 21:53:09 +08:00
|
|
|
if (!blk_mq_hw_queue_mapped(data.hctx))
|
2020-05-17 02:27:58 +08:00
|
|
|
goto out_queue_exit;
|
2020-05-29 21:53:09 +08:00
|
|
|
cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
|
2022-06-16 05:00:04 +08:00
|
|
|
if (cpu >= nr_cpu_ids)
|
|
|
|
goto out_queue_exit;
|
2020-05-29 21:53:09 +08:00
|
|
|
data.ctx = __blk_mq_get_ctx(q, cpu);
|
2016-06-13 22:45:21 +08:00
|
|
|
|
2020-06-29 23:08:34 +08:00
|
|
|
if (!q->elevator)
|
2020-05-29 21:53:13 +08:00
|
|
|
blk_mq_tag_busy(data.hctx);
|
2021-11-02 22:34:09 +08:00
|
|
|
else
|
|
|
|
data.rq_flags |= RQF_ELV;
|
2020-05-29 21:53:13 +08:00
|
|
|
|
2022-07-06 20:03:50 +08:00
|
|
|
if (flags & BLK_MQ_REQ_RESERVED)
|
|
|
|
data.rq_flags |= RQF_RESV;
|
|
|
|
|
2020-05-17 02:27:58 +08:00
|
|
|
ret = -EWOULDBLOCK;
|
2020-05-29 21:53:13 +08:00
|
|
|
tag = blk_mq_get_tag(&data);
|
|
|
|
if (tag == BLK_MQ_NO_TAG)
|
2020-05-17 02:27:58 +08:00
|
|
|
goto out_queue_exit;
|
2022-10-26 18:35:13 +08:00
|
|
|
rq = blk_mq_rq_ctx_init(&data, blk_mq_tags_from_data(&data), tag,
|
2021-10-19 23:32:58 +08:00
|
|
|
alloc_time_ns);
|
2022-10-26 18:35:13 +08:00
|
|
|
rq->__data_len = 0;
|
|
|
|
rq->__sector = (sector_t) -1;
|
|
|
|
rq->bio = rq->biotail = NULL;
|
|
|
|
return rq;
|
2020-05-29 21:53:13 +08:00
|
|
|
|
2020-05-17 02:27:58 +08:00
|
|
|
out_queue_exit:
|
|
|
|
blk_queue_exit(q);
|
|
|
|
return ERR_PTR(ret);
|
2016-06-13 22:45:21 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
|
|
|
|
|
2018-05-29 21:52:28 +08:00
|
|
|
static void __blk_mq_free_request(struct request *rq)
|
|
|
|
{
|
|
|
|
struct request_queue *q = rq->q;
|
|
|
|
struct blk_mq_ctx *ctx = rq->mq_ctx;
|
2018-10-30 05:06:13 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
|
2018-05-29 21:52:28 +08:00
|
|
|
const int sched_tag = rq->internal_tag;
|
|
|
|
|
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 08:37:18 +08:00
|
|
|
blk_crypto_free_request(rq);
|
2018-09-27 05:01:10 +08:00
|
|
|
blk_pm_mark_last_busy(rq);
|
2018-10-30 05:06:13 +08:00
|
|
|
rq->mq_hctx = NULL;
|
2020-05-29 21:53:12 +08:00
|
|
|
if (rq->tag != BLK_MQ_NO_TAG)
|
2020-02-26 20:10:15 +08:00
|
|
|
blk_mq_put_tag(hctx->tags, ctx, rq->tag);
|
2020-05-29 21:53:12 +08:00
|
|
|
if (sched_tag != BLK_MQ_NO_TAG)
|
2020-02-26 20:10:15 +08:00
|
|
|
blk_mq_put_tag(hctx->sched_tags, ctx, sched_tag);
|
2018-05-29 21:52:28 +08:00
|
|
|
blk_mq_sched_restart(hctx);
|
|
|
|
blk_queue_exit(q);
|
|
|
|
}
|
|
|
|
|
2017-06-17 00:15:22 +08:00
|
|
|
void blk_mq_free_request(struct request *rq)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
|
|
|
struct request_queue *q = rq->q;
|
2018-10-30 05:06:13 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
|
2017-06-17 00:15:22 +08:00
|
|
|
|
2021-11-26 19:58:11 +08:00
|
|
|
if ((rq->rq_flags & RQF_ELVPRIV) &&
|
|
|
|
q->elevator->type->ops.finish_request)
|
|
|
|
q->elevator->type->ops.finish_request(rq);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2016-10-20 21:12:13 +08:00
|
|
|
if (rq->rq_flags & RQF_MQ_INFLIGHT)
|
2020-08-19 23:20:26 +08:00
|
|
|
__blk_mq_dec_active_requests(hctx);
|
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 03:38:14 +08:00
|
|
|
|
2017-09-30 16:08:24 +08:00
|
|
|
if (unlikely(laptop_mode && !blk_rq_is_passthrough(rq)))
|
2021-08-16 21:46:24 +08:00
|
|
|
laptop_io_completion(q->disk->bdi);
|
2017-09-30 16:08:24 +08:00
|
|
|
|
2018-07-03 23:32:35 +08:00
|
|
|
rq_qos_done(q, rq);
|
2014-05-14 05:10:52 +08:00
|
|
|
|
2018-05-29 21:52:28 +08:00
|
|
|
WRITE_ONCE(rq->state, MQ_RQ_IDLE);
|
2021-10-15 04:39:59 +08:00
|
|
|
if (req_ref_put_and_test(rq))
|
2018-05-29 21:52:28 +08:00
|
|
|
__blk_mq_free_request(rq);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
2014-11-18 01:40:48 +08:00
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_free_request);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2021-10-06 20:34:11 +08:00
|
|
|
void blk_mq_free_plug_rqs(struct blk_plug *plug)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2021-10-13 21:58:52 +08:00
|
|
|
struct request *rq;
|
2018-11-29 01:50:07 +08:00
|
|
|
|
2021-11-03 19:49:07 +08:00
|
|
|
while ((rq = rq_list_pop(&plug->cached_rq)) != NULL)
|
2021-10-06 20:34:11 +08:00
|
|
|
blk_mq_free_request(rq);
|
|
|
|
}
|
block: consolidate struct request timestamp fields
Currently, struct request has four timestamp fields:
- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq
These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 17:08:53 +08:00
|
|
|
|
2021-11-17 14:14:02 +08:00
|
|
|
void blk_dump_rq_flags(struct request *rq, char *msg)
|
|
|
|
{
|
|
|
|
printk(KERN_INFO "%s: dev %s: flags=%llx\n", msg,
|
2021-11-26 20:18:00 +08:00
|
|
|
rq->q->disk ? rq->q->disk->disk_name : "?",
|
2022-07-15 02:06:32 +08:00
|
|
|
(__force unsigned long long) rq->cmd_flags);
|
2021-11-17 14:14:02 +08:00
|
|
|
|
|
|
|
printk(KERN_INFO " sector %llu, nr/cnr %u/%u\n",
|
|
|
|
(unsigned long long)blk_rq_pos(rq),
|
|
|
|
blk_rq_sectors(rq), blk_rq_cur_sectors(rq));
|
|
|
|
printk(KERN_INFO " bio %p, biotail %p, len %u\n",
|
|
|
|
rq->bio, rq->biotail, blk_rq_bytes(rq));
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_dump_rq_flags);
|
|
|
|
|
2021-10-14 23:17:01 +08:00
|
|
|
static void req_bio_endio(struct request *rq, struct bio *bio,
|
|
|
|
unsigned int nbytes, blk_status_t error)
|
|
|
|
{
|
2021-10-20 05:24:12 +08:00
|
|
|
if (unlikely(error)) {
|
2021-10-14 23:17:01 +08:00
|
|
|
bio->bi_status = error;
|
2021-10-20 05:24:12 +08:00
|
|
|
} else if (req_op(rq) == REQ_OP_ZONE_APPEND) {
|
2021-10-14 23:17:01 +08:00
|
|
|
/*
|
|
|
|
* Partial zone append completions cannot be supported as the
|
|
|
|
* BIO fragments may end up not being written sequentially.
|
|
|
|
*/
|
2021-10-22 23:01:44 +08:00
|
|
|
if (bio->bi_iter.bi_size != nbytes)
|
2021-10-14 23:17:01 +08:00
|
|
|
bio->bi_status = BLK_STS_IOERR;
|
|
|
|
else
|
|
|
|
bio->bi_iter.bi_sector = rq->__sector;
|
|
|
|
}
|
|
|
|
|
2021-10-20 05:24:12 +08:00
|
|
|
bio_advance(bio, nbytes);
|
|
|
|
|
|
|
|
if (unlikely(rq->rq_flags & RQF_QUIET))
|
|
|
|
bio_set_flag(bio, BIO_QUIET);
|
2021-10-14 23:17:01 +08:00
|
|
|
/* don't actually finish bio if it's part of flush sequence */
|
|
|
|
if (bio->bi_iter.bi_size == 0 && !(rq->rq_flags & RQF_FLUSH_SEQ))
|
|
|
|
bio_endio(bio);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void blk_account_io_completion(struct request *req, unsigned int bytes)
|
|
|
|
{
|
|
|
|
if (req->part && blk_do_io_stat(req)) {
|
|
|
|
const int sgrp = op_stat_group(req_op(req));
|
|
|
|
|
|
|
|
part_stat_lock();
|
|
|
|
part_stat_add(req->part, sectors[sgrp], bytes >> 9);
|
|
|
|
part_stat_unlock();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-11-17 14:14:03 +08:00
|
|
|
static void blk_print_req_error(struct request *req, blk_status_t status)
|
|
|
|
{
|
|
|
|
printk_ratelimited(KERN_ERR
|
|
|
|
"%s error, dev %s, sector %llu op 0x%x:(%s) flags 0x%x "
|
|
|
|
"phys_seg %u prio class %u\n",
|
|
|
|
blk_status_to_str(status),
|
2021-11-26 20:18:00 +08:00
|
|
|
req->q->disk ? req->q->disk->disk_name : "?",
|
2022-07-15 02:06:32 +08:00
|
|
|
blk_rq_pos(req), (__force u32)req_op(req),
|
|
|
|
blk_op_str(req_op(req)),
|
|
|
|
(__force u32)(req->cmd_flags & ~REQ_OP_MASK),
|
2021-11-17 14:14:03 +08:00
|
|
|
req->nr_phys_segments,
|
|
|
|
IOPRIO_PRIO_CLASS(req->ioprio));
|
|
|
|
}
|
|
|
|
|
2021-12-02 06:01:51 +08:00
|
|
|
/*
|
|
|
|
* Fully end IO on a request. Does not support partial completions, or
|
|
|
|
* errors.
|
|
|
|
*/
|
|
|
|
static void blk_complete_request(struct request *req)
|
|
|
|
{
|
|
|
|
const bool is_flush = (req->rq_flags & RQF_FLUSH_SEQ) != 0;
|
|
|
|
int total_bytes = blk_rq_bytes(req);
|
|
|
|
struct bio *bio = req->bio;
|
|
|
|
|
|
|
|
trace_block_rq_complete(req, BLK_STS_OK, total_bytes);
|
|
|
|
|
|
|
|
if (!bio)
|
|
|
|
return;
|
|
|
|
|
|
|
|
#ifdef CONFIG_BLK_DEV_INTEGRITY
|
|
|
|
if (blk_integrity_rq(req) && req_op(req) == REQ_OP_READ)
|
|
|
|
req->q->integrity.profile->complete_fn(req, total_bytes);
|
|
|
|
#endif
|
|
|
|
|
|
|
|
blk_account_io_completion(req, total_bytes);
|
|
|
|
|
|
|
|
do {
|
|
|
|
struct bio *next = bio->bi_next;
|
|
|
|
|
|
|
|
/* Completion has already been traced */
|
|
|
|
bio_clear_flag(bio, BIO_TRACE_COMPLETION);
|
2022-02-11 17:34:25 +08:00
|
|
|
|
|
|
|
if (req_op(req) == REQ_OP_ZONE_APPEND)
|
|
|
|
bio->bi_iter.bi_sector = req->__sector;
|
|
|
|
|
2021-12-02 06:01:51 +08:00
|
|
|
if (!is_flush)
|
|
|
|
bio_endio(bio);
|
|
|
|
bio = next;
|
|
|
|
} while (bio);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Reset counters so that the request stacking driver
|
|
|
|
* can find how many bytes remain in the request
|
|
|
|
* later.
|
|
|
|
*/
|
2022-09-21 22:24:16 +08:00
|
|
|
if (!req->end_io) {
|
|
|
|
req->bio = NULL;
|
|
|
|
req->__data_len = 0;
|
|
|
|
}
|
2021-12-02 06:01:51 +08:00
|
|
|
}
|
|
|
|
|
2021-10-14 23:17:01 +08:00
|
|
|
/**
|
|
|
|
* blk_update_request - Complete multiple bytes without completing the request
|
|
|
|
* @req: the request being processed
|
|
|
|
* @error: block status code
|
|
|
|
* @nr_bytes: number of bytes to complete for @req
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Ends I/O on a number of bytes attached to @req, but doesn't complete
|
|
|
|
* the request structure even if @req doesn't have leftover.
|
|
|
|
* If @req has leftover, sets it up for the next range of segments.
|
|
|
|
*
|
|
|
|
* Passing the result of blk_rq_bytes() as @nr_bytes guarantees
|
|
|
|
* %false return from this function.
|
|
|
|
*
|
|
|
|
* Note:
|
|
|
|
* The RQF_SPECIAL_PAYLOAD flag is ignored on purpose in this function
|
|
|
|
* except in the consistency check at the end of this function.
|
|
|
|
*
|
|
|
|
* Return:
|
|
|
|
* %false - this request doesn't have any more data
|
|
|
|
* %true - this request has more data
|
|
|
|
**/
|
|
|
|
bool blk_update_request(struct request *req, blk_status_t error,
|
|
|
|
unsigned int nr_bytes)
|
|
|
|
{
|
|
|
|
int total_bytes;
|
|
|
|
|
2021-10-18 16:45:18 +08:00
|
|
|
trace_block_rq_complete(req, error, nr_bytes);
|
2021-10-14 23:17:01 +08:00
|
|
|
|
|
|
|
if (!req->bio)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
#ifdef CONFIG_BLK_DEV_INTEGRITY
|
|
|
|
if (blk_integrity_rq(req) && req_op(req) == REQ_OP_READ &&
|
|
|
|
error == BLK_STS_OK)
|
|
|
|
req->q->integrity.profile->complete_fn(req, nr_bytes);
|
|
|
|
#endif
|
|
|
|
|
|
|
|
if (unlikely(error && !blk_rq_is_passthrough(req) &&
|
2022-03-24 00:38:15 +08:00
|
|
|
!(req->rq_flags & RQF_QUIET)) &&
|
|
|
|
!test_bit(GD_DEAD, &req->q->disk->state)) {
|
2021-10-14 23:17:01 +08:00
|
|
|
blk_print_req_error(req, error);
|
2022-02-11 06:52:22 +08:00
|
|
|
trace_block_rq_error(req, error, nr_bytes);
|
|
|
|
}
|
2021-10-14 23:17:01 +08:00
|
|
|
|
|
|
|
blk_account_io_completion(req, nr_bytes);
|
|
|
|
|
|
|
|
total_bytes = 0;
|
|
|
|
while (req->bio) {
|
|
|
|
struct bio *bio = req->bio;
|
|
|
|
unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes);
|
|
|
|
|
|
|
|
if (bio_bytes == bio->bi_iter.bi_size)
|
|
|
|
req->bio = bio->bi_next;
|
|
|
|
|
|
|
|
/* Completion has already been traced */
|
|
|
|
bio_clear_flag(bio, BIO_TRACE_COMPLETION);
|
|
|
|
req_bio_endio(req, bio, bio_bytes, error);
|
|
|
|
|
|
|
|
total_bytes += bio_bytes;
|
|
|
|
nr_bytes -= bio_bytes;
|
|
|
|
|
|
|
|
if (!nr_bytes)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* completely done
|
|
|
|
*/
|
|
|
|
if (!req->bio) {
|
|
|
|
/*
|
|
|
|
* Reset counters so that the request stacking driver
|
|
|
|
* can find how many bytes remain in the request
|
|
|
|
* later.
|
|
|
|
*/
|
|
|
|
req->__data_len = 0;
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
req->__data_len -= total_bytes;
|
|
|
|
|
|
|
|
/* update sector only for requests with clear definition of sector */
|
|
|
|
if (!blk_rq_is_passthrough(req))
|
|
|
|
req->__sector += total_bytes >> 9;
|
|
|
|
|
|
|
|
/* mixed attributes always follow the first bio */
|
|
|
|
if (req->rq_flags & RQF_MIXED_MERGE) {
|
|
|
|
req->cmd_flags &= ~REQ_FAILFAST_MASK;
|
|
|
|
req->cmd_flags |= req->bio->bi_opf & REQ_FAILFAST_MASK;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!(req->rq_flags & RQF_SPECIAL_PAYLOAD)) {
|
|
|
|
/*
|
|
|
|
* If total number of sectors is less than the first segment
|
|
|
|
* size, something has gone terribly wrong.
|
|
|
|
*/
|
|
|
|
if (blk_rq_bytes(req) < blk_rq_cur_bytes(req)) {
|
|
|
|
blk_dump_rq_flags(req, "request botched");
|
|
|
|
req->__data_len = blk_rq_cur_bytes(req);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* recalculate the number of segments */
|
|
|
|
req->nr_phys_segments = blk_recalc_rq_segments(req);
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_update_request);
|
|
|
|
|
2021-11-17 14:14:01 +08:00
|
|
|
static void __blk_account_io_done(struct request *req, u64 now)
|
|
|
|
{
|
|
|
|
const int sgrp = op_stat_group(req_op(req));
|
|
|
|
|
|
|
|
part_stat_lock();
|
|
|
|
update_io_ticks(req->part, jiffies, true);
|
|
|
|
part_stat_inc(req->part, ios[sgrp]);
|
|
|
|
part_stat_add(req->part, nsecs[sgrp], now - req->start_time_ns);
|
|
|
|
part_stat_unlock();
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void blk_account_io_done(struct request *req, u64 now)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Account IO completion. flush_rq isn't accounted as a
|
|
|
|
* normal IO on queueing nor completion. Accounting the
|
|
|
|
* containing request is enough.
|
|
|
|
*/
|
|
|
|
if (blk_do_io_stat(req) && req->part &&
|
|
|
|
!(req->rq_flags & RQF_FLUSH_SEQ))
|
|
|
|
__blk_account_io_done(req, now);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void __blk_account_io_start(struct request *rq)
|
|
|
|
{
|
2022-03-08 13:51:47 +08:00
|
|
|
/*
|
|
|
|
* All non-passthrough requests are created from a bio with one
|
|
|
|
* exception: when a flush command that is part of a flush sequence
|
|
|
|
* generated by the state machine in blk-flush.c is cloned onto the
|
|
|
|
* lower device by dm-multipath we can get here without a bio.
|
|
|
|
*/
|
|
|
|
if (rq->bio)
|
2021-11-17 14:14:01 +08:00
|
|
|
rq->part = rq->bio->bi_bdev;
|
2022-03-08 13:51:47 +08:00
|
|
|
else
|
2021-11-26 20:18:00 +08:00
|
|
|
rq->part = rq->q->disk->part0;
|
2021-11-17 14:14:01 +08:00
|
|
|
|
|
|
|
part_stat_lock();
|
|
|
|
update_io_ticks(rq->part, jiffies, false);
|
|
|
|
part_stat_unlock();
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void blk_account_io_start(struct request *req)
|
|
|
|
{
|
|
|
|
if (blk_do_io_stat(req))
|
|
|
|
__blk_account_io_start(req);
|
|
|
|
}
|
|
|
|
|
2021-10-08 19:50:46 +08:00
|
|
|
static inline void __blk_mq_end_request_acct(struct request *rq, u64 now)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2018-05-09 17:08:52 +08:00
|
|
|
if (rq->rq_flags & RQF_STATS) {
|
|
|
|
blk_mq_poll_stats_start(rq->q);
|
block: consolidate struct request timestamp fields
Currently, struct request has four timestamp fields:
- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq
These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 17:08:53 +08:00
|
|
|
blk_stat_add(rq, now);
|
2018-05-09 17:08:52 +08:00
|
|
|
}
|
|
|
|
|
2020-07-04 15:28:21 +08:00
|
|
|
blk_mq_sched_completed_request(rq, now);
|
block: consolidate struct request timestamp fields
Currently, struct request has four timestamp fields:
- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq
These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 17:08:53 +08:00
|
|
|
blk_account_io_done(rq, now);
|
2021-10-08 19:50:46 +08:00
|
|
|
}
|
block: consolidate struct request timestamp fields
Currently, struct request has four timestamp fields:
- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq
These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 17:08:53 +08:00
|
|
|
|
2021-10-08 19:50:46 +08:00
|
|
|
inline void __blk_mq_end_request(struct request *rq, blk_status_t error)
|
|
|
|
{
|
|
|
|
if (blk_mq_need_time_stamp(rq))
|
|
|
|
__blk_mq_end_request_acct(rq, ktime_get_ns());
|
2013-12-06 01:50:39 +08:00
|
|
|
|
2014-04-16 15:44:53 +08:00
|
|
|
if (rq->end_io) {
|
2018-07-03 23:32:35 +08:00
|
|
|
rq_qos_done(rq->q, rq);
|
2022-09-22 05:19:54 +08:00
|
|
|
if (rq->end_io(rq, error) == RQ_END_IO_FREE)
|
|
|
|
blk_mq_free_request(rq);
|
2014-04-16 15:44:53 +08:00
|
|
|
} else {
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
blk_mq_free_request(rq);
|
2014-04-16 15:44:53 +08:00
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
2014-09-14 07:40:10 +08:00
|
|
|
EXPORT_SYMBOL(__blk_mq_end_request);
|
2014-04-16 15:44:52 +08:00
|
|
|
|
2017-06-03 15:38:04 +08:00
|
|
|
void blk_mq_end_request(struct request *rq, blk_status_t error)
|
2014-04-16 15:44:52 +08:00
|
|
|
{
|
|
|
|
if (blk_update_request(rq, error, blk_rq_bytes(rq)))
|
|
|
|
BUG();
|
2014-09-14 07:40:10 +08:00
|
|
|
__blk_mq_end_request(rq, error);
|
2014-04-16 15:44:52 +08:00
|
|
|
}
|
2014-09-14 07:40:10 +08:00
|
|
|
EXPORT_SYMBOL(blk_mq_end_request);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2021-10-08 19:50:46 +08:00
|
|
|
#define TAG_COMP_BATCH 32
|
|
|
|
|
|
|
|
static inline void blk_mq_flush_tag_batch(struct blk_mq_hw_ctx *hctx,
|
|
|
|
int *tag_array, int nr_tags)
|
|
|
|
{
|
|
|
|
struct request_queue *q = hctx->queue;
|
|
|
|
|
2021-11-02 23:36:19 +08:00
|
|
|
/*
|
|
|
|
* All requests should have been marked as RQF_MQ_INFLIGHT, so
|
|
|
|
* update hctx->nr_active in batch
|
|
|
|
*/
|
|
|
|
if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
|
|
|
|
__blk_mq_sub_active_requests(hctx, nr_tags);
|
|
|
|
|
2021-10-08 19:50:46 +08:00
|
|
|
blk_mq_put_tags(hctx->tags, tag_array, nr_tags);
|
|
|
|
percpu_ref_put_many(&q->q_usage_counter, nr_tags);
|
|
|
|
}
|
|
|
|
|
|
|
|
void blk_mq_end_request_batch(struct io_comp_batch *iob)
|
|
|
|
{
|
|
|
|
int tags[TAG_COMP_BATCH], nr_tags = 0;
|
2021-10-29 02:08:34 +08:00
|
|
|
struct blk_mq_hw_ctx *cur_hctx = NULL;
|
2021-10-08 19:50:46 +08:00
|
|
|
struct request *rq;
|
|
|
|
u64 now = 0;
|
|
|
|
|
|
|
|
if (iob->need_ts)
|
|
|
|
now = ktime_get_ns();
|
|
|
|
|
|
|
|
while ((rq = rq_list_pop(&iob->req_list)) != NULL) {
|
|
|
|
prefetch(rq->bio);
|
|
|
|
prefetch(rq->rq_next);
|
|
|
|
|
2021-12-02 06:01:51 +08:00
|
|
|
blk_complete_request(rq);
|
2021-10-08 19:50:46 +08:00
|
|
|
if (iob->need_ts)
|
|
|
|
__blk_mq_end_request_acct(rq, now);
|
|
|
|
|
2021-11-27 00:53:23 +08:00
|
|
|
rq_qos_done(rq->q, rq);
|
|
|
|
|
2022-09-21 22:24:16 +08:00
|
|
|
/*
|
|
|
|
* If end_io handler returns NONE, then it still has
|
|
|
|
* ownership of the request.
|
|
|
|
*/
|
|
|
|
if (rq->end_io && rq->end_io(rq, 0) == RQ_END_IO_NONE)
|
|
|
|
continue;
|
|
|
|
|
2021-10-08 19:50:46 +08:00
|
|
|
WRITE_ONCE(rq->state, MQ_RQ_IDLE);
|
2021-10-15 04:39:59 +08:00
|
|
|
if (!req_ref_put_and_test(rq))
|
2021-10-08 19:50:46 +08:00
|
|
|
continue;
|
|
|
|
|
|
|
|
blk_crypto_free_request(rq);
|
|
|
|
blk_pm_mark_last_busy(rq);
|
|
|
|
|
2021-10-29 02:08:34 +08:00
|
|
|
if (nr_tags == TAG_COMP_BATCH || cur_hctx != rq->mq_hctx) {
|
|
|
|
if (cur_hctx)
|
|
|
|
blk_mq_flush_tag_batch(cur_hctx, tags, nr_tags);
|
2021-10-08 19:50:46 +08:00
|
|
|
nr_tags = 0;
|
2021-10-29 02:08:34 +08:00
|
|
|
cur_hctx = rq->mq_hctx;
|
2021-10-08 19:50:46 +08:00
|
|
|
}
|
|
|
|
tags[nr_tags++] = rq->tag;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (nr_tags)
|
2021-10-29 02:08:34 +08:00
|
|
|
blk_mq_flush_tag_batch(cur_hctx, tags, nr_tags);
|
2021-10-08 19:50:46 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_end_request_batch);
|
|
|
|
|
2021-01-24 04:10:27 +08:00
|
|
|
static void blk_complete_reqs(struct llist_head *list)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2021-01-24 04:10:27 +08:00
|
|
|
struct llist_node *entry = llist_reverse_order(llist_del_all(list));
|
|
|
|
struct request *rq, *next;
|
2020-06-11 14:44:41 +08:00
|
|
|
|
2021-01-24 04:10:27 +08:00
|
|
|
llist_for_each_entry_safe(rq, next, entry, ipi_list)
|
2020-06-11 14:44:41 +08:00
|
|
|
rq->q->mq_ops->complete(rq);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2021-01-24 04:10:27 +08:00
|
|
|
static __latent_entropy void blk_done_softirq(struct softirq_action *h)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2021-01-24 04:10:27 +08:00
|
|
|
blk_complete_reqs(this_cpu_ptr(&blk_cpu_done));
|
2020-06-11 14:44:42 +08:00
|
|
|
}
|
|
|
|
|
2020-06-11 14:44:41 +08:00
|
|
|
static int blk_softirq_cpu_dead(unsigned int cpu)
|
|
|
|
{
|
2021-01-24 04:10:27 +08:00
|
|
|
blk_complete_reqs(&per_cpu(blk_cpu_done, cpu));
|
2020-06-11 14:44:41 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-06-11 14:44:50 +08:00
|
|
|
static void __blk_mq_complete_request_remote(void *data)
|
2020-06-11 14:44:41 +08:00
|
|
|
{
|
2021-01-24 04:10:27 +08:00
|
|
|
__raise_softirq_irqoff(BLOCK_SOFTIRQ);
|
2020-06-11 14:44:41 +08:00
|
|
|
}
|
|
|
|
|
2020-06-11 14:44:49 +08:00
|
|
|
static inline bool blk_mq_complete_need_ipi(struct request *rq)
|
|
|
|
{
|
|
|
|
int cpu = raw_smp_processor_id();
|
|
|
|
|
|
|
|
if (!IS_ENABLED(CONFIG_SMP) ||
|
|
|
|
!test_bit(QUEUE_FLAG_SAME_COMP, &rq->q->queue_flags))
|
|
|
|
return false;
|
2020-12-05 03:13:54 +08:00
|
|
|
/*
|
|
|
|
* With force threaded interrupts enabled, raising softirq from an SMP
|
|
|
|
* function call will always result in waking the ksoftirqd thread.
|
|
|
|
* This is probably worse than completing the request on a different
|
|
|
|
* cache domain.
|
|
|
|
*/
|
2021-06-03 02:03:38 +08:00
|
|
|
if (force_irqthreads())
|
2020-12-05 03:13:54 +08:00
|
|
|
return false;
|
2020-06-11 14:44:49 +08:00
|
|
|
|
|
|
|
/* same CPU or cache domain? Complete locally */
|
|
|
|
if (cpu == rq->mq_ctx->cpu ||
|
|
|
|
(!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) &&
|
|
|
|
cpus_share_cache(cpu, rq->mq_ctx->cpu)))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/* don't try to IPI to an offline CPU */
|
|
|
|
return cpu_online(rq->mq_ctx->cpu);
|
|
|
|
}
|
|
|
|
|
2021-01-24 04:10:27 +08:00
|
|
|
static void blk_mq_complete_send_ipi(struct request *rq)
|
|
|
|
{
|
|
|
|
struct llist_head *list;
|
|
|
|
unsigned int cpu;
|
|
|
|
|
|
|
|
cpu = rq->mq_ctx->cpu;
|
|
|
|
list = &per_cpu(blk_cpu_done, cpu);
|
|
|
|
if (llist_add(&rq->ipi_list, list)) {
|
|
|
|
INIT_CSD(&rq->csd, __blk_mq_complete_request_remote, rq);
|
|
|
|
smp_call_function_single_async(cpu, &rq->csd);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void blk_mq_raise_softirq(struct request *rq)
|
|
|
|
{
|
|
|
|
struct llist_head *list;
|
|
|
|
|
|
|
|
preempt_disable();
|
|
|
|
list = this_cpu_ptr(&blk_cpu_done);
|
|
|
|
if (llist_add(&rq->ipi_list, list))
|
|
|
|
raise_softirq(BLOCK_SOFTIRQ);
|
|
|
|
preempt_enable();
|
|
|
|
}
|
|
|
|
|
2020-06-11 14:44:50 +08:00
|
|
|
bool blk_mq_complete_request_remote(struct request *rq)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2018-11-27 00:54:30 +08:00
|
|
|
WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
|
2018-09-28 16:42:20 +08:00
|
|
|
|
2018-11-19 07:15:35 +08:00
|
|
|
/*
|
2022-09-21 11:32:03 +08:00
|
|
|
* For request which hctx has only one ctx mapping,
|
|
|
|
* or a polled request, always complete locally,
|
|
|
|
* it's pointless to redirect the completion.
|
2018-11-19 07:15:35 +08:00
|
|
|
*/
|
2022-09-21 11:32:03 +08:00
|
|
|
if (rq->mq_hctx->nr_ctx == 1 ||
|
|
|
|
rq->cmd_flags & REQ_POLLED)
|
2020-06-11 14:44:50 +08:00
|
|
|
return false;
|
2014-04-25 17:32:53 +08:00
|
|
|
|
2020-06-11 14:44:49 +08:00
|
|
|
if (blk_mq_complete_need_ipi(rq)) {
|
2021-01-24 04:10:27 +08:00
|
|
|
blk_mq_complete_send_ipi(rq);
|
|
|
|
return true;
|
2014-01-09 01:33:37 +08:00
|
|
|
}
|
2020-06-11 14:44:50 +08:00
|
|
|
|
2021-01-24 04:10:27 +08:00
|
|
|
if (rq->q->nr_hw_queues == 1) {
|
|
|
|
blk_mq_raise_softirq(rq);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
return false;
|
2020-06-11 14:44:50 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_complete_request_remote);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_mq_complete_request - end I/O on a request
|
|
|
|
* @rq: the request being processed
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Complete a request by scheduling the ->complete_rq operation.
|
|
|
|
**/
|
|
|
|
void blk_mq_complete_request(struct request *rq)
|
|
|
|
{
|
|
|
|
if (!blk_mq_complete_request_remote(rq))
|
|
|
|
rq->q->mq_ops->complete(rq);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
2020-06-11 14:44:47 +08:00
|
|
|
EXPORT_SYMBOL(blk_mq_complete_request);
|
2014-02-10 19:24:38 +08:00
|
|
|
|
2020-01-07 02:08:18 +08:00
|
|
|
/**
|
|
|
|
* blk_mq_start_request - Start processing a request
|
|
|
|
* @rq: Pointer to request to be started
|
|
|
|
*
|
|
|
|
* Function used by device drivers to notify the block layer that a request
|
|
|
|
* is going to be processed now, so blk layer can do proper initializations
|
|
|
|
* such as starting the timeout timer.
|
|
|
|
*/
|
2014-09-14 07:40:09 +08:00
|
|
|
void blk_mq_start_request(struct request *rq)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
|
|
|
struct request_queue *q = rq->q;
|
|
|
|
|
2020-12-04 00:21:39 +08:00
|
|
|
trace_block_rq_issue(rq);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2016-11-08 12:32:37 +08:00
|
|
|
if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
|
2022-04-28 03:49:12 +08:00
|
|
|
rq->io_start_time_ns = ktime_get_ns();
|
2019-05-21 15:59:03 +08:00
|
|
|
rq->stats_sectors = blk_rq_sectors(rq);
|
2016-11-08 12:32:37 +08:00
|
|
|
rq->rq_flags |= RQF_STATS;
|
2018-07-03 23:32:35 +08:00
|
|
|
rq_qos_issue(q, rq);
|
2016-11-08 12:32:37 +08:00
|
|
|
}
|
|
|
|
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10 00:29:48 +08:00
|
|
|
WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
|
2014-09-17 00:37:37 +08:00
|
|
|
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10 00:29:48 +08:00
|
|
|
blk_add_timer(rq);
|
2018-05-29 21:52:28 +08:00
|
|
|
WRITE_ONCE(rq->state, MQ_RQ_IN_FLIGHT);
|
2014-02-12 00:27:14 +08:00
|
|
|
|
2019-09-16 23:44:29 +08:00
|
|
|
#ifdef CONFIG_BLK_DEV_INTEGRITY
|
|
|
|
if (blk_integrity_rq(rq) && req_op(rq) == REQ_OP_WRITE)
|
|
|
|
q->integrity.profile->prepare_fn(rq);
|
|
|
|
#endif
|
2021-10-12 19:12:24 +08:00
|
|
|
if (rq->bio && rq->bio->bi_opf & REQ_POLLED)
|
|
|
|
WRITE_ONCE(rq->bio->bi_cookie, blk_rq_to_qc(rq));
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
2014-09-14 07:40:09 +08:00
|
|
|
EXPORT_SYMBOL(blk_mq_start_request);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2022-05-12 22:00:10 +08:00
|
|
|
/*
|
|
|
|
* Allow 2x BLK_MAX_REQUEST_COUNT requests on plug queue for multiple
|
|
|
|
* queues. This is important for md arrays to benefit from merging
|
|
|
|
* requests.
|
|
|
|
*/
|
|
|
|
static inline unsigned short blk_plug_max_rq_count(struct blk_plug *plug)
|
|
|
|
{
|
|
|
|
if (plug->multiple_queues)
|
|
|
|
return BLK_MAX_REQUEST_COUNT * 2;
|
|
|
|
return BLK_MAX_REQUEST_COUNT;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
|
|
|
|
{
|
|
|
|
struct request *last = rq_list_peek(&plug->mq_list);
|
|
|
|
|
|
|
|
if (!plug->rq_count) {
|
|
|
|
trace_block_plug(rq->q);
|
|
|
|
} else if (plug->rq_count >= blk_plug_max_rq_count(plug) ||
|
|
|
|
(!blk_queue_nomerges(rq->q) &&
|
|
|
|
blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
|
|
|
|
blk_mq_flush_plug_list(plug, false);
|
2022-11-01 08:54:13 +08:00
|
|
|
last = NULL;
|
2022-05-12 22:00:10 +08:00
|
|
|
trace_block_plug(rq->q);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!plug->multiple_queues && last && last->q != rq->q)
|
|
|
|
plug->multiple_queues = true;
|
|
|
|
if (!plug->has_elevator && (rq->rq_flags & RQF_ELV))
|
|
|
|
plug->has_elevator = true;
|
|
|
|
rq->rq_next = NULL;
|
|
|
|
rq_list_add(&plug->mq_list, rq);
|
|
|
|
plug->rq_count++;
|
|
|
|
}
|
|
|
|
|
2021-11-17 14:13:56 +08:00
|
|
|
/**
|
|
|
|
* blk_execute_rq_nowait - insert a request to I/O scheduler for execution
|
|
|
|
* @rq: request to insert
|
|
|
|
* @at_head: insert request at head or tail of queue
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Insert a fully prepared request at the back of the I/O scheduler queue
|
|
|
|
* for execution. Don't wait for completion.
|
|
|
|
*
|
|
|
|
* Note:
|
|
|
|
* This function will invoke @done directly if the queue is dead.
|
|
|
|
*/
|
2022-05-24 20:15:30 +08:00
|
|
|
void blk_execute_rq_nowait(struct request *rq, bool at_head)
|
2021-11-17 14:13:56 +08:00
|
|
|
{
|
2022-05-24 20:15:28 +08:00
|
|
|
WARN_ON(irqs_disabled());
|
|
|
|
WARN_ON(!blk_rq_is_passthrough(rq));
|
2021-11-17 14:13:56 +08:00
|
|
|
|
2022-05-24 20:15:28 +08:00
|
|
|
blk_account_io_start(rq);
|
2022-09-29 22:41:41 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* As plugging can be enabled for passthrough requests on a zoned
|
|
|
|
* device, directly accessing the plug instead of using blk_mq_plug()
|
|
|
|
* should not have any consequences.
|
|
|
|
*/
|
2022-05-24 20:15:28 +08:00
|
|
|
if (current->plug)
|
|
|
|
blk_add_rq_to_plug(current->plug, rq);
|
|
|
|
else
|
|
|
|
blk_mq_sched_insert_request(rq, at_head, true, false);
|
2021-11-17 14:13:56 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_execute_rq_nowait);
|
|
|
|
|
2022-05-24 20:15:29 +08:00
|
|
|
struct blk_rq_wait {
|
|
|
|
struct completion done;
|
|
|
|
blk_status_t ret;
|
|
|
|
};
|
|
|
|
|
2022-09-22 05:19:54 +08:00
|
|
|
static enum rq_end_io_ret blk_end_sync_rq(struct request *rq, blk_status_t ret)
|
2022-05-24 20:15:29 +08:00
|
|
|
{
|
|
|
|
struct blk_rq_wait *wait = rq->end_io_data;
|
|
|
|
|
|
|
|
wait->ret = ret;
|
|
|
|
complete(&wait->done);
|
2022-09-22 05:19:54 +08:00
|
|
|
return RQ_END_IO_NONE;
|
2022-05-24 20:15:29 +08:00
|
|
|
}
|
|
|
|
|
2022-08-24 00:14:42 +08:00
|
|
|
bool blk_rq_is_poll(struct request *rq)
|
2021-11-17 14:13:56 +08:00
|
|
|
{
|
|
|
|
if (!rq->mq_hctx)
|
|
|
|
return false;
|
|
|
|
if (rq->mq_hctx->type != HCTX_TYPE_POLL)
|
|
|
|
return false;
|
|
|
|
if (WARN_ON_ONCE(!rq->bio))
|
|
|
|
return false;
|
|
|
|
return true;
|
|
|
|
}
|
2022-08-24 00:14:42 +08:00
|
|
|
EXPORT_SYMBOL_GPL(blk_rq_is_poll);
|
2021-11-17 14:13:56 +08:00
|
|
|
|
|
|
|
static void blk_rq_poll_completion(struct request *rq, struct completion *wait)
|
|
|
|
{
|
|
|
|
do {
|
|
|
|
bio_poll(rq->bio, NULL, 0);
|
|
|
|
cond_resched();
|
|
|
|
} while (!completion_done(wait));
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_execute_rq - insert a request into queue for execution
|
|
|
|
* @rq: request to insert
|
|
|
|
* @at_head: insert request at head or tail of queue
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Insert a fully prepared request at the back of the I/O scheduler queue
|
|
|
|
* for execution and wait for completion.
|
|
|
|
* Return: The blk_status_t result provided to blk_mq_end_request().
|
|
|
|
*/
|
2021-11-26 20:18:01 +08:00
|
|
|
blk_status_t blk_execute_rq(struct request *rq, bool at_head)
|
2021-11-17 14:13:56 +08:00
|
|
|
{
|
2022-05-24 20:15:29 +08:00
|
|
|
struct blk_rq_wait wait = {
|
|
|
|
.done = COMPLETION_INITIALIZER_ONSTACK(wait.done),
|
|
|
|
};
|
2021-11-17 14:13:56 +08:00
|
|
|
|
2022-05-24 20:15:28 +08:00
|
|
|
WARN_ON(irqs_disabled());
|
|
|
|
WARN_ON(!blk_rq_is_passthrough(rq));
|
2021-11-17 14:13:56 +08:00
|
|
|
|
|
|
|
rq->end_io_data = &wait;
|
2022-05-24 20:15:28 +08:00
|
|
|
rq->end_io = blk_end_sync_rq;
|
2021-11-17 14:13:56 +08:00
|
|
|
|
2022-05-24 20:15:28 +08:00
|
|
|
blk_account_io_start(rq);
|
|
|
|
blk_mq_sched_insert_request(rq, at_head, true, false);
|
2021-11-17 14:13:56 +08:00
|
|
|
|
2022-05-24 20:15:28 +08:00
|
|
|
if (blk_rq_is_poll(rq)) {
|
2022-05-24 20:15:29 +08:00
|
|
|
blk_rq_poll_completion(rq, &wait.done);
|
2022-05-24 20:15:28 +08:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Prevent hang_check timer from firing at us during very long
|
|
|
|
* I/O
|
|
|
|
*/
|
|
|
|
unsigned long hang_check = sysctl_hung_task_timeout_secs;
|
|
|
|
|
|
|
|
if (hang_check)
|
2022-05-24 20:15:29 +08:00
|
|
|
while (!wait_for_completion_io_timeout(&wait.done,
|
2022-05-24 20:15:28 +08:00
|
|
|
hang_check * (HZ/2)))
|
|
|
|
;
|
|
|
|
else
|
2022-05-24 20:15:29 +08:00
|
|
|
wait_for_completion_io(&wait.done);
|
2022-05-24 20:15:28 +08:00
|
|
|
}
|
2021-11-17 14:13:56 +08:00
|
|
|
|
2022-05-24 20:15:29 +08:00
|
|
|
return wait.ret;
|
2021-11-17 14:13:56 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_execute_rq);
|
|
|
|
|
2014-04-16 15:44:57 +08:00
|
|
|
static void __blk_mq_requeue_request(struct request *rq)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
|
|
|
struct request_queue *q = rq->q;
|
|
|
|
|
2017-11-02 23:24:38 +08:00
|
|
|
blk_mq_put_driver_tag(rq);
|
|
|
|
|
2020-12-04 00:21:39 +08:00
|
|
|
trace_block_rq_requeue(rq);
|
2018-07-03 23:32:35 +08:00
|
|
|
rq_qos_requeue(q, rq);
|
2014-02-12 00:27:14 +08:00
|
|
|
|
2018-05-29 21:52:28 +08:00
|
|
|
if (blk_mq_request_started(rq)) {
|
|
|
|
WRITE_ONCE(rq->state, MQ_RQ_IDLE);
|
2018-06-14 19:58:45 +08:00
|
|
|
rq->rq_flags &= ~RQF_TIMED_OUT;
|
2014-09-14 07:40:09 +08:00
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2016-10-29 08:21:41 +08:00
|
|
|
void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list)
|
2014-04-16 15:44:57 +08:00
|
|
|
{
|
|
|
|
__blk_mq_requeue_request(rq);
|
|
|
|
|
2018-02-23 23:36:56 +08:00
|
|
|
/* this request will be re-inserted to io scheduler queue */
|
|
|
|
blk_mq_sched_requeue_request(rq);
|
|
|
|
|
2016-10-29 08:21:41 +08:00
|
|
|
blk_mq_add_to_requeue_list(rq, true, kick_requeue_list);
|
2014-04-16 15:44:57 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_requeue_request);
|
|
|
|
|
2014-05-28 22:08:02 +08:00
|
|
|
static void blk_mq_requeue_work(struct work_struct *work)
|
|
|
|
{
|
|
|
|
struct request_queue *q =
|
2016-09-15 01:28:30 +08:00
|
|
|
container_of(work, struct request_queue, requeue_work.work);
|
2014-05-28 22:08:02 +08:00
|
|
|
LIST_HEAD(rq_list);
|
|
|
|
struct request *rq, *next;
|
|
|
|
|
2017-07-27 22:03:57 +08:00
|
|
|
spin_lock_irq(&q->requeue_lock);
|
2014-05-28 22:08:02 +08:00
|
|
|
list_splice_init(&q->requeue_list, &rq_list);
|
2017-07-27 22:03:57 +08:00
|
|
|
spin_unlock_irq(&q->requeue_lock);
|
2014-05-28 22:08:02 +08:00
|
|
|
|
|
|
|
list_for_each_entry_safe(rq, next, &rq_list, queuelist) {
|
blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue
When requeue, if RQF_DONTPREP, rq has contained some driver
specific data, so insert it to hctx dispatch list to avoid any
merge. Take scsi as example, here is the trace event log (no
io scheduler, because RQF_STARTED would prevent merging),
kworker/0:1H-339 [000] ...1 2037.209289: block_rq_insert: 8,0 R 4096 () 32768 + 8 [kworker/0:1H]
scsi_inert_test-1987 [000] .... 2037.220465: block_bio_queue: 8,0 R 32776 + 8 [scsi_inert_test]
scsi_inert_test-1987 [000] ...2 2037.220466: block_bio_backmerge: 8,0 R 32776 + 8 [scsi_inert_test]
kworker/0:1H-339 [000] .... 2047.220913: block_rq_issue: 8,0 R 8192 () 32768 + 16 [kworker/0:1H]
scsi_inert_test-1996 [000] ..s1 2047.221007: block_rq_complete: 8,0 R () 32768 + 8 [0]
scsi_inert_test-1996 [000] .Ns1 2047.221045: block_rq_requeue: 8,0 R () 32776 + 8 [0]
kworker/0:1H-339 [000] ...1 2047.221054: block_rq_insert: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
kworker/0:1H-339 [000] ...1 2047.221056: block_rq_issue: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
scsi_inert_test-1986 [000] ..s1 2047.221119: block_rq_complete: 8,0 R () 32776 + 8 [0]
(32768 + 8) was requeued by scsi_queue_insert and had RQF_DONTPREP.
Then it was merged with (32776 + 8) and issued. Due to RQF_DONTPREP,
the sdb only contained the part of (32768 + 8), then only that part
was completed. The lucky thing was that scsi_io_completion detected
it and requeued the remaining part. So we didn't get corrupted data.
However, the requeue of (32776 + 8) is not expected.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-12 09:56:25 +08:00
|
|
|
if (!(rq->rq_flags & (RQF_SOFTBARRIER | RQF_DONTPREP)))
|
2014-05-28 22:08:02 +08:00
|
|
|
continue;
|
|
|
|
|
2016-10-20 21:12:13 +08:00
|
|
|
rq->rq_flags &= ~RQF_SOFTBARRIER;
|
2014-05-28 22:08:02 +08:00
|
|
|
list_del_init(&rq->queuelist);
|
blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue
When requeue, if RQF_DONTPREP, rq has contained some driver
specific data, so insert it to hctx dispatch list to avoid any
merge. Take scsi as example, here is the trace event log (no
io scheduler, because RQF_STARTED would prevent merging),
kworker/0:1H-339 [000] ...1 2037.209289: block_rq_insert: 8,0 R 4096 () 32768 + 8 [kworker/0:1H]
scsi_inert_test-1987 [000] .... 2037.220465: block_bio_queue: 8,0 R 32776 + 8 [scsi_inert_test]
scsi_inert_test-1987 [000] ...2 2037.220466: block_bio_backmerge: 8,0 R 32776 + 8 [scsi_inert_test]
kworker/0:1H-339 [000] .... 2047.220913: block_rq_issue: 8,0 R 8192 () 32768 + 16 [kworker/0:1H]
scsi_inert_test-1996 [000] ..s1 2047.221007: block_rq_complete: 8,0 R () 32768 + 8 [0]
scsi_inert_test-1996 [000] .Ns1 2047.221045: block_rq_requeue: 8,0 R () 32776 + 8 [0]
kworker/0:1H-339 [000] ...1 2047.221054: block_rq_insert: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
kworker/0:1H-339 [000] ...1 2047.221056: block_rq_issue: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
scsi_inert_test-1986 [000] ..s1 2047.221119: block_rq_complete: 8,0 R () 32776 + 8 [0]
(32768 + 8) was requeued by scsi_queue_insert and had RQF_DONTPREP.
Then it was merged with (32776 + 8) and issued. Due to RQF_DONTPREP,
the sdb only contained the part of (32768 + 8), then only that part
was completed. The lucky thing was that scsi_io_completion detected
it and requeued the remaining part. So we didn't get corrupted data.
However, the requeue of (32776 + 8) is not expected.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-12 09:56:25 +08:00
|
|
|
/*
|
|
|
|
* If RQF_DONTPREP, rq has contained some driver specific
|
|
|
|
* data, so insert it to hctx dispatch list to avoid any
|
|
|
|
* merge.
|
|
|
|
*/
|
|
|
|
if (rq->rq_flags & RQF_DONTPREP)
|
2020-02-25 09:04:32 +08:00
|
|
|
blk_mq_request_bypass_insert(rq, false, false);
|
blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue
When requeue, if RQF_DONTPREP, rq has contained some driver
specific data, so insert it to hctx dispatch list to avoid any
merge. Take scsi as example, here is the trace event log (no
io scheduler, because RQF_STARTED would prevent merging),
kworker/0:1H-339 [000] ...1 2037.209289: block_rq_insert: 8,0 R 4096 () 32768 + 8 [kworker/0:1H]
scsi_inert_test-1987 [000] .... 2037.220465: block_bio_queue: 8,0 R 32776 + 8 [scsi_inert_test]
scsi_inert_test-1987 [000] ...2 2037.220466: block_bio_backmerge: 8,0 R 32776 + 8 [scsi_inert_test]
kworker/0:1H-339 [000] .... 2047.220913: block_rq_issue: 8,0 R 8192 () 32768 + 16 [kworker/0:1H]
scsi_inert_test-1996 [000] ..s1 2047.221007: block_rq_complete: 8,0 R () 32768 + 8 [0]
scsi_inert_test-1996 [000] .Ns1 2047.221045: block_rq_requeue: 8,0 R () 32776 + 8 [0]
kworker/0:1H-339 [000] ...1 2047.221054: block_rq_insert: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
kworker/0:1H-339 [000] ...1 2047.221056: block_rq_issue: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
scsi_inert_test-1986 [000] ..s1 2047.221119: block_rq_complete: 8,0 R () 32776 + 8 [0]
(32768 + 8) was requeued by scsi_queue_insert and had RQF_DONTPREP.
Then it was merged with (32776 + 8) and issued. Due to RQF_DONTPREP,
the sdb only contained the part of (32768 + 8), then only that part
was completed. The lucky thing was that scsi_io_completion detected
it and requeued the remaining part. So we didn't get corrupted data.
However, the requeue of (32776 + 8) is not expected.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-12 09:56:25 +08:00
|
|
|
else
|
|
|
|
blk_mq_sched_insert_request(rq, true, false, false);
|
2014-05-28 22:08:02 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
while (!list_empty(&rq_list)) {
|
|
|
|
rq = list_entry(rq_list.next, struct request, queuelist);
|
|
|
|
list_del_init(&rq->queuelist);
|
2018-01-18 00:25:58 +08:00
|
|
|
blk_mq_sched_insert_request(rq, false, false, false);
|
2014-05-28 22:08:02 +08:00
|
|
|
}
|
|
|
|
|
2016-10-29 08:20:32 +08:00
|
|
|
blk_mq_run_hw_queues(q, false);
|
2014-05-28 22:08:02 +08:00
|
|
|
}
|
|
|
|
|
2016-10-29 08:21:41 +08:00
|
|
|
void blk_mq_add_to_requeue_list(struct request *rq, bool at_head,
|
|
|
|
bool kick_requeue_list)
|
2014-05-28 22:08:02 +08:00
|
|
|
{
|
|
|
|
struct request_queue *q = rq->q;
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We abuse this flag that is otherwise used by the I/O scheduler to
|
2017-11-11 13:05:12 +08:00
|
|
|
* request head insertion from the workqueue.
|
2014-05-28 22:08:02 +08:00
|
|
|
*/
|
2016-10-20 21:12:13 +08:00
|
|
|
BUG_ON(rq->rq_flags & RQF_SOFTBARRIER);
|
2014-05-28 22:08:02 +08:00
|
|
|
|
|
|
|
spin_lock_irqsave(&q->requeue_lock, flags);
|
|
|
|
if (at_head) {
|
2016-10-20 21:12:13 +08:00
|
|
|
rq->rq_flags |= RQF_SOFTBARRIER;
|
2014-05-28 22:08:02 +08:00
|
|
|
list_add(&rq->queuelist, &q->requeue_list);
|
|
|
|
} else {
|
|
|
|
list_add_tail(&rq->queuelist, &q->requeue_list);
|
|
|
|
}
|
|
|
|
spin_unlock_irqrestore(&q->requeue_lock, flags);
|
2016-10-29 08:21:41 +08:00
|
|
|
|
|
|
|
if (kick_requeue_list)
|
|
|
|
blk_mq_kick_requeue_list(q);
|
2014-05-28 22:08:02 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
void blk_mq_kick_requeue_list(struct request_queue *q)
|
|
|
|
{
|
2018-01-20 00:58:55 +08:00
|
|
|
kblockd_mod_delayed_work_on(WORK_CPU_UNBOUND, &q->requeue_work, 0);
|
2014-05-28 22:08:02 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_kick_requeue_list);
|
|
|
|
|
2016-09-15 01:28:30 +08:00
|
|
|
void blk_mq_delay_kick_requeue_list(struct request_queue *q,
|
|
|
|
unsigned long msecs)
|
|
|
|
{
|
2017-08-10 02:28:06 +08:00
|
|
|
kblockd_mod_delayed_work_on(WORK_CPU_UNBOUND, &q->requeue_work,
|
|
|
|
msecs_to_jiffies(msecs));
|
2016-09-15 01:28:30 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list);
|
|
|
|
|
2022-07-06 20:03:53 +08:00
|
|
|
static bool blk_mq_rq_inflight(struct request *rq, void *priv)
|
2018-11-09 00:03:51 +08:00
|
|
|
{
|
|
|
|
/*
|
2021-12-06 20:49:48 +08:00
|
|
|
* If we find a request that isn't idle we know the queue is busy
|
|
|
|
* as it's checked in the iter.
|
|
|
|
* Return false to stop the iteration.
|
2018-11-09 00:03:51 +08:00
|
|
|
*/
|
2021-12-06 20:49:48 +08:00
|
|
|
if (blk_mq_request_started(rq)) {
|
2018-11-09 00:03:51 +08:00
|
|
|
bool *busy = priv;
|
|
|
|
|
|
|
|
*busy = true;
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2018-12-18 12:11:17 +08:00
|
|
|
bool blk_mq_queue_inflight(struct request_queue *q)
|
2018-11-09 00:03:51 +08:00
|
|
|
{
|
|
|
|
bool busy = false;
|
|
|
|
|
2018-12-18 12:11:17 +08:00
|
|
|
blk_mq_queue_tag_busy_iter(q, blk_mq_rq_inflight, &busy);
|
2018-11-09 00:03:51 +08:00
|
|
|
return busy;
|
|
|
|
}
|
2018-12-18 12:11:17 +08:00
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_queue_inflight);
|
2018-11-09 00:03:51 +08:00
|
|
|
|
2022-07-06 20:03:51 +08:00
|
|
|
static void blk_mq_rq_timed_out(struct request *req)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2018-06-14 19:58:45 +08:00
|
|
|
req->rq_flags |= RQF_TIMED_OUT;
|
2018-05-29 21:52:39 +08:00
|
|
|
if (req->q->mq_ops->timeout) {
|
|
|
|
enum blk_eh_timer_return ret;
|
|
|
|
|
2022-07-06 20:03:51 +08:00
|
|
|
ret = req->q->mq_ops->timeout(req);
|
2018-05-29 21:52:39 +08:00
|
|
|
if (ret == BLK_EH_DONE)
|
|
|
|
return;
|
|
|
|
WARN_ON_ONCE(ret != BLK_EH_RESET_TIMER);
|
2014-09-14 07:40:12 +08:00
|
|
|
}
|
2018-05-29 21:52:39 +08:00
|
|
|
|
|
|
|
blk_add_timer(req);
|
2014-04-24 22:51:47 +08:00
|
|
|
}
|
2015-01-08 09:55:46 +08:00
|
|
|
|
blk-mq: avoid double ->queue_rq() because of early timeout
David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in VM use case because of long vmexit latency or preempt
latency of vCPU pthread or long page fault in vCPU pthread, then block
IO req could be timed out before queuing the request to hardware but after
calling blk_mq_start_request() during ->queue_rq(), then timeout handler
may handle it by requeue, then double ->queue_rq() is caused, and kernel
panic.
So far, it is driver's responsibility to cover the race between timeout
and completion, so it seems supposed to be solved in driver in theory,
given driver has enough knowledge.
But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, even it isn't easy
for driver to handle the race. So David suggests this patch by draining
in-progress ->queue_rq() for solving this issue.
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: virtualization@lists.linux-foundation.org
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221026051957.358818-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-26 13:19:57 +08:00
|
|
|
struct blk_expired_data {
|
|
|
|
bool has_timedout_rq;
|
|
|
|
unsigned long next;
|
|
|
|
unsigned long timeout_start;
|
|
|
|
};
|
|
|
|
|
|
|
|
static bool blk_mq_req_expired(struct request *rq, struct blk_expired_data *expired)
|
2014-09-14 07:40:11 +08:00
|
|
|
{
|
2018-05-29 21:52:28 +08:00
|
|
|
unsigned long deadline;
|
2014-04-24 22:51:47 +08:00
|
|
|
|
2018-05-29 21:52:28 +08:00
|
|
|
if (blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT)
|
|
|
|
return false;
|
2018-06-14 19:58:45 +08:00
|
|
|
if (rq->rq_flags & RQF_TIMED_OUT)
|
|
|
|
return false;
|
2017-09-06 16:00:22 +08:00
|
|
|
|
2018-11-15 00:02:05 +08:00
|
|
|
deadline = READ_ONCE(rq->deadline);
|
blk-mq: avoid double ->queue_rq() because of early timeout
David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in VM use case because of long vmexit latency or preempt
latency of vCPU pthread or long page fault in vCPU pthread, then block
IO req could be timed out before queuing the request to hardware but after
calling blk_mq_start_request() during ->queue_rq(), then timeout handler
may handle it by requeue, then double ->queue_rq() is caused, and kernel
panic.
So far, it is driver's responsibility to cover the race between timeout
and completion, so it seems supposed to be solved in driver in theory,
given driver has enough knowledge.
But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, even it isn't easy
for driver to handle the race. So David suggests this patch by draining
in-progress ->queue_rq() for solving this issue.
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: virtualization@lists.linux-foundation.org
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221026051957.358818-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-26 13:19:57 +08:00
|
|
|
if (time_after_eq(expired->timeout_start, deadline))
|
2018-05-29 21:52:28 +08:00
|
|
|
return true;
|
2017-09-06 16:00:22 +08:00
|
|
|
|
blk-mq: avoid double ->queue_rq() because of early timeout
David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in VM use case because of long vmexit latency or preempt
latency of vCPU pthread or long page fault in vCPU pthread, then block
IO req could be timed out before queuing the request to hardware but after
calling blk_mq_start_request() during ->queue_rq(), then timeout handler
may handle it by requeue, then double ->queue_rq() is caused, and kernel
panic.
So far, it is driver's responsibility to cover the race between timeout
and completion, so it seems supposed to be solved in driver in theory,
given driver has enough knowledge.
But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, even it isn't easy
for driver to handle the race. So David suggests this patch by draining
in-progress ->queue_rq() for solving this issue.
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: virtualization@lists.linux-foundation.org
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221026051957.358818-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-26 13:19:57 +08:00
|
|
|
if (expired->next == 0)
|
|
|
|
expired->next = deadline;
|
|
|
|
else if (time_after(expired->next, deadline))
|
|
|
|
expired->next = deadline;
|
2018-05-29 21:52:28 +08:00
|
|
|
return false;
|
2014-04-24 22:51:47 +08:00
|
|
|
}
|
|
|
|
|
2021-05-11 23:22:34 +08:00
|
|
|
void blk_mq_put_rq_ref(struct request *rq)
|
|
|
|
{
|
2022-09-22 05:19:54 +08:00
|
|
|
if (is_flush_rq(rq)) {
|
|
|
|
if (rq->end_io(rq, 0) == RQ_END_IO_FREE)
|
|
|
|
blk_mq_free_request(rq);
|
|
|
|
} else if (req_ref_put_and_test(rq)) {
|
2021-05-11 23:22:34 +08:00
|
|
|
__blk_mq_free_request(rq);
|
2022-09-22 05:19:54 +08:00
|
|
|
}
|
2021-05-11 23:22:34 +08:00
|
|
|
}
|
|
|
|
|
2022-07-06 20:03:53 +08:00
|
|
|
static bool blk_mq_check_expired(struct request *rq, void *priv)
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10 00:29:48 +08:00
|
|
|
{
|
blk-mq: avoid double ->queue_rq() because of early timeout
David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in VM use case because of long vmexit latency or preempt
latency of vCPU pthread or long page fault in vCPU pthread, then block
IO req could be timed out before queuing the request to hardware but after
calling blk_mq_start_request() during ->queue_rq(), then timeout handler
may handle it by requeue, then double ->queue_rq() is caused, and kernel
panic.
So far, it is driver's responsibility to cover the race between timeout
and completion, so it seems supposed to be solved in driver in theory,
given driver has enough knowledge.
But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, even it isn't easy
for driver to handle the race. So David suggests this patch by draining
in-progress ->queue_rq() for solving this issue.
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: virtualization@lists.linux-foundation.org
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221026051957.358818-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-26 13:19:57 +08:00
|
|
|
struct blk_expired_data *expired = priv;
|
2018-05-29 21:52:28 +08:00
|
|
|
|
|
|
|
/*
|
2021-08-11 23:52:02 +08:00
|
|
|
* blk_mq_queue_tag_busy_iter() has locked the request, so it cannot
|
|
|
|
* be reallocated underneath the timeout handler's processing, then
|
|
|
|
* the expire check is reliable. If the request is not expired, then
|
|
|
|
* it was completed and reallocated as a new request after returning
|
|
|
|
* from blk_mq_check_expired().
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10 00:29:48 +08:00
|
|
|
*/
|
blk-mq: avoid double ->queue_rq() because of early timeout
David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in VM use case because of long vmexit latency or preempt
latency of vCPU pthread or long page fault in vCPU pthread, then block
IO req could be timed out before queuing the request to hardware but after
calling blk_mq_start_request() during ->queue_rq(), then timeout handler
may handle it by requeue, then double ->queue_rq() is caused, and kernel
panic.
So far, it is driver's responsibility to cover the race between timeout
and completion, so it seems supposed to be solved in driver in theory,
given driver has enough knowledge.
But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, even it isn't easy
for driver to handle the race. So David suggests this patch by draining
in-progress ->queue_rq() for solving this issue.
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: virtualization@lists.linux-foundation.org
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221026051957.358818-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-26 13:19:57 +08:00
|
|
|
if (blk_mq_req_expired(rq, expired)) {
|
|
|
|
expired->has_timedout_rq = true;
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool blk_mq_handle_expired(struct request *rq, void *priv)
|
|
|
|
{
|
|
|
|
struct blk_expired_data *expired = priv;
|
|
|
|
|
|
|
|
if (blk_mq_req_expired(rq, expired))
|
2022-07-06 20:03:51 +08:00
|
|
|
blk_mq_rq_timed_out(rq);
|
2018-11-09 01:24:07 +08:00
|
|
|
return true;
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10 00:29:48 +08:00
|
|
|
}
|
|
|
|
|
2015-10-30 20:57:30 +08:00
|
|
|
static void blk_mq_timeout_work(struct work_struct *work)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2015-10-30 20:57:30 +08:00
|
|
|
struct request_queue *q =
|
|
|
|
container_of(work, struct request_queue, timeout_work);
|
blk-mq: avoid double ->queue_rq() because of early timeout
David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in VM use case because of long vmexit latency or preempt
latency of vCPU pthread or long page fault in vCPU pthread, then block
IO req could be timed out before queuing the request to hardware but after
calling blk_mq_start_request() during ->queue_rq(), then timeout handler
may handle it by requeue, then double ->queue_rq() is caused, and kernel
panic.
So far, it is driver's responsibility to cover the race between timeout
and completion, so it seems supposed to be solved in driver in theory,
given driver has enough knowledge.
But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, even it isn't easy
for driver to handle the race. So David suggests this patch by draining
in-progress ->queue_rq() for solving this issue.
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: virtualization@lists.linux-foundation.org
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221026051957.358818-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-26 13:19:57 +08:00
|
|
|
struct blk_expired_data expired = {
|
|
|
|
.timeout_start = jiffies,
|
|
|
|
};
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10 00:29:48 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx;
|
2022-03-08 15:32:18 +08:00
|
|
|
unsigned long i;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
blk-mq: Allow timeouts to run while queue is freezing
In case a submitted request gets stuck for some reason, the block layer
can prevent the request starvation by starting the scheduled timeout work.
If this stuck request occurs at the same time another thread has started
a queue freeze, the blk_mq_timeout_work will not be able to acquire the
queue reference and will return silently, thus not issuing the timeout.
But since the request is already holding a q_usage_counter reference and
is unable to complete, it will never release its reference, preventing
the queue from completing the freeze started by first thread. This puts
the request_queue in a hung state, forever waiting for the freeze
completion.
This was observed while running IO to a NVMe device at the same time we
toggled the CPU hotplug code. Eventually, once a request got stuck
requiring a timeout during a queue freeze, we saw the CPU Hotplug
notification code get stuck inside blk_mq_freeze_queue_wait, as shown in
the trace below.
[c000000deaf13690] [c000000deaf13738] 0xc000000deaf13738 (unreliable)
[c000000deaf13860] [c000000000015ce8] __switch_to+0x1f8/0x350
[c000000deaf138b0] [c000000000ade0e4] __schedule+0x314/0x990
[c000000deaf13940] [c000000000ade7a8] schedule+0x48/0xc0
[c000000deaf13970] [c0000000005492a4] blk_mq_freeze_queue_wait+0x74/0x110
[c000000deaf139e0] [c00000000054b6a8] blk_mq_queue_reinit_notify+0x1a8/0x2e0
[c000000deaf13a40] [c0000000000e7878] notifier_call_chain+0x98/0x100
[c000000deaf13a90] [c0000000000b8e08] cpu_notify_nofail+0x48/0xa0
[c000000deaf13ac0] [c0000000000b92f0] _cpu_down+0x2a0/0x400
[c000000deaf13b90] [c0000000000b94a8] cpu_down+0x58/0xa0
[c000000deaf13bc0] [c0000000006d5dcc] cpu_subsys_offline+0x2c/0x50
[c000000deaf13bf0] [c0000000006cd244] device_offline+0x104/0x140
[c000000deaf13c30] [c0000000006cd40c] online_store+0x6c/0xc0
[c000000deaf13c80] [c0000000006c8c78] dev_attr_store+0x68/0xa0
[c000000deaf13cc0] [c0000000003974d0] sysfs_kf_write+0x80/0xb0
[c000000deaf13d00] [c0000000003963e8] kernfs_fop_write+0x188/0x200
[c000000deaf13d50] [c0000000002e0f6c] __vfs_write+0x6c/0xe0
[c000000deaf13d90] [c0000000002e1ca0] vfs_write+0xc0/0x230
[c000000deaf13de0] [c0000000002e2cdc] SyS_write+0x6c/0x110
[c000000deaf13e30] [c000000000009204] system_call+0x38/0xb4
The fix is to allow the timeout work to execute in the window between
dropping the initial refcount reference and the release of the last
reference, which actually marks the freeze completion. This can be
achieved with percpu_refcount_tryget, which does not require the counter
to be alive. This way the timeout work can do it's job and terminate a
stuck request even during a freeze, returning its reference and avoiding
the deadlock.
Allowing the timeout to run is just a part of the fix, since for some
devices, we might get stuck again inside the device driver's timeout
handler, should it attempt to allocate a new request in that path -
which is a quite common action for Abort commands, which need to be sent
after a timeout. In NVMe, for instance, we call blk_mq_alloc_request
from inside the timeout handler, which will fail during a freeze, since
it also tries to acquire a queue reference.
I considered a similar change to blk_mq_alloc_request as a generic
solution for further device driver hangs, but we can't do that, since it
would allow new requests to disturb the freeze process. I thought about
creating a new function in the block layer to support unfreezable
requests for these occasions, but after working on it for a while, I
feel like this should be handled in a per-driver basis. I'm now
experimenting with changes to the NVMe timeout path, but I'm open to
suggestions of ways to make this generic.
Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: linux-nvme@lists.infradead.org
Cc: linux-block@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-01 22:23:39 +08:00
|
|
|
/* A deadlock might occur if a request is stuck requiring a
|
|
|
|
* timeout at the same time a queue freeze is waiting
|
|
|
|
* completion, since the timeout code would not be able to
|
|
|
|
* acquire the queue reference here.
|
|
|
|
*
|
|
|
|
* That's why we don't use blk_queue_enter here; instead, we use
|
|
|
|
* percpu_ref_tryget directly, because we need to be able to
|
|
|
|
* obtain a reference even in the short window between the queue
|
|
|
|
* starting to freeze, by dropping the first reference in
|
2017-03-27 20:06:57 +08:00
|
|
|
* blk_freeze_queue_start, and the moment the last request is
|
blk-mq: Allow timeouts to run while queue is freezing
In case a submitted request gets stuck for some reason, the block layer
can prevent the request starvation by starting the scheduled timeout work.
If this stuck request occurs at the same time another thread has started
a queue freeze, the blk_mq_timeout_work will not be able to acquire the
queue reference and will return silently, thus not issuing the timeout.
But since the request is already holding a q_usage_counter reference and
is unable to complete, it will never release its reference, preventing
the queue from completing the freeze started by first thread. This puts
the request_queue in a hung state, forever waiting for the freeze
completion.
This was observed while running IO to a NVMe device at the same time we
toggled the CPU hotplug code. Eventually, once a request got stuck
requiring a timeout during a queue freeze, we saw the CPU Hotplug
notification code get stuck inside blk_mq_freeze_queue_wait, as shown in
the trace below.
[c000000deaf13690] [c000000deaf13738] 0xc000000deaf13738 (unreliable)
[c000000deaf13860] [c000000000015ce8] __switch_to+0x1f8/0x350
[c000000deaf138b0] [c000000000ade0e4] __schedule+0x314/0x990
[c000000deaf13940] [c000000000ade7a8] schedule+0x48/0xc0
[c000000deaf13970] [c0000000005492a4] blk_mq_freeze_queue_wait+0x74/0x110
[c000000deaf139e0] [c00000000054b6a8] blk_mq_queue_reinit_notify+0x1a8/0x2e0
[c000000deaf13a40] [c0000000000e7878] notifier_call_chain+0x98/0x100
[c000000deaf13a90] [c0000000000b8e08] cpu_notify_nofail+0x48/0xa0
[c000000deaf13ac0] [c0000000000b92f0] _cpu_down+0x2a0/0x400
[c000000deaf13b90] [c0000000000b94a8] cpu_down+0x58/0xa0
[c000000deaf13bc0] [c0000000006d5dcc] cpu_subsys_offline+0x2c/0x50
[c000000deaf13bf0] [c0000000006cd244] device_offline+0x104/0x140
[c000000deaf13c30] [c0000000006cd40c] online_store+0x6c/0xc0
[c000000deaf13c80] [c0000000006c8c78] dev_attr_store+0x68/0xa0
[c000000deaf13cc0] [c0000000003974d0] sysfs_kf_write+0x80/0xb0
[c000000deaf13d00] [c0000000003963e8] kernfs_fop_write+0x188/0x200
[c000000deaf13d50] [c0000000002e0f6c] __vfs_write+0x6c/0xe0
[c000000deaf13d90] [c0000000002e1ca0] vfs_write+0xc0/0x230
[c000000deaf13de0] [c0000000002e2cdc] SyS_write+0x6c/0x110
[c000000deaf13e30] [c000000000009204] system_call+0x38/0xb4
The fix is to allow the timeout work to execute in the window between
dropping the initial refcount reference and the release of the last
reference, which actually marks the freeze completion. This can be
achieved with percpu_refcount_tryget, which does not require the counter
to be alive. This way the timeout work can do it's job and terminate a
stuck request even during a freeze, returning its reference and avoiding
the deadlock.
Allowing the timeout to run is just a part of the fix, since for some
devices, we might get stuck again inside the device driver's timeout
handler, should it attempt to allocate a new request in that path -
which is a quite common action for Abort commands, which need to be sent
after a timeout. In NVMe, for instance, we call blk_mq_alloc_request
from inside the timeout handler, which will fail during a freeze, since
it also tries to acquire a queue reference.
I considered a similar change to blk_mq_alloc_request as a generic
solution for further device driver hangs, but we can't do that, since it
would allow new requests to disturb the freeze process. I thought about
creating a new function in the block layer to support unfreezable
requests for these occasions, but after working on it for a while, I
feel like this should be handled in a per-driver basis. I'm now
experimenting with changes to the NVMe timeout path, but I'm open to
suggestions of ways to make this generic.
Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: linux-nvme@lists.infradead.org
Cc: linux-block@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-01 22:23:39 +08:00
|
|
|
* consumed, marked by the instant q_usage_counter reaches
|
|
|
|
* zero.
|
|
|
|
*/
|
|
|
|
if (!percpu_ref_tryget(&q->q_usage_counter))
|
2015-10-30 20:57:30 +08:00
|
|
|
return;
|
|
|
|
|
blk-mq: avoid double ->queue_rq() because of early timeout
David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in VM use case because of long vmexit latency or preempt
latency of vCPU pthread or long page fault in vCPU pthread, then block
IO req could be timed out before queuing the request to hardware but after
calling blk_mq_start_request() during ->queue_rq(), then timeout handler
may handle it by requeue, then double ->queue_rq() is caused, and kernel
panic.
So far, it is driver's responsibility to cover the race between timeout
and completion, so it seems supposed to be solved in driver in theory,
given driver has enough knowledge.
But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, even it isn't easy
for driver to handle the race. So David suggests this patch by draining
in-progress ->queue_rq() for solving this issue.
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: virtualization@lists.linux-foundation.org
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221026051957.358818-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-26 13:19:57 +08:00
|
|
|
/* check if there is any timed-out request */
|
|
|
|
blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &expired);
|
|
|
|
if (expired.has_timedout_rq) {
|
|
|
|
/*
|
|
|
|
* Before walking tags, we must ensure any submit started
|
|
|
|
* before the current time has finished. Since the submit
|
|
|
|
* uses srcu or rcu, wait for a synchronization point to
|
|
|
|
* ensure all running submits have finished
|
|
|
|
*/
|
2022-11-01 23:00:48 +08:00
|
|
|
blk_mq_wait_quiesce_done(q->tag_set);
|
blk-mq: avoid double ->queue_rq() because of early timeout
David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in VM use case because of long vmexit latency or preempt
latency of vCPU pthread or long page fault in vCPU pthread, then block
IO req could be timed out before queuing the request to hardware but after
calling blk_mq_start_request() during ->queue_rq(), then timeout handler
may handle it by requeue, then double ->queue_rq() is caused, and kernel
panic.
So far, it is driver's responsibility to cover the race between timeout
and completion, so it seems supposed to be solved in driver in theory,
given driver has enough knowledge.
But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, even it isn't easy
for driver to handle the race. So David suggests this patch by draining
in-progress ->queue_rq() for solving this issue.
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: virtualization@lists.linux-foundation.org
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221026051957.358818-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-26 13:19:57 +08:00
|
|
|
|
|
|
|
expired.next = 0;
|
|
|
|
blk_mq_queue_tag_busy_iter(q, blk_mq_handle_expired, &expired);
|
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
blk-mq: avoid double ->queue_rq() because of early timeout
David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in VM use case because of long vmexit latency or preempt
latency of vCPU pthread or long page fault in vCPU pthread, then block
IO req could be timed out before queuing the request to hardware but after
calling blk_mq_start_request() during ->queue_rq(), then timeout handler
may handle it by requeue, then double ->queue_rq() is caused, and kernel
panic.
So far, it is driver's responsibility to cover the race between timeout
and completion, so it seems supposed to be solved in driver in theory,
given driver has enough knowledge.
But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, even it isn't easy
for driver to handle the race. So David suggests this patch by draining
in-progress ->queue_rq() for solving this issue.
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: virtualization@lists.linux-foundation.org
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221026051957.358818-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-26 13:19:57 +08:00
|
|
|
if (expired.next != 0) {
|
|
|
|
mod_timer(&q->timeout, expired.next);
|
2014-05-14 05:10:52 +08:00
|
|
|
} else {
|
2018-01-11 00:33:33 +08:00
|
|
|
/*
|
|
|
|
* Request timeouts are handled as a forward rolling timer. If
|
|
|
|
* we end up here it means that no requests are pending and
|
|
|
|
* also that no request has been pending for a while. Mark
|
|
|
|
* each hctx as idle.
|
|
|
|
*/
|
2015-04-21 10:00:19 +08:00
|
|
|
queue_for_each_hw_ctx(q, hctx, i) {
|
|
|
|
/* the hctx may be unmapped, so check it here */
|
|
|
|
if (blk_mq_hw_queue_mapped(hctx))
|
|
|
|
blk_mq_tag_idle(hctx);
|
|
|
|
}
|
2014-05-14 05:10:52 +08:00
|
|
|
}
|
2015-10-30 20:57:30 +08:00
|
|
|
blk_queue_exit(q);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2016-09-17 22:38:44 +08:00
|
|
|
struct flush_busy_ctx_data {
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
struct list_head *list;
|
|
|
|
};
|
|
|
|
|
|
|
|
static bool flush_busy_ctx(struct sbitmap *sb, unsigned int bitnr, void *data)
|
|
|
|
{
|
|
|
|
struct flush_busy_ctx_data *flush_data = data;
|
|
|
|
struct blk_mq_hw_ctx *hctx = flush_data->hctx;
|
|
|
|
struct blk_mq_ctx *ctx = hctx->ctxs[bitnr];
|
2018-12-17 23:44:05 +08:00
|
|
|
enum hctx_type type = hctx->type;
|
2016-09-17 22:38:44 +08:00
|
|
|
|
|
|
|
spin_lock(&ctx->lock);
|
2018-12-17 23:44:05 +08:00
|
|
|
list_splice_tail_init(&ctx->rq_lists[type], flush_data->list);
|
2018-02-28 08:56:42 +08:00
|
|
|
sbitmap_clear_bit(sb, bitnr);
|
2016-09-17 22:38:44 +08:00
|
|
|
spin_unlock(&ctx->lock);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2014-05-19 23:23:55 +08:00
|
|
|
/*
|
|
|
|
* Process software queues that have been marked busy, splicing them
|
|
|
|
* to the for-dispatch
|
|
|
|
*/
|
2016-12-15 05:34:47 +08:00
|
|
|
void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
|
2014-05-19 23:23:55 +08:00
|
|
|
{
|
2016-09-17 22:38:44 +08:00
|
|
|
struct flush_busy_ctx_data data = {
|
|
|
|
.hctx = hctx,
|
|
|
|
.list = list,
|
|
|
|
};
|
2014-05-19 23:23:55 +08:00
|
|
|
|
2016-09-17 22:38:44 +08:00
|
|
|
sbitmap_for_each_set(&hctx->ctx_map, flush_busy_ctx, &data);
|
2014-05-19 23:23:55 +08:00
|
|
|
}
|
2016-12-15 05:34:47 +08:00
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_flush_busy_ctxs);
|
2014-05-19 23:23:55 +08:00
|
|
|
|
blk-mq-sched: improve dispatching from sw queue
SCSI devices use host-wide tagset, and the shared driver tag space is
often quite big. However, there is also a queue depth for each lun(
.cmd_per_lun), which is often small, for example, on both lpfc and
qla2xxx, .cmd_per_lun is just 3.
So lots of requests may stay in sw queue, and we always flush all
belonging to same hw queue and dispatch them all to driver.
Unfortunately it is easy to cause queue busy because of the small
.cmd_per_lun. Once these requests are flushed out, they have to stay in
hctx->dispatch, and no bio merge can happen on these requests, and
sequential IO performance is harmed.
This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
from a sw queue, so that we can dispatch them in scheduler's way. We can
then avoid dequeueing too many requests from sw queue, since we don't
flush ->dispatch completely.
This patch improves dispatching from sw queue by using the .get_budget
and .put_budget callbacks.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-14 17:22:30 +08:00
|
|
|
struct dispatch_rq_data {
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
struct request *rq;
|
|
|
|
};
|
|
|
|
|
|
|
|
static bool dispatch_rq_from_ctx(struct sbitmap *sb, unsigned int bitnr,
|
|
|
|
void *data)
|
|
|
|
{
|
|
|
|
struct dispatch_rq_data *dispatch_data = data;
|
|
|
|
struct blk_mq_hw_ctx *hctx = dispatch_data->hctx;
|
|
|
|
struct blk_mq_ctx *ctx = hctx->ctxs[bitnr];
|
2018-12-17 23:44:05 +08:00
|
|
|
enum hctx_type type = hctx->type;
|
blk-mq-sched: improve dispatching from sw queue
SCSI devices use host-wide tagset, and the shared driver tag space is
often quite big. However, there is also a queue depth for each lun(
.cmd_per_lun), which is often small, for example, on both lpfc and
qla2xxx, .cmd_per_lun is just 3.
So lots of requests may stay in sw queue, and we always flush all
belonging to same hw queue and dispatch them all to driver.
Unfortunately it is easy to cause queue busy because of the small
.cmd_per_lun. Once these requests are flushed out, they have to stay in
hctx->dispatch, and no bio merge can happen on these requests, and
sequential IO performance is harmed.
This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
from a sw queue, so that we can dispatch them in scheduler's way. We can
then avoid dequeueing too many requests from sw queue, since we don't
flush ->dispatch completely.
This patch improves dispatching from sw queue by using the .get_budget
and .put_budget callbacks.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-14 17:22:30 +08:00
|
|
|
|
|
|
|
spin_lock(&ctx->lock);
|
2018-12-17 23:44:05 +08:00
|
|
|
if (!list_empty(&ctx->rq_lists[type])) {
|
|
|
|
dispatch_data->rq = list_entry_rq(ctx->rq_lists[type].next);
|
blk-mq-sched: improve dispatching from sw queue
SCSI devices use host-wide tagset, and the shared driver tag space is
often quite big. However, there is also a queue depth for each lun(
.cmd_per_lun), which is often small, for example, on both lpfc and
qla2xxx, .cmd_per_lun is just 3.
So lots of requests may stay in sw queue, and we always flush all
belonging to same hw queue and dispatch them all to driver.
Unfortunately it is easy to cause queue busy because of the small
.cmd_per_lun. Once these requests are flushed out, they have to stay in
hctx->dispatch, and no bio merge can happen on these requests, and
sequential IO performance is harmed.
This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
from a sw queue, so that we can dispatch them in scheduler's way. We can
then avoid dequeueing too many requests from sw queue, since we don't
flush ->dispatch completely.
This patch improves dispatching from sw queue by using the .get_budget
and .put_budget callbacks.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-14 17:22:30 +08:00
|
|
|
list_del_init(&dispatch_data->rq->queuelist);
|
2018-12-17 23:44:05 +08:00
|
|
|
if (list_empty(&ctx->rq_lists[type]))
|
blk-mq-sched: improve dispatching from sw queue
SCSI devices use host-wide tagset, and the shared driver tag space is
often quite big. However, there is also a queue depth for each lun(
.cmd_per_lun), which is often small, for example, on both lpfc and
qla2xxx, .cmd_per_lun is just 3.
So lots of requests may stay in sw queue, and we always flush all
belonging to same hw queue and dispatch them all to driver.
Unfortunately it is easy to cause queue busy because of the small
.cmd_per_lun. Once these requests are flushed out, they have to stay in
hctx->dispatch, and no bio merge can happen on these requests, and
sequential IO performance is harmed.
This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
from a sw queue, so that we can dispatch them in scheduler's way. We can
then avoid dequeueing too many requests from sw queue, since we don't
flush ->dispatch completely.
This patch improves dispatching from sw queue by using the .get_budget
and .put_budget callbacks.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-14 17:22:30 +08:00
|
|
|
sbitmap_clear_bit(sb, bitnr);
|
|
|
|
}
|
|
|
|
spin_unlock(&ctx->lock);
|
|
|
|
|
|
|
|
return !dispatch_data->rq;
|
|
|
|
}
|
|
|
|
|
|
|
|
struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
|
|
|
|
struct blk_mq_ctx *start)
|
|
|
|
{
|
2018-10-30 03:13:29 +08:00
|
|
|
unsigned off = start ? start->index_hw[hctx->type] : 0;
|
blk-mq-sched: improve dispatching from sw queue
SCSI devices use host-wide tagset, and the shared driver tag space is
often quite big. However, there is also a queue depth for each lun(
.cmd_per_lun), which is often small, for example, on both lpfc and
qla2xxx, .cmd_per_lun is just 3.
So lots of requests may stay in sw queue, and we always flush all
belonging to same hw queue and dispatch them all to driver.
Unfortunately it is easy to cause queue busy because of the small
.cmd_per_lun. Once these requests are flushed out, they have to stay in
hctx->dispatch, and no bio merge can happen on these requests, and
sequential IO performance is harmed.
This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
from a sw queue, so that we can dispatch them in scheduler's way. We can
then avoid dequeueing too many requests from sw queue, since we don't
flush ->dispatch completely.
This patch improves dispatching from sw queue by using the .get_budget
and .put_budget callbacks.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-14 17:22:30 +08:00
|
|
|
struct dispatch_rq_data data = {
|
|
|
|
.hctx = hctx,
|
|
|
|
.rq = NULL,
|
|
|
|
};
|
|
|
|
|
|
|
|
__sbitmap_for_each_set(&hctx->ctx_map, off,
|
|
|
|
dispatch_rq_from_ctx, &data);
|
|
|
|
|
|
|
|
return data.rq;
|
|
|
|
}
|
|
|
|
|
2021-10-13 22:28:14 +08:00
|
|
|
static bool __blk_mq_alloc_driver_tag(struct request *rq)
|
2020-06-30 22:03:55 +08:00
|
|
|
{
|
2021-10-05 18:23:38 +08:00
|
|
|
struct sbitmap_queue *bt = &rq->mq_hctx->tags->bitmap_tags;
|
2020-06-30 22:03:55 +08:00
|
|
|
unsigned int tag_offset = rq->mq_hctx->tags->nr_reserved_tags;
|
|
|
|
int tag;
|
|
|
|
|
2020-07-06 22:41:11 +08:00
|
|
|
blk_mq_tag_busy(rq->mq_hctx);
|
|
|
|
|
2020-06-30 22:03:55 +08:00
|
|
|
if (blk_mq_tag_is_reserved(rq->mq_hctx->sched_tags, rq->internal_tag)) {
|
2021-10-05 18:23:38 +08:00
|
|
|
bt = &rq->mq_hctx->tags->breserved_tags;
|
2020-06-30 22:03:55 +08:00
|
|
|
tag_offset = 0;
|
2020-09-11 18:41:14 +08:00
|
|
|
} else {
|
|
|
|
if (!hctx_may_queue(rq->mq_hctx, bt))
|
|
|
|
return false;
|
2020-06-30 22:03:55 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
tag = __sbitmap_queue_get(bt);
|
|
|
|
if (tag == BLK_MQ_NO_TAG)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
rq->tag = tag + tag_offset;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2021-10-13 22:28:14 +08:00
|
|
|
bool __blk_mq_get_driver_tag(struct blk_mq_hw_ctx *hctx, struct request *rq)
|
2020-06-30 22:03:55 +08:00
|
|
|
{
|
2021-10-13 22:28:14 +08:00
|
|
|
if (rq->tag == BLK_MQ_NO_TAG && !__blk_mq_alloc_driver_tag(rq))
|
2020-07-06 22:41:11 +08:00
|
|
|
return false;
|
|
|
|
|
2020-08-19 23:20:19 +08:00
|
|
|
if ((hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) &&
|
2020-07-06 22:41:11 +08:00
|
|
|
!(rq->rq_flags & RQF_MQ_INFLIGHT)) {
|
|
|
|
rq->rq_flags |= RQF_MQ_INFLIGHT;
|
2020-08-19 23:20:26 +08:00
|
|
|
__blk_mq_inc_active_requests(hctx);
|
2020-07-06 22:41:11 +08:00
|
|
|
}
|
|
|
|
hctx->tags->rqs[rq->tag] = rq;
|
|
|
|
return true;
|
2020-06-30 22:03:55 +08:00
|
|
|
}
|
|
|
|
|
2017-11-09 23:32:43 +08:00
|
|
|
static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
|
|
|
|
int flags, void *key)
|
2017-02-23 02:58:29 +08:00
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
|
|
|
|
hctx = container_of(wait, struct blk_mq_hw_ctx, dispatch_wait);
|
|
|
|
|
2018-06-25 19:31:47 +08:00
|
|
|
spin_lock(&hctx->dispatch_wait_lock);
|
2019-03-26 02:34:10 +08:00
|
|
|
if (!list_empty(&wait->entry)) {
|
|
|
|
struct sbitmap_queue *sbq;
|
|
|
|
|
|
|
|
list_del_init(&wait->entry);
|
2021-10-05 18:23:38 +08:00
|
|
|
sbq = &hctx->tags->bitmap_tags;
|
2019-03-26 02:34:10 +08:00
|
|
|
atomic_dec(&sbq->ws_active);
|
|
|
|
}
|
2018-06-25 19:31:47 +08:00
|
|
|
spin_unlock(&hctx->dispatch_wait_lock);
|
|
|
|
|
2017-02-23 02:58:29 +08:00
|
|
|
blk_mq_run_hw_queue(hctx, true);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2017-11-10 07:10:13 +08:00
|
|
|
/*
|
|
|
|
* Mark us waiting for a tag. For shared tags, this involves hooking us into
|
2018-01-10 02:09:15 +08:00
|
|
|
* the tag wakeups. For non-shared tags, we can simply mark us needing a
|
|
|
|
* restart. For both cases, take care to check the condition again after
|
2017-11-10 07:10:13 +08:00
|
|
|
* marking us as waiting.
|
|
|
|
*/
|
2018-06-25 19:31:46 +08:00
|
|
|
static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
|
2017-11-10 07:10:13 +08:00
|
|
|
struct request *rq)
|
2017-02-23 02:58:29 +08:00
|
|
|
{
|
2023-01-18 17:37:15 +08:00
|
|
|
struct sbitmap_queue *sbq;
|
2018-06-25 19:31:47 +08:00
|
|
|
struct wait_queue_head *wq;
|
2017-11-10 07:10:13 +08:00
|
|
|
wait_queue_entry_t *wait;
|
|
|
|
bool ret;
|
2017-02-23 02:58:29 +08:00
|
|
|
|
2023-01-18 17:37:16 +08:00
|
|
|
if (!(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) &&
|
|
|
|
!(blk_mq_is_shared_tags(hctx->flags))) {
|
2019-03-15 11:05:10 +08:00
|
|
|
blk_mq_sched_mark_restart_hctx(hctx);
|
2017-11-10 07:10:13 +08:00
|
|
|
|
2018-01-11 05:41:21 +08:00
|
|
|
/*
|
|
|
|
* It's possible that a tag was freed in the window between the
|
|
|
|
* allocation failure and adding the hardware queue to the wait
|
|
|
|
* queue.
|
|
|
|
*
|
|
|
|
* Don't clear RESTART here, someone else could have set it.
|
|
|
|
* At most this will cost an extra queue run.
|
|
|
|
*/
|
2018-06-25 19:31:45 +08:00
|
|
|
return blk_mq_get_driver_tag(rq);
|
2017-11-09 23:32:43 +08:00
|
|
|
}
|
|
|
|
|
2018-06-25 19:31:46 +08:00
|
|
|
wait = &hctx->dispatch_wait;
|
2018-01-11 05:41:21 +08:00
|
|
|
if (!list_empty_careful(&wait->entry))
|
|
|
|
return false;
|
|
|
|
|
2023-01-18 17:37:15 +08:00
|
|
|
if (blk_mq_tag_is_reserved(rq->mq_hctx->sched_tags, rq->internal_tag))
|
|
|
|
sbq = &hctx->tags->breserved_tags;
|
|
|
|
else
|
|
|
|
sbq = &hctx->tags->bitmap_tags;
|
2019-03-26 02:34:10 +08:00
|
|
|
wq = &bt_wait_ptr(sbq, hctx)->wait;
|
2018-06-25 19:31:47 +08:00
|
|
|
|
|
|
|
spin_lock_irq(&wq->lock);
|
|
|
|
spin_lock(&hctx->dispatch_wait_lock);
|
2018-01-11 05:41:21 +08:00
|
|
|
if (!list_empty(&wait->entry)) {
|
2018-06-25 19:31:47 +08:00
|
|
|
spin_unlock(&hctx->dispatch_wait_lock);
|
|
|
|
spin_unlock_irq(&wq->lock);
|
2018-01-11 05:41:21 +08:00
|
|
|
return false;
|
2017-11-09 23:32:43 +08:00
|
|
|
}
|
|
|
|
|
2019-03-26 02:34:10 +08:00
|
|
|
atomic_inc(&sbq->ws_active);
|
2018-06-25 19:31:47 +08:00
|
|
|
wait->flags &= ~WQ_FLAG_EXCLUSIVE;
|
|
|
|
__add_wait_queue(wq, wait);
|
2018-01-11 05:41:21 +08:00
|
|
|
|
2017-02-23 02:58:29 +08:00
|
|
|
/*
|
2017-11-09 23:32:43 +08:00
|
|
|
* It's possible that a tag was freed in the window between the
|
|
|
|
* allocation failure and adding the hardware queue to the wait
|
|
|
|
* queue.
|
2017-02-23 02:58:29 +08:00
|
|
|
*/
|
2018-06-25 19:31:45 +08:00
|
|
|
ret = blk_mq_get_driver_tag(rq);
|
2018-01-11 05:41:21 +08:00
|
|
|
if (!ret) {
|
2018-06-25 19:31:47 +08:00
|
|
|
spin_unlock(&hctx->dispatch_wait_lock);
|
|
|
|
spin_unlock_irq(&wq->lock);
|
2018-01-11 05:41:21 +08:00
|
|
|
return false;
|
2017-11-09 23:32:43 +08:00
|
|
|
}
|
2018-01-11 05:41:21 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We got a tag, remove ourselves from the wait queue to ensure
|
|
|
|
* someone else gets the wakeup.
|
|
|
|
*/
|
|
|
|
list_del_init(&wait->entry);
|
2019-03-26 02:34:10 +08:00
|
|
|
atomic_dec(&sbq->ws_active);
|
2018-06-25 19:31:47 +08:00
|
|
|
spin_unlock(&hctx->dispatch_wait_lock);
|
|
|
|
spin_unlock_irq(&wq->lock);
|
2018-01-11 05:41:21 +08:00
|
|
|
|
|
|
|
return true;
|
2017-02-23 02:58:29 +08:00
|
|
|
}
|
|
|
|
|
2018-07-03 23:03:16 +08:00
|
|
|
#define BLK_MQ_DISPATCH_BUSY_EWMA_WEIGHT 8
|
|
|
|
#define BLK_MQ_DISPATCH_BUSY_EWMA_FACTOR 4
|
|
|
|
/*
|
|
|
|
* Update dispatch busy with the Exponential Weighted Moving Average(EWMA):
|
|
|
|
* - EWMA is one simple way to compute running average value
|
|
|
|
* - weight(7/8 and 1/8) is applied so that it can decrease exponentially
|
|
|
|
* - take 4 as factor for avoiding to get too small(0) result, and this
|
|
|
|
* factor doesn't matter because EWMA decreases exponentially
|
|
|
|
*/
|
|
|
|
static void blk_mq_update_dispatch_busy(struct blk_mq_hw_ctx *hctx, bool busy)
|
|
|
|
{
|
|
|
|
unsigned int ewma;
|
|
|
|
|
|
|
|
ewma = hctx->dispatch_busy;
|
|
|
|
|
|
|
|
if (!ewma && !busy)
|
|
|
|
return;
|
|
|
|
|
|
|
|
ewma *= BLK_MQ_DISPATCH_BUSY_EWMA_WEIGHT - 1;
|
|
|
|
if (busy)
|
|
|
|
ewma += 1 << BLK_MQ_DISPATCH_BUSY_EWMA_FACTOR;
|
|
|
|
ewma /= BLK_MQ_DISPATCH_BUSY_EWMA_WEIGHT;
|
|
|
|
|
|
|
|
hctx->dispatch_busy = ewma;
|
|
|
|
}
|
|
|
|
|
2018-01-31 11:04:57 +08:00
|
|
|
#define BLK_MQ_RESOURCE_DELAY 3 /* ms units */
|
|
|
|
|
2020-03-24 23:24:44 +08:00
|
|
|
static void blk_mq_handle_dev_resource(struct request *rq,
|
|
|
|
struct list_head *list)
|
|
|
|
{
|
|
|
|
list_add(&rq->queuelist, list);
|
|
|
|
__blk_mq_requeue_request(rq);
|
|
|
|
}
|
|
|
|
|
2020-05-12 16:55:47 +08:00
|
|
|
static void blk_mq_handle_zone_resource(struct request *rq,
|
|
|
|
struct list_head *zone_list)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If we end up here it is because we cannot dispatch a request to a
|
|
|
|
* specific zone due to LLD level zone-write locking or other zone
|
|
|
|
* related resource not being available. In this case, set the request
|
|
|
|
* aside in zone_list for retrying it later.
|
|
|
|
*/
|
|
|
|
list_add(&rq->queuelist, zone_list);
|
|
|
|
__blk_mq_requeue_request(rq);
|
|
|
|
}
|
|
|
|
|
2020-06-30 18:24:58 +08:00
|
|
|
enum prep_dispatch {
|
|
|
|
PREP_DISPATCH_OK,
|
|
|
|
PREP_DISPATCH_NO_TAG,
|
|
|
|
PREP_DISPATCH_NO_BUDGET,
|
|
|
|
};
|
|
|
|
|
|
|
|
static enum prep_dispatch blk_mq_prep_dispatch_rq(struct request *rq,
|
|
|
|
bool need_budget)
|
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
|
2021-01-22 10:33:12 +08:00
|
|
|
int budget_token = -1;
|
2020-06-30 18:24:58 +08:00
|
|
|
|
2021-01-22 10:33:12 +08:00
|
|
|
if (need_budget) {
|
|
|
|
budget_token = blk_mq_get_dispatch_budget(rq->q);
|
|
|
|
if (budget_token < 0) {
|
|
|
|
blk_mq_put_driver_tag(rq);
|
|
|
|
return PREP_DISPATCH_NO_BUDGET;
|
|
|
|
}
|
|
|
|
blk_mq_set_rq_budget_token(rq, budget_token);
|
2020-06-30 18:24:58 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (!blk_mq_get_driver_tag(rq)) {
|
|
|
|
/*
|
|
|
|
* The initial allocation attempt failed, so we need to
|
|
|
|
* rerun the hardware queue when a tag is freed. The
|
|
|
|
* waitqueue takes care of that. If the queue is run
|
|
|
|
* before we add this entry back on the dispatch list,
|
|
|
|
* we'll re-run it below.
|
|
|
|
*/
|
|
|
|
if (!blk_mq_mark_tag_wait(hctx, rq)) {
|
2020-06-30 18:25:00 +08:00
|
|
|
/*
|
|
|
|
* All budgets not got from this function will be put
|
|
|
|
* together during handling partial dispatch
|
|
|
|
*/
|
|
|
|
if (need_budget)
|
2021-01-22 10:33:12 +08:00
|
|
|
blk_mq_put_dispatch_budget(rq->q, budget_token);
|
2020-06-30 18:24:58 +08:00
|
|
|
return PREP_DISPATCH_NO_TAG;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return PREP_DISPATCH_OK;
|
|
|
|
}
|
|
|
|
|
2020-06-30 18:25:00 +08:00
|
|
|
/* release all allocated budgets before calling to blk_mq_dispatch_rq_list */
|
|
|
|
static void blk_mq_release_budgets(struct request_queue *q,
|
2021-01-22 10:33:12 +08:00
|
|
|
struct list_head *list)
|
2020-06-30 18:25:00 +08:00
|
|
|
{
|
2021-01-22 10:33:12 +08:00
|
|
|
struct request *rq;
|
2020-06-30 18:25:00 +08:00
|
|
|
|
2021-01-22 10:33:12 +08:00
|
|
|
list_for_each_entry(rq, list, queuelist) {
|
|
|
|
int budget_token = blk_mq_get_rq_budget_token(rq);
|
2020-06-30 18:25:00 +08:00
|
|
|
|
2021-01-22 10:33:12 +08:00
|
|
|
if (budget_token >= 0)
|
|
|
|
blk_mq_put_dispatch_budget(q, budget_token);
|
|
|
|
}
|
2020-06-30 18:25:00 +08:00
|
|
|
}
|
|
|
|
|
2023-01-18 17:37:19 +08:00
|
|
|
/*
|
|
|
|
* blk_mq_commit_rqs will notify driver using bd->last that there is no
|
|
|
|
* more requests. (See comment in struct blk_mq_ops for commit_rqs for
|
|
|
|
* details)
|
|
|
|
* Attention, we should explicitly call this in unusual cases:
|
|
|
|
* 1) did not queue everything initially scheduled to queue
|
|
|
|
* 2) the last attempt to queue a request failed
|
|
|
|
*/
|
|
|
|
static void blk_mq_commit_rqs(struct blk_mq_hw_ctx *hctx, int queued,
|
|
|
|
bool from_schedule)
|
|
|
|
{
|
|
|
|
if (hctx->queue->mq_ops->commit_rqs && queued) {
|
|
|
|
trace_block_unplug(hctx->queue, queued, !from_schedule);
|
|
|
|
hctx->queue->mq_ops->commit_rqs(hctx);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
blk-mq: don't queue more if we get a busy return
Some devices have different queue limits depending on the type of IO. A
classic case is SATA NCQ, where some commands can queue, but others
cannot. If we have NCQ commands inflight and encounter a non-queueable
command, the driver returns busy. Currently we attempt to dispatch more
from the scheduler, if we were able to queue some commands. But for the
case where we ended up stopping due to BUSY, we should not attempt to
retrieve more from the scheduler. If we do, we can get into a situation
where we attempt to queue a non-queueable command, get BUSY, then
successfully retrieve more commands from that scheduler and queue those.
This can repeat forever, starving the non-queuable command indefinitely.
Fix this by NOT attempting to pull more commands from the scheduler, if
we get a BUSY return. This should also be more optimal in terms of
letting requests stay in the scheduler for as long as possible, if we
get a BUSY due to the regular out-of-tags condition.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-29 01:54:01 +08:00
|
|
|
/*
|
|
|
|
* Returns true if we did some work AND can potentially do more.
|
|
|
|
*/
|
2020-06-30 18:24:57 +08:00
|
|
|
bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list,
|
2020-06-30 18:25:00 +08:00
|
|
|
unsigned int nr_budgets)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2020-06-30 18:24:58 +08:00
|
|
|
enum prep_dispatch prep;
|
2020-06-30 18:24:57 +08:00
|
|
|
struct request_queue *q = hctx->queue;
|
2023-01-18 17:37:24 +08:00
|
|
|
struct request *rq;
|
2023-01-18 17:37:23 +08:00
|
|
|
int queued;
|
2018-01-31 11:04:57 +08:00
|
|
|
blk_status_t ret = BLK_STS_OK;
|
2020-05-12 16:55:47 +08:00
|
|
|
LIST_HEAD(zone_list);
|
2021-10-27 00:51:27 +08:00
|
|
|
bool needs_resource = false;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
blk-mq: use the right hctx when getting a driver tag fails
While dispatching requests, if we fail to get a driver tag, we mark the
hardware queue as waiting for a tag and put the requests on a
hctx->dispatch list to be run later when a driver tag is freed. However,
blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
queues if using a single-queue scheduler with a multiqueue device. If
blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
are processing. This means we end up using the hardware queue of the
previous request, which may or may not be the same as that of the
current request. If it isn't, the wrong hardware queue will end up
waiting for a tag, and the requests will be on the wrong dispatch list,
leading to a hang.
The fix is twofold:
1. Make sure we save which hardware queue we were trying to get a
request for in blk_mq_get_driver_tag() regardless of whether it
succeeds or not.
2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
blk_mq_hw_queue to make it clear that it must handle multiple
hardware queues, since I've already messed this up on a couple of
occasions.
This didn't appear in testing with nvme and mq-deadline because nvme has
more driver tags than the default number of scheduler tags. However,
with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 22:56:26 +08:00
|
|
|
if (list_empty(list))
|
|
|
|
return false;
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
/*
|
|
|
|
* Now process all the entries, sending them to the driver.
|
|
|
|
*/
|
2023-01-18 17:37:23 +08:00
|
|
|
queued = 0;
|
blk-mq: use the right hctx when getting a driver tag fails
While dispatching requests, if we fail to get a driver tag, we mark the
hardware queue as waiting for a tag and put the requests on a
hctx->dispatch list to be run later when a driver tag is freed. However,
blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
queues if using a single-queue scheduler with a multiqueue device. If
blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
are processing. This means we end up using the hardware queue of the
previous request, which may or may not be the same as that of the
current request. If it isn't, the wrong hardware queue will end up
waiting for a tag, and the requests will be on the wrong dispatch list,
leading to a hang.
The fix is twofold:
1. Make sure we save which hardware queue we were trying to get a
request for in blk_mq_get_driver_tag() regardless of whether it
succeeds or not.
2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
blk_mq_hw_queue to make it clear that it must handle multiple
hardware queues, since I've already messed this up on a couple of
occasions.
This didn't appear in testing with nvme and mq-deadline because nvme has
more driver tags than the default number of scheduler tags. However,
with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 22:56:26 +08:00
|
|
|
do {
|
2014-10-30 01:14:52 +08:00
|
|
|
struct blk_mq_queue_data bd;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2016-12-07 23:41:17 +08:00
|
|
|
rq = list_first_entry(list, struct request, queuelist);
|
blk-mq: order getting budget and driver tag
This patch orders getting budget and driver tag by making sure to acquire
driver tag after budget is got, this way can help to avoid the following
race:
1) before dispatch request from scheduler queue, get one budget first, then
dequeue a request, call it request A.
2) in another IO path for dispatching request B which is from hctx->dispatch,
driver tag is got, then try to get budget in blk_mq_dispatch_rq_list(),
unfortunately the budget is held by request A.
3) meantime blk_mq_dispatch_rq_list() is called for dispatching request
A, and try to get driver tag first, unfortunately no driver tag is
available because the driver tag is held by request B
4) both two IO pathes can't move on, and IO stall is caused.
This issue can be observed when running dbench on USB storage.
This patch fixes this issue by always getting budget before getting
driver tag.
Cc: stable@vger.kernel.org
Fixes: de1482974080ec9e ("blk-mq: introduce .get_budget and .put_budget in blk_mq_ops")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-05 00:35:21 +08:00
|
|
|
|
2020-06-30 18:24:57 +08:00
|
|
|
WARN_ON_ONCE(hctx != rq->mq_hctx);
|
2020-06-30 18:25:00 +08:00
|
|
|
prep = blk_mq_prep_dispatch_rq(rq, !nr_budgets);
|
2020-06-30 18:24:58 +08:00
|
|
|
if (prep != PREP_DISPATCH_OK)
|
blk-mq: order getting budget and driver tag
This patch orders getting budget and driver tag by making sure to acquire
driver tag after budget is got, this way can help to avoid the following
race:
1) before dispatch request from scheduler queue, get one budget first, then
dequeue a request, call it request A.
2) in another IO path for dispatching request B which is from hctx->dispatch,
driver tag is got, then try to get budget in blk_mq_dispatch_rq_list(),
unfortunately the budget is held by request A.
3) meantime blk_mq_dispatch_rq_list() is called for dispatching request
A, and try to get driver tag first, unfortunately no driver tag is
available because the driver tag is held by request B
4) both two IO pathes can't move on, and IO stall is caused.
This issue can be observed when running dbench on USB storage.
This patch fixes this issue by always getting budget before getting
driver tag.
Cc: stable@vger.kernel.org
Fixes: de1482974080ec9e ("blk-mq: introduce .get_budget and .put_budget in blk_mq_ops")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-05 00:35:21 +08:00
|
|
|
break;
|
2017-10-14 17:22:29 +08:00
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
list_del_init(&rq->queuelist);
|
|
|
|
|
2014-10-30 01:14:52 +08:00
|
|
|
bd.rq = rq;
|
2023-01-18 17:37:24 +08:00
|
|
|
bd.last = list_empty(list);
|
2014-10-30 01:14:52 +08:00
|
|
|
|
2020-06-30 18:25:00 +08:00
|
|
|
/*
|
|
|
|
* once the request is queued to lld, no need to cover the
|
|
|
|
* budget any more
|
|
|
|
*/
|
|
|
|
if (nr_budgets)
|
|
|
|
nr_budgets--;
|
2014-10-30 01:14:52 +08:00
|
|
|
ret = q->mq_ops->queue_rq(hctx, &bd);
|
2020-07-01 21:58:57 +08:00
|
|
|
switch (ret) {
|
|
|
|
case BLK_STS_OK:
|
|
|
|
queued++;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
break;
|
2020-07-01 21:58:57 +08:00
|
|
|
case BLK_STS_RESOURCE:
|
2021-10-27 00:51:27 +08:00
|
|
|
needs_resource = true;
|
|
|
|
fallthrough;
|
2020-07-01 21:58:57 +08:00
|
|
|
case BLK_STS_DEV_RESOURCE:
|
|
|
|
blk_mq_handle_dev_resource(rq, list);
|
|
|
|
goto out;
|
|
|
|
case BLK_STS_ZONE_RESOURCE:
|
2020-05-12 16:55:47 +08:00
|
|
|
/*
|
|
|
|
* Move the request to zone_list and keep going through
|
|
|
|
* the dispatch list to find more requests the drive can
|
|
|
|
* accept.
|
|
|
|
*/
|
|
|
|
blk_mq_handle_zone_resource(rq, &zone_list);
|
2021-10-27 00:51:27 +08:00
|
|
|
needs_resource = true;
|
2020-07-01 21:58:57 +08:00
|
|
|
break;
|
|
|
|
default:
|
2020-09-30 16:02:53 +08:00
|
|
|
blk_mq_end_request(rq, ret);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
blk-mq: use the right hctx when getting a driver tag fails
While dispatching requests, if we fail to get a driver tag, we mark the
hardware queue as waiting for a tag and put the requests on a
hctx->dispatch list to be run later when a driver tag is freed. However,
blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
queues if using a single-queue scheduler with a multiqueue device. If
blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
are processing. This means we end up using the hardware queue of the
previous request, which may or may not be the same as that of the
current request. If it isn't, the wrong hardware queue will end up
waiting for a tag, and the requests will be on the wrong dispatch list,
leading to a hang.
The fix is twofold:
1. Make sure we save which hardware queue we were trying to get a
request for in blk_mq_get_driver_tag() regardless of whether it
succeeds or not.
2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
blk_mq_hw_queue to make it clear that it must handle multiple
hardware queues, since I've already messed this up on a couple of
occasions.
This didn't appear in testing with nvme and mq-deadline because nvme has
more driver tags than the default number of scheduler tags. However,
with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 22:56:26 +08:00
|
|
|
} while (!list_empty(list));
|
2020-07-01 21:58:57 +08:00
|
|
|
out:
|
2020-05-12 16:55:47 +08:00
|
|
|
if (!list_empty(&zone_list))
|
|
|
|
list_splice_tail_init(&zone_list, list);
|
|
|
|
|
2020-09-05 19:25:56 +08:00
|
|
|
/* If we didn't flush the entire list, we could have told the driver
|
|
|
|
* there was more coming, but that turned out to be a lie.
|
|
|
|
*/
|
2023-01-18 17:37:22 +08:00
|
|
|
if (!list_empty(list) || ret != BLK_STS_OK)
|
|
|
|
blk_mq_commit_rqs(hctx, queued, false);
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
/*
|
|
|
|
* Any items that need requeuing? Stuff them into hctx->dispatch,
|
|
|
|
* that is where we will continue on next queue run.
|
|
|
|
*/
|
2016-12-07 23:41:17 +08:00
|
|
|
if (!list_empty(list)) {
|
2018-01-31 11:04:57 +08:00
|
|
|
bool needs_restart;
|
2020-06-30 18:24:58 +08:00
|
|
|
/* For non-shared tags, the RESTART check will suffice */
|
|
|
|
bool no_tag = prep == PREP_DISPATCH_NO_TAG &&
|
2023-01-18 17:37:16 +08:00
|
|
|
((hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) ||
|
|
|
|
blk_mq_is_shared_tags(hctx->flags));
|
2018-01-31 11:04:57 +08:00
|
|
|
|
2021-01-22 10:33:12 +08:00
|
|
|
if (nr_budgets)
|
|
|
|
blk_mq_release_budgets(q, list);
|
2018-01-31 11:04:57 +08:00
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
spin_lock(&hctx->lock);
|
2020-02-25 09:04:32 +08:00
|
|
|
list_splice_tail_init(list, &hctx->dispatch);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
spin_unlock(&hctx->lock);
|
2016-12-07 23:41:17 +08:00
|
|
|
|
2020-08-17 18:01:15 +08:00
|
|
|
/*
|
|
|
|
* Order adding requests to hctx->dispatch and checking
|
|
|
|
* SCHED_RESTART flag. The pair of this smp_mb() is the one
|
|
|
|
* in blk_mq_sched_restart(). Avoid restart code path to
|
|
|
|
* miss the new added requests to hctx->dispatch, meantime
|
|
|
|
* SCHED_RESTART is observed here.
|
|
|
|
*/
|
|
|
|
smp_mb();
|
|
|
|
|
2015-05-05 04:32:48 +08:00
|
|
|
/*
|
2017-04-08 02:16:51 +08:00
|
|
|
* If SCHED_RESTART was set by the caller of this function and
|
|
|
|
* it is no longer set that means that it was cleared by another
|
|
|
|
* thread and hence that a queue rerun is needed.
|
2015-05-05 04:32:48 +08:00
|
|
|
*
|
2017-11-09 23:32:43 +08:00
|
|
|
* If 'no_tag' is set, that means that we failed getting
|
|
|
|
* a driver tag with an I/O scheduler attached. If our dispatch
|
|
|
|
* waitqueue is no longer active, ensure that we run the queue
|
|
|
|
* AFTER adding our entries back to the list.
|
2017-01-17 21:03:22 +08:00
|
|
|
*
|
2017-04-08 02:16:51 +08:00
|
|
|
* If no I/O scheduler has been configured it is possible that
|
|
|
|
* the hardware queue got stopped and restarted before requests
|
|
|
|
* were pushed back onto the dispatch list. Rerun the queue to
|
|
|
|
* avoid starvation. Notes:
|
|
|
|
* - blk_mq_run_hw_queue() checks whether or not a queue has
|
|
|
|
* been stopped before rerunning a queue.
|
|
|
|
* - Some but not all block drivers stop a queue before
|
2017-06-03 15:38:05 +08:00
|
|
|
* returning BLK_STS_RESOURCE. Two exceptions are scsi-mq
|
2017-04-08 02:16:51 +08:00
|
|
|
* and dm-rq.
|
2018-01-31 11:04:57 +08:00
|
|
|
*
|
|
|
|
* If driver returns BLK_STS_RESOURCE and SCHED_RESTART
|
|
|
|
* bit is set, run queue after a delay to avoid IO stalls
|
2020-04-21 00:24:51 +08:00
|
|
|
* that could otherwise occur if the queue is idle. We'll do
|
2021-10-27 00:51:27 +08:00
|
|
|
* similar if we couldn't get budget or couldn't lock a zone
|
|
|
|
* and SCHED_RESTART is set.
|
2017-01-17 21:03:22 +08:00
|
|
|
*/
|
2018-01-31 11:04:57 +08:00
|
|
|
needs_restart = blk_mq_sched_needs_restart(hctx);
|
2021-10-27 00:51:27 +08:00
|
|
|
if (prep == PREP_DISPATCH_NO_BUDGET)
|
|
|
|
needs_resource = true;
|
2018-01-31 11:04:57 +08:00
|
|
|
if (!needs_restart ||
|
2017-11-09 23:32:43 +08:00
|
|
|
(no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
|
2017-01-17 21:03:22 +08:00
|
|
|
blk_mq_run_hw_queue(hctx, true);
|
2022-09-05 18:19:50 +08:00
|
|
|
else if (needs_resource)
|
2018-01-31 11:04:57 +08:00
|
|
|
blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
|
blk-mq: don't queue more if we get a busy return
Some devices have different queue limits depending on the type of IO. A
classic case is SATA NCQ, where some commands can queue, but others
cannot. If we have NCQ commands inflight and encounter a non-queueable
command, the driver returns busy. Currently we attempt to dispatch more
from the scheduler, if we were able to queue some commands. But for the
case where we ended up stopping due to BUSY, we should not attempt to
retrieve more from the scheduler. If we do, we can get into a situation
where we attempt to queue a non-queueable command, get BUSY, then
successfully retrieve more commands from that scheduler and queue those.
This can repeat forever, starving the non-queuable command indefinitely.
Fix this by NOT attempting to pull more commands from the scheduler, if
we get a BUSY return. This should also be more optimal in terms of
letting requests stay in the scheduler for as long as possible, if we
get a BUSY due to the regular out-of-tags condition.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-29 01:54:01 +08:00
|
|
|
|
2018-07-03 23:03:16 +08:00
|
|
|
blk_mq_update_dispatch_busy(hctx, true);
|
blk-mq: don't queue more if we get a busy return
Some devices have different queue limits depending on the type of IO. A
classic case is SATA NCQ, where some commands can queue, but others
cannot. If we have NCQ commands inflight and encounter a non-queueable
command, the driver returns busy. Currently we attempt to dispatch more
from the scheduler, if we were able to queue some commands. But for the
case where we ended up stopping due to BUSY, we should not attempt to
retrieve more from the scheduler. If we do, we can get into a situation
where we attempt to queue a non-queueable command, get BUSY, then
successfully retrieve more commands from that scheduler and queue those.
This can repeat forever, starving the non-queuable command indefinitely.
Fix this by NOT attempting to pull more commands from the scheduler, if
we get a BUSY return. This should also be more optimal in terms of
letting requests stay in the scheduler for as long as possible, if we
get a BUSY due to the regular out-of-tags condition.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-29 01:54:01 +08:00
|
|
|
return false;
|
2023-01-18 17:37:23 +08:00
|
|
|
}
|
2016-12-07 23:41:17 +08:00
|
|
|
|
2023-01-18 17:37:23 +08:00
|
|
|
blk_mq_update_dispatch_busy(hctx, false);
|
|
|
|
return true;
|
2016-12-07 23:41:17 +08:00
|
|
|
}
|
|
|
|
|
2020-01-07 02:08:18 +08:00
|
|
|
/**
|
|
|
|
* __blk_mq_run_hw_queue - Run a hardware queue.
|
|
|
|
* @hctx: Pointer to the hardware queue to run.
|
|
|
|
*
|
|
|
|
* Send pending requests to the hardware.
|
|
|
|
*/
|
2016-11-03 00:09:51 +08:00
|
|
|
static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
|
|
|
|
{
|
2017-08-01 23:28:24 +08:00
|
|
|
/*
|
|
|
|
* We can't run the queue inline with ints disabled. Ensure that
|
|
|
|
* we catch bad users of this early.
|
|
|
|
*/
|
|
|
|
WARN_ON_ONCE(in_interrupt());
|
|
|
|
|
2021-12-03 21:15:33 +08:00
|
|
|
blk_mq_run_dispatch_ops(hctx->queue,
|
|
|
|
blk_mq_sched_dispatch_requests(hctx));
|
2016-11-03 00:09:51 +08:00
|
|
|
}
|
|
|
|
|
2018-04-08 17:48:10 +08:00
|
|
|
static inline int blk_mq_first_mapped_cpu(struct blk_mq_hw_ctx *hctx)
|
|
|
|
{
|
|
|
|
int cpu = cpumask_first_and(hctx->cpumask, cpu_online_mask);
|
|
|
|
|
|
|
|
if (cpu >= nr_cpu_ids)
|
|
|
|
cpu = cpumask_first(hctx->cpumask);
|
|
|
|
return cpu;
|
|
|
|
}
|
|
|
|
|
2014-05-08 00:26:44 +08:00
|
|
|
/*
|
|
|
|
* It'd be great if the workqueue API had a way to pass
|
|
|
|
* in a mask and had some smarts for more clever placement.
|
|
|
|
* For now we just round-robin here, switching for every
|
|
|
|
* BLK_MQ_CPU_WORK_BATCH queued items.
|
|
|
|
*/
|
|
|
|
static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
|
|
|
|
{
|
blk-mq: make sure hctx->next_cpu is set correctly
When hctx->next_cpu is set from possible online CPUs, there is one
race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
break workqueue.
The race can be triggered in the following two sitations:
1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
to dispatch requests from the DEAD cpu context, but at that
time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
a bad value.
2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
should be run on the other CPU A, then CPU A may become offline at the
same time and all CPUs in hctx->cpumask become offline.
This patch deals with this issue by re-selecting next CPU, and making
sure it is set correctly.
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-18 00:41:51 +08:00
|
|
|
bool tried = false;
|
2018-04-08 17:48:09 +08:00
|
|
|
int next_cpu = hctx->next_cpu;
|
blk-mq: make sure hctx->next_cpu is set correctly
When hctx->next_cpu is set from possible online CPUs, there is one
race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
break workqueue.
The race can be triggered in the following two sitations:
1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
to dispatch requests from the DEAD cpu context, but at that
time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
a bad value.
2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
should be run on the other CPU A, then CPU A may become offline at the
same time and all CPUs in hctx->cpumask become offline.
This patch deals with this issue by re-selecting next CPU, and making
sure it is set correctly.
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-18 00:41:51 +08:00
|
|
|
|
2014-11-24 16:27:23 +08:00
|
|
|
if (hctx->queue->nr_hw_queues == 1)
|
|
|
|
return WORK_CPU_UNBOUND;
|
2014-05-08 00:26:44 +08:00
|
|
|
|
|
|
|
if (--hctx->next_cpu_batch <= 0) {
|
blk-mq: make sure hctx->next_cpu is set correctly
When hctx->next_cpu is set from possible online CPUs, there is one
race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
break workqueue.
The race can be triggered in the following two sitations:
1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
to dispatch requests from the DEAD cpu context, but at that
time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
a bad value.
2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
should be run on the other CPU A, then CPU A may become offline at the
same time and all CPUs in hctx->cpumask become offline.
This patch deals with this issue by re-selecting next CPU, and making
sure it is set correctly.
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-18 00:41:51 +08:00
|
|
|
select_cpu:
|
2018-04-08 17:48:09 +08:00
|
|
|
next_cpu = cpumask_next_and(next_cpu, hctx->cpumask,
|
2018-01-12 10:53:06 +08:00
|
|
|
cpu_online_mask);
|
2014-05-08 00:26:44 +08:00
|
|
|
if (next_cpu >= nr_cpu_ids)
|
2018-04-08 17:48:10 +08:00
|
|
|
next_cpu = blk_mq_first_mapped_cpu(hctx);
|
2014-05-08 00:26:44 +08:00
|
|
|
hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
|
|
|
|
}
|
|
|
|
|
blk-mq: make sure hctx->next_cpu is set correctly
When hctx->next_cpu is set from possible online CPUs, there is one
race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
break workqueue.
The race can be triggered in the following two sitations:
1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
to dispatch requests from the DEAD cpu context, but at that
time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
a bad value.
2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
should be run on the other CPU A, then CPU A may become offline at the
same time and all CPUs in hctx->cpumask become offline.
This patch deals with this issue by re-selecting next CPU, and making
sure it is set correctly.
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-18 00:41:51 +08:00
|
|
|
/*
|
|
|
|
* Do unbound schedule if we can't find a online CPU for this hctx,
|
|
|
|
* and it should only happen in the path of handling CPU DEAD.
|
|
|
|
*/
|
2018-04-08 17:48:09 +08:00
|
|
|
if (!cpu_online(next_cpu)) {
|
blk-mq: make sure hctx->next_cpu is set correctly
When hctx->next_cpu is set from possible online CPUs, there is one
race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
break workqueue.
The race can be triggered in the following two sitations:
1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
to dispatch requests from the DEAD cpu context, but at that
time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
a bad value.
2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
should be run on the other CPU A, then CPU A may become offline at the
same time and all CPUs in hctx->cpumask become offline.
This patch deals with this issue by re-selecting next CPU, and making
sure it is set correctly.
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-18 00:41:51 +08:00
|
|
|
if (!tried) {
|
|
|
|
tried = true;
|
|
|
|
goto select_cpu;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Make sure to re-select CPU next time once after CPUs
|
|
|
|
* in hctx->cpumask become online again.
|
|
|
|
*/
|
2018-04-08 17:48:09 +08:00
|
|
|
hctx->next_cpu = next_cpu;
|
blk-mq: make sure hctx->next_cpu is set correctly
When hctx->next_cpu is set from possible online CPUs, there is one
race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
break workqueue.
The race can be triggered in the following two sitations:
1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
to dispatch requests from the DEAD cpu context, but at that
time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
a bad value.
2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
should be run on the other CPU A, then CPU A may become offline at the
same time and all CPUs in hctx->cpumask become offline.
This patch deals with this issue by re-selecting next CPU, and making
sure it is set correctly.
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-18 00:41:51 +08:00
|
|
|
hctx->next_cpu_batch = 1;
|
|
|
|
return WORK_CPU_UNBOUND;
|
|
|
|
}
|
2018-04-08 17:48:09 +08:00
|
|
|
|
|
|
|
hctx->next_cpu = next_cpu;
|
|
|
|
return next_cpu;
|
2014-05-08 00:26:44 +08:00
|
|
|
}
|
|
|
|
|
2020-01-07 02:08:18 +08:00
|
|
|
/**
|
|
|
|
* __blk_mq_delay_run_hw_queue - Run (or schedule to run) a hardware queue.
|
|
|
|
* @hctx: Pointer to the hardware queue to run.
|
|
|
|
* @async: If we want to run the queue asynchronously.
|
2020-12-04 23:20:55 +08:00
|
|
|
* @msecs: Milliseconds of delay to wait before running the queue.
|
2020-01-07 02:08:18 +08:00
|
|
|
*
|
|
|
|
* If !@async, try to run the queue now. Else, run the queue asynchronously and
|
|
|
|
* with a delay of @msecs.
|
|
|
|
*/
|
2017-04-08 02:16:52 +08:00
|
|
|
static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
|
|
|
|
unsigned long msecs)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2017-06-21 02:15:49 +08:00
|
|
|
if (unlikely(blk_mq_hctx_stopped(hctx)))
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
return;
|
|
|
|
|
2016-09-22 00:12:13 +08:00
|
|
|
if (!async && !(hctx->flags & BLK_MQ_F_BLOCKING)) {
|
2022-06-22 16:25:54 +08:00
|
|
|
if (cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)) {
|
2014-11-08 06:03:59 +08:00
|
|
|
__blk_mq_run_hw_queue(hctx);
|
|
|
|
return;
|
|
|
|
}
|
2014-04-10 00:18:23 +08:00
|
|
|
}
|
2014-11-08 06:03:59 +08:00
|
|
|
|
2018-01-20 00:58:55 +08:00
|
|
|
kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work,
|
|
|
|
msecs_to_jiffies(msecs));
|
2017-04-08 02:16:52 +08:00
|
|
|
}
|
|
|
|
|
2020-01-07 02:08:18 +08:00
|
|
|
/**
|
|
|
|
* blk_mq_delay_run_hw_queue - Run a hardware queue asynchronously.
|
|
|
|
* @hctx: Pointer to the hardware queue to run.
|
2020-12-04 23:20:55 +08:00
|
|
|
* @msecs: Milliseconds of delay to wait before running the queue.
|
2020-01-07 02:08:18 +08:00
|
|
|
*
|
|
|
|
* Run a hardware queue asynchronously with a delay of @msecs.
|
|
|
|
*/
|
2017-04-08 02:16:52 +08:00
|
|
|
void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
|
|
|
|
{
|
|
|
|
__blk_mq_delay_run_hw_queue(hctx, true, msecs);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
|
|
|
|
|
2020-01-07 02:08:18 +08:00
|
|
|
/**
|
|
|
|
* blk_mq_run_hw_queue - Start to run a hardware queue.
|
|
|
|
* @hctx: Pointer to the hardware queue to run.
|
|
|
|
* @async: If we want to run the queue asynchronously.
|
|
|
|
*
|
|
|
|
* Check if the request queue is not in a quiesced state and if there are
|
|
|
|
* pending requests to be sent. If this is true, run the queue to send requests
|
|
|
|
* to hardware.
|
|
|
|
*/
|
2019-10-30 00:59:30 +08:00
|
|
|
void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
|
2017-04-08 02:16:52 +08:00
|
|
|
{
|
2018-01-06 16:27:38 +08:00
|
|
|
bool need_run;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* When queue is quiesced, we may be switching io scheduler, or
|
|
|
|
* updating nr_hw_queues, or other things, and we can't run queue
|
|
|
|
* any more, even __blk_mq_hctx_has_pending() can't be called safely.
|
|
|
|
*
|
|
|
|
* And queue will be rerun in blk_mq_unquiesce_queue() if it is
|
|
|
|
* quiesced.
|
|
|
|
*/
|
2021-12-06 19:12:13 +08:00
|
|
|
__blk_mq_run_dispatch_ops(hctx->queue, false,
|
2021-12-03 21:15:31 +08:00
|
|
|
need_run = !blk_queue_quiesced(hctx->queue) &&
|
|
|
|
blk_mq_hctx_has_pending(hctx));
|
2018-01-06 16:27:38 +08:00
|
|
|
|
2019-10-30 00:59:30 +08:00
|
|
|
if (need_run)
|
2017-11-11 00:13:21 +08:00
|
|
|
__blk_mq_delay_run_hw_queue(hctx, async, 0);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
2017-04-14 16:00:00 +08:00
|
|
|
EXPORT_SYMBOL(blk_mq_run_hw_queue);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2021-01-12 00:47:17 +08:00
|
|
|
/*
|
|
|
|
* Return prefered queue to dispatch from (if any) for non-mq aware IO
|
|
|
|
* scheduler.
|
|
|
|
*/
|
|
|
|
static struct blk_mq_hw_ctx *blk_mq_get_sq_hctx(struct request_queue *q)
|
|
|
|
{
|
2022-05-22 20:23:50 +08:00
|
|
|
struct blk_mq_ctx *ctx = blk_mq_get_ctx(q);
|
2021-01-12 00:47:17 +08:00
|
|
|
/*
|
|
|
|
* If the IO scheduler does not respect hardware queues when
|
|
|
|
* dispatching, we just don't bother with multiple HW queues and
|
|
|
|
* dispatch from hctx for the current CPU since running multiple queues
|
|
|
|
* just causes lock contention inside the scheduler and pointless cache
|
|
|
|
* bouncing.
|
|
|
|
*/
|
2022-06-16 06:55:49 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx = ctx->hctxs[HCTX_TYPE_DEFAULT];
|
2022-05-22 20:23:50 +08:00
|
|
|
|
2021-01-12 00:47:17 +08:00
|
|
|
if (!blk_mq_hctx_stopped(hctx))
|
|
|
|
return hctx;
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2020-01-07 02:08:18 +08:00
|
|
|
/**
|
2020-10-24 00:32:54 +08:00
|
|
|
* blk_mq_run_hw_queues - Run all hardware queues in a request queue.
|
2020-01-07 02:08:18 +08:00
|
|
|
* @q: Pointer to the request queue to run.
|
|
|
|
* @async: If we want to run the queue asynchronously.
|
|
|
|
*/
|
2015-03-12 11:56:38 +08:00
|
|
|
void blk_mq_run_hw_queues(struct request_queue *q, bool async)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2021-01-12 00:47:17 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx, *sq_hctx;
|
2022-03-08 15:32:18 +08:00
|
|
|
unsigned long i;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2021-01-12 00:47:17 +08:00
|
|
|
sq_hctx = NULL;
|
2022-06-16 09:44:00 +08:00
|
|
|
if (blk_queue_sq_sched(q))
|
2021-01-12 00:47:17 +08:00
|
|
|
sq_hctx = blk_mq_get_sq_hctx(q);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
queue_for_each_hw_ctx(q, hctx, i) {
|
2017-11-11 00:13:21 +08:00
|
|
|
if (blk_mq_hctx_stopped(hctx))
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
continue;
|
2021-01-12 00:47:17 +08:00
|
|
|
/*
|
|
|
|
* Dispatch from this hctx either if there's no hctx preferred
|
|
|
|
* by IO scheduler or if it has requests that bypass the
|
|
|
|
* scheduler.
|
|
|
|
*/
|
|
|
|
if (!sq_hctx || sq_hctx == hctx ||
|
|
|
|
!list_empty_careful(&hctx->dispatch))
|
|
|
|
blk_mq_run_hw_queue(hctx, async);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
}
|
2015-03-12 11:56:38 +08:00
|
|
|
EXPORT_SYMBOL(blk_mq_run_hw_queues);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2020-04-21 00:24:52 +08:00
|
|
|
/**
|
|
|
|
* blk_mq_delay_run_hw_queues - Run all hardware queues asynchronously.
|
|
|
|
* @q: Pointer to the request queue to run.
|
2020-12-04 23:20:55 +08:00
|
|
|
* @msecs: Milliseconds of delay to wait before running the queues.
|
2020-04-21 00:24:52 +08:00
|
|
|
*/
|
|
|
|
void blk_mq_delay_run_hw_queues(struct request_queue *q, unsigned long msecs)
|
|
|
|
{
|
2021-01-12 00:47:17 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx, *sq_hctx;
|
2022-03-08 15:32:18 +08:00
|
|
|
unsigned long i;
|
2020-04-21 00:24:52 +08:00
|
|
|
|
2021-01-12 00:47:17 +08:00
|
|
|
sq_hctx = NULL;
|
2022-06-16 09:44:00 +08:00
|
|
|
if (blk_queue_sq_sched(q))
|
2021-01-12 00:47:17 +08:00
|
|
|
sq_hctx = blk_mq_get_sq_hctx(q);
|
2020-04-21 00:24:52 +08:00
|
|
|
queue_for_each_hw_ctx(q, hctx, i) {
|
|
|
|
if (blk_mq_hctx_stopped(hctx))
|
|
|
|
continue;
|
2022-02-01 04:33:37 +08:00
|
|
|
/*
|
|
|
|
* If there is already a run_work pending, leave the
|
|
|
|
* pending delay untouched. Otherwise, a hctx can stall
|
|
|
|
* if another hctx is re-delaying the other's work
|
|
|
|
* before the work executes.
|
|
|
|
*/
|
|
|
|
if (delayed_work_pending(&hctx->run_work))
|
|
|
|
continue;
|
2021-01-12 00:47:17 +08:00
|
|
|
/*
|
|
|
|
* Dispatch from this hctx either if there's no hctx preferred
|
|
|
|
* by IO scheduler or if it has requests that bypass the
|
|
|
|
* scheduler.
|
|
|
|
*/
|
|
|
|
if (!sq_hctx || sq_hctx == hctx ||
|
|
|
|
!list_empty_careful(&hctx->dispatch))
|
|
|
|
blk_mq_delay_run_hw_queue(hctx, msecs);
|
2020-04-21 00:24:52 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_delay_run_hw_queues);
|
|
|
|
|
2017-06-06 23:22:09 +08:00
|
|
|
/*
|
|
|
|
* This function is often used for pausing .queue_rq() by driver when
|
|
|
|
* there isn't enough resource or some conditions aren't satisfied, and
|
2017-08-18 07:23:00 +08:00
|
|
|
* BLK_STS_RESOURCE is usually returned.
|
2017-06-06 23:22:09 +08:00
|
|
|
*
|
|
|
|
* We do not guarantee that dispatch can be drained or blocked
|
|
|
|
* after blk_mq_stop_hw_queue() returns. Please use
|
|
|
|
* blk_mq_quiesce_queue() for that requirement.
|
|
|
|
*/
|
2017-05-04 01:08:14 +08:00
|
|
|
void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx)
|
|
|
|
{
|
2017-06-06 23:22:10 +08:00
|
|
|
cancel_delayed_work(&hctx->run_work);
|
2013-10-25 21:45:58 +08:00
|
|
|
|
2017-06-06 23:22:10 +08:00
|
|
|
set_bit(BLK_MQ_S_STOPPED, &hctx->state);
|
2017-05-04 01:08:14 +08:00
|
|
|
}
|
2017-06-06 23:22:10 +08:00
|
|
|
EXPORT_SYMBOL(blk_mq_stop_hw_queue);
|
2017-05-04 01:08:14 +08:00
|
|
|
|
2017-06-06 23:22:09 +08:00
|
|
|
/*
|
|
|
|
* This function is often used for pausing .queue_rq() by driver when
|
|
|
|
* there isn't enough resource or some conditions aren't satisfied, and
|
2017-08-18 07:23:00 +08:00
|
|
|
* BLK_STS_RESOURCE is usually returned.
|
2017-06-06 23:22:09 +08:00
|
|
|
*
|
|
|
|
* We do not guarantee that dispatch can be drained or blocked
|
|
|
|
* after blk_mq_stop_hw_queues() returns. Please use
|
|
|
|
* blk_mq_quiesce_queue() for that requirement.
|
|
|
|
*/
|
2017-05-04 01:08:14 +08:00
|
|
|
void blk_mq_stop_hw_queues(struct request_queue *q)
|
|
|
|
{
|
2017-06-06 23:22:10 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx;
|
2022-03-08 15:32:18 +08:00
|
|
|
unsigned long i;
|
2017-06-06 23:22:10 +08:00
|
|
|
|
|
|
|
queue_for_each_hw_ctx(q, hctx, i)
|
|
|
|
blk_mq_stop_hw_queue(hctx);
|
2013-10-25 21:45:58 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_stop_hw_queues);
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
void blk_mq_start_hw_queue(struct blk_mq_hw_ctx *hctx)
|
|
|
|
{
|
|
|
|
clear_bit(BLK_MQ_S_STOPPED, &hctx->state);
|
2014-04-10 00:18:23 +08:00
|
|
|
|
2014-06-25 22:22:34 +08:00
|
|
|
blk_mq_run_hw_queue(hctx, false);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_start_hw_queue);
|
|
|
|
|
2014-04-16 15:44:56 +08:00
|
|
|
void blk_mq_start_hw_queues(struct request_queue *q)
|
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
2022-03-08 15:32:18 +08:00
|
|
|
unsigned long i;
|
2014-04-16 15:44:56 +08:00
|
|
|
|
|
|
|
queue_for_each_hw_ctx(q, hctx, i)
|
|
|
|
blk_mq_start_hw_queue(hctx);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_start_hw_queues);
|
|
|
|
|
2016-12-09 04:19:30 +08:00
|
|
|
void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
|
|
|
|
{
|
|
|
|
if (!blk_mq_hctx_stopped(hctx))
|
|
|
|
return;
|
|
|
|
|
|
|
|
clear_bit(BLK_MQ_S_STOPPED, &hctx->state);
|
|
|
|
blk_mq_run_hw_queue(hctx, async);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_start_stopped_hw_queue);
|
|
|
|
|
2014-04-16 15:44:54 +08:00
|
|
|
void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
2022-03-08 15:32:18 +08:00
|
|
|
unsigned long i;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2016-12-09 04:19:30 +08:00
|
|
|
queue_for_each_hw_ctx(q, hctx, i)
|
|
|
|
blk_mq_start_stopped_hw_queue(hctx, async);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_start_stopped_hw_queues);
|
|
|
|
|
2014-04-17 00:48:08 +08:00
|
|
|
static void blk_mq_run_work_fn(struct work_struct *work)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
|
2017-04-10 23:54:54 +08:00
|
|
|
hctx = container_of(work, struct blk_mq_hw_ctx, run_work.work);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2017-04-10 23:54:56 +08:00
|
|
|
/*
|
2018-04-08 17:48:11 +08:00
|
|
|
* If we are stopped, don't run the queue.
|
2017-04-10 23:54:56 +08:00
|
|
|
*/
|
2020-10-09 11:26:30 +08:00
|
|
|
if (blk_mq_hctx_stopped(hctx))
|
2018-06-04 17:03:55 +08:00
|
|
|
return;
|
2017-04-08 02:16:52 +08:00
|
|
|
|
|
|
|
__blk_mq_run_hw_queue(hctx);
|
|
|
|
}
|
|
|
|
|
2015-10-20 23:13:57 +08:00
|
|
|
static inline void __blk_mq_insert_req_list(struct blk_mq_hw_ctx *hctx,
|
|
|
|
struct request *rq,
|
|
|
|
bool at_head)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2016-08-25 05:34:35 +08:00
|
|
|
struct blk_mq_ctx *ctx = rq->mq_ctx;
|
2018-12-17 23:44:05 +08:00
|
|
|
enum hctx_type type = hctx->type;
|
2016-08-25 05:34:35 +08:00
|
|
|
|
2017-06-21 02:15:47 +08:00
|
|
|
lockdep_assert_held(&ctx->lock);
|
|
|
|
|
2020-12-04 00:21:39 +08:00
|
|
|
trace_block_rq_insert(rq);
|
2013-11-20 09:59:10 +08:00
|
|
|
|
2014-02-08 02:22:36 +08:00
|
|
|
if (at_head)
|
2018-12-17 23:44:05 +08:00
|
|
|
list_add(&rq->queuelist, &ctx->rq_lists[type]);
|
2014-02-08 02:22:36 +08:00
|
|
|
else
|
2018-12-17 23:44:05 +08:00
|
|
|
list_add_tail(&rq->queuelist, &ctx->rq_lists[type]);
|
2015-10-20 23:13:57 +08:00
|
|
|
}
|
2014-05-09 23:36:49 +08:00
|
|
|
|
2016-12-15 05:34:47 +08:00
|
|
|
void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
|
|
|
|
bool at_head)
|
2015-10-20 23:13:57 +08:00
|
|
|
{
|
|
|
|
struct blk_mq_ctx *ctx = rq->mq_ctx;
|
|
|
|
|
2017-06-21 02:15:47 +08:00
|
|
|
lockdep_assert_held(&ctx->lock);
|
|
|
|
|
2016-08-25 05:34:35 +08:00
|
|
|
__blk_mq_insert_req_list(hctx, rq, at_head);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
blk_mq_hctx_mark_pending(hctx, ctx);
|
|
|
|
}
|
|
|
|
|
2020-01-07 02:08:18 +08:00
|
|
|
/**
|
|
|
|
* blk_mq_request_bypass_insert - Insert a request at dispatch list.
|
|
|
|
* @rq: Pointer to request to be inserted.
|
2020-08-17 07:39:34 +08:00
|
|
|
* @at_head: true if the request should be inserted at the head of the list.
|
2020-01-07 02:08:18 +08:00
|
|
|
* @run_queue: If we should run the hardware queue after inserting the request.
|
|
|
|
*
|
2017-09-12 06:43:57 +08:00
|
|
|
* Should only be used carefully, when the caller knows we want to
|
|
|
|
* bypass a potential IO scheduler on the target device.
|
|
|
|
*/
|
2020-02-25 09:04:32 +08:00
|
|
|
void blk_mq_request_bypass_insert(struct request *rq, bool at_head,
|
|
|
|
bool run_queue)
|
2017-09-12 06:43:57 +08:00
|
|
|
{
|
2018-10-30 05:06:13 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
|
2017-09-12 06:43:57 +08:00
|
|
|
|
|
|
|
spin_lock(&hctx->lock);
|
2020-02-25 09:04:32 +08:00
|
|
|
if (at_head)
|
|
|
|
list_add(&rq->queuelist, &hctx->dispatch);
|
|
|
|
else
|
|
|
|
list_add_tail(&rq->queuelist, &hctx->dispatch);
|
2017-09-12 06:43:57 +08:00
|
|
|
spin_unlock(&hctx->lock);
|
|
|
|
|
2017-11-02 23:24:34 +08:00
|
|
|
if (run_queue)
|
|
|
|
blk_mq_run_hw_queue(hctx, false);
|
2017-09-12 06:43:57 +08:00
|
|
|
}
|
|
|
|
|
2017-01-17 21:03:22 +08:00
|
|
|
void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
|
|
|
|
struct list_head *list)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
|
|
|
{
|
2018-07-02 17:35:58 +08:00
|
|
|
struct request *rq;
|
2018-12-17 23:44:05 +08:00
|
|
|
enum hctx_type type = hctx->type;
|
2018-07-02 17:35:58 +08:00
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
/*
|
|
|
|
* preemption doesn't flush plug list, so it's possible ctx->cpu is
|
|
|
|
* offline now
|
|
|
|
*/
|
2018-07-02 17:35:58 +08:00
|
|
|
list_for_each_entry(rq, list, queuelist) {
|
2016-08-25 05:34:35 +08:00
|
|
|
BUG_ON(rq->mq_ctx != ctx);
|
2020-12-04 00:21:39 +08:00
|
|
|
trace_block_rq_insert(rq);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
2018-07-02 17:35:58 +08:00
|
|
|
|
|
|
|
spin_lock(&ctx->lock);
|
2018-12-17 23:44:05 +08:00
|
|
|
list_splice_tail_init(list, &ctx->rq_lists[type]);
|
2015-10-20 23:13:57 +08:00
|
|
|
blk_mq_hctx_mark_pending(hctx, ctx);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
spin_unlock(&ctx->lock);
|
|
|
|
}
|
|
|
|
|
2019-06-06 18:29:01 +08:00
|
|
|
static void blk_mq_bio_to_request(struct request *rq, struct bio *bio,
|
|
|
|
unsigned int nr_segs)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2020-09-16 11:53:14 +08:00
|
|
|
int err;
|
|
|
|
|
2019-06-06 18:29:00 +08:00
|
|
|
if (bio->bi_opf & REQ_RAHEAD)
|
|
|
|
rq->cmd_flags |= REQ_FAILFAST_MASK;
|
|
|
|
|
|
|
|
rq->__sector = bio->bi_iter.bi_sector;
|
2019-06-06 18:29:01 +08:00
|
|
|
blk_rq_bio_prep(rq, bio, nr_segs);
|
2020-09-16 11:53:14 +08:00
|
|
|
|
|
|
|
/* This can't fail, since GFP_NOIO includes __GFP_DIRECT_RECLAIM. */
|
|
|
|
err = blk_crypto_rq_bio_prep(rq, bio, GFP_NOIO);
|
|
|
|
WARN_ON_ONCE(err);
|
2014-05-30 01:00:11 +08:00
|
|
|
|
2020-05-27 13:24:16 +08:00
|
|
|
blk_account_io_start(rq);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2018-01-18 00:25:56 +08:00
|
|
|
static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
|
2021-10-12 19:12:24 +08:00
|
|
|
struct request *rq, bool last)
|
2015-05-09 01:51:32 +08:00
|
|
|
{
|
|
|
|
struct request_queue *q = rq->q;
|
|
|
|
struct blk_mq_queue_data bd = {
|
|
|
|
.rq = rq,
|
2018-11-25 01:15:46 +08:00
|
|
|
.last = last,
|
2015-05-09 01:51:32 +08:00
|
|
|
};
|
2017-06-13 01:22:46 +08:00
|
|
|
blk_status_t ret;
|
2018-01-18 00:25:56 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* For OK queue, we are done. For error, caller may kill it.
|
|
|
|
* Any other error (busy), just add it to our list as we
|
|
|
|
* previously would have done.
|
|
|
|
*/
|
|
|
|
ret = q->mq_ops->queue_rq(hctx, &bd);
|
|
|
|
switch (ret) {
|
|
|
|
case BLK_STS_OK:
|
2018-07-10 09:03:31 +08:00
|
|
|
blk_mq_update_dispatch_busy(hctx, false);
|
2018-01-18 00:25:56 +08:00
|
|
|
break;
|
|
|
|
case BLK_STS_RESOURCE:
|
2018-01-31 11:04:57 +08:00
|
|
|
case BLK_STS_DEV_RESOURCE:
|
2018-07-10 09:03:31 +08:00
|
|
|
blk_mq_update_dispatch_busy(hctx, true);
|
2018-01-18 00:25:56 +08:00
|
|
|
__blk_mq_requeue_request(rq);
|
|
|
|
break;
|
|
|
|
default:
|
2018-07-10 09:03:31 +08:00
|
|
|
blk_mq_update_dispatch_busy(hctx, false);
|
2018-01-18 00:25:56 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2019-04-05 01:08:43 +08:00
|
|
|
static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
|
2018-01-18 00:25:56 +08:00
|
|
|
struct request *rq,
|
2019-04-05 01:08:43 +08:00
|
|
|
bool bypass_insert, bool last)
|
2018-01-18 00:25:56 +08:00
|
|
|
{
|
|
|
|
struct request_queue *q = rq->q;
|
2017-06-06 23:22:00 +08:00
|
|
|
bool run_queue = true;
|
2021-01-22 10:33:12 +08:00
|
|
|
int budget_token;
|
2017-06-06 23:22:00 +08:00
|
|
|
|
2018-01-18 12:06:59 +08:00
|
|
|
/*
|
2019-04-05 01:08:43 +08:00
|
|
|
* RCU or SRCU read lock is needed before checking quiesced flag.
|
2018-01-18 12:06:59 +08:00
|
|
|
*
|
2019-04-05 01:08:43 +08:00
|
|
|
* When queue is stopped or quiesced, ignore 'bypass_insert' from
|
|
|
|
* blk_mq_request_issue_directly(), and return BLK_STS_OK to caller,
|
|
|
|
* and avoid driver to try to dispatch again.
|
2018-01-18 12:06:59 +08:00
|
|
|
*/
|
2019-04-05 01:08:43 +08:00
|
|
|
if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) {
|
2017-06-06 23:22:00 +08:00
|
|
|
run_queue = false;
|
2019-04-05 01:08:43 +08:00
|
|
|
bypass_insert = false;
|
|
|
|
goto insert;
|
2017-06-06 23:22:00 +08:00
|
|
|
}
|
2015-05-09 01:51:32 +08:00
|
|
|
|
2021-10-15 23:44:38 +08:00
|
|
|
if ((rq->rq_flags & RQF_ELV) && !bypass_insert)
|
2019-04-05 01:08:43 +08:00
|
|
|
goto insert;
|
2016-10-29 08:20:02 +08:00
|
|
|
|
2021-01-22 10:33:12 +08:00
|
|
|
budget_token = blk_mq_get_dispatch_budget(q);
|
|
|
|
if (budget_token < 0)
|
2019-04-05 01:08:43 +08:00
|
|
|
goto insert;
|
2017-01-17 21:03:22 +08:00
|
|
|
|
2021-01-22 10:33:12 +08:00
|
|
|
blk_mq_set_rq_budget_token(rq, budget_token);
|
|
|
|
|
2018-06-25 19:31:45 +08:00
|
|
|
if (!blk_mq_get_driver_tag(rq)) {
|
2021-01-22 10:33:12 +08:00
|
|
|
blk_mq_put_dispatch_budget(q, budget_token);
|
2019-04-05 01:08:43 +08:00
|
|
|
goto insert;
|
2017-11-05 02:21:12 +08:00
|
|
|
}
|
2017-10-14 17:22:29 +08:00
|
|
|
|
2021-10-12 19:12:24 +08:00
|
|
|
return __blk_mq_issue_directly(hctx, rq, last);
|
2019-04-05 01:08:43 +08:00
|
|
|
insert:
|
|
|
|
if (bypass_insert)
|
|
|
|
return BLK_STS_RESOURCE;
|
|
|
|
|
2020-08-18 17:07:28 +08:00
|
|
|
blk_mq_sched_insert_request(rq, false, run_queue, false);
|
|
|
|
|
2019-04-05 01:08:43 +08:00
|
|
|
return BLK_STS_OK;
|
|
|
|
}
|
|
|
|
|
2020-01-07 02:08:18 +08:00
|
|
|
/**
|
|
|
|
* blk_mq_try_issue_directly - Try to send a request directly to device driver.
|
|
|
|
* @hctx: Pointer of the associated hardware queue.
|
|
|
|
* @rq: Pointer to request to be sent.
|
|
|
|
*
|
|
|
|
* If the device has enough resources to accept a new request now, send the
|
|
|
|
* request directly to device driver. Else, insert at hctx->dispatch queue, so
|
|
|
|
* we can try send it another time in the future. Requests inserted at this
|
|
|
|
* queue have higher priority.
|
|
|
|
*/
|
2019-04-05 01:08:43 +08:00
|
|
|
static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
|
2021-10-12 19:12:24 +08:00
|
|
|
struct request *rq)
|
2019-04-05 01:08:43 +08:00
|
|
|
{
|
2021-12-03 21:15:31 +08:00
|
|
|
blk_status_t ret =
|
|
|
|
__blk_mq_try_issue_directly(hctx, rq, false, true);
|
2019-04-05 01:08:43 +08:00
|
|
|
|
|
|
|
if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
|
2020-02-25 09:04:32 +08:00
|
|
|
blk_mq_request_bypass_insert(rq, false, true);
|
2019-04-05 01:08:43 +08:00
|
|
|
else if (ret != BLK_STS_OK)
|
|
|
|
blk_mq_end_request(rq, ret);
|
|
|
|
}
|
|
|
|
|
2021-11-17 14:13:58 +08:00
|
|
|
static blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last)
|
2019-04-05 01:08:43 +08:00
|
|
|
{
|
2021-12-03 21:15:34 +08:00
|
|
|
return __blk_mq_try_issue_directly(rq->mq_hctx, rq, true, last);
|
2017-03-23 03:01:51 +08:00
|
|
|
}
|
|
|
|
|
2023-01-18 17:37:18 +08:00
|
|
|
static void blk_mq_plug_issue_direct(struct blk_plug *plug)
|
2021-11-17 14:13:57 +08:00
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx = NULL;
|
|
|
|
struct request *rq;
|
|
|
|
int queued = 0;
|
2023-01-18 17:37:20 +08:00
|
|
|
blk_status_t ret = BLK_STS_OK;
|
2021-11-17 14:13:57 +08:00
|
|
|
|
|
|
|
while ((rq = rq_list_pop(&plug->mq_list))) {
|
|
|
|
bool last = rq_list_empty(plug->mq_list);
|
|
|
|
|
|
|
|
if (hctx != rq->mq_hctx) {
|
2023-01-18 17:37:19 +08:00
|
|
|
if (hctx) {
|
|
|
|
blk_mq_commit_rqs(hctx, queued, false);
|
|
|
|
queued = 0;
|
|
|
|
}
|
2021-11-17 14:13:57 +08:00
|
|
|
hctx = rq->mq_hctx;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = blk_mq_request_issue_directly(rq, last);
|
|
|
|
switch (ret) {
|
|
|
|
case BLK_STS_OK:
|
|
|
|
queued++;
|
|
|
|
break;
|
|
|
|
case BLK_STS_RESOURCE:
|
|
|
|
case BLK_STS_DEV_RESOURCE:
|
2022-08-03 10:33:55 +08:00
|
|
|
blk_mq_request_bypass_insert(rq, false, true);
|
2023-01-18 17:37:20 +08:00
|
|
|
goto out;
|
2021-11-17 14:13:57 +08:00
|
|
|
default:
|
|
|
|
blk_mq_end_request(rq, ret);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-01-18 17:37:20 +08:00
|
|
|
out:
|
|
|
|
if (ret != BLK_STS_OK)
|
2023-01-18 17:37:19 +08:00
|
|
|
blk_mq_commit_rqs(hctx, queued, false);
|
2021-11-17 14:13:57 +08:00
|
|
|
}
|
|
|
|
|
2021-12-21 04:59:19 +08:00
|
|
|
static void __blk_mq_flush_plug_list(struct request_queue *q,
|
|
|
|
struct blk_plug *plug)
|
|
|
|
{
|
|
|
|
if (blk_queue_quiesced(q))
|
|
|
|
return;
|
|
|
|
q->mq_ops->queue_rqs(&plug->mq_list);
|
|
|
|
}
|
|
|
|
|
2022-03-12 01:24:17 +08:00
|
|
|
static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched)
|
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *this_hctx = NULL;
|
|
|
|
struct blk_mq_ctx *this_ctx = NULL;
|
|
|
|
struct request *requeue_list = NULL;
|
|
|
|
unsigned int depth = 0;
|
|
|
|
LIST_HEAD(list);
|
|
|
|
|
|
|
|
do {
|
|
|
|
struct request *rq = rq_list_pop(&plug->mq_list);
|
|
|
|
|
|
|
|
if (!this_hctx) {
|
|
|
|
this_hctx = rq->mq_hctx;
|
|
|
|
this_ctx = rq->mq_ctx;
|
|
|
|
} else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx) {
|
|
|
|
rq_list_add(&requeue_list, rq);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
list_add_tail(&rq->queuelist, &list);
|
|
|
|
depth++;
|
|
|
|
} while (!rq_list_empty(plug->mq_list));
|
|
|
|
|
|
|
|
plug->mq_list = requeue_list;
|
|
|
|
trace_block_unplug(this_hctx->queue, depth, !from_sched);
|
|
|
|
blk_mq_sched_insert_requests(this_hctx, this_ctx, &list, from_sched);
|
|
|
|
}
|
|
|
|
|
2021-11-17 14:13:57 +08:00
|
|
|
void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
|
|
|
|
{
|
2021-12-03 21:48:53 +08:00
|
|
|
struct request *rq;
|
2021-11-17 14:13:57 +08:00
|
|
|
|
|
|
|
if (rq_list_empty(plug->mq_list))
|
|
|
|
return;
|
|
|
|
plug->rq_count = 0;
|
|
|
|
|
|
|
|
if (!plug->multiple_queues && !plug->has_elevator && !from_schedule) {
|
2021-12-03 21:48:53 +08:00
|
|
|
struct request_queue *q;
|
|
|
|
|
|
|
|
rq = rq_list_peek(&plug->mq_list);
|
|
|
|
q = rq->q;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Peek first request and see if we have a ->queue_rqs() hook.
|
|
|
|
* If we do, we can dispatch the whole plug list in one go. We
|
|
|
|
* already know at this point that all requests belong to the
|
|
|
|
* same queue, caller must ensure that's the case.
|
|
|
|
*
|
|
|
|
* Since we pass off the full list to the driver at this point,
|
|
|
|
* we do not increment the active request count for the queue.
|
|
|
|
* Bypass shared tags for now because of that.
|
|
|
|
*/
|
|
|
|
if (q->mq_ops->queue_rqs &&
|
|
|
|
!(rq->mq_hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
|
|
|
|
blk_mq_run_dispatch_ops(q,
|
2021-12-21 04:59:19 +08:00
|
|
|
__blk_mq_flush_plug_list(q, plug));
|
2021-12-03 21:48:53 +08:00
|
|
|
if (rq_list_empty(plug->mq_list))
|
|
|
|
return;
|
|
|
|
}
|
2021-12-06 11:33:50 +08:00
|
|
|
|
|
|
|
blk_mq_run_dispatch_ops(q,
|
2023-01-18 17:37:18 +08:00
|
|
|
blk_mq_plug_issue_direct(plug));
|
2021-11-17 14:13:57 +08:00
|
|
|
if (rq_list_empty(plug->mq_list))
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
do {
|
2022-03-12 01:24:17 +08:00
|
|
|
blk_mq_dispatch_plug_list(plug, from_schedule);
|
2021-11-17 14:13:57 +08:00
|
|
|
} while (!rq_list_empty(plug->mq_list));
|
|
|
|
}
|
|
|
|
|
2018-07-10 09:03:31 +08:00
|
|
|
void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
|
|
|
|
struct list_head *list)
|
|
|
|
{
|
2020-04-07 02:13:48 +08:00
|
|
|
int queued = 0;
|
2023-01-18 17:37:21 +08:00
|
|
|
blk_status_t ret = BLK_STS_OK;
|
2020-04-07 02:13:48 +08:00
|
|
|
|
2018-07-10 09:03:31 +08:00
|
|
|
while (!list_empty(list)) {
|
|
|
|
struct request *rq = list_first_entry(list, struct request,
|
|
|
|
queuelist);
|
|
|
|
|
|
|
|
list_del_init(&rq->queuelist);
|
2019-04-05 01:08:43 +08:00
|
|
|
ret = blk_mq_request_issue_directly(rq, list_empty(list));
|
2023-01-18 17:37:25 +08:00
|
|
|
switch (ret) {
|
|
|
|
case BLK_STS_OK:
|
2020-04-07 02:13:48 +08:00
|
|
|
queued++;
|
2023-01-18 17:37:25 +08:00
|
|
|
break;
|
|
|
|
case BLK_STS_RESOURCE:
|
|
|
|
case BLK_STS_DEV_RESOURCE:
|
|
|
|
blk_mq_request_bypass_insert(rq, false,
|
|
|
|
list_empty(list));
|
|
|
|
goto out;
|
|
|
|
default:
|
|
|
|
blk_mq_end_request(rq, ret);
|
|
|
|
break;
|
|
|
|
}
|
2018-07-10 09:03:31 +08:00
|
|
|
}
|
2018-11-28 08:02:25 +08:00
|
|
|
|
2023-01-18 17:37:25 +08:00
|
|
|
out:
|
2023-01-18 17:37:21 +08:00
|
|
|
if (ret != BLK_STS_OK)
|
|
|
|
blk_mq_commit_rqs(hctx, queued, false);
|
2018-07-10 09:03:31 +08:00
|
|
|
}
|
|
|
|
|
2021-11-11 16:51:34 +08:00
|
|
|
static bool blk_mq_attempt_bio_merge(struct request_queue *q,
|
2021-11-24 00:04:41 +08:00
|
|
|
struct bio *bio, unsigned int nr_segs)
|
2021-11-03 19:47:09 +08:00
|
|
|
{
|
|
|
|
if (!blk_queue_nomerges(q) && bio_mergeable(bio)) {
|
2021-11-24 00:04:41 +08:00
|
|
|
if (blk_attempt_plug_merge(q, bio, nr_segs))
|
2021-11-03 19:47:09 +08:00
|
|
|
return true;
|
|
|
|
if (blk_mq_sched_bio_merge(q, bio, nr_segs))
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2021-11-03 19:52:45 +08:00
|
|
|
static struct request *blk_mq_get_new_requests(struct request_queue *q,
|
|
|
|
struct blk_plug *plug,
|
2022-03-08 16:09:15 +08:00
|
|
|
struct bio *bio,
|
|
|
|
unsigned int nsegs)
|
2021-11-03 19:52:45 +08:00
|
|
|
{
|
|
|
|
struct blk_mq_alloc_data data = {
|
|
|
|
.q = q,
|
|
|
|
.nr_tags = 1,
|
2022-01-04 21:42:23 +08:00
|
|
|
.cmd_flags = bio->bi_opf,
|
2021-11-03 19:52:45 +08:00
|
|
|
};
|
|
|
|
struct request *rq;
|
|
|
|
|
2021-11-24 14:28:56 +08:00
|
|
|
if (unlikely(bio_queue_enter(bio)))
|
2021-11-12 20:47:15 +08:00
|
|
|
return NULL;
|
2021-11-03 19:47:09 +08:00
|
|
|
|
2022-03-08 16:09:15 +08:00
|
|
|
if (blk_mq_attempt_bio_merge(q, bio, nsegs))
|
|
|
|
goto queue_exit;
|
|
|
|
|
|
|
|
rq_qos_throttle(q, bio);
|
|
|
|
|
2021-11-03 19:52:45 +08:00
|
|
|
if (plug) {
|
|
|
|
data.nr_tags = plug->nr_ios;
|
|
|
|
plug->nr_ios = 1;
|
|
|
|
data.cached_rq = &plug->cached_rq;
|
|
|
|
}
|
|
|
|
|
|
|
|
rq = __blk_mq_alloc_requests(&data);
|
2021-12-03 03:42:58 +08:00
|
|
|
if (rq)
|
|
|
|
return rq;
|
2021-11-03 19:52:45 +08:00
|
|
|
rq_qos_cleanup(q, bio);
|
|
|
|
if (bio->bi_opf & REQ_NOWAIT)
|
|
|
|
bio_wouldblock_error(bio);
|
2022-03-08 16:09:15 +08:00
|
|
|
queue_exit:
|
2021-11-24 14:28:56 +08:00
|
|
|
blk_queue_exit(q);
|
2021-11-03 19:52:45 +08:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2021-11-24 14:28:56 +08:00
|
|
|
static inline struct request *blk_mq_get_cached_request(struct request_queue *q,
|
2022-03-08 16:09:15 +08:00
|
|
|
struct blk_plug *plug, struct bio **bio, unsigned int nsegs)
|
2021-11-03 19:52:45 +08:00
|
|
|
{
|
2021-11-12 20:47:15 +08:00
|
|
|
struct request *rq;
|
2023-01-17 19:42:15 +08:00
|
|
|
enum hctx_type type, hctx_type;
|
2021-11-12 20:47:15 +08:00
|
|
|
|
2021-11-24 14:28:56 +08:00
|
|
|
if (!plug)
|
|
|
|
return NULL;
|
2021-11-03 19:52:45 +08:00
|
|
|
|
2022-03-08 16:09:15 +08:00
|
|
|
if (blk_mq_attempt_bio_merge(q, *bio, nsegs)) {
|
|
|
|
*bio = NULL;
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2023-02-09 11:19:30 +08:00
|
|
|
rq = rq_list_peek(&plug->cached_rq);
|
|
|
|
if (!rq || rq->q != q)
|
|
|
|
return NULL;
|
|
|
|
|
2023-01-17 19:42:15 +08:00
|
|
|
type = blk_mq_get_hctx_type((*bio)->bi_opf);
|
|
|
|
hctx_type = rq->mq_hctx->type;
|
|
|
|
if (type != hctx_type &&
|
|
|
|
!(type == HCTX_TYPE_READ && hctx_type == HCTX_TYPE_DEFAULT))
|
2021-11-24 14:28:56 +08:00
|
|
|
return NULL;
|
2022-03-08 16:09:15 +08:00
|
|
|
if (op_is_flush(rq->cmd_flags) != op_is_flush((*bio)->bi_opf))
|
2021-11-24 14:28:56 +08:00
|
|
|
return NULL;
|
|
|
|
|
2022-06-22 00:03:57 +08:00
|
|
|
/*
|
|
|
|
* If any qos ->throttle() end up blocking, we will have flushed the
|
|
|
|
* plug and hence killed the cached_rq list as well. Pop this entry
|
|
|
|
* before we throttle.
|
|
|
|
*/
|
2021-11-24 14:28:56 +08:00
|
|
|
plug->cached_rq = rq_list_next(rq);
|
2022-06-22 00:03:57 +08:00
|
|
|
rq_qos_throttle(q, *bio);
|
|
|
|
|
|
|
|
rq->cmd_flags = (*bio)->bi_opf;
|
2021-11-24 14:28:56 +08:00
|
|
|
INIT_LIST_HEAD(&rq->queuelist);
|
|
|
|
return rq;
|
2021-11-03 19:52:45 +08:00
|
|
|
}
|
|
|
|
|
2022-06-23 15:48:32 +08:00
|
|
|
static void bio_set_ioprio(struct bio *bio)
|
|
|
|
{
|
2022-06-23 15:48:34 +08:00
|
|
|
/* Nobody set ioprio so far? Initialize it based on task's nice value */
|
|
|
|
if (IOPRIO_PRIO_CLASS(bio->bi_ioprio) == IOPRIO_CLASS_NONE)
|
|
|
|
bio->bi_ioprio = get_current_ioprio();
|
2022-06-23 15:48:32 +08:00
|
|
|
blkcg_set_ioprio(bio);
|
|
|
|
}
|
|
|
|
|
2020-01-07 02:08:18 +08:00
|
|
|
/**
|
2020-07-01 16:59:43 +08:00
|
|
|
* blk_mq_submit_bio - Create and send a request to block device.
|
2020-01-07 02:08:18 +08:00
|
|
|
* @bio: Bio pointer.
|
|
|
|
*
|
|
|
|
* Builds up a request structure from @q and @bio and send to the device. The
|
|
|
|
* request may not be queued directly to hardware if:
|
|
|
|
* * This request can be merged with another one
|
|
|
|
* * We want to place request at plug queue for possible future merging
|
|
|
|
* * There is an IO scheduler active at this queue
|
|
|
|
*
|
|
|
|
* It will not queue the request if there is an error with the bio, or at the
|
|
|
|
* request creation.
|
|
|
|
*/
|
2021-10-12 19:12:24 +08:00
|
|
|
void blk_mq_submit_bio(struct bio *bio)
|
2014-05-23 00:40:51 +08:00
|
|
|
{
|
2021-10-14 22:03:30 +08:00
|
|
|
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
|
2022-07-06 15:03:38 +08:00
|
|
|
struct blk_plug *plug = blk_mq_plug(bio);
|
2016-10-28 22:48:16 +08:00
|
|
|
const int is_sync = op_is_sync(bio->bi_opf);
|
2014-05-23 00:40:51 +08:00
|
|
|
struct request *rq;
|
2021-10-14 02:43:41 +08:00
|
|
|
unsigned int nr_segs = 1;
|
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 08:37:18 +08:00
|
|
|
blk_status_t ret;
|
2014-05-23 00:40:51 +08:00
|
|
|
|
2022-07-28 00:22:56 +08:00
|
|
|
bio = blk_queue_bounce(bio, q);
|
2023-01-04 23:51:19 +08:00
|
|
|
if (bio_may_exceed_limits(bio, &q->limits)) {
|
2022-07-28 00:23:00 +08:00
|
|
|
bio = __bio_split_to_limits(bio, &q->limits, &nr_segs);
|
2023-01-04 23:51:19 +08:00
|
|
|
if (!bio)
|
|
|
|
return;
|
|
|
|
}
|
blk-mq: NVMe 512B/4K+T10 DIF/DIX format returns I/O error on dd with split op
When formatting NVMe to 512B/4K + T10 DIf/DIX, dd with split op returns
"Input/output error". Looks block layer split the bio after calling
bio_integrity_prep(bio). This patch fixes the issue.
Below is how we debug this issue:
(1)format nvme to 4K block # size with type 2 DIF
(2)dd with block size bigger than 1024k.
oflag=direct
dd: error writing '/dev/nvme0n1': Input/output error
We added some debug code in nvme device driver. It showed us the first
op and the second op have the same bi and pi address. This is not
correct.
1st op: nvme0n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
dsmgmt=0x0, AT=0x0 & RT=0x505
Guard 0x00b1, AT 0x0000, RT physical 0x00000505 RT virtual 0x00002828
2nd op: nvme0n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
AT=0x0 & RT=0x605 ==> This op fails and subsequent 5 retires..
Guard 0x00b1, AT 0x0000, RT physical 0x00000605 RT virtual 0x00002828
With the fix, It showed us both of the first op and the second op have
correct bi and pi address.
1st op: nvme2n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
dsmgmt=0x0, AT=0x0 & RT=0x505
Guard 0x5ccb, AT 0x0000, RT physical 0x00000505 RT virtual
0x00002828
2nd op: nvme2n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
AT=0x0 & RT=0x605
Guard 0xab4c, AT 0x0000, RT physical 0x00000605 RT virtual
0x00003028
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-10 21:54:11 +08:00
|
|
|
|
2017-06-30 02:31:11 +08:00
|
|
|
if (!bio_integrity_prep(bio))
|
2021-11-03 19:47:09 +08:00
|
|
|
return;
|
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 03:38:14 +08:00
|
|
|
|
2022-06-23 15:48:33 +08:00
|
|
|
bio_set_ioprio(bio);
|
|
|
|
|
2022-03-08 16:09:15 +08:00
|
|
|
rq = blk_mq_get_cached_request(q, plug, &bio, nr_segs);
|
2021-11-24 14:28:56 +08:00
|
|
|
if (!rq) {
|
2022-03-08 16:09:15 +08:00
|
|
|
if (!bio)
|
|
|
|
return;
|
|
|
|
rq = blk_mq_get_new_requests(q, plug, bio, nr_segs);
|
2021-11-24 14:28:56 +08:00
|
|
|
if (unlikely(!rq))
|
|
|
|
return;
|
|
|
|
}
|
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 03:38:14 +08:00
|
|
|
|
2020-12-04 00:21:36 +08:00
|
|
|
trace_block_getrq(bio);
|
2018-10-23 22:30:50 +08:00
|
|
|
|
2018-07-03 23:14:59 +08:00
|
|
|
rq_qos_track(q, rq, bio);
|
2014-05-23 00:40:51 +08:00
|
|
|
|
2019-07-01 23:47:30 +08:00
|
|
|
blk_mq_bio_to_request(rq, bio, nr_segs);
|
|
|
|
|
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 08:37:18 +08:00
|
|
|
ret = blk_crypto_init_request(rq);
|
|
|
|
if (ret != BLK_STS_OK) {
|
|
|
|
bio->bi_status = ret;
|
|
|
|
bio_endio(bio);
|
|
|
|
blk_mq_free_request(rq);
|
2021-10-12 19:12:24 +08:00
|
|
|
return;
|
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 08:37:18 +08:00
|
|
|
}
|
|
|
|
|
2021-11-18 23:30:41 +08:00
|
|
|
if (op_is_flush(bio->bi_opf)) {
|
|
|
|
blk_insert_flush(rq);
|
2021-10-19 20:25:53 +08:00
|
|
|
return;
|
2021-11-18 23:30:41 +08:00
|
|
|
}
|
2021-10-19 20:25:53 +08:00
|
|
|
|
2021-11-24 00:04:42 +08:00
|
|
|
if (plug)
|
2018-11-28 08:13:56 +08:00
|
|
|
blk_add_rq_to_plug(plug, rq);
|
2021-11-24 00:04:42 +08:00
|
|
|
else if ((rq->rq_flags & RQF_ELV) ||
|
|
|
|
(rq->mq_hctx->dispatch_busy &&
|
|
|
|
(q->nr_hw_queues == 1 || !is_sync)))
|
2019-09-27 15:24:30 +08:00
|
|
|
blk_mq_sched_insert_request(rq, false, true, true);
|
2021-11-24 00:04:42 +08:00
|
|
|
else
|
2021-12-03 21:15:33 +08:00
|
|
|
blk_mq_run_dispatch_ops(rq->q,
|
2021-12-03 21:15:31 +08:00
|
|
|
blk_mq_try_issue_directly(rq->mq_hctx, rq));
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2022-02-15 18:05:36 +08:00
|
|
|
#ifdef CONFIG_BLK_MQ_STACKING
|
2021-11-17 14:13:58 +08:00
|
|
|
/**
|
2022-02-15 18:05:37 +08:00
|
|
|
* blk_insert_cloned_request - Helper for stacking drivers to submit a request
|
|
|
|
* @rq: the request being queued
|
2021-11-17 14:13:58 +08:00
|
|
|
*/
|
2022-02-15 18:05:38 +08:00
|
|
|
blk_status_t blk_insert_cloned_request(struct request *rq)
|
2021-11-17 14:13:58 +08:00
|
|
|
{
|
2022-02-15 18:05:38 +08:00
|
|
|
struct request_queue *q = rq->q;
|
2021-11-17 14:13:58 +08:00
|
|
|
unsigned int max_sectors = blk_queue_get_max_sectors(q, req_op(rq));
|
2023-03-01 08:06:55 +08:00
|
|
|
unsigned int max_segments = blk_rq_get_max_segments(rq);
|
2022-02-15 18:05:37 +08:00
|
|
|
blk_status_t ret;
|
2021-11-17 14:13:58 +08:00
|
|
|
|
|
|
|
if (blk_rq_sectors(rq) > max_sectors) {
|
|
|
|
/*
|
|
|
|
* SCSI device does not have a good way to return if
|
|
|
|
* Write Same/Zero is actually supported. If a device rejects
|
|
|
|
* a non-read/write command (discard, write same,etc.) the
|
|
|
|
* low-level device driver will set the relevant queue limit to
|
|
|
|
* 0 to prevent blk-lib from issuing more of the offending
|
|
|
|
* operations. Commands queued prior to the queue limit being
|
|
|
|
* reset need to be completed with BLK_STS_NOTSUPP to avoid I/O
|
|
|
|
* errors being propagated to upper layers.
|
|
|
|
*/
|
|
|
|
if (max_sectors == 0)
|
|
|
|
return BLK_STS_NOTSUPP;
|
|
|
|
|
|
|
|
printk(KERN_ERR "%s: over max size limit. (%u > %u)\n",
|
|
|
|
__func__, blk_rq_sectors(rq), max_sectors);
|
|
|
|
return BLK_STS_IOERR;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The queue settings related to segment counting may differ from the
|
|
|
|
* original queue.
|
|
|
|
*/
|
|
|
|
rq->nr_phys_segments = blk_recalc_rq_segments(rq);
|
2023-03-01 08:06:55 +08:00
|
|
|
if (rq->nr_phys_segments > max_segments) {
|
|
|
|
printk(KERN_ERR "%s: over max segments limit. (%u > %u)\n",
|
|
|
|
__func__, rq->nr_phys_segments, max_segments);
|
2021-11-17 14:13:58 +08:00
|
|
|
return BLK_STS_IOERR;
|
|
|
|
}
|
|
|
|
|
2022-02-15 18:05:38 +08:00
|
|
|
if (q->disk && should_fail_request(q->disk->part0, blk_rq_bytes(rq)))
|
2021-11-17 14:13:58 +08:00
|
|
|
return BLK_STS_IOERR;
|
|
|
|
|
|
|
|
if (blk_crypto_insert_cloned_request(rq))
|
|
|
|
return BLK_STS_IOERR;
|
|
|
|
|
|
|
|
blk_account_io_start(rq);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Since we have a scheduler attached on the top device,
|
|
|
|
* bypass a potential scheduler on the bottom device for
|
|
|
|
* insert.
|
|
|
|
*/
|
2022-02-15 18:05:38 +08:00
|
|
|
blk_mq_run_dispatch_ops(q,
|
2021-12-03 21:15:34 +08:00
|
|
|
ret = blk_mq_request_issue_directly(rq, true));
|
2022-01-26 09:21:32 +08:00
|
|
|
if (ret)
|
|
|
|
blk_account_io_done(rq, ktime_get_ns());
|
2021-12-03 21:15:34 +08:00
|
|
|
return ret;
|
2021-11-17 14:13:58 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_insert_cloned_request);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_rq_unprep_clone - Helper function to free all bios in a cloned request
|
|
|
|
* @rq: the clone request to be cleaned up
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Free all bios in @rq for a cloned request.
|
|
|
|
*/
|
|
|
|
void blk_rq_unprep_clone(struct request *rq)
|
|
|
|
{
|
|
|
|
struct bio *bio;
|
|
|
|
|
|
|
|
while ((bio = rq->bio) != NULL) {
|
|
|
|
rq->bio = bio->bi_next;
|
|
|
|
|
|
|
|
bio_put(bio);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_rq_prep_clone - Helper function to setup clone request
|
|
|
|
* @rq: the request to be setup
|
|
|
|
* @rq_src: original request to be cloned
|
|
|
|
* @bs: bio_set that bios for clone are allocated from
|
|
|
|
* @gfp_mask: memory allocation mask for bio
|
|
|
|
* @bio_ctr: setup function to be called for each clone bio.
|
|
|
|
* Returns %0 for success, non %0 for failure.
|
|
|
|
* @data: private data to be passed to @bio_ctr
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Clones bios in @rq_src to @rq, and copies attributes of @rq_src to @rq.
|
|
|
|
* Also, pages which the original bios are pointing to are not copied
|
|
|
|
* and the cloned bios just point same pages.
|
|
|
|
* So cloned bios must be completed before original bios, which means
|
|
|
|
* the caller must complete @rq before @rq_src.
|
|
|
|
*/
|
|
|
|
int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
|
|
|
|
struct bio_set *bs, gfp_t gfp_mask,
|
|
|
|
int (*bio_ctr)(struct bio *, struct bio *, void *),
|
|
|
|
void *data)
|
|
|
|
{
|
|
|
|
struct bio *bio, *bio_src;
|
|
|
|
|
|
|
|
if (!bs)
|
|
|
|
bs = &fs_bio_set;
|
|
|
|
|
|
|
|
__rq_for_each_bio(bio_src, rq_src) {
|
2022-02-03 00:01:09 +08:00
|
|
|
bio = bio_alloc_clone(rq->q->disk->part0, bio_src, gfp_mask,
|
|
|
|
bs);
|
2021-11-17 14:13:58 +08:00
|
|
|
if (!bio)
|
|
|
|
goto free_and_out;
|
|
|
|
|
|
|
|
if (bio_ctr && bio_ctr(bio, bio_src, data))
|
|
|
|
goto free_and_out;
|
|
|
|
|
|
|
|
if (rq->bio) {
|
|
|
|
rq->biotail->bi_next = bio;
|
|
|
|
rq->biotail = bio;
|
|
|
|
} else {
|
|
|
|
rq->bio = rq->biotail = bio;
|
|
|
|
}
|
|
|
|
bio = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Copy attributes of the original request to the clone request. */
|
|
|
|
rq->__sector = blk_rq_pos(rq_src);
|
|
|
|
rq->__data_len = blk_rq_bytes(rq_src);
|
|
|
|
if (rq_src->rq_flags & RQF_SPECIAL_PAYLOAD) {
|
|
|
|
rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
|
|
|
|
rq->special_vec = rq_src->special_vec;
|
|
|
|
}
|
|
|
|
rq->nr_phys_segments = rq_src->nr_phys_segments;
|
|
|
|
rq->ioprio = rq_src->ioprio;
|
|
|
|
|
|
|
|
if (rq->bio && blk_crypto_rq_bio_prep(rq, rq->bio, gfp_mask) < 0)
|
|
|
|
goto free_and_out;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
free_and_out:
|
|
|
|
if (bio)
|
|
|
|
bio_put(bio);
|
|
|
|
blk_rq_unprep_clone(rq);
|
|
|
|
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_rq_prep_clone);
|
2022-02-15 18:05:36 +08:00
|
|
|
#endif /* CONFIG_BLK_MQ_STACKING */
|
2021-11-17 14:13:58 +08:00
|
|
|
|
2021-11-17 14:14:00 +08:00
|
|
|
/*
|
|
|
|
* Steal bios from a request and add them to a bio list.
|
|
|
|
* The request must not have been partially completed before.
|
|
|
|
*/
|
|
|
|
void blk_steal_bios(struct bio_list *list, struct request *rq)
|
|
|
|
{
|
|
|
|
if (rq->bio) {
|
|
|
|
if (list->tail)
|
|
|
|
list->tail->bi_next = rq->bio;
|
|
|
|
else
|
|
|
|
list->head = rq->bio;
|
|
|
|
list->tail = rq->biotail;
|
|
|
|
|
|
|
|
rq->bio = NULL;
|
|
|
|
rq->biotail = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
rq->__data_len = 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_steal_bios);
|
|
|
|
|
2021-05-11 23:22:35 +08:00
|
|
|
static size_t order_to_size(unsigned int order)
|
|
|
|
{
|
|
|
|
return (size_t)PAGE_SIZE << order;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* called before freeing request pool in @tags */
|
2021-10-05 18:23:32 +08:00
|
|
|
static void blk_mq_clear_rq_mapping(struct blk_mq_tags *drv_tags,
|
|
|
|
struct blk_mq_tags *tags)
|
2021-05-11 23:22:35 +08:00
|
|
|
{
|
|
|
|
struct page *page;
|
|
|
|
unsigned long flags;
|
|
|
|
|
2022-10-11 22:22:53 +08:00
|
|
|
/*
|
|
|
|
* There is no need to clear mapping if driver tags is not initialized
|
|
|
|
* or the mapping belongs to the driver tags.
|
|
|
|
*/
|
|
|
|
if (!drv_tags || drv_tags == tags)
|
2021-10-05 18:23:33 +08:00
|
|
|
return;
|
|
|
|
|
2021-05-11 23:22:35 +08:00
|
|
|
list_for_each_entry(page, &tags->page_list, lru) {
|
|
|
|
unsigned long start = (unsigned long)page_address(page);
|
|
|
|
unsigned long end = start + order_to_size(page->private);
|
|
|
|
int i;
|
|
|
|
|
2021-10-05 18:23:32 +08:00
|
|
|
for (i = 0; i < drv_tags->nr_tags; i++) {
|
2021-05-11 23:22:35 +08:00
|
|
|
struct request *rq = drv_tags->rqs[i];
|
|
|
|
unsigned long rq_addr = (unsigned long)rq;
|
|
|
|
|
|
|
|
if (rq_addr >= start && rq_addr < end) {
|
2021-10-15 04:39:59 +08:00
|
|
|
WARN_ON_ONCE(req_ref_read(rq) != 0);
|
2021-05-11 23:22:35 +08:00
|
|
|
cmpxchg(&drv_tags->rqs[i], rq, NULL);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Wait until all pending iteration is done.
|
|
|
|
*
|
|
|
|
* Request reference is cleared and it is guaranteed to be observed
|
|
|
|
* after the ->lock is released.
|
|
|
|
*/
|
|
|
|
spin_lock_irqsave(&drv_tags->lock, flags);
|
|
|
|
spin_unlock_irqrestore(&drv_tags->lock, flags);
|
|
|
|
}
|
|
|
|
|
2017-01-12 05:29:56 +08:00
|
|
|
void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
|
|
|
|
unsigned int hctx_idx)
|
2014-03-15 00:43:15 +08:00
|
|
|
{
|
2021-10-05 18:23:32 +08:00
|
|
|
struct blk_mq_tags *drv_tags;
|
2014-04-16 03:59:10 +08:00
|
|
|
struct page *page;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2022-03-08 13:51:48 +08:00
|
|
|
if (list_empty(&tags->page_list))
|
|
|
|
return;
|
|
|
|
|
2021-10-05 18:23:39 +08:00
|
|
|
if (blk_mq_is_shared_tags(set->flags))
|
|
|
|
drv_tags = set->shared_tags;
|
2021-10-05 18:23:37 +08:00
|
|
|
else
|
|
|
|
drv_tags = set->tags[hctx_idx];
|
2021-10-05 18:23:32 +08:00
|
|
|
|
2021-10-05 18:23:26 +08:00
|
|
|
if (tags->static_rqs && set->ops->exit_request) {
|
2014-04-16 03:59:10 +08:00
|
|
|
int i;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2014-04-16 04:14:00 +08:00
|
|
|
for (i = 0; i < tags->nr_tags; i++) {
|
2017-01-14 05:39:30 +08:00
|
|
|
struct request *rq = tags->static_rqs[i];
|
|
|
|
|
|
|
|
if (!rq)
|
2014-04-16 03:59:10 +08:00
|
|
|
continue;
|
2017-05-02 00:19:08 +08:00
|
|
|
set->ops->exit_request(set, rq, hctx_idx);
|
2017-01-14 05:39:30 +08:00
|
|
|
tags->static_rqs[i] = NULL;
|
2014-04-16 03:59:10 +08:00
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2021-10-05 18:23:32 +08:00
|
|
|
blk_mq_clear_rq_mapping(drv_tags, tags);
|
2021-05-11 23:22:35 +08:00
|
|
|
|
2014-04-16 04:14:00 +08:00
|
|
|
while (!list_empty(&tags->page_list)) {
|
|
|
|
page = list_first_entry(&tags->page_list, struct page, lru);
|
2014-01-09 11:17:46 +08:00
|
|
|
list_del_init(&page->lru);
|
2015-09-15 01:16:02 +08:00
|
|
|
/*
|
|
|
|
* Remove kmemleak object previously allocated in
|
2019-05-03 03:48:11 +08:00
|
|
|
* blk_mq_alloc_rqs().
|
2015-09-15 01:16:02 +08:00
|
|
|
*/
|
|
|
|
kmemleak_free(page_address(page));
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
__free_pages(page, page->private);
|
|
|
|
}
|
2017-01-12 05:29:56 +08:00
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2021-10-05 18:23:37 +08:00
|
|
|
void blk_mq_free_rq_map(struct blk_mq_tags *tags)
|
2017-01-12 05:29:56 +08:00
|
|
|
{
|
2014-04-16 04:14:00 +08:00
|
|
|
kfree(tags->rqs);
|
2017-01-12 05:29:56 +08:00
|
|
|
tags->rqs = NULL;
|
2017-01-14 05:39:30 +08:00
|
|
|
kfree(tags->static_rqs);
|
|
|
|
tags->static_rqs = NULL;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2021-10-05 18:23:37 +08:00
|
|
|
blk_mq_free_tags(tags);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2022-03-08 15:32:14 +08:00
|
|
|
static enum hctx_type hctx_idx_to_type(struct blk_mq_tag_set *set,
|
|
|
|
unsigned int hctx_idx)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < set->nr_maps; i++) {
|
|
|
|
unsigned int start = set->map[i].queue_offset;
|
|
|
|
unsigned int end = start + set->map[i].nr_queues;
|
|
|
|
|
|
|
|
if (hctx_idx >= start && hctx_idx < end)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (i >= set->nr_maps)
|
|
|
|
i = HCTX_TYPE_DEFAULT;
|
|
|
|
|
|
|
|
return i;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int blk_mq_get_hctx_node(struct blk_mq_tag_set *set,
|
|
|
|
unsigned int hctx_idx)
|
|
|
|
{
|
|
|
|
enum hctx_type type = hctx_idx_to_type(set, hctx_idx);
|
|
|
|
|
|
|
|
return blk_mq_hw_queue_to_node(&set->map[type], hctx_idx);
|
|
|
|
}
|
|
|
|
|
2021-10-05 18:23:35 +08:00
|
|
|
static struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
|
|
|
|
unsigned int hctx_idx,
|
|
|
|
unsigned int nr_tags,
|
2021-10-05 18:23:37 +08:00
|
|
|
unsigned int reserved_tags)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2022-03-08 15:32:14 +08:00
|
|
|
int node = blk_mq_get_hctx_node(set, hctx_idx);
|
2014-04-16 04:14:00 +08:00
|
|
|
struct blk_mq_tags *tags;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2017-02-02 01:53:14 +08:00
|
|
|
if (node == NUMA_NO_NODE)
|
|
|
|
node = set->numa_node;
|
|
|
|
|
2021-10-05 18:23:37 +08:00
|
|
|
tags = blk_mq_init_tags(nr_tags, reserved_tags, node,
|
|
|
|
BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags));
|
2014-04-16 04:14:00 +08:00
|
|
|
if (!tags)
|
|
|
|
return NULL;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
treewide: kzalloc_node() -> kcalloc_node()
The kzalloc_node() function has a 2-factor argument form, kcalloc_node(). This
patch replaces cases of:
kzalloc_node(a * b, gfp, node)
with:
kcalloc_node(a * b, gfp, node)
as well as handling cases of:
kzalloc_node(a * b * c, gfp, node)
with:
kzalloc_node(array3_size(a, b, c), gfp, node)
as it's slightly less ugly than:
kcalloc_node(array_size(a, b), c, gfp, node)
This does, however, attempt to ignore constant size factors like:
kzalloc_node(4 * 1024, gfp, node)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@
(
kzalloc_node(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kzalloc_node(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@
(
kzalloc_node(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)
// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@
- kzalloc_node
+ kcalloc_node
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
kzalloc_node(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc_node(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc_node(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc_node(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
kzalloc_node(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc_node(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc_node(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@
(
kzalloc_node(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)
// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
kzalloc_node(C1 * C2 * C3, ...)
|
kzalloc_node(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc_node(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc_node(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc_node(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)
// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@
(
kzalloc_node(sizeof(THING) * C2, ...)
|
kzalloc_node(sizeof(TYPE) * C2, ...)
|
kzalloc_node(C1 * C2 * C3, ...)
|
kzalloc_node(C1 * C2, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- (E1) * E2
+ E1, E2
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- E1 * E2
+ E1, E2
, ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-13 05:04:20 +08:00
|
|
|
tags->rqs = kcalloc_node(nr_tags, sizeof(struct request *),
|
2016-12-06 23:31:44 +08:00
|
|
|
GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
|
2017-02-02 01:53:14 +08:00
|
|
|
node);
|
2022-11-02 10:52:29 +08:00
|
|
|
if (!tags->rqs)
|
|
|
|
goto err_free_tags;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
treewide: kzalloc_node() -> kcalloc_node()
The kzalloc_node() function has a 2-factor argument form, kcalloc_node(). This
patch replaces cases of:
kzalloc_node(a * b, gfp, node)
with:
kcalloc_node(a * b, gfp, node)
as well as handling cases of:
kzalloc_node(a * b * c, gfp, node)
with:
kzalloc_node(array3_size(a, b, c), gfp, node)
as it's slightly less ugly than:
kcalloc_node(array_size(a, b), c, gfp, node)
This does, however, attempt to ignore constant size factors like:
kzalloc_node(4 * 1024, gfp, node)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@
(
kzalloc_node(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kzalloc_node(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@
(
kzalloc_node(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)
// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@
- kzalloc_node
+ kcalloc_node
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
kzalloc_node(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc_node(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc_node(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc_node(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
kzalloc_node(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc_node(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc_node(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@
(
kzalloc_node(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)
// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
kzalloc_node(C1 * C2 * C3, ...)
|
kzalloc_node(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc_node(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc_node(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc_node(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)
// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@
(
kzalloc_node(sizeof(THING) * C2, ...)
|
kzalloc_node(sizeof(TYPE) * C2, ...)
|
kzalloc_node(C1 * C2 * C3, ...)
|
kzalloc_node(C1 * C2, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- (E1) * E2
+ E1, E2
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- E1 * E2
+ E1, E2
, ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-13 05:04:20 +08:00
|
|
|
tags->static_rqs = kcalloc_node(nr_tags, sizeof(struct request *),
|
|
|
|
GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
|
|
|
|
node);
|
2022-11-02 10:52:29 +08:00
|
|
|
if (!tags->static_rqs)
|
|
|
|
goto err_free_rqs;
|
2017-01-14 05:39:30 +08:00
|
|
|
|
2017-01-12 05:29:56 +08:00
|
|
|
return tags;
|
2022-11-02 10:52:29 +08:00
|
|
|
|
|
|
|
err_free_rqs:
|
|
|
|
kfree(tags->rqs);
|
|
|
|
err_free_tags:
|
|
|
|
blk_mq_free_tags(tags);
|
|
|
|
return NULL;
|
2017-01-12 05:29:56 +08:00
|
|
|
}
|
|
|
|
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10 00:29:48 +08:00
|
|
|
static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
|
|
|
|
unsigned int hctx_idx, int node)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (set->ops->init_request) {
|
|
|
|
ret = set->ops->init_request(set, rq, hctx_idx, node);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-05-29 21:52:28 +08:00
|
|
|
WRITE_ONCE(rq->state, MQ_RQ_IDLE);
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10 00:29:48 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2021-10-05 18:23:35 +08:00
|
|
|
static int blk_mq_alloc_rqs(struct blk_mq_tag_set *set,
|
|
|
|
struct blk_mq_tags *tags,
|
|
|
|
unsigned int hctx_idx, unsigned int depth)
|
2017-01-12 05:29:56 +08:00
|
|
|
{
|
|
|
|
unsigned int i, j, entries_per_page, max_order = 4;
|
2022-03-08 15:32:14 +08:00
|
|
|
int node = blk_mq_get_hctx_node(set, hctx_idx);
|
2017-01-12 05:29:56 +08:00
|
|
|
size_t rq_size, left;
|
2017-02-02 01:53:14 +08:00
|
|
|
|
|
|
|
if (node == NUMA_NO_NODE)
|
|
|
|
node = set->numa_node;
|
2017-01-12 05:29:56 +08:00
|
|
|
|
|
|
|
INIT_LIST_HEAD(&tags->page_list);
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
/*
|
|
|
|
* rq_size is the size of the request plus driver payload, rounded
|
|
|
|
* to the cacheline size
|
|
|
|
*/
|
2014-04-16 04:14:00 +08:00
|
|
|
rq_size = round_up(sizeof(struct request) + set->cmd_size,
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
cache_line_size());
|
2017-01-12 05:29:56 +08:00
|
|
|
left = rq_size * depth;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2017-01-12 05:29:56 +08:00
|
|
|
for (i = 0; i < depth; ) {
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
int this_order = max_order;
|
|
|
|
struct page *page;
|
|
|
|
int to_do;
|
|
|
|
void *p;
|
|
|
|
|
2016-05-16 23:54:47 +08:00
|
|
|
while (this_order && left < order_to_size(this_order - 1))
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
this_order--;
|
|
|
|
|
|
|
|
do {
|
2017-02-02 01:53:14 +08:00
|
|
|
page = alloc_pages_node(node,
|
2016-12-06 23:31:44 +08:00
|
|
|
GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO,
|
2014-09-10 23:02:03 +08:00
|
|
|
this_order);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
if (page)
|
|
|
|
break;
|
|
|
|
if (!this_order--)
|
|
|
|
break;
|
|
|
|
if (order_to_size(this_order) < rq_size)
|
|
|
|
break;
|
|
|
|
} while (1);
|
|
|
|
|
|
|
|
if (!page)
|
2014-04-16 04:14:00 +08:00
|
|
|
goto fail;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
|
|
|
page->private = this_order;
|
2014-04-16 04:14:00 +08:00
|
|
|
list_add_tail(&page->lru, &tags->page_list);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
|
|
|
p = page_address(page);
|
2015-09-15 01:16:02 +08:00
|
|
|
/*
|
|
|
|
* Allow kmemleak to scan these pages as they contain pointers
|
|
|
|
* to additional allocations like via ops->init_request().
|
|
|
|
*/
|
2016-12-06 23:31:44 +08:00
|
|
|
kmemleak_alloc(p, order_to_size(this_order), 1, GFP_NOIO);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
entries_per_page = order_to_size(this_order) / rq_size;
|
2017-01-12 05:29:56 +08:00
|
|
|
to_do = min(entries_per_page, depth - i);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
left -= to_do * rq_size;
|
|
|
|
for (j = 0; j < to_do; j++) {
|
2017-01-14 05:39:30 +08:00
|
|
|
struct request *rq = p;
|
|
|
|
|
|
|
|
tags->static_rqs[i] = rq;
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10 00:29:48 +08:00
|
|
|
if (blk_mq_init_request(set, rq, hctx_idx, node)) {
|
|
|
|
tags->static_rqs[i] = NULL;
|
|
|
|
goto fail;
|
2014-04-16 03:59:10 +08:00
|
|
|
}
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
p += rq_size;
|
|
|
|
i++;
|
|
|
|
}
|
|
|
|
}
|
2017-01-12 05:29:56 +08:00
|
|
|
return 0;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2014-04-16 04:14:00 +08:00
|
|
|
fail:
|
2017-01-12 05:29:56 +08:00
|
|
|
blk_mq_free_rqs(set, tags, hctx_idx);
|
|
|
|
return -ENOMEM;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2020-05-29 21:53:15 +08:00
|
|
|
struct rq_iter_data {
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
bool has_rq;
|
|
|
|
};
|
|
|
|
|
2022-07-06 20:03:53 +08:00
|
|
|
static bool blk_mq_has_request(struct request *rq, void *data)
|
2020-05-29 21:53:15 +08:00
|
|
|
{
|
|
|
|
struct rq_iter_data *iter_data = data;
|
|
|
|
|
|
|
|
if (rq->mq_hctx != iter_data->hctx)
|
|
|
|
return true;
|
|
|
|
iter_data->has_rq = true;
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
|
|
|
|
{
|
|
|
|
struct blk_mq_tags *tags = hctx->sched_tags ?
|
|
|
|
hctx->sched_tags : hctx->tags;
|
|
|
|
struct rq_iter_data data = {
|
|
|
|
.hctx = hctx,
|
|
|
|
};
|
|
|
|
|
|
|
|
blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
|
|
|
|
return data.has_rq;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool blk_mq_last_cpu_in_hctx(unsigned int cpu,
|
|
|
|
struct blk_mq_hw_ctx *hctx)
|
|
|
|
{
|
2021-08-15 05:17:05 +08:00
|
|
|
if (cpumask_first_and(hctx->cpumask, cpu_online_mask) != cpu)
|
2020-05-29 21:53:15 +08:00
|
|
|
return false;
|
|
|
|
if (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids)
|
|
|
|
return false;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
|
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
|
|
|
|
struct blk_mq_hw_ctx, cpuhp_online);
|
|
|
|
|
|
|
|
if (!cpumask_test_cpu(cpu, hctx->cpumask) ||
|
|
|
|
!blk_mq_last_cpu_in_hctx(cpu, hctx))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Prevent new request from being allocated on the current hctx.
|
|
|
|
*
|
|
|
|
* The smp_mb__after_atomic() Pairs with the implied barrier in
|
|
|
|
* test_and_set_bit_lock in sbitmap_get(). Ensures the inactive flag is
|
|
|
|
* seen once we return from the tag allocator.
|
|
|
|
*/
|
|
|
|
set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
|
|
|
|
smp_mb__after_atomic();
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Try to grab a reference to the queue and wait for any outstanding
|
|
|
|
* requests. If we could not grab a reference the queue has been
|
|
|
|
* frozen and there are no requests.
|
|
|
|
*/
|
|
|
|
if (percpu_ref_tryget(&hctx->queue->q_usage_counter)) {
|
|
|
|
while (blk_mq_hctx_has_requests(hctx))
|
|
|
|
msleep(5);
|
|
|
|
percpu_ref_put(&hctx->queue->q_usage_counter);
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
|
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
|
|
|
|
struct blk_mq_hw_ctx, cpuhp_online);
|
|
|
|
|
|
|
|
if (cpumask_test_cpu(cpu, hctx->cpumask))
|
|
|
|
clear_bit(BLK_MQ_S_INACTIVE, &hctx->state);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2016-08-25 05:34:35 +08:00
|
|
|
/*
|
|
|
|
* 'cpu' is going away. splice any existing rq_list entries from this
|
|
|
|
* software queue to the hw queue dispatch list, and ensure that it
|
|
|
|
* gets run.
|
|
|
|
*/
|
2016-09-22 22:05:17 +08:00
|
|
|
static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
|
2014-05-22 04:01:15 +08:00
|
|
|
{
|
2016-09-22 22:05:17 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx;
|
2014-05-22 04:01:15 +08:00
|
|
|
struct blk_mq_ctx *ctx;
|
|
|
|
LIST_HEAD(tmp);
|
2018-12-17 23:44:05 +08:00
|
|
|
enum hctx_type type;
|
2014-05-22 04:01:15 +08:00
|
|
|
|
2016-09-22 22:05:17 +08:00
|
|
|
hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
|
2020-05-29 21:53:15 +08:00
|
|
|
if (!cpumask_test_cpu(cpu, hctx->cpumask))
|
|
|
|
return 0;
|
|
|
|
|
2016-08-25 05:34:35 +08:00
|
|
|
ctx = __blk_mq_get_ctx(hctx->queue, cpu);
|
2018-12-17 23:44:05 +08:00
|
|
|
type = hctx->type;
|
2014-05-22 04:01:15 +08:00
|
|
|
|
|
|
|
spin_lock(&ctx->lock);
|
2018-12-17 23:44:05 +08:00
|
|
|
if (!list_empty(&ctx->rq_lists[type])) {
|
|
|
|
list_splice_init(&ctx->rq_lists[type], &tmp);
|
2014-05-22 04:01:15 +08:00
|
|
|
blk_mq_hctx_clear_pending(hctx, ctx);
|
|
|
|
}
|
|
|
|
spin_unlock(&ctx->lock);
|
|
|
|
|
|
|
|
if (list_empty(&tmp))
|
2016-09-22 22:05:17 +08:00
|
|
|
return 0;
|
2014-05-22 04:01:15 +08:00
|
|
|
|
2016-08-25 05:34:35 +08:00
|
|
|
spin_lock(&hctx->lock);
|
|
|
|
list_splice_tail_init(&tmp, &hctx->dispatch);
|
|
|
|
spin_unlock(&hctx->lock);
|
2014-05-22 04:01:15 +08:00
|
|
|
|
|
|
|
blk_mq_run_hw_queue(hctx, true);
|
2016-09-22 22:05:17 +08:00
|
|
|
return 0;
|
2014-05-22 04:01:15 +08:00
|
|
|
}
|
|
|
|
|
2016-09-22 22:05:17 +08:00
|
|
|
static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
|
2014-05-22 04:01:15 +08:00
|
|
|
{
|
2020-05-29 21:53:15 +08:00
|
|
|
if (!(hctx->flags & BLK_MQ_F_STACKING))
|
|
|
|
cpuhp_state_remove_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
|
|
|
|
&hctx->cpuhp_online);
|
2016-09-22 22:05:17 +08:00
|
|
|
cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD,
|
|
|
|
&hctx->cpuhp_dead);
|
2014-05-22 04:01:15 +08:00
|
|
|
}
|
|
|
|
|
2021-05-11 23:22:36 +08:00
|
|
|
/*
|
|
|
|
* Before freeing hw queue, clearing the flush request reference in
|
|
|
|
* tags->rqs[] for avoiding potential UAF.
|
|
|
|
*/
|
|
|
|
static void blk_mq_clear_flush_rq_mapping(struct blk_mq_tags *tags,
|
|
|
|
unsigned int queue_depth, struct request *flush_rq)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
/* The hw queue may not be mapped yet */
|
|
|
|
if (!tags)
|
|
|
|
return;
|
|
|
|
|
2021-10-15 04:39:59 +08:00
|
|
|
WARN_ON_ONCE(req_ref_read(flush_rq) != 0);
|
2021-05-11 23:22:36 +08:00
|
|
|
|
|
|
|
for (i = 0; i < queue_depth; i++)
|
|
|
|
cmpxchg(&tags->rqs[i], flush_rq, NULL);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Wait until all pending iteration is done.
|
|
|
|
*
|
|
|
|
* Request reference is cleared and it is guaranteed to be observed
|
|
|
|
* after the ->lock is released.
|
|
|
|
*/
|
|
|
|
spin_lock_irqsave(&tags->lock, flags);
|
|
|
|
spin_unlock_irqrestore(&tags->lock, flags);
|
|
|
|
}
|
|
|
|
|
2015-06-04 22:25:04 +08:00
|
|
|
/* hctx->ctxs will be freed in queue's release handler */
|
2014-09-25 23:23:38 +08:00
|
|
|
static void blk_mq_exit_hctx(struct request_queue *q,
|
|
|
|
struct blk_mq_tag_set *set,
|
|
|
|
struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
|
|
|
|
{
|
2021-05-11 23:22:36 +08:00
|
|
|
struct request *flush_rq = hctx->fq->flush_rq;
|
|
|
|
|
2018-01-09 21:28:29 +08:00
|
|
|
if (blk_mq_hw_queue_mapped(hctx))
|
|
|
|
blk_mq_tag_idle(hctx);
|
2014-09-25 23:23:38 +08:00
|
|
|
|
2022-06-16 09:44:01 +08:00
|
|
|
if (blk_queue_init_done(q))
|
|
|
|
blk_mq_clear_flush_rq_mapping(set->tags[hctx_idx],
|
|
|
|
set->queue_depth, flush_rq);
|
2014-09-25 23:23:47 +08:00
|
|
|
if (set->ops->exit_request)
|
2021-05-11 23:22:36 +08:00
|
|
|
set->ops->exit_request(set, flush_rq, hctx_idx);
|
2014-09-25 23:23:47 +08:00
|
|
|
|
2014-09-25 23:23:38 +08:00
|
|
|
if (set->ops->exit_hctx)
|
|
|
|
set->ops->exit_hctx(hctx, hctx_idx);
|
|
|
|
|
2016-09-22 22:05:17 +08:00
|
|
|
blk_mq_remove_cpuhp(hctx);
|
blk-mq: always free hctx after request queue is freed
In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().
However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.
Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:52:27 +08:00
|
|
|
|
2022-03-08 15:32:19 +08:00
|
|
|
xa_erase(&q->hctx_table, hctx_idx);
|
|
|
|
|
blk-mq: always free hctx after request queue is freed
In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().
However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.
Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:52:27 +08:00
|
|
|
spin_lock(&q->unused_hctx_lock);
|
|
|
|
list_add(&hctx->hctx_list, &q->unused_hctx_list);
|
|
|
|
spin_unlock(&q->unused_hctx_lock);
|
2014-09-25 23:23:38 +08:00
|
|
|
}
|
|
|
|
|
2014-05-27 23:35:13 +08:00
|
|
|
static void blk_mq_exit_hw_queues(struct request_queue *q,
|
|
|
|
struct blk_mq_tag_set *set, int nr_queue)
|
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
2022-03-08 15:32:18 +08:00
|
|
|
unsigned long i;
|
2014-05-27 23:35:13 +08:00
|
|
|
|
|
|
|
queue_for_each_hw_ctx(q, hctx, i) {
|
|
|
|
if (i == nr_queue)
|
|
|
|
break;
|
2014-09-25 23:23:38 +08:00
|
|
|
blk_mq_exit_hctx(q, set, hctx, i);
|
2014-05-27 23:35:13 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-09-25 23:23:38 +08:00
|
|
|
static int blk_mq_init_hctx(struct request_queue *q,
|
|
|
|
struct blk_mq_tag_set *set,
|
|
|
|
struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2019-04-30 09:52:26 +08:00
|
|
|
hctx->queue_num = hctx_idx;
|
|
|
|
|
2020-05-29 21:53:15 +08:00
|
|
|
if (!(hctx->flags & BLK_MQ_F_STACKING))
|
|
|
|
cpuhp_state_add_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
|
|
|
|
&hctx->cpuhp_online);
|
2019-04-30 09:52:26 +08:00
|
|
|
cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
|
|
|
|
|
|
|
|
hctx->tags = set->tags[hctx_idx];
|
|
|
|
|
|
|
|
if (set->ops->init_hctx &&
|
|
|
|
set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
|
|
|
|
goto unregister_cpu_notifier;
|
2014-09-25 23:23:38 +08:00
|
|
|
|
2019-04-30 09:52:26 +08:00
|
|
|
if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx,
|
|
|
|
hctx->numa_node))
|
|
|
|
goto exit_hctx;
|
2022-03-08 15:32:19 +08:00
|
|
|
|
|
|
|
if (xa_insert(&q->hctx_table, hctx_idx, hctx, GFP_KERNEL))
|
|
|
|
goto exit_flush_rq;
|
|
|
|
|
2019-04-30 09:52:26 +08:00
|
|
|
return 0;
|
|
|
|
|
2022-03-08 15:32:19 +08:00
|
|
|
exit_flush_rq:
|
|
|
|
if (set->ops->exit_request)
|
|
|
|
set->ops->exit_request(set, hctx->fq->flush_rq, hctx_idx);
|
2019-04-30 09:52:26 +08:00
|
|
|
exit_hctx:
|
|
|
|
if (set->ops->exit_hctx)
|
|
|
|
set->ops->exit_hctx(hctx, hctx_idx);
|
|
|
|
unregister_cpu_notifier:
|
|
|
|
blk_mq_remove_cpuhp(hctx);
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct blk_mq_hw_ctx *
|
|
|
|
blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
|
|
|
|
int node)
|
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
|
|
|
|
|
2021-12-03 21:15:32 +08:00
|
|
|
hctx = kzalloc_node(sizeof(struct blk_mq_hw_ctx), gfp, node);
|
2019-04-30 09:52:26 +08:00
|
|
|
if (!hctx)
|
|
|
|
goto fail_alloc_hctx;
|
|
|
|
|
|
|
|
if (!zalloc_cpumask_var_node(&hctx->cpumask, gfp, node))
|
|
|
|
goto free_hctx;
|
|
|
|
|
|
|
|
atomic_set(&hctx->nr_active, 0);
|
2014-09-25 23:23:38 +08:00
|
|
|
if (node == NUMA_NO_NODE)
|
2019-04-30 09:52:26 +08:00
|
|
|
node = set->numa_node;
|
|
|
|
hctx->numa_node = node;
|
2014-09-25 23:23:38 +08:00
|
|
|
|
2017-04-10 23:54:54 +08:00
|
|
|
INIT_DELAYED_WORK(&hctx->run_work, blk_mq_run_work_fn);
|
2014-09-25 23:23:38 +08:00
|
|
|
spin_lock_init(&hctx->lock);
|
|
|
|
INIT_LIST_HEAD(&hctx->dispatch);
|
|
|
|
hctx->queue = q;
|
2020-08-19 23:20:19 +08:00
|
|
|
hctx->flags = set->flags & ~BLK_MQ_F_TAG_QUEUE_SHARED;
|
2014-09-25 23:23:38 +08:00
|
|
|
|
blk-mq: always free hctx after request queue is freed
In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().
However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.
Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:52:27 +08:00
|
|
|
INIT_LIST_HEAD(&hctx->hctx_list);
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
/*
|
2014-09-25 23:23:38 +08:00
|
|
|
* Allocate space for all possible cpus to avoid allocation at
|
|
|
|
* runtime
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
*/
|
2017-11-16 09:32:33 +08:00
|
|
|
hctx->ctxs = kmalloc_array_node(nr_cpu_ids, sizeof(void *),
|
2019-04-30 09:52:26 +08:00
|
|
|
gfp, node);
|
2014-09-25 23:23:38 +08:00
|
|
|
if (!hctx->ctxs)
|
2019-04-30 09:52:26 +08:00
|
|
|
goto free_cpumask;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2018-10-12 18:07:26 +08:00
|
|
|
if (sbitmap_init_node(&hctx->ctx_map, nr_cpu_ids, ilog2(8),
|
2021-01-22 10:33:08 +08:00
|
|
|
gfp, node, false, false))
|
2014-09-25 23:23:38 +08:00
|
|
|
goto free_ctxs;
|
|
|
|
hctx->nr_ctx = 0;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2018-06-25 19:31:47 +08:00
|
|
|
spin_lock_init(&hctx->dispatch_wait_lock);
|
2017-11-09 23:32:43 +08:00
|
|
|
init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake);
|
|
|
|
INIT_LIST_HEAD(&hctx->dispatch_wait.entry);
|
|
|
|
|
2020-03-10 05:41:37 +08:00
|
|
|
hctx->fq = blk_alloc_flush_queue(hctx->numa_node, set->cmd_size, gfp);
|
2014-09-25 23:23:47 +08:00
|
|
|
if (!hctx->fq)
|
2019-04-30 09:52:26 +08:00
|
|
|
goto free_bitmap;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2019-04-30 09:52:26 +08:00
|
|
|
blk_mq_hctx_kobj_init(hctx);
|
2016-11-03 00:09:51 +08:00
|
|
|
|
2019-04-30 09:52:26 +08:00
|
|
|
return hctx;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2014-09-25 23:23:38 +08:00
|
|
|
free_bitmap:
|
2016-09-17 22:38:44 +08:00
|
|
|
sbitmap_free(&hctx->ctx_map);
|
2014-09-25 23:23:38 +08:00
|
|
|
free_ctxs:
|
|
|
|
kfree(hctx->ctxs);
|
2019-04-30 09:52:26 +08:00
|
|
|
free_cpumask:
|
|
|
|
free_cpumask_var(hctx->cpumask);
|
|
|
|
free_hctx:
|
|
|
|
kfree(hctx);
|
|
|
|
fail_alloc_hctx:
|
|
|
|
return NULL;
|
2014-09-25 23:23:38 +08:00
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
|
|
|
static void blk_mq_init_cpu_queues(struct request_queue *q,
|
|
|
|
unsigned int nr_hw_queues)
|
|
|
|
{
|
2018-10-31 00:36:06 +08:00
|
|
|
struct blk_mq_tag_set *set = q->tag_set;
|
|
|
|
unsigned int i, j;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
|
|
|
for_each_possible_cpu(i) {
|
|
|
|
struct blk_mq_ctx *__ctx = per_cpu_ptr(q->queue_ctx, i);
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
2018-12-17 23:44:05 +08:00
|
|
|
int k;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
|
|
|
__ctx->cpu = i;
|
|
|
|
spin_lock_init(&__ctx->lock);
|
2018-12-17 23:44:05 +08:00
|
|
|
for (k = HCTX_TYPE_DEFAULT; k < HCTX_MAX_TYPES; k++)
|
|
|
|
INIT_LIST_HEAD(&__ctx->rq_lists[k]);
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
__ctx->queue = q;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Set local node, IFF we have more than one hw queue. If
|
|
|
|
* not, we remain on the home node of the device
|
|
|
|
*/
|
2018-10-31 00:36:06 +08:00
|
|
|
for (j = 0; j < set->nr_maps; j++) {
|
|
|
|
hctx = blk_mq_map_queue_type(q, j, i);
|
|
|
|
if (nr_hw_queues > 1 && hctx->numa_node == NUMA_NO_NODE)
|
2020-10-19 16:20:47 +08:00
|
|
|
hctx->numa_node = cpu_to_node(i);
|
2018-10-31 00:36:06 +08:00
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-10-05 18:23:35 +08:00
|
|
|
struct blk_mq_tags *blk_mq_alloc_map_and_rqs(struct blk_mq_tag_set *set,
|
|
|
|
unsigned int hctx_idx,
|
|
|
|
unsigned int depth)
|
2017-01-12 05:29:56 +08:00
|
|
|
{
|
2021-10-05 18:23:35 +08:00
|
|
|
struct blk_mq_tags *tags;
|
|
|
|
int ret;
|
2017-01-12 05:29:56 +08:00
|
|
|
|
2021-10-05 18:23:37 +08:00
|
|
|
tags = blk_mq_alloc_rq_map(set, hctx_idx, depth, set->reserved_tags);
|
2021-10-05 18:23:35 +08:00
|
|
|
if (!tags)
|
|
|
|
return NULL;
|
2017-01-12 05:29:56 +08:00
|
|
|
|
2021-10-05 18:23:35 +08:00
|
|
|
ret = blk_mq_alloc_rqs(set, tags, hctx_idx, depth);
|
|
|
|
if (ret) {
|
2021-10-05 18:23:37 +08:00
|
|
|
blk_mq_free_rq_map(tags);
|
2021-10-05 18:23:35 +08:00
|
|
|
return NULL;
|
|
|
|
}
|
2017-01-12 05:29:56 +08:00
|
|
|
|
2021-10-05 18:23:35 +08:00
|
|
|
return tags;
|
2017-01-12 05:29:56 +08:00
|
|
|
}
|
|
|
|
|
2021-10-05 18:23:35 +08:00
|
|
|
static bool __blk_mq_alloc_map_and_rqs(struct blk_mq_tag_set *set,
|
|
|
|
int hctx_idx)
|
2017-01-12 05:29:56 +08:00
|
|
|
{
|
2021-10-05 18:23:39 +08:00
|
|
|
if (blk_mq_is_shared_tags(set->flags)) {
|
|
|
|
set->tags[hctx_idx] = set->shared_tags;
|
2020-08-19 23:20:22 +08:00
|
|
|
|
2021-10-05 18:23:37 +08:00
|
|
|
return true;
|
2017-01-17 21:03:22 +08:00
|
|
|
}
|
2021-10-05 18:23:37 +08:00
|
|
|
|
2021-10-05 18:23:35 +08:00
|
|
|
set->tags[hctx_idx] = blk_mq_alloc_map_and_rqs(set, hctx_idx,
|
|
|
|
set->queue_depth);
|
|
|
|
|
|
|
|
return set->tags[hctx_idx];
|
2017-01-12 05:29:56 +08:00
|
|
|
}
|
|
|
|
|
2021-10-05 18:23:36 +08:00
|
|
|
void blk_mq_free_map_and_rqs(struct blk_mq_tag_set *set,
|
|
|
|
struct blk_mq_tags *tags,
|
|
|
|
unsigned int hctx_idx)
|
2017-01-12 05:29:56 +08:00
|
|
|
{
|
2021-10-05 18:23:36 +08:00
|
|
|
if (tags) {
|
|
|
|
blk_mq_free_rqs(set, tags, hctx_idx);
|
2021-10-05 18:23:37 +08:00
|
|
|
blk_mq_free_rq_map(tags);
|
2017-01-17 21:03:22 +08:00
|
|
|
}
|
2017-01-12 05:29:56 +08:00
|
|
|
}
|
|
|
|
|
2021-10-05 18:23:37 +08:00
|
|
|
static void __blk_mq_free_map_and_rqs(struct blk_mq_tag_set *set,
|
|
|
|
unsigned int hctx_idx)
|
|
|
|
{
|
2021-10-05 18:23:39 +08:00
|
|
|
if (!blk_mq_is_shared_tags(set->flags))
|
2021-10-05 18:23:37 +08:00
|
|
|
blk_mq_free_map_and_rqs(set, set->tags[hctx_idx], hctx_idx);
|
|
|
|
|
|
|
|
set->tags[hctx_idx] = NULL;
|
2017-01-12 05:29:56 +08:00
|
|
|
}
|
|
|
|
|
2017-06-26 18:20:57 +08:00
|
|
|
static void blk_mq_map_swqueue(struct request_queue *q)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2022-03-08 15:32:18 +08:00
|
|
|
unsigned int j, hctx_idx;
|
|
|
|
unsigned long i;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
struct blk_mq_ctx *ctx;
|
2015-04-21 10:00:20 +08:00
|
|
|
struct blk_mq_tag_set *set = q->tag_set;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
|
|
|
queue_for_each_hw_ctx(q, hctx, i) {
|
2014-04-10 00:18:23 +08:00
|
|
|
cpumask_clear(hctx->cpumask);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
hctx->nr_ctx = 0;
|
2018-05-18 22:32:30 +08:00
|
|
|
hctx->dispatch_from = NULL;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2017-06-26 18:20:57 +08:00
|
|
|
* Map software to hardware queues.
|
2018-04-25 04:01:44 +08:00
|
|
|
*
|
|
|
|
* If the cpu isn't present, the cpu is mapped to first hctx.
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
*/
|
2018-01-12 10:53:06 +08:00
|
|
|
for_each_possible_cpu(i) {
|
2018-04-25 04:01:44 +08:00
|
|
|
|
2016-03-19 18:30:33 +08:00
|
|
|
ctx = per_cpu_ptr(q->queue_ctx, i);
|
2018-10-31 00:36:06 +08:00
|
|
|
for (j = 0; j < set->nr_maps; j++) {
|
2019-01-24 18:25:33 +08:00
|
|
|
if (!set->map[j].nr_queues) {
|
|
|
|
ctx->hctxs[j] = blk_mq_map_queue_type(q,
|
|
|
|
HCTX_TYPE_DEFAULT, i);
|
2018-12-18 01:28:56 +08:00
|
|
|
continue;
|
2019-01-24 18:25:33 +08:00
|
|
|
}
|
block: alloc map and request for new hardware queue
Alloc new map and request for new hardware queue when increse
hardware queue count. Before this patch, it will show a
warning for each new hardware queue, but it's not enough, these
hctx have no maps and reqeust, when a bio was mapped to these
hardware queue, it will trigger kernel panic when get request
from these hctx.
Test environment:
* A NVMe disk supports 128 io queues
* 96 cpus in system
A corner case can always trigger this panic, there are 96
io queues allocated for HCTX_TYPE_DEFAULT type, the corresponding kernel
log: nvme nvme0: 96/0/0 default/read/poll queues. Now we set nvme write
queues to 96, then nvme will alloc others(32) queues for read, but
blk_mq_update_nr_hw_queues does not alloc map and request for these new
added io queues. So when process read nvme disk, it will trigger kernel
panic when get request from these hardware context.
Reproduce script:
nr=$(expr `cat /sys/block/nvme0n1/device/queue_count` - 1)
echo $nr > /sys/module/nvme/parameters/write_queues
echo 1 > /sys/block/nvme0n1/device/reset_controller
dd if=/dev/nvme0n1 of=/dev/null bs=4K count=1
[ 8040.805626] ------------[ cut here ]------------
[ 8040.805627] WARNING: CPU: 82 PID: 12921 at block/blk-mq.c:2578 blk_mq_map_swqueue+0x2b6/0x2c0
[ 8040.805627] Modules linked in: nvme nvme_core nf_conntrack_netlink xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nf_conntrack_tftp nft_masq nf_tables_set nft_fib_inet nft_f
ib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack tun bridge nf_defrag_ipv6 nf_defrag_ipv4 stp llc ip6_tables ip_tables nft_compat rfkill ip_set nf_tables nfne
tlink sunrpc intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel intel_
cstate intel_uncore raid0 joydev intel_rapl_perf ipmi_si pcspkr mei_me ioatdma sg ipmi_devintf mei i2c_i801 dca lpc_ich ipmi_msghandler acpi_power_meter acpi_pad xfs libcrc32c sd_mod ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm d
rm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
[ 8040.805637] ahci drm i40e libahci crc32c_intel libata t10_pi wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nvme_core]
[ 8040.805640] CPU: 82 PID: 12921 Comm: kworker/u194:2 Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2
[ 8040.805640] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
[ 8040.805641] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[ 8040.805642] RIP: 0010:blk_mq_map_swqueue+0x2b6/0x2c0
[ 8040.805643] Code: 00 00 00 00 00 41 83 c5 01 44 39 6d 50 77 b8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b bb 98 00 00 00 89 d6 e8 8c 81 03 00 eb 83 <0f> 0b e9 52 ff ff ff 0f 1f 00 0f 1f 44 00 00 41 57 48 89 f1 41 56
[ 8040.805643] RSP: 0018:ffffba590d2e7d48 EFLAGS: 00010246
[ 8040.805643] RAX: 0000000000000000 RBX: ffff9f013e1ba800 RCX: 000000000000003d
[ 8040.805644] RDX: ffff9f00ffff6000 RSI: 0000000000000003 RDI: ffff9ed200246d90
[ 8040.805644] RBP: ffff9f00f6a79860 R08: 0000000000000000 R09: 000000000000003d
[ 8040.805645] R10: 0000000000000001 R11: ffff9f0138c3d000 R12: ffff9f00fb3a9008
[ 8040.805645] R13: 000000000000007f R14: ffffffff96822660 R15: 000000000000005f
[ 8040.805645] FS: 0000000000000000(0000) GS:ffff9f013fa80000(0000) knlGS:0000000000000000
[ 8040.805646] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8040.805646] CR2: 00007f7f397fa6f8 CR3: 0000003d8240a002 CR4: 00000000007606e0
[ 8040.805647] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8040.805647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8040.805647] PKRU: 55555554
[ 8040.805647] Call Trace:
[ 8040.805649] blk_mq_update_nr_hw_queues+0x31b/0x390
[ 8040.805650] nvme_reset_work+0xb4b/0xeab [nvme]
[ 8040.805651] process_one_work+0x1a7/0x370
[ 8040.805652] worker_thread+0x1c9/0x380
[ 8040.805653] ? max_active_store+0x80/0x80
[ 8040.805655] kthread+0x112/0x130
[ 8040.805656] ? __kthread_parkme+0x70/0x70
[ 8040.805657] ret_from_fork+0x35/0x40
[ 8040.805658] ---[ end trace b5f13b1e73ccb5d3 ]---
[ 8229.365135] BUG: kernel NULL pointer dereference, address: 0000000000000004
[ 8229.365165] #PF: supervisor read access in kernel mode
[ 8229.365178] #PF: error_code(0x0000) - not-present page
[ 8229.365191] PGD 0 P4D 0
[ 8229.365201] Oops: 0000 [#1] SMP PTI
[ 8229.365212] CPU: 77 PID: 13024 Comm: dd Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2
[ 8229.365232] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
[ 8229.365253] RIP: 0010:blk_mq_get_tag+0x227/0x250
[ 8229.365265] Code: 44 24 04 44 01 e0 48 8b 74 24 38 65 48 33 34 25 28 00 00 00 75 33 48 83 c4 40 5b 5d 41 5c 41 5d 41 5e c3 48 8d 68 10 4c 89 ef <44> 8b 60 04 48 89 ee e8 dd f9 ff ff 83 f8 ff 75 c8 e9 67 fe ff ff
[ 8229.365304] RSP: 0018:ffffba590e977970 EFLAGS: 00010246
[ 8229.365317] RAX: 0000000000000000 RBX: ffff9f00f6a79860 RCX: ffffba590e977998
[ 8229.365333] RDX: 0000000000000000 RSI: ffff9f012039b140 RDI: ffffba590e977a38
[ 8229.365349] RBP: 0000000000000010 R08: ffffda58ff94e190 R09: ffffda58ff94e198
[ 8229.365365] R10: 0000000000000011 R11: ffff9f00f6a79860 R12: 0000000000000000
[ 8229.365381] R13: ffffba590e977a38 R14: ffff9f012039b140 R15: 0000000000000001
[ 8229.365397] FS: 00007f481c230580(0000) GS:ffff9f013f940000(0000) knlGS:0000000000000000
[ 8229.365415] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8229.365428] CR2: 0000000000000004 CR3: 0000005f35e26004 CR4: 00000000007606e0
[ 8229.365444] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8229.365460] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8229.365476] PKRU: 55555554
[ 8229.365484] Call Trace:
[ 8229.365498] ? finish_wait+0x80/0x80
[ 8229.365512] blk_mq_get_request+0xcb/0x3f0
[ 8229.365525] blk_mq_make_request+0x143/0x5d0
[ 8229.365538] generic_make_request+0xcf/0x310
[ 8229.365553] ? scan_shadow_nodes+0x30/0x30
[ 8229.365564] submit_bio+0x3c/0x150
[ 8229.365576] mpage_readpages+0x163/0x1a0
[ 8229.365588] ? blkdev_direct_IO+0x490/0x490
[ 8229.365601] read_pages+0x6b/0x190
[ 8229.365612] __do_page_cache_readahead+0x1c1/0x1e0
[ 8229.365626] ondemand_readahead+0x182/0x2f0
[ 8229.365639] generic_file_buffered_read+0x590/0xab0
[ 8229.365655] new_sync_read+0x12a/0x1c0
[ 8229.365666] vfs_read+0x8a/0x140
[ 8229.365676] ksys_read+0x59/0xd0
[ 8229.365688] do_syscall_64+0x55/0x1d0
[ 8229.365700] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
Tested-by: Weiping Zhang <zhangweiping@didiglobal.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-07 21:04:08 +08:00
|
|
|
hctx_idx = set->map[j].mq_map[i];
|
|
|
|
/* unmapped hw queue can be remapped after CPU topo changed */
|
|
|
|
if (!set->tags[hctx_idx] &&
|
2021-10-05 18:23:35 +08:00
|
|
|
!__blk_mq_alloc_map_and_rqs(set, hctx_idx)) {
|
block: alloc map and request for new hardware queue
Alloc new map and request for new hardware queue when increse
hardware queue count. Before this patch, it will show a
warning for each new hardware queue, but it's not enough, these
hctx have no maps and reqeust, when a bio was mapped to these
hardware queue, it will trigger kernel panic when get request
from these hctx.
Test environment:
* A NVMe disk supports 128 io queues
* 96 cpus in system
A corner case can always trigger this panic, there are 96
io queues allocated for HCTX_TYPE_DEFAULT type, the corresponding kernel
log: nvme nvme0: 96/0/0 default/read/poll queues. Now we set nvme write
queues to 96, then nvme will alloc others(32) queues for read, but
blk_mq_update_nr_hw_queues does not alloc map and request for these new
added io queues. So when process read nvme disk, it will trigger kernel
panic when get request from these hardware context.
Reproduce script:
nr=$(expr `cat /sys/block/nvme0n1/device/queue_count` - 1)
echo $nr > /sys/module/nvme/parameters/write_queues
echo 1 > /sys/block/nvme0n1/device/reset_controller
dd if=/dev/nvme0n1 of=/dev/null bs=4K count=1
[ 8040.805626] ------------[ cut here ]------------
[ 8040.805627] WARNING: CPU: 82 PID: 12921 at block/blk-mq.c:2578 blk_mq_map_swqueue+0x2b6/0x2c0
[ 8040.805627] Modules linked in: nvme nvme_core nf_conntrack_netlink xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nf_conntrack_tftp nft_masq nf_tables_set nft_fib_inet nft_f
ib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack tun bridge nf_defrag_ipv6 nf_defrag_ipv4 stp llc ip6_tables ip_tables nft_compat rfkill ip_set nf_tables nfne
tlink sunrpc intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel intel_
cstate intel_uncore raid0 joydev intel_rapl_perf ipmi_si pcspkr mei_me ioatdma sg ipmi_devintf mei i2c_i801 dca lpc_ich ipmi_msghandler acpi_power_meter acpi_pad xfs libcrc32c sd_mod ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm d
rm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
[ 8040.805637] ahci drm i40e libahci crc32c_intel libata t10_pi wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nvme_core]
[ 8040.805640] CPU: 82 PID: 12921 Comm: kworker/u194:2 Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2
[ 8040.805640] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
[ 8040.805641] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[ 8040.805642] RIP: 0010:blk_mq_map_swqueue+0x2b6/0x2c0
[ 8040.805643] Code: 00 00 00 00 00 41 83 c5 01 44 39 6d 50 77 b8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b bb 98 00 00 00 89 d6 e8 8c 81 03 00 eb 83 <0f> 0b e9 52 ff ff ff 0f 1f 00 0f 1f 44 00 00 41 57 48 89 f1 41 56
[ 8040.805643] RSP: 0018:ffffba590d2e7d48 EFLAGS: 00010246
[ 8040.805643] RAX: 0000000000000000 RBX: ffff9f013e1ba800 RCX: 000000000000003d
[ 8040.805644] RDX: ffff9f00ffff6000 RSI: 0000000000000003 RDI: ffff9ed200246d90
[ 8040.805644] RBP: ffff9f00f6a79860 R08: 0000000000000000 R09: 000000000000003d
[ 8040.805645] R10: 0000000000000001 R11: ffff9f0138c3d000 R12: ffff9f00fb3a9008
[ 8040.805645] R13: 000000000000007f R14: ffffffff96822660 R15: 000000000000005f
[ 8040.805645] FS: 0000000000000000(0000) GS:ffff9f013fa80000(0000) knlGS:0000000000000000
[ 8040.805646] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8040.805646] CR2: 00007f7f397fa6f8 CR3: 0000003d8240a002 CR4: 00000000007606e0
[ 8040.805647] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8040.805647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8040.805647] PKRU: 55555554
[ 8040.805647] Call Trace:
[ 8040.805649] blk_mq_update_nr_hw_queues+0x31b/0x390
[ 8040.805650] nvme_reset_work+0xb4b/0xeab [nvme]
[ 8040.805651] process_one_work+0x1a7/0x370
[ 8040.805652] worker_thread+0x1c9/0x380
[ 8040.805653] ? max_active_store+0x80/0x80
[ 8040.805655] kthread+0x112/0x130
[ 8040.805656] ? __kthread_parkme+0x70/0x70
[ 8040.805657] ret_from_fork+0x35/0x40
[ 8040.805658] ---[ end trace b5f13b1e73ccb5d3 ]---
[ 8229.365135] BUG: kernel NULL pointer dereference, address: 0000000000000004
[ 8229.365165] #PF: supervisor read access in kernel mode
[ 8229.365178] #PF: error_code(0x0000) - not-present page
[ 8229.365191] PGD 0 P4D 0
[ 8229.365201] Oops: 0000 [#1] SMP PTI
[ 8229.365212] CPU: 77 PID: 13024 Comm: dd Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2
[ 8229.365232] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
[ 8229.365253] RIP: 0010:blk_mq_get_tag+0x227/0x250
[ 8229.365265] Code: 44 24 04 44 01 e0 48 8b 74 24 38 65 48 33 34 25 28 00 00 00 75 33 48 83 c4 40 5b 5d 41 5c 41 5d 41 5e c3 48 8d 68 10 4c 89 ef <44> 8b 60 04 48 89 ee e8 dd f9 ff ff 83 f8 ff 75 c8 e9 67 fe ff ff
[ 8229.365304] RSP: 0018:ffffba590e977970 EFLAGS: 00010246
[ 8229.365317] RAX: 0000000000000000 RBX: ffff9f00f6a79860 RCX: ffffba590e977998
[ 8229.365333] RDX: 0000000000000000 RSI: ffff9f012039b140 RDI: ffffba590e977a38
[ 8229.365349] RBP: 0000000000000010 R08: ffffda58ff94e190 R09: ffffda58ff94e198
[ 8229.365365] R10: 0000000000000011 R11: ffff9f00f6a79860 R12: 0000000000000000
[ 8229.365381] R13: ffffba590e977a38 R14: ffff9f012039b140 R15: 0000000000000001
[ 8229.365397] FS: 00007f481c230580(0000) GS:ffff9f013f940000(0000) knlGS:0000000000000000
[ 8229.365415] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8229.365428] CR2: 0000000000000004 CR3: 0000005f35e26004 CR4: 00000000007606e0
[ 8229.365444] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8229.365460] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8229.365476] PKRU: 55555554
[ 8229.365484] Call Trace:
[ 8229.365498] ? finish_wait+0x80/0x80
[ 8229.365512] blk_mq_get_request+0xcb/0x3f0
[ 8229.365525] blk_mq_make_request+0x143/0x5d0
[ 8229.365538] generic_make_request+0xcf/0x310
[ 8229.365553] ? scan_shadow_nodes+0x30/0x30
[ 8229.365564] submit_bio+0x3c/0x150
[ 8229.365576] mpage_readpages+0x163/0x1a0
[ 8229.365588] ? blkdev_direct_IO+0x490/0x490
[ 8229.365601] read_pages+0x6b/0x190
[ 8229.365612] __do_page_cache_readahead+0x1c1/0x1e0
[ 8229.365626] ondemand_readahead+0x182/0x2f0
[ 8229.365639] generic_file_buffered_read+0x590/0xab0
[ 8229.365655] new_sync_read+0x12a/0x1c0
[ 8229.365666] vfs_read+0x8a/0x140
[ 8229.365676] ksys_read+0x59/0xd0
[ 8229.365688] do_syscall_64+0x55/0x1d0
[ 8229.365700] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
Tested-by: Weiping Zhang <zhangweiping@didiglobal.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-07 21:04:08 +08:00
|
|
|
/*
|
|
|
|
* If tags initialization fail for some hctx,
|
|
|
|
* that hctx won't be brought online. In this
|
|
|
|
* case, remap the current ctx to hctx[0] which
|
|
|
|
* is guaranteed to always have tags allocated
|
|
|
|
*/
|
|
|
|
set->map[j].mq_map[i] = 0;
|
|
|
|
}
|
2018-12-18 01:28:56 +08:00
|
|
|
|
2018-10-31 00:36:06 +08:00
|
|
|
hctx = blk_mq_map_queue_type(q, j, i);
|
2019-01-24 18:25:32 +08:00
|
|
|
ctx->hctxs[j] = hctx;
|
2018-10-31 00:36:06 +08:00
|
|
|
/*
|
|
|
|
* If the CPU is already set in the mask, then we've
|
|
|
|
* mapped this one already. This can happen if
|
|
|
|
* devices share queues across queue maps.
|
|
|
|
*/
|
|
|
|
if (cpumask_test_cpu(i, hctx->cpumask))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
cpumask_set_cpu(i, hctx->cpumask);
|
|
|
|
hctx->type = j;
|
|
|
|
ctx->index_hw[hctx->type] = hctx->nr_ctx;
|
|
|
|
hctx->ctxs[hctx->nr_ctx++] = ctx;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the nr_ctx type overflows, we have exceeded the
|
|
|
|
* amount of sw queues we can support.
|
|
|
|
*/
|
|
|
|
BUG_ON(!hctx->nr_ctx);
|
|
|
|
}
|
2019-01-24 18:25:33 +08:00
|
|
|
|
|
|
|
for (; j < HCTX_MAX_TYPES; j++)
|
|
|
|
ctx->hctxs[j] = blk_mq_map_queue_type(q,
|
|
|
|
HCTX_TYPE_DEFAULT, i);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
2014-05-08 00:26:44 +08:00
|
|
|
|
|
|
|
queue_for_each_hw_ctx(q, hctx, i) {
|
2018-04-25 04:01:44 +08:00
|
|
|
/*
|
|
|
|
* If no software queues are mapped to this hardware queue,
|
|
|
|
* disable it and free the request entries.
|
|
|
|
*/
|
|
|
|
if (!hctx->nr_ctx) {
|
|
|
|
/* Never unmap queue 0. We need it as a
|
|
|
|
* fallback in case of a new remap fails
|
|
|
|
* allocation
|
|
|
|
*/
|
2021-10-05 18:23:37 +08:00
|
|
|
if (i)
|
|
|
|
__blk_mq_free_map_and_rqs(set, i);
|
2018-04-25 04:01:44 +08:00
|
|
|
|
|
|
|
hctx->tags = NULL;
|
|
|
|
continue;
|
|
|
|
}
|
2014-05-22 04:01:15 +08:00
|
|
|
|
2015-04-21 10:00:20 +08:00
|
|
|
hctx->tags = set->tags[i];
|
|
|
|
WARN_ON(!hctx->tags);
|
|
|
|
|
2015-04-16 01:39:29 +08:00
|
|
|
/*
|
|
|
|
* Set the map size to the number of mapped software queues.
|
|
|
|
* This is more accurate and more efficient than looping
|
|
|
|
* over all possibly mapped software queues.
|
|
|
|
*/
|
2016-09-17 22:38:44 +08:00
|
|
|
sbitmap_resize(&hctx->ctx_map, hctx->nr_ctx);
|
2015-04-16 01:39:29 +08:00
|
|
|
|
2014-05-22 04:01:15 +08:00
|
|
|
/*
|
|
|
|
* Initialize batch roundrobin counts
|
|
|
|
*/
|
2018-04-08 17:48:10 +08:00
|
|
|
hctx->next_cpu = blk_mq_first_mapped_cpu(hctx);
|
2014-05-08 00:26:44 +08:00
|
|
|
hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
|
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2017-06-21 07:56:13 +08:00
|
|
|
/*
|
|
|
|
* Caller needs to ensure that we're either frozen/quiesced, or that
|
|
|
|
* the queue isn't live yet.
|
|
|
|
*/
|
2015-11-03 23:40:06 +08:00
|
|
|
static void queue_set_hctx_shared(struct request_queue *q, bool shared)
|
2014-05-14 05:10:52 +08:00
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
2022-03-08 15:32:18 +08:00
|
|
|
unsigned long i;
|
2014-05-14 05:10:52 +08:00
|
|
|
|
2015-11-03 23:40:06 +08:00
|
|
|
queue_for_each_hw_ctx(q, hctx, i) {
|
2021-07-31 14:21:30 +08:00
|
|
|
if (shared) {
|
2020-08-19 23:20:19 +08:00
|
|
|
hctx->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
|
2021-07-31 14:21:30 +08:00
|
|
|
} else {
|
|
|
|
blk_mq_tag_idle(hctx);
|
2020-08-19 23:20:19 +08:00
|
|
|
hctx->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
|
2021-07-31 14:21:30 +08:00
|
|
|
}
|
2015-11-03 23:40:06 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-08-19 23:20:20 +08:00
|
|
|
static void blk_mq_update_tag_set_shared(struct blk_mq_tag_set *set,
|
|
|
|
bool shared)
|
2015-11-03 23:40:06 +08:00
|
|
|
{
|
|
|
|
struct request_queue *q;
|
2014-05-14 05:10:52 +08:00
|
|
|
|
2017-04-08 02:16:49 +08:00
|
|
|
lockdep_assert_held(&set->tag_list_lock);
|
|
|
|
|
2014-05-14 05:10:52 +08:00
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list) {
|
|
|
|
blk_mq_freeze_queue(q);
|
2015-11-03 23:40:06 +08:00
|
|
|
queue_set_hctx_shared(q, shared);
|
2014-05-14 05:10:52 +08:00
|
|
|
blk_mq_unfreeze_queue(q);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void blk_mq_del_queue_tag_set(struct request_queue *q)
|
|
|
|
{
|
|
|
|
struct blk_mq_tag_set *set = q->tag_set;
|
|
|
|
|
|
|
|
mutex_lock(&set->tag_list_lock);
|
2020-07-28 21:29:51 +08:00
|
|
|
list_del(&q->tag_set_list);
|
2015-11-03 23:40:06 +08:00
|
|
|
if (list_is_singular(&set->tag_list)) {
|
|
|
|
/* just transitioned to unshared */
|
2020-08-19 23:20:19 +08:00
|
|
|
set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
|
2015-11-03 23:40:06 +08:00
|
|
|
/* update existing queue */
|
2020-08-19 23:20:20 +08:00
|
|
|
blk_mq_update_tag_set_shared(set, false);
|
2015-11-03 23:40:06 +08:00
|
|
|
}
|
2014-05-14 05:10:52 +08:00
|
|
|
mutex_unlock(&set->tag_list_lock);
|
2018-06-11 04:38:24 +08:00
|
|
|
INIT_LIST_HEAD(&q->tag_set_list);
|
2014-05-14 05:10:52 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
|
|
|
|
struct request_queue *q)
|
|
|
|
{
|
|
|
|
mutex_lock(&set->tag_list_lock);
|
2015-11-03 23:40:06 +08:00
|
|
|
|
2017-11-11 13:05:12 +08:00
|
|
|
/*
|
|
|
|
* Check to see if we're transitioning to shared (from 1 to 2 queues).
|
|
|
|
*/
|
|
|
|
if (!list_empty(&set->tag_list) &&
|
2020-08-19 23:20:19 +08:00
|
|
|
!(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
|
|
|
|
set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
|
2015-11-03 23:40:06 +08:00
|
|
|
/* update existing queue */
|
2020-08-19 23:20:20 +08:00
|
|
|
blk_mq_update_tag_set_shared(set, true);
|
2015-11-03 23:40:06 +08:00
|
|
|
}
|
2020-08-19 23:20:19 +08:00
|
|
|
if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
|
2015-11-03 23:40:06 +08:00
|
|
|
queue_set_hctx_shared(q, true);
|
2020-07-28 21:29:51 +08:00
|
|
|
list_add_tail(&q->tag_set_list, &set->tag_list);
|
2015-11-03 23:40:06 +08:00
|
|
|
|
2014-05-14 05:10:52 +08:00
|
|
|
mutex_unlock(&set->tag_list_lock);
|
|
|
|
}
|
|
|
|
|
2018-11-20 09:44:35 +08:00
|
|
|
/* All allocations will be freed in release handler of q->mq_kobj */
|
|
|
|
static int blk_mq_alloc_ctxs(struct request_queue *q)
|
|
|
|
{
|
|
|
|
struct blk_mq_ctxs *ctxs;
|
|
|
|
int cpu;
|
|
|
|
|
|
|
|
ctxs = kzalloc(sizeof(*ctxs), GFP_KERNEL);
|
|
|
|
if (!ctxs)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
ctxs->queue_ctx = alloc_percpu(struct blk_mq_ctx);
|
|
|
|
if (!ctxs->queue_ctx)
|
|
|
|
goto fail;
|
|
|
|
|
|
|
|
for_each_possible_cpu(cpu) {
|
|
|
|
struct blk_mq_ctx *ctx = per_cpu_ptr(ctxs->queue_ctx, cpu);
|
|
|
|
ctx->ctxs = ctxs;
|
|
|
|
}
|
|
|
|
|
|
|
|
q->mq_kobj = &ctxs->kobj;
|
|
|
|
q->queue_ctx = ctxs->queue_ctx;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
fail:
|
|
|
|
kfree(ctxs);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
2015-01-29 20:17:27 +08:00
|
|
|
/*
|
|
|
|
* It is the actual release handler for mq, but we do it from
|
|
|
|
* request queue's release handler for avoiding use-after-free
|
|
|
|
* and headache because q->mq_kobj shouldn't have been introduced,
|
|
|
|
* but we can't group ctx/kctx kobj without it.
|
|
|
|
*/
|
|
|
|
void blk_mq_release(struct request_queue *q)
|
|
|
|
{
|
blk-mq: always free hctx after request queue is freed
In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().
However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.
Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:52:27 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx, *next;
|
2022-03-08 15:32:18 +08:00
|
|
|
unsigned long i;
|
2015-01-29 20:17:27 +08:00
|
|
|
|
blk-mq: always free hctx after request queue is freed
In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().
However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.
Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:52:27 +08:00
|
|
|
queue_for_each_hw_ctx(q, hctx, i)
|
|
|
|
WARN_ON_ONCE(hctx && list_empty(&hctx->hctx_list));
|
|
|
|
|
|
|
|
/* all hctx are in .unused_hctx_list now */
|
|
|
|
list_for_each_entry_safe(hctx, next, &q->unused_hctx_list, hctx_list) {
|
|
|
|
list_del_init(&hctx->hctx_list);
|
2017-02-22 18:14:01 +08:00
|
|
|
kobject_put(&hctx->kobj);
|
2015-06-04 22:25:04 +08:00
|
|
|
}
|
2015-01-29 20:17:27 +08:00
|
|
|
|
2022-03-08 15:32:19 +08:00
|
|
|
xa_destroy(&q->hctx_table);
|
2015-01-29 20:17:27 +08:00
|
|
|
|
2017-02-22 18:14:00 +08:00
|
|
|
/*
|
|
|
|
* release .mq_kobj and sw queue's kobject now because
|
|
|
|
* both share lifetime with request queue.
|
|
|
|
*/
|
|
|
|
blk_mq_sysfs_deinit(q);
|
2015-01-29 20:17:27 +08:00
|
|
|
}
|
|
|
|
|
2021-06-24 16:10:12 +08:00
|
|
|
static struct request_queue *blk_mq_init_queue_data(struct blk_mq_tag_set *set,
|
2020-03-27 16:30:08 +08:00
|
|
|
void *queuedata)
|
2015-03-13 11:56:02 +08:00
|
|
|
{
|
2021-06-02 14:53:17 +08:00
|
|
|
struct request_queue *q;
|
|
|
|
int ret;
|
2015-03-13 11:56:02 +08:00
|
|
|
|
2022-11-01 23:00:47 +08:00
|
|
|
q = blk_alloc_queue(set->numa_node);
|
2021-06-02 14:53:17 +08:00
|
|
|
if (!q)
|
2015-03-13 11:56:02 +08:00
|
|
|
return ERR_PTR(-ENOMEM);
|
2021-06-02 14:53:17 +08:00
|
|
|
q->queuedata = queuedata;
|
|
|
|
ret = blk_mq_init_allocated_queue(set, q);
|
|
|
|
if (ret) {
|
2022-06-19 14:05:51 +08:00
|
|
|
blk_put_queue(q);
|
2021-06-02 14:53:17 +08:00
|
|
|
return ERR_PTR(ret);
|
|
|
|
}
|
2015-03-13 11:56:02 +08:00
|
|
|
return q;
|
|
|
|
}
|
2020-03-27 16:30:08 +08:00
|
|
|
|
|
|
|
struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
|
|
|
|
{
|
|
|
|
return blk_mq_init_queue_data(set, NULL);
|
|
|
|
}
|
2015-03-13 11:56:02 +08:00
|
|
|
EXPORT_SYMBOL(blk_mq_init_queue);
|
|
|
|
|
2022-06-19 14:05:51 +08:00
|
|
|
/**
|
|
|
|
* blk_mq_destroy_queue - shutdown a request queue
|
|
|
|
* @q: request queue to shutdown
|
|
|
|
*
|
2023-01-31 05:12:33 +08:00
|
|
|
* This shuts down a request queue allocated by blk_mq_init_queue(). All future
|
|
|
|
* requests will be failed with -ENODEV. The caller is responsible for dropping
|
|
|
|
* the reference from blk_mq_init_queue() by calling blk_put_queue().
|
2022-06-19 14:05:51 +08:00
|
|
|
*
|
|
|
|
* Context: can sleep
|
|
|
|
*/
|
|
|
|
void blk_mq_destroy_queue(struct request_queue *q)
|
|
|
|
{
|
|
|
|
WARN_ON_ONCE(!queue_is_mq(q));
|
|
|
|
WARN_ON_ONCE(blk_queue_registered(q));
|
|
|
|
|
|
|
|
might_sleep();
|
|
|
|
|
|
|
|
blk_queue_flag_set(QUEUE_FLAG_DYING, q);
|
|
|
|
blk_queue_start_drain(q);
|
2022-10-30 16:32:12 +08:00
|
|
|
blk_mq_freeze_queue_wait(q);
|
2022-06-19 14:05:51 +08:00
|
|
|
|
|
|
|
blk_sync_queue(q);
|
|
|
|
blk_mq_cancel_work_sync(q);
|
|
|
|
blk_mq_exit_queue(q);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_destroy_queue);
|
|
|
|
|
2021-08-16 21:19:05 +08:00
|
|
|
struct gendisk *__blk_mq_alloc_disk(struct blk_mq_tag_set *set, void *queuedata,
|
|
|
|
struct lock_class_key *lkclass)
|
2018-10-15 22:40:37 +08:00
|
|
|
{
|
|
|
|
struct request_queue *q;
|
2021-06-02 14:53:18 +08:00
|
|
|
struct gendisk *disk;
|
2018-10-15 22:40:37 +08:00
|
|
|
|
2021-06-02 14:53:18 +08:00
|
|
|
q = blk_mq_init_queue_data(set, queuedata);
|
|
|
|
if (IS_ERR(q))
|
|
|
|
return ERR_CAST(q);
|
2018-10-15 22:40:37 +08:00
|
|
|
|
2021-08-16 21:19:08 +08:00
|
|
|
disk = __alloc_disk_node(q, set->numa_node, lkclass);
|
2021-06-02 14:53:18 +08:00
|
|
|
if (!disk) {
|
2022-07-20 21:05:40 +08:00
|
|
|
blk_mq_destroy_queue(q);
|
2022-10-18 21:57:17 +08:00
|
|
|
blk_put_queue(q);
|
2021-06-02 14:53:18 +08:00
|
|
|
return ERR_PTR(-ENOMEM);
|
2018-10-15 22:40:37 +08:00
|
|
|
}
|
2022-06-19 14:05:51 +08:00
|
|
|
set_bit(GD_OWNS_QUEUE, &disk->state);
|
2021-06-02 14:53:18 +08:00
|
|
|
return disk;
|
2018-10-15 22:40:37 +08:00
|
|
|
}
|
2021-06-02 14:53:18 +08:00
|
|
|
EXPORT_SYMBOL(__blk_mq_alloc_disk);
|
2018-10-15 22:40:37 +08:00
|
|
|
|
2022-06-19 14:05:51 +08:00
|
|
|
struct gendisk *blk_mq_alloc_disk_for_queue(struct request_queue *q,
|
|
|
|
struct lock_class_key *lkclass)
|
|
|
|
{
|
2022-11-22 15:27:53 +08:00
|
|
|
struct gendisk *disk;
|
|
|
|
|
2022-06-19 14:05:51 +08:00
|
|
|
if (!blk_get_queue(q))
|
|
|
|
return NULL;
|
2022-11-22 15:27:53 +08:00
|
|
|
disk = __alloc_disk_node(q, NUMA_NO_NODE, lkclass);
|
|
|
|
if (!disk)
|
|
|
|
blk_put_queue(q);
|
|
|
|
return disk;
|
2022-06-19 14:05:51 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_alloc_disk_for_queue);
|
|
|
|
|
2018-10-12 18:07:27 +08:00
|
|
|
static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
|
|
|
|
struct blk_mq_tag_set *set, struct request_queue *q,
|
|
|
|
int hctx_idx, int node)
|
|
|
|
{
|
blk-mq: always free hctx after request queue is freed
In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().
However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.
Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:52:27 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx = NULL, *tmp;
|
2018-10-12 18:07:27 +08:00
|
|
|
|
blk-mq: always free hctx after request queue is freed
In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().
However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.
Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:52:27 +08:00
|
|
|
/* reuse dead hctx first */
|
|
|
|
spin_lock(&q->unused_hctx_lock);
|
|
|
|
list_for_each_entry(tmp, &q->unused_hctx_list, hctx_list) {
|
|
|
|
if (tmp->numa_node == node) {
|
|
|
|
hctx = tmp;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (hctx)
|
|
|
|
list_del_init(&hctx->hctx_list);
|
|
|
|
spin_unlock(&q->unused_hctx_lock);
|
|
|
|
|
|
|
|
if (!hctx)
|
|
|
|
hctx = blk_mq_alloc_hctx(q, set, node);
|
2018-10-12 18:07:27 +08:00
|
|
|
if (!hctx)
|
2019-04-30 09:52:26 +08:00
|
|
|
goto fail;
|
2018-10-12 18:07:27 +08:00
|
|
|
|
2019-04-30 09:52:26 +08:00
|
|
|
if (blk_mq_init_hctx(q, set, hctx, hctx_idx))
|
|
|
|
goto free_hctx;
|
2018-10-12 18:07:27 +08:00
|
|
|
|
|
|
|
return hctx;
|
2019-04-30 09:52:26 +08:00
|
|
|
|
|
|
|
free_hctx:
|
|
|
|
kobject_put(&hctx->kobj);
|
|
|
|
fail:
|
|
|
|
return NULL;
|
2018-10-12 18:07:27 +08:00
|
|
|
}
|
|
|
|
|
2015-12-18 08:08:14 +08:00
|
|
|
static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
|
|
|
|
struct request_queue *q)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2022-03-08 15:32:19 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
unsigned long i, j;
|
2019-10-26 00:50:09 +08:00
|
|
|
|
2018-01-06 16:27:40 +08:00
|
|
|
/* protect against switching io scheduler */
|
|
|
|
mutex_lock(&q->sysfs_lock);
|
2014-04-16 04:14:00 +08:00
|
|
|
for (i = 0; i < set->nr_hw_queues; i++) {
|
2022-03-08 15:32:15 +08:00
|
|
|
int old_node;
|
2022-03-08 15:32:14 +08:00
|
|
|
int node = blk_mq_get_hctx_node(set, i);
|
2022-03-08 15:32:19 +08:00
|
|
|
struct blk_mq_hw_ctx *old_hctx = xa_load(&q->hctx_table, i);
|
2015-12-18 08:08:14 +08:00
|
|
|
|
2022-03-08 15:32:15 +08:00
|
|
|
if (old_hctx) {
|
|
|
|
old_node = old_hctx->numa_node;
|
|
|
|
blk_mq_exit_hctx(q, set, old_hctx, i);
|
|
|
|
}
|
2015-12-18 08:08:14 +08:00
|
|
|
|
2022-03-08 15:32:19 +08:00
|
|
|
if (!blk_mq_alloc_and_init_hctx(set, q, i, node)) {
|
2022-03-08 15:32:15 +08:00
|
|
|
if (!old_hctx)
|
2018-10-12 18:07:27 +08:00
|
|
|
break;
|
2022-03-08 15:32:15 +08:00
|
|
|
pr_warn("Allocate new hctx on node %d fails, fallback to previous one on node %d\n",
|
|
|
|
node, old_node);
|
2022-03-08 15:32:19 +08:00
|
|
|
hctx = blk_mq_alloc_and_init_hctx(set, q, i, old_node);
|
|
|
|
WARN_ON_ONCE(!hctx);
|
2015-12-18 08:08:14 +08:00
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
blk-mq: fallback to previous nr_hw_queues when updating fails
When we try to increate the nr_hw_queues, we may fail due to
shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops
and some entries in q->queue_hw_ctx are left with NULL. However,
because queue map has been updated with new nr_hw_queues, some cpus
have been mapped to hw queue which just encounters allocation failure,
thus blk_mq_map_queue could return NULL. This will cause panic in
following blk_mq_map_swqueue.
To fix it, when increase nr_hw_queues fails, fallback to previous
nr_hw_queues and post warning. At the same time, driver's .map_queues
usually use completion irq affinity to map hw and cpu, fallback
nr_hw_queues will cause lack of some cpu's map to hw, so use default
blk_mq_map_queues to do that.
Reported-by: syzbot+83e8cbe702263932d9d4@syzkaller.appspotmail.com
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-12 18:07:28 +08:00
|
|
|
/*
|
|
|
|
* Increasing nr_hw_queues fails. Free the newly allocated
|
|
|
|
* hctxs and keep the previous q->nr_hw_queues.
|
|
|
|
*/
|
|
|
|
if (i != set->nr_hw_queues) {
|
|
|
|
j = q->nr_hw_queues;
|
|
|
|
} else {
|
|
|
|
j = i;
|
|
|
|
q->nr_hw_queues = set->nr_hw_queues;
|
|
|
|
}
|
2018-10-12 18:07:27 +08:00
|
|
|
|
2022-03-08 15:32:19 +08:00
|
|
|
xa_for_each_start(&q->hctx_table, j, hctx, j)
|
|
|
|
blk_mq_exit_hctx(q, set, hctx, j);
|
2018-01-06 16:27:40 +08:00
|
|
|
mutex_unlock(&q->sysfs_lock);
|
2015-12-18 08:08:14 +08:00
|
|
|
}
|
|
|
|
|
2022-03-08 15:32:16 +08:00
|
|
|
static void blk_mq_update_poll_flag(struct request_queue *q)
|
|
|
|
{
|
|
|
|
struct blk_mq_tag_set *set = q->tag_set;
|
|
|
|
|
|
|
|
if (set->nr_maps > HCTX_TYPE_POLL &&
|
|
|
|
set->map[HCTX_TYPE_POLL].nr_queues)
|
|
|
|
blk_queue_flag_set(QUEUE_FLAG_POLL, q);
|
|
|
|
else
|
|
|
|
blk_queue_flag_clear(QUEUE_FLAG_POLL, q);
|
|
|
|
}
|
|
|
|
|
2021-06-02 14:53:17 +08:00
|
|
|
int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
|
|
|
|
struct request_queue *q)
|
2015-12-18 08:08:14 +08:00
|
|
|
{
|
2016-02-12 15:27:00 +08:00
|
|
|
/* mark the queue as mq asap */
|
|
|
|
q->mq_ops = set->ops;
|
|
|
|
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 23:56:08 +08:00
|
|
|
q->poll_cb = blk_stat_alloc_callback(blk_mq_poll_stats_fn,
|
2017-04-07 20:24:03 +08:00
|
|
|
blk_mq_poll_stats_bkt,
|
|
|
|
BLK_MQ_POLL_STATS_BKTS, q);
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 23:56:08 +08:00
|
|
|
if (!q->poll_cb)
|
|
|
|
goto err_exit;
|
|
|
|
|
2018-11-20 09:44:35 +08:00
|
|
|
if (blk_mq_alloc_ctxs(q))
|
2019-04-20 04:35:44 +08:00
|
|
|
goto err_poll;
|
2015-12-18 08:08:14 +08:00
|
|
|
|
2017-02-22 18:13:59 +08:00
|
|
|
/* init q->mq_kobj and sw queues' kobjects */
|
|
|
|
blk_mq_sysfs_init(q);
|
|
|
|
|
blk-mq: always free hctx after request queue is freed
In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().
However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.
Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:52:27 +08:00
|
|
|
INIT_LIST_HEAD(&q->unused_hctx_list);
|
|
|
|
spin_lock_init(&q->unused_hctx_lock);
|
|
|
|
|
2022-03-08 15:32:19 +08:00
|
|
|
xa_init(&q->hctx_table);
|
|
|
|
|
2015-12-18 08:08:14 +08:00
|
|
|
blk_mq_realloc_hw_ctxs(set, q);
|
|
|
|
if (!q->nr_hw_queues)
|
|
|
|
goto err_hctxs;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2015-10-30 20:57:30 +08:00
|
|
|
INIT_WORK(&q->timeout_work, blk_mq_timeout_work);
|
2015-07-16 19:53:22 +08:00
|
|
|
blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2018-10-17 04:23:06 +08:00
|
|
|
q->tag_set = set;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2013-11-20 00:25:07 +08:00
|
|
|
q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT;
|
2022-03-08 15:32:16 +08:00
|
|
|
blk_mq_update_poll_flag(q);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2016-09-15 01:28:30 +08:00
|
|
|
INIT_DELAYED_WORK(&q->requeue_work, blk_mq_requeue_work);
|
2014-05-28 22:08:02 +08:00
|
|
|
INIT_LIST_HEAD(&q->requeue_list);
|
|
|
|
spin_lock_init(&q->requeue_lock);
|
|
|
|
|
2014-05-21 05:17:27 +08:00
|
|
|
q->nr_requests = set->queue_depth;
|
|
|
|
|
2016-11-15 04:03:03 +08:00
|
|
|
/*
|
|
|
|
* Default to classic polling
|
|
|
|
*/
|
2019-03-18 22:44:41 +08:00
|
|
|
q->poll_nsec = BLK_MQ_POLL_CLASSIC;
|
2016-11-15 04:03:03 +08:00
|
|
|
|
2014-04-16 04:14:00 +08:00
|
|
|
blk_mq_init_cpu_queues(q, set->nr_hw_queues);
|
2014-05-14 05:10:52 +08:00
|
|
|
blk_mq_add_queue_tag_set(set, q);
|
2017-06-26 18:20:57 +08:00
|
|
|
blk_mq_map_swqueue(q);
|
2021-06-02 14:53:17 +08:00
|
|
|
return 0;
|
2014-02-11 00:29:00 +08:00
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
err_hctxs:
|
blk-mq: Fix kmemleak in blk_mq_init_allocated_queue
There is a kmemleak caused by modprobe null_blk.ko
unreferenced object 0xffff8881acb1f000 (size 1024):
comm "modprobe", pid 836, jiffies 4294971190 (age 27.068s)
hex dump (first 32 bytes):
00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N..........
ff ff ff ff ff ff ff ff 00 53 99 9e ff ff ff ff .........S......
backtrace:
[<000000004a10c249>] kmalloc_node_trace+0x22/0x60
[<00000000648f7950>] blk_mq_alloc_and_init_hctx+0x289/0x350
[<00000000af06de0e>] blk_mq_realloc_hw_ctxs+0x2fe/0x3d0
[<00000000e00c1872>] blk_mq_init_allocated_queue+0x48c/0x1440
[<00000000d16b4e68>] __blk_mq_alloc_disk+0xc8/0x1c0
[<00000000d10c98c3>] 0xffffffffc450d69d
[<00000000b9299f48>] 0xffffffffc4538392
[<0000000061c39ed6>] do_one_initcall+0xd0/0x4f0
[<00000000b389383b>] do_init_module+0x1a4/0x680
[<0000000087cf3542>] load_module+0x6249/0x7110
[<00000000beba61b8>] __do_sys_finit_module+0x140/0x200
[<00000000fdcfff51>] do_syscall_64+0x35/0x80
[<000000003c0f1f71>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
That is because q->ma_ops is set to NULL before blk_release_queue is
called.
blk_mq_init_queue_data
blk_mq_init_allocated_queue
blk_mq_realloc_hw_ctxs
for (i = 0; i < set->nr_hw_queues; i++) {
old_hctx = xa_load(&q->hctx_table, i);
if (!blk_mq_alloc_and_init_hctx(.., i, ..)) [1]
if (!old_hctx)
break;
xa_for_each_start(&q->hctx_table, j, hctx, j)
blk_mq_exit_hctx(q, set, hctx, j); [2]
if (!q->nr_hw_queues) [3]
goto err_hctxs;
err_exit:
q->mq_ops = NULL; [4]
blk_put_queue
blk_release_queue
if (queue_is_mq(q)) [5]
blk_mq_release(q);
[1]: blk_mq_alloc_and_init_hctx failed at i != 0.
[2]: The hctxs allocated by [1] are moved to q->unused_hctx_list and
will be cleaned up in blk_mq_release.
[3]: q->nr_hw_queues is 0.
[4]: Set q->mq_ops to NULL.
[5]: queue_is_mq returns false due to [4]. And blk_mq_release
will not be called. The hctxs in q->unused_hctx_list are leaked.
To fix it, call blk_release_queue in exception path.
Fixes: 2f8f1336a48b ("blk-mq: always free hctx after request queue is freed")
Signed-off-by: Yuan Can <yuancan@huawei.com>
Signed-off-by: Chen Jun <chenjun102@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20221031031242.94107-1-chenjun102@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-31 11:12:42 +08:00
|
|
|
blk_mq_release(q);
|
2019-04-20 04:35:44 +08:00
|
|
|
err_poll:
|
|
|
|
blk_stat_free_callback(q->poll_cb);
|
|
|
|
q->poll_cb = NULL;
|
2016-05-26 14:23:27 +08:00
|
|
|
err_exit:
|
|
|
|
q->mq_ops = NULL;
|
2021-06-02 14:53:17 +08:00
|
|
|
return -ENOMEM;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
2015-03-13 11:56:02 +08:00
|
|
|
EXPORT_SYMBOL(blk_mq_init_allocated_queue);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
blk-mq: free hw queue's resource in hctx's release handler
Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
("blk-mq: Fix a use-after-free") fixes this issue exactly.
However, that commit introduces another issue. Before 45a9c9d909b2,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().
We have invented ways for addressing this kind of issue before, such as:
8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")
But still can't cover all cases, recently James reports another such
kind of issue:
https://marc.info/?l=linux-scsi&m=155389088124782&w=2
This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.
Fixes the above issue by freeing hctx's resources in its release handler, and this
way is safe becasue tags isn't needed for freeing such hctx resource.
This approach follows typical design pattern wrt. kobject's release handler.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reported-by: James Smart <james.smart@broadcom.com>
Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
Cc: stable@vger.kernel.org
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:52:25 +08:00
|
|
|
/* tags can _not_ be used after returning from blk_mq_exit_queue */
|
|
|
|
void blk_mq_exit_queue(struct request_queue *q)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2021-05-14 01:15:29 +08:00
|
|
|
struct blk_mq_tag_set *set = q->tag_set;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2021-05-14 01:15:29 +08:00
|
|
|
/* Checks hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED. */
|
2014-05-27 23:35:13 +08:00
|
|
|
blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
|
2021-05-14 01:15:29 +08:00
|
|
|
/* May clear BLK_MQ_F_TAG_QUEUE_SHARED in hctx->flags. */
|
|
|
|
blk_mq_del_queue_tag_set(q);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2014-09-10 23:02:03 +08:00
|
|
|
static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
2021-10-05 18:23:39 +08:00
|
|
|
if (blk_mq_is_shared_tags(set->flags)) {
|
|
|
|
set->shared_tags = blk_mq_alloc_map_and_rqs(set,
|
2021-10-05 18:23:37 +08:00
|
|
|
BLK_MQ_NO_HCTX_IDX,
|
|
|
|
set->queue_depth);
|
2021-10-05 18:23:39 +08:00
|
|
|
if (!set->shared_tags)
|
2021-10-05 18:23:37 +08:00
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps()
We found blk_mq_alloc_rq_maps() takes more time in kernel space when
testing nvme device hot-plugging. The test and anlysis as below.
Debug code,
1, blk_mq_alloc_rq_maps():
u64 start, end;
depth = set->queue_depth;
start = ktime_get_ns();
pr_err("[%d:%s switch:%ld,%ld] queue depth %d, nr_hw_queues %d\n",
current->pid, current->comm, current->nvcsw, current->nivcsw,
set->queue_depth, set->nr_hw_queues);
do {
err = __blk_mq_alloc_rq_maps(set);
if (!err)
break;
set->queue_depth >>= 1;
if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) {
err = -ENOMEM;
break;
}
} while (set->queue_depth);
end = ktime_get_ns();
pr_err("[%d:%s switch:%ld,%ld] all hw queues init cost time %lld ns\n",
current->pid, current->comm,
current->nvcsw, current->nivcsw, end - start);
2, __blk_mq_alloc_rq_maps():
u64 start, end;
for (i = 0; i < set->nr_hw_queues; i++) {
start = ktime_get_ns();
if (!__blk_mq_alloc_rq_map(set, i))
goto out_unwind;
end = ktime_get_ns();
pr_err("hw queue %d init cost time %lld ns\n", i, end - start);
}
Test nvme hot-plugging with above debug code, we found it totally cost more
than 3ms in kernel space without being scheduled out when alloc rqs for all
16 hw queues with depth 1023, each hw queue cost about 140-250us. The cost
time will be increased with hw queue number and queue depth increasing. And
in an extreme case, if __blk_mq_alloc_rq_maps() returns -ENOMEM, it will try
"queue_depth >>= 1", more time will be consumed.
[ 428.428771] nvme nvme0: pci function 10000:01:00.0
[ 428.428798] nvme 10000:01:00.0: enabling device (0000 -> 0002)
[ 428.428806] pcieport 10000:00:00.0: can't derive routing for PCI INT A
[ 428.428809] nvme 10000:01:00.0: PCI INT A: no GSI
[ 432.593374] [4688:kworker/u33:8 switch:663,2] queue depth 30, nr_hw_queues 1
[ 432.593404] hw queue 0 init cost time 22883 ns
[ 432.593408] [4688:kworker/u33:8 switch:663,2] all hw queues init cost time 35960 ns
[ 432.595953] nvme nvme0: 16/0/0 default/read/poll queues
[ 432.595958] [4688:kworker/u33:8 switch:700,2] queue depth 1023, nr_hw_queues 16
[ 432.596203] hw queue 0 init cost time 242630 ns
[ 432.596441] hw queue 1 init cost time 235913 ns
[ 432.596659] hw queue 2 init cost time 216461 ns
[ 432.596877] hw queue 3 init cost time 215851 ns
[ 432.597107] hw queue 4 init cost time 228406 ns
[ 432.597336] hw queue 5 init cost time 227298 ns
[ 432.597564] hw queue 6 init cost time 224633 ns
[ 432.597785] hw queue 7 init cost time 219954 ns
[ 432.597937] hw queue 8 init cost time 150930 ns
[ 432.598082] hw queue 9 init cost time 143496 ns
[ 432.598231] hw queue 10 init cost time 147261 ns
[ 432.598397] hw queue 11 init cost time 164522 ns
[ 432.598542] hw queue 12 init cost time 143401 ns
[ 432.598692] hw queue 13 init cost time 148934 ns
[ 432.598841] hw queue 14 init cost time 147194 ns
[ 432.598991] hw queue 15 init cost time 148942 ns
[ 432.598993] [4688:kworker/u33:8 switch:700,2] all hw queues init cost time 3035099 ns
[ 432.602611] nvme0n1: p1
So use this patch to trigger schedule between each hw queue init, to avoid
other threads getting stuck. It is not in atomic context when executing
__blk_mq_alloc_rq_maps(), so it is safe to call cond_resched().
Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-26 10:39:47 +08:00
|
|
|
for (i = 0; i < set->nr_hw_queues; i++) {
|
2021-10-05 18:23:35 +08:00
|
|
|
if (!__blk_mq_alloc_map_and_rqs(set, i))
|
2014-09-10 23:02:03 +08:00
|
|
|
goto out_unwind;
|
blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps()
We found blk_mq_alloc_rq_maps() takes more time in kernel space when
testing nvme device hot-plugging. The test and anlysis as below.
Debug code,
1, blk_mq_alloc_rq_maps():
u64 start, end;
depth = set->queue_depth;
start = ktime_get_ns();
pr_err("[%d:%s switch:%ld,%ld] queue depth %d, nr_hw_queues %d\n",
current->pid, current->comm, current->nvcsw, current->nivcsw,
set->queue_depth, set->nr_hw_queues);
do {
err = __blk_mq_alloc_rq_maps(set);
if (!err)
break;
set->queue_depth >>= 1;
if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) {
err = -ENOMEM;
break;
}
} while (set->queue_depth);
end = ktime_get_ns();
pr_err("[%d:%s switch:%ld,%ld] all hw queues init cost time %lld ns\n",
current->pid, current->comm,
current->nvcsw, current->nivcsw, end - start);
2, __blk_mq_alloc_rq_maps():
u64 start, end;
for (i = 0; i < set->nr_hw_queues; i++) {
start = ktime_get_ns();
if (!__blk_mq_alloc_rq_map(set, i))
goto out_unwind;
end = ktime_get_ns();
pr_err("hw queue %d init cost time %lld ns\n", i, end - start);
}
Test nvme hot-plugging with above debug code, we found it totally cost more
than 3ms in kernel space without being scheduled out when alloc rqs for all
16 hw queues with depth 1023, each hw queue cost about 140-250us. The cost
time will be increased with hw queue number and queue depth increasing. And
in an extreme case, if __blk_mq_alloc_rq_maps() returns -ENOMEM, it will try
"queue_depth >>= 1", more time will be consumed.
[ 428.428771] nvme nvme0: pci function 10000:01:00.0
[ 428.428798] nvme 10000:01:00.0: enabling device (0000 -> 0002)
[ 428.428806] pcieport 10000:00:00.0: can't derive routing for PCI INT A
[ 428.428809] nvme 10000:01:00.0: PCI INT A: no GSI
[ 432.593374] [4688:kworker/u33:8 switch:663,2] queue depth 30, nr_hw_queues 1
[ 432.593404] hw queue 0 init cost time 22883 ns
[ 432.593408] [4688:kworker/u33:8 switch:663,2] all hw queues init cost time 35960 ns
[ 432.595953] nvme nvme0: 16/0/0 default/read/poll queues
[ 432.595958] [4688:kworker/u33:8 switch:700,2] queue depth 1023, nr_hw_queues 16
[ 432.596203] hw queue 0 init cost time 242630 ns
[ 432.596441] hw queue 1 init cost time 235913 ns
[ 432.596659] hw queue 2 init cost time 216461 ns
[ 432.596877] hw queue 3 init cost time 215851 ns
[ 432.597107] hw queue 4 init cost time 228406 ns
[ 432.597336] hw queue 5 init cost time 227298 ns
[ 432.597564] hw queue 6 init cost time 224633 ns
[ 432.597785] hw queue 7 init cost time 219954 ns
[ 432.597937] hw queue 8 init cost time 150930 ns
[ 432.598082] hw queue 9 init cost time 143496 ns
[ 432.598231] hw queue 10 init cost time 147261 ns
[ 432.598397] hw queue 11 init cost time 164522 ns
[ 432.598542] hw queue 12 init cost time 143401 ns
[ 432.598692] hw queue 13 init cost time 148934 ns
[ 432.598841] hw queue 14 init cost time 147194 ns
[ 432.598991] hw queue 15 init cost time 148942 ns
[ 432.598993] [4688:kworker/u33:8 switch:700,2] all hw queues init cost time 3035099 ns
[ 432.602611] nvme0n1: p1
So use this patch to trigger schedule between each hw queue init, to avoid
other threads getting stuck. It is not in atomic context when executing
__blk_mq_alloc_rq_maps(), so it is safe to call cond_resched().
Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-26 10:39:47 +08:00
|
|
|
cond_resched();
|
|
|
|
}
|
2014-09-10 23:02:03 +08:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
out_unwind:
|
|
|
|
while (--i >= 0)
|
2021-10-05 18:23:37 +08:00
|
|
|
__blk_mq_free_map_and_rqs(set, i);
|
|
|
|
|
2021-10-05 18:23:39 +08:00
|
|
|
if (blk_mq_is_shared_tags(set->flags)) {
|
|
|
|
blk_mq_free_map_and_rqs(set, set->shared_tags,
|
2021-10-05 18:23:37 +08:00
|
|
|
BLK_MQ_NO_HCTX_IDX);
|
2021-10-05 18:23:36 +08:00
|
|
|
}
|
2014-09-10 23:02:03 +08:00
|
|
|
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocate the request maps associated with this tag_set. Note that this
|
|
|
|
* may reduce the depth asked for, if memory is tight. set->queue_depth
|
|
|
|
* will be updated to reflect the allocated depth.
|
|
|
|
*/
|
2021-10-05 18:23:35 +08:00
|
|
|
static int blk_mq_alloc_set_map_and_rqs(struct blk_mq_tag_set *set)
|
2014-09-10 23:02:03 +08:00
|
|
|
{
|
|
|
|
unsigned int depth;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
depth = set->queue_depth;
|
|
|
|
do {
|
|
|
|
err = __blk_mq_alloc_rq_maps(set);
|
|
|
|
if (!err)
|
|
|
|
break;
|
|
|
|
|
|
|
|
set->queue_depth >>= 1;
|
|
|
|
if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) {
|
|
|
|
err = -ENOMEM;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
} while (set->queue_depth);
|
|
|
|
|
|
|
|
if (!set->queue_depth || err) {
|
|
|
|
pr_err("blk-mq: failed to allocate request map\n");
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (depth != set->queue_depth)
|
|
|
|
pr_info("blk-mq: reduced tag depth (%u -> %u)\n",
|
|
|
|
depth, set->queue_depth);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2022-08-16 01:00:43 +08:00
|
|
|
static void blk_mq_update_queue_map(struct blk_mq_tag_set *set)
|
2017-04-07 22:53:11 +08:00
|
|
|
{
|
2020-03-10 12:26:17 +08:00
|
|
|
/*
|
|
|
|
* blk_mq_map_queues() and multiple .map_queues() implementations
|
|
|
|
* expect that set->map[HCTX_TYPE_DEFAULT].nr_queues is set to the
|
|
|
|
* number of hardware queues.
|
|
|
|
*/
|
|
|
|
if (set->nr_maps == 1)
|
|
|
|
set->map[HCTX_TYPE_DEFAULT].nr_queues = set->nr_hw_queues;
|
|
|
|
|
2018-12-07 11:03:53 +08:00
|
|
|
if (set->ops->map_queues && !is_kdump_kernel()) {
|
2018-10-31 00:36:06 +08:00
|
|
|
int i;
|
|
|
|
|
2018-01-06 16:27:39 +08:00
|
|
|
/*
|
|
|
|
* transport .map_queues is usually done in the following
|
|
|
|
* way:
|
|
|
|
*
|
|
|
|
* for (queue = 0; queue < set->nr_hw_queues; queue++) {
|
|
|
|
* mask = get_cpu_mask(queue)
|
|
|
|
* for_each_cpu(cpu, mask)
|
2018-10-31 00:36:06 +08:00
|
|
|
* set->map[x].mq_map[cpu] = queue;
|
2018-01-06 16:27:39 +08:00
|
|
|
* }
|
|
|
|
*
|
|
|
|
* When we need to remap, the table has to be cleared for
|
|
|
|
* killing stale mapping since one CPU may not be mapped
|
|
|
|
* to any hw queue.
|
|
|
|
*/
|
2018-10-31 00:36:06 +08:00
|
|
|
for (i = 0; i < set->nr_maps; i++)
|
|
|
|
blk_mq_clear_mq_map(&set->map[i]);
|
2018-01-06 16:27:39 +08:00
|
|
|
|
2022-08-16 01:00:43 +08:00
|
|
|
set->ops->map_queues(set);
|
2018-10-31 00:36:06 +08:00
|
|
|
} else {
|
|
|
|
BUG_ON(set->nr_maps > 1);
|
2022-08-16 01:00:43 +08:00
|
|
|
blk_mq_map_queues(&set->map[HCTX_TYPE_DEFAULT]);
|
2018-10-31 00:36:06 +08:00
|
|
|
}
|
2017-04-07 22:53:11 +08:00
|
|
|
}
|
|
|
|
|
2019-10-26 00:50:10 +08:00
|
|
|
static int blk_mq_realloc_tag_set_tags(struct blk_mq_tag_set *set,
|
2022-11-09 18:08:11 +08:00
|
|
|
int new_nr_hw_queues)
|
2019-10-26 00:50:10 +08:00
|
|
|
{
|
|
|
|
struct blk_mq_tags **new_tags;
|
|
|
|
|
2022-11-09 18:08:11 +08:00
|
|
|
if (set->nr_hw_queues >= new_nr_hw_queues)
|
2022-11-22 16:49:17 +08:00
|
|
|
goto done;
|
2019-10-26 00:50:10 +08:00
|
|
|
|
|
|
|
new_tags = kcalloc_node(new_nr_hw_queues, sizeof(struct blk_mq_tags *),
|
|
|
|
GFP_KERNEL, set->numa_node);
|
|
|
|
if (!new_tags)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
if (set->tags)
|
2022-11-09 18:08:11 +08:00
|
|
|
memcpy(new_tags, set->tags, set->nr_hw_queues *
|
2019-10-26 00:50:10 +08:00
|
|
|
sizeof(*set->tags));
|
|
|
|
kfree(set->tags);
|
|
|
|
set->tags = new_tags;
|
2022-11-22 16:49:17 +08:00
|
|
|
done:
|
2019-10-26 00:50:10 +08:00
|
|
|
set->nr_hw_queues = new_nr_hw_queues;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-06-06 05:21:56 +08:00
|
|
|
/*
|
|
|
|
* Alloc a tag set to be associated with one or more request queues.
|
|
|
|
* May fail with EINVAL for various error conditions. May adjust the
|
2018-06-30 21:12:41 +08:00
|
|
|
* requested depth down, if it's too large. In that case, the set
|
2014-06-06 05:21:56 +08:00
|
|
|
* value will be stored in set->queue_depth.
|
|
|
|
*/
|
2014-04-16 04:14:00 +08:00
|
|
|
int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
|
|
|
|
{
|
2018-10-31 00:36:06 +08:00
|
|
|
int i, ret;
|
2016-09-14 22:18:55 +08:00
|
|
|
|
2014-10-30 21:45:11 +08:00
|
|
|
BUILD_BUG_ON(BLK_MQ_MAX_DEPTH > 1 << BLK_MQ_UNIQUE_TAG_BITS);
|
|
|
|
|
2014-04-16 04:14:00 +08:00
|
|
|
if (!set->nr_hw_queues)
|
|
|
|
return -EINVAL;
|
2014-06-06 05:21:56 +08:00
|
|
|
if (!set->queue_depth)
|
2014-04-16 04:14:00 +08:00
|
|
|
return -EINVAL;
|
|
|
|
if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2016-09-14 22:18:54 +08:00
|
|
|
if (!set->ops->queue_rq)
|
2014-04-16 04:14:00 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
2017-10-14 17:22:29 +08:00
|
|
|
if (!set->ops->get_budget ^ !set->ops->put_budget)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2014-06-06 05:21:56 +08:00
|
|
|
if (set->queue_depth > BLK_MQ_MAX_DEPTH) {
|
|
|
|
pr_info("blk-mq: reduced tag depth to %u\n",
|
|
|
|
BLK_MQ_MAX_DEPTH);
|
|
|
|
set->queue_depth = BLK_MQ_MAX_DEPTH;
|
|
|
|
}
|
2014-04-16 04:14:00 +08:00
|
|
|
|
2018-10-31 00:36:06 +08:00
|
|
|
if (!set->nr_maps)
|
|
|
|
set->nr_maps = 1;
|
|
|
|
else if (set->nr_maps > HCTX_MAX_TYPES)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2014-12-01 08:00:58 +08:00
|
|
|
/*
|
|
|
|
* If a crashdump is active, then we are potentially in a very
|
|
|
|
* memory constrained environment. Limit us to 1 queue and
|
|
|
|
* 64 tags to prevent using too much memory.
|
|
|
|
*/
|
|
|
|
if (is_kdump_kernel()) {
|
|
|
|
set->nr_hw_queues = 1;
|
2018-12-07 11:03:53 +08:00
|
|
|
set->nr_maps = 1;
|
2014-12-01 08:00:58 +08:00
|
|
|
set->queue_depth = min(64U, set->queue_depth);
|
|
|
|
}
|
2015-12-18 08:08:14 +08:00
|
|
|
/*
|
2018-10-30 03:25:27 +08:00
|
|
|
* There is no use for more h/w queues than cpus if we just have
|
|
|
|
* a single map
|
2015-12-18 08:08:14 +08:00
|
|
|
*/
|
2018-10-30 03:25:27 +08:00
|
|
|
if (set->nr_maps == 1 && set->nr_hw_queues > nr_cpu_ids)
|
2015-12-18 08:08:14 +08:00
|
|
|
set->nr_hw_queues = nr_cpu_ids;
|
2014-12-01 08:00:58 +08:00
|
|
|
|
2022-11-01 23:00:47 +08:00
|
|
|
if (set->flags & BLK_MQ_F_BLOCKING) {
|
|
|
|
set->srcu = kmalloc(sizeof(*set->srcu), GFP_KERNEL);
|
|
|
|
if (!set->srcu)
|
|
|
|
return -ENOMEM;
|
|
|
|
ret = init_srcu_struct(set->srcu);
|
|
|
|
if (ret)
|
|
|
|
goto out_free_srcu;
|
|
|
|
}
|
2014-04-16 04:14:00 +08:00
|
|
|
|
2016-09-14 22:18:55 +08:00
|
|
|
ret = -ENOMEM;
|
2022-11-09 18:08:10 +08:00
|
|
|
set->tags = kcalloc_node(set->nr_hw_queues,
|
|
|
|
sizeof(struct blk_mq_tags *), GFP_KERNEL,
|
|
|
|
set->numa_node);
|
|
|
|
if (!set->tags)
|
2022-11-01 23:00:47 +08:00
|
|
|
goto out_cleanup_srcu;
|
2014-04-16 04:14:00 +08:00
|
|
|
|
2018-10-31 00:36:06 +08:00
|
|
|
for (i = 0; i < set->nr_maps; i++) {
|
|
|
|
set->map[i].mq_map = kcalloc_node(nr_cpu_ids,
|
2018-12-17 18:42:45 +08:00
|
|
|
sizeof(set->map[i].mq_map[0]),
|
2018-10-31 00:36:06 +08:00
|
|
|
GFP_KERNEL, set->numa_node);
|
|
|
|
if (!set->map[i].mq_map)
|
|
|
|
goto out_free_mq_map;
|
2018-12-07 11:03:53 +08:00
|
|
|
set->map[i].nr_queues = is_kdump_kernel() ? 1 : set->nr_hw_queues;
|
2018-10-31 00:36:06 +08:00
|
|
|
}
|
2016-09-14 22:18:53 +08:00
|
|
|
|
2022-08-16 01:00:43 +08:00
|
|
|
blk_mq_update_queue_map(set);
|
2016-09-14 22:18:55 +08:00
|
|
|
|
2021-10-05 18:23:35 +08:00
|
|
|
ret = blk_mq_alloc_set_map_and_rqs(set);
|
2016-09-14 22:18:55 +08:00
|
|
|
if (ret)
|
2016-09-14 22:18:53 +08:00
|
|
|
goto out_free_mq_map;
|
2014-04-16 04:14:00 +08:00
|
|
|
|
2014-05-14 05:10:52 +08:00
|
|
|
mutex_init(&set->tag_list_lock);
|
|
|
|
INIT_LIST_HEAD(&set->tag_list);
|
|
|
|
|
2014-04-16 04:14:00 +08:00
|
|
|
return 0;
|
2016-09-14 22:18:53 +08:00
|
|
|
|
|
|
|
out_free_mq_map:
|
2018-10-31 00:36:06 +08:00
|
|
|
for (i = 0; i < set->nr_maps; i++) {
|
|
|
|
kfree(set->map[i].mq_map);
|
|
|
|
set->map[i].mq_map = NULL;
|
|
|
|
}
|
2014-09-03 00:38:44 +08:00
|
|
|
kfree(set->tags);
|
|
|
|
set->tags = NULL;
|
2022-11-01 23:00:47 +08:00
|
|
|
out_cleanup_srcu:
|
|
|
|
if (set->flags & BLK_MQ_F_BLOCKING)
|
|
|
|
cleanup_srcu_struct(set->srcu);
|
|
|
|
out_free_srcu:
|
|
|
|
if (set->flags & BLK_MQ_F_BLOCKING)
|
|
|
|
kfree(set->srcu);
|
2016-09-14 22:18:55 +08:00
|
|
|
return ret;
|
2014-04-16 04:14:00 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_alloc_tag_set);
|
|
|
|
|
2021-06-02 14:53:16 +08:00
|
|
|
/* allocate and initialize a tagset for a simple single-queue device */
|
|
|
|
int blk_mq_alloc_sq_tag_set(struct blk_mq_tag_set *set,
|
|
|
|
const struct blk_mq_ops *ops, unsigned int queue_depth,
|
|
|
|
unsigned int set_flags)
|
|
|
|
{
|
|
|
|
memset(set, 0, sizeof(*set));
|
|
|
|
set->ops = ops;
|
|
|
|
set->nr_hw_queues = 1;
|
|
|
|
set->nr_maps = 1;
|
|
|
|
set->queue_depth = queue_depth;
|
|
|
|
set->numa_node = NUMA_NO_NODE;
|
|
|
|
set->flags = set_flags;
|
|
|
|
return blk_mq_alloc_tag_set(set);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_alloc_sq_tag_set);
|
|
|
|
|
2014-04-16 04:14:00 +08:00
|
|
|
void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
|
|
|
|
{
|
2018-10-31 00:36:06 +08:00
|
|
|
int i, j;
|
2014-04-16 04:14:00 +08:00
|
|
|
|
2019-10-26 00:50:10 +08:00
|
|
|
for (i = 0; i < set->nr_hw_queues; i++)
|
2021-10-05 18:23:37 +08:00
|
|
|
__blk_mq_free_map_and_rqs(set, i);
|
2014-05-22 04:01:15 +08:00
|
|
|
|
2021-10-05 18:23:39 +08:00
|
|
|
if (blk_mq_is_shared_tags(set->flags)) {
|
|
|
|
blk_mq_free_map_and_rqs(set, set->shared_tags,
|
2021-10-05 18:23:37 +08:00
|
|
|
BLK_MQ_NO_HCTX_IDX);
|
|
|
|
}
|
blk-mq: Facilitate a shared sbitmap per tagset
Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags.
In addition, these drivers want to use interrupt assignment in
pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
CPU hotplug may cause in-flight IO completion to not be serviced when an
interrupt is shutdown. That problem is solved in commit bf0beec0607d
("blk-mq: drain I/O when all CPUs in a hctx are offline").
However, to take advantage of that blk-mq feature, the HBA HW queuess are
required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
queues need to be exposed to the upper layer.
In making that transition, the per-SCSI command request tags are no
longer unique per Scsi host - they are just unique per hctx. As such, the
HBA LLDD would have to generate this tag internally, which has a certain
performance overhead.
However another problem is that blk-mq assumes the host may accept
(Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi:
core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
counter was removed, which would stop the LLDD being sent more than
.can_queue commands; however, it should still be ensured that the block
layer does not issue more than .can_queue commands to the Scsi host.
To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
which may be requested at init time.
New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
tagset to indicate whether the shared sbitmap should be used.
Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
are still allocated per hctx; the reason for this is that if tags and
requests were only allocated for a single hctx - like hctx0 - it may break
block drivers which expect a request be associated with a specific hctx,
i.e. not always hctx0. This will introduce extra memory usage.
This change is based on work originally from Ming Lei in [1] and from
Bart's suggestion in [2].
[0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
[1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
[2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-19 23:20:24 +08:00
|
|
|
|
2018-10-31 00:36:06 +08:00
|
|
|
for (j = 0; j < set->nr_maps; j++) {
|
|
|
|
kfree(set->map[j].mq_map);
|
|
|
|
set->map[j].mq_map = NULL;
|
|
|
|
}
|
2016-09-14 22:18:53 +08:00
|
|
|
|
2014-04-24 00:07:34 +08:00
|
|
|
kfree(set->tags);
|
2014-09-03 00:38:44 +08:00
|
|
|
set->tags = NULL;
|
2022-11-01 23:00:47 +08:00
|
|
|
if (set->flags & BLK_MQ_F_BLOCKING) {
|
|
|
|
cleanup_srcu_struct(set->srcu);
|
|
|
|
kfree(set->srcu);
|
|
|
|
}
|
2014-04-16 04:14:00 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_free_tag_set);
|
|
|
|
|
2014-05-21 01:49:02 +08:00
|
|
|
int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
|
|
|
|
{
|
|
|
|
struct blk_mq_tag_set *set = q->tag_set;
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
2022-03-08 15:32:18 +08:00
|
|
|
int ret;
|
|
|
|
unsigned long i;
|
2014-05-21 01:49:02 +08:00
|
|
|
|
2017-01-17 21:03:22 +08:00
|
|
|
if (!set)
|
2014-05-21 01:49:02 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
2019-02-09 00:14:05 +08:00
|
|
|
if (q->nr_requests == nr)
|
|
|
|
return 0;
|
|
|
|
|
2017-01-20 01:59:07 +08:00
|
|
|
blk_mq_freeze_queue(q);
|
2018-01-06 16:27:38 +08:00
|
|
|
blk_mq_quiesce_queue(q);
|
2017-01-20 01:59:07 +08:00
|
|
|
|
2014-05-21 01:49:02 +08:00
|
|
|
ret = 0;
|
|
|
|
queue_for_each_hw_ctx(q, hctx, i) {
|
2016-02-19 05:56:35 +08:00
|
|
|
if (!hctx->tags)
|
|
|
|
continue;
|
2017-01-17 21:03:22 +08:00
|
|
|
/*
|
|
|
|
* If we're using an MQ scheduler, just update the scheduler
|
|
|
|
* queue depth. This is similar to what the old code would do.
|
|
|
|
*/
|
2021-10-05 18:23:29 +08:00
|
|
|
if (hctx->sched_tags) {
|
2017-01-20 01:59:07 +08:00
|
|
|
ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags,
|
2021-10-05 18:23:29 +08:00
|
|
|
nr, true);
|
|
|
|
} else {
|
|
|
|
ret = blk_mq_tag_update_depth(hctx, &hctx->tags, nr,
|
|
|
|
false);
|
2017-01-20 01:59:07 +08:00
|
|
|
}
|
2014-05-21 01:49:02 +08:00
|
|
|
if (ret)
|
|
|
|
break;
|
2019-01-19 01:34:16 +08:00
|
|
|
if (q->elevator && q->elevator->type->ops.depth_updated)
|
|
|
|
q->elevator->type->ops.depth_updated(hctx);
|
2014-05-21 01:49:02 +08:00
|
|
|
}
|
blk-mq: Use request queue-wide tags for tagset-wide sbitmap
The tags used for an IO scheduler are currently per hctx.
As such, when q->nr_hw_queues grows, so does the request queue total IO
scheduler tag depth.
This may cause problems for SCSI MQ HBAs whose total driver depth is
fixed.
Ming and Yanhui report higher CPU usage and lower throughput in scenarios
where the fixed total driver tag depth is appreciably lower than the total
scheduler tag depth:
https://lore.kernel.org/linux-block/440dfcfc-1a2c-bd98-1161-cec4d78c6dfc@huawei.com/T/#mc0d6d4f95275a2743d1c8c3e4dc9ff6c9aa3a76b
In that scenario, since the scheduler tag is got first, much contention
is introduced since a driver tag may not be available after we have got
the sched tag.
Improve this scenario by introducing request queue-wide tags for when
a tagset-wide sbitmap is used. The static sched requests are still
allocated per hctx, as requests are initialised per hctx, as in
blk_mq_init_request(..., hctx_idx, ...) ->
set->ops->init_request(.., hctx_idx, ...).
For simplicity of resizing the request queue sbitmap when updating the
request queue depth, just init at the max possible size, so we don't need
to deal with the possibly with swapping out a new sbitmap for old if
we need to grow.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1620907258-30910-3-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-13 20:00:58 +08:00
|
|
|
if (!ret) {
|
2014-05-21 01:49:02 +08:00
|
|
|
q->nr_requests = nr;
|
2021-10-05 18:23:39 +08:00
|
|
|
if (blk_mq_is_shared_tags(set->flags)) {
|
2021-10-05 18:23:28 +08:00
|
|
|
if (q->elevator)
|
2021-10-05 18:23:39 +08:00
|
|
|
blk_mq_tag_update_sched_shared_tags(q);
|
2021-10-05 18:23:28 +08:00
|
|
|
else
|
2021-10-05 18:23:39 +08:00
|
|
|
blk_mq_tag_resize_shared_tags(set, nr);
|
2021-10-05 18:23:28 +08:00
|
|
|
}
|
blk-mq: Use request queue-wide tags for tagset-wide sbitmap
The tags used for an IO scheduler are currently per hctx.
As such, when q->nr_hw_queues grows, so does the request queue total IO
scheduler tag depth.
This may cause problems for SCSI MQ HBAs whose total driver depth is
fixed.
Ming and Yanhui report higher CPU usage and lower throughput in scenarios
where the fixed total driver tag depth is appreciably lower than the total
scheduler tag depth:
https://lore.kernel.org/linux-block/440dfcfc-1a2c-bd98-1161-cec4d78c6dfc@huawei.com/T/#mc0d6d4f95275a2743d1c8c3e4dc9ff6c9aa3a76b
In that scenario, since the scheduler tag is got first, much contention
is introduced since a driver tag may not be available after we have got
the sched tag.
Improve this scenario by introducing request queue-wide tags for when
a tagset-wide sbitmap is used. The static sched requests are still
allocated per hctx, as requests are initialised per hctx, as in
blk_mq_init_request(..., hctx_idx, ...) ->
set->ops->init_request(.., hctx_idx, ...).
For simplicity of resizing the request queue sbitmap when updating the
request queue depth, just init at the max possible size, so we don't need
to deal with the possibly with swapping out a new sbitmap for old if
we need to grow.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1620907258-30910-3-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-13 20:00:58 +08:00
|
|
|
}
|
2014-05-21 01:49:02 +08:00
|
|
|
|
2018-01-06 16:27:38 +08:00
|
|
|
blk_mq_unquiesce_queue(q);
|
2017-01-20 01:59:07 +08:00
|
|
|
blk_mq_unfreeze_queue(q);
|
|
|
|
|
2014-05-21 01:49:02 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-08-21 15:15:03 +08:00
|
|
|
/*
|
|
|
|
* request_queue and elevator_type pair.
|
|
|
|
* It is just used by __blk_mq_update_nr_hw_queues to cache
|
|
|
|
* the elevator_type associated with a request_queue.
|
|
|
|
*/
|
|
|
|
struct blk_mq_qe_pair {
|
|
|
|
struct list_head node;
|
|
|
|
struct request_queue *q;
|
|
|
|
struct elevator_type *type;
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Cache the elevator_type in qe pair list and switch the
|
|
|
|
* io scheduler to 'none'
|
|
|
|
*/
|
|
|
|
static bool blk_mq_elv_switch_none(struct list_head *head,
|
|
|
|
struct request_queue *q)
|
|
|
|
{
|
|
|
|
struct blk_mq_qe_pair *qe;
|
|
|
|
|
|
|
|
if (!q->elevator)
|
|
|
|
return true;
|
|
|
|
|
|
|
|
qe = kmalloc(sizeof(*qe), GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY);
|
|
|
|
if (!qe)
|
|
|
|
return false;
|
|
|
|
|
2022-06-16 09:43:59 +08:00
|
|
|
/* q->elevator needs protection from ->sysfs_lock */
|
|
|
|
mutex_lock(&q->sysfs_lock);
|
|
|
|
|
2018-08-21 15:15:03 +08:00
|
|
|
INIT_LIST_HEAD(&qe->node);
|
|
|
|
qe->q = q;
|
|
|
|
qe->type = q->elevator->type;
|
2022-10-20 14:48:16 +08:00
|
|
|
/* keep a reference to the elevator module as we'll switch back */
|
|
|
|
__elevator_get(qe->type);
|
2018-08-21 15:15:03 +08:00
|
|
|
list_add(&qe->node, head);
|
2022-10-30 18:07:14 +08:00
|
|
|
elevator_disable(q);
|
2018-08-21 15:15:03 +08:00
|
|
|
mutex_unlock(&q->sysfs_lock);
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2022-03-31 17:12:18 +08:00
|
|
|
static struct blk_mq_qe_pair *blk_lookup_qe_pair(struct list_head *head,
|
|
|
|
struct request_queue *q)
|
2018-08-21 15:15:03 +08:00
|
|
|
{
|
|
|
|
struct blk_mq_qe_pair *qe;
|
|
|
|
|
|
|
|
list_for_each_entry(qe, head, node)
|
2022-03-31 17:12:18 +08:00
|
|
|
if (qe->q == q)
|
|
|
|
return qe;
|
2018-08-21 15:15:03 +08:00
|
|
|
|
2022-03-31 17:12:18 +08:00
|
|
|
return NULL;
|
|
|
|
}
|
2018-08-21 15:15:03 +08:00
|
|
|
|
2022-03-31 17:12:18 +08:00
|
|
|
static void blk_mq_elv_switch_back(struct list_head *head,
|
|
|
|
struct request_queue *q)
|
|
|
|
{
|
|
|
|
struct blk_mq_qe_pair *qe;
|
|
|
|
struct elevator_type *t;
|
|
|
|
|
|
|
|
qe = blk_lookup_qe_pair(head, q);
|
|
|
|
if (!qe)
|
|
|
|
return;
|
|
|
|
t = qe->type;
|
2018-08-21 15:15:03 +08:00
|
|
|
list_del(&qe->node);
|
|
|
|
kfree(qe);
|
|
|
|
|
|
|
|
mutex_lock(&q->sysfs_lock);
|
2022-09-27 23:56:52 +08:00
|
|
|
elevator_switch(q, t);
|
block: fix up elevator_type refcounting
The current reference management logic of io scheduler modules contains
refcnt problems. For example, blk_mq_init_sched may fail before or after
the calling of e->ops.init_sched. If it fails before the calling, it does
nothing to the reference to the io scheduler module. But if it fails after
the calling, it releases the reference by calling kobject_put(&eq->kobj).
As the callers of blk_mq_init_sched can't know exactly where the failure
happens, they can't handle the reference to the io scheduler module
properly: releasing the reference on failure results in double-release if
blk_mq_init_sched has released it, and not releasing the reference results
in ghost reference if blk_mq_init_sched did not release it either.
The same problem also exists in io schedulers' init_sched implementations.
We can address the problem by adding releasing statements to the error
handling procedures of blk_mq_init_sched and init_sched implementations.
But that is counterintuitive and requires modifications to existing io
schedulers.
Instead, We make elevator_alloc get the io scheduler module references
that will be released by elevator_release. And then, we match each
elevator_get with an elevator_put. Therefore, each reference to an io
scheduler module explicitly has its own getter and releaser, and we no
longer need to worry about the refcnt problems.
The bugs and the patch can be validated with tools here:
https://github.com/nickyc975/linux_elv_refcnt_bug.git
[hch: split out a few bits into separate patches, use a non-try
module_get in elevator_alloc]
Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221020064819.1469928-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-20 14:48:19 +08:00
|
|
|
/* drop the reference acquired in blk_mq_elv_switch_none */
|
|
|
|
elevator_put(t);
|
2018-08-21 15:15:03 +08:00
|
|
|
mutex_unlock(&q->sysfs_lock);
|
|
|
|
}
|
|
|
|
|
2017-05-31 02:39:11 +08:00
|
|
|
static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
|
|
|
|
int nr_hw_queues)
|
2015-12-18 08:08:14 +08:00
|
|
|
{
|
|
|
|
struct request_queue *q;
|
2018-08-21 15:15:03 +08:00
|
|
|
LIST_HEAD(head);
|
blk-mq: fallback to previous nr_hw_queues when updating fails
When we try to increate the nr_hw_queues, we may fail due to
shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops
and some entries in q->queue_hw_ctx are left with NULL. However,
because queue map has been updated with new nr_hw_queues, some cpus
have been mapped to hw queue which just encounters allocation failure,
thus blk_mq_map_queue could return NULL. This will cause panic in
following blk_mq_map_swqueue.
To fix it, when increase nr_hw_queues fails, fallback to previous
nr_hw_queues and post warning. At the same time, driver's .map_queues
usually use completion irq affinity to map hw and cpu, fallback
nr_hw_queues will cause lack of some cpu's map to hw, so use default
blk_mq_map_queues to do that.
Reported-by: syzbot+83e8cbe702263932d9d4@syzkaller.appspotmail.com
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-12 18:07:28 +08:00
|
|
|
int prev_nr_hw_queues;
|
2015-12-18 08:08:14 +08:00
|
|
|
|
2017-04-08 02:16:49 +08:00
|
|
|
lockdep_assert_held(&set->tag_list_lock);
|
|
|
|
|
2018-10-30 03:25:27 +08:00
|
|
|
if (set->nr_maps == 1 && nr_hw_queues > nr_cpu_ids)
|
2015-12-18 08:08:14 +08:00
|
|
|
nr_hw_queues = nr_cpu_ids;
|
2020-06-17 14:18:37 +08:00
|
|
|
if (nr_hw_queues < 1)
|
|
|
|
return;
|
|
|
|
if (set->nr_maps == 1 && nr_hw_queues == set->nr_hw_queues)
|
2015-12-18 08:08:14 +08:00
|
|
|
return;
|
|
|
|
|
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list)
|
|
|
|
blk_mq_freeze_queue(q);
|
2018-08-21 15:15:03 +08:00
|
|
|
/*
|
|
|
|
* Switch IO scheduler to 'none', cleaning up the data associated
|
|
|
|
* with the previous scheduler. We will switch back once we are done
|
|
|
|
* updating the new sw to hw queue mappings.
|
|
|
|
*/
|
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list)
|
|
|
|
if (!blk_mq_elv_switch_none(&head, q))
|
|
|
|
goto switch_back;
|
2015-12-18 08:08:14 +08:00
|
|
|
|
2018-10-12 18:07:25 +08:00
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list) {
|
|
|
|
blk_mq_debugfs_unregister_hctxs(q);
|
2022-06-29 01:18:49 +08:00
|
|
|
blk_mq_sysfs_unregister_hctxs(q);
|
2018-10-12 18:07:25 +08:00
|
|
|
}
|
|
|
|
|
2020-05-07 21:03:56 +08:00
|
|
|
prev_nr_hw_queues = set->nr_hw_queues;
|
2022-11-09 18:08:11 +08:00
|
|
|
if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0)
|
2019-10-26 00:50:10 +08:00
|
|
|
goto reregister;
|
|
|
|
|
blk-mq: fallback to previous nr_hw_queues when updating fails
When we try to increate the nr_hw_queues, we may fail due to
shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops
and some entries in q->queue_hw_ctx are left with NULL. However,
because queue map has been updated with new nr_hw_queues, some cpus
have been mapped to hw queue which just encounters allocation failure,
thus blk_mq_map_queue could return NULL. This will cause panic in
following blk_mq_map_swqueue.
To fix it, when increase nr_hw_queues fails, fallback to previous
nr_hw_queues and post warning. At the same time, driver's .map_queues
usually use completion irq affinity to map hw and cpu, fallback
nr_hw_queues will cause lack of some cpu's map to hw, so use default
blk_mq_map_queues to do that.
Reported-by: syzbot+83e8cbe702263932d9d4@syzkaller.appspotmail.com
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-12 18:07:28 +08:00
|
|
|
fallback:
|
2020-05-13 08:44:05 +08:00
|
|
|
blk_mq_update_queue_map(set);
|
2015-12-18 08:08:14 +08:00
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list) {
|
|
|
|
blk_mq_realloc_hw_ctxs(set, q);
|
2022-03-08 15:32:16 +08:00
|
|
|
blk_mq_update_poll_flag(q);
|
blk-mq: fallback to previous nr_hw_queues when updating fails
When we try to increate the nr_hw_queues, we may fail due to
shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops
and some entries in q->queue_hw_ctx are left with NULL. However,
because queue map has been updated with new nr_hw_queues, some cpus
have been mapped to hw queue which just encounters allocation failure,
thus blk_mq_map_queue could return NULL. This will cause panic in
following blk_mq_map_swqueue.
To fix it, when increase nr_hw_queues fails, fallback to previous
nr_hw_queues and post warning. At the same time, driver's .map_queues
usually use completion irq affinity to map hw and cpu, fallback
nr_hw_queues will cause lack of some cpu's map to hw, so use default
blk_mq_map_queues to do that.
Reported-by: syzbot+83e8cbe702263932d9d4@syzkaller.appspotmail.com
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-12 18:07:28 +08:00
|
|
|
if (q->nr_hw_queues != set->nr_hw_queues) {
|
2021-11-08 15:40:19 +08:00
|
|
|
int i = prev_nr_hw_queues;
|
|
|
|
|
blk-mq: fallback to previous nr_hw_queues when updating fails
When we try to increate the nr_hw_queues, we may fail due to
shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops
and some entries in q->queue_hw_ctx are left with NULL. However,
because queue map has been updated with new nr_hw_queues, some cpus
have been mapped to hw queue which just encounters allocation failure,
thus blk_mq_map_queue could return NULL. This will cause panic in
following blk_mq_map_swqueue.
To fix it, when increase nr_hw_queues fails, fallback to previous
nr_hw_queues and post warning. At the same time, driver's .map_queues
usually use completion irq affinity to map hw and cpu, fallback
nr_hw_queues will cause lack of some cpu's map to hw, so use default
blk_mq_map_queues to do that.
Reported-by: syzbot+83e8cbe702263932d9d4@syzkaller.appspotmail.com
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-12 18:07:28 +08:00
|
|
|
pr_warn("Increasing nr_hw_queues to %d fails, fallback to %d\n",
|
|
|
|
nr_hw_queues, prev_nr_hw_queues);
|
2021-11-08 15:40:19 +08:00
|
|
|
for (; i < set->nr_hw_queues; i++)
|
|
|
|
__blk_mq_free_map_and_rqs(set, i);
|
|
|
|
|
blk-mq: fallback to previous nr_hw_queues when updating fails
When we try to increate the nr_hw_queues, we may fail due to
shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops
and some entries in q->queue_hw_ctx are left with NULL. However,
because queue map has been updated with new nr_hw_queues, some cpus
have been mapped to hw queue which just encounters allocation failure,
thus blk_mq_map_queue could return NULL. This will cause panic in
following blk_mq_map_swqueue.
To fix it, when increase nr_hw_queues fails, fallback to previous
nr_hw_queues and post warning. At the same time, driver's .map_queues
usually use completion irq affinity to map hw and cpu, fallback
nr_hw_queues will cause lack of some cpu's map to hw, so use default
blk_mq_map_queues to do that.
Reported-by: syzbot+83e8cbe702263932d9d4@syzkaller.appspotmail.com
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-12 18:07:28 +08:00
|
|
|
set->nr_hw_queues = prev_nr_hw_queues;
|
2019-02-27 21:35:01 +08:00
|
|
|
blk_mq_map_queues(&set->map[HCTX_TYPE_DEFAULT]);
|
blk-mq: fallback to previous nr_hw_queues when updating fails
When we try to increate the nr_hw_queues, we may fail due to
shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops
and some entries in q->queue_hw_ctx are left with NULL. However,
because queue map has been updated with new nr_hw_queues, some cpus
have been mapped to hw queue which just encounters allocation failure,
thus blk_mq_map_queue could return NULL. This will cause panic in
following blk_mq_map_swqueue.
To fix it, when increase nr_hw_queues fails, fallback to previous
nr_hw_queues and post warning. At the same time, driver's .map_queues
usually use completion irq affinity to map hw and cpu, fallback
nr_hw_queues will cause lack of some cpu's map to hw, so use default
blk_mq_map_queues to do that.
Reported-by: syzbot+83e8cbe702263932d9d4@syzkaller.appspotmail.com
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-12 18:07:28 +08:00
|
|
|
goto fallback;
|
|
|
|
}
|
2018-10-12 18:07:25 +08:00
|
|
|
blk_mq_map_swqueue(q);
|
|
|
|
}
|
|
|
|
|
2019-10-26 00:50:10 +08:00
|
|
|
reregister:
|
2018-10-12 18:07:25 +08:00
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list) {
|
2022-06-29 01:18:49 +08:00
|
|
|
blk_mq_sysfs_register_hctxs(q);
|
2018-10-12 18:07:25 +08:00
|
|
|
blk_mq_debugfs_register_hctxs(q);
|
2015-12-18 08:08:14 +08:00
|
|
|
}
|
|
|
|
|
2018-08-21 15:15:03 +08:00
|
|
|
switch_back:
|
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list)
|
|
|
|
blk_mq_elv_switch_back(&head, q);
|
|
|
|
|
2015-12-18 08:08:14 +08:00
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list)
|
|
|
|
blk_mq_unfreeze_queue(q);
|
|
|
|
}
|
2017-05-31 02:39:11 +08:00
|
|
|
|
|
|
|
void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
|
|
|
|
{
|
|
|
|
mutex_lock(&set->tag_list_lock);
|
|
|
|
__blk_mq_update_nr_hw_queues(set, nr_hw_queues);
|
|
|
|
mutex_unlock(&set->tag_list_lock);
|
|
|
|
}
|
2015-12-18 08:08:14 +08:00
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues);
|
|
|
|
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 23:56:08 +08:00
|
|
|
/* Enable polling stats and return whether they were already enabled. */
|
|
|
|
static bool blk_poll_stats_enable(struct request_queue *q)
|
|
|
|
{
|
2021-11-14 05:03:26 +08:00
|
|
|
if (q->poll_stat)
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 23:56:08 +08:00
|
|
|
return true;
|
2021-11-14 05:03:26 +08:00
|
|
|
|
|
|
|
return blk_stats_alloc_enable(q);
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 23:56:08 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void blk_mq_poll_stats_start(struct request_queue *q)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* We don't arm the callback if polling stats are not enabled or the
|
|
|
|
* callback is already active.
|
|
|
|
*/
|
2021-11-14 05:03:26 +08:00
|
|
|
if (!q->poll_stat || blk_stat_is_active(q->poll_cb))
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 23:56:08 +08:00
|
|
|
return;
|
|
|
|
|
|
|
|
blk_stat_activate_msecs(q->poll_cb, 100);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb)
|
|
|
|
{
|
|
|
|
struct request_queue *q = cb->data;
|
2017-04-07 20:24:03 +08:00
|
|
|
int bucket;
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 23:56:08 +08:00
|
|
|
|
2017-04-07 20:24:03 +08:00
|
|
|
for (bucket = 0; bucket < BLK_MQ_POLL_STATS_BKTS; bucket++) {
|
|
|
|
if (cb->stat[bucket].nr_samples)
|
|
|
|
q->poll_stat[bucket] = cb->stat[bucket];
|
|
|
|
}
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 23:56:08 +08:00
|
|
|
}
|
|
|
|
|
2016-11-15 04:03:03 +08:00
|
|
|
static unsigned long blk_mq_poll_nsecs(struct request_queue *q,
|
|
|
|
struct request *rq)
|
|
|
|
{
|
|
|
|
unsigned long ret = 0;
|
2017-04-07 20:24:03 +08:00
|
|
|
int bucket;
|
2016-11-15 04:03:03 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If stats collection isn't on, don't sleep but turn it on for
|
|
|
|
* future users
|
|
|
|
*/
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 23:56:08 +08:00
|
|
|
if (!blk_poll_stats_enable(q))
|
2016-11-15 04:03:03 +08:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* As an optimistic guess, use half of the mean service time
|
|
|
|
* for this type of request. We can (and should) make this smarter.
|
|
|
|
* For instance, if the completion latencies are tight, we can
|
|
|
|
* get closer than just half the mean. This is especially
|
|
|
|
* important on devices where the completion latencies are longer
|
2017-04-07 20:24:03 +08:00
|
|
|
* than ~10 usec. We do use the stats for the relevant IO size
|
|
|
|
* if available which does lead to better estimates.
|
2016-11-15 04:03:03 +08:00
|
|
|
*/
|
2017-04-07 20:24:03 +08:00
|
|
|
bucket = blk_mq_poll_stats_bkt(rq);
|
|
|
|
if (bucket < 0)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
if (q->poll_stat[bucket].nr_samples)
|
|
|
|
ret = (q->poll_stat[bucket].mean + 1) / 2;
|
2016-11-15 04:03:03 +08:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2021-10-12 19:12:16 +08:00
|
|
|
static bool blk_mq_poll_hybrid(struct request_queue *q, blk_qc_t qc)
|
2016-11-15 04:01:59 +08:00
|
|
|
{
|
2021-10-12 19:12:16 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx = blk_qc_to_hctx(q, qc);
|
|
|
|
struct request *rq = blk_qc_to_rq(hctx, qc);
|
2016-11-15 04:01:59 +08:00
|
|
|
struct hrtimer_sleeper hs;
|
|
|
|
enum hrtimer_mode mode;
|
2016-11-15 04:03:03 +08:00
|
|
|
unsigned int nsecs;
|
2016-11-15 04:01:59 +08:00
|
|
|
ktime_t kt;
|
|
|
|
|
2021-10-12 19:12:16 +08:00
|
|
|
/*
|
|
|
|
* If a request has completed on queue that uses an I/O scheduler, we
|
|
|
|
* won't get back a request from blk_qc_to_rq.
|
|
|
|
*/
|
|
|
|
if (!rq || (rq->rq_flags & RQF_MQ_POLL_SLEPT))
|
2016-11-15 04:03:03 +08:00
|
|
|
return false;
|
|
|
|
|
|
|
|
/*
|
2018-11-26 23:21:49 +08:00
|
|
|
* If we get here, hybrid polling is enabled. Hence poll_nsec can be:
|
2016-11-15 04:03:03 +08:00
|
|
|
*
|
|
|
|
* 0: use half of prev avg
|
|
|
|
* >0: use this specific value
|
|
|
|
*/
|
2018-11-26 23:21:49 +08:00
|
|
|
if (q->poll_nsec > 0)
|
2016-11-15 04:03:03 +08:00
|
|
|
nsecs = q->poll_nsec;
|
|
|
|
else
|
2020-02-26 20:10:15 +08:00
|
|
|
nsecs = blk_mq_poll_nsecs(q, rq);
|
2016-11-15 04:03:03 +08:00
|
|
|
|
|
|
|
if (!nsecs)
|
2016-11-15 04:01:59 +08:00
|
|
|
return false;
|
|
|
|
|
2018-01-11 02:30:56 +08:00
|
|
|
rq->rq_flags |= RQF_MQ_POLL_SLEPT;
|
2016-11-15 04:01:59 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This will be replaced with the stats tracking code, using
|
|
|
|
* 'avg_completion_time / 2' as the pre-sleep target.
|
|
|
|
*/
|
2016-12-25 19:30:41 +08:00
|
|
|
kt = nsecs;
|
2016-11-15 04:01:59 +08:00
|
|
|
|
|
|
|
mode = HRTIMER_MODE_REL;
|
2019-07-27 02:30:50 +08:00
|
|
|
hrtimer_init_sleeper_on_stack(&hs, CLOCK_MONOTONIC, mode);
|
2016-11-15 04:01:59 +08:00
|
|
|
hrtimer_set_expires(&hs.timer, kt);
|
|
|
|
|
|
|
|
do {
|
2018-01-10 00:29:52 +08:00
|
|
|
if (blk_mq_rq_state(rq) == MQ_RQ_COMPLETE)
|
2016-11-15 04:01:59 +08:00
|
|
|
break;
|
|
|
|
set_current_state(TASK_UNINTERRUPTIBLE);
|
2019-07-31 03:16:55 +08:00
|
|
|
hrtimer_sleeper_start_expires(&hs, mode);
|
2016-11-15 04:01:59 +08:00
|
|
|
if (hs.task)
|
|
|
|
io_schedule();
|
|
|
|
hrtimer_cancel(&hs.timer);
|
|
|
|
mode = HRTIMER_MODE_ABS;
|
|
|
|
} while (hs.task && !signal_pending(current));
|
|
|
|
|
|
|
|
__set_current_state(TASK_RUNNING);
|
|
|
|
destroy_hrtimer_on_stack(&hs.timer);
|
2018-11-26 23:21:49 +08:00
|
|
|
|
2016-11-15 04:01:59 +08:00
|
|
|
/*
|
2021-10-12 19:12:16 +08:00
|
|
|
* If we sleep, have the caller restart the poll loop to reset the
|
|
|
|
* state. Like for the other success return cases, the caller is
|
|
|
|
* responsible for checking if the IO completed. If the IO isn't
|
|
|
|
* complete, we'll get called again and will go straight to the busy
|
|
|
|
* poll loop.
|
2016-11-15 04:01:59 +08:00
|
|
|
*/
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2021-10-12 19:12:16 +08:00
|
|
|
static int blk_mq_poll_classic(struct request_queue *q, blk_qc_t cookie,
|
2021-10-12 23:24:29 +08:00
|
|
|
struct io_comp_batch *iob, unsigned int flags)
|
2016-11-04 23:34:34 +08:00
|
|
|
{
|
2021-10-12 19:12:16 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx = blk_qc_to_hctx(q, cookie);
|
|
|
|
long state = get_current_state();
|
|
|
|
int ret;
|
2016-11-04 23:34:34 +08:00
|
|
|
|
2018-11-14 12:32:10 +08:00
|
|
|
do {
|
2021-10-12 23:24:29 +08:00
|
|
|
ret = q->mq_ops->poll(hctx, iob);
|
2016-11-04 23:34:34 +08:00
|
|
|
if (ret > 0) {
|
2018-11-16 23:37:34 +08:00
|
|
|
__set_current_state(TASK_RUNNING);
|
2018-11-07 04:30:55 +08:00
|
|
|
return ret;
|
2016-11-04 23:34:34 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (signal_pending_state(state, current))
|
2018-11-16 23:37:34 +08:00
|
|
|
__set_current_state(TASK_RUNNING);
|
2021-06-11 16:28:12 +08:00
|
|
|
if (task_is_running(current))
|
2018-11-07 04:30:55 +08:00
|
|
|
return 1;
|
2021-10-12 19:12:16 +08:00
|
|
|
|
2021-10-12 19:12:19 +08:00
|
|
|
if (ret < 0 || (flags & BLK_POLL_ONESHOT))
|
2016-11-04 23:34:34 +08:00
|
|
|
break;
|
|
|
|
cpu_relax();
|
2018-11-14 12:32:10 +08:00
|
|
|
} while (!need_resched());
|
2016-11-04 23:34:34 +08:00
|
|
|
|
2018-02-13 23:48:12 +08:00
|
|
|
__set_current_state(TASK_RUNNING);
|
2018-11-07 04:30:55 +08:00
|
|
|
return 0;
|
2016-11-04 23:34:34 +08:00
|
|
|
}
|
2018-11-26 23:21:49 +08:00
|
|
|
|
2021-10-12 23:24:29 +08:00
|
|
|
int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, struct io_comp_batch *iob,
|
|
|
|
unsigned int flags)
|
2018-11-26 23:21:49 +08:00
|
|
|
{
|
2021-10-12 19:12:20 +08:00
|
|
|
if (!(flags & BLK_POLL_NOSLEEP) &&
|
|
|
|
q->poll_nsec != BLK_MQ_POLL_CLASSIC) {
|
2021-10-12 19:12:16 +08:00
|
|
|
if (blk_mq_poll_hybrid(q, cookie))
|
2018-11-07 04:30:55 +08:00
|
|
|
return 1;
|
2021-10-12 19:12:16 +08:00
|
|
|
}
|
2021-10-12 23:24:29 +08:00
|
|
|
return blk_mq_poll_classic(q, cookie, iob, flags);
|
2016-11-04 23:34:34 +08:00
|
|
|
}
|
|
|
|
|
2018-11-01 07:01:22 +08:00
|
|
|
unsigned int blk_mq_rq_cpu(struct request *rq)
|
|
|
|
{
|
|
|
|
return rq->mq_ctx->cpu;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_rq_cpu);
|
|
|
|
|
blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release()
For avoiding to slow down queue destroy, we don't call
blk_mq_quiesce_queue() in blk_cleanup_queue(), instead of delaying to
cancel dispatch work in blk_release_queue().
However, this way has caused kernel oops[1], reported by Changhui. The log
shows that scsi_device can be freed before running blk_release_queue(),
which is expected too since scsi_device is released after the scsi disk
is closed and the scsi_device is removed.
Fixes the issue by canceling blk-mq dispatch work in both blk_cleanup_queue()
and disk_release():
1) when disk_release() is run, the disk has been closed, and any sync
dispatch activities have been done, so canceling dispatch work is enough to
quiesce filesystem I/O dispatch activity.
2) in blk_cleanup_queue(), we only focus on passthrough request, and
passthrough request is always explicitly allocated & freed by
its caller, so once queue is frozen, all sync dispatch activity
for passthrough request has been done, then it is enough to just cancel
dispatch work for avoiding any dispatch activity.
[1] kernel panic log
[12622.769416] BUG: kernel NULL pointer dereference, address: 0000000000000300
[12622.777186] #PF: supervisor read access in kernel mode
[12622.782918] #PF: error_code(0x0000) - not-present page
[12622.788649] PGD 0 P4D 0
[12622.791474] Oops: 0000 [#1] PREEMPT SMP PTI
[12622.796138] CPU: 10 PID: 744 Comm: kworker/10:1H Kdump: loaded Not tainted 5.15.0+ #1
[12622.804877] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
[12622.813321] Workqueue: kblockd blk_mq_run_work_fn
[12622.818572] RIP: 0010:sbitmap_get+0x75/0x190
[12622.823336] Code: 85 80 00 00 00 41 8b 57 08 85 d2 0f 84 b1 00 00 00 45 31 e4 48 63 cd 48 8d 1c 49 48 c1 e3 06 49 03 5f 10 4c 8d 6b 40 83 f0 01 <48> 8b 33 44 89 f2 4c 89 ef 0f b6 c8 e8 fa f3 ff ff 83 f8 ff 75 58
[12622.844290] RSP: 0018:ffffb00a446dbd40 EFLAGS: 00010202
[12622.850120] RAX: 0000000000000001 RBX: 0000000000000300 RCX: 0000000000000004
[12622.858082] RDX: 0000000000000006 RSI: 0000000000000082 RDI: ffffa0b7a2dfe030
[12622.866042] RBP: 0000000000000004 R08: 0000000000000001 R09: ffffa0b742721334
[12622.874003] R10: 0000000000000008 R11: 0000000000000008 R12: 0000000000000000
[12622.881964] R13: 0000000000000340 R14: 0000000000000000 R15: ffffa0b7a2dfe030
[12622.889926] FS: 0000000000000000(0000) GS:ffffa0baafb40000(0000) knlGS:0000000000000000
[12622.898956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12622.905367] CR2: 0000000000000300 CR3: 0000000641210001 CR4: 00000000001706e0
[12622.913328] Call Trace:
[12622.916055] <TASK>
[12622.918394] scsi_mq_get_budget+0x1a/0x110
[12622.922969] __blk_mq_do_dispatch_sched+0x1d4/0x320
[12622.928404] ? pick_next_task_fair+0x39/0x390
[12622.933268] __blk_mq_sched_dispatch_requests+0xf4/0x140
[12622.939194] blk_mq_sched_dispatch_requests+0x30/0x60
[12622.944829] __blk_mq_run_hw_queue+0x30/0xa0
[12622.949593] process_one_work+0x1e8/0x3c0
[12622.954059] worker_thread+0x50/0x3b0
[12622.958144] ? rescuer_thread+0x370/0x370
[12622.962616] kthread+0x158/0x180
[12622.966218] ? set_kthread_struct+0x40/0x40
[12622.970884] ret_from_fork+0x22/0x30
[12622.974875] </TASK>
[12622.977309] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc dm_multipath intel_rapl_msr intel_rapl_common dell_wmi_descriptor sb_edac rfkill video x86_pkg_temp_thermal intel_powerclamp dcdbas coretemp kvm_intel kvm mgag200 irqbypass i2c_algo_bit rapl drm_kms_helper ipmi_ssif intel_cstate intel_uncore syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr cec mei_me lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sr_mod cdrom sd_mod t10_pi sg ixgbe ahci libahci crct10dif_pclmul crc32_pclmul crc32c_intel libata megaraid_sas ghash_clmulni_intel tg3 wdat_wdt mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_debug]
Reported-by: ChanghuiZhong <czhong@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211116014343.610501-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-16 09:43:43 +08:00
|
|
|
void blk_mq_cancel_work_sync(struct request_queue *q)
|
|
|
|
{
|
2022-10-30 17:47:30 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
unsigned long i;
|
blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release()
For avoiding to slow down queue destroy, we don't call
blk_mq_quiesce_queue() in blk_cleanup_queue(), instead of delaying to
cancel dispatch work in blk_release_queue().
However, this way has caused kernel oops[1], reported by Changhui. The log
shows that scsi_device can be freed before running blk_release_queue(),
which is expected too since scsi_device is released after the scsi disk
is closed and the scsi_device is removed.
Fixes the issue by canceling blk-mq dispatch work in both blk_cleanup_queue()
and disk_release():
1) when disk_release() is run, the disk has been closed, and any sync
dispatch activities have been done, so canceling dispatch work is enough to
quiesce filesystem I/O dispatch activity.
2) in blk_cleanup_queue(), we only focus on passthrough request, and
passthrough request is always explicitly allocated & freed by
its caller, so once queue is frozen, all sync dispatch activity
for passthrough request has been done, then it is enough to just cancel
dispatch work for avoiding any dispatch activity.
[1] kernel panic log
[12622.769416] BUG: kernel NULL pointer dereference, address: 0000000000000300
[12622.777186] #PF: supervisor read access in kernel mode
[12622.782918] #PF: error_code(0x0000) - not-present page
[12622.788649] PGD 0 P4D 0
[12622.791474] Oops: 0000 [#1] PREEMPT SMP PTI
[12622.796138] CPU: 10 PID: 744 Comm: kworker/10:1H Kdump: loaded Not tainted 5.15.0+ #1
[12622.804877] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
[12622.813321] Workqueue: kblockd blk_mq_run_work_fn
[12622.818572] RIP: 0010:sbitmap_get+0x75/0x190
[12622.823336] Code: 85 80 00 00 00 41 8b 57 08 85 d2 0f 84 b1 00 00 00 45 31 e4 48 63 cd 48 8d 1c 49 48 c1 e3 06 49 03 5f 10 4c 8d 6b 40 83 f0 01 <48> 8b 33 44 89 f2 4c 89 ef 0f b6 c8 e8 fa f3 ff ff 83 f8 ff 75 58
[12622.844290] RSP: 0018:ffffb00a446dbd40 EFLAGS: 00010202
[12622.850120] RAX: 0000000000000001 RBX: 0000000000000300 RCX: 0000000000000004
[12622.858082] RDX: 0000000000000006 RSI: 0000000000000082 RDI: ffffa0b7a2dfe030
[12622.866042] RBP: 0000000000000004 R08: 0000000000000001 R09: ffffa0b742721334
[12622.874003] R10: 0000000000000008 R11: 0000000000000008 R12: 0000000000000000
[12622.881964] R13: 0000000000000340 R14: 0000000000000000 R15: ffffa0b7a2dfe030
[12622.889926] FS: 0000000000000000(0000) GS:ffffa0baafb40000(0000) knlGS:0000000000000000
[12622.898956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12622.905367] CR2: 0000000000000300 CR3: 0000000641210001 CR4: 00000000001706e0
[12622.913328] Call Trace:
[12622.916055] <TASK>
[12622.918394] scsi_mq_get_budget+0x1a/0x110
[12622.922969] __blk_mq_do_dispatch_sched+0x1d4/0x320
[12622.928404] ? pick_next_task_fair+0x39/0x390
[12622.933268] __blk_mq_sched_dispatch_requests+0xf4/0x140
[12622.939194] blk_mq_sched_dispatch_requests+0x30/0x60
[12622.944829] __blk_mq_run_hw_queue+0x30/0xa0
[12622.949593] process_one_work+0x1e8/0x3c0
[12622.954059] worker_thread+0x50/0x3b0
[12622.958144] ? rescuer_thread+0x370/0x370
[12622.962616] kthread+0x158/0x180
[12622.966218] ? set_kthread_struct+0x40/0x40
[12622.970884] ret_from_fork+0x22/0x30
[12622.974875] </TASK>
[12622.977309] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc dm_multipath intel_rapl_msr intel_rapl_common dell_wmi_descriptor sb_edac rfkill video x86_pkg_temp_thermal intel_powerclamp dcdbas coretemp kvm_intel kvm mgag200 irqbypass i2c_algo_bit rapl drm_kms_helper ipmi_ssif intel_cstate intel_uncore syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr cec mei_me lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sr_mod cdrom sd_mod t10_pi sg ixgbe ahci libahci crct10dif_pclmul crc32_pclmul crc32c_intel libata megaraid_sas ghash_clmulni_intel tg3 wdat_wdt mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_debug]
Reported-by: ChanghuiZhong <czhong@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211116014343.610501-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-16 09:43:43 +08:00
|
|
|
|
2022-10-30 17:47:30 +08:00
|
|
|
cancel_delayed_work_sync(&q->requeue_work);
|
blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release()
For avoiding to slow down queue destroy, we don't call
blk_mq_quiesce_queue() in blk_cleanup_queue(), instead of delaying to
cancel dispatch work in blk_release_queue().
However, this way has caused kernel oops[1], reported by Changhui. The log
shows that scsi_device can be freed before running blk_release_queue(),
which is expected too since scsi_device is released after the scsi disk
is closed and the scsi_device is removed.
Fixes the issue by canceling blk-mq dispatch work in both blk_cleanup_queue()
and disk_release():
1) when disk_release() is run, the disk has been closed, and any sync
dispatch activities have been done, so canceling dispatch work is enough to
quiesce filesystem I/O dispatch activity.
2) in blk_cleanup_queue(), we only focus on passthrough request, and
passthrough request is always explicitly allocated & freed by
its caller, so once queue is frozen, all sync dispatch activity
for passthrough request has been done, then it is enough to just cancel
dispatch work for avoiding any dispatch activity.
[1] kernel panic log
[12622.769416] BUG: kernel NULL pointer dereference, address: 0000000000000300
[12622.777186] #PF: supervisor read access in kernel mode
[12622.782918] #PF: error_code(0x0000) - not-present page
[12622.788649] PGD 0 P4D 0
[12622.791474] Oops: 0000 [#1] PREEMPT SMP PTI
[12622.796138] CPU: 10 PID: 744 Comm: kworker/10:1H Kdump: loaded Not tainted 5.15.0+ #1
[12622.804877] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
[12622.813321] Workqueue: kblockd blk_mq_run_work_fn
[12622.818572] RIP: 0010:sbitmap_get+0x75/0x190
[12622.823336] Code: 85 80 00 00 00 41 8b 57 08 85 d2 0f 84 b1 00 00 00 45 31 e4 48 63 cd 48 8d 1c 49 48 c1 e3 06 49 03 5f 10 4c 8d 6b 40 83 f0 01 <48> 8b 33 44 89 f2 4c 89 ef 0f b6 c8 e8 fa f3 ff ff 83 f8 ff 75 58
[12622.844290] RSP: 0018:ffffb00a446dbd40 EFLAGS: 00010202
[12622.850120] RAX: 0000000000000001 RBX: 0000000000000300 RCX: 0000000000000004
[12622.858082] RDX: 0000000000000006 RSI: 0000000000000082 RDI: ffffa0b7a2dfe030
[12622.866042] RBP: 0000000000000004 R08: 0000000000000001 R09: ffffa0b742721334
[12622.874003] R10: 0000000000000008 R11: 0000000000000008 R12: 0000000000000000
[12622.881964] R13: 0000000000000340 R14: 0000000000000000 R15: ffffa0b7a2dfe030
[12622.889926] FS: 0000000000000000(0000) GS:ffffa0baafb40000(0000) knlGS:0000000000000000
[12622.898956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12622.905367] CR2: 0000000000000300 CR3: 0000000641210001 CR4: 00000000001706e0
[12622.913328] Call Trace:
[12622.916055] <TASK>
[12622.918394] scsi_mq_get_budget+0x1a/0x110
[12622.922969] __blk_mq_do_dispatch_sched+0x1d4/0x320
[12622.928404] ? pick_next_task_fair+0x39/0x390
[12622.933268] __blk_mq_sched_dispatch_requests+0xf4/0x140
[12622.939194] blk_mq_sched_dispatch_requests+0x30/0x60
[12622.944829] __blk_mq_run_hw_queue+0x30/0xa0
[12622.949593] process_one_work+0x1e8/0x3c0
[12622.954059] worker_thread+0x50/0x3b0
[12622.958144] ? rescuer_thread+0x370/0x370
[12622.962616] kthread+0x158/0x180
[12622.966218] ? set_kthread_struct+0x40/0x40
[12622.970884] ret_from_fork+0x22/0x30
[12622.974875] </TASK>
[12622.977309] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc dm_multipath intel_rapl_msr intel_rapl_common dell_wmi_descriptor sb_edac rfkill video x86_pkg_temp_thermal intel_powerclamp dcdbas coretemp kvm_intel kvm mgag200 irqbypass i2c_algo_bit rapl drm_kms_helper ipmi_ssif intel_cstate intel_uncore syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr cec mei_me lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sr_mod cdrom sd_mod t10_pi sg ixgbe ahci libahci crct10dif_pclmul crc32_pclmul crc32c_intel libata megaraid_sas ghash_clmulni_intel tg3 wdat_wdt mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_debug]
Reported-by: ChanghuiZhong <czhong@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211116014343.610501-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-16 09:43:43 +08:00
|
|
|
|
2022-10-30 17:47:30 +08:00
|
|
|
queue_for_each_hw_ctx(q, hctx, i)
|
|
|
|
cancel_delayed_work_sync(&hctx->run_work);
|
blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release()
For avoiding to slow down queue destroy, we don't call
blk_mq_quiesce_queue() in blk_cleanup_queue(), instead of delaying to
cancel dispatch work in blk_release_queue().
However, this way has caused kernel oops[1], reported by Changhui. The log
shows that scsi_device can be freed before running blk_release_queue(),
which is expected too since scsi_device is released after the scsi disk
is closed and the scsi_device is removed.
Fixes the issue by canceling blk-mq dispatch work in both blk_cleanup_queue()
and disk_release():
1) when disk_release() is run, the disk has been closed, and any sync
dispatch activities have been done, so canceling dispatch work is enough to
quiesce filesystem I/O dispatch activity.
2) in blk_cleanup_queue(), we only focus on passthrough request, and
passthrough request is always explicitly allocated & freed by
its caller, so once queue is frozen, all sync dispatch activity
for passthrough request has been done, then it is enough to just cancel
dispatch work for avoiding any dispatch activity.
[1] kernel panic log
[12622.769416] BUG: kernel NULL pointer dereference, address: 0000000000000300
[12622.777186] #PF: supervisor read access in kernel mode
[12622.782918] #PF: error_code(0x0000) - not-present page
[12622.788649] PGD 0 P4D 0
[12622.791474] Oops: 0000 [#1] PREEMPT SMP PTI
[12622.796138] CPU: 10 PID: 744 Comm: kworker/10:1H Kdump: loaded Not tainted 5.15.0+ #1
[12622.804877] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
[12622.813321] Workqueue: kblockd blk_mq_run_work_fn
[12622.818572] RIP: 0010:sbitmap_get+0x75/0x190
[12622.823336] Code: 85 80 00 00 00 41 8b 57 08 85 d2 0f 84 b1 00 00 00 45 31 e4 48 63 cd 48 8d 1c 49 48 c1 e3 06 49 03 5f 10 4c 8d 6b 40 83 f0 01 <48> 8b 33 44 89 f2 4c 89 ef 0f b6 c8 e8 fa f3 ff ff 83 f8 ff 75 58
[12622.844290] RSP: 0018:ffffb00a446dbd40 EFLAGS: 00010202
[12622.850120] RAX: 0000000000000001 RBX: 0000000000000300 RCX: 0000000000000004
[12622.858082] RDX: 0000000000000006 RSI: 0000000000000082 RDI: ffffa0b7a2dfe030
[12622.866042] RBP: 0000000000000004 R08: 0000000000000001 R09: ffffa0b742721334
[12622.874003] R10: 0000000000000008 R11: 0000000000000008 R12: 0000000000000000
[12622.881964] R13: 0000000000000340 R14: 0000000000000000 R15: ffffa0b7a2dfe030
[12622.889926] FS: 0000000000000000(0000) GS:ffffa0baafb40000(0000) knlGS:0000000000000000
[12622.898956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12622.905367] CR2: 0000000000000300 CR3: 0000000641210001 CR4: 00000000001706e0
[12622.913328] Call Trace:
[12622.916055] <TASK>
[12622.918394] scsi_mq_get_budget+0x1a/0x110
[12622.922969] __blk_mq_do_dispatch_sched+0x1d4/0x320
[12622.928404] ? pick_next_task_fair+0x39/0x390
[12622.933268] __blk_mq_sched_dispatch_requests+0xf4/0x140
[12622.939194] blk_mq_sched_dispatch_requests+0x30/0x60
[12622.944829] __blk_mq_run_hw_queue+0x30/0xa0
[12622.949593] process_one_work+0x1e8/0x3c0
[12622.954059] worker_thread+0x50/0x3b0
[12622.958144] ? rescuer_thread+0x370/0x370
[12622.962616] kthread+0x158/0x180
[12622.966218] ? set_kthread_struct+0x40/0x40
[12622.970884] ret_from_fork+0x22/0x30
[12622.974875] </TASK>
[12622.977309] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc dm_multipath intel_rapl_msr intel_rapl_common dell_wmi_descriptor sb_edac rfkill video x86_pkg_temp_thermal intel_powerclamp dcdbas coretemp kvm_intel kvm mgag200 irqbypass i2c_algo_bit rapl drm_kms_helper ipmi_ssif intel_cstate intel_uncore syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr cec mei_me lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sr_mod cdrom sd_mod t10_pi sg ixgbe ahci libahci crct10dif_pclmul crc32_pclmul crc32c_intel libata megaraid_sas ghash_clmulni_intel tg3 wdat_wdt mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_debug]
Reported-by: ChanghuiZhong <czhong@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211116014343.610501-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-16 09:43:43 +08:00
|
|
|
}
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
static int __init blk_mq_init(void)
|
|
|
|
{
|
2020-06-11 14:44:41 +08:00
|
|
|
int i;
|
|
|
|
|
|
|
|
for_each_possible_cpu(i)
|
2021-01-24 04:10:27 +08:00
|
|
|
init_llist_head(&per_cpu(blk_cpu_done, i));
|
2020-06-11 14:44:41 +08:00
|
|
|
open_softirq(BLOCK_SOFTIRQ, blk_done_softirq);
|
|
|
|
|
|
|
|
cpuhp_setup_state_nocalls(CPUHP_BLOCK_SOFTIRQ_DEAD,
|
|
|
|
"block/softirq:dead", NULL,
|
|
|
|
blk_softirq_cpu_dead);
|
2016-09-22 22:05:17 +08:00
|
|
|
cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
|
|
|
|
blk_mq_hctx_notify_dead);
|
2020-05-29 21:53:15 +08:00
|
|
|
cpuhp_setup_state_multi(CPUHP_AP_BLK_MQ_ONLINE, "block/mq:online",
|
|
|
|
blk_mq_hctx_notify_online,
|
|
|
|
blk_mq_hctx_notify_offline);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
subsys_initcall(blk_mq_init);
|