License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2005-04-17 06:20:36 +08:00
|
|
|
#ifndef _LINUX_BLKDEV_H
|
|
|
|
#define _LINUX_BLKDEV_H
|
|
|
|
|
2012-05-14 14:29:23 +08:00
|
|
|
#include <linux/sched.h>
|
2017-02-01 23:36:40 +08:00
|
|
|
#include <linux/sched/clock.h>
|
2012-05-14 14:29:23 +08:00
|
|
|
|
2007-09-21 15:19:54 +08:00
|
|
|
#ifdef CONFIG_BLOCK
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/major.h>
|
|
|
|
#include <linux/genhd.h>
|
|
|
|
#include <linux/list.h>
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
#include <linux/llist.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/timer.h>
|
|
|
|
#include <linux/workqueue.h>
|
|
|
|
#include <linux/pagemap.h>
|
2015-05-23 05:13:32 +08:00
|
|
|
#include <linux/backing-dev-defs.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/wait.h>
|
|
|
|
#include <linux/mempool.h>
|
2016-01-16 08:56:14 +08:00
|
|
|
#include <linux/pfn.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/bio.h>
|
|
|
|
#include <linux/stringify.h>
|
2008-09-11 16:57:55 +08:00
|
|
|
#include <linux/gfp.h>
|
2007-07-09 18:40:35 +08:00
|
|
|
#include <linux/bsg.h>
|
2008-09-14 02:26:01 +08:00
|
|
|
#include <linux/smp.h>
|
2013-01-10 00:05:13 +08:00
|
|
|
#include <linux/rcupdate.h>
|
2014-07-02 00:34:38 +08:00
|
|
|
#include <linux/percpu-refcount.h>
|
2015-05-01 18:46:15 +08:00
|
|
|
#include <linux/scatterlist.h>
|
2016-10-18 14:40:33 +08:00
|
|
|
#include <linux/blkzoned.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2011-05-27 01:46:22 +08:00
|
|
|
struct module;
|
2006-03-23 00:52:04 +08:00
|
|
|
struct scsi_ioctl_command;
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
struct request_queue;
|
|
|
|
struct elevator_queue;
|
2006-03-24 03:00:26 +08:00
|
|
|
struct blk_trace;
|
2007-07-09 18:38:05 +08:00
|
|
|
struct request;
|
|
|
|
struct sg_io_hdr;
|
2011-08-01 04:05:09 +08:00
|
|
|
struct bsg_job;
|
2012-04-17 04:57:25 +08:00
|
|
|
struct blkcg_gq;
|
2014-09-25 23:23:43 +08:00
|
|
|
struct blk_flush_queue;
|
2015-10-15 20:10:48 +08:00
|
|
|
struct pr_ops;
|
2018-07-03 23:32:35 +08:00
|
|
|
struct rq_qos;
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 23:56:08 +08:00
|
|
|
struct blk_queue_stats;
|
|
|
|
struct blk_stat_callback;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
#define BLKDEV_MIN_RQ 4
|
|
|
|
#define BLKDEV_MAX_RQ 128 /* Default maximum */
|
|
|
|
|
2018-02-15 22:53:17 +08:00
|
|
|
/* Must be consistent with blk_mq_poll_stats_bkt() */
|
2017-04-21 06:59:11 +08:00
|
|
|
#define BLK_MQ_POLL_STATS_BKTS 16
|
|
|
|
|
2019-03-18 22:44:41 +08:00
|
|
|
/* Doing classic polling */
|
|
|
|
#define BLK_MQ_POLL_CLASSIC -1
|
|
|
|
|
2012-04-14 04:11:28 +08:00
|
|
|
/*
|
|
|
|
* Maximum number of blkcg policies allowed to be registered concurrently.
|
|
|
|
* Defined here to simplify include dependency.
|
|
|
|
*/
|
2018-09-12 00:59:53 +08:00
|
|
|
#define BLKCG_MAX_POLS 5
|
2012-04-14 04:11:28 +08:00
|
|
|
|
2017-06-03 15:38:04 +08:00
|
|
|
typedef void (rq_end_io_fn)(struct request *, blk_status_t);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2016-10-20 21:12:13 +08:00
|
|
|
/*
|
|
|
|
* request flags */
|
|
|
|
typedef __u32 __bitwise req_flags_t;
|
|
|
|
|
|
|
|
/* elevator knows about this request */
|
|
|
|
#define RQF_SORTED ((__force req_flags_t)(1 << 0))
|
|
|
|
/* drive already may have started this one */
|
|
|
|
#define RQF_STARTED ((__force req_flags_t)(1 << 1))
|
|
|
|
/* may not be passed by ioscheduler */
|
|
|
|
#define RQF_SOFTBARRIER ((__force req_flags_t)(1 << 3))
|
|
|
|
/* request for flush sequence */
|
|
|
|
#define RQF_FLUSH_SEQ ((__force req_flags_t)(1 << 4))
|
|
|
|
/* merge of different types, fail separately */
|
|
|
|
#define RQF_MIXED_MERGE ((__force req_flags_t)(1 << 5))
|
|
|
|
/* track inflight for MQ */
|
|
|
|
#define RQF_MQ_INFLIGHT ((__force req_flags_t)(1 << 6))
|
|
|
|
/* don't call prep for this one */
|
|
|
|
#define RQF_DONTPREP ((__force req_flags_t)(1 << 7))
|
|
|
|
/* set for "ide_preempt" requests and also for requests for which the SCSI
|
|
|
|
"quiesce" state must be ignored. */
|
|
|
|
#define RQF_PREEMPT ((__force req_flags_t)(1 << 8))
|
|
|
|
/* contains copies of user pages */
|
|
|
|
#define RQF_COPY_USER ((__force req_flags_t)(1 << 9))
|
|
|
|
/* vaguely specified driver internal error. Ignored by the block layer */
|
|
|
|
#define RQF_FAILED ((__force req_flags_t)(1 << 10))
|
|
|
|
/* don't warn about errors */
|
|
|
|
#define RQF_QUIET ((__force req_flags_t)(1 << 11))
|
|
|
|
/* elevator private data attached */
|
|
|
|
#define RQF_ELVPRIV ((__force req_flags_t)(1 << 12))
|
2018-10-11 15:07:06 +08:00
|
|
|
/* account into disk and partition IO statistics */
|
2016-10-20 21:12:13 +08:00
|
|
|
#define RQF_IO_STAT ((__force req_flags_t)(1 << 13))
|
|
|
|
/* request came from our alloc pool */
|
|
|
|
#define RQF_ALLOCED ((__force req_flags_t)(1 << 14))
|
|
|
|
/* runtime pm request */
|
|
|
|
#define RQF_PM ((__force req_flags_t)(1 << 15))
|
|
|
|
/* on IO scheduler merge hash */
|
|
|
|
#define RQF_HASHED ((__force req_flags_t)(1 << 16))
|
2018-10-11 15:07:06 +08:00
|
|
|
/* track IO completion time */
|
2016-11-08 12:32:37 +08:00
|
|
|
#define RQF_STATS ((__force req_flags_t)(1 << 17))
|
2016-12-09 06:20:32 +08:00
|
|
|
/* Look at ->special_vec for the actual data payload instead of the
|
|
|
|
bio chain. */
|
|
|
|
#define RQF_SPECIAL_PAYLOAD ((__force req_flags_t)(1 << 18))
|
2017-12-21 14:43:38 +08:00
|
|
|
/* The per-zone write lock is held for this request */
|
|
|
|
#define RQF_ZONE_WRITE_LOCKED ((__force req_flags_t)(1 << 19))
|
2018-01-11 02:30:56 +08:00
|
|
|
/* already slept for hybrid poll */
|
2018-05-29 21:52:28 +08:00
|
|
|
#define RQF_MQ_POLL_SLEPT ((__force req_flags_t)(1 << 20))
|
2018-06-14 19:58:45 +08:00
|
|
|
/* ->timeout has been called, don't expire again */
|
|
|
|
#define RQF_TIMED_OUT ((__force req_flags_t)(1 << 21))
|
2016-10-20 21:12:13 +08:00
|
|
|
|
|
|
|
/* flags that prevent us from merging requests: */
|
|
|
|
#define RQF_NOMERGE_FLAGS \
|
2016-12-09 06:20:32 +08:00
|
|
|
(RQF_STARTED | RQF_SOFTBARRIER | RQF_FLUSH_SEQ | RQF_SPECIAL_PAYLOAD)
|
2016-10-20 21:12:13 +08:00
|
|
|
|
2018-05-29 21:52:28 +08:00
|
|
|
/*
|
|
|
|
* Request state for blk-mq.
|
|
|
|
*/
|
|
|
|
enum mq_rq_state {
|
|
|
|
MQ_RQ_IDLE = 0,
|
|
|
|
MQ_RQ_IN_FLIGHT = 1,
|
|
|
|
MQ_RQ_COMPLETE = 2,
|
|
|
|
};
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
2014-05-06 18:12:45 +08:00
|
|
|
* Try to put the fields that are referenced together in the same cacheline.
|
|
|
|
*
|
|
|
|
* If you modify this structure, make sure to update blk_rq_init() and
|
|
|
|
* especially blk_mq_rq_ctx_init() to take care of the added fields.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
|
|
|
struct request {
|
2007-07-24 15:28:11 +08:00
|
|
|
struct request_queue *q;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
struct blk_mq_ctx *mq_ctx;
|
2018-10-30 05:06:13 +08:00
|
|
|
struct blk_mq_hw_ctx *mq_hctx;
|
2006-08-10 15:00:21 +08:00
|
|
|
|
2016-10-28 22:48:16 +08:00
|
|
|
unsigned int cmd_flags; /* op and common flags */
|
2016-10-20 21:12:13 +08:00
|
|
|
req_flags_t rq_flags;
|
2017-02-01 03:34:41 +08:00
|
|
|
|
2019-06-09 04:15:51 +08:00
|
|
|
int tag;
|
2017-02-01 03:34:41 +08:00
|
|
|
int internal_tag;
|
|
|
|
|
2009-05-07 21:24:44 +08:00
|
|
|
/* the following two fields are internal, NEVER access directly */
|
|
|
|
unsigned int __data_len; /* total data len */
|
2010-03-19 15:58:16 +08:00
|
|
|
sector_t __sector; /* sector cursor */
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
struct bio *bio;
|
|
|
|
struct bio *biotail;
|
|
|
|
|
2018-01-11 02:46:39 +08:00
|
|
|
struct list_head queuelist;
|
|
|
|
|
2014-04-10 10:27:01 +08:00
|
|
|
/*
|
|
|
|
* The hash is used inside the scheduler, and killed once the
|
|
|
|
* request reaches the dispatch list. The ipi_list is only used
|
|
|
|
* to queue the request for softirq completion, which is long
|
|
|
|
* after the request has been unhashed (and even removed from
|
|
|
|
* the dispatch list).
|
|
|
|
*/
|
|
|
|
union {
|
|
|
|
struct hlist_node hash; /* merge hash */
|
|
|
|
struct list_head ipi_list;
|
|
|
|
};
|
|
|
|
|
2006-08-10 15:00:21 +08:00
|
|
|
/*
|
|
|
|
* The rb_node is only used inside the io scheduler, requests
|
|
|
|
* are pruned when moved to the dispatch queue. So let the
|
2011-02-11 18:08:00 +08:00
|
|
|
* completion_data share space with the rb_node.
|
2006-08-10 15:00:21 +08:00
|
|
|
*/
|
|
|
|
union {
|
|
|
|
struct rb_node rb_node; /* sort/lookup */
|
2016-12-09 06:20:32 +08:00
|
|
|
struct bio_vec special_vec;
|
2011-02-11 18:08:00 +08:00
|
|
|
void *completion_data;
|
2017-04-20 22:03:11 +08:00
|
|
|
int error_count; /* for legacy drivers, don't use */
|
2006-08-10 15:00:21 +08:00
|
|
|
};
|
2006-07-28 15:23:08 +08:00
|
|
|
|
2006-07-12 20:04:37 +08:00
|
|
|
/*
|
2010-04-21 23:44:16 +08:00
|
|
|
* Three pointers are available for the IO schedulers, if they need
|
2011-02-11 18:08:00 +08:00
|
|
|
* more they have to dynamically allocate it. Flush requests are
|
|
|
|
* never put on the IO scheduler. So let the flush fields share
|
2011-12-14 07:33:41 +08:00
|
|
|
* space with the elevator data.
|
2006-07-12 20:04:37 +08:00
|
|
|
*/
|
2011-02-11 18:08:00 +08:00
|
|
|
union {
|
2011-12-14 07:33:41 +08:00
|
|
|
struct {
|
|
|
|
struct io_cq *icq;
|
|
|
|
void *priv[2];
|
|
|
|
} elv;
|
|
|
|
|
2011-02-11 18:08:00 +08:00
|
|
|
struct {
|
|
|
|
unsigned int seq;
|
|
|
|
struct list_head list;
|
block: fix flush machinery for stacking drivers with differring flush flags
Commit ae1b1539622fb46e51b4d13b3f9e5f4c713f86ae, block: reimplement
FLUSH/FUA to support merge, introduced a performance regression when
running any sort of fsyncing workload using dm-multipath and certain
storage (in our case, an HP EVA). The test I ran was fs_mark, and it
dropped from ~800 files/sec on ext4 to ~100 files/sec. It turns out
that dm-multipath always advertised flush+fua support, and passed
commands on down the stack, where those flags used to get stripped off.
The above commit changed that behavior:
static inline struct request *__elv_next_request(struct request_queue *q)
{
struct request *rq;
while (1) {
- while (!list_empty(&q->queue_head)) {
+ if (!list_empty(&q->queue_head)) {
rq = list_entry_rq(q->queue_head.next);
- if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
- (rq->cmd_flags & REQ_FLUSH_SEQ))
- return rq;
- rq = blk_do_flush(q, rq);
- if (rq)
- return rq;
+ return rq;
}
Note that previously, a command would come in here, have
REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:
struct request *blk_do_flush(struct request_queue *q, struct request *rq)
{
unsigned int fflags = q->flush_flags; /* may change, cache it */
bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
REQ_FUA);
unsigned skip = 0;
...
if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
rq->cmd_flags &= ~REQ_FLUSH;
if (!has_fua)
rq->cmd_flags &= ~REQ_FUA;
return rq;
}
So, the flush machinery was bypassed in such cases (q->flush_flags == 0
&& rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).
Now, however, we don't get into the flush machinery at all. Instead,
__elv_next_request just hands a request with flush and fua bits set to
the scsi_request_fn, even if the underlying request_queue does not
support flush or fua.
The agreed upon approach is to fix the flush machinery to allow
stacking. While this isn't used in practice (since there is only one
request-based dm target, and that target will now reflect the flush
flags of the underlying device), it does future-proof the solution, and
make it function as designed.
In order to make this work, I had to add a field to the struct request,
inside the flush structure (to store the original req->end_io). Shaohua
had suggested overloading the union with rb_node and completion_data,
but the completion data is used by device mapper and can also be used by
other drivers. So, I didn't see a way around the additional field.
I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
the lost performance. Comments and other testers, as always, are
appreciated.
Cheers,
Jeff
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-16 03:37:25 +08:00
|
|
|
rq_end_io_fn *saved_end_io;
|
2011-02-11 18:08:00 +08:00
|
|
|
} flush;
|
|
|
|
};
|
2006-07-12 20:04:37 +08:00
|
|
|
|
2006-06-13 15:02:34 +08:00
|
|
|
struct gendisk *rq_disk;
|
2011-01-05 23:57:38 +08:00
|
|
|
struct hd_struct *part;
|
2019-08-29 06:05:57 +08:00
|
|
|
#ifdef CONFIG_BLK_RQ_ALLOC_TIME
|
|
|
|
/* Time that the first bio started allocating this request. */
|
|
|
|
u64 alloc_time_ns;
|
|
|
|
#endif
|
|
|
|
/* Time that this request was allocated for this IO. */
|
block: consolidate struct request timestamp fields
Currently, struct request has four timestamp fields:
- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq
These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 17:08:53 +08:00
|
|
|
u64 start_time_ns;
|
2018-05-09 17:08:50 +08:00
|
|
|
/* Time that I/O was submitted to the device. */
|
|
|
|
u64 io_start_time_ns;
|
|
|
|
|
|
|
|
#ifdef CONFIG_BLK_WBT
|
|
|
|
unsigned short wbt_flags;
|
|
|
|
#endif
|
2019-05-21 15:59:03 +08:00
|
|
|
/*
|
|
|
|
* rq sectors used for blk stats. It has the same value
|
|
|
|
* with blk_rq_sectors(rq), except that it never be zeroed
|
|
|
|
* by completion.
|
|
|
|
*/
|
|
|
|
unsigned short stats_sectors;
|
2018-05-09 17:08:50 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Number of scatter-gather DMA addr+len pairs after
|
2005-04-17 06:20:36 +08:00
|
|
|
* physical address coalescing is performed.
|
|
|
|
*/
|
|
|
|
unsigned short nr_phys_segments;
|
2018-01-11 02:46:39 +08:00
|
|
|
|
2010-09-11 02:50:10 +08:00
|
|
|
#if defined(CONFIG_BLK_DEV_INTEGRITY)
|
|
|
|
unsigned short nr_integrity_segments;
|
|
|
|
#endif
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2018-01-11 02:46:39 +08:00
|
|
|
unsigned short write_hint;
|
2006-06-13 15:02:34 +08:00
|
|
|
unsigned short ioprio;
|
|
|
|
|
2008-03-04 18:17:11 +08:00
|
|
|
unsigned int extra_len; /* length of alignment and padding */
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2018-05-29 21:52:28 +08:00
|
|
|
enum mq_rq_state state;
|
|
|
|
refcount_t ref;
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10 00:29:48 +08:00
|
|
|
|
2018-05-29 22:47:57 +08:00
|
|
|
unsigned int timeout;
|
2018-11-15 00:02:05 +08:00
|
|
|
unsigned long deadline;
|
2017-06-27 23:22:02 +08:00
|
|
|
|
2018-01-11 02:46:39 +08:00
|
|
|
union {
|
Merge branch 'for-4.16/block' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
"This is the main pull request for block IO related changes for the
4.16 kernel. Nothing major in this pull request, but a good amount of
improvements and fixes all over the map. This contains:
- BFQ improvements, fixes, and cleanups from Angelo, Chiara, and
Paolo.
- Support for SMR zones for deadline and mq-deadline from Damien and
Christoph.
- Set of fixes for bcache by way of Michael Lyle, including fixes
from himself, Kent, Rui, Tang, and Coly.
- Series from Matias for lightnvm with fixes from Hans Holmberg,
Javier, and Matias. Mostly centered around pblk, and the removing
rrpc 1.2 in preparation for supporting 2.0.
- A couple of NVMe pull requests from Christoph. Nothing major in
here, just fixes and cleanups, and support for command tracing from
Johannes.
- Support for blk-throttle for tracking reads and writes separately.
From Joseph Qi. A few cleanups/fixes also for blk-throttle from
Weiping.
- Series from Mike Snitzer that enables dm to register its queue more
logically, something that's alwways been problematic on dm since
it's a stacked device.
- Series from Ming cleaning up some of the bio accessor use, in
preparation for supporting multipage bvecs.
- Various fixes from Ming closing up holes around queue mapping and
quiescing.
- BSD partition fix from Richard Narron, fixing a problem where we
can't mount newer (10/11) FreeBSD partitions.
- Series from Tejun reworking blk-mq timeout handling. The previous
scheme relied on atomic bits, but it had races where we would think
a request had timed out if it to reused at the wrong time.
- null_blk now supports faking timeouts, to enable us to better
exercise and test that functionality separately. From me.
- Kill the separate atomic poll bit in the request struct. After
this, we don't use the atomic bits on blk-mq anymore at all. From
me.
- sgl_alloc/free helpers from Bart.
- Heavily contended tag case scalability improvement from me.
- Various little fixes and cleanups from Arnd, Bart, Corentin,
Douglas, Eryu, Goldwyn, and myself"
* 'for-4.16/block' of git://git.kernel.dk/linux-block: (186 commits)
block: remove smart1,2.h
nvme: add tracepoint for nvme_complete_rq
nvme: add tracepoint for nvme_setup_cmd
nvme-pci: introduce RECONNECTING state to mark initializing procedure
nvme-rdma: remove redundant boolean for inline_data
nvme: don't free uuid pointer before printing it
nvme-pci: Suspend queues after deleting them
bsg: use pr_debug instead of hand crafted macros
blk-mq-debugfs: don't allow write on attributes with seq_operations set
nvme-pci: Fix queue double allocations
block: Set BIO_TRACE_COMPLETION on new bio during split
blk-throttle: use queue_is_rq_based
block: Remove kblockd_schedule_delayed_work{,_on}()
blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays
blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly()
lib/scatterlist: Fix chaining support in sgl_alloc_order()
blk-throttle: track read and write request individually
block: add bdev_read_only() checks to common helpers
block: fail op_is_write() requests to read-only partitions
blk-throttle: export io_serviced_recursive, io_service_bytes_recursive
...
2018-01-30 03:51:49 +08:00
|
|
|
struct __call_single_data csd;
|
2018-01-11 02:46:39 +08:00
|
|
|
u64 fifo_time;
|
|
|
|
};
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
2006-10-01 02:29:12 +08:00
|
|
|
* completion callback.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
|
|
|
rq_end_io_fn *end_io;
|
|
|
|
void *end_io_data;
|
|
|
|
};
|
|
|
|
|
2017-12-18 15:40:43 +08:00
|
|
|
static inline bool blk_op_is_scsi(unsigned int op)
|
|
|
|
{
|
|
|
|
return op == REQ_OP_SCSI_IN || op == REQ_OP_SCSI_OUT;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool blk_op_is_private(unsigned int op)
|
|
|
|
{
|
|
|
|
return op == REQ_OP_DRV_IN || op == REQ_OP_DRV_OUT;
|
|
|
|
}
|
|
|
|
|
2017-01-31 23:57:31 +08:00
|
|
|
static inline bool blk_rq_is_scsi(struct request *rq)
|
|
|
|
{
|
2017-12-18 15:40:43 +08:00
|
|
|
return blk_op_is_scsi(req_op(rq));
|
2017-01-31 23:57:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool blk_rq_is_private(struct request *rq)
|
|
|
|
{
|
2017-12-18 15:40:43 +08:00
|
|
|
return blk_op_is_private(req_op(rq));
|
2017-01-31 23:57:31 +08:00
|
|
|
}
|
|
|
|
|
2017-01-31 23:57:29 +08:00
|
|
|
static inline bool blk_rq_is_passthrough(struct request *rq)
|
|
|
|
{
|
2017-01-31 23:57:31 +08:00
|
|
|
return blk_rq_is_scsi(rq) || blk_rq_is_private(rq);
|
2017-01-31 23:57:29 +08:00
|
|
|
}
|
|
|
|
|
2017-12-18 15:40:43 +08:00
|
|
|
static inline bool bio_is_passthrough(struct bio *bio)
|
|
|
|
{
|
|
|
|
unsigned op = bio_op(bio);
|
|
|
|
|
|
|
|
return blk_op_is_scsi(op) || blk_op_is_private(op);
|
|
|
|
}
|
|
|
|
|
2008-08-14 15:59:13 +08:00
|
|
|
static inline unsigned short req_get_ioprio(struct request *req)
|
|
|
|
{
|
|
|
|
return req->ioprio;
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/elevator.h>
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
struct blk_queue_ctx;
|
|
|
|
|
2015-11-06 01:41:16 +08:00
|
|
|
typedef blk_qc_t (make_request_fn) (struct request_queue *q, struct bio *bio);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
struct bio_vec;
|
2008-02-19 18:36:53 +08:00
|
|
|
typedef int (dma_drain_needed_fn)(struct request *);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2008-09-14 20:55:09 +08:00
|
|
|
enum blk_eh_timer_return {
|
2018-05-29 21:52:38 +08:00
|
|
|
BLK_EH_DONE, /* drivers has completed the command */
|
|
|
|
BLK_EH_RESET_TIMER, /* reset timer and try again */
|
2008-09-14 20:55:09 +08:00
|
|
|
};
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
enum blk_queue_state {
|
|
|
|
Queue_down,
|
|
|
|
Queue_up,
|
|
|
|
};
|
|
|
|
|
2015-01-16 09:32:25 +08:00
|
|
|
#define BLK_TAG_ALLOC_FIFO 0 /* allocate starting from 0 */
|
|
|
|
#define BLK_TAG_ALLOC_RR 1 /* allocate starting from last allocated tag */
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2008-08-16 13:10:05 +08:00
|
|
|
#define BLK_SCSI_MAX_CMDS (256)
|
|
|
|
#define BLK_SCSI_CMD_PER_LONG (BLK_SCSI_MAX_CMDS / (sizeof(long) * 8))
|
|
|
|
|
2016-10-18 14:40:29 +08:00
|
|
|
/*
|
|
|
|
* Zoned block device models (zoned limit).
|
|
|
|
*/
|
|
|
|
enum blk_zoned_model {
|
|
|
|
BLK_ZONED_NONE, /* Regular block device */
|
|
|
|
BLK_ZONED_HA, /* Host-aware zoned block device */
|
|
|
|
BLK_ZONED_HM, /* Host-managed zoned block device */
|
|
|
|
};
|
|
|
|
|
2009-05-23 05:17:51 +08:00
|
|
|
struct queue_limits {
|
|
|
|
unsigned long bounce_pfn;
|
|
|
|
unsigned long seg_boundary_mask;
|
2015-08-20 05:24:05 +08:00
|
|
|
unsigned long virt_boundary_mask;
|
2009-05-23 05:17:51 +08:00
|
|
|
|
|
|
|
unsigned int max_hw_sectors;
|
2015-11-14 05:46:48 +08:00
|
|
|
unsigned int max_dev_sectors;
|
2014-06-06 03:38:39 +08:00
|
|
|
unsigned int chunk_sectors;
|
2009-05-23 05:17:51 +08:00
|
|
|
unsigned int max_sectors;
|
|
|
|
unsigned int max_segment_size;
|
2009-05-23 05:17:53 +08:00
|
|
|
unsigned int physical_block_size;
|
2020-01-15 21:35:25 +08:00
|
|
|
unsigned int logical_block_size;
|
2009-05-23 05:17:53 +08:00
|
|
|
unsigned int alignment_offset;
|
|
|
|
unsigned int io_min;
|
|
|
|
unsigned int io_opt;
|
2009-09-30 19:54:20 +08:00
|
|
|
unsigned int max_discard_sectors;
|
2015-07-16 23:14:26 +08:00
|
|
|
unsigned int max_hw_discard_sectors;
|
2012-09-19 00:19:27 +08:00
|
|
|
unsigned int max_write_same_sectors;
|
2016-12-01 04:28:59 +08:00
|
|
|
unsigned int max_write_zeroes_sectors;
|
2009-11-10 18:50:21 +08:00
|
|
|
unsigned int discard_granularity;
|
|
|
|
unsigned int discard_alignment;
|
2009-05-23 05:17:51 +08:00
|
|
|
|
2010-02-26 13:20:39 +08:00
|
|
|
unsigned short max_segments;
|
2010-09-11 02:50:10 +08:00
|
|
|
unsigned short max_integrity_segments;
|
2017-02-08 21:46:49 +08:00
|
|
|
unsigned short max_discard_segments;
|
2009-05-23 05:17:51 +08:00
|
|
|
|
2009-05-23 05:17:53 +08:00
|
|
|
unsigned char misaligned;
|
2009-11-10 18:50:21 +08:00
|
|
|
unsigned char discard_misaligned;
|
2013-07-12 13:39:53 +08:00
|
|
|
unsigned char raid_partial_stripes_expensive;
|
2016-10-18 14:40:29 +08:00
|
|
|
enum blk_zoned_model zoned;
|
2009-05-23 05:17:51 +08:00
|
|
|
};
|
|
|
|
|
2019-11-11 10:39:30 +08:00
|
|
|
typedef int (*report_zones_cb)(struct blk_zone *zone, unsigned int idx,
|
|
|
|
void *data);
|
|
|
|
|
2016-10-18 14:40:33 +08:00
|
|
|
#ifdef CONFIG_BLK_DEV_ZONED
|
|
|
|
|
2019-11-11 10:39:30 +08:00
|
|
|
#define BLK_ALL_ZONES ((unsigned int)-1)
|
|
|
|
int blkdev_report_zones(struct block_device *bdev, sector_t sector,
|
|
|
|
unsigned int nr_zones, report_zones_cb cb, void *data);
|
2019-12-03 17:39:04 +08:00
|
|
|
unsigned int blkdev_nr_zones(struct gendisk *disk);
|
2019-10-27 22:05:45 +08:00
|
|
|
extern int blkdev_zone_mgmt(struct block_device *bdev, enum req_opf op,
|
|
|
|
sector_t sectors, sector_t nr_sectors,
|
|
|
|
gfp_t gfp_mask);
|
2018-10-12 18:08:50 +08:00
|
|
|
extern int blk_revalidate_disk_zones(struct gendisk *disk);
|
2016-10-18 14:40:33 +08:00
|
|
|
|
2016-10-18 14:40:35 +08:00
|
|
|
extern int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
|
|
|
|
unsigned int cmd, unsigned long arg);
|
2019-10-27 22:05:46 +08:00
|
|
|
extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
|
|
|
|
unsigned int cmd, unsigned long arg);
|
2016-10-18 14:40:35 +08:00
|
|
|
|
|
|
|
#else /* CONFIG_BLK_DEV_ZONED */
|
|
|
|
|
2019-12-03 17:39:04 +08:00
|
|
|
static inline unsigned int blkdev_nr_zones(struct gendisk *disk)
|
2018-10-12 18:08:43 +08:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2018-10-12 18:08:50 +08:00
|
|
|
|
2016-10-18 14:40:35 +08:00
|
|
|
static inline int blkdev_report_zones_ioctl(struct block_device *bdev,
|
|
|
|
fmode_t mode, unsigned int cmd,
|
|
|
|
unsigned long arg)
|
|
|
|
{
|
|
|
|
return -ENOTTY;
|
|
|
|
}
|
|
|
|
|
2019-10-27 22:05:46 +08:00
|
|
|
static inline int blkdev_zone_mgmt_ioctl(struct block_device *bdev,
|
|
|
|
fmode_t mode, unsigned int cmd,
|
|
|
|
unsigned long arg)
|
2016-10-18 14:40:35 +08:00
|
|
|
{
|
|
|
|
return -ENOTTY;
|
|
|
|
}
|
|
|
|
|
2016-10-18 14:40:33 +08:00
|
|
|
#endif /* CONFIG_BLK_DEV_ZONED */
|
|
|
|
|
2011-07-14 03:17:23 +08:00
|
|
|
struct request_queue {
|
2005-04-17 06:20:36 +08:00
|
|
|
struct request *last_merge;
|
2008-10-31 17:05:07 +08:00
|
|
|
struct elevator_queue *elevator;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 23:56:08 +08:00
|
|
|
struct blk_queue_stats *stats;
|
2018-07-03 23:32:35 +08:00
|
|
|
struct rq_qos *rq_qos;
|
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 03:38:14 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
make_request_fn *make_request_fn;
|
2008-02-19 18:36:53 +08:00
|
|
|
dma_drain_needed_fn *dma_drain_needed;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2016-12-14 00:24:51 +08:00
|
|
|
const struct blk_mq_ops *mq_ops;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
|
|
|
/* sw queues */
|
2014-06-03 11:24:06 +08:00
|
|
|
struct blk_mq_ctx __percpu *queue_ctx;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2016-03-31 00:21:08 +08:00
|
|
|
unsigned int queue_depth;
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
/* hw dispatch queues */
|
|
|
|
struct blk_mq_hw_ctx **queue_hw_ctx;
|
|
|
|
unsigned int nr_hw_queues;
|
|
|
|
|
2017-02-02 22:56:50 +08:00
|
|
|
struct backing_dev_info *backing_dev_info;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The queue owner gets to use this for whatever they like.
|
|
|
|
* ll_rw_blk doesn't touch it.
|
|
|
|
*/
|
|
|
|
void *queuedata;
|
|
|
|
|
|
|
|
/*
|
2011-07-14 03:17:23 +08:00
|
|
|
* various queue flags, see QUEUE_* below
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2011-07-14 03:17:23 +08:00
|
|
|
unsigned long queue_flags;
|
2018-09-27 05:01:04 +08:00
|
|
|
/*
|
|
|
|
* Number of contexts that have called blk_set_pm_only(). If this
|
|
|
|
* counter is above zero then only RQF_PM and RQF_PREEMPT requests are
|
|
|
|
* processed.
|
|
|
|
*/
|
|
|
|
atomic_t pm_only;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2011-12-14 07:33:37 +08:00
|
|
|
/*
|
|
|
|
* ida allocated id for this queue. Used to index queues from
|
|
|
|
* ioctx.
|
|
|
|
*/
|
|
|
|
int id;
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
2011-07-14 03:17:23 +08:00
|
|
|
* queue needs bounce pages for pages above this limit
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2011-07-14 03:17:23 +08:00
|
|
|
gfp_t bounce_gfp;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2018-11-16 03:17:28 +08:00
|
|
|
spinlock_t queue_lock;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* queue kobject
|
|
|
|
*/
|
|
|
|
struct kobject kobj;
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
/*
|
|
|
|
* mq queue kobject
|
|
|
|
*/
|
2018-11-20 09:44:35 +08:00
|
|
|
struct kobject *mq_kobj;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2015-10-22 01:20:18 +08:00
|
|
|
#ifdef CONFIG_BLK_DEV_INTEGRITY
|
|
|
|
struct blk_integrity integrity;
|
|
|
|
#endif /* CONFIG_BLK_DEV_INTEGRITY */
|
|
|
|
|
2014-12-04 08:00:23 +08:00
|
|
|
#ifdef CONFIG_PM
|
2013-03-23 11:42:26 +08:00
|
|
|
struct device *dev;
|
|
|
|
int rpm_status;
|
|
|
|
unsigned int nr_pending;
|
|
|
|
#endif
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* queue settings
|
|
|
|
*/
|
|
|
|
unsigned long nr_requests; /* Max # of requests */
|
|
|
|
|
2008-01-11 01:30:36 +08:00
|
|
|
unsigned int dma_drain_size;
|
2011-07-14 03:17:23 +08:00
|
|
|
void *dma_drain_buffer;
|
2008-03-04 18:18:17 +08:00
|
|
|
unsigned int dma_pad_mask;
|
2005-04-17 06:20:36 +08:00
|
|
|
unsigned int dma_alignment;
|
|
|
|
|
2008-09-14 20:55:09 +08:00
|
|
|
unsigned int rq_timeout;
|
2016-11-15 04:03:03 +08:00
|
|
|
int poll_nsec;
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 23:56:08 +08:00
|
|
|
|
|
|
|
struct blk_stat_callback *poll_cb;
|
2017-04-21 06:59:11 +08:00
|
|
|
struct blk_rq_stat poll_stat[BLK_MQ_POLL_STATS_BKTS];
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 23:56:08 +08:00
|
|
|
|
2008-09-14 20:55:09 +08:00
|
|
|
struct timer_list timeout;
|
2015-10-30 20:57:30 +08:00
|
|
|
struct work_struct timeout_work;
|
2008-09-14 20:55:09 +08:00
|
|
|
|
2011-12-14 07:33:41 +08:00
|
|
|
struct list_head icq_list;
|
2012-03-06 05:15:18 +08:00
|
|
|
#ifdef CONFIG_BLK_CGROUP
|
2012-04-14 04:11:33 +08:00
|
|
|
DECLARE_BITMAP (blkcg_pols, BLKCG_MAX_POLS);
|
2012-04-17 04:57:25 +08:00
|
|
|
struct blkcg_gq *root_blkg;
|
2012-03-06 05:15:19 +08:00
|
|
|
struct list_head blkg_list;
|
2012-03-06 05:15:18 +08:00
|
|
|
#endif
|
2011-12-14 07:33:41 +08:00
|
|
|
|
2009-05-23 05:17:51 +08:00
|
|
|
struct queue_limits limits;
|
|
|
|
|
2019-09-05 17:51:31 +08:00
|
|
|
unsigned int required_elevator_features;
|
|
|
|
|
2018-06-16 05:55:21 +08:00
|
|
|
#ifdef CONFIG_BLK_DEV_ZONED
|
2017-12-21 14:43:38 +08:00
|
|
|
/*
|
|
|
|
* Zoned block device information for request dispatch control.
|
|
|
|
* nr_zones is the total number of zones of the device. This is always
|
2019-12-03 17:39:05 +08:00
|
|
|
* 0 for regular block devices. conv_zones_bitmap is a bitmap of nr_zones
|
|
|
|
* bits which indicates if a zone is conventional (bit set) or
|
|
|
|
* sequential (bit clear). seq_zones_wlock is a bitmap of nr_zones
|
2017-12-21 14:43:38 +08:00
|
|
|
* bits which indicates if a zone is write locked, that is, if a write
|
|
|
|
* request targeting the zone was dispatched. All three fields are
|
|
|
|
* initialized by the low level device driver (e.g. scsi/sd.c).
|
|
|
|
* Stacking drivers (device mappers) may or may not initialize
|
|
|
|
* these fields.
|
2018-04-17 09:04:41 +08:00
|
|
|
*
|
|
|
|
* Reads of this information must be protected with blk_queue_enter() /
|
|
|
|
* blk_queue_exit(). Modifying this information is only allowed while
|
|
|
|
* no requests are being processed. See also blk_mq_freeze_queue() and
|
|
|
|
* blk_mq_unfreeze_queue().
|
2017-12-21 14:43:38 +08:00
|
|
|
*/
|
|
|
|
unsigned int nr_zones;
|
2019-12-03 17:39:05 +08:00
|
|
|
unsigned long *conv_zones_bitmap;
|
2017-12-21 14:43:38 +08:00
|
|
|
unsigned long *seq_zones_wlock;
|
2018-06-16 05:55:21 +08:00
|
|
|
#endif /* CONFIG_BLK_DEV_ZONED */
|
2017-12-21 14:43:38 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* sg stuff
|
|
|
|
*/
|
|
|
|
unsigned int sg_timeout;
|
|
|
|
unsigned int sg_reserved_size;
|
2005-06-23 15:08:19 +08:00
|
|
|
int node;
|
2006-09-29 16:59:40 +08:00
|
|
|
#ifdef CONFIG_BLK_DEV_IO_TRACE
|
2006-03-24 03:00:26 +08:00
|
|
|
struct blk_trace *blk_trace;
|
2017-09-21 03:12:20 +08:00
|
|
|
struct mutex blk_trace_mutex;
|
2006-09-29 16:59:40 +08:00
|
|
|
#endif
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
2010-09-03 17:56:16 +08:00
|
|
|
* for flush operations
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2014-09-25 23:23:43 +08:00
|
|
|
struct blk_flush_queue *fq;
|
2006-03-19 07:34:37 +08:00
|
|
|
|
2014-05-28 22:08:02 +08:00
|
|
|
struct list_head requeue_list;
|
|
|
|
spinlock_t requeue_lock;
|
2016-09-15 01:28:30 +08:00
|
|
|
struct delayed_work requeue_work;
|
2014-05-28 22:08:02 +08:00
|
|
|
|
2006-03-19 07:34:37 +08:00
|
|
|
struct mutex sysfs_lock;
|
block: split .sysfs_lock into two locks
The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store
path. Meantime, inside block's .show/.store callback, q->sysfs_lock is
required.
However, when mq & iosched kobjects are removed via
blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held
too. This way causes AB-BA lock because the kernfs built-in lock of
'kn-count' is required inside kobject_del() too, see the lockdep warning[1].
On the other hand, it isn't necessary to acquire q->sysfs_lock for
both blk_mq_unregister_dev() & elv_unregister_queue() because
clearing REGISTERED flag prevents storing to 'queue/scheduler'
from being happened. Also sysfs write(store) is exclusive, so no
necessary to hold the lock for elv_unregister_queue() when it is
called in switching elevator path.
So split .sysfs_lock into two: one is still named as .sysfs_lock for
covering sync .store, the other one is named as .sysfs_dir_lock
for covering kobjects and related status change.
sysfs itself can handle the race between add/remove kobjects and
showing/storing attributes under kobjects. For switching scheduler
via storing to 'queue/scheduler', we use the queue flag of
QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then
we can avoid to hold .sysfs_lock during removing/adding kobjects.
[1] lockdep warning
======================================================
WARNING: possible circular locking dependency detected
5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted
------------------------------------------------------
rmmod/777 is trying to acquire lock:
00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72
but task is already holding lock:
00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&q->sysfs_lock){+.+.}:
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__mutex_lock+0x14a/0xa9b
blk_mq_hw_sysfs_show+0x63/0xb6
sysfs_kf_seq_show+0x11f/0x196
seq_read+0x2cd/0x5f2
vfs_read+0xc7/0x18c
ksys_read+0xc4/0x13e
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #0 (kn->count#202){++++}:
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__kernfs_remove+0x237/0x40b
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&q->sysfs_lock);
lock(kn->count#202);
lock(&q->sysfs_lock);
lock(kn->count#202);
*** DEADLOCK ***
2 locks held by rmmod/777:
#0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk]
#1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
stack backtrace:
CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4
Call Trace:
dump_stack+0x9a/0xe6
check_noncircular+0x207/0x251
? print_circular_bug+0x32a/0x32a
? find_usage_backwards+0x84/0xb0
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
? check_prev_add+0xc45/0xc45
? mark_lock+0x11b/0x804
? check_usage_forwards+0x1ca/0x1ca
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
? kernfs_remove_by_name_ns+0x59/0x72
__kernfs_remove+0x237/0x40b
? kernfs_remove_by_name_ns+0x59/0x72
? kernfs_next_descendant_post+0x7d/0x7d
? strlen+0x10/0x23
? strcmp+0x22/0x44
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
? disk_events_poll_msecs_store+0x12b/0x12b
? check_flags+0x1ea/0x204
? mark_held_locks+0x1f/0x7a
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
? free_module+0x39f/0x39f
? blkcg_maybe_throttle_current+0x8a/0x718
? rwlock_bug+0x62/0x62
? __blkcg_punt_bio_submit+0xd0/0xd0
? trace_hardirqs_on_thunk+0x1a/0x20
? mark_held_locks+0x1f/0x7a
? do_syscall_64+0x4c/0x295
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb696cdbe6b
Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008
RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828
RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000
R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0
R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-27 19:01:48 +08:00
|
|
|
struct mutex sysfs_dir_lock;
|
2007-07-09 18:40:35 +08:00
|
|
|
|
blk-mq: always free hctx after request queue is freed
In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().
However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.
Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:52:27 +08:00
|
|
|
/*
|
|
|
|
* for reusing dead hctx instance in case of updating
|
|
|
|
* nr_hw_queues
|
|
|
|
*/
|
|
|
|
struct list_head unused_hctx_list;
|
|
|
|
spinlock_t unused_hctx_lock;
|
|
|
|
|
2019-05-21 11:25:55 +08:00
|
|
|
int mq_freeze_depth;
|
2012-03-06 05:14:58 +08:00
|
|
|
|
2007-07-09 18:40:35 +08:00
|
|
|
#if defined(CONFIG_BLK_DEV_BSG)
|
|
|
|
struct bsg_class_device bsg_dev;
|
|
|
|
#endif
|
2010-09-16 05:06:35 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_BLK_DEV_THROTTLING
|
|
|
|
/* Throttle data */
|
|
|
|
struct throtl_data *td;
|
|
|
|
#endif
|
2013-01-10 00:05:13 +08:00
|
|
|
struct rcu_head rcu_head;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
wait_queue_head_t mq_freeze_wq;
|
2019-05-21 11:25:55 +08:00
|
|
|
/*
|
|
|
|
* Protect concurrent access to q_usage_counter by
|
|
|
|
* percpu_ref_kill() and percpu_ref_reinit().
|
|
|
|
*/
|
|
|
|
struct mutex mq_freeze_lock;
|
2015-10-22 01:20:12 +08:00
|
|
|
struct percpu_ref q_usage_counter;
|
2014-05-14 05:10:52 +08:00
|
|
|
|
|
|
|
struct blk_mq_tag_set *tag_set;
|
|
|
|
struct list_head tag_set_list;
|
2018-05-21 06:25:47 +08:00
|
|
|
struct bio_set bio_split;
|
2015-09-27 01:09:20 +08:00
|
|
|
|
2017-02-01 06:53:18 +08:00
|
|
|
#ifdef CONFIG_BLK_DEBUG_FS
|
2017-01-26 00:06:40 +08:00
|
|
|
struct dentry *debugfs_dir;
|
2017-05-04 22:24:40 +08:00
|
|
|
struct dentry *sched_debugfs_dir;
|
2018-12-17 09:46:00 +08:00
|
|
|
struct dentry *rqos_debugfs_dir;
|
2017-01-26 00:06:40 +08:00
|
|
|
#endif
|
|
|
|
|
2015-09-27 01:09:20 +08:00
|
|
|
bool mq_sysfs_init_done;
|
2017-01-28 00:51:45 +08:00
|
|
|
|
|
|
|
size_t cmd_size;
|
2017-06-15 03:27:50 +08:00
|
|
|
|
|
|
|
struct work_struct release_work;
|
2017-06-26 22:15:27 +08:00
|
|
|
|
|
|
|
#define BLK_MAX_WRITE_HINTS 5
|
|
|
|
u64 write_hints[BLK_MAX_WRITE_HINTS];
|
2005-04-17 06:20:36 +08:00
|
|
|
};
|
|
|
|
|
2019-02-10 06:42:07 +08:00
|
|
|
#define QUEUE_FLAG_STOPPED 0 /* queue is stopped */
|
|
|
|
#define QUEUE_FLAG_DYING 1 /* queue being torn down */
|
|
|
|
#define QUEUE_FLAG_NOMERGES 3 /* disable merge attempts */
|
|
|
|
#define QUEUE_FLAG_SAME_COMP 4 /* complete on same CPU-group */
|
|
|
|
#define QUEUE_FLAG_FAIL_IO 5 /* fake timeout */
|
|
|
|
#define QUEUE_FLAG_NONROT 6 /* non-rotational device (SSD) */
|
|
|
|
#define QUEUE_FLAG_VIRT QUEUE_FLAG_NONROT /* paravirt device */
|
|
|
|
#define QUEUE_FLAG_IO_STAT 7 /* do disk/partitions IO accounting */
|
|
|
|
#define QUEUE_FLAG_DISCARD 8 /* supports DISCARD */
|
|
|
|
#define QUEUE_FLAG_NOXMERGES 9 /* No extended merges */
|
|
|
|
#define QUEUE_FLAG_ADD_RANDOM 10 /* Contributes to random pool */
|
|
|
|
#define QUEUE_FLAG_SECERASE 11 /* supports secure erase */
|
|
|
|
#define QUEUE_FLAG_SAME_FORCE 12 /* force complete on same CPU */
|
|
|
|
#define QUEUE_FLAG_DEAD 13 /* queue tear-down finished */
|
|
|
|
#define QUEUE_FLAG_INIT_DONE 14 /* queue is initialized */
|
|
|
|
#define QUEUE_FLAG_POLL 16 /* IO polling enabled if set */
|
|
|
|
#define QUEUE_FLAG_WC 17 /* Write back caching */
|
|
|
|
#define QUEUE_FLAG_FUA 18 /* device supports FUA writes */
|
|
|
|
#define QUEUE_FLAG_DAX 19 /* device supports DAX */
|
|
|
|
#define QUEUE_FLAG_STATS 20 /* track IO start and completion times */
|
|
|
|
#define QUEUE_FLAG_POLL_STATS 21 /* collecting stats for hybrid polling */
|
|
|
|
#define QUEUE_FLAG_REGISTERED 22 /* queue has been registered to a disk */
|
|
|
|
#define QUEUE_FLAG_SCSI_PASSTHROUGH 23 /* queue supports SCSI commands */
|
|
|
|
#define QUEUE_FLAG_QUIESCED 24 /* queue has been quiesced */
|
|
|
|
#define QUEUE_FLAG_PCI_P2PDMA 25 /* device supports PCI p2p requests */
|
2019-08-02 01:26:35 +08:00
|
|
|
#define QUEUE_FLAG_ZONE_RESETALL 26 /* supports Zone Reset All */
|
2019-08-29 06:05:57 +08:00
|
|
|
#define QUEUE_FLAG_RQ_ALLOC_TIME 27 /* record rq->alloc_time_ns */
|
2006-01-06 16:51:03 +08:00
|
|
|
|
2013-11-20 00:25:07 +08:00
|
|
|
#define QUEUE_FLAG_MQ_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
|
2018-12-05 21:50:40 +08:00
|
|
|
(1 << QUEUE_FLAG_SAME_COMP))
|
2013-11-20 00:25:07 +08:00
|
|
|
|
2018-03-08 09:10:04 +08:00
|
|
|
void blk_queue_flag_set(unsigned int flag, struct request_queue *q);
|
|
|
|
void blk_queue_flag_clear(unsigned int flag, struct request_queue *q);
|
|
|
|
bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
#define blk_queue_stopped(q) test_bit(QUEUE_FLAG_STOPPED, &(q)->queue_flags)
|
2012-11-28 20:42:38 +08:00
|
|
|
#define blk_queue_dying(q) test_bit(QUEUE_FLAG_DYING, &(q)->queue_flags)
|
2012-12-06 21:32:01 +08:00
|
|
|
#define blk_queue_dead(q) test_bit(QUEUE_FLAG_DEAD, &(q)->queue_flags)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
#define blk_queue_init_done(q) test_bit(QUEUE_FLAG_INIT_DONE, &(q)->queue_flags)
|
2008-04-29 20:44:19 +08:00
|
|
|
#define blk_queue_nomerges(q) test_bit(QUEUE_FLAG_NOMERGES, &(q)->queue_flags)
|
2010-01-29 16:04:08 +08:00
|
|
|
#define blk_queue_noxmerges(q) \
|
|
|
|
test_bit(QUEUE_FLAG_NOXMERGES, &(q)->queue_flags)
|
2008-09-24 19:03:33 +08:00
|
|
|
#define blk_queue_nonrot(q) test_bit(QUEUE_FLAG_NONROT, &(q)->queue_flags)
|
2009-01-23 17:54:44 +08:00
|
|
|
#define blk_queue_io_stat(q) test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
|
2010-06-09 16:42:09 +08:00
|
|
|
#define blk_queue_add_random(q) test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags)
|
2009-09-30 19:52:12 +08:00
|
|
|
#define blk_queue_discard(q) test_bit(QUEUE_FLAG_DISCARD, &(q)->queue_flags)
|
2019-08-02 01:26:35 +08:00
|
|
|
#define blk_queue_zone_resetall(q) \
|
|
|
|
test_bit(QUEUE_FLAG_ZONE_RESETALL, &(q)->queue_flags)
|
2016-06-09 22:00:36 +08:00
|
|
|
#define blk_queue_secure_erase(q) \
|
|
|
|
(test_bit(QUEUE_FLAG_SECERASE, &(q)->queue_flags))
|
2016-06-24 05:05:50 +08:00
|
|
|
#define blk_queue_dax(q) test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
|
2017-06-01 05:43:46 +08:00
|
|
|
#define blk_queue_scsi_passthrough(q) \
|
|
|
|
test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
|
2018-10-05 05:27:41 +08:00
|
|
|
#define blk_queue_pci_p2pdma(q) \
|
|
|
|
test_bit(QUEUE_FLAG_PCI_P2PDMA, &(q)->queue_flags)
|
2019-08-29 06:05:57 +08:00
|
|
|
#ifdef CONFIG_BLK_RQ_ALLOC_TIME
|
|
|
|
#define blk_queue_rq_alloc_time(q) \
|
|
|
|
test_bit(QUEUE_FLAG_RQ_ALLOC_TIME, &(q)->queue_flags)
|
|
|
|
#else
|
|
|
|
#define blk_queue_rq_alloc_time(q) false
|
|
|
|
#endif
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2010-08-08 00:17:56 +08:00
|
|
|
#define blk_noretry_request(rq) \
|
|
|
|
((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
|
|
|
|
REQ_FAILFAST_DRIVER))
|
2017-06-19 04:24:27 +08:00
|
|
|
#define blk_queue_quiesced(q) test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags)
|
2018-09-27 05:01:04 +08:00
|
|
|
#define blk_queue_pm_only(q) atomic_read(&(q)->pm_only)
|
2018-04-18 12:08:27 +08:00
|
|
|
#define blk_queue_fua(q) test_bit(QUEUE_FLAG_FUA, &(q)->queue_flags)
|
2019-08-27 19:01:47 +08:00
|
|
|
#define blk_queue_registered(q) test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags)
|
2017-11-10 02:49:57 +08:00
|
|
|
|
2018-09-27 05:01:04 +08:00
|
|
|
extern void blk_set_pm_only(struct request_queue *q);
|
|
|
|
extern void blk_clear_pm_only(struct request_queue *q);
|
2010-08-08 00:17:56 +08:00
|
|
|
|
2017-01-31 23:57:29 +08:00
|
|
|
static inline bool blk_account_rq(struct request *rq)
|
|
|
|
{
|
|
|
|
return (rq->rq_flags & RQF_STARTED) && !blk_rq_is_passthrough(rq);
|
|
|
|
}
|
2010-08-08 00:17:56 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
#define list_entry_rq(ptr) list_entry((ptr), struct request, queuelist)
|
|
|
|
|
2016-06-06 03:32:22 +08:00
|
|
|
#define rq_data_dir(rq) (op_is_write(req_op(rq)) ? WRITE : READ)
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2019-03-03 23:18:30 +08:00
|
|
|
#define rq_dma_dir(rq) \
|
|
|
|
(op_is_write(req_op(rq)) ? DMA_TO_DEVICE : DMA_FROM_DEVICE)
|
|
|
|
|
2019-03-03 23:40:36 +08:00
|
|
|
#define dma_map_bvec(dev, bv, dir, attrs) \
|
|
|
|
dma_map_page_attrs(dev, (bv)->bv_page, (bv)->bv_offset, (bv)->bv_len, \
|
|
|
|
(dir), (attrs))
|
|
|
|
|
2018-11-16 03:22:51 +08:00
|
|
|
static inline bool queue_is_mq(struct request_queue *q)
|
2014-04-17 00:57:18 +08:00
|
|
|
{
|
block: remove dead elevator code
This removes a bunch of core and elevator related code. On the core
front, we remove anything related to queue running, draining,
initialization, plugging, and congestions. We also kill anything
related to request allocation, merging, retrieval, and completion.
Remove any checking for single queue IO schedulers, as they no
longer exist. This means we can also delete a bunch of code related
to request issue, adding, completion, etc - and all the SQ related
ops and helpers.
Also kill the load_default_modules(), as all that did was provide
for a way to load the default single queue elevator.
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-30 00:23:51 +08:00
|
|
|
return q->mq_ops;
|
2014-04-17 00:57:18 +08:00
|
|
|
}
|
|
|
|
|
2016-10-18 14:40:29 +08:00
|
|
|
static inline enum blk_zoned_model
|
|
|
|
blk_queue_zoned_model(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return q->limits.zoned;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool blk_queue_is_zoned(struct request_queue *q)
|
|
|
|
{
|
|
|
|
switch (blk_queue_zoned_model(q)) {
|
|
|
|
case BLK_ZONED_HA:
|
|
|
|
case BLK_ZONED_HM:
|
|
|
|
return true;
|
|
|
|
default:
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-07-10 12:53:10 +08:00
|
|
|
static inline sector_t blk_queue_zone_sectors(struct request_queue *q)
|
2016-10-18 14:40:33 +08:00
|
|
|
{
|
|
|
|
return blk_queue_is_zoned(q) ? q->limits.chunk_sectors : 0;
|
|
|
|
}
|
|
|
|
|
2018-06-16 05:55:21 +08:00
|
|
|
#ifdef CONFIG_BLK_DEV_ZONED
|
2018-10-12 18:08:48 +08:00
|
|
|
static inline unsigned int blk_queue_nr_zones(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return blk_queue_is_zoned(q) ? q->nr_zones : 0;
|
|
|
|
}
|
|
|
|
|
2017-12-21 14:43:38 +08:00
|
|
|
static inline unsigned int blk_queue_zone_no(struct request_queue *q,
|
|
|
|
sector_t sector)
|
|
|
|
{
|
|
|
|
if (!blk_queue_is_zoned(q))
|
|
|
|
return 0;
|
|
|
|
return sector >> ilog2(q->limits.chunk_sectors);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool blk_queue_zone_is_seq(struct request_queue *q,
|
|
|
|
sector_t sector)
|
|
|
|
{
|
2019-12-03 17:39:05 +08:00
|
|
|
if (!blk_queue_is_zoned(q))
|
2017-12-21 14:43:38 +08:00
|
|
|
return false;
|
2019-12-03 17:39:05 +08:00
|
|
|
if (!q->conv_zones_bitmap)
|
|
|
|
return true;
|
|
|
|
return !test_bit(blk_queue_zone_no(q, sector), q->conv_zones_bitmap);
|
2017-12-21 14:43:38 +08:00
|
|
|
}
|
2018-10-12 18:08:48 +08:00
|
|
|
#else /* CONFIG_BLK_DEV_ZONED */
|
|
|
|
static inline unsigned int blk_queue_nr_zones(struct request_queue *q)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2018-06-16 05:55:21 +08:00
|
|
|
#endif /* CONFIG_BLK_DEV_ZONED */
|
2017-12-21 14:43:38 +08:00
|
|
|
|
2009-04-06 20:48:01 +08:00
|
|
|
static inline bool rq_is_sync(struct request *rq)
|
|
|
|
{
|
2016-10-28 22:48:16 +08:00
|
|
|
return op_is_sync(rq->cmd_flags);
|
2009-04-06 20:48:01 +08:00
|
|
|
}
|
|
|
|
|
2012-09-19 00:19:25 +08:00
|
|
|
static inline bool rq_mergeable(struct request *rq)
|
|
|
|
{
|
2017-01-31 23:57:29 +08:00
|
|
|
if (blk_rq_is_passthrough(rq))
|
2012-09-19 00:19:25 +08:00
|
|
|
return false;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2016-06-06 03:32:23 +08:00
|
|
|
if (req_op(rq) == REQ_OP_FLUSH)
|
|
|
|
return false;
|
|
|
|
|
2016-12-01 04:28:59 +08:00
|
|
|
if (req_op(rq) == REQ_OP_WRITE_ZEROES)
|
|
|
|
return false;
|
|
|
|
|
2012-09-19 00:19:25 +08:00
|
|
|
if (rq->cmd_flags & REQ_NOMERGE_FLAGS)
|
2016-10-20 21:12:13 +08:00
|
|
|
return false;
|
|
|
|
if (rq->rq_flags & RQF_NOMERGE_FLAGS)
|
2012-09-19 00:19:25 +08:00
|
|
|
return false;
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2012-09-19 00:19:27 +08:00
|
|
|
static inline bool blk_write_same_mergeable(struct bio *a, struct bio *b)
|
|
|
|
{
|
2017-06-19 15:24:41 +08:00
|
|
|
if (bio_page(a) == bio_page(b) &&
|
|
|
|
bio_offset(a) == bio_offset(b))
|
2012-09-19 00:19:27 +08:00
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2016-03-31 00:21:08 +08:00
|
|
|
static inline unsigned int blk_queue_depth(struct request_queue *q)
|
|
|
|
{
|
|
|
|
if (q->queue_depth)
|
|
|
|
return q->queue_depth;
|
|
|
|
|
|
|
|
return q->nr_requests;
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
extern unsigned long blk_max_low_pfn, blk_max_pfn;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* standard bounce addresses:
|
|
|
|
*
|
|
|
|
* BLK_BOUNCE_HIGH : bounce all highmem pages
|
|
|
|
* BLK_BOUNCE_ANY : don't bounce anything
|
|
|
|
* BLK_BOUNCE_ISA : bounce pages above ISA DMA boundary
|
|
|
|
*/
|
2008-04-21 15:51:05 +08:00
|
|
|
|
|
|
|
#if BITS_PER_LONG == 32
|
2005-04-17 06:20:36 +08:00
|
|
|
#define BLK_BOUNCE_HIGH ((u64)blk_max_low_pfn << PAGE_SHIFT)
|
2008-04-21 15:51:05 +08:00
|
|
|
#else
|
|
|
|
#define BLK_BOUNCE_HIGH -1ULL
|
|
|
|
#endif
|
|
|
|
#define BLK_BOUNCE_ANY (-1ULL)
|
2010-05-31 14:59:03 +08:00
|
|
|
#define BLK_BOUNCE_ISA (DMA_BIT_MASK(24))
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2007-07-09 18:38:05 +08:00
|
|
|
/*
|
|
|
|
* default timeout for SG_IO if none specified
|
|
|
|
*/
|
|
|
|
#define BLK_DEFAULT_SG_TIMEOUT (60 * HZ)
|
2008-12-06 06:49:18 +08:00
|
|
|
#define BLK_MIN_SG_TIMEOUT (7 * HZ)
|
2007-07-09 18:38:05 +08:00
|
|
|
|
2008-08-28 15:17:06 +08:00
|
|
|
struct rq_map_data {
|
|
|
|
struct page **pages;
|
|
|
|
int page_order;
|
|
|
|
int nr_entries;
|
2008-12-18 13:49:37 +08:00
|
|
|
unsigned long offset;
|
2008-12-18 13:49:38 +08:00
|
|
|
int null_mapped;
|
2009-07-09 20:46:53 +08:00
|
|
|
int from_user;
|
2008-08-28 15:17:06 +08:00
|
|
|
};
|
|
|
|
|
2007-09-25 18:35:59 +08:00
|
|
|
struct req_iterator {
|
2013-11-24 09:19:00 +08:00
|
|
|
struct bvec_iter iter;
|
2007-09-25 18:35:59 +08:00
|
|
|
struct bio *bio;
|
|
|
|
};
|
|
|
|
|
|
|
|
/* This should not be used directly - use rq_for_each_segment */
|
2009-02-23 16:03:10 +08:00
|
|
|
#define for_each_bio(_bio) \
|
|
|
|
for (; _bio; _bio = _bio->bi_next)
|
2007-09-25 18:35:59 +08:00
|
|
|
#define __rq_for_each_bio(_bio, rq) \
|
2005-04-17 06:20:36 +08:00
|
|
|
if ((rq->bio)) \
|
|
|
|
for (_bio = (rq)->bio; _bio; _bio = _bio->bi_next)
|
|
|
|
|
2007-09-25 18:35:59 +08:00
|
|
|
#define rq_for_each_segment(bvl, _rq, _iter) \
|
|
|
|
__rq_for_each_bio(_iter.bio, _rq) \
|
2013-11-24 09:19:00 +08:00
|
|
|
bio_for_each_segment(bvl, _iter.bio, _iter.iter)
|
2007-09-25 18:35:59 +08:00
|
|
|
|
2019-02-15 19:13:11 +08:00
|
|
|
#define rq_for_each_bvec(bvl, _rq, _iter) \
|
|
|
|
__rq_for_each_bio(_iter.bio, _rq) \
|
|
|
|
bio_for_each_bvec(bvl, _iter.bio, _iter.iter)
|
|
|
|
|
2013-08-08 05:26:21 +08:00
|
|
|
#define rq_iter_last(bvec, _iter) \
|
2013-11-24 09:19:00 +08:00
|
|
|
(_iter.bio->bi_next == NULL && \
|
2013-08-08 05:26:21 +08:00
|
|
|
bio_iter_last(bvec, _iter.iter))
|
2007-09-25 18:35:59 +08:00
|
|
|
|
2009-11-26 16:16:19 +08:00
|
|
|
#ifndef ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
|
|
|
|
# error "You should define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE for your platform"
|
|
|
|
#endif
|
|
|
|
#if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
|
|
|
|
extern void rq_flush_dcache_pages(struct request *rq);
|
|
|
|
#else
|
|
|
|
static inline void rq_flush_dcache_pages(struct request *rq)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
extern int blk_register_queue(struct gendisk *disk);
|
|
|
|
extern void blk_unregister_queue(struct gendisk *disk);
|
2015-11-06 01:41:16 +08:00
|
|
|
extern blk_qc_t generic_make_request(struct bio *bio);
|
2017-11-03 02:29:50 +08:00
|
|
|
extern blk_qc_t direct_make_request(struct bio *bio);
|
2008-04-29 15:54:36 +08:00
|
|
|
extern void blk_rq_init(struct request_queue *q, struct request *rq);
|
2005-04-17 06:20:36 +08:00
|
|
|
extern void blk_put_request(struct request *);
|
2017-06-21 02:15:39 +08:00
|
|
|
extern struct request *blk_get_request(struct request_queue *, unsigned int op,
|
2018-05-09 15:54:05 +08:00
|
|
|
blk_mq_req_flags_t flags);
|
2008-10-01 22:12:15 +08:00
|
|
|
extern int blk_lld_busy(struct request_queue *q);
|
2015-06-26 22:01:13 +08:00
|
|
|
extern int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
|
|
|
|
struct bio_set *bs, gfp_t gfp_mask,
|
|
|
|
int (*bio_ctr)(struct bio *, struct bio *, void *),
|
|
|
|
void *data);
|
|
|
|
extern void blk_rq_unprep_clone(struct request *rq);
|
2017-06-03 15:38:04 +08:00
|
|
|
extern blk_status_t blk_insert_cloned_request(struct request_queue *q,
|
2008-09-18 22:45:38 +08:00
|
|
|
struct request *rq);
|
2017-12-18 15:40:44 +08:00
|
|
|
extern int blk_rq_append_bio(struct request *rq, struct bio **bio);
|
2017-06-18 12:38:57 +08:00
|
|
|
extern void blk_queue_split(struct request_queue *, struct bio **);
|
2012-01-12 23:01:28 +08:00
|
|
|
extern int scsi_verify_blk_ioctl(struct block_device *, unsigned int);
|
2012-01-12 23:01:27 +08:00
|
|
|
extern int scsi_cmd_blk_ioctl(struct block_device *, fmode_t,
|
|
|
|
unsigned int, void __user *);
|
2007-08-28 03:38:10 +08:00
|
|
|
extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
|
|
|
|
unsigned int, void __user *);
|
2008-09-03 05:16:41 +08:00
|
|
|
extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
|
|
|
|
struct scsi_ioctl_command __user *);
|
2019-03-15 00:45:18 +08:00
|
|
|
extern int get_sg_io_hdr(struct sg_io_hdr *hdr, const void __user *argp);
|
|
|
|
extern int put_sg_io_hdr(const struct sg_io_hdr *hdr, void __user *argp);
|
2006-10-20 14:28:16 +08:00
|
|
|
|
2017-11-10 02:49:59 +08:00
|
|
|
extern int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags);
|
2015-11-20 05:29:28 +08:00
|
|
|
extern void blk_queue_exit(struct request_queue *q);
|
2005-04-17 06:20:36 +08:00
|
|
|
extern void blk_sync_queue(struct request_queue *q);
|
2008-08-28 15:17:05 +08:00
|
|
|
extern int blk_rq_map_user(struct request_queue *, struct request *,
|
2008-08-28 15:17:06 +08:00
|
|
|
struct rq_map_data *, void __user *, unsigned long,
|
|
|
|
gfp_t);
|
2006-12-19 18:12:46 +08:00
|
|
|
extern int blk_rq_unmap_user(struct bio *);
|
2007-07-24 15:28:11 +08:00
|
|
|
extern int blk_rq_map_kern(struct request_queue *, struct request *, void *, unsigned int, gfp_t);
|
|
|
|
extern int blk_rq_map_user_iov(struct request_queue *, struct request *,
|
2015-01-18 23:16:31 +08:00
|
|
|
struct rq_map_data *, const struct iov_iter *,
|
|
|
|
gfp_t);
|
2017-04-20 22:02:55 +08:00
|
|
|
extern void blk_execute_rq(struct request_queue *, struct gendisk *,
|
2005-06-20 20:11:09 +08:00
|
|
|
struct request *, int);
|
2007-07-24 15:28:11 +08:00
|
|
|
extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *,
|
2006-01-06 17:00:50 +08:00
|
|
|
struct request *, int, rq_end_io_fn *);
|
2005-11-11 19:30:24 +08:00
|
|
|
|
2019-06-21 01:59:16 +08:00
|
|
|
/* Helper to convert REQ_OP_XXX to its string format XXX */
|
|
|
|
extern const char *blk_op_str(unsigned int op);
|
|
|
|
|
2017-06-03 15:38:04 +08:00
|
|
|
int blk_status_to_errno(blk_status_t status);
|
|
|
|
blk_status_t errno_to_blk_status(int errno);
|
|
|
|
|
2018-11-26 23:24:43 +08:00
|
|
|
int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin);
|
2015-11-06 01:44:55 +08:00
|
|
|
|
2007-07-24 15:28:11 +08:00
|
|
|
static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2014-09-08 07:03:56 +08:00
|
|
|
return bdev->bd_disk->queue; /* this is never NULL */
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2018-03-15 06:48:06 +08:00
|
|
|
/*
|
|
|
|
* The basic unit of block I/O is a sector. It is used in a number of contexts
|
|
|
|
* in Linux (blk, bio, genhd). The size of one sector is 512 = 2**9
|
|
|
|
* bytes. Variables of type sector_t represent an offset or size that is a
|
|
|
|
* multiple of 512 bytes. Hence these two constants.
|
|
|
|
*/
|
|
|
|
#ifndef SECTOR_SHIFT
|
|
|
|
#define SECTOR_SHIFT 9
|
|
|
|
#endif
|
|
|
|
#ifndef SECTOR_SIZE
|
|
|
|
#define SECTOR_SIZE (1 << SECTOR_SHIFT)
|
|
|
|
#endif
|
|
|
|
|
2009-04-23 10:05:18 +08:00
|
|
|
/*
|
2009-07-03 16:48:17 +08:00
|
|
|
* blk_rq_pos() : the current sector
|
|
|
|
* blk_rq_bytes() : bytes left in the entire request
|
|
|
|
* blk_rq_cur_bytes() : bytes left in the current segment
|
|
|
|
* blk_rq_err_bytes() : bytes left till the next error boundary
|
|
|
|
* blk_rq_sectors() : sectors left in the entire request
|
|
|
|
* blk_rq_cur_sectors() : sectors left in the current segment
|
2019-05-21 15:59:03 +08:00
|
|
|
* blk_rq_stats_sectors() : sectors of the entire request used for stats
|
2009-04-23 10:05:18 +08:00
|
|
|
*/
|
2009-05-07 21:24:38 +08:00
|
|
|
static inline sector_t blk_rq_pos(const struct request *rq)
|
|
|
|
{
|
2009-05-07 21:24:44 +08:00
|
|
|
return rq->__sector;
|
2009-05-07 21:24:41 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int blk_rq_bytes(const struct request *rq)
|
|
|
|
{
|
2009-05-07 21:24:44 +08:00
|
|
|
return rq->__data_len;
|
2009-05-07 21:24:38 +08:00
|
|
|
}
|
|
|
|
|
2009-05-07 21:24:41 +08:00
|
|
|
static inline int blk_rq_cur_bytes(const struct request *rq)
|
|
|
|
{
|
|
|
|
return rq->bio ? bio_cur_bytes(rq->bio) : 0;
|
|
|
|
}
|
2009-04-23 10:05:18 +08:00
|
|
|
|
2009-07-03 16:48:17 +08:00
|
|
|
extern unsigned int blk_rq_err_bytes(const struct request *rq);
|
|
|
|
|
2009-05-07 21:24:38 +08:00
|
|
|
static inline unsigned int blk_rq_sectors(const struct request *rq)
|
|
|
|
{
|
2018-03-15 06:48:06 +08:00
|
|
|
return blk_rq_bytes(rq) >> SECTOR_SHIFT;
|
2009-05-07 21:24:38 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int blk_rq_cur_sectors(const struct request *rq)
|
|
|
|
{
|
2018-03-15 06:48:06 +08:00
|
|
|
return blk_rq_cur_bytes(rq) >> SECTOR_SHIFT;
|
2009-05-07 21:24:38 +08:00
|
|
|
}
|
|
|
|
|
2019-05-21 15:59:03 +08:00
|
|
|
static inline unsigned int blk_rq_stats_sectors(const struct request *rq)
|
|
|
|
{
|
|
|
|
return rq->stats_sectors;
|
|
|
|
}
|
|
|
|
|
2018-06-16 05:55:21 +08:00
|
|
|
#ifdef CONFIG_BLK_DEV_ZONED
|
2017-12-21 14:43:38 +08:00
|
|
|
static inline unsigned int blk_rq_zone_no(struct request *rq)
|
|
|
|
{
|
|
|
|
return blk_queue_zone_no(rq->q, blk_rq_pos(rq));
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int blk_rq_zone_is_seq(struct request *rq)
|
|
|
|
{
|
|
|
|
return blk_queue_zone_is_seq(rq->q, blk_rq_pos(rq));
|
|
|
|
}
|
2018-06-16 05:55:21 +08:00
|
|
|
#endif /* CONFIG_BLK_DEV_ZONED */
|
2017-12-21 14:43:38 +08:00
|
|
|
|
2017-01-13 19:29:10 +08:00
|
|
|
/*
|
|
|
|
* Some commands like WRITE SAME have a payload or data transfer size which
|
|
|
|
* is different from the size of the request. Any driver that supports such
|
|
|
|
* commands using the RQF_SPECIAL_PAYLOAD flag needs to use this helper to
|
|
|
|
* calculate the data transfer size.
|
|
|
|
*/
|
|
|
|
static inline unsigned int blk_rq_payload_bytes(struct request *rq)
|
|
|
|
{
|
|
|
|
if (rq->rq_flags & RQF_SPECIAL_PAYLOAD)
|
|
|
|
return rq->special_vec.bv_len;
|
|
|
|
return blk_rq_bytes(rq);
|
|
|
|
}
|
|
|
|
|
2019-03-04 00:14:01 +08:00
|
|
|
/*
|
|
|
|
* Return the first full biovec in the request. The caller needs to check that
|
|
|
|
* there are any bvecs before calling this helper.
|
|
|
|
*/
|
|
|
|
static inline struct bio_vec req_bvec(struct request *rq)
|
|
|
|
{
|
|
|
|
if (rq->rq_flags & RQF_SPECIAL_PAYLOAD)
|
|
|
|
return rq->special_vec;
|
|
|
|
return mp_bvec_iter_bvec(rq->bio->bi_io_vec, rq->bio->bi_iter);
|
|
|
|
}
|
|
|
|
|
2012-09-19 00:19:26 +08:00
|
|
|
static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
|
2016-06-06 03:32:15 +08:00
|
|
|
int op)
|
2012-09-19 00:19:26 +08:00
|
|
|
{
|
2016-08-16 15:59:35 +08:00
|
|
|
if (unlikely(op == REQ_OP_DISCARD || op == REQ_OP_SECURE_ERASE))
|
2018-03-15 06:48:06 +08:00
|
|
|
return min(q->limits.max_discard_sectors,
|
|
|
|
UINT_MAX >> SECTOR_SHIFT);
|
2012-09-19 00:19:26 +08:00
|
|
|
|
2016-06-06 03:32:15 +08:00
|
|
|
if (unlikely(op == REQ_OP_WRITE_SAME))
|
2012-09-19 00:19:27 +08:00
|
|
|
return q->limits.max_write_same_sectors;
|
|
|
|
|
2016-12-01 04:28:59 +08:00
|
|
|
if (unlikely(op == REQ_OP_WRITE_ZEROES))
|
|
|
|
return q->limits.max_write_zeroes_sectors;
|
|
|
|
|
2012-09-19 00:19:26 +08:00
|
|
|
return q->limits.max_sectors;
|
|
|
|
}
|
|
|
|
|
2014-06-06 03:38:39 +08:00
|
|
|
/*
|
|
|
|
* Return maximum size of a request at given offset. Only valid for
|
|
|
|
* file system requests.
|
|
|
|
*/
|
|
|
|
static inline unsigned int blk_max_size_offset(struct request_queue *q,
|
|
|
|
sector_t offset)
|
|
|
|
{
|
|
|
|
if (!q->limits.chunk_sectors)
|
2014-06-18 13:09:29 +08:00
|
|
|
return q->limits.max_sectors;
|
2014-06-06 03:38:39 +08:00
|
|
|
|
2018-06-26 23:14:58 +08:00
|
|
|
return min(q->limits.max_sectors, (unsigned int)(q->limits.chunk_sectors -
|
|
|
|
(offset & (q->limits.chunk_sectors - 1))));
|
2014-06-06 03:38:39 +08:00
|
|
|
}
|
|
|
|
|
2016-07-21 11:40:47 +08:00
|
|
|
static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
|
|
|
|
sector_t offset)
|
2012-09-19 00:19:26 +08:00
|
|
|
{
|
|
|
|
struct request_queue *q = rq->q;
|
|
|
|
|
2017-01-31 23:57:29 +08:00
|
|
|
if (blk_rq_is_passthrough(rq))
|
2012-09-19 00:19:26 +08:00
|
|
|
return q->limits.max_hw_sectors;
|
|
|
|
|
2016-08-16 15:59:35 +08:00
|
|
|
if (!q->limits.chunk_sectors ||
|
|
|
|
req_op(rq) == REQ_OP_DISCARD ||
|
|
|
|
req_op(rq) == REQ_OP_SECURE_ERASE)
|
2016-06-06 03:32:15 +08:00
|
|
|
return blk_queue_get_max_sectors(q, req_op(rq));
|
2014-06-06 03:38:39 +08:00
|
|
|
|
2016-07-21 11:40:47 +08:00
|
|
|
return min(blk_max_size_offset(q, offset),
|
2016-06-06 03:32:15 +08:00
|
|
|
blk_queue_get_max_sectors(q, req_op(rq)));
|
2012-09-19 00:19:26 +08:00
|
|
|
}
|
|
|
|
|
2013-09-22 03:57:47 +08:00
|
|
|
static inline unsigned int blk_rq_count_bios(struct request *rq)
|
|
|
|
{
|
|
|
|
unsigned int nr_bios = 0;
|
|
|
|
struct bio *bio;
|
|
|
|
|
|
|
|
__rq_for_each_bio(bio, rq)
|
|
|
|
nr_bios++;
|
|
|
|
|
|
|
|
return nr_bios;
|
|
|
|
}
|
|
|
|
|
2017-11-03 02:29:51 +08:00
|
|
|
void blk_steal_bios(struct bio_list *list, struct request *rq);
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
2009-04-23 10:05:18 +08:00
|
|
|
* Request completion related functions.
|
|
|
|
*
|
|
|
|
* blk_update_request() completes given number of bytes and updates
|
|
|
|
* the request without completing it.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2017-06-03 15:38:04 +08:00
|
|
|
extern bool blk_update_request(struct request *rq, blk_status_t error,
|
2009-04-23 10:05:18 +08:00
|
|
|
unsigned int nr_bytes);
|
|
|
|
|
2008-09-14 20:55:09 +08:00
|
|
|
extern void __blk_complete_request(struct request *);
|
|
|
|
extern void blk_abort_request(struct request *);
|
2006-01-09 23:02:34 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* Access functions for manipulating queue properties
|
|
|
|
*/
|
2007-07-24 15:28:11 +08:00
|
|
|
extern void blk_cleanup_queue(struct request_queue *);
|
|
|
|
extern void blk_queue_make_request(struct request_queue *, make_request_fn *);
|
|
|
|
extern void blk_queue_bounce_limit(struct request_queue *, u64);
|
2010-02-26 13:20:38 +08:00
|
|
|
extern void blk_queue_max_hw_sectors(struct request_queue *, unsigned int);
|
2014-06-06 03:38:39 +08:00
|
|
|
extern void blk_queue_chunk_sectors(struct request_queue *, unsigned int);
|
2010-02-26 13:20:39 +08:00
|
|
|
extern void blk_queue_max_segments(struct request_queue *, unsigned short);
|
2017-02-08 21:46:49 +08:00
|
|
|
extern void blk_queue_max_discard_segments(struct request_queue *,
|
|
|
|
unsigned short);
|
2007-07-24 15:28:11 +08:00
|
|
|
extern void blk_queue_max_segment_size(struct request_queue *, unsigned int);
|
2009-09-30 19:54:20 +08:00
|
|
|
extern void blk_queue_max_discard_sectors(struct request_queue *q,
|
|
|
|
unsigned int max_discard_sectors);
|
2012-09-19 00:19:27 +08:00
|
|
|
extern void blk_queue_max_write_same_sectors(struct request_queue *q,
|
|
|
|
unsigned int max_write_same_sectors);
|
2016-12-01 04:28:59 +08:00
|
|
|
extern void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
|
|
|
|
unsigned int max_write_same_sectors);
|
2020-01-15 21:35:25 +08:00
|
|
|
extern void blk_queue_logical_block_size(struct request_queue *, unsigned int);
|
2010-10-14 03:18:03 +08:00
|
|
|
extern void blk_queue_physical_block_size(struct request_queue *, unsigned int);
|
2009-05-23 05:17:53 +08:00
|
|
|
extern void blk_queue_alignment_offset(struct request_queue *q,
|
|
|
|
unsigned int alignment);
|
2009-07-31 23:49:11 +08:00
|
|
|
extern void blk_limits_io_min(struct queue_limits *limits, unsigned int min);
|
2009-05-23 05:17:53 +08:00
|
|
|
extern void blk_queue_io_min(struct request_queue *q, unsigned int min);
|
2009-09-12 03:54:52 +08:00
|
|
|
extern void blk_limits_io_opt(struct queue_limits *limits, unsigned int opt);
|
2009-05-23 05:17:53 +08:00
|
|
|
extern void blk_queue_io_opt(struct request_queue *q, unsigned int opt);
|
2016-03-31 00:21:08 +08:00
|
|
|
extern void blk_set_queue_depth(struct request_queue *q, unsigned int depth);
|
2009-06-16 14:23:52 +08:00
|
|
|
extern void blk_set_default_limits(struct queue_limits *lim);
|
2012-01-11 23:27:11 +08:00
|
|
|
extern void blk_set_stacking_limits(struct queue_limits *lim);
|
2009-05-23 05:17:53 +08:00
|
|
|
extern int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
|
|
|
|
sector_t offset);
|
2010-01-11 16:21:49 +08:00
|
|
|
extern int bdev_stack_limits(struct queue_limits *t, struct block_device *bdev,
|
|
|
|
sector_t offset);
|
2009-05-23 05:17:53 +08:00
|
|
|
extern void disk_stack_limits(struct gendisk *disk, struct block_device *bdev,
|
|
|
|
sector_t offset);
|
2007-07-24 15:28:11 +08:00
|
|
|
extern void blk_queue_stack_limits(struct request_queue *t, struct request_queue *b);
|
2008-07-04 15:30:03 +08:00
|
|
|
extern void blk_queue_update_dma_pad(struct request_queue *, unsigned int);
|
2008-02-19 18:36:53 +08:00
|
|
|
extern int blk_queue_dma_drain(struct request_queue *q,
|
|
|
|
dma_drain_needed_fn *dma_drain_needed,
|
|
|
|
void *buf, unsigned int size);
|
2007-07-24 15:28:11 +08:00
|
|
|
extern void blk_queue_segment_boundary(struct request_queue *, unsigned long);
|
2015-08-20 05:24:05 +08:00
|
|
|
extern void blk_queue_virt_boundary(struct request_queue *, unsigned long);
|
2007-07-24 15:28:11 +08:00
|
|
|
extern void blk_queue_dma_alignment(struct request_queue *, int);
|
2008-01-01 06:37:00 +08:00
|
|
|
extern void blk_queue_update_dma_alignment(struct request_queue *, int);
|
2008-09-14 20:55:09 +08:00
|
|
|
extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
|
2016-04-13 02:32:46 +08:00
|
|
|
extern void blk_queue_write_cache(struct request_queue *q, bool enabled, bool fua);
|
2019-09-05 17:51:31 +08:00
|
|
|
extern void blk_queue_required_elevator_features(struct request_queue *q,
|
|
|
|
unsigned int features);
|
2019-08-28 20:35:42 +08:00
|
|
|
extern bool blk_queue_can_use_dma_map_merging(struct request_queue *q,
|
|
|
|
struct device *dev);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-02-08 21:46:49 +08:00
|
|
|
/*
|
|
|
|
* Number of physical segments as sent to the device.
|
|
|
|
*
|
|
|
|
* Normally this is the number of discontiguous data segments sent by the
|
|
|
|
* submitter. But for data-less command like discard we might have no
|
|
|
|
* actual data segments submitted, but the driver might have to add it's
|
|
|
|
* own special payload. In that case we still return 1 here so that this
|
|
|
|
* special payload will be mapped.
|
|
|
|
*/
|
2016-12-09 06:20:32 +08:00
|
|
|
static inline unsigned short blk_rq_nr_phys_segments(struct request *rq)
|
|
|
|
{
|
|
|
|
if (rq->rq_flags & RQF_SPECIAL_PAYLOAD)
|
|
|
|
return 1;
|
|
|
|
return rq->nr_phys_segments;
|
|
|
|
}
|
|
|
|
|
2017-02-08 21:46:49 +08:00
|
|
|
/*
|
|
|
|
* Number of discard segments (or ranges) the driver needs to fill in.
|
|
|
|
* Each discard bio merged into a request is counted as one segment.
|
|
|
|
*/
|
|
|
|
static inline unsigned short blk_rq_nr_discard_segments(struct request *rq)
|
|
|
|
{
|
|
|
|
return max_t(unsigned short, rq->nr_phys_segments, 1);
|
|
|
|
}
|
|
|
|
|
2007-07-24 15:28:11 +08:00
|
|
|
extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *);
|
2005-04-17 06:20:36 +08:00
|
|
|
extern void blk_dump_rq_flags(struct request *, char *);
|
|
|
|
extern long nr_blockdev_pages(void);
|
|
|
|
|
2011-12-14 07:33:38 +08:00
|
|
|
bool __must_check blk_get_queue(struct request_queue *);
|
2007-07-24 15:28:11 +08:00
|
|
|
struct request_queue *blk_alloc_queue(gfp_t);
|
2018-11-15 00:02:18 +08:00
|
|
|
struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id);
|
2007-07-24 15:28:11 +08:00
|
|
|
extern void blk_put_queue(struct request_queue *);
|
2015-06-06 00:57:37 +08:00
|
|
|
extern void blk_set_queue_dying(struct request_queue *);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2011-07-08 14:19:21 +08:00
|
|
|
/*
|
2011-09-21 16:00:16 +08:00
|
|
|
* blk_plug permits building a queue of related requests by holding the I/O
|
|
|
|
* fragments for a short period. This allows merging of sequential requests
|
|
|
|
* into single larger request. As the requests are moved from a per-task list to
|
|
|
|
* the device's request_queue in a batch, this results in improved scalability
|
|
|
|
* as the lock contention for request_queue lock is reduced.
|
|
|
|
*
|
|
|
|
* It is ok not to disable preemption when adding the request to the plug list
|
|
|
|
* or when attempting a merge, because blk_schedule_flush_list() will only flush
|
|
|
|
* the plug list when the task sleeps by itself. For details, please see
|
|
|
|
* schedule() where blk_schedule_flush_plug() is called.
|
2011-07-08 14:19:21 +08:00
|
|
|
*/
|
2011-03-08 20:19:51 +08:00
|
|
|
struct blk_plug {
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
struct list_head mq_list; /* blk-mq requests */
|
2011-09-21 16:00:16 +08:00
|
|
|
struct list_head cb_list; /* md requires an unplug callback */
|
2018-11-24 13:04:33 +08:00
|
|
|
unsigned short rq_count;
|
2018-11-28 08:13:56 +08:00
|
|
|
bool multiple_queues;
|
2011-03-08 20:19:51 +08:00
|
|
|
};
|
2011-07-08 14:19:20 +08:00
|
|
|
#define BLK_MAX_REQUEST_COUNT 16
|
2016-11-04 08:03:53 +08:00
|
|
|
#define BLK_PLUG_FLUSH_SIZE (128 * 1024)
|
2011-07-08 14:19:20 +08:00
|
|
|
|
2012-07-31 15:08:14 +08:00
|
|
|
struct blk_plug_cb;
|
2012-07-31 15:08:15 +08:00
|
|
|
typedef void (*blk_plug_cb_fn)(struct blk_plug_cb *, bool);
|
2011-04-18 15:52:22 +08:00
|
|
|
struct blk_plug_cb {
|
|
|
|
struct list_head list;
|
2012-07-31 15:08:14 +08:00
|
|
|
blk_plug_cb_fn callback;
|
|
|
|
void *data;
|
2011-04-18 15:52:22 +08:00
|
|
|
};
|
2012-07-31 15:08:14 +08:00
|
|
|
extern struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug,
|
|
|
|
void *data, int size);
|
2011-03-08 20:19:51 +08:00
|
|
|
extern void blk_start_plug(struct blk_plug *);
|
|
|
|
extern void blk_finish_plug(struct blk_plug *);
|
2011-04-15 21:49:07 +08:00
|
|
|
extern void blk_flush_plug_list(struct blk_plug *, bool);
|
2011-03-08 20:19:51 +08:00
|
|
|
|
|
|
|
static inline void blk_flush_plug(struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
struct blk_plug *plug = tsk->plug;
|
|
|
|
|
2011-04-16 19:27:55 +08:00
|
|
|
if (plug)
|
|
|
|
blk_flush_plug_list(plug, false);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void blk_schedule_flush_plug(struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
struct blk_plug *plug = tsk->plug;
|
|
|
|
|
2011-04-15 21:20:10 +08:00
|
|
|
if (plug)
|
2011-04-15 21:49:07 +08:00
|
|
|
blk_flush_plug_list(plug, true);
|
2011-03-08 20:19:51 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool blk_needs_flush_plug(struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
struct blk_plug *plug = tsk->plug;
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
return plug &&
|
block: remove dead elevator code
This removes a bunch of core and elevator related code. On the core
front, we remove anything related to queue running, draining,
initialization, plugging, and congestions. We also kill anything
related to request allocation, merging, retrieval, and completion.
Remove any checking for single queue IO schedulers, as they no
longer exist. This means we can also delete a bunch of code related
to request issue, adding, completion, etc - and all the SQ related
ops and helpers.
Also kill the load_default_modules(), as all that did was provide
for a way to load the default single queue elevator.
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-30 00:23:51 +08:00
|
|
|
(!list_empty(&plug->mq_list) ||
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
!list_empty(&plug->cb_list));
|
2011-03-08 20:19:51 +08:00
|
|
|
}
|
|
|
|
|
2017-04-06 01:21:08 +08:00
|
|
|
extern int blkdev_issue_flush(struct block_device *, gfp_t, sector_t *);
|
|
|
|
extern int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
|
|
|
|
sector_t nr_sects, gfp_t gfp_mask, struct page *page);
|
2016-07-19 17:23:33 +08:00
|
|
|
|
|
|
|
#define BLKDEV_DISCARD_SECURE (1 << 0) /* issue a secure erase */
|
2010-09-17 02:51:46 +08:00
|
|
|
|
2010-04-28 21:55:06 +08:00
|
|
|
extern int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
|
|
|
|
sector_t nr_sects, gfp_t gfp_mask, unsigned long flags);
|
2016-04-17 02:55:28 +08:00
|
|
|
extern int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
|
2016-06-09 22:00:36 +08:00
|
|
|
sector_t nr_sects, gfp_t gfp_mask, int flags,
|
2016-06-06 03:31:49 +08:00
|
|
|
struct bio **biop);
|
2017-04-06 01:21:08 +08:00
|
|
|
|
|
|
|
#define BLKDEV_ZERO_NOUNMAP (1 << 0) /* do not free blocks */
|
2017-04-06 01:21:10 +08:00
|
|
|
#define BLKDEV_ZERO_NOFALLBACK (1 << 1) /* don't write explicit zeroes */
|
2017-04-06 01:21:08 +08:00
|
|
|
|
2016-12-01 04:28:58 +08:00
|
|
|
extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
|
|
|
|
sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
|
2017-04-06 01:21:08 +08:00
|
|
|
unsigned flags);
|
2010-04-28 21:55:09 +08:00
|
|
|
extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
|
2017-04-06 01:21:08 +08:00
|
|
|
sector_t nr_sects, gfp_t gfp_mask, unsigned flags);
|
|
|
|
|
2010-08-18 17:29:10 +08:00
|
|
|
static inline int sb_issue_discard(struct super_block *sb, sector_t block,
|
|
|
|
sector_t nr_blocks, gfp_t gfp_mask, unsigned long flags)
|
2008-08-06 01:01:53 +08:00
|
|
|
{
|
2018-03-15 06:48:06 +08:00
|
|
|
return blkdev_issue_discard(sb->s_bdev,
|
|
|
|
block << (sb->s_blocksize_bits -
|
|
|
|
SECTOR_SHIFT),
|
|
|
|
nr_blocks << (sb->s_blocksize_bits -
|
|
|
|
SECTOR_SHIFT),
|
2010-08-18 17:29:10 +08:00
|
|
|
gfp_mask, flags);
|
2008-08-06 01:01:53 +08:00
|
|
|
}
|
2010-10-28 09:30:04 +08:00
|
|
|
static inline int sb_issue_zeroout(struct super_block *sb, sector_t block,
|
2010-10-28 11:44:47 +08:00
|
|
|
sector_t nr_blocks, gfp_t gfp_mask)
|
2010-10-28 09:30:04 +08:00
|
|
|
{
|
|
|
|
return blkdev_issue_zeroout(sb->s_bdev,
|
2018-03-15 06:48:06 +08:00
|
|
|
block << (sb->s_blocksize_bits -
|
|
|
|
SECTOR_SHIFT),
|
|
|
|
nr_blocks << (sb->s_blocksize_bits -
|
|
|
|
SECTOR_SHIFT),
|
2017-04-06 01:21:08 +08:00
|
|
|
gfp_mask, 0);
|
2010-10-28 09:30:04 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-11-05 15:36:31 +08:00
|
|
|
extern int blk_verify_command(unsigned char *cmd, fmode_t mode);
|
2008-06-26 19:48:27 +08:00
|
|
|
|
2010-02-26 13:20:37 +08:00
|
|
|
enum blk_default_limits {
|
|
|
|
BLK_MAX_SEGMENTS = 128,
|
|
|
|
BLK_SAFE_MAX_SECTORS = 255,
|
2015-08-14 02:57:57 +08:00
|
|
|
BLK_DEF_MAX_SECTORS = 2560,
|
2010-02-26 13:20:37 +08:00
|
|
|
BLK_MAX_SEGMENT_SIZE = 65536,
|
|
|
|
BLK_SEG_BOUNDARY_MASK = 0xFFFFFFFFUL,
|
|
|
|
};
|
2008-12-03 19:55:08 +08:00
|
|
|
|
2019-08-02 06:50:40 +08:00
|
|
|
static inline unsigned long queue_segment_boundary(const struct request_queue *q)
|
2009-05-23 05:17:50 +08:00
|
|
|
{
|
2009-05-23 05:17:51 +08:00
|
|
|
return q->limits.seg_boundary_mask;
|
2009-05-23 05:17:50 +08:00
|
|
|
}
|
|
|
|
|
2019-08-02 06:50:40 +08:00
|
|
|
static inline unsigned long queue_virt_boundary(const struct request_queue *q)
|
2015-08-20 05:24:05 +08:00
|
|
|
{
|
|
|
|
return q->limits.virt_boundary_mask;
|
|
|
|
}
|
|
|
|
|
2019-08-02 06:50:40 +08:00
|
|
|
static inline unsigned int queue_max_sectors(const struct request_queue *q)
|
2009-05-23 05:17:50 +08:00
|
|
|
{
|
2009-05-23 05:17:51 +08:00
|
|
|
return q->limits.max_sectors;
|
2009-05-23 05:17:50 +08:00
|
|
|
}
|
|
|
|
|
2019-08-02 06:50:40 +08:00
|
|
|
static inline unsigned int queue_max_hw_sectors(const struct request_queue *q)
|
2009-05-23 05:17:50 +08:00
|
|
|
{
|
2009-05-23 05:17:51 +08:00
|
|
|
return q->limits.max_hw_sectors;
|
2009-05-23 05:17:50 +08:00
|
|
|
}
|
|
|
|
|
2019-08-02 06:50:40 +08:00
|
|
|
static inline unsigned short queue_max_segments(const struct request_queue *q)
|
2009-05-23 05:17:50 +08:00
|
|
|
{
|
2010-02-26 13:20:39 +08:00
|
|
|
return q->limits.max_segments;
|
2009-05-23 05:17:50 +08:00
|
|
|
}
|
|
|
|
|
2019-08-02 06:50:40 +08:00
|
|
|
static inline unsigned short queue_max_discard_segments(const struct request_queue *q)
|
2017-02-08 21:46:49 +08:00
|
|
|
{
|
|
|
|
return q->limits.max_discard_segments;
|
|
|
|
}
|
|
|
|
|
2019-08-02 06:50:40 +08:00
|
|
|
static inline unsigned int queue_max_segment_size(const struct request_queue *q)
|
2009-05-23 05:17:50 +08:00
|
|
|
{
|
2009-05-23 05:17:51 +08:00
|
|
|
return q->limits.max_segment_size;
|
2009-05-23 05:17:50 +08:00
|
|
|
}
|
|
|
|
|
2020-01-15 21:35:25 +08:00
|
|
|
static inline unsigned queue_logical_block_size(const struct request_queue *q)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
int retval = 512;
|
|
|
|
|
2009-05-23 05:17:51 +08:00
|
|
|
if (q && q->limits.logical_block_size)
|
|
|
|
retval = q->limits.logical_block_size;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
2020-01-15 21:35:25 +08:00
|
|
|
static inline unsigned int bdev_logical_block_size(struct block_device *bdev)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2009-05-23 05:17:49 +08:00
|
|
|
return queue_logical_block_size(bdev_get_queue(bdev));
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2019-08-02 06:50:40 +08:00
|
|
|
static inline unsigned int queue_physical_block_size(const struct request_queue *q)
|
2009-05-23 05:17:53 +08:00
|
|
|
{
|
|
|
|
return q->limits.physical_block_size;
|
|
|
|
}
|
|
|
|
|
2010-10-14 03:18:03 +08:00
|
|
|
static inline unsigned int bdev_physical_block_size(struct block_device *bdev)
|
2009-10-04 02:52:01 +08:00
|
|
|
{
|
|
|
|
return queue_physical_block_size(bdev_get_queue(bdev));
|
|
|
|
}
|
|
|
|
|
2019-08-02 06:50:40 +08:00
|
|
|
static inline unsigned int queue_io_min(const struct request_queue *q)
|
2009-05-23 05:17:53 +08:00
|
|
|
{
|
|
|
|
return q->limits.io_min;
|
|
|
|
}
|
|
|
|
|
2009-10-04 02:52:01 +08:00
|
|
|
static inline int bdev_io_min(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
return queue_io_min(bdev_get_queue(bdev));
|
|
|
|
}
|
|
|
|
|
2019-08-02 06:50:40 +08:00
|
|
|
static inline unsigned int queue_io_opt(const struct request_queue *q)
|
2009-05-23 05:17:53 +08:00
|
|
|
{
|
|
|
|
return q->limits.io_opt;
|
|
|
|
}
|
|
|
|
|
2009-10-04 02:52:01 +08:00
|
|
|
static inline int bdev_io_opt(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
return queue_io_opt(bdev_get_queue(bdev));
|
|
|
|
}
|
|
|
|
|
2019-08-02 06:50:40 +08:00
|
|
|
static inline int queue_alignment_offset(const struct request_queue *q)
|
2009-05-23 05:17:53 +08:00
|
|
|
{
|
2009-10-04 02:52:01 +08:00
|
|
|
if (q->limits.misaligned)
|
2009-05-23 05:17:53 +08:00
|
|
|
return -1;
|
|
|
|
|
2009-10-04 02:52:01 +08:00
|
|
|
return q->limits.alignment_offset;
|
2009-05-23 05:17:53 +08:00
|
|
|
}
|
|
|
|
|
2010-01-11 16:21:51 +08:00
|
|
|
static inline int queue_limit_alignment_offset(struct queue_limits *lim, sector_t sector)
|
2009-12-29 15:35:35 +08:00
|
|
|
{
|
|
|
|
unsigned int granularity = max(lim->physical_block_size, lim->io_min);
|
2018-03-15 06:48:06 +08:00
|
|
|
unsigned int alignment = sector_div(sector, granularity >> SECTOR_SHIFT)
|
|
|
|
<< SECTOR_SHIFT;
|
2009-12-29 15:35:35 +08:00
|
|
|
|
2014-10-09 06:26:13 +08:00
|
|
|
return (granularity + lim->alignment_offset - alignment) % granularity;
|
2009-05-23 05:17:53 +08:00
|
|
|
}
|
|
|
|
|
2009-10-04 02:52:01 +08:00
|
|
|
static inline int bdev_alignment_offset(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
|
|
|
|
|
|
|
if (q->limits.misaligned)
|
|
|
|
return -1;
|
|
|
|
|
|
|
|
if (bdev != bdev->bd_contains)
|
|
|
|
return bdev->bd_part->alignment_offset;
|
|
|
|
|
|
|
|
return q->limits.alignment_offset;
|
|
|
|
}
|
|
|
|
|
2019-08-02 06:50:40 +08:00
|
|
|
static inline int queue_discard_alignment(const struct request_queue *q)
|
2009-11-10 18:50:21 +08:00
|
|
|
{
|
|
|
|
if (q->limits.discard_misaligned)
|
|
|
|
return -1;
|
|
|
|
|
|
|
|
return q->limits.discard_alignment;
|
|
|
|
}
|
|
|
|
|
2010-01-11 16:21:51 +08:00
|
|
|
static inline int queue_limit_discard_alignment(struct queue_limits *lim, sector_t sector)
|
2009-11-10 18:50:21 +08:00
|
|
|
{
|
2012-12-19 23:18:35 +08:00
|
|
|
unsigned int alignment, granularity, offset;
|
2010-01-11 16:21:48 +08:00
|
|
|
|
2011-05-18 16:37:35 +08:00
|
|
|
if (!lim->max_discard_sectors)
|
|
|
|
return 0;
|
|
|
|
|
2012-12-19 23:18:35 +08:00
|
|
|
/* Why are these in bytes, not sectors? */
|
2018-03-15 06:48:06 +08:00
|
|
|
alignment = lim->discard_alignment >> SECTOR_SHIFT;
|
|
|
|
granularity = lim->discard_granularity >> SECTOR_SHIFT;
|
2012-12-19 23:18:35 +08:00
|
|
|
if (!granularity)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
/* Offset of the partition start in 'granularity' sectors */
|
|
|
|
offset = sector_div(sector, granularity);
|
|
|
|
|
|
|
|
/* And why do we do this modulus *again* in blkdev_issue_discard()? */
|
|
|
|
offset = (granularity + alignment - offset) % granularity;
|
|
|
|
|
|
|
|
/* Turn it back into bytes, gaah */
|
2018-03-15 06:48:06 +08:00
|
|
|
return offset << SECTOR_SHIFT;
|
2009-11-10 18:50:21 +08:00
|
|
|
}
|
|
|
|
|
block: split discard into aligned requests
When a disk has large discard_granularity and small max_discard_sectors,
discards are not split with optimal alignment. In the limit case of
discard_granularity == max_discard_sectors, no request could be aligned
correctly, so in fact you might end up with no discarded logical blocks
at all.
Another example that helps showing the condition in the patch is with
discard_granularity == 64, max_discard_sectors == 128. A request that is
submitted for 256 sectors 2..257 will be split in two: 2..129, 130..257.
However, only 2 aligned blocks out of 3 are included in the request;
128..191 may be left intact and not discarded. With this patch, the
first request will be truncated to ensure good alignment of what's left,
and the split will be 2..127, 128..255, 256..257. The patch will also
take into account the discard_alignment.
At most one extra request will be introduced, because the first request
will be reduced by at most granularity-1 sectors, and granularity
must be less than max_discard_sectors. Subsequent requests will run
on round_down(max_discard_sectors, granularity) sectors, as in the
current code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-02 15:48:50 +08:00
|
|
|
static inline int bdev_discard_alignment(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
|
|
|
|
|
|
|
if (bdev != bdev->bd_contains)
|
|
|
|
return bdev->bd_part->discard_alignment;
|
|
|
|
|
|
|
|
return q->limits.discard_alignment;
|
|
|
|
}
|
|
|
|
|
2012-09-19 00:19:27 +08:00
|
|
|
static inline unsigned int bdev_write_same(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
|
|
|
|
|
|
|
if (q)
|
|
|
|
return q->limits.max_write_same_sectors;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2016-12-01 04:28:59 +08:00
|
|
|
static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
|
|
|
|
|
|
|
if (q)
|
|
|
|
return q->limits.max_write_zeroes_sectors;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2016-10-18 14:40:29 +08:00
|
|
|
static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
|
|
|
|
|
|
|
if (q)
|
|
|
|
return blk_queue_zoned_model(q);
|
|
|
|
|
|
|
|
return BLK_ZONED_NONE;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool bdev_is_zoned(struct block_device *bdev)
|
|
|
|
{
|
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
|
|
|
|
|
|
|
if (q)
|
|
|
|
return blk_queue_is_zoned(q);
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2019-07-10 12:53:10 +08:00
|
|
|
static inline sector_t bdev_zone_sectors(struct block_device *bdev)
|
2016-10-18 14:40:33 +08:00
|
|
|
{
|
|
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
|
|
|
|
|
|
|
if (q)
|
2017-01-12 22:58:32 +08:00
|
|
|
return blk_queue_zone_sectors(q);
|
2017-12-21 14:43:38 +08:00
|
|
|
return 0;
|
|
|
|
}
|
2016-10-18 14:40:33 +08:00
|
|
|
|
2019-08-02 06:50:40 +08:00
|
|
|
static inline int queue_dma_alignment(const struct request_queue *q)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2008-01-01 23:23:02 +08:00
|
|
|
return q ? q->dma_alignment : 511;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2010-09-15 19:08:27 +08:00
|
|
|
static inline int blk_rq_aligned(struct request_queue *q, unsigned long addr,
|
2008-08-28 14:05:58 +08:00
|
|
|
unsigned int len)
|
|
|
|
{
|
|
|
|
unsigned int alignment = queue_dma_alignment(q) | q->dma_pad_mask;
|
2010-09-15 19:08:27 +08:00
|
|
|
return !(addr & alignment) && !(len & alignment);
|
2008-08-28 14:05:58 +08:00
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/* assumes size > 256 */
|
|
|
|
static inline unsigned int blksize_bits(unsigned int size)
|
|
|
|
{
|
|
|
|
unsigned int bits = 8;
|
|
|
|
do {
|
|
|
|
bits++;
|
|
|
|
size >>= 1;
|
|
|
|
} while (size > 256);
|
|
|
|
return bits;
|
|
|
|
}
|
|
|
|
|
2005-09-10 15:27:17 +08:00
|
|
|
static inline unsigned int block_size(struct block_device *bdev)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
return bdev->bd_block_size;
|
|
|
|
}
|
|
|
|
|
|
|
|
typedef struct {struct page *v;} Sector;
|
|
|
|
|
|
|
|
unsigned char *read_dev_sector(struct block_device *, sector_t, Sector *);
|
|
|
|
|
|
|
|
static inline void put_dev_sector(Sector p)
|
|
|
|
{
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
put_page(p.v);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2014-04-08 23:15:35 +08:00
|
|
|
int kblockd_schedule_work(struct work_struct *work);
|
2016-08-25 05:52:48 +08:00
|
|
|
int kblockd_schedule_work_on(int cpu, struct work_struct *work);
|
2017-04-10 23:54:55 +08:00
|
|
|
int kblockd_mod_delayed_work_on(int cpu, struct delayed_work *dwork, unsigned long delay);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
#define MODULE_ALIAS_BLOCKDEV(major,minor) \
|
|
|
|
MODULE_ALIAS("block-major-" __stringify(major) "-" __stringify(minor))
|
|
|
|
#define MODULE_ALIAS_BLOCKDEV_MAJOR(major) \
|
|
|
|
MODULE_ALIAS("block-major-" __stringify(major) "-*")
|
|
|
|
|
2008-07-01 02:04:41 +08:00
|
|
|
#if defined(CONFIG_BLK_DEV_INTEGRITY)
|
|
|
|
|
2014-09-27 07:20:02 +08:00
|
|
|
enum blk_integrity_flags {
|
|
|
|
BLK_INTEGRITY_VERIFY = 1 << 0,
|
|
|
|
BLK_INTEGRITY_GENERATE = 1 << 1,
|
2014-09-27 07:20:03 +08:00
|
|
|
BLK_INTEGRITY_DEVICE_CAPABLE = 1 << 2,
|
2014-09-27 07:20:05 +08:00
|
|
|
BLK_INTEGRITY_IP_CHECKSUM = 1 << 3,
|
2014-09-27 07:20:02 +08:00
|
|
|
};
|
2008-07-01 02:04:41 +08:00
|
|
|
|
2014-09-27 07:20:01 +08:00
|
|
|
struct blk_integrity_iter {
|
2008-07-01 02:04:41 +08:00
|
|
|
void *prot_buf;
|
|
|
|
void *data_buf;
|
2014-09-27 07:19:59 +08:00
|
|
|
sector_t seed;
|
2008-07-01 02:04:41 +08:00
|
|
|
unsigned int data_size;
|
2014-09-27 07:19:59 +08:00
|
|
|
unsigned short interval;
|
2008-07-01 02:04:41 +08:00
|
|
|
const char *disk_name;
|
|
|
|
};
|
|
|
|
|
2017-06-03 15:38:06 +08:00
|
|
|
typedef blk_status_t (integrity_processing_fn) (struct blk_integrity_iter *);
|
2019-09-16 23:44:29 +08:00
|
|
|
typedef void (integrity_prepare_fn) (struct request *);
|
|
|
|
typedef void (integrity_complete_fn) (struct request *, unsigned int);
|
2008-07-01 02:04:41 +08:00
|
|
|
|
2015-10-22 01:19:33 +08:00
|
|
|
struct blk_integrity_profile {
|
|
|
|
integrity_processing_fn *generate_fn;
|
|
|
|
integrity_processing_fn *verify_fn;
|
2019-09-16 23:44:29 +08:00
|
|
|
integrity_prepare_fn *prepare_fn;
|
|
|
|
integrity_complete_fn *complete_fn;
|
2015-10-22 01:19:33 +08:00
|
|
|
const char *name;
|
|
|
|
};
|
2008-07-01 02:04:41 +08:00
|
|
|
|
2015-10-22 01:19:49 +08:00
|
|
|
extern void blk_integrity_register(struct gendisk *, struct blk_integrity *);
|
2008-07-01 02:04:41 +08:00
|
|
|
extern void blk_integrity_unregister(struct gendisk *);
|
2008-10-01 15:38:39 +08:00
|
|
|
extern int blk_integrity_compare(struct gendisk *, struct gendisk *);
|
2010-09-11 02:50:10 +08:00
|
|
|
extern int blk_rq_map_integrity_sg(struct request_queue *, struct bio *,
|
|
|
|
struct scatterlist *);
|
|
|
|
extern int blk_rq_count_integrity_sg(struct request_queue *, struct bio *);
|
2014-09-27 07:20:06 +08:00
|
|
|
extern bool blk_integrity_merge_rq(struct request_queue *, struct request *,
|
|
|
|
struct request *);
|
|
|
|
extern bool blk_integrity_merge_bio(struct request_queue *, struct request *,
|
|
|
|
struct bio *);
|
2008-07-01 02:04:41 +08:00
|
|
|
|
2015-10-22 01:19:49 +08:00
|
|
|
static inline struct blk_integrity *blk_get_integrity(struct gendisk *disk)
|
2008-10-02 18:53:22 +08:00
|
|
|
{
|
2015-10-22 01:20:18 +08:00
|
|
|
struct blk_integrity *bi = &disk->queue->integrity;
|
2015-10-22 01:19:49 +08:00
|
|
|
|
|
|
|
if (!bi->profile)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
return bi;
|
2008-10-02 18:53:22 +08:00
|
|
|
}
|
|
|
|
|
2015-10-22 01:19:49 +08:00
|
|
|
static inline
|
|
|
|
struct blk_integrity *bdev_get_integrity(struct block_device *bdev)
|
2008-10-03 00:47:49 +08:00
|
|
|
{
|
2015-10-22 01:19:49 +08:00
|
|
|
return blk_get_integrity(bdev->bd_disk);
|
2008-10-03 00:47:49 +08:00
|
|
|
}
|
|
|
|
|
2014-09-27 07:19:56 +08:00
|
|
|
static inline bool blk_integrity_rq(struct request *rq)
|
2008-07-01 02:04:41 +08:00
|
|
|
{
|
2014-09-27 07:19:56 +08:00
|
|
|
return rq->cmd_flags & REQ_INTEGRITY;
|
2008-07-01 02:04:41 +08:00
|
|
|
}
|
|
|
|
|
2010-09-11 02:50:10 +08:00
|
|
|
static inline void blk_queue_max_integrity_segments(struct request_queue *q,
|
|
|
|
unsigned int segs)
|
|
|
|
{
|
|
|
|
q->limits.max_integrity_segments = segs;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned short
|
2019-08-02 06:50:40 +08:00
|
|
|
queue_max_integrity_segments(const struct request_queue *q)
|
2010-09-11 02:50:10 +08:00
|
|
|
{
|
|
|
|
return q->limits.max_integrity_segments;
|
|
|
|
}
|
|
|
|
|
2018-07-25 22:22:58 +08:00
|
|
|
/**
|
|
|
|
* bio_integrity_intervals - Return number of integrity intervals for a bio
|
|
|
|
* @bi: blk_integrity profile for device
|
|
|
|
* @sectors: Size of the bio in 512-byte sectors
|
|
|
|
*
|
|
|
|
* Description: The block layer calculates everything in 512 byte
|
|
|
|
* sectors but integrity metadata is done in terms of the data integrity
|
|
|
|
* interval size of the storage device. Convert the block layer sectors
|
|
|
|
* to the appropriate number of integrity intervals.
|
|
|
|
*/
|
|
|
|
static inline unsigned int bio_integrity_intervals(struct blk_integrity *bi,
|
|
|
|
unsigned int sectors)
|
|
|
|
{
|
|
|
|
return sectors >> (bi->interval_exp - 9);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int bio_integrity_bytes(struct blk_integrity *bi,
|
|
|
|
unsigned int sectors)
|
|
|
|
{
|
|
|
|
return bio_integrity_intervals(bi, sectors) * bi->tuple_size;
|
|
|
|
}
|
|
|
|
|
2019-03-03 23:38:29 +08:00
|
|
|
/*
|
|
|
|
* Return the first bvec that contains integrity data. Only drivers that are
|
|
|
|
* limited to a single integrity segment should use this helper.
|
|
|
|
*/
|
|
|
|
static inline struct bio_vec *rq_integrity_vec(struct request *rq)
|
|
|
|
{
|
|
|
|
if (WARN_ON_ONCE(queue_max_integrity_segments(rq->q) > 1))
|
|
|
|
return NULL;
|
|
|
|
return rq->bio->bi_integrity->bip_vec;
|
|
|
|
}
|
|
|
|
|
2008-07-01 02:04:41 +08:00
|
|
|
#else /* CONFIG_BLK_DEV_INTEGRITY */
|
|
|
|
|
2012-01-12 16:17:30 +08:00
|
|
|
struct bio;
|
|
|
|
struct block_device;
|
|
|
|
struct gendisk;
|
|
|
|
struct blk_integrity;
|
|
|
|
|
|
|
|
static inline int blk_integrity_rq(struct request *rq)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline int blk_rq_count_integrity_sg(struct request_queue *q,
|
|
|
|
struct bio *b)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline int blk_rq_map_integrity_sg(struct request_queue *q,
|
|
|
|
struct bio *b,
|
|
|
|
struct scatterlist *s)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline struct blk_integrity *bdev_get_integrity(struct block_device *b)
|
|
|
|
{
|
2014-10-10 06:30:17 +08:00
|
|
|
return NULL;
|
2012-01-12 16:17:30 +08:00
|
|
|
}
|
|
|
|
static inline struct blk_integrity *blk_get_integrity(struct gendisk *disk)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
static inline int blk_integrity_compare(struct gendisk *a, struct gendisk *b)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2015-10-22 01:19:49 +08:00
|
|
|
static inline void blk_integrity_register(struct gendisk *d,
|
2012-01-12 16:17:30 +08:00
|
|
|
struct blk_integrity *b)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
static inline void blk_integrity_unregister(struct gendisk *d)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
static inline void blk_queue_max_integrity_segments(struct request_queue *q,
|
|
|
|
unsigned int segs)
|
|
|
|
{
|
|
|
|
}
|
2019-08-02 06:50:40 +08:00
|
|
|
static inline unsigned short queue_max_integrity_segments(const struct request_queue *q)
|
2012-01-12 16:17:30 +08:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2014-09-27 07:20:06 +08:00
|
|
|
static inline bool blk_integrity_merge_rq(struct request_queue *rq,
|
|
|
|
struct request *r1,
|
|
|
|
struct request *r2)
|
2012-01-12 16:17:30 +08:00
|
|
|
{
|
2014-10-29 10:27:43 +08:00
|
|
|
return true;
|
2012-01-12 16:17:30 +08:00
|
|
|
}
|
2014-09-27 07:20:06 +08:00
|
|
|
static inline bool blk_integrity_merge_bio(struct request_queue *rq,
|
|
|
|
struct request *r,
|
|
|
|
struct bio *b)
|
2012-01-12 16:17:30 +08:00
|
|
|
{
|
2014-10-29 10:27:43 +08:00
|
|
|
return true;
|
2012-01-12 16:17:30 +08:00
|
|
|
}
|
2015-10-22 01:19:49 +08:00
|
|
|
|
2018-07-25 22:22:58 +08:00
|
|
|
static inline unsigned int bio_integrity_intervals(struct blk_integrity *bi,
|
|
|
|
unsigned int sectors)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int bio_integrity_bytes(struct blk_integrity *bi,
|
|
|
|
unsigned int sectors)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-03-03 23:38:29 +08:00
|
|
|
static inline struct bio_vec *rq_integrity_vec(struct request *rq)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2008-07-01 02:04:41 +08:00
|
|
|
#endif /* CONFIG_BLK_DEV_INTEGRITY */
|
|
|
|
|
2007-10-09 01:26:20 +08:00
|
|
|
struct block_device_operations {
|
[PATCH] beginning of methods conversion
To keep the size of changesets sane we split the switch by drivers;
to keep the damn thing bisectable we do the following:
1) rename the affected methods, add ones with correct
prototypes, make (few) callers handle both. That's this changeset.
2) for each driver convert to new methods. *ALL* drivers
are converted in this series.
3) kill the old (renamed) methods.
Note that it _is_ a flagday; all in-tree drivers are converted and by the
end of this series no trace of old methods remain. The only reason why
we do that this way is to keep the damn thing bisectable and allow per-driver
debugging if anything goes wrong.
New methods:
open(bdev, mode)
release(disk, mode)
ioctl(bdev, mode, cmd, arg) /* Called without BKL */
compat_ioctl(bdev, mode, cmd, arg)
locked_ioctl(bdev, mode, cmd, arg) /* Called with BKL, legacy */
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-03-02 22:09:22 +08:00
|
|
|
int (*open) (struct block_device *, fmode_t);
|
2013-05-06 09:52:57 +08:00
|
|
|
void (*release) (struct gendisk *, fmode_t);
|
2018-07-18 19:47:36 +08:00
|
|
|
int (*rw_page)(struct block_device *, sector_t, struct page *, unsigned int);
|
[PATCH] beginning of methods conversion
To keep the size of changesets sane we split the switch by drivers;
to keep the damn thing bisectable we do the following:
1) rename the affected methods, add ones with correct
prototypes, make (few) callers handle both. That's this changeset.
2) for each driver convert to new methods. *ALL* drivers
are converted in this series.
3) kill the old (renamed) methods.
Note that it _is_ a flagday; all in-tree drivers are converted and by the
end of this series no trace of old methods remain. The only reason why
we do that this way is to keep the damn thing bisectable and allow per-driver
debugging if anything goes wrong.
New methods:
open(bdev, mode)
release(disk, mode)
ioctl(bdev, mode, cmd, arg) /* Called without BKL */
compat_ioctl(bdev, mode, cmd, arg)
locked_ioctl(bdev, mode, cmd, arg) /* Called with BKL, legacy */
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-03-02 22:09:22 +08:00
|
|
|
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
|
|
|
|
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
|
implement in-kernel gendisk events handling
Currently, media presence polling for removeable block devices is done
from userland. There are several issues with this.
* Polling is done by periodically opening the device. For SCSI
devices, the command sequence generated by such action involves a
few different commands including TEST_UNIT_READY. This behavior,
while perfectly legal, is different from Windows which only issues
single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some
ATAPI devices lock up after being periodically queried such command
sequences.
* There is no reliable and unintrusive way for a userland program to
tell whether the target device is safe for media presence polling.
For example, polling for media presence during an on-going burning
session can make it fail. The polling program can avoid this by
opening the device with O_EXCL but then it risks making a valid
exclusive user of the device fail w/ -EBUSY.
* Userland polling is unnecessarily heavy and in-kernel implementation
is lighter and better coordinated (workqueue, timer slack).
This patch implements framework for in-kernel disk event handling,
which includes media presence polling.
* bdops->check_events() is added, which supercedes ->media_changed().
It should check whether there's any pending event and return if so.
Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be
called parallelly.
* gendisk->events and ->async_events are added. These should be
initialized by block driver before passing the device to add_disk().
The former contains the mask of all supported events and the latter
the mask of all events which the device can report without polling.
/sys/block/*/events[_async] export these to userland.
* Kernel parameter block.events_dfl_poll_msecs controls the system
polling interval (default is 0 which means disable) and
/sys/block/*/events_poll_msecs control polling intervals for
individual devices (default is -1 meaning use system setting). Note
that if a device can report all supported events asynchronously and
its polling interval isn't explicitly set, the device won't be
polled regardless of the system polling interval.
* If a device is opened exclusively with write access, event checking
is automatically disabled until all write exclusive accesses are
released.
* There are event 'clearing' events. For example, both of currently
defined events are cleared after the device has been successfully
opened. This information is passed to ->check_events() callback
using @clearing argument as a hint.
* Event checking is always performed from system_nrt_wq and timer
slack is set to 25% for polling.
* Nothing changes for drivers which implement ->media_changed() but
not ->check_events(). Going forward, all drivers will be converted
to ->check_events() and ->media_change() will be dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-09 03:57:37 +08:00
|
|
|
unsigned int (*check_events) (struct gendisk *disk,
|
|
|
|
unsigned int clearing);
|
|
|
|
/* ->media_changed() is DEPRECATED, use ->check_events() instead */
|
2007-10-09 01:26:20 +08:00
|
|
|
int (*media_changed) (struct gendisk *);
|
2010-05-16 02:09:29 +08:00
|
|
|
void (*unlock_native_capacity) (struct gendisk *);
|
2007-10-09 01:26:20 +08:00
|
|
|
int (*revalidate_disk) (struct gendisk *);
|
|
|
|
int (*getgeo)(struct block_device *, struct hd_geometry *);
|
2010-05-17 13:32:43 +08:00
|
|
|
/* this callback is with swap_lock and sometimes page table lock held */
|
|
|
|
void (*swap_slot_free_notify) (struct block_device *, unsigned long);
|
2018-10-12 18:08:49 +08:00
|
|
|
int (*report_zones)(struct gendisk *, sector_t sector,
|
2019-11-11 10:39:30 +08:00
|
|
|
unsigned int nr_zones, report_zones_cb cb, void *data);
|
2007-10-09 01:26:20 +08:00
|
|
|
struct module *owner;
|
2015-10-15 20:10:48 +08:00
|
|
|
const struct pr_ops *pr_ops;
|
2007-10-09 01:26:20 +08:00
|
|
|
};
|
|
|
|
|
2019-11-28 22:48:10 +08:00
|
|
|
#ifdef CONFIG_COMPAT
|
|
|
|
extern int blkdev_compat_ptr_ioctl(struct block_device *, fmode_t,
|
|
|
|
unsigned int, unsigned long);
|
|
|
|
#else
|
|
|
|
#define blkdev_compat_ptr_ioctl NULL
|
|
|
|
#endif
|
|
|
|
|
2007-08-30 08:34:12 +08:00
|
|
|
extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int,
|
|
|
|
unsigned long);
|
2014-06-05 07:07:46 +08:00
|
|
|
extern int bdev_read_page(struct block_device *, sector_t, struct page *);
|
|
|
|
extern int bdev_write_page(struct block_device *, sector_t, struct page *,
|
|
|
|
struct writeback_control *);
|
2017-12-21 14:43:38 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_BLK_DEV_ZONED
|
|
|
|
bool blk_req_needs_zone_write_lock(struct request *rq);
|
|
|
|
void __blk_req_zone_write_lock(struct request *rq);
|
|
|
|
void __blk_req_zone_write_unlock(struct request *rq);
|
|
|
|
|
|
|
|
static inline void blk_req_zone_write_lock(struct request *rq)
|
|
|
|
{
|
|
|
|
if (blk_req_needs_zone_write_lock(rq))
|
|
|
|
__blk_req_zone_write_lock(rq);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void blk_req_zone_write_unlock(struct request *rq)
|
|
|
|
{
|
|
|
|
if (rq->rq_flags & RQF_ZONE_WRITE_LOCKED)
|
|
|
|
__blk_req_zone_write_unlock(rq);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool blk_req_zone_is_write_locked(struct request *rq)
|
|
|
|
{
|
|
|
|
return rq->q->seq_zones_wlock &&
|
|
|
|
test_bit(blk_rq_zone_no(rq), rq->q->seq_zones_wlock);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool blk_req_can_dispatch_to_zone(struct request *rq)
|
|
|
|
{
|
|
|
|
if (!blk_req_needs_zone_write_lock(rq))
|
|
|
|
return true;
|
|
|
|
return !blk_req_zone_is_write_locked(rq);
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static inline bool blk_req_needs_zone_write_lock(struct request *rq)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void blk_req_zone_write_lock(struct request *rq)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void blk_req_zone_write_unlock(struct request *rq)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
static inline bool blk_req_zone_is_write_locked(struct request *rq)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool blk_req_can_dispatch_to_zone(struct request *rq)
|
|
|
|
{
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_BLK_DEV_ZONED */
|
|
|
|
|
[PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer. Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.
This patch does the following:
(*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
support.
(*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
an item that uses the block layer. This includes:
(*) Block I/O tracing.
(*) Disk partition code.
(*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
(*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
block layer to do scheduling. Some drivers that use SCSI facilities -
such as USB storage - end up disabled indirectly from this.
(*) Various block-based device drivers, such as IDE and the old CDROM
drivers.
(*) MTD blockdev handling and FTL.
(*) JFFS - which uses set_bdev_super(), something it could avoid doing by
taking a leaf out of JFFS2's book.
(*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
however, still used in places, and so is still available.
(*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
parts of linux/fs.h.
(*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
(*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
(*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
is not enabled.
(*) fs/no-block.c is created to hold out-of-line stubs and things that are
required when CONFIG_BLOCK is not set:
(*) Default blockdev file operations (to give error ENODEV on opening).
(*) Makes some /proc changes:
(*) /proc/devices does not list any blockdevs.
(*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
(*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
(*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
given command other than Q_SYNC or if a special device is specified.
(*) In init/do_mounts.c, no reference is made to the blockdev routines if
CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.
(*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
error ENOSYS by way of cond_syscall if so).
(*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
CONFIG_BLOCK is not set, since they can't then happen.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-10-01 02:45:40 +08:00
|
|
|
#else /* CONFIG_BLOCK */
|
2014-06-05 07:06:27 +08:00
|
|
|
|
|
|
|
struct block_device;
|
|
|
|
|
[PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer. Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.
This patch does the following:
(*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
support.
(*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
an item that uses the block layer. This includes:
(*) Block I/O tracing.
(*) Disk partition code.
(*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
(*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
block layer to do scheduling. Some drivers that use SCSI facilities -
such as USB storage - end up disabled indirectly from this.
(*) Various block-based device drivers, such as IDE and the old CDROM
drivers.
(*) MTD blockdev handling and FTL.
(*) JFFS - which uses set_bdev_super(), something it could avoid doing by
taking a leaf out of JFFS2's book.
(*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
however, still used in places, and so is still available.
(*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
parts of linux/fs.h.
(*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
(*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
(*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
is not enabled.
(*) fs/no-block.c is created to hold out-of-line stubs and things that are
required when CONFIG_BLOCK is not set:
(*) Default blockdev file operations (to give error ENODEV on opening).
(*) Makes some /proc changes:
(*) /proc/devices does not list any blockdevs.
(*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
(*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
(*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
given command other than Q_SYNC or if a special device is specified.
(*) In init/do_mounts.c, no reference is made to the blockdev routines if
CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.
(*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
error ENOSYS by way of cond_syscall if so).
(*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
CONFIG_BLOCK is not set, since they can't then happen.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-10-01 02:45:40 +08:00
|
|
|
/*
|
|
|
|
* stubs for when the block layer is configured out
|
|
|
|
*/
|
|
|
|
#define buffer_heads_over_limit 0
|
|
|
|
|
|
|
|
static inline long nr_blockdev_pages(void)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-03-12 03:17:08 +08:00
|
|
|
struct blk_plug {
|
|
|
|
};
|
|
|
|
|
|
|
|
static inline void blk_start_plug(struct blk_plug *plug)
|
2011-03-08 20:19:51 +08:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2011-03-12 03:17:08 +08:00
|
|
|
static inline void blk_finish_plug(struct blk_plug *plug)
|
2011-03-08 20:19:51 +08:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2011-03-12 03:17:08 +08:00
|
|
|
static inline void blk_flush_plug(struct task_struct *task)
|
2011-03-08 20:19:51 +08:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2011-04-16 19:27:55 +08:00
|
|
|
static inline void blk_schedule_flush_plug(struct task_struct *task)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2011-03-08 20:19:51 +08:00
|
|
|
static inline bool blk_needs_flush_plug(struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2014-06-05 07:06:27 +08:00
|
|
|
static inline int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
|
|
|
|
sector_t *error_sector)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
[PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer. Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.
This patch does the following:
(*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
support.
(*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
an item that uses the block layer. This includes:
(*) Block I/O tracing.
(*) Disk partition code.
(*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
(*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
block layer to do scheduling. Some drivers that use SCSI facilities -
such as USB storage - end up disabled indirectly from this.
(*) Various block-based device drivers, such as IDE and the old CDROM
drivers.
(*) MTD blockdev handling and FTL.
(*) JFFS - which uses set_bdev_super(), something it could avoid doing by
taking a leaf out of JFFS2's book.
(*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
however, still used in places, and so is still available.
(*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
parts of linux/fs.h.
(*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
(*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
(*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
is not enabled.
(*) fs/no-block.c is created to hold out-of-line stubs and things that are
required when CONFIG_BLOCK is not set:
(*) Default blockdev file operations (to give error ENODEV on opening).
(*) Makes some /proc changes:
(*) /proc/devices does not list any blockdevs.
(*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
(*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
(*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
given command other than Q_SYNC or if a special device is specified.
(*) In init/do_mounts.c, no reference is made to the blockdev routines if
CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.
(*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
error ENOSYS by way of cond_syscall if so).
(*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
CONFIG_BLOCK is not set, since they can't then happen.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-10-01 02:45:40 +08:00
|
|
|
#endif /* CONFIG_BLOCK */
|
|
|
|
|
2018-11-14 12:16:54 +08:00
|
|
|
static inline void blk_wake_io_task(struct task_struct *waiter)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If we're polling, the task itself is doing the completions. For
|
|
|
|
* that case, we don't need to signal a wakeup, it's enough to just
|
|
|
|
* mark us as RUNNING.
|
|
|
|
*/
|
|
|
|
if (waiter == current)
|
|
|
|
__set_current_state(TASK_RUNNING);
|
|
|
|
else
|
|
|
|
wake_up_process(waiter);
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
#endif
|