License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2008-01-29 21:04:06 +08:00
|
|
|
/*
|
|
|
|
* Functions related to segment and merge handling
|
|
|
|
*/
|
|
|
|
#include <linux/kernel.h>
|
|
|
|
#include <linux/module.h>
|
|
|
|
#include <linux/bio.h>
|
|
|
|
#include <linux/blkdev.h>
|
2021-09-20 20:33:27 +08:00
|
|
|
#include <linux/blk-integrity.h>
|
2008-01-29 21:04:06 +08:00
|
|
|
#include <linux/scatterlist.h>
|
2021-11-24 02:53:12 +08:00
|
|
|
#include <linux/part_stat.h>
|
2022-03-15 08:30:11 +08:00
|
|
|
#include <linux/blk-cgroup.h>
|
2008-01-29 21:04:06 +08:00
|
|
|
|
2015-12-03 22:32:30 +08:00
|
|
|
#include <trace/events/block.h>
|
|
|
|
|
2008-01-29 21:04:06 +08:00
|
|
|
#include "blk.h"
|
2021-11-24 02:53:08 +08:00
|
|
|
#include "blk-mq-sched.h"
|
2020-08-28 10:52:54 +08:00
|
|
|
#include "blk-rq-qos.h"
|
2021-10-05 23:11:56 +08:00
|
|
|
#include "blk-throttle.h"
|
2008-01-29 21:04:06 +08:00
|
|
|
|
2021-10-13 00:18:03 +08:00
|
|
|
static inline void bio_get_first_bvec(struct bio *bio, struct bio_vec *bv)
|
|
|
|
{
|
|
|
|
*bv = mp_bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void bio_get_last_bvec(struct bio *bio, struct bio_vec *bv)
|
|
|
|
{
|
|
|
|
struct bvec_iter iter = bio->bi_iter;
|
|
|
|
int idx;
|
|
|
|
|
|
|
|
bio_get_first_bvec(bio, bv);
|
|
|
|
if (bv->bv_len == bio->bi_iter.bi_size)
|
|
|
|
return; /* this bio only has a single bvec */
|
|
|
|
|
|
|
|
bio_advance_iter(bio, &iter, iter.bi_size);
|
|
|
|
|
|
|
|
if (!iter.bi_bvec_done)
|
|
|
|
idx = iter.bi_idx - 1;
|
|
|
|
else /* in the middle of bvec */
|
|
|
|
idx = iter.bi_idx;
|
|
|
|
|
|
|
|
*bv = bio->bi_io_vec[idx];
|
|
|
|
|
|
|
|
/*
|
|
|
|
* iter.bi_bvec_done records actual length of the last bvec
|
|
|
|
* if this bio ends in the middle of one io vector
|
|
|
|
*/
|
|
|
|
if (iter.bi_bvec_done)
|
|
|
|
bv->bv_len = iter.bi_bvec_done;
|
|
|
|
}
|
|
|
|
|
2018-09-24 15:43:48 +08:00
|
|
|
static inline bool bio_will_gap(struct request_queue *q,
|
|
|
|
struct request *prev_rq, struct bio *prev, struct bio *next)
|
|
|
|
{
|
|
|
|
struct bio_vec pb, nb;
|
|
|
|
|
|
|
|
if (!bio_has_data(prev) || !queue_virt_boundary(q))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Don't merge if the 1st bio starts with non-zero offset, otherwise it
|
|
|
|
* is quite difficult to respect the sg gap limit. We work hard to
|
|
|
|
* merge a huge number of small single bios in case of mkfs.
|
|
|
|
*/
|
|
|
|
if (prev_rq)
|
|
|
|
bio_get_first_bvec(prev_rq->bio, &pb);
|
|
|
|
else
|
|
|
|
bio_get_first_bvec(prev, &pb);
|
2018-11-07 21:58:14 +08:00
|
|
|
if (pb.bv_offset & queue_virt_boundary(q))
|
2018-09-24 15:43:48 +08:00
|
|
|
return true;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't need to worry about the situation that the merged segment
|
|
|
|
* ends in unaligned virt boundary:
|
|
|
|
*
|
|
|
|
* - if 'pb' ends aligned, the merged segment ends aligned
|
|
|
|
* - if 'pb' ends unaligned, the next bio must include
|
|
|
|
* one single bvec of 'nb', otherwise the 'nb' can't
|
|
|
|
* merge with 'pb'
|
|
|
|
*/
|
|
|
|
bio_get_last_bvec(prev, &pb);
|
|
|
|
bio_get_first_bvec(next, &nb);
|
2019-05-21 15:01:42 +08:00
|
|
|
if (biovec_phys_mergeable(q, &pb, &nb))
|
2018-09-24 15:43:48 +08:00
|
|
|
return false;
|
2022-07-28 00:23:00 +08:00
|
|
|
return __bvec_gap_to_prev(&q->limits, &pb, nb.bv_offset);
|
2018-09-24 15:43:48 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool req_gap_back_merge(struct request *req, struct bio *bio)
|
|
|
|
{
|
|
|
|
return bio_will_gap(req->q, req, req->biotail, bio);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool req_gap_front_merge(struct request *req, struct bio *bio)
|
|
|
|
{
|
|
|
|
return bio_will_gap(req->q, NULL, bio, req->bio);
|
|
|
|
}
|
|
|
|
|
2022-07-28 00:22:59 +08:00
|
|
|
/*
|
|
|
|
* The max size one bio can handle is UINT_MAX becasue bvec_iter.bi_size
|
|
|
|
* is defined as 'unsigned int', meantime it has to be aligned to with the
|
|
|
|
* logical block size, which is the minimum accepted unit by hardware.
|
|
|
|
*/
|
2022-10-26 03:17:54 +08:00
|
|
|
static unsigned int bio_allowed_max_sectors(const struct queue_limits *lim)
|
2022-07-28 00:22:59 +08:00
|
|
|
{
|
2022-07-28 00:23:00 +08:00
|
|
|
return round_down(UINT_MAX, lim->logical_block_size) >> SECTOR_SHIFT;
|
2022-07-28 00:22:59 +08:00
|
|
|
}
|
|
|
|
|
2022-10-26 03:17:54 +08:00
|
|
|
static struct bio *bio_split_discard(struct bio *bio,
|
|
|
|
const struct queue_limits *lim,
|
|
|
|
unsigned *nsegs, struct bio_set *bs)
|
2015-04-24 13:37:18 +08:00
|
|
|
{
|
|
|
|
unsigned int max_discard_sectors, granularity;
|
|
|
|
sector_t tmp;
|
|
|
|
unsigned split_sectors;
|
|
|
|
|
2015-10-20 23:13:52 +08:00
|
|
|
*nsegs = 1;
|
|
|
|
|
2022-07-28 00:23:00 +08:00
|
|
|
granularity = max(lim->discard_granularity >> 9, 1U);
|
2015-04-24 13:37:18 +08:00
|
|
|
|
2022-07-28 00:23:00 +08:00
|
|
|
max_discard_sectors =
|
|
|
|
min(lim->max_discard_sectors, bio_allowed_max_sectors(lim));
|
2015-04-24 13:37:18 +08:00
|
|
|
max_discard_sectors -= max_discard_sectors % granularity;
|
2023-12-28 15:55:37 +08:00
|
|
|
if (unlikely(!max_discard_sectors))
|
2015-04-24 13:37:18 +08:00
|
|
|
return NULL;
|
|
|
|
|
|
|
|
if (bio_sectors(bio) <= max_discard_sectors)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
split_sectors = max_discard_sectors;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the next starting sector would be misaligned, stop the discard at
|
|
|
|
* the previous aligned sector.
|
|
|
|
*/
|
2022-07-28 00:23:00 +08:00
|
|
|
tmp = bio->bi_iter.bi_sector + split_sectors -
|
|
|
|
((lim->discard_alignment >> 9) % granularity);
|
2015-04-24 13:37:18 +08:00
|
|
|
tmp = sector_div(tmp, granularity);
|
|
|
|
|
|
|
|
if (split_sectors > tmp)
|
|
|
|
split_sectors -= tmp;
|
|
|
|
|
|
|
|
return bio_split(bio, split_sectors, GFP_NOIO, bs);
|
|
|
|
}
|
|
|
|
|
2022-07-28 00:22:55 +08:00
|
|
|
static struct bio *bio_split_write_zeroes(struct bio *bio,
|
2022-10-26 03:17:54 +08:00
|
|
|
const struct queue_limits *lim,
|
|
|
|
unsigned *nsegs, struct bio_set *bs)
|
2017-04-06 01:21:01 +08:00
|
|
|
{
|
2019-07-03 20:24:35 +08:00
|
|
|
*nsegs = 0;
|
2022-07-28 00:23:00 +08:00
|
|
|
if (!lim->max_write_zeroes_sectors)
|
2017-04-06 01:21:01 +08:00
|
|
|
return NULL;
|
2022-07-28 00:23:00 +08:00
|
|
|
if (bio_sectors(bio) <= lim->max_write_zeroes_sectors)
|
2017-04-06 01:21:01 +08:00
|
|
|
return NULL;
|
2022-07-28 00:23:00 +08:00
|
|
|
return bio_split(bio, lim->max_write_zeroes_sectors, GFP_NOIO, bs);
|
2017-04-06 01:21:01 +08:00
|
|
|
}
|
|
|
|
|
block: Add core atomic write support
Add atomic write support, as follows:
- add helper functions to get request_queue atomic write limits
- report request_queue atomic write support limits to sysfs and update Doc
- support to safely merge atomic writes
- deal with splitting atomic writes
- misc helper functions
- add a per-request atomic write flag
New request_queue limits are added, as follows:
- atomic_write_hw_max is set by the block driver and is the maximum length
of an atomic write which the device may support. It is not
necessarily a power-of-2.
- atomic_write_max_sectors is derived from atomic_write_hw_max_sectors and
max_hw_sectors. It is always a power-of-2. Atomic writes may be merged,
and atomic_write_max_sectors would be the limit on a merged atomic write
request size. This value is not capped at max_sectors, as the value in
max_sectors can be controlled from userspace, and it would only cause
trouble if userspace could limit atomic_write_unit_max_bytes and the
other atomic write limits.
- atomic_write_hw_unit_{min,max} are set by the block driver and are the
min/max length of an atomic write unit which the device may support. They
both must be a power-of-2. Typically atomic_write_hw_unit_max will hold
the same value as atomic_write_hw_max.
- atomic_write_unit_{min,max} are derived from
atomic_write_hw_unit_{min,max}, max_hw_sectors, and block core limits.
Both min and max values must be a power-of-2.
- atomic_write_hw_boundary is set by the block driver. If non-zero, it
indicates an LBA space boundary at which an atomic write straddles no
longer is atomically executed by the disk. The value must be a
power-of-2. Note that it would be acceptable to enforce a rule that
atomic_write_hw_boundary_sectors is a multiple of
atomic_write_hw_unit_max, but the resultant code would be more
complicated.
All atomic writes limits are by default set 0 to indicate no atomic write
support. Even though it is assumed by Linux that a logical block can always
be atomically written, we ignore this as it is not of particular interest.
Stacked devices are just not supported either for now.
An atomic write must always be submitted to the block driver as part of a
single request. As such, only a single BIO must be submitted to the block
layer for an atomic write. When a single atomic write BIO is submitted, it
cannot be split. As such, atomic_write_unit_{max, min}_bytes are limited
by the maximum guaranteed BIO size which will not be required to be split.
This max size is calculated by request_queue max segments and the number
of bvecs a BIO can fit, BIO_MAX_VECS. Currently we rely on userspace
issuing a write with iovcnt=1 for pwritev2() - as such, we can rely on each
segment containing PAGE_SIZE of data, apart from the first+last, which each
can fit logical block size of data. The first+last will be LBS
length/aligned as we rely on direct IO alignment rules also.
New sysfs files are added to report the following atomic write limits:
- atomic_write_unit_max_bytes - same as atomic_write_unit_max_sectors in
bytes
- atomic_write_unit_min_bytes - same as atomic_write_unit_min_sectors in
bytes
- atomic_write_boundary_bytes - same as atomic_write_hw_boundary_sectors in
bytes
- atomic_write_max_bytes - same as atomic_write_max_sectors in bytes
Atomic writes may only be merged with other atomic writes and only under
the following conditions:
- total resultant request length <= atomic_write_max_bytes
- the merged write does not straddle a boundary
Helper function bdev_can_atomic_write() is added to indicate whether
atomic writes may be issued to a bdev. If a bdev is a partition, the
partition start must be aligned with both atomic_write_unit_min_sectors
and atomic_write_hw_boundary_sectors.
FSes will rely on the block layer to validate that an atomic write BIO
submitted will be of valid size, so add blk_validate_atomic_write_op_size()
for this purpose. Userspace expects an atomic write which is of invalid
size to be rejected with -EINVAL, so add BLK_STS_INVAL for this. Also use
BLK_STS_INVAL for when a BIO needs to be split, as this should mean an
invalid size BIO.
Flag REQ_ATOMIC is used for indicating an atomic write.
Co-developed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-6-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 20:53:54 +08:00
|
|
|
static inline unsigned int blk_boundary_sectors(const struct queue_limits *lim,
|
|
|
|
bool is_atomic)
|
2024-06-20 20:53:51 +08:00
|
|
|
{
|
block: Add core atomic write support
Add atomic write support, as follows:
- add helper functions to get request_queue atomic write limits
- report request_queue atomic write support limits to sysfs and update Doc
- support to safely merge atomic writes
- deal with splitting atomic writes
- misc helper functions
- add a per-request atomic write flag
New request_queue limits are added, as follows:
- atomic_write_hw_max is set by the block driver and is the maximum length
of an atomic write which the device may support. It is not
necessarily a power-of-2.
- atomic_write_max_sectors is derived from atomic_write_hw_max_sectors and
max_hw_sectors. It is always a power-of-2. Atomic writes may be merged,
and atomic_write_max_sectors would be the limit on a merged atomic write
request size. This value is not capped at max_sectors, as the value in
max_sectors can be controlled from userspace, and it would only cause
trouble if userspace could limit atomic_write_unit_max_bytes and the
other atomic write limits.
- atomic_write_hw_unit_{min,max} are set by the block driver and are the
min/max length of an atomic write unit which the device may support. They
both must be a power-of-2. Typically atomic_write_hw_unit_max will hold
the same value as atomic_write_hw_max.
- atomic_write_unit_{min,max} are derived from
atomic_write_hw_unit_{min,max}, max_hw_sectors, and block core limits.
Both min and max values must be a power-of-2.
- atomic_write_hw_boundary is set by the block driver. If non-zero, it
indicates an LBA space boundary at which an atomic write straddles no
longer is atomically executed by the disk. The value must be a
power-of-2. Note that it would be acceptable to enforce a rule that
atomic_write_hw_boundary_sectors is a multiple of
atomic_write_hw_unit_max, but the resultant code would be more
complicated.
All atomic writes limits are by default set 0 to indicate no atomic write
support. Even though it is assumed by Linux that a logical block can always
be atomically written, we ignore this as it is not of particular interest.
Stacked devices are just not supported either for now.
An atomic write must always be submitted to the block driver as part of a
single request. As such, only a single BIO must be submitted to the block
layer for an atomic write. When a single atomic write BIO is submitted, it
cannot be split. As such, atomic_write_unit_{max, min}_bytes are limited
by the maximum guaranteed BIO size which will not be required to be split.
This max size is calculated by request_queue max segments and the number
of bvecs a BIO can fit, BIO_MAX_VECS. Currently we rely on userspace
issuing a write with iovcnt=1 for pwritev2() - as such, we can rely on each
segment containing PAGE_SIZE of data, apart from the first+last, which each
can fit logical block size of data. The first+last will be LBS
length/aligned as we rely on direct IO alignment rules also.
New sysfs files are added to report the following atomic write limits:
- atomic_write_unit_max_bytes - same as atomic_write_unit_max_sectors in
bytes
- atomic_write_unit_min_bytes - same as atomic_write_unit_min_sectors in
bytes
- atomic_write_boundary_bytes - same as atomic_write_hw_boundary_sectors in
bytes
- atomic_write_max_bytes - same as atomic_write_max_sectors in bytes
Atomic writes may only be merged with other atomic writes and only under
the following conditions:
- total resultant request length <= atomic_write_max_bytes
- the merged write does not straddle a boundary
Helper function bdev_can_atomic_write() is added to indicate whether
atomic writes may be issued to a bdev. If a bdev is a partition, the
partition start must be aligned with both atomic_write_unit_min_sectors
and atomic_write_hw_boundary_sectors.
FSes will rely on the block layer to validate that an atomic write BIO
submitted will be of valid size, so add blk_validate_atomic_write_op_size()
for this purpose. Userspace expects an atomic write which is of invalid
size to be rejected with -EINVAL, so add BLK_STS_INVAL for this. Also use
BLK_STS_INVAL for when a BIO needs to be split, as this should mean an
invalid size BIO.
Flag REQ_ATOMIC is used for indicating an atomic write.
Co-developed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-6-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 20:53:54 +08:00
|
|
|
/*
|
|
|
|
* chunk_sectors must be a multiple of atomic_write_boundary_sectors if
|
|
|
|
* both non-zero.
|
|
|
|
*/
|
|
|
|
if (is_atomic && lim->atomic_write_boundary_sectors)
|
|
|
|
return lim->atomic_write_boundary_sectors;
|
|
|
|
|
2024-06-20 20:53:51 +08:00
|
|
|
return lim->chunk_sectors;
|
|
|
|
}
|
|
|
|
|
2019-08-02 06:50:44 +08:00
|
|
|
/*
|
|
|
|
* Return the maximum number of sectors from the start of a bio that may be
|
|
|
|
* submitted as a single request to a block device. If enough sectors remain,
|
|
|
|
* align the end to the physical block size. Otherwise align the end to the
|
|
|
|
* logical block size. This approach minimizes the number of non-aligned
|
|
|
|
* requests that are submitted to a block device if the start of a bio is not
|
|
|
|
* aligned to a physical block boundary.
|
|
|
|
*/
|
2022-07-28 00:22:55 +08:00
|
|
|
static inline unsigned get_max_io_size(struct bio *bio,
|
2022-10-26 03:17:54 +08:00
|
|
|
const struct queue_limits *lim)
|
2016-01-23 08:05:33 +08:00
|
|
|
{
|
2022-07-28 00:23:00 +08:00
|
|
|
unsigned pbs = lim->physical_block_size >> SECTOR_SHIFT;
|
|
|
|
unsigned lbs = lim->logical_block_size >> SECTOR_SHIFT;
|
block: Add core atomic write support
Add atomic write support, as follows:
- add helper functions to get request_queue atomic write limits
- report request_queue atomic write support limits to sysfs and update Doc
- support to safely merge atomic writes
- deal with splitting atomic writes
- misc helper functions
- add a per-request atomic write flag
New request_queue limits are added, as follows:
- atomic_write_hw_max is set by the block driver and is the maximum length
of an atomic write which the device may support. It is not
necessarily a power-of-2.
- atomic_write_max_sectors is derived from atomic_write_hw_max_sectors and
max_hw_sectors. It is always a power-of-2. Atomic writes may be merged,
and atomic_write_max_sectors would be the limit on a merged atomic write
request size. This value is not capped at max_sectors, as the value in
max_sectors can be controlled from userspace, and it would only cause
trouble if userspace could limit atomic_write_unit_max_bytes and the
other atomic write limits.
- atomic_write_hw_unit_{min,max} are set by the block driver and are the
min/max length of an atomic write unit which the device may support. They
both must be a power-of-2. Typically atomic_write_hw_unit_max will hold
the same value as atomic_write_hw_max.
- atomic_write_unit_{min,max} are derived from
atomic_write_hw_unit_{min,max}, max_hw_sectors, and block core limits.
Both min and max values must be a power-of-2.
- atomic_write_hw_boundary is set by the block driver. If non-zero, it
indicates an LBA space boundary at which an atomic write straddles no
longer is atomically executed by the disk. The value must be a
power-of-2. Note that it would be acceptable to enforce a rule that
atomic_write_hw_boundary_sectors is a multiple of
atomic_write_hw_unit_max, but the resultant code would be more
complicated.
All atomic writes limits are by default set 0 to indicate no atomic write
support. Even though it is assumed by Linux that a logical block can always
be atomically written, we ignore this as it is not of particular interest.
Stacked devices are just not supported either for now.
An atomic write must always be submitted to the block driver as part of a
single request. As such, only a single BIO must be submitted to the block
layer for an atomic write. When a single atomic write BIO is submitted, it
cannot be split. As such, atomic_write_unit_{max, min}_bytes are limited
by the maximum guaranteed BIO size which will not be required to be split.
This max size is calculated by request_queue max segments and the number
of bvecs a BIO can fit, BIO_MAX_VECS. Currently we rely on userspace
issuing a write with iovcnt=1 for pwritev2() - as such, we can rely on each
segment containing PAGE_SIZE of data, apart from the first+last, which each
can fit logical block size of data. The first+last will be LBS
length/aligned as we rely on direct IO alignment rules also.
New sysfs files are added to report the following atomic write limits:
- atomic_write_unit_max_bytes - same as atomic_write_unit_max_sectors in
bytes
- atomic_write_unit_min_bytes - same as atomic_write_unit_min_sectors in
bytes
- atomic_write_boundary_bytes - same as atomic_write_hw_boundary_sectors in
bytes
- atomic_write_max_bytes - same as atomic_write_max_sectors in bytes
Atomic writes may only be merged with other atomic writes and only under
the following conditions:
- total resultant request length <= atomic_write_max_bytes
- the merged write does not straddle a boundary
Helper function bdev_can_atomic_write() is added to indicate whether
atomic writes may be issued to a bdev. If a bdev is a partition, the
partition start must be aligned with both atomic_write_unit_min_sectors
and atomic_write_hw_boundary_sectors.
FSes will rely on the block layer to validate that an atomic write BIO
submitted will be of valid size, so add blk_validate_atomic_write_op_size()
for this purpose. Userspace expects an atomic write which is of invalid
size to be rejected with -EINVAL, so add BLK_STS_INVAL for this. Also use
BLK_STS_INVAL for when a BIO needs to be split, as this should mean an
invalid size BIO.
Flag REQ_ATOMIC is used for indicating an atomic write.
Co-developed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-6-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 20:53:54 +08:00
|
|
|
bool is_atomic = bio->bi_opf & REQ_ATOMIC;
|
|
|
|
unsigned boundary_sectors = blk_boundary_sectors(lim, is_atomic);
|
|
|
|
unsigned max_sectors, start, end;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We ignore lim->max_sectors for atomic writes because it may less
|
|
|
|
* than the actual bio size, which we cannot tolerate.
|
|
|
|
*/
|
|
|
|
if (is_atomic)
|
|
|
|
max_sectors = lim->atomic_write_max_sectors;
|
|
|
|
else
|
|
|
|
max_sectors = lim->max_sectors;
|
2016-01-23 08:05:33 +08:00
|
|
|
|
2024-06-20 20:53:51 +08:00
|
|
|
if (boundary_sectors) {
|
2022-06-14 17:09:33 +08:00
|
|
|
max_sectors = min(max_sectors,
|
2024-06-20 20:53:51 +08:00
|
|
|
blk_boundary_sectors_left(bio->bi_iter.bi_sector,
|
|
|
|
boundary_sectors));
|
2022-06-14 17:09:33 +08:00
|
|
|
}
|
2016-01-23 08:05:33 +08:00
|
|
|
|
2022-06-14 17:09:32 +08:00
|
|
|
start = bio->bi_iter.bi_sector & (pbs - 1);
|
|
|
|
end = (start + max_sectors) & ~(pbs - 1);
|
|
|
|
if (end > start)
|
|
|
|
return end - start;
|
|
|
|
return max_sectors & ~(lbs - 1);
|
2016-01-23 08:05:33 +08:00
|
|
|
}
|
|
|
|
|
block: Micro-optimize get_max_segment_size()
This patch removes a conditional jump from get_max_segment_size(). The
x86-64 assembler code for this function without this patch is as follows:
206 return min_not_zero(mask - offset + 1,
0x0000000000000118 <+72>: not %rax
0x000000000000011b <+75>: and 0x8(%r10),%rax
0x000000000000011f <+79>: add $0x1,%rax
0x0000000000000123 <+83>: je 0x138 <bvec_split_segs+104>
0x0000000000000125 <+85>: cmp %rdx,%rax
0x0000000000000128 <+88>: mov %rdx,%r12
0x000000000000012b <+91>: cmovbe %rax,%r12
0x000000000000012f <+95>: test %rdx,%rdx
0x0000000000000132 <+98>: mov %eax,%edx
0x0000000000000134 <+100>: cmovne %r12d,%edx
With this patch applied:
206 return min(mask - offset, (unsigned long)lim->max_segment_size - 1) + 1;
0x000000000000003f <+63>: mov 0x28(%rdi),%ebp
0x0000000000000042 <+66>: not %rax
0x0000000000000045 <+69>: and 0x8(%rdi),%rax
0x0000000000000049 <+73>: sub $0x1,%rbp
0x000000000000004d <+77>: cmp %rbp,%rax
0x0000000000000050 <+80>: cmova %rbp,%rax
0x0000000000000054 <+84>: add $0x1,%eax
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221025191755.1711437-4-bvanassche@acm.org
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-26 03:17:55 +08:00
|
|
|
/**
|
|
|
|
* get_max_segment_size() - maximum number of bytes to add as a single segment
|
|
|
|
* @lim: Request queue limits.
|
2024-07-06 15:52:18 +08:00
|
|
|
* @paddr: address of the range to add
|
2024-07-09 12:54:32 +08:00
|
|
|
* @len: maximum length available to add at @paddr
|
block: Micro-optimize get_max_segment_size()
This patch removes a conditional jump from get_max_segment_size(). The
x86-64 assembler code for this function without this patch is as follows:
206 return min_not_zero(mask - offset + 1,
0x0000000000000118 <+72>: not %rax
0x000000000000011b <+75>: and 0x8(%r10),%rax
0x000000000000011f <+79>: add $0x1,%rax
0x0000000000000123 <+83>: je 0x138 <bvec_split_segs+104>
0x0000000000000125 <+85>: cmp %rdx,%rax
0x0000000000000128 <+88>: mov %rdx,%r12
0x000000000000012b <+91>: cmovbe %rax,%r12
0x000000000000012f <+95>: test %rdx,%rdx
0x0000000000000132 <+98>: mov %eax,%edx
0x0000000000000134 <+100>: cmovne %r12d,%edx
With this patch applied:
206 return min(mask - offset, (unsigned long)lim->max_segment_size - 1) + 1;
0x000000000000003f <+63>: mov 0x28(%rdi),%ebp
0x0000000000000042 <+66>: not %rax
0x0000000000000045 <+69>: and 0x8(%rdi),%rax
0x0000000000000049 <+73>: sub $0x1,%rbp
0x000000000000004d <+77>: cmp %rbp,%rax
0x0000000000000050 <+80>: cmova %rbp,%rax
0x0000000000000054 <+84>: add $0x1,%eax
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221025191755.1711437-4-bvanassche@acm.org
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-26 03:17:55 +08:00
|
|
|
*
|
2024-07-06 15:52:18 +08:00
|
|
|
* Returns the maximum number of bytes of the range starting at @paddr that can
|
|
|
|
* be added to a single segment.
|
block: Micro-optimize get_max_segment_size()
This patch removes a conditional jump from get_max_segment_size(). The
x86-64 assembler code for this function without this patch is as follows:
206 return min_not_zero(mask - offset + 1,
0x0000000000000118 <+72>: not %rax
0x000000000000011b <+75>: and 0x8(%r10),%rax
0x000000000000011f <+79>: add $0x1,%rax
0x0000000000000123 <+83>: je 0x138 <bvec_split_segs+104>
0x0000000000000125 <+85>: cmp %rdx,%rax
0x0000000000000128 <+88>: mov %rdx,%r12
0x000000000000012b <+91>: cmovbe %rax,%r12
0x000000000000012f <+95>: test %rdx,%rdx
0x0000000000000132 <+98>: mov %eax,%edx
0x0000000000000134 <+100>: cmovne %r12d,%edx
With this patch applied:
206 return min(mask - offset, (unsigned long)lim->max_segment_size - 1) + 1;
0x000000000000003f <+63>: mov 0x28(%rdi),%ebp
0x0000000000000042 <+66>: not %rax
0x0000000000000045 <+69>: and 0x8(%rdi),%rax
0x0000000000000049 <+73>: sub $0x1,%rbp
0x000000000000004d <+77>: cmp %rbp,%rax
0x0000000000000050 <+80>: cmova %rbp,%rax
0x0000000000000054 <+84>: add $0x1,%eax
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221025191755.1711437-4-bvanassche@acm.org
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-26 03:17:55 +08:00
|
|
|
*/
|
2022-10-26 03:17:54 +08:00
|
|
|
static inline unsigned get_max_segment_size(const struct queue_limits *lim,
|
2024-07-06 15:52:18 +08:00
|
|
|
phys_addr_t paddr, unsigned int len)
|
2019-02-15 19:13:12 +08:00
|
|
|
{
|
2020-01-11 20:57:43 +08:00
|
|
|
/*
|
block: Micro-optimize get_max_segment_size()
This patch removes a conditional jump from get_max_segment_size(). The
x86-64 assembler code for this function without this patch is as follows:
206 return min_not_zero(mask - offset + 1,
0x0000000000000118 <+72>: not %rax
0x000000000000011b <+75>: and 0x8(%r10),%rax
0x000000000000011f <+79>: add $0x1,%rax
0x0000000000000123 <+83>: je 0x138 <bvec_split_segs+104>
0x0000000000000125 <+85>: cmp %rdx,%rax
0x0000000000000128 <+88>: mov %rdx,%r12
0x000000000000012b <+91>: cmovbe %rax,%r12
0x000000000000012f <+95>: test %rdx,%rdx
0x0000000000000132 <+98>: mov %eax,%edx
0x0000000000000134 <+100>: cmovne %r12d,%edx
With this patch applied:
206 return min(mask - offset, (unsigned long)lim->max_segment_size - 1) + 1;
0x000000000000003f <+63>: mov 0x28(%rdi),%ebp
0x0000000000000042 <+66>: not %rax
0x0000000000000045 <+69>: and 0x8(%rdi),%rax
0x0000000000000049 <+73>: sub $0x1,%rbp
0x000000000000004d <+77>: cmp %rbp,%rax
0x0000000000000050 <+80>: cmova %rbp,%rax
0x0000000000000054 <+84>: add $0x1,%eax
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221025191755.1711437-4-bvanassche@acm.org
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-26 03:17:55 +08:00
|
|
|
* Prevent an overflow if mask = ULONG_MAX and offset = 0 by adding 1
|
|
|
|
* after having calculated the minimum.
|
2020-01-11 20:57:43 +08:00
|
|
|
*/
|
2024-07-06 15:52:18 +08:00
|
|
|
return min_t(unsigned long, len,
|
|
|
|
min(lim->seg_boundary_mask - (lim->seg_boundary_mask & paddr),
|
|
|
|
(unsigned long)lim->max_segment_size - 1) + 1);
|
2019-02-15 19:13:12 +08:00
|
|
|
}
|
|
|
|
|
2019-08-02 06:50:43 +08:00
|
|
|
/**
|
|
|
|
* bvec_split_segs - verify whether or not a bvec should be split in the middle
|
2022-07-28 00:23:00 +08:00
|
|
|
* @lim: [in] queue limits to split based on
|
2019-08-02 06:50:43 +08:00
|
|
|
* @bv: [in] bvec to examine
|
|
|
|
* @nsegs: [in,out] Number of segments in the bio being built. Incremented
|
|
|
|
* by the number of segments from @bv that may be appended to that
|
|
|
|
* bio without exceeding @max_segs
|
2022-06-11 03:58:25 +08:00
|
|
|
* @bytes: [in,out] Number of bytes in the bio being built. Incremented
|
|
|
|
* by the number of bytes from @bv that may be appended to that
|
|
|
|
* bio without exceeding @max_bytes
|
2019-08-02 06:50:43 +08:00
|
|
|
* @max_segs: [in] upper bound for *@nsegs
|
2022-06-11 03:58:25 +08:00
|
|
|
* @max_bytes: [in] upper bound for *@bytes
|
2019-08-02 06:50:43 +08:00
|
|
|
*
|
|
|
|
* When splitting a bio, it can happen that a bvec is encountered that is too
|
|
|
|
* big to fit in a single segment and hence that it has to be split in the
|
|
|
|
* middle. This function verifies whether or not that should happen. The value
|
|
|
|
* %true is returned if and only if appending the entire @bv to a bio with
|
|
|
|
* *@nsegs segments and *@sectors sectors would make that bio unacceptable for
|
|
|
|
* the block driver.
|
2019-02-15 19:13:12 +08:00
|
|
|
*/
|
2022-10-26 03:17:54 +08:00
|
|
|
static bool bvec_split_segs(const struct queue_limits *lim,
|
|
|
|
const struct bio_vec *bv, unsigned *nsegs, unsigned *bytes,
|
|
|
|
unsigned max_segs, unsigned max_bytes)
|
2019-02-15 19:13:12 +08:00
|
|
|
{
|
2022-06-11 03:58:25 +08:00
|
|
|
unsigned max_len = min(max_bytes, UINT_MAX) - *bytes;
|
2019-08-02 06:50:43 +08:00
|
|
|
unsigned len = min(bv->bv_len, max_len);
|
2019-02-15 19:13:12 +08:00
|
|
|
unsigned total_len = 0;
|
2019-08-02 06:50:42 +08:00
|
|
|
unsigned seg_size = 0;
|
2019-02-15 19:13:12 +08:00
|
|
|
|
2019-08-02 06:50:42 +08:00
|
|
|
while (len && *nsegs < max_segs) {
|
2024-07-06 15:52:18 +08:00
|
|
|
seg_size = get_max_segment_size(lim, bvec_phys(bv) + total_len, len);
|
2019-02-15 19:13:12 +08:00
|
|
|
|
2019-08-02 06:50:42 +08:00
|
|
|
(*nsegs)++;
|
2019-02-15 19:13:12 +08:00
|
|
|
total_len += seg_size;
|
|
|
|
len -= seg_size;
|
|
|
|
|
2022-07-28 00:23:00 +08:00
|
|
|
if ((bv->bv_offset + total_len) & lim->virt_boundary_mask)
|
2019-02-15 19:13:12 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2022-06-11 03:58:25 +08:00
|
|
|
*bytes += total_len;
|
2019-02-15 19:13:12 +08:00
|
|
|
|
2019-08-02 06:50:43 +08:00
|
|
|
/* tell the caller to split the bvec if it is too big to fit */
|
|
|
|
return len > 0 || bv->bv_len > max_len;
|
2019-02-15 19:13:12 +08:00
|
|
|
}
|
|
|
|
|
2019-08-02 06:50:41 +08:00
|
|
|
/**
|
2022-07-28 00:22:55 +08:00
|
|
|
* bio_split_rw - split a bio in two bios
|
2019-08-02 06:50:41 +08:00
|
|
|
* @bio: [in] bio to be split
|
2022-07-28 00:23:00 +08:00
|
|
|
* @lim: [in] queue limits to split based on
|
2019-08-02 06:50:41 +08:00
|
|
|
* @segs: [out] number of segments in the bio with the first half of the sectors
|
2022-07-28 00:22:55 +08:00
|
|
|
* @bs: [in] bio set to allocate the clone from
|
2022-07-28 00:22:58 +08:00
|
|
|
* @max_bytes: [in] maximum number of bytes per bio
|
2019-08-02 06:50:41 +08:00
|
|
|
*
|
|
|
|
* Clone @bio, update the bi_iter of the clone to represent the first sectors
|
|
|
|
* of @bio and update @bio->bi_iter to represent the remaining sectors. The
|
|
|
|
* following is guaranteed for the cloned bio:
|
2022-07-28 00:22:58 +08:00
|
|
|
* - That it has at most @max_bytes worth of data
|
2019-08-02 06:50:41 +08:00
|
|
|
* - That it has at most queue_max_segments(@q) segments.
|
|
|
|
*
|
|
|
|
* Except for discard requests the cloned bio will point at the bi_io_vec of
|
|
|
|
* the original bio. It is the responsibility of the caller to ensure that the
|
|
|
|
* original bio is not freed before the cloned bio. The caller is also
|
|
|
|
* responsible for ensuring that @bs is only destroyed after processing of the
|
|
|
|
* split bio has finished.
|
|
|
|
*/
|
2023-01-21 14:49:58 +08:00
|
|
|
struct bio *bio_split_rw(struct bio *bio, const struct queue_limits *lim,
|
2022-07-28 00:22:58 +08:00
|
|
|
unsigned *segs, struct bio_set *bs, unsigned max_bytes)
|
2015-04-24 13:37:18 +08:00
|
|
|
{
|
2015-09-03 06:46:02 +08:00
|
|
|
struct bio_vec bv, bvprv, *bvprvp = NULL;
|
2015-04-24 13:37:18 +08:00
|
|
|
struct bvec_iter iter;
|
2022-06-11 03:58:25 +08:00
|
|
|
unsigned nsegs = 0, bytes = 0;
|
2015-04-24 13:37:18 +08:00
|
|
|
|
2019-02-15 19:13:12 +08:00
|
|
|
bio_for_each_bvec(bv, bio, iter) {
|
2015-04-24 13:37:18 +08:00
|
|
|
/*
|
|
|
|
* If the queue doesn't support SG gaps and adding this
|
|
|
|
* offset would create a gap, disallow it.
|
|
|
|
*/
|
2022-07-28 00:23:00 +08:00
|
|
|
if (bvprvp && bvec_gap_to_prev(lim, bvprvp, bv.bv_offset))
|
2015-04-24 13:37:18 +08:00
|
|
|
goto split;
|
|
|
|
|
2022-07-28 00:23:00 +08:00
|
|
|
if (nsegs < lim->max_segments &&
|
2022-06-11 03:58:25 +08:00
|
|
|
bytes + bv.bv_len <= max_bytes &&
|
2019-08-02 06:50:43 +08:00
|
|
|
bv.bv_offset + bv.bv_len <= PAGE_SIZE) {
|
|
|
|
nsegs++;
|
2022-06-11 03:58:25 +08:00
|
|
|
bytes += bv.bv_len;
|
2022-07-28 00:23:00 +08:00
|
|
|
} else {
|
|
|
|
if (bvec_split_segs(lim, &bv, &nsegs, &bytes,
|
|
|
|
lim->max_segments, max_bytes))
|
|
|
|
goto split;
|
2016-01-13 06:08:39 +08:00
|
|
|
}
|
|
|
|
|
2015-04-24 13:37:18 +08:00
|
|
|
bvprv = bv;
|
2015-11-24 10:35:29 +08:00
|
|
|
bvprvp = &bvprv;
|
2015-04-24 13:37:18 +08:00
|
|
|
}
|
|
|
|
|
2019-06-06 18:29:03 +08:00
|
|
|
*segs = nsegs;
|
|
|
|
return NULL;
|
2015-04-24 13:37:18 +08:00
|
|
|
split:
|
block: Add core atomic write support
Add atomic write support, as follows:
- add helper functions to get request_queue atomic write limits
- report request_queue atomic write support limits to sysfs and update Doc
- support to safely merge atomic writes
- deal with splitting atomic writes
- misc helper functions
- add a per-request atomic write flag
New request_queue limits are added, as follows:
- atomic_write_hw_max is set by the block driver and is the maximum length
of an atomic write which the device may support. It is not
necessarily a power-of-2.
- atomic_write_max_sectors is derived from atomic_write_hw_max_sectors and
max_hw_sectors. It is always a power-of-2. Atomic writes may be merged,
and atomic_write_max_sectors would be the limit on a merged atomic write
request size. This value is not capped at max_sectors, as the value in
max_sectors can be controlled from userspace, and it would only cause
trouble if userspace could limit atomic_write_unit_max_bytes and the
other atomic write limits.
- atomic_write_hw_unit_{min,max} are set by the block driver and are the
min/max length of an atomic write unit which the device may support. They
both must be a power-of-2. Typically atomic_write_hw_unit_max will hold
the same value as atomic_write_hw_max.
- atomic_write_unit_{min,max} are derived from
atomic_write_hw_unit_{min,max}, max_hw_sectors, and block core limits.
Both min and max values must be a power-of-2.
- atomic_write_hw_boundary is set by the block driver. If non-zero, it
indicates an LBA space boundary at which an atomic write straddles no
longer is atomically executed by the disk. The value must be a
power-of-2. Note that it would be acceptable to enforce a rule that
atomic_write_hw_boundary_sectors is a multiple of
atomic_write_hw_unit_max, but the resultant code would be more
complicated.
All atomic writes limits are by default set 0 to indicate no atomic write
support. Even though it is assumed by Linux that a logical block can always
be atomically written, we ignore this as it is not of particular interest.
Stacked devices are just not supported either for now.
An atomic write must always be submitted to the block driver as part of a
single request. As such, only a single BIO must be submitted to the block
layer for an atomic write. When a single atomic write BIO is submitted, it
cannot be split. As such, atomic_write_unit_{max, min}_bytes are limited
by the maximum guaranteed BIO size which will not be required to be split.
This max size is calculated by request_queue max segments and the number
of bvecs a BIO can fit, BIO_MAX_VECS. Currently we rely on userspace
issuing a write with iovcnt=1 for pwritev2() - as such, we can rely on each
segment containing PAGE_SIZE of data, apart from the first+last, which each
can fit logical block size of data. The first+last will be LBS
length/aligned as we rely on direct IO alignment rules also.
New sysfs files are added to report the following atomic write limits:
- atomic_write_unit_max_bytes - same as atomic_write_unit_max_sectors in
bytes
- atomic_write_unit_min_bytes - same as atomic_write_unit_min_sectors in
bytes
- atomic_write_boundary_bytes - same as atomic_write_hw_boundary_sectors in
bytes
- atomic_write_max_bytes - same as atomic_write_max_sectors in bytes
Atomic writes may only be merged with other atomic writes and only under
the following conditions:
- total resultant request length <= atomic_write_max_bytes
- the merged write does not straddle a boundary
Helper function bdev_can_atomic_write() is added to indicate whether
atomic writes may be issued to a bdev. If a bdev is a partition, the
partition start must be aligned with both atomic_write_unit_min_sectors
and atomic_write_hw_boundary_sectors.
FSes will rely on the block layer to validate that an atomic write BIO
submitted will be of valid size, so add blk_validate_atomic_write_op_size()
for this purpose. Userspace expects an atomic write which is of invalid
size to be rejected with -EINVAL, so add BLK_STS_INVAL for this. Also use
BLK_STS_INVAL for when a BIO needs to be split, as this should mean an
invalid size BIO.
Flag REQ_ATOMIC is used for indicating an atomic write.
Co-developed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-6-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 20:53:54 +08:00
|
|
|
if (bio->bi_opf & REQ_ATOMIC) {
|
|
|
|
bio->bi_status = BLK_STS_INVAL;
|
|
|
|
bio_endio(bio);
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
}
|
2023-01-04 23:52:06 +08:00
|
|
|
/*
|
|
|
|
* We can't sanely support splitting for a REQ_NOWAIT bio. End it
|
|
|
|
* with EAGAIN if splitting is required and return an error pointer.
|
|
|
|
*/
|
|
|
|
if (bio->bi_opf & REQ_NOWAIT) {
|
|
|
|
bio->bi_status = BLK_STS_AGAIN;
|
|
|
|
bio_endio(bio);
|
|
|
|
return ERR_PTR(-EAGAIN);
|
|
|
|
}
|
|
|
|
|
2015-10-20 23:13:52 +08:00
|
|
|
*segs = nsegs;
|
block: disable iopoll for split bio
iopoll is initially for small size, latency sensitive IO. It doesn't
work well for big IO, especially when it needs to be split to multiple
bios. In this case, the returned cookie of __submit_bio_noacct_mq() is
indeed the cookie of the last split bio. The completion of *this* last
split bio done by iopoll doesn't mean the whole original bio has
completed. Callers of iopoll still need to wait for completion of other
split bios.
Besides bio splitting may cause more trouble for iopoll which isn't
supposed to be used in case of big IO.
iopoll for split bio may cause potential race if CPU migration happens
during bio submission. Since the returned cookie is that of the last
split bio, polling on the corresponding hardware queue doesn't help
complete other split bios, if these split bios are enqueued into
different hardware queues. Since interrupts are disabled for polling
queues, the completion of these other split bios depends on timeout
mechanism, thus causing a potential hang.
iopoll for split bio may also cause hang for sync polling. Currently
both the blkdev and iomap-based fs (ext4/xfs, etc) support sync polling
in direct IO routine. These routines will submit bio without REQ_NOWAIT
flag set, and then start sync polling in current process context. The
process may hang in blk_mq_get_tag() if the submitted bio has to be
split into multiple bios and can rapidly exhaust the queue depth. The
process are waiting for the completion of the previously allocated
requests, which should be reaped by the following polling, and thus
causing a deadlock.
To avoid these subtle trouble described above, just disable iopoll for
split bio and return BLK_QC_T_NONE in this case. The side effect is that
non-HIPRI IO also returns BLK_QC_T_NONE now. It should be acceptable
since the returned cookie is never used for non-HIPRI IO.
Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-26 17:18:52 +08:00
|
|
|
|
2022-06-11 03:58:25 +08:00
|
|
|
/*
|
|
|
|
* Individual bvecs might not be logical block aligned. Round down the
|
|
|
|
* split size so that each bio is properly block size aligned, even if
|
|
|
|
* we do not use the full hardware limits.
|
|
|
|
*/
|
2022-07-28 00:23:00 +08:00
|
|
|
bytes = ALIGN_DOWN(bytes, lim->logical_block_size);
|
2022-06-11 03:58:25 +08:00
|
|
|
|
block: disable iopoll for split bio
iopoll is initially for small size, latency sensitive IO. It doesn't
work well for big IO, especially when it needs to be split to multiple
bios. In this case, the returned cookie of __submit_bio_noacct_mq() is
indeed the cookie of the last split bio. The completion of *this* last
split bio done by iopoll doesn't mean the whole original bio has
completed. Callers of iopoll still need to wait for completion of other
split bios.
Besides bio splitting may cause more trouble for iopoll which isn't
supposed to be used in case of big IO.
iopoll for split bio may cause potential race if CPU migration happens
during bio submission. Since the returned cookie is that of the last
split bio, polling on the corresponding hardware queue doesn't help
complete other split bios, if these split bios are enqueued into
different hardware queues. Since interrupts are disabled for polling
queues, the completion of these other split bios depends on timeout
mechanism, thus causing a potential hang.
iopoll for split bio may also cause hang for sync polling. Currently
both the blkdev and iomap-based fs (ext4/xfs, etc) support sync polling
in direct IO routine. These routines will submit bio without REQ_NOWAIT
flag set, and then start sync polling in current process context. The
process may hang in blk_mq_get_tag() if the submitted bio has to be
split into multiple bios and can rapidly exhaust the queue depth. The
process are waiting for the completion of the previously allocated
requests, which should be reaped by the following polling, and thus
causing a deadlock.
To avoid these subtle trouble described above, just disable iopoll for
split bio and return BLK_QC_T_NONE in this case. The side effect is that
non-HIPRI IO also returns BLK_QC_T_NONE now. It should be acceptable
since the returned cookie is never used for non-HIPRI IO.
Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-26 17:18:52 +08:00
|
|
|
/*
|
|
|
|
* Bio splitting may cause subtle trouble such as hang when doing sync
|
|
|
|
* iopoll in direct IO routine. Given performance gain of iopoll for
|
|
|
|
* big IO can be trival, disable iopoll when split needed.
|
|
|
|
*/
|
2021-10-12 19:12:21 +08:00
|
|
|
bio_clear_polled(bio);
|
2022-06-11 03:58:25 +08:00
|
|
|
return bio_split(bio, bytes >> SECTOR_SHIFT, GFP_NOIO, bs);
|
2015-04-24 13:37:18 +08:00
|
|
|
}
|
2023-01-21 14:49:58 +08:00
|
|
|
EXPORT_SYMBOL_GPL(bio_split_rw);
|
2015-04-24 13:37:18 +08:00
|
|
|
|
2019-08-02 06:50:41 +08:00
|
|
|
/**
|
2022-07-28 00:22:55 +08:00
|
|
|
* __bio_split_to_limits - split a bio to fit the queue limits
|
|
|
|
* @bio: bio to be split
|
2022-07-28 00:23:00 +08:00
|
|
|
* @lim: queue limits to split based on
|
2022-07-28 00:22:55 +08:00
|
|
|
* @nr_segs: returns the number of segments in the returned bio
|
|
|
|
*
|
|
|
|
* Check if @bio needs splitting based on the queue limits, and if so split off
|
|
|
|
* a bio fitting the limits from the beginning of @bio and return it. @bio is
|
|
|
|
* shortened to the remainder and re-submitted.
|
2019-08-02 06:50:41 +08:00
|
|
|
*
|
2022-07-28 00:22:55 +08:00
|
|
|
* The split bio is allocated from @q->bio_split, which is provided by the
|
|
|
|
* block layer.
|
2019-08-02 06:50:41 +08:00
|
|
|
*/
|
2022-10-26 03:17:54 +08:00
|
|
|
struct bio *__bio_split_to_limits(struct bio *bio,
|
|
|
|
const struct queue_limits *lim,
|
|
|
|
unsigned int *nr_segs)
|
2015-04-24 13:37:18 +08:00
|
|
|
{
|
2022-07-28 00:22:57 +08:00
|
|
|
struct bio_set *bs = &bio->bi_bdev->bd_disk->bio_split;
|
2022-07-28 00:22:55 +08:00
|
|
|
struct bio *split;
|
2015-04-24 13:37:18 +08:00
|
|
|
|
2022-07-28 00:22:55 +08:00
|
|
|
switch (bio_op(bio)) {
|
2016-08-16 15:59:35 +08:00
|
|
|
case REQ_OP_DISCARD:
|
|
|
|
case REQ_OP_SECURE_ERASE:
|
2022-07-28 00:23:00 +08:00
|
|
|
split = bio_split_discard(bio, lim, nr_segs, bs);
|
2016-08-16 15:59:35 +08:00
|
|
|
break;
|
2016-12-01 04:28:59 +08:00
|
|
|
case REQ_OP_WRITE_ZEROES:
|
2022-07-28 00:23:00 +08:00
|
|
|
split = bio_split_write_zeroes(bio, lim, nr_segs, bs);
|
2016-12-01 04:28:59 +08:00
|
|
|
break;
|
2016-08-16 15:59:35 +08:00
|
|
|
default:
|
2022-07-28 00:23:00 +08:00
|
|
|
split = bio_split_rw(bio, lim, nr_segs, bs,
|
|
|
|
get_max_io_size(bio, lim) << SECTOR_SHIFT);
|
2023-01-04 23:51:19 +08:00
|
|
|
if (IS_ERR(split))
|
|
|
|
return NULL;
|
2016-08-16 15:59:35 +08:00
|
|
|
break;
|
|
|
|
}
|
2015-10-20 23:13:52 +08:00
|
|
|
|
2015-04-24 13:37:18 +08:00
|
|
|
if (split) {
|
2023-01-04 23:51:19 +08:00
|
|
|
/* there isn't chance to merge the split bio */
|
2016-08-06 05:35:16 +08:00
|
|
|
split->bi_opf |= REQ_NOMERGE;
|
2015-10-20 23:13:53 +08:00
|
|
|
|
2022-07-13 22:02:26 +08:00
|
|
|
blkcg_bio_issue_init(split);
|
2022-07-28 00:22:55 +08:00
|
|
|
bio_chain(split, bio);
|
|
|
|
trace_block_split(split, bio->bi_iter.bi_sector);
|
block: Introduce zone write plugging
Zone write plugging implements a per-zone "plug" for write operations
to control the submission and execution order of write operations to
sequential write required zones of a zoned block device. Per-zone
plugging guarantees that at any time there is at most only one write
request per zone being executed. This mechanism is intended to replace
zone write locking which implements a similar per-zone write throttling
at the scheduler level, but is implemented only by mq-deadline.
Unlike zone write locking which operates on requests, zone write
plugging operates on BIOs. A zone write plug is simply a BIO list that
is atomically manipulated using a spinlock and a kblockd submission
work. A write BIO to a zone is "plugged" to delay its execution if a
write BIO for the same zone was already issued, that is, if a write
request for the same zone is being executed. The next plugged BIO is
unplugged and issued once the write request completes.
This mechanism allows to:
- Untangle zone write ordering from block IO schedulers. This allows
removing the restriction on using mq-deadline for writing to zoned
block devices. Any block IO scheduler, including "none" can be used.
- Zone write plugging operates on BIOs instead of requests. Plugged
BIOs waiting for execution thus do not hold scheduling tags and thus
are not preventing other BIOs from executing (reads or writes to
other zones). Depending on the workload, this can significantly
improve the device use (higher queue depth operation) and
performance.
- Both blk-mq (request based) zoned devices and BIO-based zoned devices
(e.g. device mapper) can use zone write plugging. It is mandatory
for the former but optional for the latter. BIO-based drivers can
use zone write plugging to implement write ordering guarantees, or
the drivers can implement their own if needed.
- The code is less invasive in the block layer and is mostly limited to
blk-zoned.c with some small changes in blk-mq.c, blk-merge.c and
bio.c.
Zone write plugging is implemented using struct blk_zone_wplug. This
structure includes a spinlock, a BIO list and a work structure to
handle the submission of plugged BIOs. Zone write plugs structures are
managed using a per-disk hash table.
Plugging of zone write BIOs is done using the function
blk_zone_write_plug_bio() which returns false if a BIO execution does
not need to be delayed and true otherwise. This function is called
from blk_mq_submit_bio() after a BIO is split to avoid large BIOs
spanning multiple zones which would cause mishandling of zone write
plugs. This ichange enables by default zone write plugging for any mq
request-based block device. BIO-based device drivers can also use zone
write plugging by expliclty calling blk_zone_write_plug_bio() in their
->submit_bio method. For such devices, the driver must ensure that a
BIO passed to blk_zone_write_plug_bio() is already split and not
straddling zone boundaries.
Only write and write zeroes BIOs are plugged. Zone write plugging does
not introduce any significant overhead for other operations. A BIO that
is being handled through zone write plugging is flagged using the new
BIO flag BIO_ZONE_WRITE_PLUGGING. A request handling a BIO flagged with
this new flag is flagged with the new RQF_ZONE_WRITE_PLUGGING flag.
The completion of BIOs and requests flagged trigger respectively calls
to the functions blk_zone_write_bio_endio() and
blk_zone_write_complete_request(). The latter function is used to
trigger submission of the next plugged BIO using the zone plug work.
blk_zone_write_bio_endio() does the same for BIO-based devices.
This ensures that at any time, at most one request (blk-mq devices) or
one BIO (BIO-based devices) is being executed for any zone. The
handling of zone write plugs using a per-zone plug spinlock maximizes
parallelism and device usage by allowing multiple zones to be writen
simultaneously without lock contention.
Zone write plugging ignores flush BIOs without data. Hovever, any flush
BIO that has data is always plugged so that the write part of the flush
sequence is serialized with other regular writes.
Given that any BIO handled through zone write plugging will be the only
BIO in flight for the target zone when it is executed, the unplugging
and submission of a BIO will have no chance of successfully merging with
plugged requests or requests in the scheduler. To overcome this
potential performance degradation, blk_mq_submit_bio() calls the
function blk_zone_write_plug_attempt_merge() to try to merge other
plugged BIOs with the one just unplugged and submitted. Successful
merging is signaled using blk_zone_write_plug_bio_merged(), called from
bio_attempt_back_merge(). Furthermore, to avoid recalculating the number
of segments of plugged BIOs to attempt merging, the number of segments
of a plugged BIO is saved using the new struct bio field
__bi_nr_segments. To avoid growing the size of struct bio, this field is
added as a union with the bio_cookie field. This is safe to do as
polling is always disabled for plugged BIOs.
When BIOs are plugged in a zone write plug, the device request queue
usage counter is always incremented. This reference is kept and reused
for blk-mq devices when the plugged BIO is unplugged and submitted
again using submit_bio_noacct_nocheck(). For this case, the unplugged
BIO is already flagged with BIO_ZONE_WRITE_PLUGGING and
blk_mq_submit_bio() proceeds directly to allocating a new request for
the BIO, re-using the usage reference count taken when the BIO was
plugged. This extra reference count is dropped in
blk_zone_write_plug_attempt_merge() for any plugged BIO that is
successfully merged. Given that BIO-based devices will not take this
path, the extra reference is dropped after a plugged BIO is unplugged
and submitted.
Zone write plugs are dynamically allocated and managed using a hash
table (an array of struct hlist_head) with RCU protection.
A zone write plug is allocated when a write BIO is received for the
zone and not freed until the zone is fully written, reset or finished.
To detect when a zone write plug can be freed, the write state of each
zone is tracked using a write pointer offset which corresponds to the
offset of a zone write pointer relative to the zone start. Write
operations always increment this write pointer offset. Zone reset
operations set it to 0 and zone finish operations set it to the zone
size.
If a write error happens, the wp_offset value of a zone write plug may
become incorrect and out of sync with the device managed write pointer.
This is handled using the zone write plug flag BLK_ZONE_WPLUG_ERROR.
The function blk_zone_wplug_handle_error() is called from the new disk
zone write plug work when this flag is set. This function executes a
report zone to update the zone write pointer offset to the current
value as indicated by the device. The disk zone write plug work is
scheduled whenever a BIO flagged with BIO_ZONE_WRITE_PLUGGING completes
with an error or when bio_zone_wplug_prepare_bio() detects an unaligned
write. Once scheduled, the disk zone write plugs work keeps running
until all zone errors are handled.
To match the new data structures used for zoned disks, the function
disk_free_zone_bitmaps() is renamed to the more generic
disk_free_zone_resources(). The function disk_init_zone_resources() is
also introduced to initialize zone write plugs resources when a gendisk
is allocated.
In order to guarantee that the user can simultaneously write up to a
number of zones equal to a device max active zone limit or max open zone
limit, zone write plugs are allocated using a mempool sized to the
maximum of these 2 device limits. For a device that does not have
active and open zone limits, 128 is used as the default mempool size.
If a change to the device active and open zone limits is detected, the
disk mempool is resized when blk_revalidate_disk_zones() is executed.
This commit contains contributions from Christoph Hellwig <hch@lst.de>.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240408014128.205141-8-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-08 09:41:07 +08:00
|
|
|
WARN_ON_ONCE(bio_zone_write_plugging(bio));
|
2022-07-28 00:22:55 +08:00
|
|
|
submit_bio_noacct(bio);
|
|
|
|
return split;
|
2015-04-24 13:37:18 +08:00
|
|
|
}
|
2022-07-28 00:22:55 +08:00
|
|
|
return bio;
|
2015-04-24 13:37:18 +08:00
|
|
|
}
|
2019-06-06 18:29:01 +08:00
|
|
|
|
2019-08-02 06:50:41 +08:00
|
|
|
/**
|
2022-07-28 00:22:55 +08:00
|
|
|
* bio_split_to_limits - split a bio to fit the queue limits
|
|
|
|
* @bio: bio to be split
|
|
|
|
*
|
|
|
|
* Check if @bio needs splitting based on the queue limits of @bio->bi_bdev, and
|
|
|
|
* if so split off a bio fitting the limits from the beginning of @bio and
|
|
|
|
* return it. @bio is shortened to the remainder and re-submitted.
|
2019-08-02 06:50:41 +08:00
|
|
|
*
|
2022-07-28 00:22:55 +08:00
|
|
|
* The split bio is allocated from @q->bio_split, which is provided by the
|
|
|
|
* block layer.
|
2019-08-02 06:50:41 +08:00
|
|
|
*/
|
2022-07-28 00:22:55 +08:00
|
|
|
struct bio *bio_split_to_limits(struct bio *bio)
|
2019-06-06 18:29:01 +08:00
|
|
|
{
|
2022-10-26 03:17:54 +08:00
|
|
|
const struct queue_limits *lim = &bdev_get_queue(bio->bi_bdev)->limits;
|
2019-06-06 18:29:01 +08:00
|
|
|
unsigned int nr_segs;
|
|
|
|
|
2022-07-28 00:23:00 +08:00
|
|
|
if (bio_may_exceed_limits(bio, lim))
|
|
|
|
return __bio_split_to_limits(bio, lim, &nr_segs);
|
2022-07-28 00:22:55 +08:00
|
|
|
return bio;
|
2019-06-06 18:29:01 +08:00
|
|
|
}
|
2022-07-28 00:22:55 +08:00
|
|
|
EXPORT_SYMBOL(bio_split_to_limits);
|
2015-04-24 13:37:18 +08:00
|
|
|
|
2019-06-06 18:29:02 +08:00
|
|
|
unsigned int blk_recalc_rq_segments(struct request *rq)
|
2008-01-29 21:04:06 +08:00
|
|
|
{
|
2019-05-21 15:01:43 +08:00
|
|
|
unsigned int nr_phys_segs = 0;
|
2022-06-11 03:58:25 +08:00
|
|
|
unsigned int bytes = 0;
|
2019-06-06 18:29:02 +08:00
|
|
|
struct req_iterator iter;
|
2019-05-21 15:01:43 +08:00
|
|
|
struct bio_vec bv;
|
2008-01-29 21:04:06 +08:00
|
|
|
|
2019-06-06 18:29:02 +08:00
|
|
|
if (!rq->bio)
|
2009-02-23 16:03:10 +08:00
|
|
|
return 0;
|
2008-01-29 21:04:06 +08:00
|
|
|
|
2019-06-06 18:29:02 +08:00
|
|
|
switch (bio_op(rq->bio)) {
|
2016-12-01 04:28:59 +08:00
|
|
|
case REQ_OP_DISCARD:
|
|
|
|
case REQ_OP_SECURE_ERASE:
|
2021-02-11 22:38:07 +08:00
|
|
|
if (queue_max_discard_segments(rq->q) > 1) {
|
|
|
|
struct bio *bio = rq->bio;
|
|
|
|
|
|
|
|
for_each_bio(bio)
|
|
|
|
nr_phys_segs++;
|
|
|
|
return nr_phys_segs;
|
|
|
|
}
|
|
|
|
return 1;
|
2016-12-01 04:28:59 +08:00
|
|
|
case REQ_OP_WRITE_ZEROES:
|
2016-12-09 06:20:32 +08:00
|
|
|
return 0;
|
2022-07-15 02:06:30 +08:00
|
|
|
default:
|
|
|
|
break;
|
2016-12-01 04:28:59 +08:00
|
|
|
}
|
2014-02-08 04:53:46 +08:00
|
|
|
|
2019-06-06 18:29:02 +08:00
|
|
|
rq_for_each_bvec(bv, rq, iter)
|
2022-07-28 00:23:00 +08:00
|
|
|
bvec_split_segs(&rq->q->limits, &bv, &nr_phys_segs, &bytes,
|
2019-08-02 06:50:43 +08:00
|
|
|
UINT_MAX, UINT_MAX);
|
2009-02-23 16:03:10 +08:00
|
|
|
return nr_phys_segs;
|
|
|
|
}
|
|
|
|
|
2019-02-27 20:40:11 +08:00
|
|
|
static inline struct scatterlist *blk_next_sg(struct scatterlist **sg,
|
2019-02-15 19:13:13 +08:00
|
|
|
struct scatterlist *sglist)
|
|
|
|
{
|
|
|
|
if (!*sg)
|
|
|
|
return sglist;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the driver previously mapped a shorter list, we could see a
|
|
|
|
* termination bit prematurely unless it fully inits the sg table
|
|
|
|
* on each mapping. We KNOW that there must be more entries here
|
|
|
|
* or the driver would be buggy, so force clear the termination bit
|
|
|
|
* to avoid doing a full sg_init_table() in drivers for each command.
|
|
|
|
*/
|
|
|
|
sg_unmark_end(*sg);
|
|
|
|
return sg_next(*sg);
|
|
|
|
}
|
|
|
|
|
|
|
|
static unsigned blk_bvec_map_sg(struct request_queue *q,
|
|
|
|
struct bio_vec *bvec, struct scatterlist *sglist,
|
|
|
|
struct scatterlist **sg)
|
|
|
|
{
|
|
|
|
unsigned nbytes = bvec->bv_len;
|
2019-04-11 14:23:27 +08:00
|
|
|
unsigned nsegs = 0, total = 0;
|
2019-02-15 19:13:13 +08:00
|
|
|
|
|
|
|
while (nbytes > 0) {
|
2019-04-11 14:23:27 +08:00
|
|
|
unsigned offset = bvec->bv_offset + total;
|
2024-07-09 15:01:25 +08:00
|
|
|
unsigned len = get_max_segment_size(&q->limits,
|
|
|
|
bvec_phys(bvec) + total, nbytes);
|
2019-04-19 14:56:24 +08:00
|
|
|
struct page *page = bvec->bv_page;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Unfortunately a fair number of drivers barf on scatterlists
|
|
|
|
* that have an offset larger than PAGE_SIZE, despite other
|
|
|
|
* subsystems dealing with that invariant just fine. For now
|
|
|
|
* stick to the legacy format where we never present those from
|
|
|
|
* the block layer, but the code below should be removed once
|
|
|
|
* these offenders (mostly MMC/SD drivers) are fixed.
|
|
|
|
*/
|
|
|
|
page += (offset >> PAGE_SHIFT);
|
|
|
|
offset &= ~PAGE_MASK;
|
2019-02-15 19:13:13 +08:00
|
|
|
|
|
|
|
*sg = blk_next_sg(sg, sglist);
|
2019-04-19 14:56:24 +08:00
|
|
|
sg_set_page(*sg, page, len, offset);
|
2019-02-15 19:13:13 +08:00
|
|
|
|
2019-04-11 14:23:27 +08:00
|
|
|
total += len;
|
|
|
|
nbytes -= len;
|
2019-02-15 19:13:13 +08:00
|
|
|
nsegs++;
|
|
|
|
}
|
|
|
|
|
|
|
|
return nsegs;
|
|
|
|
}
|
|
|
|
|
2019-03-17 18:01:11 +08:00
|
|
|
static inline int __blk_bvec_map_sg(struct bio_vec bv,
|
|
|
|
struct scatterlist *sglist, struct scatterlist **sg)
|
|
|
|
{
|
|
|
|
*sg = blk_next_sg(sg, sglist);
|
|
|
|
sg_set_page(*sg, bv.bv_page, bv.bv_len, bv.bv_offset);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2019-03-17 18:01:12 +08:00
|
|
|
/* only try to merge bvecs into one sg if they are from two bios */
|
|
|
|
static inline bool
|
|
|
|
__blk_segment_map_sg_merge(struct request_queue *q, struct bio_vec *bvec,
|
|
|
|
struct bio_vec *bvprv, struct scatterlist **sg)
|
2012-08-03 05:42:03 +08:00
|
|
|
{
|
|
|
|
|
|
|
|
int nbytes = bvec->bv_len;
|
|
|
|
|
2019-03-17 18:01:12 +08:00
|
|
|
if (!*sg)
|
|
|
|
return false;
|
2012-08-03 05:42:03 +08:00
|
|
|
|
2019-03-17 18:01:12 +08:00
|
|
|
if ((*sg)->length + nbytes > queue_max_segment_size(q))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if (!biovec_phys_mergeable(q, bvprv, bvec))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
(*sg)->length += nbytes;
|
|
|
|
|
|
|
|
return true;
|
2012-08-03 05:42:03 +08:00
|
|
|
}
|
|
|
|
|
2014-02-08 04:53:46 +08:00
|
|
|
static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio,
|
|
|
|
struct scatterlist *sglist,
|
|
|
|
struct scatterlist **sg)
|
2008-01-29 21:04:06 +08:00
|
|
|
{
|
treewide: Remove uninitialized_var() usage
Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.
In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:
git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'
drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.
No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.
[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/
Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-06-04 04:09:38 +08:00
|
|
|
struct bio_vec bvec, bvprv = { NULL };
|
2014-02-08 04:53:46 +08:00
|
|
|
struct bvec_iter iter;
|
2018-12-13 23:17:10 +08:00
|
|
|
int nsegs = 0;
|
2019-03-17 18:01:12 +08:00
|
|
|
bool new_bio = false;
|
2014-02-08 04:53:46 +08:00
|
|
|
|
2019-03-17 18:01:12 +08:00
|
|
|
for_each_bio(bio) {
|
|
|
|
bio_for_each_bvec(bvec, bio, iter) {
|
|
|
|
/*
|
|
|
|
* Only try to merge bvecs from two bios given we
|
|
|
|
* have done bio internal merge when adding pages
|
|
|
|
* to bio
|
|
|
|
*/
|
|
|
|
if (new_bio &&
|
|
|
|
__blk_segment_map_sg_merge(q, &bvec, &bvprv, sg))
|
|
|
|
goto next_bvec;
|
|
|
|
|
|
|
|
if (bvec.bv_offset + bvec.bv_len <= PAGE_SIZE)
|
|
|
|
nsegs += __blk_bvec_map_sg(bvec, sglist, sg);
|
|
|
|
else
|
|
|
|
nsegs += blk_bvec_map_sg(q, &bvec, sglist, sg);
|
|
|
|
next_bvec:
|
|
|
|
new_bio = false;
|
|
|
|
}
|
2019-04-02 10:26:44 +08:00
|
|
|
if (likely(bio->bi_iter.bi_size)) {
|
|
|
|
bvprv = bvec;
|
|
|
|
new_bio = true;
|
|
|
|
}
|
2019-03-17 18:01:12 +08:00
|
|
|
}
|
2008-01-29 21:04:06 +08:00
|
|
|
|
2014-02-08 04:53:46 +08:00
|
|
|
return nsegs;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* map a request to scatterlist, return number of sg entries setup. Caller
|
|
|
|
* must make sure sg can hold rq->nr_phys_segments entries
|
|
|
|
*/
|
2020-04-14 15:42:22 +08:00
|
|
|
int __blk_rq_map_sg(struct request_queue *q, struct request *rq,
|
|
|
|
struct scatterlist *sglist, struct scatterlist **last_sg)
|
2014-02-08 04:53:46 +08:00
|
|
|
{
|
|
|
|
int nsegs = 0;
|
|
|
|
|
2016-12-09 06:20:32 +08:00
|
|
|
if (rq->rq_flags & RQF_SPECIAL_PAYLOAD)
|
2020-04-14 15:42:22 +08:00
|
|
|
nsegs = __blk_bvec_map_sg(rq->special_vec, sglist, last_sg);
|
2016-12-09 06:20:32 +08:00
|
|
|
else if (rq->bio)
|
2020-04-14 15:42:22 +08:00
|
|
|
nsegs = __blk_bios_map_sg(q, rq->bio, sglist, last_sg);
|
2008-04-11 18:56:52 +08:00
|
|
|
|
2020-04-14 15:42:22 +08:00
|
|
|
if (*last_sg)
|
|
|
|
sg_mark_end(*last_sg);
|
2008-01-29 21:04:06 +08:00
|
|
|
|
2015-11-24 10:35:31 +08:00
|
|
|
/*
|
|
|
|
* Something must have been wrong if the figured number of
|
|
|
|
* segment is bigger than number of req's physical segments
|
|
|
|
*/
|
2016-12-09 06:20:32 +08:00
|
|
|
WARN_ON(nsegs > blk_rq_nr_phys_segments(rq));
|
2015-11-24 10:35:31 +08:00
|
|
|
|
2008-01-29 21:04:06 +08:00
|
|
|
return nsegs;
|
|
|
|
}
|
2020-04-14 15:42:22 +08:00
|
|
|
EXPORT_SYMBOL(__blk_rq_map_sg);
|
2008-01-29 21:04:06 +08:00
|
|
|
|
2021-09-20 20:33:26 +08:00
|
|
|
static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
|
|
|
|
sector_t offset)
|
|
|
|
{
|
|
|
|
struct request_queue *q = rq->q;
|
2024-06-20 20:53:51 +08:00
|
|
|
struct queue_limits *lim = &q->limits;
|
|
|
|
unsigned int max_sectors, boundary_sectors;
|
block: Add core atomic write support
Add atomic write support, as follows:
- add helper functions to get request_queue atomic write limits
- report request_queue atomic write support limits to sysfs and update Doc
- support to safely merge atomic writes
- deal with splitting atomic writes
- misc helper functions
- add a per-request atomic write flag
New request_queue limits are added, as follows:
- atomic_write_hw_max is set by the block driver and is the maximum length
of an atomic write which the device may support. It is not
necessarily a power-of-2.
- atomic_write_max_sectors is derived from atomic_write_hw_max_sectors and
max_hw_sectors. It is always a power-of-2. Atomic writes may be merged,
and atomic_write_max_sectors would be the limit on a merged atomic write
request size. This value is not capped at max_sectors, as the value in
max_sectors can be controlled from userspace, and it would only cause
trouble if userspace could limit atomic_write_unit_max_bytes and the
other atomic write limits.
- atomic_write_hw_unit_{min,max} are set by the block driver and are the
min/max length of an atomic write unit which the device may support. They
both must be a power-of-2. Typically atomic_write_hw_unit_max will hold
the same value as atomic_write_hw_max.
- atomic_write_unit_{min,max} are derived from
atomic_write_hw_unit_{min,max}, max_hw_sectors, and block core limits.
Both min and max values must be a power-of-2.
- atomic_write_hw_boundary is set by the block driver. If non-zero, it
indicates an LBA space boundary at which an atomic write straddles no
longer is atomically executed by the disk. The value must be a
power-of-2. Note that it would be acceptable to enforce a rule that
atomic_write_hw_boundary_sectors is a multiple of
atomic_write_hw_unit_max, but the resultant code would be more
complicated.
All atomic writes limits are by default set 0 to indicate no atomic write
support. Even though it is assumed by Linux that a logical block can always
be atomically written, we ignore this as it is not of particular interest.
Stacked devices are just not supported either for now.
An atomic write must always be submitted to the block driver as part of a
single request. As such, only a single BIO must be submitted to the block
layer for an atomic write. When a single atomic write BIO is submitted, it
cannot be split. As such, atomic_write_unit_{max, min}_bytes are limited
by the maximum guaranteed BIO size which will not be required to be split.
This max size is calculated by request_queue max segments and the number
of bvecs a BIO can fit, BIO_MAX_VECS. Currently we rely on userspace
issuing a write with iovcnt=1 for pwritev2() - as such, we can rely on each
segment containing PAGE_SIZE of data, apart from the first+last, which each
can fit logical block size of data. The first+last will be LBS
length/aligned as we rely on direct IO alignment rules also.
New sysfs files are added to report the following atomic write limits:
- atomic_write_unit_max_bytes - same as atomic_write_unit_max_sectors in
bytes
- atomic_write_unit_min_bytes - same as atomic_write_unit_min_sectors in
bytes
- atomic_write_boundary_bytes - same as atomic_write_hw_boundary_sectors in
bytes
- atomic_write_max_bytes - same as atomic_write_max_sectors in bytes
Atomic writes may only be merged with other atomic writes and only under
the following conditions:
- total resultant request length <= atomic_write_max_bytes
- the merged write does not straddle a boundary
Helper function bdev_can_atomic_write() is added to indicate whether
atomic writes may be issued to a bdev. If a bdev is a partition, the
partition start must be aligned with both atomic_write_unit_min_sectors
and atomic_write_hw_boundary_sectors.
FSes will rely on the block layer to validate that an atomic write BIO
submitted will be of valid size, so add blk_validate_atomic_write_op_size()
for this purpose. Userspace expects an atomic write which is of invalid
size to be rejected with -EINVAL, so add BLK_STS_INVAL for this. Also use
BLK_STS_INVAL for when a BIO needs to be split, as this should mean an
invalid size BIO.
Flag REQ_ATOMIC is used for indicating an atomic write.
Co-developed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-6-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 20:53:54 +08:00
|
|
|
bool is_atomic = rq->cmd_flags & REQ_ATOMIC;
|
2021-09-20 20:33:26 +08:00
|
|
|
|
|
|
|
if (blk_rq_is_passthrough(rq))
|
|
|
|
return q->limits.max_hw_sectors;
|
|
|
|
|
block: Add core atomic write support
Add atomic write support, as follows:
- add helper functions to get request_queue atomic write limits
- report request_queue atomic write support limits to sysfs and update Doc
- support to safely merge atomic writes
- deal with splitting atomic writes
- misc helper functions
- add a per-request atomic write flag
New request_queue limits are added, as follows:
- atomic_write_hw_max is set by the block driver and is the maximum length
of an atomic write which the device may support. It is not
necessarily a power-of-2.
- atomic_write_max_sectors is derived from atomic_write_hw_max_sectors and
max_hw_sectors. It is always a power-of-2. Atomic writes may be merged,
and atomic_write_max_sectors would be the limit on a merged atomic write
request size. This value is not capped at max_sectors, as the value in
max_sectors can be controlled from userspace, and it would only cause
trouble if userspace could limit atomic_write_unit_max_bytes and the
other atomic write limits.
- atomic_write_hw_unit_{min,max} are set by the block driver and are the
min/max length of an atomic write unit which the device may support. They
both must be a power-of-2. Typically atomic_write_hw_unit_max will hold
the same value as atomic_write_hw_max.
- atomic_write_unit_{min,max} are derived from
atomic_write_hw_unit_{min,max}, max_hw_sectors, and block core limits.
Both min and max values must be a power-of-2.
- atomic_write_hw_boundary is set by the block driver. If non-zero, it
indicates an LBA space boundary at which an atomic write straddles no
longer is atomically executed by the disk. The value must be a
power-of-2. Note that it would be acceptable to enforce a rule that
atomic_write_hw_boundary_sectors is a multiple of
atomic_write_hw_unit_max, but the resultant code would be more
complicated.
All atomic writes limits are by default set 0 to indicate no atomic write
support. Even though it is assumed by Linux that a logical block can always
be atomically written, we ignore this as it is not of particular interest.
Stacked devices are just not supported either for now.
An atomic write must always be submitted to the block driver as part of a
single request. As such, only a single BIO must be submitted to the block
layer for an atomic write. When a single atomic write BIO is submitted, it
cannot be split. As such, atomic_write_unit_{max, min}_bytes are limited
by the maximum guaranteed BIO size which will not be required to be split.
This max size is calculated by request_queue max segments and the number
of bvecs a BIO can fit, BIO_MAX_VECS. Currently we rely on userspace
issuing a write with iovcnt=1 for pwritev2() - as such, we can rely on each
segment containing PAGE_SIZE of data, apart from the first+last, which each
can fit logical block size of data. The first+last will be LBS
length/aligned as we rely on direct IO alignment rules also.
New sysfs files are added to report the following atomic write limits:
- atomic_write_unit_max_bytes - same as atomic_write_unit_max_sectors in
bytes
- atomic_write_unit_min_bytes - same as atomic_write_unit_min_sectors in
bytes
- atomic_write_boundary_bytes - same as atomic_write_hw_boundary_sectors in
bytes
- atomic_write_max_bytes - same as atomic_write_max_sectors in bytes
Atomic writes may only be merged with other atomic writes and only under
the following conditions:
- total resultant request length <= atomic_write_max_bytes
- the merged write does not straddle a boundary
Helper function bdev_can_atomic_write() is added to indicate whether
atomic writes may be issued to a bdev. If a bdev is a partition, the
partition start must be aligned with both atomic_write_unit_min_sectors
and atomic_write_hw_boundary_sectors.
FSes will rely on the block layer to validate that an atomic write BIO
submitted will be of valid size, so add blk_validate_atomic_write_op_size()
for this purpose. Userspace expects an atomic write which is of invalid
size to be rejected with -EINVAL, so add BLK_STS_INVAL for this. Also use
BLK_STS_INVAL for when a BIO needs to be split, as this should mean an
invalid size BIO.
Flag REQ_ATOMIC is used for indicating an atomic write.
Co-developed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-6-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 20:53:54 +08:00
|
|
|
boundary_sectors = blk_boundary_sectors(lim, is_atomic);
|
2024-06-20 20:53:50 +08:00
|
|
|
max_sectors = blk_queue_get_max_sectors(rq);
|
|
|
|
|
2024-06-20 20:53:51 +08:00
|
|
|
if (!boundary_sectors ||
|
2021-09-20 20:33:26 +08:00
|
|
|
req_op(rq) == REQ_OP_DISCARD ||
|
|
|
|
req_op(rq) == REQ_OP_SECURE_ERASE)
|
2022-06-14 17:09:31 +08:00
|
|
|
return max_sectors;
|
|
|
|
return min(max_sectors,
|
2024-06-20 20:53:51 +08:00
|
|
|
blk_boundary_sectors_left(offset, boundary_sectors));
|
2021-09-20 20:33:26 +08:00
|
|
|
}
|
|
|
|
|
2019-06-06 18:29:01 +08:00
|
|
|
static inline int ll_new_hw_segment(struct request *req, struct bio *bio,
|
|
|
|
unsigned int nr_phys_segs)
|
2008-01-29 21:04:06 +08:00
|
|
|
{
|
2022-03-15 08:30:11 +08:00
|
|
|
if (!blk_cgroup_mergeable(req, bio))
|
|
|
|
goto no_merge;
|
|
|
|
|
2021-06-28 10:33:12 +08:00
|
|
|
if (blk_integrity_merge_bio(req->q, req, bio) == false)
|
2010-09-11 02:50:10 +08:00
|
|
|
goto no_merge;
|
|
|
|
|
2021-06-28 10:33:12 +08:00
|
|
|
/* discard request merge won't add new segment */
|
|
|
|
if (req_op(req) == REQ_OP_DISCARD)
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
if (req->nr_phys_segments + nr_phys_segs > blk_rq_get_max_segments(req))
|
2010-09-11 02:50:10 +08:00
|
|
|
goto no_merge;
|
2008-01-29 21:04:06 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This will form the start of a new hw segment. Bump both
|
|
|
|
* counters.
|
|
|
|
*/
|
|
|
|
req->nr_phys_segments += nr_phys_segs;
|
|
|
|
return 1;
|
2010-09-11 02:50:10 +08:00
|
|
|
|
|
|
|
no_merge:
|
2019-06-06 18:29:01 +08:00
|
|
|
req_set_nomerge(req->q, req);
|
2010-09-11 02:50:10 +08:00
|
|
|
return 0;
|
2008-01-29 21:04:06 +08:00
|
|
|
}
|
|
|
|
|
2019-06-06 18:29:01 +08:00
|
|
|
int ll_back_merge_fn(struct request *req, struct bio *bio, unsigned int nr_segs)
|
2008-01-29 21:04:06 +08:00
|
|
|
{
|
2015-09-04 00:28:20 +08:00
|
|
|
if (req_gap_back_merge(req, bio))
|
|
|
|
return 0;
|
2015-09-11 23:03:04 +08:00
|
|
|
if (blk_integrity_rq(req) &&
|
|
|
|
integrity_req_gap_back_merge(req, bio))
|
|
|
|
return 0;
|
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 08:37:18 +08:00
|
|
|
if (!bio_crypt_ctx_back_mergeable(req, bio))
|
|
|
|
return 0;
|
2012-09-19 00:19:26 +08:00
|
|
|
if (blk_rq_sectors(req) + bio_sectors(bio) >
|
2016-07-21 11:40:47 +08:00
|
|
|
blk_rq_get_max_sectors(req, blk_rq_pos(req))) {
|
2019-06-06 18:29:01 +08:00
|
|
|
req_set_nomerge(req->q, req);
|
2008-01-29 21:04:06 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-06-06 18:29:01 +08:00
|
|
|
return ll_new_hw_segment(req, bio, nr_segs);
|
2008-01-29 21:04:06 +08:00
|
|
|
}
|
|
|
|
|
2020-10-06 15:07:19 +08:00
|
|
|
static int ll_front_merge_fn(struct request *req, struct bio *bio,
|
|
|
|
unsigned int nr_segs)
|
2008-01-29 21:04:06 +08:00
|
|
|
{
|
2015-09-04 00:28:20 +08:00
|
|
|
if (req_gap_front_merge(req, bio))
|
|
|
|
return 0;
|
2015-09-11 23:03:04 +08:00
|
|
|
if (blk_integrity_rq(req) &&
|
|
|
|
integrity_req_gap_front_merge(req, bio))
|
|
|
|
return 0;
|
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 08:37:18 +08:00
|
|
|
if (!bio_crypt_ctx_front_mergeable(req, bio))
|
|
|
|
return 0;
|
2012-09-19 00:19:26 +08:00
|
|
|
if (blk_rq_sectors(req) + bio_sectors(bio) >
|
2016-07-21 11:40:47 +08:00
|
|
|
blk_rq_get_max_sectors(req, bio->bi_iter.bi_sector)) {
|
2019-06-06 18:29:01 +08:00
|
|
|
req_set_nomerge(req->q, req);
|
2008-01-29 21:04:06 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-06-06 18:29:01 +08:00
|
|
|
return ll_new_hw_segment(req, bio, nr_segs);
|
2008-01-29 21:04:06 +08:00
|
|
|
}
|
|
|
|
|
2018-02-02 05:01:02 +08:00
|
|
|
static bool req_attempt_discard_merge(struct request_queue *q, struct request *req,
|
|
|
|
struct request *next)
|
|
|
|
{
|
|
|
|
unsigned short segments = blk_rq_nr_discard_segments(req);
|
|
|
|
|
|
|
|
if (segments >= queue_max_discard_segments(q))
|
|
|
|
goto no_merge;
|
|
|
|
if (blk_rq_sectors(req) + bio_sectors(next->bio) >
|
|
|
|
blk_rq_get_max_sectors(req, blk_rq_pos(req)))
|
|
|
|
goto no_merge;
|
|
|
|
|
|
|
|
req->nr_phys_segments = segments + blk_rq_nr_discard_segments(next);
|
|
|
|
return true;
|
|
|
|
no_merge:
|
|
|
|
req_set_nomerge(q, req);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2008-01-29 21:04:06 +08:00
|
|
|
static int ll_merge_requests_fn(struct request_queue *q, struct request *req,
|
|
|
|
struct request *next)
|
|
|
|
{
|
|
|
|
int total_phys_segments;
|
|
|
|
|
2015-09-04 00:28:20 +08:00
|
|
|
if (req_gap_back_merge(req, next->bio))
|
2015-02-11 23:20:13 +08:00
|
|
|
return 0;
|
|
|
|
|
2008-01-29 21:04:06 +08:00
|
|
|
/*
|
|
|
|
* Will it become too large?
|
|
|
|
*/
|
2012-09-19 00:19:26 +08:00
|
|
|
if ((blk_rq_sectors(req) + blk_rq_sectors(next)) >
|
2016-07-21 11:40:47 +08:00
|
|
|
blk_rq_get_max_sectors(req, blk_rq_pos(req)))
|
2008-01-29 21:04:06 +08:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
total_phys_segments = req->nr_phys_segments + next->nr_phys_segments;
|
2020-08-17 17:52:39 +08:00
|
|
|
if (total_phys_segments > blk_rq_get_max_segments(req))
|
2008-01-29 21:04:06 +08:00
|
|
|
return 0;
|
|
|
|
|
2022-03-15 08:30:11 +08:00
|
|
|
if (!blk_cgroup_mergeable(req, next->bio))
|
|
|
|
return 0;
|
|
|
|
|
2014-09-27 07:20:06 +08:00
|
|
|
if (blk_integrity_merge_rq(q, req, next) == false)
|
2010-09-11 02:50:10 +08:00
|
|
|
return 0;
|
|
|
|
|
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 08:37:18 +08:00
|
|
|
if (!bio_crypt_ctx_merge_rq(req, next))
|
|
|
|
return 0;
|
|
|
|
|
2008-01-29 21:04:06 +08:00
|
|
|
/* Merge is OK... */
|
|
|
|
req->nr_phys_segments = total_phys_segments;
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2009-07-03 16:48:17 +08:00
|
|
|
/**
|
|
|
|
* blk_rq_set_mixed_merge - mark a request as mixed merge
|
|
|
|
* @rq: request to mark as mixed merge
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* @rq is about to be mixed merged. Make sure the attributes
|
|
|
|
* which can be mixed are set in each bio and mark @rq as mixed
|
|
|
|
* merged.
|
|
|
|
*/
|
2024-03-25 16:35:01 +08:00
|
|
|
static void blk_rq_set_mixed_merge(struct request *rq)
|
2009-07-03 16:48:17 +08:00
|
|
|
{
|
2022-07-15 02:06:32 +08:00
|
|
|
blk_opf_t ff = rq->cmd_flags & REQ_FAILFAST_MASK;
|
2009-07-03 16:48:17 +08:00
|
|
|
struct bio *bio;
|
|
|
|
|
2016-10-20 21:12:13 +08:00
|
|
|
if (rq->rq_flags & RQF_MIXED_MERGE)
|
2009-07-03 16:48:17 +08:00
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* @rq will no longer represent mixable attributes for all the
|
|
|
|
* contained bios. It will just track those of the first one.
|
|
|
|
* Distributes the attributs to each bio.
|
|
|
|
*/
|
|
|
|
for (bio = rq->bio; bio; bio = bio->bi_next) {
|
2016-08-06 05:35:16 +08:00
|
|
|
WARN_ON_ONCE((bio->bi_opf & REQ_FAILFAST_MASK) &&
|
|
|
|
(bio->bi_opf & REQ_FAILFAST_MASK) != ff);
|
|
|
|
bio->bi_opf |= ff;
|
2009-07-03 16:48:17 +08:00
|
|
|
}
|
2016-10-20 21:12:13 +08:00
|
|
|
rq->rq_flags |= RQF_MIXED_MERGE;
|
2009-07-03 16:48:17 +08:00
|
|
|
}
|
|
|
|
|
2023-02-17 10:39:15 +08:00
|
|
|
static inline blk_opf_t bio_failfast(const struct bio *bio)
|
block: sync mixed merged request's failfast with 1st bio's
We support mixed merge for requests/bios with different fastfail
settings. When request fails, each time we only handle the portion
with same failfast setting, then bios with failfast can be failed
immediately, and bios without failfast can be retried.
The idea is pretty good, but the current implementation has several
defects:
1) initially RA bio doesn't set failfast, however bio merge code
doesn't consider this point, and just check its failfast setting for
deciding if mixed merge is required. Fix this issue by adding helper
of bio_failfast().
2) when merging bio to request front, if this request is mixed
merged, we have to sync request's faifast setting with 1st bio's
failfast. Fix it by calling blk_update_mixed_merge().
3) when merging bio to request back, if this request is mixed
merged, we have to mark the bio as failfast, because blk_update_request
simply updates request failfast with 1st bio's failfast. Fix
it by calling blk_update_mixed_merge().
Fixes one normal EXT4 READ IO failure issue, because it is observed
that the normal READ IO is merged with RA IO, and the mixed merged
request has different failfast setting with 1st bio's, so finally
the normal READ IO doesn't get retried.
Cc: Tejun Heo <tj@kernel.org>
Fixes: 80a761fd33cf ("block: implement mixed merge of different failfast requests")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230209125527.667004-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-09 20:55:27 +08:00
|
|
|
{
|
|
|
|
if (bio->bi_opf & REQ_RAHEAD)
|
|
|
|
return REQ_FAILFAST_MASK;
|
|
|
|
|
|
|
|
return bio->bi_opf & REQ_FAILFAST_MASK;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* After we are marked as MIXED_MERGE, any new RA bio has to be updated
|
|
|
|
* as failfast, and request's failfast has to be updated in case of
|
|
|
|
* front merge.
|
|
|
|
*/
|
|
|
|
static inline void blk_update_mixed_merge(struct request *req,
|
|
|
|
struct bio *bio, bool front_merge)
|
|
|
|
{
|
|
|
|
if (req->rq_flags & RQF_MIXED_MERGE) {
|
|
|
|
if (bio->bi_opf & REQ_RAHEAD)
|
|
|
|
bio->bi_opf |= REQ_FAILFAST_MASK;
|
|
|
|
|
|
|
|
if (front_merge) {
|
|
|
|
req->cmd_flags &= ~REQ_FAILFAST_MASK;
|
|
|
|
req->cmd_flags |= bio->bi_opf & REQ_FAILFAST_MASK;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-05-27 13:24:15 +08:00
|
|
|
static void blk_account_io_merge_request(struct request *req)
|
2009-03-27 17:31:51 +08:00
|
|
|
{
|
|
|
|
if (blk_do_io_stat(req)) {
|
2018-12-07 00:41:18 +08:00
|
|
|
part_stat_lock();
|
2020-05-27 13:24:15 +08:00
|
|
|
part_stat_inc(req->part, merges[op_stat_group(req_op(req))]);
|
block: support to account io_ticks precisely
Currently, io_ticks is accounted based on sampling, specifically
update_io_ticks() will always account io_ticks by 1 jiffies from
bdev_start_io_acct()/blk_account_io_start(), and the result can be
inaccurate, for example(HZ is 250):
Test script:
fio -filename=/dev/sda -bs=4k -rw=write -direct=1 -name=test -thinktime=4ms
Test result: util is about 90%, while the disk is really idle.
This behaviour is introduced by commit 5b18b5a73760 ("block: delete
part_round_stats and switch to less precise counting"), however, there
was a key point that is missed that this patch also improve performance
a lot:
Before the commit:
part_round_stats:
if (part->stamp != now)
stats |= 1;
part_in_flight()
-> there can be lots of task here in 1 jiffies.
part_round_stats_single()
__part_stat_add()
part->stamp = now;
After the commit:
update_io_ticks:
stamp = part->bd_stamp;
if (time_after(now, stamp))
if (try_cmpxchg())
__part_stat_add()
-> only one task can reach here in 1 jiffies.
Hence in order to account io_ticks precisely, we only need to know if
there are IO inflight at most once in one jiffies. Noted that for
rq-based device, iterating tags should not be used here because
'tags->lock' is grabbed in blk_mq_find_and_get_req(), hence
part_stat_lock_inc/dec() and part_in_flight() is used to trace inflight.
The additional overhead is quite little:
- per cpu add/dec for each IO for rq-based device;
- per cpu sum for each jiffies;
And it's verified by null-blk that there are no performance degration
under heavy IO pressure.
Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240509123717.3223892-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-09 20:37:16 +08:00
|
|
|
part_stat_local_dec(req->part,
|
|
|
|
in_flight[op_is_write(req_op(req))]);
|
2009-03-27 17:31:51 +08:00
|
|
|
part_stat_unlock();
|
|
|
|
}
|
|
|
|
}
|
2020-05-27 13:24:15 +08:00
|
|
|
|
2018-11-15 09:19:46 +08:00
|
|
|
static enum elv_merge blk_try_req_merge(struct request *req,
|
|
|
|
struct request *next)
|
2018-10-27 19:52:14 +08:00
|
|
|
{
|
|
|
|
if (blk_discard_mergable(req))
|
|
|
|
return ELEVATOR_DISCARD_MERGE;
|
|
|
|
else if (blk_rq_pos(req) + blk_rq_sectors(req) == blk_rq_pos(next))
|
|
|
|
return ELEVATOR_BACK_MERGE;
|
|
|
|
|
|
|
|
return ELEVATOR_NO_MERGE;
|
|
|
|
}
|
2009-03-27 17:31:51 +08:00
|
|
|
|
block: Add core atomic write support
Add atomic write support, as follows:
- add helper functions to get request_queue atomic write limits
- report request_queue atomic write support limits to sysfs and update Doc
- support to safely merge atomic writes
- deal with splitting atomic writes
- misc helper functions
- add a per-request atomic write flag
New request_queue limits are added, as follows:
- atomic_write_hw_max is set by the block driver and is the maximum length
of an atomic write which the device may support. It is not
necessarily a power-of-2.
- atomic_write_max_sectors is derived from atomic_write_hw_max_sectors and
max_hw_sectors. It is always a power-of-2. Atomic writes may be merged,
and atomic_write_max_sectors would be the limit on a merged atomic write
request size. This value is not capped at max_sectors, as the value in
max_sectors can be controlled from userspace, and it would only cause
trouble if userspace could limit atomic_write_unit_max_bytes and the
other atomic write limits.
- atomic_write_hw_unit_{min,max} are set by the block driver and are the
min/max length of an atomic write unit which the device may support. They
both must be a power-of-2. Typically atomic_write_hw_unit_max will hold
the same value as atomic_write_hw_max.
- atomic_write_unit_{min,max} are derived from
atomic_write_hw_unit_{min,max}, max_hw_sectors, and block core limits.
Both min and max values must be a power-of-2.
- atomic_write_hw_boundary is set by the block driver. If non-zero, it
indicates an LBA space boundary at which an atomic write straddles no
longer is atomically executed by the disk. The value must be a
power-of-2. Note that it would be acceptable to enforce a rule that
atomic_write_hw_boundary_sectors is a multiple of
atomic_write_hw_unit_max, but the resultant code would be more
complicated.
All atomic writes limits are by default set 0 to indicate no atomic write
support. Even though it is assumed by Linux that a logical block can always
be atomically written, we ignore this as it is not of particular interest.
Stacked devices are just not supported either for now.
An atomic write must always be submitted to the block driver as part of a
single request. As such, only a single BIO must be submitted to the block
layer for an atomic write. When a single atomic write BIO is submitted, it
cannot be split. As such, atomic_write_unit_{max, min}_bytes are limited
by the maximum guaranteed BIO size which will not be required to be split.
This max size is calculated by request_queue max segments and the number
of bvecs a BIO can fit, BIO_MAX_VECS. Currently we rely on userspace
issuing a write with iovcnt=1 for pwritev2() - as such, we can rely on each
segment containing PAGE_SIZE of data, apart from the first+last, which each
can fit logical block size of data. The first+last will be LBS
length/aligned as we rely on direct IO alignment rules also.
New sysfs files are added to report the following atomic write limits:
- atomic_write_unit_max_bytes - same as atomic_write_unit_max_sectors in
bytes
- atomic_write_unit_min_bytes - same as atomic_write_unit_min_sectors in
bytes
- atomic_write_boundary_bytes - same as atomic_write_hw_boundary_sectors in
bytes
- atomic_write_max_bytes - same as atomic_write_max_sectors in bytes
Atomic writes may only be merged with other atomic writes and only under
the following conditions:
- total resultant request length <= atomic_write_max_bytes
- the merged write does not straddle a boundary
Helper function bdev_can_atomic_write() is added to indicate whether
atomic writes may be issued to a bdev. If a bdev is a partition, the
partition start must be aligned with both atomic_write_unit_min_sectors
and atomic_write_hw_boundary_sectors.
FSes will rely on the block layer to validate that an atomic write BIO
submitted will be of valid size, so add blk_validate_atomic_write_op_size()
for this purpose. Userspace expects an atomic write which is of invalid
size to be rejected with -EINVAL, so add BLK_STS_INVAL for this. Also use
BLK_STS_INVAL for when a BIO needs to be split, as this should mean an
invalid size BIO.
Flag REQ_ATOMIC is used for indicating an atomic write.
Co-developed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-6-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 20:53:54 +08:00
|
|
|
static bool blk_atomic_write_mergeable_rq_bio(struct request *rq,
|
|
|
|
struct bio *bio)
|
|
|
|
{
|
|
|
|
return (rq->cmd_flags & REQ_ATOMIC) == (bio->bi_opf & REQ_ATOMIC);
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool blk_atomic_write_mergeable_rqs(struct request *rq,
|
|
|
|
struct request *next)
|
|
|
|
{
|
|
|
|
return (rq->cmd_flags & REQ_ATOMIC) == (next->cmd_flags & REQ_ATOMIC);
|
|
|
|
}
|
|
|
|
|
2008-01-29 21:04:06 +08:00
|
|
|
/*
|
2017-02-02 23:54:40 +08:00
|
|
|
* For non-mq, this has to be called with the request spinlock acquired.
|
|
|
|
* For mq with scheduling, the appropriate queue wide lock should be held.
|
2008-01-29 21:04:06 +08:00
|
|
|
*/
|
2017-02-02 23:54:40 +08:00
|
|
|
static struct request *attempt_merge(struct request_queue *q,
|
|
|
|
struct request *req, struct request *next)
|
2008-01-29 21:04:06 +08:00
|
|
|
{
|
|
|
|
if (!rq_mergeable(req) || !rq_mergeable(next))
|
2017-02-02 23:54:40 +08:00
|
|
|
return NULL;
|
2008-01-29 21:04:06 +08:00
|
|
|
|
2016-06-09 22:00:36 +08:00
|
|
|
if (req_op(req) != req_op(next))
|
2017-02-02 23:54:40 +08:00
|
|
|
return NULL;
|
2012-09-19 00:19:26 +08:00
|
|
|
|
2021-11-26 20:17:59 +08:00
|
|
|
if (rq_data_dir(req) != rq_data_dir(next))
|
2017-02-02 23:54:40 +08:00
|
|
|
return NULL;
|
2008-01-29 21:04:06 +08:00
|
|
|
|
2024-02-03 04:39:25 +08:00
|
|
|
/* Don't merge requests with different write hints. */
|
|
|
|
if (req->write_hint != next->write_hint)
|
|
|
|
return NULL;
|
|
|
|
|
2018-11-20 09:52:37 +08:00
|
|
|
if (req->ioprio != next->ioprio)
|
|
|
|
return NULL;
|
|
|
|
|
block: Add core atomic write support
Add atomic write support, as follows:
- add helper functions to get request_queue atomic write limits
- report request_queue atomic write support limits to sysfs and update Doc
- support to safely merge atomic writes
- deal with splitting atomic writes
- misc helper functions
- add a per-request atomic write flag
New request_queue limits are added, as follows:
- atomic_write_hw_max is set by the block driver and is the maximum length
of an atomic write which the device may support. It is not
necessarily a power-of-2.
- atomic_write_max_sectors is derived from atomic_write_hw_max_sectors and
max_hw_sectors. It is always a power-of-2. Atomic writes may be merged,
and atomic_write_max_sectors would be the limit on a merged atomic write
request size. This value is not capped at max_sectors, as the value in
max_sectors can be controlled from userspace, and it would only cause
trouble if userspace could limit atomic_write_unit_max_bytes and the
other atomic write limits.
- atomic_write_hw_unit_{min,max} are set by the block driver and are the
min/max length of an atomic write unit which the device may support. They
both must be a power-of-2. Typically atomic_write_hw_unit_max will hold
the same value as atomic_write_hw_max.
- atomic_write_unit_{min,max} are derived from
atomic_write_hw_unit_{min,max}, max_hw_sectors, and block core limits.
Both min and max values must be a power-of-2.
- atomic_write_hw_boundary is set by the block driver. If non-zero, it
indicates an LBA space boundary at which an atomic write straddles no
longer is atomically executed by the disk. The value must be a
power-of-2. Note that it would be acceptable to enforce a rule that
atomic_write_hw_boundary_sectors is a multiple of
atomic_write_hw_unit_max, but the resultant code would be more
complicated.
All atomic writes limits are by default set 0 to indicate no atomic write
support. Even though it is assumed by Linux that a logical block can always
be atomically written, we ignore this as it is not of particular interest.
Stacked devices are just not supported either for now.
An atomic write must always be submitted to the block driver as part of a
single request. As such, only a single BIO must be submitted to the block
layer for an atomic write. When a single atomic write BIO is submitted, it
cannot be split. As such, atomic_write_unit_{max, min}_bytes are limited
by the maximum guaranteed BIO size which will not be required to be split.
This max size is calculated by request_queue max segments and the number
of bvecs a BIO can fit, BIO_MAX_VECS. Currently we rely on userspace
issuing a write with iovcnt=1 for pwritev2() - as such, we can rely on each
segment containing PAGE_SIZE of data, apart from the first+last, which each
can fit logical block size of data. The first+last will be LBS
length/aligned as we rely on direct IO alignment rules also.
New sysfs files are added to report the following atomic write limits:
- atomic_write_unit_max_bytes - same as atomic_write_unit_max_sectors in
bytes
- atomic_write_unit_min_bytes - same as atomic_write_unit_min_sectors in
bytes
- atomic_write_boundary_bytes - same as atomic_write_hw_boundary_sectors in
bytes
- atomic_write_max_bytes - same as atomic_write_max_sectors in bytes
Atomic writes may only be merged with other atomic writes and only under
the following conditions:
- total resultant request length <= atomic_write_max_bytes
- the merged write does not straddle a boundary
Helper function bdev_can_atomic_write() is added to indicate whether
atomic writes may be issued to a bdev. If a bdev is a partition, the
partition start must be aligned with both atomic_write_unit_min_sectors
and atomic_write_hw_boundary_sectors.
FSes will rely on the block layer to validate that an atomic write BIO
submitted will be of valid size, so add blk_validate_atomic_write_op_size()
for this purpose. Userspace expects an atomic write which is of invalid
size to be rejected with -EINVAL, so add BLK_STS_INVAL for this. Also use
BLK_STS_INVAL for when a BIO needs to be split, as this should mean an
invalid size BIO.
Flag REQ_ATOMIC is used for indicating an atomic write.
Co-developed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-6-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 20:53:54 +08:00
|
|
|
if (!blk_atomic_write_mergeable_rqs(req, next))
|
|
|
|
return NULL;
|
|
|
|
|
2008-01-29 21:04:06 +08:00
|
|
|
/*
|
|
|
|
* If we are allowed to merge, then append bio list
|
|
|
|
* from next to rq and release next. merge_requests_fn
|
|
|
|
* will have updated segment counts, update sector
|
2018-02-02 05:01:02 +08:00
|
|
|
* counts here. Handle DISCARDs separately, as they
|
|
|
|
* have separate settings.
|
2008-01-29 21:04:06 +08:00
|
|
|
*/
|
2018-10-27 19:52:14 +08:00
|
|
|
|
|
|
|
switch (blk_try_req_merge(req, next)) {
|
|
|
|
case ELEVATOR_DISCARD_MERGE:
|
2018-02-02 05:01:02 +08:00
|
|
|
if (!req_attempt_discard_merge(q, req, next))
|
|
|
|
return NULL;
|
2018-10-27 19:52:14 +08:00
|
|
|
break;
|
|
|
|
case ELEVATOR_BACK_MERGE:
|
|
|
|
if (!ll_merge_requests_fn(q, req, next))
|
|
|
|
return NULL;
|
|
|
|
break;
|
|
|
|
default:
|
2017-02-02 23:54:40 +08:00
|
|
|
return NULL;
|
2018-10-27 19:52:14 +08:00
|
|
|
}
|
2008-01-29 21:04:06 +08:00
|
|
|
|
2009-07-03 16:48:17 +08:00
|
|
|
/*
|
|
|
|
* If failfast settings disagree or any of the two is already
|
|
|
|
* a mixed merge, mark both as mixed before proceeding. This
|
|
|
|
* makes sure that all involved bios have mixable attributes
|
|
|
|
* set properly.
|
|
|
|
*/
|
2016-10-20 21:12:13 +08:00
|
|
|
if (((req->rq_flags | next->rq_flags) & RQF_MIXED_MERGE) ||
|
2009-07-03 16:48:17 +08:00
|
|
|
(req->cmd_flags & REQ_FAILFAST_MASK) !=
|
|
|
|
(next->cmd_flags & REQ_FAILFAST_MASK)) {
|
|
|
|
blk_rq_set_mixed_merge(req);
|
|
|
|
blk_rq_set_mixed_merge(next);
|
|
|
|
}
|
|
|
|
|
2008-01-29 21:04:06 +08:00
|
|
|
/*
|
block: consolidate struct request timestamp fields
Currently, struct request has four timestamp fields:
- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq
These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 17:08:53 +08:00
|
|
|
* At this point we have either done a back merge or front merge. We
|
|
|
|
* need the smaller start_time_ns of the merged requests to be the
|
|
|
|
* current request for accounting purposes.
|
2008-01-29 21:04:06 +08:00
|
|
|
*/
|
block: consolidate struct request timestamp fields
Currently, struct request has four timestamp fields:
- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq
These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 17:08:53 +08:00
|
|
|
if (next->start_time_ns < req->start_time_ns)
|
|
|
|
req->start_time_ns = next->start_time_ns;
|
2008-01-29 21:04:06 +08:00
|
|
|
|
|
|
|
req->biotail->bi_next = next->bio;
|
|
|
|
req->biotail = next->biotail;
|
|
|
|
|
2009-05-07 21:24:44 +08:00
|
|
|
req->__data_len += blk_rq_bytes(next);
|
2008-01-29 21:04:06 +08:00
|
|
|
|
2018-12-01 00:38:18 +08:00
|
|
|
if (!blk_discard_mergable(req))
|
2018-02-02 05:01:02 +08:00
|
|
|
elv_merge_requests(q, req, next);
|
2008-01-29 21:04:06 +08:00
|
|
|
|
2023-03-16 02:39:02 +08:00
|
|
|
blk_crypto_rq_put_keyslot(next);
|
|
|
|
|
2009-04-22 20:01:49 +08:00
|
|
|
/*
|
|
|
|
* 'next' is going away, so update stats accordingly
|
|
|
|
*/
|
2020-05-27 13:24:15 +08:00
|
|
|
blk_account_io_merge_request(next);
|
2008-01-29 21:04:06 +08:00
|
|
|
|
2020-12-04 00:21:39 +08:00
|
|
|
trace_block_rq_merge(next);
|
2020-06-17 21:58:23 +08:00
|
|
|
|
2017-02-04 00:48:28 +08:00
|
|
|
/*
|
|
|
|
* ownership of bio passed from next to req, return 'next' for
|
|
|
|
* the caller to free
|
|
|
|
*/
|
2009-03-24 19:35:07 +08:00
|
|
|
next->bio = NULL;
|
2017-02-02 23:54:40 +08:00
|
|
|
return next;
|
2008-01-29 21:04:06 +08:00
|
|
|
}
|
|
|
|
|
2020-10-06 15:07:19 +08:00
|
|
|
static struct request *attempt_back_merge(struct request_queue *q,
|
|
|
|
struct request *rq)
|
2008-01-29 21:04:06 +08:00
|
|
|
{
|
|
|
|
struct request *next = elv_latter_request(q, rq);
|
|
|
|
|
|
|
|
if (next)
|
|
|
|
return attempt_merge(q, rq, next);
|
|
|
|
|
2017-02-02 23:54:40 +08:00
|
|
|
return NULL;
|
2008-01-29 21:04:06 +08:00
|
|
|
}
|
|
|
|
|
2020-10-06 15:07:19 +08:00
|
|
|
static struct request *attempt_front_merge(struct request_queue *q,
|
|
|
|
struct request *rq)
|
2008-01-29 21:04:06 +08:00
|
|
|
{
|
|
|
|
struct request *prev = elv_former_request(q, rq);
|
|
|
|
|
|
|
|
if (prev)
|
|
|
|
return attempt_merge(q, prev, rq);
|
|
|
|
|
2017-02-02 23:54:40 +08:00
|
|
|
return NULL;
|
2008-01-29 21:04:06 +08:00
|
|
|
}
|
2011-03-21 17:14:27 +08:00
|
|
|
|
2021-06-23 17:36:34 +08:00
|
|
|
/*
|
|
|
|
* Try to merge 'next' into 'rq'. Return true if the merge happened, false
|
|
|
|
* otherwise. The caller is responsible for freeing 'next' if the merge
|
|
|
|
* happened.
|
|
|
|
*/
|
|
|
|
bool blk_attempt_req_merge(struct request_queue *q, struct request *rq,
|
|
|
|
struct request *next)
|
2011-03-21 17:14:27 +08:00
|
|
|
{
|
2021-06-23 17:36:34 +08:00
|
|
|
return attempt_merge(q, rq, next);
|
2011-03-21 17:14:27 +08:00
|
|
|
}
|
2012-02-08 16:19:38 +08:00
|
|
|
|
|
|
|
bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
|
|
|
|
{
|
2012-09-19 00:19:25 +08:00
|
|
|
if (!rq_mergeable(rq) || !bio_mergeable(bio))
|
2012-02-08 16:19:38 +08:00
|
|
|
return false;
|
|
|
|
|
2016-06-09 22:00:36 +08:00
|
|
|
if (req_op(rq) != bio_op(bio))
|
2012-09-19 00:19:26 +08:00
|
|
|
return false;
|
|
|
|
|
2012-02-08 16:19:38 +08:00
|
|
|
/* different data direction or already started, don't merge */
|
|
|
|
if (bio_data_dir(bio) != rq_data_dir(rq))
|
|
|
|
return false;
|
|
|
|
|
2022-03-15 08:30:11 +08:00
|
|
|
/* don't merge across cgroup boundaries */
|
|
|
|
if (!blk_cgroup_mergeable(rq, bio))
|
|
|
|
return false;
|
|
|
|
|
2012-02-08 16:19:38 +08:00
|
|
|
/* only merge integrity protected bio into ditto rq */
|
2014-09-27 07:20:06 +08:00
|
|
|
if (blk_integrity_merge_bio(rq->q, rq, bio) == false)
|
2012-02-08 16:19:38 +08:00
|
|
|
return false;
|
|
|
|
|
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 08:37:18 +08:00
|
|
|
/* Only merge if the crypt contexts are compatible */
|
|
|
|
if (!bio_crypt_rq_ctx_compatible(rq, bio))
|
|
|
|
return false;
|
|
|
|
|
2024-02-03 04:39:25 +08:00
|
|
|
/* Don't merge requests with different write hints. */
|
|
|
|
if (rq->write_hint != bio->bi_write_hint)
|
|
|
|
return false;
|
|
|
|
|
2018-11-20 09:52:37 +08:00
|
|
|
if (rq->ioprio != bio_prio(bio))
|
|
|
|
return false;
|
|
|
|
|
block: Add core atomic write support
Add atomic write support, as follows:
- add helper functions to get request_queue atomic write limits
- report request_queue atomic write support limits to sysfs and update Doc
- support to safely merge atomic writes
- deal with splitting atomic writes
- misc helper functions
- add a per-request atomic write flag
New request_queue limits are added, as follows:
- atomic_write_hw_max is set by the block driver and is the maximum length
of an atomic write which the device may support. It is not
necessarily a power-of-2.
- atomic_write_max_sectors is derived from atomic_write_hw_max_sectors and
max_hw_sectors. It is always a power-of-2. Atomic writes may be merged,
and atomic_write_max_sectors would be the limit on a merged atomic write
request size. This value is not capped at max_sectors, as the value in
max_sectors can be controlled from userspace, and it would only cause
trouble if userspace could limit atomic_write_unit_max_bytes and the
other atomic write limits.
- atomic_write_hw_unit_{min,max} are set by the block driver and are the
min/max length of an atomic write unit which the device may support. They
both must be a power-of-2. Typically atomic_write_hw_unit_max will hold
the same value as atomic_write_hw_max.
- atomic_write_unit_{min,max} are derived from
atomic_write_hw_unit_{min,max}, max_hw_sectors, and block core limits.
Both min and max values must be a power-of-2.
- atomic_write_hw_boundary is set by the block driver. If non-zero, it
indicates an LBA space boundary at which an atomic write straddles no
longer is atomically executed by the disk. The value must be a
power-of-2. Note that it would be acceptable to enforce a rule that
atomic_write_hw_boundary_sectors is a multiple of
atomic_write_hw_unit_max, but the resultant code would be more
complicated.
All atomic writes limits are by default set 0 to indicate no atomic write
support. Even though it is assumed by Linux that a logical block can always
be atomically written, we ignore this as it is not of particular interest.
Stacked devices are just not supported either for now.
An atomic write must always be submitted to the block driver as part of a
single request. As such, only a single BIO must be submitted to the block
layer for an atomic write. When a single atomic write BIO is submitted, it
cannot be split. As such, atomic_write_unit_{max, min}_bytes are limited
by the maximum guaranteed BIO size which will not be required to be split.
This max size is calculated by request_queue max segments and the number
of bvecs a BIO can fit, BIO_MAX_VECS. Currently we rely on userspace
issuing a write with iovcnt=1 for pwritev2() - as such, we can rely on each
segment containing PAGE_SIZE of data, apart from the first+last, which each
can fit logical block size of data. The first+last will be LBS
length/aligned as we rely on direct IO alignment rules also.
New sysfs files are added to report the following atomic write limits:
- atomic_write_unit_max_bytes - same as atomic_write_unit_max_sectors in
bytes
- atomic_write_unit_min_bytes - same as atomic_write_unit_min_sectors in
bytes
- atomic_write_boundary_bytes - same as atomic_write_hw_boundary_sectors in
bytes
- atomic_write_max_bytes - same as atomic_write_max_sectors in bytes
Atomic writes may only be merged with other atomic writes and only under
the following conditions:
- total resultant request length <= atomic_write_max_bytes
- the merged write does not straddle a boundary
Helper function bdev_can_atomic_write() is added to indicate whether
atomic writes may be issued to a bdev. If a bdev is a partition, the
partition start must be aligned with both atomic_write_unit_min_sectors
and atomic_write_hw_boundary_sectors.
FSes will rely on the block layer to validate that an atomic write BIO
submitted will be of valid size, so add blk_validate_atomic_write_op_size()
for this purpose. Userspace expects an atomic write which is of invalid
size to be rejected with -EINVAL, so add BLK_STS_INVAL for this. Also use
BLK_STS_INVAL for when a BIO needs to be split, as this should mean an
invalid size BIO.
Flag REQ_ATOMIC is used for indicating an atomic write.
Co-developed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-6-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 20:53:54 +08:00
|
|
|
if (blk_atomic_write_mergeable_rq_bio(rq, bio) == false)
|
|
|
|
return false;
|
|
|
|
|
2012-02-08 16:19:38 +08:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2017-02-08 21:46:48 +08:00
|
|
|
enum elv_merge blk_try_merge(struct request *rq, struct bio *bio)
|
2012-02-08 16:19:38 +08:00
|
|
|
{
|
2018-10-27 19:52:14 +08:00
|
|
|
if (blk_discard_mergable(rq))
|
2017-02-08 21:46:49 +08:00
|
|
|
return ELEVATOR_DISCARD_MERGE;
|
|
|
|
else if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
|
2012-02-08 16:19:38 +08:00
|
|
|
return ELEVATOR_BACK_MERGE;
|
2013-10-12 06:44:27 +08:00
|
|
|
else if (blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_iter.bi_sector)
|
2012-02-08 16:19:38 +08:00
|
|
|
return ELEVATOR_FRONT_MERGE;
|
|
|
|
return ELEVATOR_NO_MERGE;
|
|
|
|
}
|
2020-08-28 10:52:54 +08:00
|
|
|
|
|
|
|
static void blk_account_io_merge_bio(struct request *req)
|
|
|
|
{
|
|
|
|
if (!blk_do_io_stat(req))
|
|
|
|
return;
|
|
|
|
|
|
|
|
part_stat_lock();
|
|
|
|
part_stat_inc(req->part, merges[op_stat_group(req_op(req))]);
|
|
|
|
part_stat_unlock();
|
|
|
|
}
|
|
|
|
|
2024-04-08 09:41:05 +08:00
|
|
|
enum bio_merge_status bio_attempt_back_merge(struct request *req,
|
2020-10-06 15:07:19 +08:00
|
|
|
struct bio *bio, unsigned int nr_segs)
|
2020-08-28 10:52:54 +08:00
|
|
|
{
|
block: sync mixed merged request's failfast with 1st bio's
We support mixed merge for requests/bios with different fastfail
settings. When request fails, each time we only handle the portion
with same failfast setting, then bios with failfast can be failed
immediately, and bios without failfast can be retried.
The idea is pretty good, but the current implementation has several
defects:
1) initially RA bio doesn't set failfast, however bio merge code
doesn't consider this point, and just check its failfast setting for
deciding if mixed merge is required. Fix this issue by adding helper
of bio_failfast().
2) when merging bio to request front, if this request is mixed
merged, we have to sync request's faifast setting with 1st bio's
failfast. Fix it by calling blk_update_mixed_merge().
3) when merging bio to request back, if this request is mixed
merged, we have to mark the bio as failfast, because blk_update_request
simply updates request failfast with 1st bio's failfast. Fix
it by calling blk_update_mixed_merge().
Fixes one normal EXT4 READ IO failure issue, because it is observed
that the normal READ IO is merged with RA IO, and the mixed merged
request has different failfast setting with 1st bio's, so finally
the normal READ IO doesn't get retried.
Cc: Tejun Heo <tj@kernel.org>
Fixes: 80a761fd33cf ("block: implement mixed merge of different failfast requests")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230209125527.667004-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-09 20:55:27 +08:00
|
|
|
const blk_opf_t ff = bio_failfast(bio);
|
2020-08-28 10:52:54 +08:00
|
|
|
|
|
|
|
if (!ll_back_merge_fn(req, bio, nr_segs))
|
2020-08-28 10:52:56 +08:00
|
|
|
return BIO_MERGE_FAILED;
|
2020-08-28 10:52:54 +08:00
|
|
|
|
2020-12-04 00:21:36 +08:00
|
|
|
trace_block_bio_backmerge(bio);
|
2020-08-28 10:52:54 +08:00
|
|
|
rq_qos_merge(req->q, req, bio);
|
|
|
|
|
|
|
|
if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff)
|
|
|
|
blk_rq_set_mixed_merge(req);
|
|
|
|
|
block: sync mixed merged request's failfast with 1st bio's
We support mixed merge for requests/bios with different fastfail
settings. When request fails, each time we only handle the portion
with same failfast setting, then bios with failfast can be failed
immediately, and bios without failfast can be retried.
The idea is pretty good, but the current implementation has several
defects:
1) initially RA bio doesn't set failfast, however bio merge code
doesn't consider this point, and just check its failfast setting for
deciding if mixed merge is required. Fix this issue by adding helper
of bio_failfast().
2) when merging bio to request front, if this request is mixed
merged, we have to sync request's faifast setting with 1st bio's
failfast. Fix it by calling blk_update_mixed_merge().
3) when merging bio to request back, if this request is mixed
merged, we have to mark the bio as failfast, because blk_update_request
simply updates request failfast with 1st bio's failfast. Fix
it by calling blk_update_mixed_merge().
Fixes one normal EXT4 READ IO failure issue, because it is observed
that the normal READ IO is merged with RA IO, and the mixed merged
request has different failfast setting with 1st bio's, so finally
the normal READ IO doesn't get retried.
Cc: Tejun Heo <tj@kernel.org>
Fixes: 80a761fd33cf ("block: implement mixed merge of different failfast requests")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230209125527.667004-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-09 20:55:27 +08:00
|
|
|
blk_update_mixed_merge(req, bio, false);
|
|
|
|
|
block: Introduce zone write plugging
Zone write plugging implements a per-zone "plug" for write operations
to control the submission and execution order of write operations to
sequential write required zones of a zoned block device. Per-zone
plugging guarantees that at any time there is at most only one write
request per zone being executed. This mechanism is intended to replace
zone write locking which implements a similar per-zone write throttling
at the scheduler level, but is implemented only by mq-deadline.
Unlike zone write locking which operates on requests, zone write
plugging operates on BIOs. A zone write plug is simply a BIO list that
is atomically manipulated using a spinlock and a kblockd submission
work. A write BIO to a zone is "plugged" to delay its execution if a
write BIO for the same zone was already issued, that is, if a write
request for the same zone is being executed. The next plugged BIO is
unplugged and issued once the write request completes.
This mechanism allows to:
- Untangle zone write ordering from block IO schedulers. This allows
removing the restriction on using mq-deadline for writing to zoned
block devices. Any block IO scheduler, including "none" can be used.
- Zone write plugging operates on BIOs instead of requests. Plugged
BIOs waiting for execution thus do not hold scheduling tags and thus
are not preventing other BIOs from executing (reads or writes to
other zones). Depending on the workload, this can significantly
improve the device use (higher queue depth operation) and
performance.
- Both blk-mq (request based) zoned devices and BIO-based zoned devices
(e.g. device mapper) can use zone write plugging. It is mandatory
for the former but optional for the latter. BIO-based drivers can
use zone write plugging to implement write ordering guarantees, or
the drivers can implement their own if needed.
- The code is less invasive in the block layer and is mostly limited to
blk-zoned.c with some small changes in blk-mq.c, blk-merge.c and
bio.c.
Zone write plugging is implemented using struct blk_zone_wplug. This
structure includes a spinlock, a BIO list and a work structure to
handle the submission of plugged BIOs. Zone write plugs structures are
managed using a per-disk hash table.
Plugging of zone write BIOs is done using the function
blk_zone_write_plug_bio() which returns false if a BIO execution does
not need to be delayed and true otherwise. This function is called
from blk_mq_submit_bio() after a BIO is split to avoid large BIOs
spanning multiple zones which would cause mishandling of zone write
plugs. This ichange enables by default zone write plugging for any mq
request-based block device. BIO-based device drivers can also use zone
write plugging by expliclty calling blk_zone_write_plug_bio() in their
->submit_bio method. For such devices, the driver must ensure that a
BIO passed to blk_zone_write_plug_bio() is already split and not
straddling zone boundaries.
Only write and write zeroes BIOs are plugged. Zone write plugging does
not introduce any significant overhead for other operations. A BIO that
is being handled through zone write plugging is flagged using the new
BIO flag BIO_ZONE_WRITE_PLUGGING. A request handling a BIO flagged with
this new flag is flagged with the new RQF_ZONE_WRITE_PLUGGING flag.
The completion of BIOs and requests flagged trigger respectively calls
to the functions blk_zone_write_bio_endio() and
blk_zone_write_complete_request(). The latter function is used to
trigger submission of the next plugged BIO using the zone plug work.
blk_zone_write_bio_endio() does the same for BIO-based devices.
This ensures that at any time, at most one request (blk-mq devices) or
one BIO (BIO-based devices) is being executed for any zone. The
handling of zone write plugs using a per-zone plug spinlock maximizes
parallelism and device usage by allowing multiple zones to be writen
simultaneously without lock contention.
Zone write plugging ignores flush BIOs without data. Hovever, any flush
BIO that has data is always plugged so that the write part of the flush
sequence is serialized with other regular writes.
Given that any BIO handled through zone write plugging will be the only
BIO in flight for the target zone when it is executed, the unplugging
and submission of a BIO will have no chance of successfully merging with
plugged requests or requests in the scheduler. To overcome this
potential performance degradation, blk_mq_submit_bio() calls the
function blk_zone_write_plug_attempt_merge() to try to merge other
plugged BIOs with the one just unplugged and submitted. Successful
merging is signaled using blk_zone_write_plug_bio_merged(), called from
bio_attempt_back_merge(). Furthermore, to avoid recalculating the number
of segments of plugged BIOs to attempt merging, the number of segments
of a plugged BIO is saved using the new struct bio field
__bi_nr_segments. To avoid growing the size of struct bio, this field is
added as a union with the bio_cookie field. This is safe to do as
polling is always disabled for plugged BIOs.
When BIOs are plugged in a zone write plug, the device request queue
usage counter is always incremented. This reference is kept and reused
for blk-mq devices when the plugged BIO is unplugged and submitted
again using submit_bio_noacct_nocheck(). For this case, the unplugged
BIO is already flagged with BIO_ZONE_WRITE_PLUGGING and
blk_mq_submit_bio() proceeds directly to allocating a new request for
the BIO, re-using the usage reference count taken when the BIO was
plugged. This extra reference count is dropped in
blk_zone_write_plug_attempt_merge() for any plugged BIO that is
successfully merged. Given that BIO-based devices will not take this
path, the extra reference is dropped after a plugged BIO is unplugged
and submitted.
Zone write plugs are dynamically allocated and managed using a hash
table (an array of struct hlist_head) with RCU protection.
A zone write plug is allocated when a write BIO is received for the
zone and not freed until the zone is fully written, reset or finished.
To detect when a zone write plug can be freed, the write state of each
zone is tracked using a write pointer offset which corresponds to the
offset of a zone write pointer relative to the zone start. Write
operations always increment this write pointer offset. Zone reset
operations set it to 0 and zone finish operations set it to the zone
size.
If a write error happens, the wp_offset value of a zone write plug may
become incorrect and out of sync with the device managed write pointer.
This is handled using the zone write plug flag BLK_ZONE_WPLUG_ERROR.
The function blk_zone_wplug_handle_error() is called from the new disk
zone write plug work when this flag is set. This function executes a
report zone to update the zone write pointer offset to the current
value as indicated by the device. The disk zone write plug work is
scheduled whenever a BIO flagged with BIO_ZONE_WRITE_PLUGGING completes
with an error or when bio_zone_wplug_prepare_bio() detects an unaligned
write. Once scheduled, the disk zone write plugs work keeps running
until all zone errors are handled.
To match the new data structures used for zoned disks, the function
disk_free_zone_bitmaps() is renamed to the more generic
disk_free_zone_resources(). The function disk_init_zone_resources() is
also introduced to initialize zone write plugs resources when a gendisk
is allocated.
In order to guarantee that the user can simultaneously write up to a
number of zones equal to a device max active zone limit or max open zone
limit, zone write plugs are allocated using a mempool sized to the
maximum of these 2 device limits. For a device that does not have
active and open zone limits, 128 is used as the default mempool size.
If a change to the device active and open zone limits is detected, the
disk mempool is resized when blk_revalidate_disk_zones() is executed.
This commit contains contributions from Christoph Hellwig <hch@lst.de>.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240408014128.205141-8-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-08 09:41:07 +08:00
|
|
|
if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING)
|
|
|
|
blk_zone_write_plug_bio_merged(bio);
|
|
|
|
|
2020-08-28 10:52:54 +08:00
|
|
|
req->biotail->bi_next = bio;
|
|
|
|
req->biotail = bio;
|
|
|
|
req->__data_len += bio->bi_iter.bi_size;
|
|
|
|
|
|
|
|
bio_crypt_free_ctx(bio);
|
|
|
|
|
|
|
|
blk_account_io_merge_bio(req);
|
2020-08-28 10:52:56 +08:00
|
|
|
return BIO_MERGE_OK;
|
2020-08-28 10:52:54 +08:00
|
|
|
}
|
|
|
|
|
2020-10-06 15:07:19 +08:00
|
|
|
static enum bio_merge_status bio_attempt_front_merge(struct request *req,
|
|
|
|
struct bio *bio, unsigned int nr_segs)
|
2020-08-28 10:52:54 +08:00
|
|
|
{
|
block: sync mixed merged request's failfast with 1st bio's
We support mixed merge for requests/bios with different fastfail
settings. When request fails, each time we only handle the portion
with same failfast setting, then bios with failfast can be failed
immediately, and bios without failfast can be retried.
The idea is pretty good, but the current implementation has several
defects:
1) initially RA bio doesn't set failfast, however bio merge code
doesn't consider this point, and just check its failfast setting for
deciding if mixed merge is required. Fix this issue by adding helper
of bio_failfast().
2) when merging bio to request front, if this request is mixed
merged, we have to sync request's faifast setting with 1st bio's
failfast. Fix it by calling blk_update_mixed_merge().
3) when merging bio to request back, if this request is mixed
merged, we have to mark the bio as failfast, because blk_update_request
simply updates request failfast with 1st bio's failfast. Fix
it by calling blk_update_mixed_merge().
Fixes one normal EXT4 READ IO failure issue, because it is observed
that the normal READ IO is merged with RA IO, and the mixed merged
request has different failfast setting with 1st bio's, so finally
the normal READ IO doesn't get retried.
Cc: Tejun Heo <tj@kernel.org>
Fixes: 80a761fd33cf ("block: implement mixed merge of different failfast requests")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230209125527.667004-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-09 20:55:27 +08:00
|
|
|
const blk_opf_t ff = bio_failfast(bio);
|
2020-08-28 10:52:54 +08:00
|
|
|
|
block: Introduce zone write plugging
Zone write plugging implements a per-zone "plug" for write operations
to control the submission and execution order of write operations to
sequential write required zones of a zoned block device. Per-zone
plugging guarantees that at any time there is at most only one write
request per zone being executed. This mechanism is intended to replace
zone write locking which implements a similar per-zone write throttling
at the scheduler level, but is implemented only by mq-deadline.
Unlike zone write locking which operates on requests, zone write
plugging operates on BIOs. A zone write plug is simply a BIO list that
is atomically manipulated using a spinlock and a kblockd submission
work. A write BIO to a zone is "plugged" to delay its execution if a
write BIO for the same zone was already issued, that is, if a write
request for the same zone is being executed. The next plugged BIO is
unplugged and issued once the write request completes.
This mechanism allows to:
- Untangle zone write ordering from block IO schedulers. This allows
removing the restriction on using mq-deadline for writing to zoned
block devices. Any block IO scheduler, including "none" can be used.
- Zone write plugging operates on BIOs instead of requests. Plugged
BIOs waiting for execution thus do not hold scheduling tags and thus
are not preventing other BIOs from executing (reads or writes to
other zones). Depending on the workload, this can significantly
improve the device use (higher queue depth operation) and
performance.
- Both blk-mq (request based) zoned devices and BIO-based zoned devices
(e.g. device mapper) can use zone write plugging. It is mandatory
for the former but optional for the latter. BIO-based drivers can
use zone write plugging to implement write ordering guarantees, or
the drivers can implement their own if needed.
- The code is less invasive in the block layer and is mostly limited to
blk-zoned.c with some small changes in blk-mq.c, blk-merge.c and
bio.c.
Zone write plugging is implemented using struct blk_zone_wplug. This
structure includes a spinlock, a BIO list and a work structure to
handle the submission of plugged BIOs. Zone write plugs structures are
managed using a per-disk hash table.
Plugging of zone write BIOs is done using the function
blk_zone_write_plug_bio() which returns false if a BIO execution does
not need to be delayed and true otherwise. This function is called
from blk_mq_submit_bio() after a BIO is split to avoid large BIOs
spanning multiple zones which would cause mishandling of zone write
plugs. This ichange enables by default zone write plugging for any mq
request-based block device. BIO-based device drivers can also use zone
write plugging by expliclty calling blk_zone_write_plug_bio() in their
->submit_bio method. For such devices, the driver must ensure that a
BIO passed to blk_zone_write_plug_bio() is already split and not
straddling zone boundaries.
Only write and write zeroes BIOs are plugged. Zone write plugging does
not introduce any significant overhead for other operations. A BIO that
is being handled through zone write plugging is flagged using the new
BIO flag BIO_ZONE_WRITE_PLUGGING. A request handling a BIO flagged with
this new flag is flagged with the new RQF_ZONE_WRITE_PLUGGING flag.
The completion of BIOs and requests flagged trigger respectively calls
to the functions blk_zone_write_bio_endio() and
blk_zone_write_complete_request(). The latter function is used to
trigger submission of the next plugged BIO using the zone plug work.
blk_zone_write_bio_endio() does the same for BIO-based devices.
This ensures that at any time, at most one request (blk-mq devices) or
one BIO (BIO-based devices) is being executed for any zone. The
handling of zone write plugs using a per-zone plug spinlock maximizes
parallelism and device usage by allowing multiple zones to be writen
simultaneously without lock contention.
Zone write plugging ignores flush BIOs without data. Hovever, any flush
BIO that has data is always plugged so that the write part of the flush
sequence is serialized with other regular writes.
Given that any BIO handled through zone write plugging will be the only
BIO in flight for the target zone when it is executed, the unplugging
and submission of a BIO will have no chance of successfully merging with
plugged requests or requests in the scheduler. To overcome this
potential performance degradation, blk_mq_submit_bio() calls the
function blk_zone_write_plug_attempt_merge() to try to merge other
plugged BIOs with the one just unplugged and submitted. Successful
merging is signaled using blk_zone_write_plug_bio_merged(), called from
bio_attempt_back_merge(). Furthermore, to avoid recalculating the number
of segments of plugged BIOs to attempt merging, the number of segments
of a plugged BIO is saved using the new struct bio field
__bi_nr_segments. To avoid growing the size of struct bio, this field is
added as a union with the bio_cookie field. This is safe to do as
polling is always disabled for plugged BIOs.
When BIOs are plugged in a zone write plug, the device request queue
usage counter is always incremented. This reference is kept and reused
for blk-mq devices when the plugged BIO is unplugged and submitted
again using submit_bio_noacct_nocheck(). For this case, the unplugged
BIO is already flagged with BIO_ZONE_WRITE_PLUGGING and
blk_mq_submit_bio() proceeds directly to allocating a new request for
the BIO, re-using the usage reference count taken when the BIO was
plugged. This extra reference count is dropped in
blk_zone_write_plug_attempt_merge() for any plugged BIO that is
successfully merged. Given that BIO-based devices will not take this
path, the extra reference is dropped after a plugged BIO is unplugged
and submitted.
Zone write plugs are dynamically allocated and managed using a hash
table (an array of struct hlist_head) with RCU protection.
A zone write plug is allocated when a write BIO is received for the
zone and not freed until the zone is fully written, reset or finished.
To detect when a zone write plug can be freed, the write state of each
zone is tracked using a write pointer offset which corresponds to the
offset of a zone write pointer relative to the zone start. Write
operations always increment this write pointer offset. Zone reset
operations set it to 0 and zone finish operations set it to the zone
size.
If a write error happens, the wp_offset value of a zone write plug may
become incorrect and out of sync with the device managed write pointer.
This is handled using the zone write plug flag BLK_ZONE_WPLUG_ERROR.
The function blk_zone_wplug_handle_error() is called from the new disk
zone write plug work when this flag is set. This function executes a
report zone to update the zone write pointer offset to the current
value as indicated by the device. The disk zone write plug work is
scheduled whenever a BIO flagged with BIO_ZONE_WRITE_PLUGGING completes
with an error or when bio_zone_wplug_prepare_bio() detects an unaligned
write. Once scheduled, the disk zone write plugs work keeps running
until all zone errors are handled.
To match the new data structures used for zoned disks, the function
disk_free_zone_bitmaps() is renamed to the more generic
disk_free_zone_resources(). The function disk_init_zone_resources() is
also introduced to initialize zone write plugs resources when a gendisk
is allocated.
In order to guarantee that the user can simultaneously write up to a
number of zones equal to a device max active zone limit or max open zone
limit, zone write plugs are allocated using a mempool sized to the
maximum of these 2 device limits. For a device that does not have
active and open zone limits, 128 is used as the default mempool size.
If a change to the device active and open zone limits is detected, the
disk mempool is resized when blk_revalidate_disk_zones() is executed.
This commit contains contributions from Christoph Hellwig <hch@lst.de>.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240408014128.205141-8-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-08 09:41:07 +08:00
|
|
|
/*
|
|
|
|
* A front merge for writes to sequential zones of a zoned block device
|
|
|
|
* can happen only if the user submitted writes out of order. Do not
|
|
|
|
* merge such write to let it fail.
|
|
|
|
*/
|
|
|
|
if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING)
|
|
|
|
return BIO_MERGE_FAILED;
|
|
|
|
|
2020-08-28 10:52:54 +08:00
|
|
|
if (!ll_front_merge_fn(req, bio, nr_segs))
|
2020-08-28 10:52:56 +08:00
|
|
|
return BIO_MERGE_FAILED;
|
2020-08-28 10:52:54 +08:00
|
|
|
|
2020-12-04 00:21:36 +08:00
|
|
|
trace_block_bio_frontmerge(bio);
|
2020-08-28 10:52:54 +08:00
|
|
|
rq_qos_merge(req->q, req, bio);
|
|
|
|
|
|
|
|
if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff)
|
|
|
|
blk_rq_set_mixed_merge(req);
|
|
|
|
|
block: sync mixed merged request's failfast with 1st bio's
We support mixed merge for requests/bios with different fastfail
settings. When request fails, each time we only handle the portion
with same failfast setting, then bios with failfast can be failed
immediately, and bios without failfast can be retried.
The idea is pretty good, but the current implementation has several
defects:
1) initially RA bio doesn't set failfast, however bio merge code
doesn't consider this point, and just check its failfast setting for
deciding if mixed merge is required. Fix this issue by adding helper
of bio_failfast().
2) when merging bio to request front, if this request is mixed
merged, we have to sync request's faifast setting with 1st bio's
failfast. Fix it by calling blk_update_mixed_merge().
3) when merging bio to request back, if this request is mixed
merged, we have to mark the bio as failfast, because blk_update_request
simply updates request failfast with 1st bio's failfast. Fix
it by calling blk_update_mixed_merge().
Fixes one normal EXT4 READ IO failure issue, because it is observed
that the normal READ IO is merged with RA IO, and the mixed merged
request has different failfast setting with 1st bio's, so finally
the normal READ IO doesn't get retried.
Cc: Tejun Heo <tj@kernel.org>
Fixes: 80a761fd33cf ("block: implement mixed merge of different failfast requests")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230209125527.667004-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-09 20:55:27 +08:00
|
|
|
blk_update_mixed_merge(req, bio, true);
|
|
|
|
|
2020-08-28 10:52:54 +08:00
|
|
|
bio->bi_next = req->bio;
|
|
|
|
req->bio = bio;
|
|
|
|
|
|
|
|
req->__sector = bio->bi_iter.bi_sector;
|
|
|
|
req->__data_len += bio->bi_iter.bi_size;
|
|
|
|
|
|
|
|
bio_crypt_do_front_merge(req, bio);
|
|
|
|
|
|
|
|
blk_account_io_merge_bio(req);
|
2020-08-28 10:52:56 +08:00
|
|
|
return BIO_MERGE_OK;
|
2020-08-28 10:52:54 +08:00
|
|
|
}
|
|
|
|
|
2020-10-06 15:07:19 +08:00
|
|
|
static enum bio_merge_status bio_attempt_discard_merge(struct request_queue *q,
|
|
|
|
struct request *req, struct bio *bio)
|
2020-08-28 10:52:54 +08:00
|
|
|
{
|
|
|
|
unsigned short segments = blk_rq_nr_discard_segments(req);
|
|
|
|
|
|
|
|
if (segments >= queue_max_discard_segments(q))
|
|
|
|
goto no_merge;
|
|
|
|
if (blk_rq_sectors(req) + bio_sectors(bio) >
|
|
|
|
blk_rq_get_max_sectors(req, blk_rq_pos(req)))
|
|
|
|
goto no_merge;
|
|
|
|
|
|
|
|
rq_qos_merge(q, req, bio);
|
|
|
|
|
|
|
|
req->biotail->bi_next = bio;
|
|
|
|
req->biotail = bio;
|
|
|
|
req->__data_len += bio->bi_iter.bi_size;
|
|
|
|
req->nr_phys_segments = segments + 1;
|
|
|
|
|
|
|
|
blk_account_io_merge_bio(req);
|
2020-08-28 10:52:56 +08:00
|
|
|
return BIO_MERGE_OK;
|
2020-08-28 10:52:54 +08:00
|
|
|
no_merge:
|
|
|
|
req_set_nomerge(q, req);
|
2020-08-28 10:52:56 +08:00
|
|
|
return BIO_MERGE_FAILED;
|
|
|
|
}
|
|
|
|
|
|
|
|
static enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q,
|
|
|
|
struct request *rq,
|
|
|
|
struct bio *bio,
|
|
|
|
unsigned int nr_segs,
|
|
|
|
bool sched_allow_merge)
|
|
|
|
{
|
|
|
|
if (!blk_rq_merge_ok(rq, bio))
|
|
|
|
return BIO_MERGE_NONE;
|
|
|
|
|
|
|
|
switch (blk_try_merge(rq, bio)) {
|
|
|
|
case ELEVATOR_BACK_MERGE:
|
2020-09-02 09:45:25 +08:00
|
|
|
if (!sched_allow_merge || blk_mq_sched_allow_merge(q, rq, bio))
|
2020-08-28 10:52:56 +08:00
|
|
|
return bio_attempt_back_merge(rq, bio, nr_segs);
|
|
|
|
break;
|
|
|
|
case ELEVATOR_FRONT_MERGE:
|
2020-09-02 09:45:25 +08:00
|
|
|
if (!sched_allow_merge || blk_mq_sched_allow_merge(q, rq, bio))
|
2020-08-28 10:52:56 +08:00
|
|
|
return bio_attempt_front_merge(rq, bio, nr_segs);
|
|
|
|
break;
|
|
|
|
case ELEVATOR_DISCARD_MERGE:
|
|
|
|
return bio_attempt_discard_merge(q, rq, bio);
|
|
|
|
default:
|
|
|
|
return BIO_MERGE_NONE;
|
|
|
|
}
|
|
|
|
|
|
|
|
return BIO_MERGE_FAILED;
|
2020-08-28 10:52:54 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_attempt_plug_merge - try to merge with %current's plugged list
|
|
|
|
* @q: request_queue new bio is being queued at
|
|
|
|
* @bio: new bio being queued
|
|
|
|
* @nr_segs: number of segments in @bio
|
2021-10-19 00:07:09 +08:00
|
|
|
* from the passed in @q already in the plug list
|
2020-08-28 10:52:54 +08:00
|
|
|
*
|
block: only check previous entry for plug merge attempt
Currently we scan the entire plug list, which is potentially very
expensive. In an IOPS bound workload, we can drive about 5.6M IOPS with
merging enabled, and profiling shows that the plug merge check is the
(by far) most expensive thing we're doing:
Overhead Command Shared Object Symbol
+ 20.89% io_uring [kernel.vmlinux] [k] blk_attempt_plug_merge
+ 4.98% io_uring [kernel.vmlinux] [k] io_submit_sqes
+ 4.78% io_uring [kernel.vmlinux] [k] blkdev_direct_IO
+ 4.61% io_uring [kernel.vmlinux] [k] blk_mq_submit_bio
Instead of browsing the whole list, just check the previously inserted
entry. That is enough for a naive merge check and will catch most cases,
and for devices that need full merging, the IO scheduler attached to
such devices will do that anyway. The plug merge is meant to be an
inexpensive check to avoid getting a request, but if we repeatedly
scan the list for every single insert, it is very much not a cheap
check.
With this patch, the workload instead runs at ~7.0M IOPS, providing
a 25% improvement. Disabling merging entirely yields another 5%
improvement.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-14 21:24:07 +08:00
|
|
|
* Determine whether @bio being queued on @q can be merged with the previous
|
|
|
|
* request on %current's plugged list. Returns %true if merge was successful,
|
2020-08-28 10:52:54 +08:00
|
|
|
* otherwise %false.
|
|
|
|
*
|
|
|
|
* Plugging coalesces IOs from the same issuer for the same purpose without
|
|
|
|
* going through @q->queue_lock. As such it's more of an issuing mechanism
|
|
|
|
* than scheduling, and the request, while may have elvpriv data, is not
|
|
|
|
* added on the elevator at this point. In addition, we don't have
|
|
|
|
* reliable access to the elevator outside queue lock. Only check basic
|
|
|
|
* merging parameters without querying the elevator.
|
|
|
|
*
|
|
|
|
* Caller must ensure !blk_queue_nomerges(q) beforehand.
|
|
|
|
*/
|
|
|
|
bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
|
2021-11-24 00:04:41 +08:00
|
|
|
unsigned int nr_segs)
|
2020-08-28 10:52:54 +08:00
|
|
|
{
|
2024-04-08 09:41:28 +08:00
|
|
|
struct blk_plug *plug = current->plug;
|
2020-08-28 10:52:54 +08:00
|
|
|
struct request *rq;
|
|
|
|
|
2021-10-19 00:12:12 +08:00
|
|
|
if (!plug || rq_list_empty(plug->mq_list))
|
2020-08-28 10:52:54 +08:00
|
|
|
return false;
|
|
|
|
|
2022-03-12 01:21:43 +08:00
|
|
|
rq_list_for_each(&plug->mq_list, rq) {
|
|
|
|
if (rq->q == q) {
|
|
|
|
if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
|
|
|
|
BIO_MERGE_OK)
|
|
|
|
return true;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Only keep iterating plug list for merges if we have multiple
|
|
|
|
* queues
|
|
|
|
*/
|
|
|
|
if (!plug->multiple_queues)
|
|
|
|
break;
|
2020-08-28 10:52:54 +08:00
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
2020-08-28 10:52:55 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Iterate list of requests and see if we can merge this bio with any
|
|
|
|
* of them.
|
|
|
|
*/
|
|
|
|
bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
|
|
|
|
struct bio *bio, unsigned int nr_segs)
|
|
|
|
{
|
|
|
|
struct request *rq;
|
|
|
|
int checked = 8;
|
|
|
|
|
|
|
|
list_for_each_entry_reverse(rq, list, queuelist) {
|
|
|
|
if (!checked--)
|
|
|
|
break;
|
|
|
|
|
2020-08-28 10:52:56 +08:00
|
|
|
switch (blk_attempt_bio_merge(q, rq, bio, nr_segs, true)) {
|
|
|
|
case BIO_MERGE_NONE:
|
2020-08-28 10:52:55 +08:00
|
|
|
continue;
|
2020-08-28 10:52:56 +08:00
|
|
|
case BIO_MERGE_OK:
|
|
|
|
return true;
|
|
|
|
case BIO_MERGE_FAILED:
|
|
|
|
return false;
|
2020-08-28 10:52:55 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_bio_list_merge);
|
2020-10-06 15:07:19 +08:00
|
|
|
|
|
|
|
bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
|
|
|
|
unsigned int nr_segs, struct request **merged_request)
|
|
|
|
{
|
|
|
|
struct request *rq;
|
|
|
|
|
|
|
|
switch (elv_merge(q, &rq, bio)) {
|
|
|
|
case ELEVATOR_BACK_MERGE:
|
|
|
|
if (!blk_mq_sched_allow_merge(q, rq, bio))
|
|
|
|
return false;
|
|
|
|
if (bio_attempt_back_merge(rq, bio, nr_segs) != BIO_MERGE_OK)
|
|
|
|
return false;
|
|
|
|
*merged_request = attempt_back_merge(q, rq);
|
|
|
|
if (!*merged_request)
|
|
|
|
elv_merged_request(q, rq, ELEVATOR_BACK_MERGE);
|
|
|
|
return true;
|
|
|
|
case ELEVATOR_FRONT_MERGE:
|
|
|
|
if (!blk_mq_sched_allow_merge(q, rq, bio))
|
|
|
|
return false;
|
|
|
|
if (bio_attempt_front_merge(rq, bio, nr_segs) != BIO_MERGE_OK)
|
|
|
|
return false;
|
|
|
|
*merged_request = attempt_front_merge(q, rq);
|
|
|
|
if (!*merged_request)
|
|
|
|
elv_merged_request(q, rq, ELEVATOR_FRONT_MERGE);
|
|
|
|
return true;
|
|
|
|
case ELEVATOR_DISCARD_MERGE:
|
|
|
|
return bio_attempt_discard_merge(q, rq, bio) == BIO_MERGE_OK;
|
|
|
|
default:
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_sched_try_merge);
|