We only need the number of segments in the blk-mq submission path.
Remove the field from struct bio, and return it from a variant of
blk_queue_split instead of that it can passed as an argument to
those functions that need the value.
This also means we stop recounting segments except for cloning
and partial segments.
To keep the number of arguments in this how path down remove
pointless struct request_queue arguments from any of the functions
that had it and grew a nr_segs argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stopping external metadata arrays during resync/recovery causes
retries, loop of interrupting and starting reconstruction, until it
hit at good moment to stop completely. While these retries
curr_mark_cnt can be small- especially on HDD drives, so subtraction
result can be smaller than 0. However it is casted to uint without
checking. As a result of it the status bar in /proc/mdstat while stopping
is strange (it jumps between 0% and 99%).
The real problem occurs here after commit 72deb455b5 ("block: remove
CONFIG_LBDAF"). Sector_div() macro has been changed, now the
divisor is casted to uint32. For db = -8 the divisior(db/32-1) becomes 0.
Check if db value can be really counted and replace these macro by
div64_u64() inline.
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Andy reported that raid10 array with SSD disks has poor
read performance. Compared with raid1, RAID-1 can be 3x
faster than RAID-10 sometimes [1].
The thing is that raid10 chooses the low distance disk
for read request, however, the approach doesn't work
well for SSD device since it doesn't have spindle like
HDD, we should just read from the SSD which has less
pending IO like commit 9dedf60313 ("md/raid1: read
balance chooses idlest disk for SSD").
So this commit selects the idlest SSD disk for read if
array has none rotational disk, otherwise, read_balance
uses the previous distance priority algorithm. With the
change, the performance of raid10 gets increased largely
per Andy's test [2].
[1]. https://marc.info/?l=linux-raid&m=155915890004761&w=2
[2]. https://marc.info/?l=linux-raid&m=155990654223786&w=2
Tested-by: Andy Smith <andy@strugglers.net>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Avoiding duplicated code, since they just execute a kfree.
Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct foo {
int stuff;
struct boo entry[];
};
instance = kmalloc(size, GFP_KERNEL);
Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:
instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch get rid of extra blank line and space, and
add necessary space for code.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch fix a spelling typo and add necessary space for code.
In addition, the patch get rid of the unnecessary 'if'.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit c42d324099
("md: return -ENODEV if rdev has no mddev assigned") changed
rdev_attr_store to return -ENODEV when rdev->mddev is NULL, now do the
same to rdev_attr_show.
Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
commit d5d885fd51 ("md: introduce new personality funciton start()")
splits the init job to two parts. The first part run() does the jobs that
do not require the md threads. The second part start() does the jobs that
require the md threads.
Now it just does run() in adding new journal device. It needs to do the
second part start() too.
Fixes: d5d885fd51 ("md: introduce new personality funciton start()")
Cc: stable@vger.kernel.org #v4.9+
Reported-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
These definitions are being moved to raid1-10.c.
Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The conversion is actually:
- add blank lines and indentation in order to identify paragraphs;
- fix tables markups;
- add some lists markups;
- mark literal blocks;
- adjust title markups.
At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
When people set a writeback percent via sysfs file,
/sys/block/bcache<N>/bcache/writeback_percent
current code directly sets BCACHE_DEV_WB_RUNNING to dc->disk.flags
and schedules kworker dc->writeback_rate_update.
If there is no cache set attached to, the writeback kernel thread is
not running indeed, running dc->writeback_rate_update does not make
sense and may cause NULL pointer deference when reference cache set
pointer inside update_writeback_rate().
This patch checks whether the cache set point (dc->disk.c) is NULL in
sysfs interface handler, and only set BCACHE_DEV_WB_RUNNING and
schedule dc->writeback_rate_update when dc->disk.c is not NULL (it
means the cache device is attached to a cache set).
This problem might be introduced from initial bcache commit, but
commit 3fd47bfe55 ("bcache: stop dc->writeback_rate_update properly")
changes part of the original code piece, so I add 'Fixes: 3fd47bfe55b0'
to indicate from which commit this patch can be applied.
Fixes: 3fd47bfe55 ("bcache: stop dc->writeback_rate_update properly")
Reported-by: Bjørn Forsman <bjorn.forsman@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Bjørn Forsman <bjorn.forsman@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Recently people report bcache code compiled with gcc9 is broken, one of
the buggy behavior I observe is that two adjacent 4KB I/Os should merge
into one but they don't. Finally it turns out to be a stack corruption
caused by macro PRECEDING_KEY().
See how PRECEDING_KEY() is defined in bset.h,
437 #define PRECEDING_KEY(_k) \
438 ({ \
439 struct bkey *_ret = NULL; \
440 \
441 if (KEY_INODE(_k) || KEY_OFFSET(_k)) { \
442 _ret = &KEY(KEY_INODE(_k), KEY_OFFSET(_k), 0); \
443 \
444 if (!_ret->low) \
445 _ret->high--; \
446 _ret->low--; \
447 } \
448 \
449 _ret; \
450 })
At line 442, _ret points to address of a on-stack variable combined by
KEY(), the life range of this on-stack variable is in line 442-446,
once _ret is returned to bch_btree_insert_key(), the returned address
points to an invalid stack address and this address is overwritten in
the following called bch_btree_iter_init(). Then argument 'search' of
bch_btree_iter_init() points to some address inside stackframe of
bch_btree_iter_init(), exact address depends on how the compiler
allocates stack space. Now the stack is corrupted.
Fixes: 0eacac2203 ("bcache: PRECEDING_KEY()")
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Rolf Fokkens <rolf@rolffokkens.nl>
Reviewed-by: Pierre JUHEN <pierre.juhen@orange.fr>
Tested-by: Shenghui Wang <shhuiw@foxmail.com>
Tested-by: Pierre JUHEN <pierre.juhen@orange.fr>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Nix <nix@esperi.org.uk>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Based on 1 normalized pattern(s):
this file is released under the gplv2
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 68 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190114.292346262@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms and conditions of the gnu general public license
version 2 as published by the free software foundation this program
is distributed in the hope it will be useful but without any
warranty without even the implied warranty of merchantability or
fitness for a particular purpose see the gnu general public license
for more details
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 263 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141901.208660670@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version this program is distributed in the
hope that it will be useful but without any warranty without even
the implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more details you
should have received a copy of the gnu general public license along
with this program if not write to the free software foundation inc
59 temple place suite 330 boston ma 02111 1307 usa
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 1334 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070033.113240726@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 3029 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Here is another set of reviewed patches that adds SPDX tags to different
kernel files, based on a set of rules that are being used to parse the
comments to try to determine that the license of the file is
"GPL-2.0-or-later". Only the "obvious" versions of these matches are
included here, a number of "non-obvious" variants of text have been
found but those have been postponed for later review and analysis.
These patches have been out for review on the linux-spdx@vger mailing
list, and while they were created by automatic tools, they were
hand-verified by a bunch of different people, all whom names are on the
patches are reviewers.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXOgmlw8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+yk4rACfRqxGOGVLR/t6E9dDzOZRAdEz/mYAoJLZmziY
0YlSSSPtP5HI6JDh65Ng
=HXQb
-----END PGP SIGNATURE-----
Merge tag 'spdx-5.2-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pule more SPDX updates from Greg KH:
"Here is another set of reviewed patches that adds SPDX tags to
different kernel files, based on a set of rules that are being used to
parse the comments to try to determine that the license of the file is
"GPL-2.0-or-later".
Only the "obvious" versions of these matches are included here, a
number of "non-obvious" variants of text have been found but those
have been postponed for later review and analysis.
These patches have been out for review on the linux-spdx@vger mailing
list, and while they were created by automatic tools, they were
hand-verified by a bunch of different people, all whom names are on
the patches are reviewers"
* tag 'spdx-5.2-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (85 commits)
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 125
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 123
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 122
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 121
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 120
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 119
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 118
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 116
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 114
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 113
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 112
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 111
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 110
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 106
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 105
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 104
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 103
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 102
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 101
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 98
...
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 or at your option any
later version you should have received a copy of the gnu general
public license for example usr src linux copying if not write to the
free software foundation inc 675 mass ave cambridge ma 02139 usa
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 20 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520170858.552543146@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 or at your option any
later version
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 11 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520170858.370933192@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
doesn't properly trim special IOs (e.g. discards) relative to
corresponding target's max_io_len_target_boundary().
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAlzkicQTHHNuaXR6ZXJA
cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWuRcCADKc4zew1EAhcHjwanfWoQ2d+5ONdxv
Ir1yiQX3hqK96G1Vu4IROsxWPDgNBI+60HwWI0z3/nMdv4F9a9Tfowl9fcKgZNey
jGrzP7x1+9GPtJY0BIpFUQ895qC75wXdd0c6HjmM5IaN0DRrv783ZMMBNhS6yx84
2vDThiEpGiDKUXHTP9b1khUbZzvkpTmlk293UlkFDftgejfW5mq1FqbjLfACMNVI
Nh7D6A7MjFrEKHjvNbiGgeFn93iW1+XqGbsFobuCV8Z4A0ImD85H78Lesri2qrRG
nxGfTZMyq++SGSOV/JTzZ4k/qkZfkrDyljYiPaTgZpCi1mZYQ6wH8+7u
=oKZ5
-----END PGP SIGNATURE-----
Merge tag 'for-5.2/dm-fix-1' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fix from Mike Snitzer:
"Fix a particularly glaring oversight in a DM core commit from 5.1 that
doesn't properly trim special IOs (e.g. discards) relative to
corresponding target's max_io_len_target_boundary()"
* tag 'for-5.2/dm-fix-1' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: make sure to obey max_io_len_target_boundary
Commit 61697a6abd ("dm: eliminate 'split_discard_bios' flag from DM
target interface") incorrectly removed code from
__send_changing_extent_only() that is required to impose a per-target IO
boundary on IO that exceeds max_io_len_target_boundary(). Otherwise
"special" IO (e.g. DISCARD, WRITE SAME, WRITE ZEROES) can write beyond
where allowed.
Fix this by restoring the max_io_len_target_boundary() limit in
__send_changing_extent_only()
Fixes: 61697a6abd ("dm: eliminate 'split_discard_bios' flag from DM target interface")
Cc: stable@vger.kernel.org # 5.1+
Signed-off-by: Michael Lass <bevan@bi-co.net>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add SPDX license identifiers to all Make/Kconfig files which:
- Have no license information of any form
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add SPDX license identifiers to all files which:
- Have no license information of any form
- Have MODULE_LICENCE("GPL*") inside which was used in the initial
scan/conversion to ignore the file
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add SPDX license identifiers to all files which:
- Have no license information of any form
- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pankaj reports that starting with commit ad428cdb52 "dax: Check the
end of the block-device capacity with dax_direct_access()" device-mapper
no longer allows dax operation. This results from the stricter checks in
__bdev_dax_supported() that validate that the start and end of a
block-device map to the same 'pagemap' instance.
Teach the dax-core and device-mapper to validate the 'pagemap' on a
per-target basis. This is accomplished by refactoring the
bdev_dax_supported() internals into generic_fsdax_supported() which
takes a sector range to validate. Consequently generic_fsdax_supported()
is suitable to be used in a device-mapper ->iterate_devices() callback.
A new ->dax_supported() operation is added to allow composite devices to
split and route upper-level bdev_dax_supported() requests.
Fixes: ad428cdb52 ("dax: Check the end of the block-device...")
Cc: <stable@vger.kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reported-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
Tested-by: Pankaj Gupta <pagupta@redhat.com>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
locking. Requires some list_bl interface improvements.
- Add ability for DM integrity to use a bitmap mode, that tracks regions
where data and metadata are out of sync, instead of using a journal.
- Improve DM thin provisioning target to not write metadata changes to
disk if the thin-pool and associated thin devices are merely
activated but not used. This avoids metadata corruption due to
concurrent activation of thin devices across different OS instances
(e.g. split brain scenarios, which ultimately would be avoided if
proper device filters were used -- but not having proper filtering has
proven a very common configuration mistake)
- Fix missing call to path selector type->end_io in DM multipath. This
fixes reported performance problems due to inaccurate path selector IO
accounting causing an imbalance of IO (e.g. avoiding issuing IO to
particular path due to it seemingly being heavily used).
- Fix bug in DM cache metadata's loading of its discard bitset that
could lead to all cache blocks being discarded if the very first cache
block was discarded (thankfully in practice the first cache block is
generally in use; be it FS superblock, partition table, disk label,
etc).
- Add testing-only DM dust target which simulates a device that has
failing sectors and/or read failures.
- Fix a DM init error path reference count hang that caused boot hangs
if user supplied malformed input on kernel commandline.
- Fix a couple issues with DM crypt target's logging being overly
verbose or lacking context.
- Various other small fixes to DM init, DM multipath, DM zoned, and DM
crypt.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAlzdcCgTHHNuaXR6ZXJA
cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWsxZB/9idHl8LmwwL1JzBfi/XX7bWxwqDQLo
j1b3ycQ14AKVau4VCkmgDuRIfMDuU6PIAVvsMeVbF3aCE0fZ7zbEV1qHefbtJuCL
MMm//KbrhIT8oMKYUWtlOj7XI9MT6ErFzfActBZ6UF6r21m1N3bohhVGN7kvCnJm
wgmSlnz/m2GLKK8gQx+OisnAh0nlje3PIdIYPu7uWN6t0FF2XRz3UwWTuyw7lYhC
Rx2J+sOIL02CtadhHKLMCG8OutRXWP01cBSohUVJIMGihWfbe6aqvhG5afbqb4bG
UQrXl477ry5zyQ4fAU2JKZ+8qFvc1FoLLknKrZQu+uYPRokUPw/AwiL7
=mOH3
-----END PGP SIGNATURE-----
Merge tag 'for-5.2/dm-changes-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- Improve DM snapshot target's scalability by using finer grained
locking. Requires some list_bl interface improvements.
- Add ability for DM integrity to use a bitmap mode, that tracks
regions where data and metadata are out of sync, instead of using a
journal.
- Improve DM thin provisioning target to not write metadata changes to
disk if the thin-pool and associated thin devices are merely
activated but not used. This avoids metadata corruption due to
concurrent activation of thin devices across different OS instances
(e.g. split brain scenarios, which ultimately would be avoided if
proper device filters were used -- but not having proper filtering
has proven a very common configuration mistake)
- Fix missing call to path selector type->end_io in DM multipath. This
fixes reported performance problems due to inaccurate path selector
IO accounting causing an imbalance of IO (e.g. avoiding issuing IO to
particular path due to it seemingly being heavily used).
- Fix bug in DM cache metadata's loading of its discard bitset that
could lead to all cache blocks being discarded if the very first
cache block was discarded (thankfully in practice the first cache
block is generally in use; be it FS superblock, partition table, disk
label, etc).
- Add testing-only DM dust target which simulates a device that has
failing sectors and/or read failures.
- Fix a DM init error path reference count hang that caused boot hangs
if user supplied malformed input on kernel commandline.
- Fix a couple issues with DM crypt target's logging being overly
verbose or lacking context.
- Various other small fixes to DM init, DM multipath, DM zoned, and DM
crypt.
* tag 'for-5.2/dm-changes-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (42 commits)
dm: fix a couple brace coding style issues
dm crypt: print device name in integrity error message
dm crypt: move detailed message into debug level
dm ioctl: fix hang in early create error condition
dm integrity: whitespace, coding style and dead code cleanup
dm integrity: implement synchronous mode for reboot handling
dm integrity: handle machine reboot in bitmap mode
dm integrity: add a bitmap mode
dm integrity: introduce a function add_new_range_and_wait()
dm integrity: allow large ranges to be described
dm ingerity: pass size to dm_integrity_alloc_page_list()
dm integrity: introduce rw_journal_sectors()
dm integrity: update documentation
dm integrity: don't report unused options
dm integrity: don't check null pointer before kvfree and vfree
dm integrity: correctly calculate the size of metadata area
dm dust: Make dm_dust_init and dm_dust_exit static
dm dust: remove redundant unsigned comparison to less than zero
dm mpath: always free attached_handler_name in parse_path()
dm init: fix max devices/targets checks
...
This message should better identify the DM device with the integrity
failure.
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The information about tag size should not be printed without debug info
set. Also print device major:minor in the error message to identify the
device instance.
Also use rate limiting and debug level for info about used crypto API
implementaton. This is important because during online reencryption
the existing message saturates syslog (because we are moving hotzone
across the whole device).
Cc: stable@vger.kernel.org
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The dm_early_create() function (which deals with "dm-mod.create=" kernel
command line option) calls dm_hash_insert() who gets an extra reference
to the md object.
In case of failure, this reference wasn't being released, causing
dm_destroy() to hang, thus hanging the whole boot process.
Fix this by calling __hash_remove() in the error path.
Fixes: 6bbc923dfc ("dm: add support to directly boot to a mapped device")
Cc: stable@vger.kernel.org
Signed-off-by: Helen Koike <helen.koike@collabora.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Just some things that stood out like a sore thumb.
Also, converted some printk(KERN_CRIT, ...) to DMCRIT(...)
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Percpu reference counters should now be initialized with the
PERCPU_REF_ALLOW_REINIT in order to allow switching them to the
percpu mode from the atomic mode.
To make percpu_ref_switch_to_percpu() call in set_in_sync()
succeed,let's initialize percpu refcounters with the
PERCU_REF_ALLOW_REINIT flag.
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Unfortunatelly, there may be bios coming even after the reboot notifier
was called. We don't want these bios to make the bitmap dirty again.
To address this, implement a synchronous mode - when a bio is about to
be terminated, we clean the bitmap and terminate the bio after the clean
operation succeeds. This obviously slows down bio processing, but it
makes sure that when all bios are finished, the bitmap will be clean.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
When in bitmap mode the bitmap must be cleared when rebooting. This
commit adds the reboot hook.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Introduce an alternate mode of operation where dm-integrity uses a
bitmap instead of a journal. If a bit in the bitmap is 1, the
corresponding region's data and integrity tags are not synchronized - if
the machine crashes, the unsynchronized regions will be recalculated.
The bitmap mode is faster than the journal mode, because we don't have
to write the data twice, but it is also less reliable, because if data
corruption happens when the machine crashes, it may not be detected.
Benchmark results for an SSD connected to a SATA300 port, when doing
large linear writes with dd:
buffered I/O:
raw device throughput - 245MB/s
dm-integrity with journaling - 120MB/s
dm-integrity with bitmap - 238MB/s
direct I/O with 1MB block size:
raw device throughput - 248MB/s
dm-integrity with journaling - 123MB/s
dm-integrity with bitmap - 223MB/s
For more info see dm-integrity in Documentation/device-mapper/
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Introduce a function add_new_range_and_wait() in order to avoid
repetitive code. It will be used in the following commit.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlzR0AAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpo0MD/47D1kBK9rGzkAwIz1Jkh1Qy/ITVaDJzmHJ
UP5uncQsgKFLKMR1LbRcrWtmk2MwFDNULGbteHFeCYE1ypCrTgpWSp5+SJluKd1Q
hma9krLSAXO9QiSaZ4jafshXFIZxz6IjakOW8c9LrT80Ze47yh7AxiLwDafcp/Jj
x6NW790qB7ENDtfarDkZk14NCS8HGLRHO5B21LB+hT0Kfbh0XZaLzJdj7Mck1wPA
VT8hL9mPuA++AjF7Ra4kUjwSakgmajTa3nS2fpkwTYdztQfas7x5Jiv7FWxrrelb
qbabkNkWKepcHAPEiZR7o53TyfCucGeSK/jG+dsJ9KhNp26kl1ci3frl5T6PfVMP
SPPDjsKIHs+dqFrU9y5rSGhLJqewTs96hHthnLGxyF67+5sRb5+YIy+dcqgiyc/b
TUVyjCD6r0cO2q4v9VhwnhOyeBUA9Rwbu8nl7JV5Q45uG7qI4BC39l1jfubMNDPO
GLNGUUzb6ER7z6lYINjRSF2Jhejsx8SR9P7jhpb1Q7k/VvDDxO1T4FpwvqWFz9+s
Gn+s6//+cA6LL+42eZkQjvwF2CUNE7TaVT8zdb+s5HP1RQkZToqUnsQCGeRTrFni
RqWXfW9o9+awYRp431417oMdX/LvLGq9+ZtifRk9DqDcowXevTaf0W2RpplWSuiX
RcCuPeLAVg==
=Ot0g
-----END PGP SIGNATURE-----
Merge tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
"Nothing major in this series, just fixes and improvements all over the
map. This contains:
- Series of fixes for sed-opal (David, Jonas)
- Fixes and performance tweaks for BFQ (via Paolo)
- Set of fixes for bcache (via Coly)
- Set of fixes for md (via Song)
- Enabling multi-page for passthrough requests (Ming)
- Queue release fix series (Ming)
- Device notification improvements (Martin)
- Propagate underlying device rotational status in loop (Holger)
- Removal of mtip32xx trim support, which has been disabled for years
(Christoph)
- Improvement and cleanup of nvme command handling (Christoph)
- Add block SPDX tags (Christoph)
- Cleanup/hardening of bio/bvec iteration (Christoph)
- A few NVMe pull requests (Christoph)
- Removal of CONFIG_LBDAF (Christoph)
- Various little fixes here and there"
* tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block: (164 commits)
block: fix mismerge in bvec_advance
block: don't drain in-progress dispatch in blk_cleanup_queue()
blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
blk-mq: always free hctx after request queue is freed
blk-mq: split blk_mq_alloc_and_init_hctx into two parts
blk-mq: free hw queue's resource in hctx's release handler
blk-mq: move cancel of requeue_work into blk_mq_release
blk-mq: grab .q_usage_counter when queuing request from plug code path
block: fix function name in comment
nvmet: protect discovery change log event list iteration
nvme: mark nvme_core_init and nvme_core_exit static
nvme: move command size checks to the core
nvme-fabrics: check more command sizes
nvme-pci: check more command sizes
nvme-pci: remove an unneeded variable initialization
nvme-pci: unquiesce admin queue on shutdown
nvme-pci: shutdown on timeout during deletion
nvme-pci: fix psdt field for single segment sgls
nvme-multipath: don't print ANA group state by default
nvme-multipath: split bios with the ns_head bio_set before submitting
...
Change n_sectors data type from unsigned to sector_t. Following commits
will need to lock large ranges.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Pass size to dm_integrity_alloc_page_list(). This is needed so
following commits can pass a size that is different from
ic->journal_pages.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Introduce a function rw_journal_sectors() that takes sector and length
as its arguments instead of a section and the number of sections.
This functions will be used in further patches.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Update documentation with the "meta_device" parameter and flags.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If we are not journaling, don't report journaling options in the table
status.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The functions kfree, vfree and kvfree do nothing if we pass a NULL
pointer to them. So we don't need to test the pointer for NULL.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
When we use separate devices for data and metadata, dm-integrity would
incorrectly calculate the size of the metadata device as if it had
512-byte block size - and it would refuse activation with larger block
size and smaller metadata device.
Fix this so that it takes actual block size into account, which fixes
the following reported issue:
https://gitlab.com/cryptsetup/cryptsetup/issues/450
Fixes: 356d9d52e1 ("dm integrity: allow separate metadata device")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fix sparse warnings:
drivers/md/dm-dust.c:495:12: warning: symbol 'dm_dust_init' was not declared. Should it be static?
drivers/md/dm-dust.c:505:13: warning: symbol 'dm_dust_exit' was not declared. Should it be static?
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Variable block is an unsigned long long hence the less than zero
comparison is always false, hence it is redundant and can be removed.
Addresses-Coverity: ("Unsigned compared against 0")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Pull crypto update from Herbert Xu:
"API:
- Add support for AEAD in simd
- Add fuzz testing to testmgr
- Add panic_on_fail module parameter to testmgr
- Use per-CPU struct instead multiple variables in scompress
- Change verify API for akcipher
Algorithms:
- Convert x86 AEAD algorithms over to simd
- Forbid 2-key 3DES in FIPS mode
- Add EC-RDSA (GOST 34.10) algorithm
Drivers:
- Set output IV with ctr-aes in crypto4xx
- Set output IV in rockchip
- Fix potential length overflow with hashing in sun4i-ss
- Fix computation error with ctr in vmx
- Add SM4 protected keys support in ccree
- Remove long-broken mxc-scc driver
- Add rfc4106(gcm(aes)) cipher support in cavium/nitrox"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (179 commits)
crypto: ccree - use a proper le32 type for le32 val
crypto: ccree - remove set but not used variable 'du_size'
crypto: ccree - Make cc_sec_disable static
crypto: ccree - fix spelling mistake "protedcted" -> "protected"
crypto: caam/qi2 - generate hash keys in-place
crypto: caam/qi2 - fix DMA mapping of stack memory
crypto: caam/qi2 - fix zero-length buffer DMA mapping
crypto: stm32/cryp - update to return iv_out
crypto: stm32/cryp - remove request mutex protection
crypto: stm32/cryp - add weak key check for DES
crypto: atmel - remove set but not used variable 'alg_name'
crypto: picoxcell - Use dev_get_drvdata()
crypto: crypto4xx - get rid of redundant using_sd variable
crypto: crypto4xx - use sync skcipher for fallback
crypto: crypto4xx - fix cfb and ofb "overran dst buffer" issues
crypto: crypto4xx - fix ctr-aes missing output IV
crypto: ecrdsa - select ASN1 and OID_REGISTRY for EC-RDSA
crypto: ux500 - use ccflags-y instead of CFLAGS_<basename>.o
crypto: ccree - handle tee fips error during power management resume
crypto: ccree - add function to handle cryptocell tee fips error
...
Commit b592211c33 ("dm mpath: fix attached_handler_name leak and
dangling hw_handler_name pointer") fixed a memory leak for the case
where setup_scsi_dh() returns failure. But setup_scsi_dh may return
success and not "use" attached_handler_name if the
retain_attached_hwhandler flag is not set on the map. As setup_scsi_sh
properly "steals" the pointer by nullifying it, freeing it
unconditionally in parse_path() is safe.
Fixes: b592211c33 ("dm mpath: fix attached_handler_name leak and dangling hw_handler_name pointer")
Cc: stable@vger.kernel.org
Reported-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
dm-init should allow up to DM_MAX_{DEVICES,TARGETS} for devices/targets,
and not DM_MAX_{DEVICES,TARGETS} - 1.
Fix the checks and also fix the error message when the number of devices
is surpassed.
Fixes: 6bbc923dfc ("dm: add support to directly boot to a mapped device")
Cc: stable@vger.kernel.org
Signed-off-by: Helen Koike <helen.koike@collabora.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add the dm-dust target, which simulates the behavior of bad sectors
at arbitrary locations, and the ability to enable the emulation of
the read failures at an arbitrary time.
This target behaves similarly to a linear target. At a given time,
the user can send a message to the target to start failing read
requests on specific blocks. When the failure behavior is enabled,
reads of blocks configured "bad" will fail with EIO.
Writes of blocks configured "bad" will result in the following:
1. Remove the block from the "bad block list".
2. Successfully complete the write.
After this point, the block will successfully contain the written
data, and will service reads and writes normally. This emulates the
behavior of a "remapped sector" on a hard disk drive.
dm-dust provides logging of which blocks have been added or removed
to the "bad block list", as well as logging when a block has been
removed from the bad block list. These messages can be used
alongside the messages from the driver using a dm-dust device to
analyze the driver's behavior when a read fails at a given time.
(This logging can be reduced via a "quiet" mode, if desired.)
NOTE: If the block size is larger than 512 bytes, only the first sector
of each "dust block" is detected. Placing a limiting layer above a dust
target, to limit the minimum I/O size to the dust block size, will
ensure proper emulation of the given large block size.
Signed-off-by: Bryan Gurney <bgurney@redhat.com>
Co-developed-by: Joe Shimkus <jshimkus@redhat.com>
Co-developed-by: John Dorminy <jdorminy@redhat.com>
Co-developed-by: John Pittman <jpittman@redhat.com>
Co-developed-by: Thomas Jaskiewicz <tjaskiew@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
We only have two callers that need the integer loop iterator, and they
can easily maintain it themselves.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use a variable containing the buffer address instead of the to be
removed integer iterator from bio_for_each_segment_all.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 95f18c9d13 ("bcache: avoid potential memleak of list of
journal_replay(s) in the CACHE_SYNC branch of run_cache_set") forgets
to remove the original define of LIST_HEAD(journal), which makes
the change no take effect. This patch removes redundant variable
LIST_HEAD(journal) from run_cache_set(), to make Shenghui's fix
working.
Fixes: 95f18c9d13 ("bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC branch of run_cache_set")
Reported-by: Juha Aatrokoski <juha.aatrokoski@aalto.fi>
Cc: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace the indirection through struct stack_trace with an invocation of
the storage array based interface. This results in less storage space and
indirection.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: dm-devel@redhat.com
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: linux-arch@vger.kernel.org
Link: https://lkml.kernel.org/r/20190425094802.533968922@linutronix.de
This is a small optimization in writecache_find_entry().
If we go past the condition "if (unlikely(!node))", we can be certain that
there is no entry in the tree that has the block equal to the "block"
variable.
Consequently, we can return the next entry directly, we don't need to go
to the second part of the function that finds the entry with lowest or
highest seq number that matches the "block" variable.
Also, add some whitespace and cleanup needless braces.
Suggested-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The stucture member page_offset in writeback_struct never has been
used actually. Remove it.
Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
When the target line contains an invalid device, delay_ctr() will call
delay_dtr() with NULL workqueue. Attempting to destroy the NULL
workqueue causes a crash.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
md->dax_dev defaults to NULL and there is no need to initialize it
if CONFIG_DAX_DRIVER is disabled.
Signed-off-by: Peng Wang <rocking@whu.edu.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
After commit 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via
blk_insert_cloned_request feedback"), map_request() will requeue the tio
when issued clone request return BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE.
Thus, if device driver status is error, a tio may be requeued multiple
times until the return value is not DM_MAPIO_REQUEUE. That means
type->start_io may be called multiple times, while type->end_io is only
called when IO complete.
In fact, even without commit 396eaf21ee, setup_clone() failure can
also cause tio requeue and associated missed call to type->end_io.
The service-time path selector selects path based on in_flight_size,
which is increased by st_start_io() and decreased by st_end_io().
Missed calls to st_end_io() can lead to in_flight_size count error and
will cause the selector to make the wrong choice. In addition,
queue-length path selector will also be affected.
To fix the problem, call type->end_io in ->release_clone_rq before tio
requeue. map_info is passed to ->release_clone_rq() for map_request()
error path that result in requeue.
Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
Cc: stable@vger.kernl.org
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The flags field in 'struct shash_desc' never actually does anything.
The only ostensibly supported flag is CRYPTO_TFM_REQ_MAY_SLEEP.
However, no shash algorithm ever sleeps, making this flag a no-op.
With this being the case, inevitably some users who can't sleep wrongly
pass MAY_SLEEP. These would all need to be fixed if any shash algorithm
actually started sleeping. For example, the shash_ahash_*() functions,
which wrap a shash algorithm with the ahash API, pass through MAY_SLEEP
from the ahash API to the shash API. However, the shash functions are
called under kmap_atomic(), so actually they're assumed to never sleep.
Even if it turns out that some users do need preemption points while
hashing large buffers, we could easily provide a helper function
crypto_shash_update_large() which divides the data into smaller chunks
and calls crypto_shash_update() and cond_resched() for each chunk. It's
not necessary to have a flag in 'struct shash_desc', nor is it necessary
to make individual shash algorithms aware of this at all.
Therefore, remove shash_desc::flags, and document that the
crypto_shash_*() functions can be called from any context.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
In the CACHE_SYNC branch of run_cache_set(), LIST_HEAD(journal) is used
to collect journal_replay(s) and filled by bch_journal_read().
If all goes well, bch_journal_replay() will release the list of
jounal_replay(s) at the end of the branch.
If something goes wrong, code flow will jump to the label "err:" and leave
the list unreleased.
This patch will release the list of journal_replay(s) in the case of
error detected.
v1 -> v2:
* Move the release code to the location after label 'err:' to
simply the change.
Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Elements of keylist should be accessed before the list is freed.
Move bch_keylist_free() calling after the while loop to avoid wrong
content accessed.
Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
journal replay failed with messages:
Sep 10 19:10:43 ceph kernel: bcache: error on
bb379a64-e44e-4812-b91d-a5599871a3b1: bcache: journal entries
2057493-2057567 missing! (replaying 2057493-2076601), disabling
caching
The reason is in journal_reclaim(), when discard is enabled, we send
discard command and reclaim those journal buckets whose seq is old
than the last_seq_now, but before we write a journal with last_seq_now,
the machine is restarted, so the journal with the last_seq_now is not
written to the journal bucket, and the last_seq_wrote in the newest
journal is old than last_seq_now which we expect to be, so when we doing
replay, journals from last_seq_wrote to last_seq_now are missing.
It's hard to write a journal immediately after journal_reclaim(),
and it harmless if those missed journal are caused by discarding
since those journals are already wrote to btree node. So, if miss
seqs are started from the beginning journal, we treat it as normal,
and only print a message to show the miss journal, and point out
it maybe caused by discarding.
Patch v2 add a judgement condition to ignore the missed journal
only when discard enabled as Coly suggested.
(Coly Li: rebase the patch with other changes in bch_journal_replay())
Signed-off-by: Tang Junhui <tang.junhui.linux@gmail.com>
Tested-by: Dennis Schridde <devurandom@gmx.net>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch tries to release mutex bch_register_lock early, to give
chance to stop cache set and bcache device early.
This patch also expends time out of stopping all bcache device from
2 seconds to 10 seconds, because stopping writeback rate update worker
may delay for 5 seconds, 2 seconds is not enough.
After this patch applied, stopping bcache devices during system reboot
or shutdown is very hard to be observed any more.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add code comments to explain which call back function might be called
for the closure_queue(). This is an effort to make code to be more
understandable for readers.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add comments to explain why in register_bcache() blkdev_put() won't
be called in two location. Add comments to explain why blkdev_put()
must be called in register_cache() when cache_alloc() failed.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch adds return value to register_bdev(). Then if failure happens
inside register_bdev(), its caller register_bcache() may detect and
handle the failure more properly.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When failure happens inside bch_journal_replay(), calling
cache_set_err_on() and handling the failure in async way is not a good
idea. Because after bch_journal_replay() returns, registering code will
continue to execute following steps, and unregistering code triggered
by cache_set_err_on() is running in same time. First it is unnecessary
to handle failure and unregister cache set in an async way, second there
might be potential race condition to run register and unregister code
for same cache set.
So in this patch, if failure happens in bch_journal_replay(), we don't
call cache_set_err_on(), and just print out the same error message to
kernel message buffer, then return -EIO immediately caller. Then caller
can detect such failure and handle it in synchrnozied way.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bcache has several routines to release resources in implicit way, they
are called when the associated kobj released. This patch adds code
comments to notice when and which release callback will be called,
- When dc->disk.kobj released:
void bch_cached_dev_release(struct kobject *kobj)
- When d->kobj released:
void bch_flash_dev_release(struct kobject *kobj)
- When c->kobj released:
void bch_cache_set_release(struct kobject *kobj)
- When ca->kobj released
void bch_cache_release(struct kobject *kobj)
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently run_cache_set() has no return value, if there is failure in
bch_journal_replay(), the caller of run_cache_set() has no idea about
such failure and just continue to execute following code after
run_cache_set(). The internal failure is triggered inside
bch_journal_replay() and being handled in async way. This behavior is
inefficient, while failure handling inside bch_journal_replay(), cache
register code is still running to start the cache set. Registering and
unregistering code running as same time may introduce some rare race
condition, and make the code to be more hard to be understood.
This patch adds return value to run_cache_set(), and returns -EIO if
bch_journal_rreplay() fails. Then caller of run_cache_set() may detect
such failure and stop registering code flow immedidately inside
register_cache_set().
If journal replay fails, run_cache_set() can report error immediately
to register_cache_set(). This patch makes the failure handling for
bch_journal_replay() be in synchronized way, easier to understand and
debug, and avoid poetential race condition for register-and-unregister
in same time.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In journal_reclaim() ja->cur_idx of each cache will be update to
reclaim available journal buckets. Variable 'int n' is used to count how
many cache is successfully reclaimed, then n is set to c->journal.key
by SET_KEY_PTRS(). Later in journal_write_unlocked(), a for_each_cache()
loop will write the jset data onto each cache.
The problem is, if all jouranl buckets on each cache is full, the
following code in journal_reclaim(),
529 for_each_cache(ca, c, iter) {
530 struct journal_device *ja = &ca->journal;
531 unsigned int next = (ja->cur_idx + 1) % ca->sb.njournal_buckets;
532
533 /* No space available on this device */
534 if (next == ja->discard_idx)
535 continue;
536
537 ja->cur_idx = next;
538 k->ptr[n++] = MAKE_PTR(0,
539 bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
540 ca->sb.nr_this_dev);
541 }
542
543 bkey_init(k);
544 SET_KEY_PTRS(k, n);
If there is no available bucket to reclaim, the if() condition at line
534 will always true, and n remains 0. Then at line 544, SET_KEY_PTRS()
will set KEY_PTRS field of c->journal.key to 0.
Setting KEY_PTRS field of c->journal.key to 0 is wrong. Because in
journal_write_unlocked() the journal data is written in following loop,
649 for (i = 0; i < KEY_PTRS(k); i++) {
650-671 submit journal data to cache device
672 }
If KEY_PTRS field is set to 0 in jouranl_reclaim(), the journal data
won't be written to cache device here. If system crahed or rebooted
before bkeys of the lost journal entries written into btree nodes, data
corruption will be reported during bcache reload after rebooting the
system.
Indeed there is only one cache in a cache set, there is no need to set
KEY_PTRS field in journal_reclaim() at all. But in order to keep the
for_each_cache() logic consistent for now, this patch fixes the above
problem by not setting 0 KEY_PTRS of journal key, if there is no bucket
available to reclaim.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
'int ret' is defined as a local variable inside macro read_bucket().
Since this macro is called multiple times, and following patches will
use a 'int ret' variable in bch_journal_read(), this patch moves
definition of 'int ret' from macro read_bucket() to range of function
bch_journal_read().
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are a few nits in this function. They could in theory all
be separate patches, but that's probably taking small commits
too far.
1) I added a brief comment saying what it does.
2) I like to declare pointer parameters "const" where possible
for documentation reasons.
3) It uses bitmap_weight(&rand, BITS_PER_LONG) to compute the Hamming
weight of a 32-bit random number (giving a random integer with
mean 16 and variance 8). Passing by reference in a 64-bit variable
is silly; just use hweight32().
4) Its helper function fract_exp_two is unnecessarily tangled.
Gcc can optimize the multiply by (1 << x) to a shift, but it can
be written in a much more straightforward way at the cost of one
more bit of internal precision. Some analysis reveals that this
bit is always available.
This shrinks the object code for fract_exp_two(x, 6) from 23 bytes:
0000000000000000 <foo1>:
0: 89 f9 mov %edi,%ecx
2: c1 e9 06 shr $0x6,%ecx
5: b8 01 00 00 00 mov $0x1,%eax
a: d3 e0 shl %cl,%eax
c: 83 e7 3f and $0x3f,%edi
f: d3 e7 shl %cl,%edi
11: c1 ef 06 shr $0x6,%edi
14: 01 f8 add %edi,%eax
16: c3 retq
To 19:
0000000000000017 <foo2>:
17: 89 f8 mov %edi,%eax
19: 83 e0 3f and $0x3f,%eax
1c: 83 c0 40 add $0x40,%eax
1f: 89 f9 mov %edi,%ecx
21: c1 e9 06 shr $0x6,%ecx
24: d3 e0 shl %cl,%eax
26: c1 e8 06 shr $0x6,%eax
29: c3 retq
(Verified with 0 <= frac_bits <= 8, 0 <= x < 16<<frac_bits;
both versions produce the same output.)
5) And finally, the call to bch_get_congested() in check_should_bypass()
is separated from the use of the value by multiple tests which
could moot the need to compute it. Move the computation down to
where it's needed. This also saves a local register to hold the
computed value.
Signed-off-by: George Spelvin <lkml@sdf.org>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch uses kmemdup_nul to create a NUL-terminated string from
dc->sb.label. This is better than open coding it.
With this, we can move env[2] initialization into env[] array to make
code more elegant.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
clang has identified a code path in which it thinks a
variable may be unused:
drivers/md/bcache/alloc.c:333:4: error: variable 'bucket' is used uninitialized whenever 'if' condition is false
[-Werror,-Wsometimes-uninitialized]
fifo_pop(&ca->free_inc, bucket);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/md/bcache/util.h:219:27: note: expanded from macro 'fifo_pop'
#define fifo_pop(fifo, i) fifo_pop_front(fifo, (i))
^~~~~~~~~~~~~~~~~~~~~~~~~
drivers/md/bcache/util.h:189:6: note: expanded from macro 'fifo_pop_front'
if (_r) { \
^~
drivers/md/bcache/alloc.c:343:46: note: uninitialized use occurs here
allocator_wait(ca, bch_allocator_push(ca, bucket));
^~~~~~
drivers/md/bcache/alloc.c:287:7: note: expanded from macro 'allocator_wait'
if (cond) \
^~~~
drivers/md/bcache/alloc.c:333:4: note: remove the 'if' if its condition is always true
fifo_pop(&ca->free_inc, bucket);
^
drivers/md/bcache/util.h:219:27: note: expanded from macro 'fifo_pop'
#define fifo_pop(fifo, i) fifo_pop_front(fifo, (i))
^
drivers/md/bcache/util.h:189:2: note: expanded from macro 'fifo_pop_front'
if (_r) { \
^
drivers/md/bcache/alloc.c:331:15: note: initialize the variable 'bucket' to silence this warning
long bucket;
^
This cannot happen in practice because we only enter the loop
if there is at least one element in the list.
Slightly rearranging the code makes this clearer to both the
reader and the compiler, which avoids the warning.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
To get the amount of unused buckets in sysfs_priority_stats, the code
count the buckets which GC_SECTORS_USED is zero. It's correct and should
not be overwritten by the count of buckets which prio is zero.
Signed-off-by: Guoju Fang <fangguoju@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The bio from upper layer is considered completed when bio_complete()
returns. In most scenarios bio_complete() is called in search_free(),
but when read miss happens, the bio_compete() is called when backing
device reading completed, while the struct search is still in use until
cache inserting finished.
If someone stops the bcache device just then, the device may be closed
and released, but after cache inserting finished the struct search will
access a freed struct cached_dev.
This patch add the reference of bcache device before bio_complete() when
read miss happens, and put it after the search is not used.
Signed-off-by: Guoju Fang <fangguoju@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Otherwise, just activating a thin-pool and thin device and then
deactivating them will cause the thin-pool metadata to be changed
(e.g. superblock written) -- even without any metadata being changed.
Add 'in_service' flag to struct dm_pool_metadata and set it in
pmd_write_lock() because all on-disk metadata changes must take a write
lock of pmd->root_lock. Once 'in_service' is set it is never cleared.
__commit_transaction() will return 0 if 'in_service' is not set.
dm_pool_commit_metadata() is updated to use __pmd_write_lock() so that
it isn't the sole reason for putting a thin-pool in service.
Also fix dm_pool_commit_metadata() to open the next transaction if the
return from __commit_transaction() is 0. Not seeing why the early
return ever made since for a return of 0 given that dm-io's async_io(),
as used by bufio, always returns 0.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
No functional change, but this prepares to hook off of pmd_write_lock()
with additional functionality (as provided in next commit).
Suggested-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Otherwise, memory that is allocated (and potentially not previously
zeroed) will get written to disk as part of the space maps.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
In functions writecache_discard() and writecache_find_entry() there is a
high probablity that the pointer of structure rb_node won't equal NULL.
Add unlikely for the pointer node NULL.
Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
bio is already available so there is no need to access it in terms of
the wb pointer.
Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Substitute the global locking scheme with a fine grained one, employing
the read-write semaphore and the scalable exception tables with
per-bucket locks introduced by the previous two commits.
Summarizing, we now use a read-write semaphore to protect the mostly
read fields of the snapshot structure, e.g., valid, active, etc., and
per-bucket bit spinlocks to protect accesses to the complete and pending
exception tables.
Finally, we use an extra spinlock (pe_allocation_lock) to serialize the
allocation of new exceptions by the exception store. This allocation is
really fast, so the extra spinlock doesn't hurt the performance.
This scheme allows dm-snapshot to scale better, resulting in increased
IOPS and reduced latency.
Following are some benchmark results using the null_blk device:
modprobe null_blk gb=1024 bs=512 submit_queues=8 hw_queue_depth=4096 \
queue_mode=2 irqmode=1 completion_nsec=1 nr_devices=1
* Benchmark fio_origin_randwrite_throughput_N, from the device mapper
test suite [1] (direct IO, random 4K writes to origin device, IO
engine libaio):
+--------------+-------------+------------+
| # of workers | IOPS Before | IOPS After |
+--------------+-------------+------------+
| 1 | 57708 | 66421 |
| 2 | 63415 | 77589 |
| 4 | 67276 | 98839 |
| 8 | 60564 | 109258 |
+--------------+-------------+------------+
* Benchmark fio_origin_randwrite_latency_N, from the device mapper test
suite [1] (direct IO, random 4K writes to origin device, IO engine
psync):
+--------------+-----------------------+----------------------+
| # of workers | Latency (usec) Before | Latency (usec) After |
+--------------+-----------------------+----------------------+
| 1 | 16.25 | 13.27 |
| 2 | 31.65 | 25.08 |
| 4 | 55.28 | 41.08 |
| 8 | 121.47 | 74.44 |
+--------------+-----------------------+----------------------+
* Benchmark fio_snapshot_randwrite_throughput_N, from the device mapper
test suite [1] (direct IO, random 4K writes to snapshot device, IO
engine libaio):
+--------------+-------------+------------+
| # of workers | IOPS Before | IOPS After |
+--------------+-------------+------------+
| 1 | 72593 | 84938 |
| 2 | 97379 | 134973 |
| 4 | 90610 | 143077 |
| 8 | 90537 | 180085 |
+--------------+-------------+------------+
* Benchmark fio_snapshot_randwrite_latency_N, from the device mapper
test suite [1] (direct IO, random 4K writes to snapshot device, IO
engine psync):
+--------------+-----------------------+----------------------+
| # of workers | Latency (usec) Before | Latency (usec) After |
+--------------+-----------------------+----------------------+
| 1 | 12.53 | 10.6 |
| 2 | 19.78 | 14.89 |
| 4 | 40.37 | 23.47 |
| 8 | 89.32 | 48.48 |
+--------------+-----------------------+----------------------+
[1] https://github.com/jthornber/device-mapper-test-suite
Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Use list_bl to implement the exception hash tables' buckets. This change
permits concurrent access, to distinct buckets, by multiple threads.
Also, implement helper functions to lock and unlock the exception tables
based on the chunk number of the exception at hand.
We retain the global locking, by means of down_write(), which is
replaced by the next commit.
Still, we must acquire the per-bucket spinlocks when accessing the hash
tables, since list_bl does not allow modification on unlocked lists.
Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
dm-snapshot uses a single mutex to serialize every access to the
snapshot state. This includes all accesses to the complete and pending
exception tables, which occur at every origin write, every snapshot
read/write and every exception completion.
The lock statistics indicate that this mutex is a bottleneck (average
wait time ~480 usecs for 8 processes doing random 4K writes to the
origin device) preventing dm-snapshot to scale as the number of threads
doing IO increases.
The major contention points are __origin_write()/snapshot_map() and
pending_complete(), i.e., the submission and completion of pending
exceptions.
Replace this mutex with a rw semaphore.
We essentially revert commit ae1093be5a ("dm snapshot: use mutex
instead of rw_semaphore") and together with the next two patches we
substitute the single mutex with a fine-grained locking scheme, where we
use a read-write semaphore to protect the mostly read fields of the
snapshot structure, e.g., valid, active, etc., and per-bucket bit
spinlocks to protect accesses to the complete and pending exception
tables.
Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
When completing a pending exception, pending_complete() waits for all
conflicting reads to drain, before inserting the final, completed
exception. Conflicting reads are snapshot reads redirected to the
origin, because the relevant chunk is not remapped to the COW device the
moment we receive the read.
The completed exception must be inserted into the exception table after
all conflicting reads drain to ensure snapshot reads don't return
corrupted data. This is required because inserting the completed
exception into the exception table signals that the relevant chunk is
remapped and both origin writes and snapshot merging will now overwrite
the chunk in origin.
This wait is done holding the snapshot lock to ensure that
pending_complete() doesn't starve if new snapshot reads keep coming for
this chunk.
In preparation for the next commit, where we use a spinlock instead of a
mutex to protect the exception tables, we remove the need for holding
the lock while waiting for conflicting reads to drain.
We achieve this in two steps:
1. pending_complete() inserts the completed exception before waiting for
conflicting reads to drain and removes the pending exception after
all conflicting reads drain.
This ensures that new snapshot reads will be redirected to the COW
device, instead of the origin, and thus pending_complete() will not
starve. Moreover, we use the existence of both a completed and
a pending exception to signify that the COW is done but there are
conflicting reads in flight.
2. In __origin_write() we check first if there is a pending exception
and then if there is a completed exception. If there is a pending
exception any submitted BIO is delayed on the pe->origin_bios list and
DM_MAPIO_SUBMITTED is returned. This ensures that neither writes to the
origin nor snapshot merging can overwrite the origin chunk, until all
conflicting reads drain, and thus snapshot reads will not return
corrupted data.
Summarizing, we now have the following possible combinations of pending
and completed exceptions for a chunk, along with their meaning:
A. No exceptions exist: The chunk has not been remapped yet.
B. Only a pending exception exists: The chunk is currently being copied
to the COW device.
C. Both a pending and a completed exception exist: COW for this chunk
has completed but there are snapshot reads in flight which had been
redirected to the origin before the chunk was remapped.
D. Only the completed exception exists: COW has been completed and there
are no conflicting reads in flight.
Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add missing dm_bitset_cursor_next() to properly advance the bitset
cursor.
Otherwise, the discarded state of all blocks is set according to the
discarded state of the first block.
Fixes: ae4a46a1f6 ("dm cache metadata: use bitset cursor api to load discard bitset")
Cc: stable@vger.kernel.org
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The function blkdev_report_zones() returns success even if no zone
information is reported (empty report). Empty zone reports can only
happen if the report start sector passed exceeds the device capacity.
The conditions for this to happen are either a bug in the caller code,
or, a change in the device that forced the low level driver to change
the device capacity to a value that is lower than the report start
sector. This situation includes a failed disk revalidation resulting in
the disk capacity being changed to 0.
If this change happens while dm-zoned is in its initialization phase
executing dmz_init_zones(), this function may enter an infinite loop
and hang the system. To avoid this, add a check to disallow empty zone
reports and bail out early. Also fix the function dmz_update_zone() to
make sure that the report for the requested zone was correctly obtained.
Fixes: 3b1a94c88b ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Shaun Tancheff <shaun@tancheff.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
My static checker complains about this line from dmz_get_zoned_device()
aligned_capacity = dev->capacity & ~(blk_queue_zone_sectors(q) - 1);
The problem is that "aligned_capacity" and "dev->capacity" are sector_t
type (which is a u64 under most configs) but blk_queue_zone_sectors(q)
returns a u32 so the higher 32 bits in aligned_capacity are cleared to
zero. This patch adds a cast to address the issue.
Fixes: 114e025968 ("dm zoned: ignore last smaller runt zone")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The sector used here is a little endian value, so use the right
type for it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The problem is that any 'uptodate' vs 'disks' check is not precise
in this path. Put a "WARN_ON(!test_bit(R5_UPTODATE, &dev->flags)" on the
device that might try to kick off writes and then skip the action.
Better to prevent the raid driver from taking unexpected action *and* keep
the system alive vs killing the machine with BUG_ON.
Note: fixed warning reported by kbuild test robot <lkp@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
This reverts commit 4f4fd7c579.
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Nigel Croxon <ncroxon@redhat.com>
Cc: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Mdadm expects that setting drive as faulty will fail with -EBUSY only if
this operation will cause RAID to be failed. If this happens, it will
try to stop the array. Currently -EBUSY might also be returned if rdev
is in the middle of the removal process - for example there is a race
with mdmon that already requested the drive to be failed/removed.
If rdev does not contain mddev, return -ENODEV instead, so the caller
can distinguish between those two cases and behave accordingly.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlyzsYgeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGMw0H/ir42KJiABBKSETD
0d38qXVclAI/123zl8EkSfDrBKOsuIpXUDxzKeoDMhMkiurMpK6bbEOTPJAQMZJe
nEYpq/bZQi+vO8Q/pMMpaC3ExlIRosd0JAR7TyDUh5ZAeeMuDNzmvMk/DPxXPbNt
0P1FWePDa7908ajCOW1T8ZrB9Ak8boo7TKkF3LBb00ks1mEkyp/l74MKOHdu+HYn
XIwncX/Jotl4BrKdNC2f/NXYLYk6MrJDGug8TxuHgIqiMWhhrcSqbxU1ri7iqFXB
cBYdFo6ZJ8CWHux8/5LY5CMjSqEtzKha2Ohuhy3MMu1RsICyFLQtHnxHJ1ytLSBt
DOPcDQ0=
=CEUD
-----END PGP SIGNATURE-----
Merge tag 'v5.1-rc5' into for-5.2/block
Pull in v5.1-rc5 to resolve two conflicts. One is in BFQ, in just
a comment, and is trivial. The other one is a conflict due to a
later fix in the bio multi-page work, and needs a bit more care.
* tag 'v5.1-rc5': (476 commits)
Linux 5.1-rc5
fs: prevent page refcount overflow in pipe_buf_get
mm: prevent get_user_pages() from overflowing page refcount
mm: add 'try_get_page()' helper function
mm: make page ref count overflow check tighter and more explicit
clk: imx: Fix PLL_1416X not rounding rates
clk: mediatek: fix clk-gate flag setting
arm64: futex: Fix FUTEX_WAKE_OP atomic ops with non-zero result value
iommu/amd: Set exclusion range correctly
clang-format: Update with the latest for_each macro list
perf/core: Fix perf_event_disable_inatomic() race
block: fix the return errno for direct IO
Revert "SUNRPC: Micro-optimise when the task is known not to be sleeping"
NFSv4.1 fix incorrect return value in copy_file_range
xprtrdma: Fix helper that drains the transport
NFS: Fix handling of reply page vector
NFS: Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.
dma-debug: only skip one stackframe entry
platform/x86: pmc_atom: Drop __initconst on dmi table
nvmet: fix discover log page when offsets are used
...
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This tells sparse that we release and reacquire the device_lock and
avoids a warning.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
This tells sparse that we acquire/release the two stripe locks and
avoids a warning.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
Sparse complains that it has no external declaration, and it turns out
that it is never even used outside of md.c. So just mark it static
and drop the export.
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
If we want to convert from a little endian format we need to cast
to a little endian type, otherwise sparse will be unhappy.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
If we want to convert from a little endian format we need to cast
to a little endian type, otherwise sparse will be unhappy.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
If we want to convert from a little endian format we need to cast
to a little endian type, otherwise sparse will be unhappy.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
The on-disk value is little endian and we need to convert it to
native endian before storing the value in the in-core structure.
Fixes: 7564beda19 ("md-cluster/raid10: support add disk under grow mode")
Cc: <stable@vger.kernel.org> # 4.20+
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
When doing re-add, we need to ensure rdev->mddev->pers is not NULL,
which can avoid potential NULL pointer derefence in fallowing
add_bound_rdev().
Fixes: a6da4ef85c ("md: re-add a failed disk")
Cc: Xiao Ni <xni@redhat.com>
Cc: NeilBrown <neilb@suse.com>
Cc: <stable@vger.kernel.org> # 4.4+
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit
architectures. These types are required to support block device and/or
file sizes larger than 2 TiB, and have generally defaulted to on for
a long time. Enabling the option only increases the i386 tinyconfig
size by 145 bytes, and many data structures already always use
64-bit values for their in-core and on-disk data structures anyway,
so there should not be a large change in dynamic memory usage either.
Dropping this option removes a somewhat weird non-default config that
has cause various bugs or compiler warnings when actually used.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
dm-integrity will deadlock if overlapping I/O is issued to it, the bug
was introduced by commit 724376a04d ("dm integrity: implement fair
range locks"). Users rarely use overlapping I/O so this bug went
undetected until now.
Fix this bug by correcting, likely cut-n-paste, typos in
ranges_overlap() and also remove a flawed ranges_overlap() check in
remove_range_unlocked(). This condition could leave unprocessed bios
hanging on wait_list forever.
Cc: stable@vger.kernel.org # v4.19+
Fixes: 724376a04d ("dm integrity: implement fair range locks")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Storage devices which report supporting discard commands like
WRITE_SAME_16 with unmap, but reject discard commands sent to the
storage device. This is a clear storage firmware bug but it doesn't
change the fact that should a program cause discards to be sent to a
multipath device layered on this buggy storage, all paths can end up
failed at the same time from the discards, causing possible I/O loss.
The first discard to a path will fail with Illegal Request, Invalid
field in cdb, e.g.:
kernel: sd 8:0:8:19: [sdfn] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
kernel: sd 8:0:8:19: [sdfn] tag#0 Sense Key : Illegal Request [current]
kernel: sd 8:0:8:19: [sdfn] tag#0 Add. Sense: Invalid field in cdb
kernel: sd 8:0:8:19: [sdfn] tag#0 CDB: Write same(16) 93 08 00 00 00 00 00 a0 08 00 00 00 80 00 00 00
kernel: blk_update_request: critical target error, dev sdfn, sector 10487808
The SCSI layer converts this to the BLK_STS_TARGET error number, the sd
device disables its support for discard on this path, and because of the
BLK_STS_TARGET error multipath fails the discard without failing any
path or retrying down a different path. But subsequent discards can
cause path failures. Any discards sent to the path which already failed
a discard ends up failing with EIO from blk_cloned_rq_check_limits with
an "over max size limit" error since the discard limit was set to 0 by
the sd driver for the path. As the error is EIO, this now fails the
path and multipath tries to send the discard down the next path. This
cycle continues as discards are sent until all paths fail.
Fix this by training DM core to disable DISCARD if the underlying
storage already did so.
Also, fix branching in dm_done() and clone_endio() to reflect the
mutually exclussive nature of the IO operations in question.
Cc: stable@vger.kernel.org
Reported-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Some devices don't use blk_integrity but still want stable pages
because they do their own checksumming. Examples include rbd and iSCSI
when data digests are negotiated. Stacking DM (and thus LVM) on top of
these devices results in sporadic checksum errors.
Set BDI_CAP_STABLE_WRITES if any underlying device has it set.
Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The limit was already incorporated to dm-crypt with commit 4e870e948f
("dm crypt: fix error with too large bios"), so we don't need to apply
it globally to all targets. The quantity BIO_MAX_PAGES * PAGE_SIZE is
wrong anyway because the variable ti->max_io_len it is supposed to be in
the units of 512-byte sectors not in bytes.
Reduction of the limit to 1048576 sectors could even cause data
corruption in rare cases - suppose that we have a dm-striped device with
stripe size 768MiB. The target will call dm_set_target_max_io_len with
the value 1572864. The buggy code would reduce it to 1048576. Now, the
dm-core will errorneously split the bios on 1048576-sector boundary
insetad of 1572864-sector boundary and pass these stripe-crossing bios
to the striped target.
Cc: stable@vger.kernel.org # v4.16+
Fixes: 8f50e35815 ("dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A non const pointer to const cannot be marked initconst.
Mark the array actually const.
Fixes: 6bbc923dfc dm: add support to directly boot to a mapped device
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fix sparse warnings:
drivers/md/dm-integrity.c:3619:12: warning:
symbol 'dm_integrity_init' was not declared. Should it be static?
drivers/md/dm-integrity.c:3638:6: warning:
symbol 'dm_integrity_exit' was not declared. Should it be static?
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If the string opt_string is small, the function memcmp can access bytes
that are beyond the terminating nul character. In theory, it could cause
segfault, if opt_string were located just below some unmapped memory.
Change from memcmp to strncmp so that we don't read bytes beyond the end
of the string.
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This reverts commit 5a409b4f56.
This patch has two problems.
1/ it make multiple calls to submit_bio() from inside a make_request_fn.
The bios thus submitted will be queued on current->bio_list and not
submitted immediately. As the bios are allocated from a mempool,
this can theoretically result in a deadlock - all the pool of requests
could be in various ->bio_list queues and a subsequent mempool_alloc
could block waiting for one of them to be released.
2/ It aims to handle a case when there are many concurrent flush requests.
It handles this by submitting many requests in parallel - all of which
are identical and so most of which do nothing useful.
It would be more efficient to just send one lower-level request, but
allow that to satisfy multiple upper-level requests.
Fixes: 5a409b4f56 ("MD: fix lock contention for flush bios")
Cc: <stable@vger.kernel.org> # v4.19+
Tested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Changing state from check_state_check_result to
check_state_compute_result not only is unsafe but also doesn't
appear to serve a valid purpose. A raid6 check should only be
pushing out extra writes if doing repair and a mis-match occurs.
The stripe dev management will already try and do repair writes
for failing sectors.
This patch makes the raid6 check_state_check_result handling
work more like raid5's. If somehow too many failures for a
check, just quit the check operation for the stripe. When any
checks pass, don't try and use check_state_compute_result for
a purpose it isn't needed for and is unsafe for. Just mark the
stripe as in sync for passing its parity checks and let the
stripe dev read/write code and the bad blocks list do their
job handling I/O errors.
Repro steps from Xiao:
These are the steps to reproduce this problem:
1. redefined OPT_MEDIUM_ERR_ADDR to 12000 in scsi_debug.c
2. insmod scsi_debug.ko dev_size_mb=11000 max_luns=1 num_tgts=1
3. mdadm --create /dev/md127 --level=6 --raid-devices=5 /dev/sde1 /dev/sde2 /dev/sde3 /dev/sde5 /dev/sde6
sde is the disk created by scsi_debug
4. echo "2" >/sys/module/scsi_debug/parameters/opts
5. raid-check
It panic:
[ 4854.730899] md: data-check of RAID array md127
[ 4854.857455] sd 5:0:0:0: [sdr] tag#80 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4854.859246] sd 5:0:0:0: [sdr] tag#80 Sense Key : Medium Error [current]
[ 4854.860694] sd 5:0:0:0: [sdr] tag#80 Add. Sense: Unrecovered read error
[ 4854.862207] sd 5:0:0:0: [sdr] tag#80 CDB: Read(10) 28 00 00 00 2d 88 00 04 00 00
[ 4854.864196] print_req_error: critical medium error, dev sdr, sector 11656 flags 0
[ 4854.867409] sd 5:0:0:0: [sdr] tag#100 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4854.869469] sd 5:0:0:0: [sdr] tag#100 Sense Key : Medium Error [current]
[ 4854.871206] sd 5:0:0:0: [sdr] tag#100 Add. Sense: Unrecovered read error
[ 4854.872858] sd 5:0:0:0: [sdr] tag#100 CDB: Read(10) 28 00 00 00 2e e0 00 00 08 00
[ 4854.874587] print_req_error: critical medium error, dev sdr, sector 12000 flags 4000
[ 4854.876456] sd 5:0:0:0: [sdr] tag#101 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4854.878552] sd 5:0:0:0: [sdr] tag#101 Sense Key : Medium Error [current]
[ 4854.880278] sd 5:0:0:0: [sdr] tag#101 Add. Sense: Unrecovered read error
[ 4854.881846] sd 5:0:0:0: [sdr] tag#101 CDB: Read(10) 28 00 00 00 2e e8 00 00 08 00
[ 4854.883691] print_req_error: critical medium error, dev sdr, sector 12008 flags 4000
[ 4854.893927] sd 5:0:0:0: [sdr] tag#166 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4854.896002] sd 5:0:0:0: [sdr] tag#166 Sense Key : Medium Error [current]
[ 4854.897561] sd 5:0:0:0: [sdr] tag#166 Add. Sense: Unrecovered read error
[ 4854.899110] sd 5:0:0:0: [sdr] tag#166 CDB: Read(10) 28 00 00 00 2e e0 00 00 10 00
[ 4854.900989] print_req_error: critical medium error, dev sdr, sector 12000 flags 0
[ 4854.902757] md/raid:md127: read error NOT corrected!! (sector 9952 on sdr1).
[ 4854.904375] md/raid:md127: read error NOT corrected!! (sector 9960 on sdr1).
[ 4854.906201] ------------[ cut here ]------------
[ 4854.907341] kernel BUG at drivers/md/raid5.c:4190!
raid5.c:4190 above is this BUG_ON:
handle_parity_checks6()
...
BUG_ON(s->uptodate < disks - 1); /* We don't need Q to recover */
Cc: <stable@vger.kernel.org> # v3.16+
OriginalAuthor: David Jeffery <djeffery@redhat.com>
Cc: Xiao Ni <xni@redhat.com>
Tested-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: David Jeffy <djeffery@redhat.com>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlyL124QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgptsxD/42slmoE5TC3vwXcgMBEilrjIHCns6O4Leo
0r8Awdwil8QkVDphfAWsgkTBjRPUNKv4cCg2kG4VEzAy62YSutUWPeqJZwLOpGDI
kji9XI6WLqwQ/VhDFwEln9G+xWDUQxds5PZDomlzLpjiNqkFArwwsPFnJbshH4fB
U6kZrhVSLfvJHIJmC9H4RIWuTEwUH1yFSvzzMqDOOyvRon2g/A2YlHb2KhSCaJPq
1b0jbhyR0GVP0EH1FdeKvNYFZfvXXSPAbxDN1CEtW/Lq8WxXeoaCj390tC+gL7yQ
WWHntvUoVU/weWudbT3tVsYgpI91KfPM5OuWTDGod6lFwHrI5X91Pao3KYUGPb9d
cwvNBOlkNqR1ENZOGTgxLeKwiwV7G1DIjvsaijRQJhGy4Uw4RkM/YEct9JHxWBIF
x4ZuSVUVZ5Y3zNPC945iJ6Z5feOz/UO9bQL00oimu0c0JhAp++3pHWAFJEMQ8q1a
0IRifkeUyhf0p9CIVPDnUzmNgSBglFkAVTPVAWySBVDU+v0/GoNcYwTzPq4cgPrF
UJEIlx+RdDpKKmCqBvKjtx4w7BC1lCebL/1ZJrbARNO42djt8xeuyvKw0t+MYVTZ
UsvLX72tXwUIbj0IZZGuz+8uSGD4ddDs8+x486FN4oaCPf36FUnnkOZZkhjV/KQA
vsZNrNNZpw==
=qBae
-----END PGP SIGNATURE-----
Merge tag 'for-5.1/block-post-20190315' of git://git.kernel.dk/linux-block
Pull more block layer changes from Jens Axboe:
"This is a collection of both stragglers, and fixes that came in after
I finalized the initial pull. This contains:
- An MD pull request from Song, with a few minor fixes
- Set of NVMe patches via Christoph
- Pull request from Konrad, with a few fixes for xen/blkback
- pblk fix IO calculation fix (Javier)
- Segment calculation fix for pass-through (Ming)
- Fallthrough annotation for blkcg (Mathieu)"
* tag 'for-5.1/block-post-20190315' of git://git.kernel.dk/linux-block: (25 commits)
blkcg: annotate implicit fall through
nvme-tcp: support C2HData with SUCCESS flag
nvmet: ignore EOPNOTSUPP for discard
nvme: add proper write zeroes setup for the multipath device
nvme: add proper discard setup for the multipath device
nvme: remove nvme_ns_config_oncs
nvme: disable Write Zeroes for qemu controllers
nvmet-fc: bring Disconnect into compliance with FC-NVME spec
nvmet-fc: fix issues with targetport assoc_list list walking
nvme-fc: reject reconnect if io queue count is reduced to zero
nvme-fc: fix numa_node when dev is null
nvme-fc: use nr_phys_segments to determine existence of sgl
nvme-loop: init nvmet_ctrl fatal_err_work when allocate
nvme: update comment to make the code easier to read
nvme: put ns_head ref if namespace fails allocation
nvme-trace: fix cdw10 buffer overrun
nvme: don't warn on block content change effects
nvme: add get-feature to admin cmds tracer
md: Fix failed allocation of md_register_thread
It's wrong to add len to sector_nr in raid10 reshape twice
...
mddev->sync_thread can be set to NULL on kzalloc failure downstream.
The patch checks for such a scenario and frees allocated resources.
Committer node:
Added similar fix to raid5.c, as suggested by Guoqing.
Cc: stable@vger.kernel.org # v3.16+
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Aditya Pakki <pakki001@umn.edu>
Signed-off-by: Song Liu <songliubraving@fb.com>
In reshape_request it already adds len to sector_nr already. It's wrong to add len to
sector_nr again after adding pages to bio. If there is bad block it can't copy one chunk
at a time, it needs to goto read_more. Now the sector_nr is wrong. It can cause data
corruption.
Cc: stable@vger.kernel.org # v3.16+
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
When the Partial Parity Log is enabled, circular buffer is used to store
PPL data. Each write to RAID device causes overwrite of data in this buffer
so some write_hint can be set to those request to help drives handle
garbage collection. This patch adds new sysfs attribute which can be used
to specify which write_hint should be assigned to PPL.
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
The code really just wants a big flat buffer, so just do that.
Link: http://lkml.kernel.org/r/20181217131929.11727-3-kent.overstreet@gmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Eric Paris <eparis@parisplace.org>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pravin B Shelar <pshelar@ovn.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
DM targets to properly advertise discard limits that blk_queue_split()
looks at when dtermining to split discard. Whereby allowing DM core's
own 'split_discard_bios' to be removed.
- Improve DM cache target to provide support for discard passdown to the
origin device.
- Introduce support to directly boot to a DM mapped device from init by
using dm-mod.create= module param. This eliminates the need for an
elaborate initramfs that is otherwise needed to create DM devices.
This feature's implementation has been worked on for quite some time
(got up to v12) and is of particular interest to Android and other
more embedded platforms (e.g. ARM).
- Rate limit errors from the DM integrity target that were identified as
the cause for recent NMI hangs due to console limitations.
- Add sanity checks for user input to thin-pool and external snapshot
creation.
- Remove some unused leftover kmem caches from when old .request_fn
request-based support was removed.
- Various small cleanups and fixes to targets (e.g. typos, needless
unlikely() annotations, use struct_size(), remove needless
.direct_access method from dm-snapshot)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJcgT+7AAoJEMUj8QotnQNaAsUIAIxsO5y6+7UruZzZxpyYBA34
yBLnZ9SICxESteu4R9lWT4LnFbrdwDDSSCeQ1dFt5/vx54T4qISN/O3lv9e//BeJ
BxFXtu7wB485l28uojBZeb+9APTaoihfEokcfDqZnaf26XtY0t/M+yRP7U86eGcC
zsX9fOEmJ3cpWtpai07tbHNDjIrr1kIWcFuU2+xGO/wn+Up8uLd85exi7e3cqDs6
VC+YJ/10/2keqFQvse3w3TBMjduwpb7SlDa2z/SorYaStVHzgwRSSjWYkSM/eDRA
OkSeRQ3Rnwc+Vad2R8J7unnZlMd4kALjGuzbyafWnitE+C+n0aJFDKqjIwNbKcw=
=GKp5
-----END PGP SIGNATURE-----
Merge tag 'for-5.1/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- Update bio-based DM core to always call blk_queue_split() and update
DM targets to properly advertise discard limits that
blk_queue_split() looks at when dtermining to split discard. Whereby
allowing DM core's own 'split_discard_bios' to be removed.
- Improve DM cache target to provide support for discard passdown to
the origin device.
- Introduce support to directly boot to a DM mapped device from init by
using dm-mod.create= module param. This eliminates the need for an
elaborate initramfs that is otherwise needed to create DM devices.
This feature's implementation has been worked on for quite some time
(got up to v12) and is of particular interest to Android and other
more embedded platforms (e.g. ARM).
- Rate limit errors from the DM integrity target that were identified
as the cause for recent NMI hangs due to console limitations.
- Add sanity checks for user input to thin-pool and external snapshot
creation.
- Remove some unused leftover kmem caches from when old .request_fn
request-based support was removed.
- Various small cleanups and fixes to targets (e.g. typos, needless
unlikely() annotations, use struct_size(), remove needless
.direct_access method from dm-snapshot)
* tag 'for-5.1/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm integrity: limit the rate of error messages
dm snapshot: don't define direct_access if we don't support it
dm cache: add support for discard passdown to the origin device
dm writecache: fix typo in name for writeback_wq
dm: add support to directly boot to a mapped device
dm thin: add sanity checks to thin-pool and external snapshot creation
dm block manager: remove redundant unlikely annotation
dm verity fec: remove redundant unlikely annotation
dm integrity: remove redundant unlikely annotation
dm: always call blk_queue_split() in dm_process_bio()
dm: fix to_sector() for 32bit
dm switch: use struct_size() in kzalloc()
dm: remove unused _rq_tio_cache and _rq_cache
dm: eliminate 'split_discard_bios' flag from DM target interface
dm: update dm_process_bio() to split bio if in ->make_request_fn()
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlx63XIQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpp2vEACfrrQsap7R+Av28mmXpmXi2FPa3g5Tev1t
yYjK2qHvhlMZjPTYw3hCmbYdDDczlF7PEgSE2x2DjdcsYapb8Fy1lZ2X16c7ztBR
HD/t9b5AVSQsczZzKgv3RqsNtTnjzS5V0A8XH8FAP2QRgiwDMwSN6G0FP0JBLbE/
ZgxQrH1Iy1F33Wz4hI3Z7dEghKPZrH1IlegkZCEu47q9SlWS76qUetSy2GEtchOl
3Lgu54mQZyVdI5/QZf9DyMDLF6dIz3tYU2qhuo01AHjGRCC72v86p8sIiXcUr94Q
8pbegJhJ/g8KBol9Qhv3+pWG/QUAZwi/ZwasTkK+MJ4klRXfOrznxPubW1z6t9Vn
QRo39Po5SqqP0QWAscDxCFjESIQlWlKa+LZurJL7DJDCUGrSgzTpnVwFqKwc5zTP
HJa5MT2tEeL2TfUYRYCfh0ZV0elINdHA1y1klDBh38drh4EWr2gW8xdseGYXqRjh
fLgEpoF7VQ8kTvxKN+E4jZXkcZmoLmefp0ZyAbblS6IawpPVC7kXM9Fdn2OU8f2c
fjVjvSiqxfeN6dnpfeLDRbbN9894HwgP/LPropJOQ7KmjCorQq5zMDkAvoh3tElq
qwluRqdBJpWT/F05KweY+XVW8OawIycmUWqt6JrVNoIDAK31auHQv47kR0VA4OvE
DRVVhYpocw==
=VBaU
-----END PGP SIGNATURE-----
Merge tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
"Not a huge amount of changes in this round, the biggest one is that we
finally have Mings multi-page bvec support merged. Apart from that,
this pull request contains:
- Small series that avoids quiescing the queue for sysfs changes that
match what we currently have (Aleksei)
- Series of bcache fixes (via Coly)
- Series of lightnvm fixes (via Mathias)
- NVMe pull request from Christoph. Nothing major, just SPDX/license
cleanups, RR mp policy (Hannes), and little fixes (Bart,
Chaitanya).
- BFQ series (Paolo)
- Save blk-mq cpu -> hw queue mapping, removing a pointer indirection
for the fast path (Jianchao)
- fops->iopoll() added for async IO polling, this is a feature that
the upcoming io_uring interface will use (Christoph, me)
- Partition scan loop fixes (Dongli)
- mtip32xx conversion from managed resource API (Christoph)
- cdrom registration race fix (Guenter)
- MD pull from Song, two minor fixes.
- Various documentation fixes (Marcos)
- Multi-page bvec feature. This brings a lot of nice improvements
with it, like more efficient splitting, larger IOs can be supported
without growing the bvec table size, and so on. (Ming)
- Various little fixes to core and drivers"
* tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block: (117 commits)
block: fix updating bio's front segment size
block: Replace function name in string with __func__
nbd: propagate genlmsg_reply return code
floppy: remove set but not used variable 'q'
null_blk: fix checking for REQ_FUA
block: fix NULL pointer dereference in register_disk
fs: fix guard_bio_eod to check for real EOD errors
blk-mq: use HCTX_TYPE_DEFAULT but not 0 to index blk_mq_tag_set->map
block: optimize bvec iteration in bvec_iter_advance
block: introduce mp_bvec_for_each_page() for iterating over page
block: optimize blk_bio_segment_split for single-page bvec
block: optimize __blk_segment_map_sg() for single-page bvec
block: introduce bvec_nth_page()
iomap: wire up the iopoll method
block: add bio_set_polled() helper
block: wire up block device iopoll method
fs: add an iopoll method to struct file_operations
loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part()
loop: do not print warn message if partition scan is successful
block: bounce: make sure that bvec table is updated
...
When using dm-integrity underneath md-raid, some tests with raid
auto-correction trigger large amounts of integrity failures - and all
these failures print an error message. These messages can bring the
system to a halt if the system is using serial console.
Fix this by limiting the rate of error messages - it improves the speed
of raid recovery and avoids the hang.
Fixes: 7eada909bf ("dm: add integrity target")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Don't define a direct_access function that fails, dm_dax_direct_access
already fails with -EIO if the pointer is zero;
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
DM cache now defaults to passing discards down to the origin device.
User may disable this using the "no_discard_passdown" feature when
creating the cache device.
If the cache's underlying origin device doesn't support discards then
passdown is disabled (with warning). Similarly, if the underlying
origin device's max_discard_sectors is less than a cache block discard
passdown will be disabled (this is required because sizing of the cache
internal discard bitset depends on it).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The workqueue's name should be "writecache-writeback" instead of
"writecache-writeabck".
Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add a "create" module parameter, which allows device-mapper targets to
be configured at boot time. This enables early use of DM targets in the
boot process (as the root device or otherwise) without the need of an
initramfs.
The syntax used in the boot param is based on the concise format from
the dmsetup tool to follow the rule of least surprise:
dmsetup table --concise /dev/mapper/lroot
Which is:
dm-mod.create=<name>,<uuid>,<minor>,<flags>,<table>[,<table>+][;<name>,<uuid>,<minor>,<flags>,<table>[,<table>+]+]
Where,
<name> ::= The device name.
<uuid> ::= xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx | ""
<minor> ::= The device minor number | ""
<flags> ::= "ro" | "rw"
<table> ::= <start_sector> <num_sectors> <target_type> <target_args>
<target_type> ::= "verity" | "linear" | ...
For example, the following could be added in the boot parameters:
dm-mod.create="lroot,,,rw, 0 4096 linear 98:16 0, 4096 4096 linear 98:32 0" root=/dev/dm-0
Only the targets that were tested are allowed and the ones that don't
change any block device when the device is create as read-only. For
example, mirror and cache targets are not allowed. The rationale behind
this is that if the user makes a mistake, choosing the wrong device to
be the mirror or the cache can corrupt data.
The only targets initially allowed are:
* crypt
* delay
* linear
* snapshot-origin
* striped
* verity
Co-developed-by: Will Drewry <wad@chromium.org>
Co-developed-by: Kees Cook <keescook@chromium.org>
Co-developed-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
Signed-off-by: Helen Koike <helen.koike@collabora.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Invoking dm_get_device() twice on the same device path with different
modes is dangerous. Because in that case, upgrade_mode() will alloc a
new 'dm_dev' and free the old one, which may be referenced by a previous
caller. Dereferencing the dangling pointer will trigger kernel NULL
pointer dereference.
The following two cases can reproduce this issue. Actually, they are
invalid setups that must be disallowed, e.g.:
1. Creating a thin-pool with read_only mode, and the same device as
both metadata and data.
dmsetup create thinp --table \
"0 41943040 thin-pool /dev/vdb /dev/vdb 128 0 1 read_only"
BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
...
Call Trace:
new_read+0xfb/0x110 [dm_bufio]
dm_bm_read_lock+0x43/0x190 [dm_persistent_data]
? kmem_cache_alloc_trace+0x15c/0x1e0
__create_persistent_data_objects+0x65/0x3e0 [dm_thin_pool]
dm_pool_metadata_open+0x8c/0xf0 [dm_thin_pool]
pool_ctr.cold.79+0x213/0x913 [dm_thin_pool]
? realloc_argv+0x50/0x70 [dm_mod]
dm_table_add_target+0x14e/0x330 [dm_mod]
table_load+0x122/0x2e0 [dm_mod]
? dev_status+0x40/0x40 [dm_mod]
ctl_ioctl+0x1aa/0x3e0 [dm_mod]
dm_ctl_ioctl+0xa/0x10 [dm_mod]
do_vfs_ioctl+0xa2/0x600
? handle_mm_fault+0xda/0x200
? __do_page_fault+0x26c/0x4f0
ksys_ioctl+0x60/0x90
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x55/0x150
entry_SYSCALL_64_after_hwframe+0x44/0xa9
2. Creating a external snapshot using the same thin-pool device.
dmsetup create thinp --table \
"0 41943040 thin-pool /dev/vdc /dev/vdb 128 0 2 ignore_discard"
dmsetup message /dev/mapper/thinp 0 "create_thin 0"
dmsetup create snap --table \
"0 204800 thin /dev/mapper/thinp 0 /dev/mapper/thinp"
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
...
Call Trace:
? __alloc_pages_nodemask+0x13c/0x2e0
retrieve_status+0xa5/0x1f0 [dm_mod]
? dm_get_live_or_inactive_table.isra.7+0x20/0x20 [dm_mod]
table_status+0x61/0xa0 [dm_mod]
ctl_ioctl+0x1aa/0x3e0 [dm_mod]
dm_ctl_ioctl+0xa/0x10 [dm_mod]
do_vfs_ioctl+0xa2/0x600
ksys_ioctl+0x60/0x90
? ksys_write+0x4f/0xb0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x55/0x150
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Signed-off-by: Jason Cai (Xiang Feng) <jason.cai@linux.alibaba.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
unlikely has already included in IS_ERR(),
so just remove redundant unlikely annotation.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
unlikely has already included in IS_ERR(),
so just remove redundant unlikely annotation.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
unlikely has already included in IS_ERR(),
so just remove redundant unlikely annotation.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Do not just call blk_queue_split() if the bio is_abnormal_io().
Fixes: 568c73a355 ("dm: update dm_process_bio() to split bio if in ->make_request_fn()")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct foo {
int stuff;
void *entry[];
};
instance = kzalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:
instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
There is no need to have DM core split discards on behalf of a DM target
now that blk_queue_split() handles splitting discards based on the
queue_limits. A DM target just needs to set max_discard_sectors,
discard_granularity, etc, in queue_limits.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Must call blk_queue_split() otherwise queue_limits for abnormal requests
(e.g. discard, writesame, etc) won't be imposed.
In addition, add dm_queue_split() to simplify DM specific splitting that
is needed for targets that impose ti->max_io_len.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlxm7pAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpl6JEACM5qHp7HEf7muuLKDUoX16G2eDOjacVxbL
q1kqyHNvrYD/aGo+8vcshCef6xno9fL1akIxTyaTcMwYJUk9JSMicsVimxC1OvI6
a5ZiWItX2L8Nh/heJe+FtutWbrT+Nd+3Q8DqI+U0YkRnjnXaRVgLFtBmjLOxBrqJ
Ps/VepB4GaxA0oWdPbhos/N3wa42uFy3ixdv3Kv6WmHdqraB9uagt8PwwUti3WzQ
uxWL6J+JOBSDha8l3fp68Okib1bm/6Nmmc9l8Yz1eFwf+Y+gVgw7wPQxkUD/XaFW
bDJGwp3NawK07EanIAIzfXUEGfLvgeRJBEP3OGwV/TAiHX5q9zQo/tbM6x8j4aT9
zGlwU/EnwFixgbRW/hOT5Ox4usBlfB1j0ZiNmgUm8QphHrELFnc35Kd+PR/KONNX
sI6ZiifEAMR+4S99kTZ5YjHUqcUVm9ndd8iQGW9mvM6vt3o1L6QKeOeEKBMlhMcx
V+JtViC50ojidYc82kEtQFY9OKRkc5x3k1wBsH49LGMT+fvEwETallOXHTarQKrv
QAZNN1NINkMmrL5bgBXFqf0qpOy4xHnhis5AilUHNZwa4G8iAe8oqz/2eUCydiV1
Ogx20a8T1ifeSkI2NXrwnBjVzqnfiO9wOb9py98BiLR6k59x3GYtbCdGtpIXfSFv
hG79KKoz3Q==
=8mjO
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20190215' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- Ensure we insert into the hctx dispatch list, if a request is marked
as DONTPREP (Jianchao)
- NVMe pull request, single missing unlock on error fix (Keith)
- MD pull request, single fix for a potentially data corrupting issue
(Nate)
- Floppy check_events regression fix (Yufen)
* tag 'for-linus-20190215' of git://git.kernel.dk/linux-block:
md/raid1: don't clear bitmap bits on interrupted recovery.
floppy: check_events callback should not return a negative number
nvme-pci: add missing unlock for reset error
blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue
-----BEGIN PGP SIGNATURE-----
iQFRBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlxgqNUeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGwsoH+OVXu0NQofwTvVru
8lgF3BSDG2mhf7mxbBBlBizGVy9jnjRNGCFMC+Jq8IwiFLwprja/G27kaDTkpuF1
PHC3yfjKvjTeUP5aNdHlmxv6j1sSJfZl0y46DQal4UeTG/Giq8TFTi+Tbz7Wb/WV
yCx4Lr8okAwTuNhnL8ojUCVIpd3c8QsyR9v6nEQ14Mj+MvEbokyTkMJV0bzOrM38
JOB+/X1XY4JPZ6o3MoXrBca3bxbAJzMneq+9CWw1U5eiIG3msg4a+Ua3++RQMDNr
8BP0yCZ6wo32S8uu0PI6HrZaBnLYi5g9Wh7Q7yc0mn1Uh1zWFykA6TtqK90agJeR
A6Ktjw==
=scY4
-----END PGP SIGNATURE-----
Merge tag 'v5.0-rc6' into for-5.1/block
Pull in 5.0-rc6 to avoid a dumb merge conflict with fs/iomap.c.
This is needed since io_uring is now based on the block branch,
to avoid a conflict between the multi-page bvecs and the bits
of io_uring that touch the core block parts.
* tag 'v5.0-rc6': (525 commits)
Linux 5.0-rc6
x86/mm: Make set_pmd_at() paravirt aware
MAINTAINERS: Update the ocores i2c bus driver maintainer, etc
blk-mq: remove duplicated definition of blk_mq_freeze_queue
Blk-iolatency: warn on negative inflight IO counter
blk-iolatency: fix IO hang due to negative inflight counter
MAINTAINERS: unify reference to xen-devel list
x86/mm/cpa: Fix set_mce_nospec()
futex: Handle early deadlock return correctly
futex: Fix barrier comment
net: dsa: b53: Fix for failure when irq is not defined in dt
blktrace: Show requests without sector
mips: cm: reprime error cause
mips: loongson64: remove unreachable(), fix loongson_poweroff().
sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach()
geneve: should not call rt6_lookup() when ipv6 was disabled
KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221)
KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222)
kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974)
signal: Better detection of synchronous signals
...
QUEUE_FLAG_NO_SG_MERGE has been killed, so kill BLK_MQ_F_SG_MERGE too.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since bdced438ac ("block: setup bi_phys_segments after splitting"),
physical segment number is mainly figured out in blk_queue_split() for
fast path, and the flag of BIO_SEG_VALID is set there too.
Now only blk_recount_segments() and blk_recalc_rq_segments() use this
flag.
Basically blk_recount_segments() is bypassed in fast path given BIO_SEG_VALID
is set in blk_queue_split().
For another user of blk_recalc_rq_segments():
- run in partial completion branch of blk_update_request, which is an unusual case
- run in blk_cloned_rq_check_limits(), still not a big problem if the flag is killed
since dm-rq is the only user.
Multi-page bvec is enabled now, not doing S/G merging is rather pointless with the
current setup of the I/O path, as it isn't going to save you a significant amount
of cycles.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch introduces one extra iterator variable to bio_for_each_segment_all(),
then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.
Given it is just one mechannical & simple change on all bio_for_each_segment_all()
users, this patch does tree-wide change in one single patch, so that we can
avoid to use a temporary helper for this conversion.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bch_bio_alloc_pages() is always called on one new bio, so it is safe
to access the bvec table directly. Given it is the only kind of this
case, open code the bvec table access since bio_for_each_segment_all()
will be changed to support for iterating over multipage bvec.
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When provisioning a new data block for a virtual block, either because
the block was previously unallocated or because we are breaking sharing,
if the whole block of data is being overwritten the bio that triggered
the provisioning is issued immediately, skipping copying or zeroing of
the data block.
When this bio completes the new mapping is inserted in to the pool's
metadata by process_prepared_mapping(), where the bio completion is
signaled to the upper layers.
This completion is signaled without first committing the metadata. If
the bio in question has the REQ_FUA flag set and the system crashes
right after its completion and before the next metadata commit, then the
write is lost despite the REQ_FUA flag requiring that I/O completion for
this request must only be signaled after the data has been committed to
non-volatile storage.
Fix this by deferring the completion of overwrite bios, with the REQ_FUA
flag set, until after the metadata has been committed.
Cc: stable@vger.kernel.org
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
sync_request_write no longer submits writes to a Faulty device. This has
the unfortunate side effect that bitmap bits can be incorrectly cleared
if a recovery is interrupted (previously, end_sync_write would have
prevented this). This means the next recovery may not copy everything
it should, potentially corrupting data.
Add a function for doing the proper md_bitmap_end_sync, called from
end_sync_write and the Faulty case in sync_request_write.
backport note to 4.14: s/md_bitmap_end_sync/bitmap_end_sync
Cc: stable@vger.kernel.org 4.14+
Fixes: 0c9d5b127f ("md/raid1: avoid reusing a resync bio after error handling.")
Reviewed-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Tested-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Nate Dailey <nate.dailey@stratus.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
bio_sectors() returns the value in the units of 512-byte sectors (no
matter what the real sector size of the device). dm-crypt multiplies
bio_sectors() by on_disk_tag_size to calculate the space allocated for
integrity tags. If dm-crypt is running with sector size larger than
512b, it allocates more data than is needed.
Device Mapper trims the extra space when passing the bio to
dm-integrity, so this bug didn't result in any visible misbehavior.
But it must be fixed to avoid wasteful memory allocation for the block
integrity payload.
Fixes: ef43aa3806 ("dm crypt: add cryptographic data integrity protection (authenticated encryption)")
Cc: stable@vger.kernel.org # 4.12+
Reported-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
In 'commit 752f66a75a ("bcache: use REQ_PRIO to indicate bio for
metadata")' REQ_META is replaced by REQ_PRIO to indicate metadata bio.
This assumption is not always correct, e.g. XFS uses REQ_META to mark
metadata bio other than REQ_PRIO. This is why Nix noticed that bcache
does not cache metadata for XFS after the above commit.
Thanks to Dave Chinner, he explains the difference between REQ_META and
REQ_PRIO from view of file system developer. Here I quote part of his
explanation from mailing list,
REQ_META is used for metadata. REQ_PRIO is used to communicate to
the lower layers that the submitter considers this IO to be more
important that non REQ_PRIO IO and so dispatch should be expedited.
IOWs, if the filesystem considers metadata IO to be more important
that user data IO, then it will use REQ_PRIO | REQ_META rather than
just REQ_META.
Then it seems bios with REQ_META or REQ_PRIO should both be cached for
performance optimation, because they are all probably low I/O latency
demand by upper layer (e.g. file system).
So in this patch, when we want to decide whether to bypass the cache,
REQ_META and REQ_PRIO are both checked. Then both metadata and
high priority I/O requests will be handled properly.
Reported-by: Nix <nix@esperi.org.uk>
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Andre Noll <maan@tuebingen.mpg.de>
Tested-by: Nix <nix@esperi.org.uk>
Cc: stable@vger.kernel.org
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>