Add the dm-dust target, which simulates the behavior of bad sectors
at arbitrary locations, and the ability to enable the emulation of
the read failures at an arbitrary time.
This target behaves similarly to a linear target. At a given time,
the user can send a message to the target to start failing read
requests on specific blocks. When the failure behavior is enabled,
reads of blocks configured "bad" will fail with EIO.
Writes of blocks configured "bad" will result in the following:
1. Remove the block from the "bad block list".
2. Successfully complete the write.
After this point, the block will successfully contain the written
data, and will service reads and writes normally. This emulates the
behavior of a "remapped sector" on a hard disk drive.
dm-dust provides logging of which blocks have been added or removed
to the "bad block list", as well as logging when a block has been
removed from the bad block list. These messages can be used
alongside the messages from the driver using a dm-dust device to
analyze the driver's behavior when a read fails at a given time.
(This logging can be reduced via a "quiet" mode, if desired.)
NOTE: If the block size is larger than 512 bytes, only the first sector
of each "dust block" is detected. Placing a limiting layer above a dust
target, to limit the minimum I/O size to the dust block size, will
ensure proper emulation of the given large block size.
Signed-off-by: Bryan Gurney <bgurney@redhat.com>
Co-developed-by: Joe Shimkus <jshimkus@redhat.com>
Co-developed-by: John Dorminy <jdorminy@redhat.com>
Co-developed-by: John Pittman <jpittman@redhat.com>
Co-developed-by: Thomas Jaskiewicz <tjaskiew@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add a "create" module parameter, which allows device-mapper targets to
be configured at boot time. This enables early use of DM targets in the
boot process (as the root device or otherwise) without the need of an
initramfs.
The syntax used in the boot param is based on the concise format from
the dmsetup tool to follow the rule of least surprise:
dmsetup table --concise /dev/mapper/lroot
Which is:
dm-mod.create=<name>,<uuid>,<minor>,<flags>,<table>[,<table>+][;<name>,<uuid>,<minor>,<flags>,<table>[,<table>+]+]
Where,
<name> ::= The device name.
<uuid> ::= xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx | ""
<minor> ::= The device minor number | ""
<flags> ::= "ro" | "rw"
<table> ::= <start_sector> <num_sectors> <target_type> <target_args>
<target_type> ::= "verity" | "linear" | ...
For example, the following could be added in the boot parameters:
dm-mod.create="lroot,,,rw, 0 4096 linear 98:16 0, 4096 4096 linear 98:32 0" root=/dev/dm-0
Only the targets that were tested are allowed and the ones that don't
change any block device when the device is create as read-only. For
example, mirror and cache targets are not allowed. The rationale behind
this is that if the user makes a mistake, choosing the wrong device to
be the mirror or the cache can corrupt data.
The only targets initially allowed are:
* crypt
* delay
* linear
* snapshot-origin
* striped
* verity
Co-developed-by: Will Drewry <wad@chromium.org>
Co-developed-by: Kees Cook <keescook@chromium.org>
Co-developed-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
Signed-off-by: Helen Koike <helen.koike@collabora.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The writecache target caches writes on persistent memory or SSD.
It is intended for databases or other programs that need extremely low
commit latency.
The writecache target doesn't cache reads because reads are supposed to
be cached in page cache in normal RAM.
If persistent memory isn't available this target can still be used in
SSD mode.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com> # fix missing goto
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> # fix compilation issue with !DAX
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> # use msecs_to_jiffies
Acked-by: Dan Williams <dan.j.williams@intel.com> # reworks to unify ARM and x86 flushing
Signed-off-by: Mike Snitzer <msnitzer@redhat.com>
This device mapper "unstriped" target remaps and unstripes I/O so it
is issued solely on a single drive in a HW RAID0 or dm-striped target.
In a 4 drive HW RAID0 the striped target exposes 1/4th of the LBA range
as a virtual drive. Each I/O to that virtual drive will only be issued
to the 1 drive that was selected of the 4 drives in the HW RAID0.
This unstriped target is most useful for Intel NVMe drives that have
multiple cores but that do not have firmware control to pin separate LBA
ranges to each discrete cpu core.
Signed-off-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Pull MD update from Shaohua Li:
"This update mostly includes bug fixes:
- md-cluster now supports raid10 from Guoqing
- raid5 PPL fixes from Artur
- badblock regression fix from Bo
- suspend hang related fixes from Neil
- raid5 reshape fixes from Neil
- raid1 freeze deadlock fix from Nate
- memleak fixes from Zdenek
- bitmap related fixes from Me and Tao
- other fixes and cleanups"
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (33 commits)
md: free unused memory after bitmap resize
md: release allocated bitset sync_set
md/bitmap: clear BITMAP_WRITE_ERROR bit before writing it to sb
md: be cautious about using ->curr_resync_completed for ->recovery_offset
badblocks: fix wrong return value in badblocks_set if badblocks are disabled
md: don't check MD_SB_CHANGE_CLEAN in md_allow_write
md-cluster: update document for raid10
md: remove redundant variable q
raid1: remove obsolete code in raid1_write_request
md-cluster: Use a small window for raid10 resync
md-cluster: Suspend writes in RAID10 if within range
md-cluster/raid10: set "do_balance = 0" if area is resyncing
md: use lockdep_assert_held
raid1: prevent freeze_array/wait_all_barriers deadlock
md: use TASK_IDLE instead of blocking signals
md: remove special meaning of ->quiesce(.., 2)
md: allow metadata update while suspending.
md: use mddev_suspend/resume instead of ->quiesce()
md: move suspend_hi/lo handling into core md code
md: don't call bitmap_create() while array is quiesced.
...
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Motivated by the desire to illiminate the imprecise nature of
DM-specific patches being unnecessarily sent to both the MD maintainer
and mailing-list. Which is born out of the fact that DM files also
reside in drivers/md/
Now all MD-specific files in drivers/md/ start with either "raid" or
"md-" and the MAINTAINERS file has been updated accordingly.
Shaohua: don't change module name
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
The dm-zoned device mapper target provides transparent write access
to zoned block devices (ZBC and ZAC compliant block devices).
dm-zoned hides to the device user (a file system or an application
doing raw block device accesses) any constraint imposed on write
requests by the device, equivalent to a drive-managed zoned block
device model.
Write requests are processed using a combination of on-disk buffering
using the device conventional zones and direct in-place processing for
requests aligned to a zone sequential write pointer position.
A background reclaim process implemented using dm_kcopyd_copy ensures
that conventional zones are always available for executing unaligned
write requests. The reclaim process overhead is minimized by managing
buffer zones in a least-recently-written order and first targeting the
oldest buffer zones. Doing so, blocks under regular write access (such
as metadata blocks of a file system) remain stored in conventional
zones, resulting in no apparent overhead.
dm-zoned implementation focus on simplicity and on minimizing overhead
(CPU, memory and storage overhead). For a 14TB host-managed disk with
256 MB zones, dm-zoned memory usage per disk instance is at most about
3 MB and as little as 5 zones will be used internally for storing metadata
and performing buffer zone reclaim operations. This is achieved using
zone level indirection rather than a full block indirection system for
managing block movement between zones.
dm-zoned primary target is host-managed zoned block devices but it can
also be used with host-aware device models to mitigate potential
device-side performance degradation due to excessive random writing.
Zoned block devices can be formatted and checked for use with the dm-zoned
target using the dmzadm utility available at:
https://github.com/hgst/dm-zoned-tools
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
[Mike Snitzer partly refactored Damien's original work to cleanup the code]
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
whether blocks should migrate to/from the cache. The bio-prison-v2
interface supports this improvement by enabling direct dispatch of
work to workqueues rather than having to delay the actual work
dispatch to the DM cache core. So the dm-cache policies are much more
nimble by being able to drive IO as they see fit. One immediate
benefit from the improved latency is a cache that should be much more
adaptive to changing workloads.
- Add a new DM integrity target that emulates a block device that has
additional per-sector tags that can be used for storing integrity
information.
- Add a new authenticated encryption feature to the DM crypt target that
builds on the capabilities provided by the DM integrity target.
- Add MD interface for switching the raid4/5/6 journal mode and update
the DM raid target to use it to enable aid4/5/6 journal write-back
support.
- Switch the DM verity target over to using the asynchronous hash crypto
API (this helps work better with architectures that have access to
off-CPU algorithm providers, which should reduce CPU utilization).
- Various request-based DM and DM multipath fixes and improvements from
Bart and Christoph.
- A DM thinp target fix for a bio structure leak that occurs for each
discard IFF discard passdown is enabled.
- A fix for a possible deadlock in DM bufio and a fix to re-check the
new buffer allocation watermark in the face of competing admin changes
to the 'max_cache_size_bytes' tunable.
- A couple DM core cleanups.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJZB6vtAAoJEMUj8QotnQNaoicIALuZTLElgAzxzA28cfk1+1Ea
Gd09CfJ3M6cvk/YGUU7WwiSYIwu16yOJALG4sLcYnEmUCzvKfFPcl/RpeSJHPpYM
0aVXa6NIJw7K2r3C17toiK2DRMHYw6QU843WeWI93vBW13lDJklNJL9fM7GBEOLH
NMSNw2mAq9ajtLlnJhM3ZfhloA7/u/jektvlBO1AA3RQ5Kx1cXVXFPqN7FdRfcqp
4RuEMe9faAadlXLsj3bia5IBmF/W0Qza6JilP+NLKLWB4fm7LZDjN/k+TsHWMa9e
cGR73TgUGLMBJX+sDJy8R3oeBG9JZkFVkD7I30eCjzyhSOs/54XNYQ23EkqHJU0=
=9Ryi
-----END PGP SIGNATURE-----
Merge tag 'for-4.12/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- A major update for DM cache that reduces the latency for deciding
whether blocks should migrate to/from the cache. The bio-prison-v2
interface supports this improvement by enabling direct dispatch of
work to workqueues rather than having to delay the actual work
dispatch to the DM cache core. So the dm-cache policies are much more
nimble by being able to drive IO as they see fit. One immediate
benefit from the improved latency is a cache that should be much more
adaptive to changing workloads.
- Add a new DM integrity target that emulates a block device that has
additional per-sector tags that can be used for storing integrity
information.
- Add a new authenticated encryption feature to the DM crypt target
that builds on the capabilities provided by the DM integrity target.
- Add MD interface for switching the raid4/5/6 journal mode and update
the DM raid target to use it to enable aid4/5/6 journal write-back
support.
- Switch the DM verity target over to using the asynchronous hash
crypto API (this helps work better with architectures that have
access to off-CPU algorithm providers, which should reduce CPU
utilization).
- Various request-based DM and DM multipath fixes and improvements from
Bart and Christoph.
- A DM thinp target fix for a bio structure leak that occurs for each
discard IFF discard passdown is enabled.
- A fix for a possible deadlock in DM bufio and a fix to re-check the
new buffer allocation watermark in the face of competing admin
changes to the 'max_cache_size_bytes' tunable.
- A couple DM core cleanups.
* tag 'for-4.12/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (50 commits)
dm bufio: check new buffer allocation watermark every 30 seconds
dm bufio: avoid a possible ABBA deadlock
dm mpath: make it easier to detect unintended I/O request flushes
dm mpath: cleanup QUEUE_IF_NO_PATH bit manipulation by introducing assign_bit()
dm mpath: micro-optimize the hot path relative to MPATHF_QUEUE_IF_NO_PATH
dm: introduce enum dm_queue_mode to cleanup related code
dm mpath: verify __pg_init_all_paths locking assumptions at runtime
dm: verify suspend_locking assumptions at runtime
dm block manager: remove an unused argument from dm_block_manager_create()
dm rq: check blk_mq_register_dev() return value in dm_mq_init_request_queue()
dm mpath: delay requeuing while path initialization is in progress
dm mpath: avoid that path removal can trigger an infinite loop
dm mpath: split and rename activate_path() to prepare for its expanded use
dm ioctl: prevent stack leak in dm ioctl call
dm integrity: use previously calculated log2 of sectors_per_block
dm integrity: use hex2bin instead of open-coded variant
dm crypt: replace custom implementation of hex2bin()
dm crypt: remove obsolete references to per-CPU state
dm verity: switch to using asynchronous hash crypto API
dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues
...
The dm-integrity target emulates a block device that has additional
per-sector tags that can be used for storing integrity information.
A general problem with storing integrity tags with every sector is that
writing the sector and the integrity tag must be atomic - i.e. in case of
crash, either both sector and integrity tag or none of them is written.
To guarantee write atomicity the dm-integrity target uses a journal. It
writes sector data and integrity tags into a journal, commits the journal
and then copies the data and integrity tags to their respective location.
The dm-integrity target can be used with the dm-crypt target - in this
situation the dm-crypt target creates the integrity data and passes them
to the dm-integrity target via bio_integrity_payload attached to the bio.
In this mode, the dm-crypt and dm-integrity targets provide authenticated
disk encryption - if the attacker modifies the encrypted device, an I/O
error is returned instead of random data.
The dm-integrity target can also be used as a standalone target, in this
mode it calculates and verifies the integrity tag internally. In this
mode, the dm-integrity target can be used to detect silent data
corruption on the disk or in the I/O path.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Implement the calculation of partial parity for a stripe and PPL write
logging functionality. The description of PPL is added to the
documentation. More details can be found in the comments in raid5-ppl.c.
Attach a page for holding the partial parity data to stripe_head.
Allocate it only if mddev has the MD_HAS_PPL flag set.
Partial parity is the xor of not modified data chunks of a stripe and is
calculated as follows:
- reconstruct-write case:
xor data from all not updated disks in a stripe
- read-modify-write case:
xor old data and parity from all updated disks in a stripe
Implement it using the async_tx API and integrate into raid_run_ops().
It must be called when we still have access to old data, so do it when
STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
stored into sh->ppl_page.
Partial parity is not meaningful for full stripe write and is not stored
in the log or used for recovery, so don't attempt to calculate it when
stripe has STRIPE_FULL_WRITE.
Put the PPL metadata structures to md_p.h because userspace tools
(mdadm) will also need to read/write PPL.
Warn about using PPL with enabled disk volatile write-back cache for
now. It can be removed once disk cache flushing before writing PPL is
implemented.
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
The cache policy interfaces have been updated to work well with the new
bio-prison v2 interface's ability to queue work immediately (promotion,
demotion, etc) -- overriding benefit being reduced latency on processing
IO through the cache. Previously such work would be left for the DM
cache core to queue on various lists and then process in batches later
-- this caused a serious delay in latency for IO driven by the cache.
The background tracker code was factored out so that all cache policies
can make use of it.
Also, the "cleaner" policy has been removed and is now a variant of the
smq policy that simply disallows migrations.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The deferred set is gone and all methods have _v2 appended to the end of
their names to allow for continued use of the original bio prison in DM
thin-provisioning.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add some seperation between bio-based and request-based DM core code.
'struct mapped_device' and other DM core only structures and functions
have been moved to dm-core.h and all relevant DM core .c files have been
updated to include dm-core.h rather than dm.h
DM targets should _never_ include dm-core.h!
[block core merge conflict resolution from Stephen Rothwell]
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
smq seems to be performing better than the old mq policy in all
situations, as well as using a quarter of the memory.
Make 'mq' an alias for 'smq' when choosing a cache policy. The tunables
that were present for the old mq are faked, and have no effect. mq
should be considered deprecated now.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add support for correcting corrupted blocks using Reed-Solomon.
This code uses RS(255, N) interleaved across data and hash
blocks. Each error-correcting block covers N bytes evenly
distributed across the combined total data, so that each byte is a
maximum distance away from the others. This makes it possible to
recover from several consecutive corrupted blocks with relatively
small space overhead.
In addition, using verity hashes to locate erasures nearly doubles
the effectiveness of error correction. Being able to detect
corrupted blocks also improves performance, because only corrupted
blocks need to corrected.
For a 2 GiB partition, RS(255, 253) (two parity bytes for each
253-byte block) can correct up to 16 MiB of consecutive corrupted
blocks if erasures can be located, and 8 MiB if they cannot, with
16 MiB space overhead.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Prepare for extending dm-verity with an optional object. Follows the
naming convention used by other DM targets (e.g. dm-cache and dm-era).
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This introduces a simple log for raid5. Data/parity writing to raid
array first writes to the log, then write to raid array disks. If
crash happens, we can recovery data from the log. This can speed up
raid resync and fix write hole issue.
The log structure is pretty simple. Data/meta data is stored in block
unit, which is 4k generally. It has only one type of meta data block.
The meta data block can track 3 types of data, stripe data, stripe
parity and flush block. MD superblock will point to the last valid
meta data block. Each meta data block has checksum/seq number, so
recovery can scan the log correctly. We store a checksum of stripe
data/parity to the metadata block, so meta data and stripe data/parity
can be written to log disk together. otherwise, meta data write must
wait till stripe data/parity is finished.
For stripe data, meta data block will record stripe data sector and
size. Currently the size is always 4k. This meta data record can be made
simpler if we just fix write hole (eg, we can record data of a stripe's
different disks together), but this format can be extended to support
caching in the future, which must record data address/size.
For stripe parity, meta data block will record stripe sector. It's
size should be 4k (for raid5) or 8k (for raid6). We always store p
parity first. This format should work for caching too.
flush block indicates a stripe is in raid array disks. Fixing write
hole doesn't need this type of meta data, it's for caching extension.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
The stochastic-multi-queue (smq) policy addresses some of the problems
with the current multiqueue (mq) policy.
Memory usage
------------
The mq policy uses a lot of memory; 88 bytes per cache block on a 64
bit machine.
SMQ uses 28bit indexes to implement it's data structures rather than
pointers. It avoids storing an explicit hit count for each block. It
has a 'hotspot' queue rather than a pre cache which uses a quarter of
the entries (each hotspot block covers a larger area than a single
cache block).
All these mean smq uses ~25bytes per cache block. Still a lot of
memory, but a substantial improvement nontheless.
Level balancing
---------------
MQ places entries in different levels of the multiqueue structures
based on their hit count (~ln(hit count)). This means the bottom
levels generally have the most entries, and the top ones have very
few. Having unbalanced levels like this reduces the efficacy of the
multiqueue.
SMQ does not maintain a hit count, instead it swaps hit entries with
the least recently used entry from the level above. The over all
ordering being a side effect of this stochastic process. With this
scheme we can decide how many entries occupy each multiqueue level,
resulting in better promotion/demotion decisions.
Adaptability
------------
The MQ policy maintains a hit count for each cache block. For a
different block to get promoted to the cache it's hit count has to
exceed the lowest currently in the cache. This means it can take a
long time for the cache to adapt between varying IO patterns.
Periodically degrading the hit counts could help with this, but I
haven't found a nice general solution.
SMQ doesn't maintain hit counts, so a lot of this problem just goes
away. In addition it tracks performance of the hotspot queue, which
is used to decide which blocks to promote. If the hotspot queue is
performing badly then it starts moving entries more quickly between
levels. This lets it adapt to new IO patterns very quickly.
Performance
-----------
In my tests SMQ shows substantially better performance than MQ. Once
this matures a bit more I'm sure it'll become the default policy.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Highlights:
- "experimental" code for managing md/raid1 across a cluster using
DLM. Code is not ready for general use and triggers a WARNING if used.
However it is looking good and mostly done and having in mainline
will help co-ordinate development.
- RAID5/6 can now batch multiple (4K wide) stripe_heads so as to
handle a full (chunk wide) stripe as a single unit.
- RAID6 can now perform read-modify-write cycles which should
help performance on larger arrays: 6 or more devices.
- RAID5/6 stripe cache now grows and shrinks dynamically. The value
set is used as a minimum.
- Resync is now allowed to go a little faster than the 'mininum' when
there is competing IO. How much faster depends on the speed of the
devices, so the effective minimum should scale with device speed to
some extent.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIVAwUAVTbIxDnsnt1WYoG5AQIgAA//Z9FlEpkHcJJ75WGjrXgGJPyNTfEZOkoz
jnD8PpBY2Afp341vMatd0XKErdEAhuPQmJMAa+tbxht6pk77X3fSzXghGA8FEafg
yazn5pfBt6NXmepV4vhl/+LNYuWRxxLSA9EDm7wEg+tiO0UEts+a0w2TSXzcT1w+
30Yi1EjAlTaJ5yjlBHUOtTXWE43D6RKnUr6FMy2dRlnzlFyDlezRMChWo9v6OkQF
YqJ20FcvmJdLHY/6Yif3jvpm7eQecMqdCZENTvW/mJI86zqf6E+ToCYS1VNjfDnK
ud61iU9eshu4WtNWBG6KLuHBD0grO1NaEL7/S16w1KdNJMhYgiK8WussvIAJEesA
5SlETM7Y/1XFq8puwlAq2/tuPfhZ+TFxnAwce/C3hMTDcYAACnS/R6INFQXqGvy3
nX1NLogrCycX8oqxv3jTFKLVqIVwlkSlHcUGzIWjcfCF37StcXFKI5q862agyg2+
NNocFMuXhPPM1YcB9JJSo2nCsor4e9tTdVEZlFm2B3cc8LJ9BLWUMSoi1h7VK/1g
P7psnPIjz7/cdI2TZTFjGTZ0Kvhx/NTYp41AZealDNxeGWUNM+5xGZnUF8QRBc/E
0dGHtEAah834BDQFvNnJtuuh/s+KwbvswjNP+njoBsHjIQIvngDABpOwpIkdqF6r
diQ2gUPnHN0=
=OHG6
-----END PGP SIGNATURE-----
Merge tag 'md/4.1' of git://neil.brown.name/md
Pull md updates from Neil Brown:
"More updates that usual this time. A few have performance impacts
which hould mostly be positive, but RAID5 (in particular) can be very
work-load ensitive... We'll have to wait and see.
Highlights:
- "experimental" code for managing md/raid1 across a cluster using
DLM. Code is not ready for general use and triggers a WARNING if
used. However it is looking good and mostly done and having in
mainline will help co-ordinate development.
- RAID5/6 can now batch multiple (4K wide) stripe_heads so as to
handle a full (chunk wide) stripe as a single unit.
- RAID6 can now perform read-modify-write cycles which should help
performance on larger arrays: 6 or more devices.
- RAID5/6 stripe cache now grows and shrinks dynamically. The value
set is used as a minimum.
- Resync is now allowed to go a little faster than the 'mininum' when
there is competing IO. How much faster depends on the speed of the
devices, so the effective minimum should scale with device speed to
some extent"
* tag 'md/4.1' of git://neil.brown.name/md: (58 commits)
md/raid5: don't do chunk aligned read on degraded array.
md/raid5: allow the stripe_cache to grow and shrink.
md/raid5: change ->inactive_blocked to a bit-flag.
md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe
md/raid5: pass gfp_t arg to grow_one_stripe()
md/raid5: introduce configuration option rmw_level
md/raid5: activate raid6 rmw feature
md/raid6 algorithms: xor_syndrome() for SSE2
md/raid6 algorithms: xor_syndrome() for generic int
md/raid6 algorithms: improve test program
md/raid6 algorithms: delta syndrome functions
raid5: handle expansion/resync case with stripe batching
raid5: handle io error of batch list
RAID5: batch adjacent full stripe write
raid5: track overwrite disk count
raid5: add a new flag to track if a stripe can be batched
raid5: use flex_array for scribble data
md raid0: access mddev->queue (request queue member) conditionally because it is not set when accessed from dm-raid
md: allow resync to go faster when there is competing IO.
md: remove 'go_faster' option from ->sync_request()
...
Introduce a new target that is meant for file system developers to test file
system integrity at particular points in the life of a file system. We capture
all write requests and associated data and log them to a separate device
for later replay. There is a userspace utility to do this replay. The
idea behind this is to give file system developers a tool to verify that
the file system is always consistent.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Zach Brown <zab@zabbo.net>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
dm-era is a target that behaves similar to the linear target. In
addition it keeps track of which blocks were written within a user
defined period of time called an 'era'. Each era target instance
maintains the current era as a monotonically increasing 32-bit
counter.
Use cases include tracking changed blocks for backup software, and
partially invalidating the contents of a cache to restore cache
coherency after rolling back a vendor snapshot.
dm-era is primarily expected to be paired with the dm-cache target.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This reverts commit be35f48610 ("dm: wait until embedded kobject is
released before destroying a device") and provides an improved fix.
The kobject release code that calls the completion must be placed in a
non-module file, otherwise there is a module unload race (if the process
calling dm_kobject_release is preempted and the DM module unloaded after
the completion is triggered, but before dm_kobject_release returns).
To fix this race, this patch moves the completion code to dm-builtin.c
which is always compiled directly into the kernel if BLK_DEV_DM is
selected.
The patch introduces a new dm_kobject_holder structure, its purpose is
to keep the completion and kobject in one place, so that it can be
accessed from non-module code without the need to export the layout of
struct mapped_device to that code.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Support the collection of I/O statistics on user-defined regions of
a DM device. If no regions are defined no statistics are collected so
there isn't any performance impact. Only bio-based DM devices are
currently supported.
Each user-defined region specifies a starting sector, length and step.
Individual statistics will be collected for each step-sized area within
the range specified.
The I/O statistics counters for each step-sized area of a region are
in the same format as /sys/block/*/stat or /proc/diskstats but extra
counters (12 and 13) are provided: total time spent reading and
writing in milliseconds. All these counters may be accessed by sending
the @stats_print message to the appropriate DM device via dmsetup.
The creation of DM statistics will allocate memory via kmalloc or
fallback to using vmalloc space. At most, 1/4 of the overall system
memory may be allocated by DM statistics. The admin can see how much
memory is used by reading
/sys/module/dm_mod/parameters/stats_current_allocated_bytes
See Documentation/device-mapper/statistics.txt for more details.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
dm-switch is a new target that maps IO to underlying block devices
efficiently when there is a large number of fixed-sized address regions
but there is no simple pattern to allow for a compact mapping
representation such as dm-stripe.
Though we have developed this target for a specific storage device, Dell
EqualLogic, we have made an effort to keep it as general purpose as
possible in the hope that others may benefit.
Originally developed by Jim Ramsay. Simplified by Mikulas Patocka.
Signed-off-by: Jim Ramsay <jim_ramsay@dell.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Does writethrough and writeback caching, handles unclean shutdown, and
has a bunch of other nifty features motivated by real world usage.
See the wiki at http://bcache.evilpiepirate.org for more.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
A simple cache policy that writes back all data to the origin.
This is used to decommission a dm cache by emptying it.
Signed-off-by: Heinz Mauelshagen <mauelshagen@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
A cache policy that uses a multiqueue ordered by recent hit
count to select which blocks should be promoted and demoted.
This is meant to be a general purpose policy. It prioritises
reads over writes.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add a target that allows a fast device such as an SSD to be used as a
cache for a slower device such as a disk.
A plug-in architecture was chosen so that the decisions about which data
to migrate and when are delegated to interchangeable tunable policy
modules. The first general purpose module we have developed, called
"mq" (multiqueue), follows in the next patch. Other modules are
under development.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Heinz Mauelshagen <mauelshagen@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The bio prison code will be useful to other future DM targets so
move it to a separate module.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This device-mapper target creates a read-only device that transparently
validates the data on one underlying device against a pre-generated tree
of cryptographic checksums stored on a second device.
Two checksum device formats are supported: version 0 which is already
shipping in Chromium OS and version 1 which incorporates some
improvements.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Elly Jones <ellyjones@chromium.org>
Cc: Milan Broz <mbroz@redhat.com>
Cc: Olof Johansson <olofj@chromium.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Initial EXPERIMENTAL implementation of device-mapper thin provisioning
with snapshot support. The 'thin' target is used to create instances of
the virtual devices that are hosted in the 'thin-pool' target. The
thin-pool target provides data sharing among devices. This sharing is
made possible using the persistent-data library in the previous patch.
The main highlight of this implementation, compared to the previous
implementation of snapshots, is that it allows many virtual devices to
be stored on the same data volume, simplifying administration and
allowing sharing of data between volumes (thus reducing disk usage).
Another big feature is support for arbitrary depth of recursive
snapshots (snapshots of snapshots of snapshots ...). The previous
implementation of snapshots did this by chaining together lookup tables,
and so performance was O(depth). This new implementation uses a single
data structure so we don't get this degradation with depth.
For further information and examples of how to use this, please read
Documentation/device-mapper/thin-provisioning.txt
Signed-off-by: Joe Thornber <thornber@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The dm-bufio interface allows you to do cached I/O on devices,
holding recently-read blocks in memory and performing delayed writes.
We don't use buffer cache or page cache already present in the kernel, because:
* we need to handle block sizes larger than a page
* we can't allocate memory to perform reads or we'd have deadlocks
Currently, when a cache is required, we limit its size to a fraction of
available memory. Usage can be viewed and changed in
/sys/module/dm_bufio/parameters/ .
The first user is thin provisioning, but more dm users are planned.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This target is the same as the linear target except that it returns I/O
errors periodically. It's been found useful in simulating failing
devices for testing purposes.
I needed a dm target to do some failure testing on btrfs's raid code, and
Mike pointed me at this.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch is the skeleton for the DM target that will be
the bridge from DM to MD (initially RAID456 and later RAID1). It
provides a way to use device-mapper interfaces to the MD RAID456
drivers.
As with all device-mapper targets, the nominal public interfaces are the
constructor (CTR) tables and the status outputs (both STATUSTYPE_INFO
and STATUSTYPE_TABLE). The CTR table looks like the following:
1: <s> <l> raid \
2: <raid_type> <#raid_params> <raid_params> \
3: <#raid_devs> <meta_dev1> <dev1> .. <meta_devN> <devN>
Line 1 contains the standard first three arguments to any device-mapper
target - the start, length, and target type fields. The target type in
this case is "raid".
Line 2 contains the arguments that define the particular raid
type/personality/level, the required arguments for that raid type, and
any optional arguments. Possible raid types include: raid4, raid5_la,
raid5_ls, raid5_rs, raid6_zr, raid6_nr, and raid6_nc. (again, raid1 is
planned for the future.) The list of required and optional parameters
is the same for all the current raid types. The required parameters are
positional, while the optional parameters are given as key/value pairs.
The possible parameters are as follows:
<chunk_size> Chunk size in sectors.
[[no]sync] Force/Prevent RAID initialization
[rebuild <idx>] Rebuild the drive indicated by the index
[daemon_sleep <ms>] Time between bitmap daemon work to clear bits
[min_recovery_rate <kB/sec/disk>] Throttle RAID initialization
[max_recovery_rate <kB/sec/disk>] Throttle RAID initialization
[max_write_behind <value>] See '-write-behind=' (man mdadm)
[stripe_cache <sectors>] Stripe cache size for higher RAIDs
Line 3 contains the list of devices that compose the array in
metadata/data device pairs. If the metadata is stored separately, a '-'
is given for the metadata device position. If a drive has failed or is
missing at creation time, a '-' can be given for both the metadata and
data drives for a given position.
Examples:
# RAID4 - 4 data drives, 1 parity
# No metadata devices specified to hold superblock/bitmap info
# Chunk size of 1MiB
# (Lines separated for easy reading)
0 1960893648 raid \
raid4 1 2048 \
5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81
# RAID4 - 4 data drives, 1 parity (no metadata devices)
# Chunk size of 1MiB, force RAID initialization,
# min recovery rate at 20 kiB/sec/disk
0 1960893648 raid \
raid4 4 2048 min_recovery_rate 20 sync\
5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81
Performing a 'dmsetup table' should display the CTR table used to
construct the mapping (with possible reordering of optional
parameters).
Performing a 'dmsetup status' will yield information on the state and
health of the array. The output is as follows:
1: <s> <l> raid \
2: <raid_type> <#devices> <1 health char for each dev> <resync_ratio>
Line 1 is standard DM output. Line 2 is best shown by example:
0 1960893648 raid raid4 5 AAAAA 2/490221568
Here we can see the RAID type is raid4, there are 5 devices - all of
which are 'A'live, and the array is 2/490221568 complete with recovery.
Cc: linux-raid@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
drivers/md/unroll.pl replaced by awk script to drop build-time
dependency on perl
Signed-off-by: Vladimir Dronnikov <dronnikov@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This patch contains a device-mapper mirror log module that forwards
requests to userspace for processing.
The structures used for communication between kernel and userspace are
located in include/linux/dm-log-userspace.h. Due to the frequency,
diversity, and 2-way communication nature of the exchanges between
kernel and userspace, 'connector' was chosen as the interface for
communication.
The first log implementations written in userspace - "clustered-disk"
and "clustered-core" - support clustered shared storage. A userspace
daemon (in the LVM2 source code repository) uses openAIS/corosync to
process requests in an ordered fashion with the rest of the nodes in the
cluster so as to prevent log state corruption. Other implementations
with no association to LVM or openAIS/corosync, are certainly possible.
(Imagine if two machines are writing to the same region of a mirror.
They would both mark the region dirty, but you need a cluster-aware
entity that can handle properly marking the region clean when they are
done. Otherwise, you might clear the region when the first machine is
done, not the second.)
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Cc: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch adds a service time oriented dynamic load balancer,
dm-service-time, which selects the path with the shortest estimated
service time for the incoming I/O.
The service time is estimated by dividing the in-flight I/O size
by a performance value of each path.
The performance value can be given as a table argument at the table
loading time. If no performance value is given, all paths are
considered equal.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch adds a dynamic load balancer, dm-queue-length, which
balances the number of in-flight I/Os across the paths.
The code is based on the patch posted by Stefan Bader:
https://www.redhat.com/archives/dm-devel/2005-October/msg00050.html
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move the raid6 data processing routines into a standalone module
(raid6_pq) to prepare them to be called from async_tx wrappers and other
non-md drivers/modules. This precludes a circular dependency of raid456
needing the async modules for data processing while those modules in
turn depend on raid456 for the base level synchronous raid6 routines.
To support this move:
1/ The exportable definitions in raid6.h move to include/linux/raid/pq.h
2/ The raid6_call, recovery calls, and table symbols are exported
3/ Extra #ifdef __KERNEL__ statements to enable the userspace raid6test to
compile
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Use the -y variables instead of the old -objs so we can easily add
conditional objects to the modules. Also always use += to add
subobjects to avoid problems when placing additional objects in
some place in the file.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Move the existing snapshot exception store implementations out into
separate files. Later patches will place these behind a new
interface in preparation for alternative implementations.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Implement simple read-only sysfs entry for device-mapper block device.
This patch adds a simple sysfs directory named "dm" under block device
properties and implements
- name attribute (string containing mapped device name)
- uuid attribute (string containing UUID, or empty string if not set)
The kobject is embedded in mapped_device struct, so no additional
memory allocation is needed for initializing sysfs entry.
During the processing of sysfs attribute we need to lock mapped device
which is done by a new function dm_get_from_kobj, which returns the md
associated with kobject and increases the usage count.
Each 'show attribute' function is responsible for its own locking.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Separate the region hash code from raid1 so it can be shared by forthcoming
targets. Use BUG_ON() for failed async dm_io() calls.
Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch just removes infrastructure that provided support for hardware
handlers in the dm layer as it is not needed anymore.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
This patch removes the 3 hardware handlers that currently exist
under dm as the functionality is moved to SCSI layer in the earlier
patches.
[jejb: removed more makefile hunks and rejection fixes]
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>