2
0
mirror of https://github.com/edk2-porting/linux-next.git synced 2024-12-23 20:53:53 +08:00
Commit Graph

634391 Commits

Author SHA1 Message Date
JackieLiu
9b69173e5c md/raid5-cache: remove unnecessary function parameters
The function parameter 'recovery_list' is not used in
body, we can delete it

Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
Reviewed-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-29 14:46:21 -08:00
Zhengyuan Liu
462eb7d872 raid5-cache: don't set STRIPE_R5C_PARTIAL_STRIPE flag while load stripe into cache
r5c_recovery_load_one_stripe should not set STRIPE_R5C_PARTIAL_STRIPE flag,as
the data-only stripe may be STRIPE_R5C_FULL_STRIPE stripe. The state machine
would release the stripe later and add it into neither r5c_cached_full_stripes
list or r5c_cached_partial_stripes list and set correct flag.

Reviewed-by: JackieLiu <liuyun01@kylinos.cn>
Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-29 14:45:14 -08:00
Zhengyuan Liu
f7b7bee75e raid5-cache: add another check conditon before replaying one stripe
New stripe that was just allocated has no STRIPE_R5C_CACHING state too,
add this check condition could avoid unnecessary replaying for empty stripe.

r5l_recovery_replay_one_stripe would reset stripe for any case, delete it
to make code more clean.

Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-29 11:56:20 -08:00
Dan Carpenter
d3014e21e1 md/r5cache: enable IRQs on error path
We need to re-enable the IRQs here before returning.

Fixes: a39f7afde3 ("md/r5cache: write-out phase and reclaim support")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-27 21:38:08 -08:00
Song Liu
d7bd398e97 md/r5cache: handle alloc_page failure
RMW of r5c write back cache uses an extra page to store old data for
prexor. handle_stripe_dirtying() allocates this page by calling
alloc_page(). However, alloc_page() may fail.

To handle alloc_page() failures, this patch adds an extra page to
disk_info. When alloc_page fails, handle_stripe() trys to use these
pages. When these pages are used by other stripe (R5C_EXTRA_PAGE_IN_USE),
the stripe is added to delayed_list.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-27 21:35:38 -08:00
Shaohua Li
034e33f5ed md: stop write should stop journal reclaim
__md_stop_writes currently doesn't stop raid5-cache reclaim thread. It's
possible the reclaim thread is still running and doing write, which
doesn't match what __md_stop_writes should do. The extra ->quiesce()
call should not harm any raid types. For raid5-cache, this will
guarantee we reclaim all caches before we update superblock.

Signed-off-by: Shaohua Li <shli@fb.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Cc: Song Liu <songliubraving@fb.com>
2016-11-23 19:30:25 -08:00
Shaohua Li
ce1ccd079f raid5-cache: suspend reclaim thread instead of shutdown
There is mechanism to suspend a kernel thread. Use it instead of playing
create/destroy game.

Signed-off-by: Shaohua Li <shli@fb.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Cc: Song Liu <songliubraving@fb.com>
2016-11-23 19:30:06 -08:00
NeilBrown
1919cbb23b md/raid10: add failfast handling for writes.
When writing to a fastfail device, we use MD_FASTFAIL unless
it is the only device being written to.  For
resync/recovery, assume there was a working device to read
from so always use MD_FASTFAIL.

If a write for resync/recovery fails, we just fail the
device - there is not much else to do.

If a normal write fails, but the device cannot be marked
Faulty (must be only one left), we queue for write error
handling which calls narrow_write_error() to write the block
synchronously without any failfast flags.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-22 09:14:42 -08:00
NeilBrown
8d3ca83dcf md/raid10: add failfast handling for reads.
If a device is marked FailFast, and it is not the only
device we can read from, we mark the bio as MD_FAILFAST.

If this does fail-fast, we don't try read repair but just
allow failure.

If it was the last device, it doesn't get marked Faulty so
the retry happens on the same device - this time without
FAILFAST.  A subsequent failure will not retry but will just
pass up the error.

During resync we may use FAILFAST requests, and on a failure
we will simply use the other device(s).

During recovery we will only use FAILFAST in the unusual
case were there are multiple places to read from - i.e. if
there are > 2 devices.  If we get a failure we will fail the
device and complete the resync/recovery with remaining
devices.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-22 09:14:28 -08:00
NeilBrown
212e7eb7a3 md/raid1: add failfast handling for writes.
When writing to a fastfail device we use MD_FASTFAIL unless
it is the only device being written to.

For resync/recovery, assume there was a working device to
read from so always use REQ_FASTFAIL_DEV.

If a write for resync/recovery fails, we just fail the
device - there is not much else to do.

If a normal failfast write fails, but the device cannot be
failed (must be only one left), we queue for write error
handling.  This will call narrow_write_error() to retry the
write synchronously and without any FAILFAST flags.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-22 09:14:10 -08:00
NeilBrown
2e52d449bc md/raid1: add failfast handling for reads.
If a device is marked FailFast and it is not the only device
we can read from, we mark the bio with REQ_FAILFAST_* flags.

If this does fail, we don't try read repair but just allow
failure.  If it was the last device it doesn't fail of
course, so the retry happens on the same device - this time
without FAILFAST.  A subsequent failure will not retry but
will just pass up the error.

During resync we may use FAILFAST requests and on a failure
we will simply use the other device(s).

During recovery we will only use FAILFAST in the unusual
case were there are multiple places to read from - i.e. if
there are > 2 devices.  If we get a failure we will fail the
device and complete the resync/recovery with remaining
devices.

The new R1BIO_FailFast flag is set on read reqest to suggest
the a FAILFAST request might be acceptable.  The rdev needs
to have FailFast set as well for the read to actually use
REQ_FAILFAST_*.

We need to know there are at least two working devices
before we can set R1BIO_FailFast, so we mustn't stop looking
at the first device we find.  So the "min_pending == 0"
handling to not exit early, but too always choose the
best_pending_disk if min_pending == 0.

The spinlocked region in raid1_error() in enlarged to ensure
that if two bios, reading from two different devices, fail
at the same time, then there is no risk that both devices
will be marked faulty, leaving zero "In_sync" devices.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-22 09:13:18 -08:00
NeilBrown
46533ff7fe md: Use REQ_FAILFAST_* on metadata writes where appropriate
This can only be supported on personalities which ensure
that md_error() never causes an array to enter the 'failed'
state.  i.e. if marking a device Faulty would cause some
data to be inaccessible, the device is status is left as
non-Faulty.  This is true for RAID1 and RAID10.

If we get a failure writing metadata but the device doesn't
fail, it must be the last device so we re-write without
FAILFAST to improve chance of success.  We also flag the
device as LastDev so that future metadata updates don't
waste time on failfast writes.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-22 09:11:33 -08:00
NeilBrown
688834e6ae md/failfast: add failfast flag for md to be used by some personalities.
This patch just adds a 'failfast' per-device flag which can be stored
in v0.90 or v1.x metadata.
The flag is not used yet but the intent is that it can be used for
mirrored (raid1/raid10) arrays where low latency is more important
than keeping all devices on-line.

Setting the flag for a device effectively gives permission for that
device to be marked as Faulty and excluded from the array on the first
error.  The underlying driver will be directed not to retry requests
that result in failures.  There is a proviso that the device must not
be marked faulty if that would cause the array as a whole to fail, it
may only be marked Faulty if the array remains functional, but is
degraded.

Failures on read requests will cause the device to be marked
as Faulty immediately so that further reads will avoid that
device.  No attempt will be made to correct read errors by
over-writing with the correct data.

It is expected that if transient errors, such as cable unplug, are
possible, then something in user-space will revalidate failed
devices and re-add them when they appear to be working again.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-22 08:58:17 -08:00
Song Liu
3bddb7f8f2 md/r5cache: handle FLUSH and FUA
With raid5 cache, we committing data from journal device. When
there is flush request, we need to flush journal device's cache.
This was not needed in raid5 journal, because we will flush the
journal before committing data to raid disks.

This is similar to FUA, except that we also need flush journal for
FUA. Otherwise, corruptions in earlier meta data will stop recovery
from reaching FUA data.

slightly changed the code by Shaohua

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 17:13:49 -08:00
Song Liu
5aabf7c49d md/r5cache: r5cache recovery: part 2
1. In previous patch, we:
      - add new data to r5l_recovery_ctx
      - add new functions to recovery write-back cache
   The new functions are not used in this patch, so this patch does not
   change the behavior of recovery.

2. In this patchpatch, we:
      - modify main recovery procedure r5l_recovery_log() to call new
        functions
      - remove old functions

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 13:28:28 -08:00
Song Liu
b4c625c673 md/r5cache: r5cache recovery: part 1
Recovery of write-back cache has different logic to write-through only
cache. Specifically, for write-back cache, the recovery need to scan
through all active journal entries before flushing data out. Therefore,
large portion of the recovery logic is rewritten here.

To make the diffs cleaner, we split the rewrite as follows:

1. In this patch, we:
      - add new data to r5l_recovery_ctx
      - add new functions to recovery write-back cache
   The new functions are not used in this patch, so this patch does not
   change the behavior of recovery.

2. In next patch, we:
      - modify main recovery procedure r5l_recovery_log() to call new
        functions
      - remove old functions

With cache feature, there are 2 different scenarios of recovery:
1. Data-Parity stripe: a stripe with complete parity in journal.
2. Data-Only stripe: a stripe with only data in journal (or partial
   parity).

The code differentiate Data-Parity stripe from Data-Only stripe with
flag STRIPE_R5C_CACHING.

For Data-Parity stripes, we use the same procedure as raid5 journal,
where all the data and parity are replayed to the RAID devices.

For Data-Only strips, we need to finish complete calculate parity and
finish the full reconstruct write or RMW write. For simplicity, in
the recovery, we load the stripe to stripe cache. Once the array is
started, the stripe cache state machine will handle these stripes
through normal write path.

r5c_recovery_flush_log contains the main procedure of recovery. The
recovery code first scans through the journal and loads data to
stripe cache. The code keeps tracks of all these stripes in a list
(use sh->lru and ctx->cached_list), stripes in the list are
organized in the order of its first appearance on the journal.
During the scan, the recovery code assesses each stripe as
Data-Parity or Data-Only.

During scan, the array may run out of stripe cache. In these cases,
the recovery code will also call raid5_set_cache_size to increase
stripe cache size. If the array still runs out of stripe cache
because there isn't enough memory, the array will not assemble.

At the end of scan, the recovery code replays all Data-Parity
stripes, and sets proper states for Data-Only stripes. The recovery
code also increases seq number by 10 and rewrites all Data-Only
stripes to journal. This is to avoid confusion after repeated
crashes. More details is explained in raid5-cache.c before
r5c_recovery_rewrite_data_only_stripes().

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 13:28:14 -08:00
Song Liu
9ed988f5dc md/r5cache: refactoring journal recovery code
1. rename r5l_read_meta_block() as r5l_recovery_read_meta_block();
2. pull the code that initialize r5l_meta_block from
   r5l_log_write_empty_meta_block() to a separate function
   r5l_recovery_create_empty_meta_block(), so that we can reuse this
   piece of code.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 13:27:45 -08:00
Song Liu
2c7da14b90 md/r5cache: sysfs entry journal_mode
With write cache, journal_mode is the knob to switch between
write-back and write-through.

Below is an example:

root@virt-test:~/# cat /sys/block/md0/md/journal_mode
[write-through] write-back
root@virt-test:~/# echo write-back > /sys/block/md0/md/journal_mode
root@virt-test:~/# cat /sys/block/md0/md/journal_mode
write-through [write-back]

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 13:27:24 -08:00
Song Liu
a39f7afde3 md/r5cache: write-out phase and reclaim support
There are two limited resources, stripe cache and journal disk space.
For better performance, we priotize reclaim of full stripe writes.
To free up more journal space, we free earliest data on the journal.

In current implementation, reclaim happens when:
1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
   if there is no reclaim in the past 5 seconds.
2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
   or cached stripes is enough for a full stripe (chunk size / 4k)
   (r5c_check_cached_full_stripe)
3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)

r5c_do_reclaim() contains new logic of reclaim.

For stripe cache:

When stripe cache pressure is high (more than 3/4 stripes are cached,
or there is empty inactive lists), flush all full stripe. If fewer
than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
are flushed, flush some paritial stripes. When stripe cache pressure
is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.

For log space:

To avoid deadlock due to log space, we need to reserve enough space
to flush cached data. The size of required log space depends on total
number of cached stripes (stripe_in_journal_count). In current
implementation, the writing-out phase automatically include pending
data writes with parity writes (similar to write through case).
Therefore, we need up to (conf->raid_disks + 1) pages for each cached
stripe (1 page for meta data, raid_disks pages for all data and
parity). r5c_log_required_to_flush_cache() calculates log space
required to flush cache. In the following, we refer to the space
calculated by r5c_log_required_to_flush_cache() as
reclaim_required_space.

Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
is set when free space on the log device is less than 2x of
reclaim_required_space.

r5c_cache keeps all data in cache (not fully committed to RAID) in
a list (stripe_in_journal_list). These stripes are in the order of their
first appearance on the journal. So the log tail (last_checkpoint)
should point to the journal_start of the first item in the list.

When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
set, the state machine only writes data that are already in the
log device (in stripe_in_journal_list).

This patch includes a fix to improve performance by
Shaohua Li <shli@fb.com>.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 13:26:48 -08:00
Song Liu
1e6d690b93 md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
   (r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
   (r5c_handle_data_cached, r5c_return_dev_pending_writes).

Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks

This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.

Writing-out phase of the cache is implemented in the next patch.

With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.

This patch adds 2 flags to stripe_head.state:
 - STRIPE_R5C_PARTIAL_STRIPE,
 - STRIPE_R5C_FULL_STRIPE,

Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".

For RMW, the code allocates an extra page for each data block
being updated.  This is stored in r5dev->orig_page and the old data
is read into it.  Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.

r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.

There are some known limitations of the cache implementation:

1. Write cache only covers full page writes (R5_OVERWRITE). Writes
   of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
   writes for the same stripe have to wait. This can be improved by
   moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
   is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
   the log, so recovery code has to replay more than necessary data
   (sometimes all the log from last_checkpoint). This reduces
   availability of the array.

This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 13:26:30 -08:00
Song Liu
2ded370373 md/r5cache: State machine for raid5-cache write back mode
This patch adds state machine for raid5-cache. With log device, the
raid456 array could operate in two different modes (r5c_journal_mode):
  - write-back (R5C_MODE_WRITE_BACK)
  - write-through (R5C_MODE_WRITE_THROUGH)

Existing code of raid5-cache only has write-through mode. For write-back
cache, it is necessary to extend the state machine.

With write-back cache, every stripe could operate in two different
phases:
  - caching
  - writing-out

In caching phase, the stripe handles writes as:
  - write to journal
  - return IO

In writing-out phase, the stripe behaviors as a stripe in write through
mode R5C_MODE_WRITE_THROUGH.

STRIPE_R5C_CACHING is added to sh->state to differentiate caching and
writing-out phase.

Please note: this is a "no-op" patch for raid5-cache write-through
mode.

The following detailed explanation is copied from the raid5-cache.c:

/*
 * raid5 cache state machine
 *
 * With rhe RAID cache, each stripe works in two phases:
 *      - caching phase
 *      - writing-out phase
 *
 * These two phases are controlled by bit STRIPE_R5C_CACHING:
 *   if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase
 *   if STRIPE_R5C_CACHING == 1, the stripe is in caching phase
 *
 * When there is no journal, or the journal is in write-through mode,
 * the stripe is always in writing-out phase.
 *
 * For write-back journal, the stripe is sent to caching phase on write
 * (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off
 * the write-out phase by clearing STRIPE_R5C_CACHING.
 *
 * Stripes in caching phase do not write the raid disks. Instead, all
 * writes are committed from the log device. Therefore, a stripe in
 * caching phase handles writes as:
 *      - write to log device
 *      - return IO
 *
 * Stripes in writing-out phase handle writes as:
 *      - calculate parity
 *      - write pending data and parity to journal
 *      - write data and parity to raid disks
 *      - return IO for pending writes
 */

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 13:26:07 -08:00
Song Liu
937621c36e md/r5cache: move some code to raid5.h
Move some define and inline functions to raid5.h, so they can be
used in raid5-cache.c

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 13:25:40 -08:00
Song Liu
c757ec95c2 md/r5cache: Check array size in r5l_init_log
Currently, r5l_write_stripe checks meta size for each stripe write,
which is not necessary.

With this patch, r5l_init_log checks maximal meta size of the array,
which is (r5l_meta_block + raid_disks x r5l_payload_data_parity).
If this is too big to fit in one page, r5l_init_log aborts.

With current meta data, r5l_log support raid_disks up to 203.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 13:24:46 -08:00
Shaohua Li
504634f60f md: add blktrace event for writes to superblock
superblock write is an expensive operation. With raid5-cache, it can be called
regularly. Tracing to help performance debug.

Signed-off-by: Shaohua Li <shli@fb.com>
Cc: NeilBrown <neilb@suse.com>
2016-11-18 09:47:57 -08:00
NeilBrown
578b54ade8 md/raid1, raid10: add blktrace records when IO is delayed
Both raid1 and raid10 will sometimes delay handling an IO request,
such as when resync is happening or there are too many requests queued.

Add some blktrace messsages so we can see when that is happening when
looking for performance artefacts.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 09:35:37 -08:00
NeilBrown
581dbd94da md/bitmap: add blktrace event for writes to the bitmap
We trace wheneven bitmap_unplug() finds that it needs to write
to the bitmap, or when bitmap_daemon_work() find there is work
to do.

This makes it easier to correlate bitmap updates with data writes.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 09:34:45 -08:00
NeilBrown
109e376530 md: add block tracing for bio_remapping
The block tracing infrastructure (accessed with blktrace/blkparse)
supports the tracing of mapping bios from one device to another.
This is currently used when a bio in a partition is mapped to the
whole device, when bios are mapped by dm, and for mapping in md/raid5.
Other md personalities do not include this tracing yet, so add it.

When a read-error is detected we redirect the request to a different device.
This could justifiably be seen as a new mapping for the originial bio,
or a secondary mapping for the bio that errors.  This patch uses
the second option.

When md is used under dm-raid, the mappings are not traced as we do
not have access to the block device number of the parent.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 09:32:50 -08:00
Shaohua Li
354b445b5f raid5-cache: fix lockdep warning
lockdep reports warning of the rcu_dereference usage. Using normal rdev
access pattern to avoid the warning.

Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-17 11:30:27 -08:00
NeilBrown
6119e6792b md: remove md_super_wait() call after bitmap_flush()
bitmap_flush() finishes with bitmap_update_sb(), and that finishes
with write_page(..., 1), so write_page() will wait for all writes
to complete.  So there is no point calling md_super_wait()
immediately afterwards.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-09 17:14:28 -08:00
NeilBrown
be306c2989 md: define mddev flags, recovery flags and r1bio state bits using enums
This is less error prone than using individual #defines.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-09 12:53:52 -08:00
NeilBrown
f2c771a655 md/raid1: fix: IO can block resync indefinitely
While performing a resync/recovery, raid1 divides the
array space into three regions:
 - before the resync
 - at or shortly after the resync point
 - much further ahead of the resync point.

Write requests to the first or third do not need to wait.  Write
requests to the middle region do need to wait if resync requests are
pending.

If there are any active write requests in the middle region, resync
will wait for them.

Due to an accounting error, there is a small range of addresses,
between conf->next_resync and conf->start_next_window, where write
requests will *not* be blocked, but *will* be counted in the middle
region.  This can effectively block resync indefinitely if filesystem
writes happen repeatedly to this region.

As ->next_window_requests is incremented when the sector is after
  conf->start_next_window + NEXT_NORMALIO_DISTANCE
the same boundary should be used for determining when write requests
should wait.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-09 12:53:24 -08:00
NeilBrown
85c9ccd4f0 md/bitmap: Don't write bitmap while earlier writes might be in-flight
As we don't wait for writes to complete in bitmap_daemon_work, they
could still be in-flight when bitmap_unplug writes again.  Or when
bitmap_daemon_work tries to write again.
This can be confusing and could risk the wrong data being written last.

So make sure we wait for old writes to complete before new writes start.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:23 -08:00
NeilBrown
a9ae93c8cc md/raid10: abort delayed writes when device fails.
When writing to an array with a bitmap enabled, the writes are grouped
in batches which are preceded by an update to the bitmap.

It is quite likely if that a drive develops a problem which is not
media related, that the bitmap write will be the first to report an
error and cause the device to be marked faulty (as the bitmap write is
at the start of a batch).

In this case, there is point submiting the subsequent writes to the
failed device - that just wastes times.

So re-check the Faulty state of a device before submitting a
delayed write.

This requires that we keep the 'rdev', rather than the 'bdev' in the
bio, then swap in the bdev just before final submission.

Reported-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:23 -08:00
NeilBrown
5e2c7a3611 md/raid1: abort delayed writes when device fails.
When writing to an array with a bitmap enabled, the writes are grouped
in batches which are preceded by an update to the bitmap.

It is quite likely if that a drive develops a problem which is not
media related, that the bitmap write will be the first to report an
error and cause the device to be marked faulty (as the bitmap write is
at the start of a batch).

In this case, there is point submiting the subsequent writes to the
failed device - that just wastes times.

So re-check the Faulty state of a device before submitting a
delayed write.

This requires that we keep the 'rdev', rather than the 'bdev' in the
bio, then swap in the bdev just before final submission.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:23 -08:00
NeilBrown
060b0689f5 md: perform async updates for metadata where possible.
When adding devices to, or removing device from, an array we need to
update the metadata.  However we don't need to do it synchronously as
data integrity doesn't depend on these changes being recorded
instantly.  So avoid the synchronous call to md_update_sb and just set
a flag so that the thread will do it.

This can reduce the number of updates performed when lots of devices
are being added or removed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:23 -08:00
JackieLiu
3fd880af41 raid5-cache: restrict the use area of the log_offset variable
We can calculate this offset by using ctx->meta_total_blocks,
without passing in from the function

Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:22 -08:00
NeilBrown
cc6167b4f3 md/raid5: change printk() to pr_*()
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:22 -08:00
NeilBrown
08464e0926 md/raid10: change printk() to pr_*()
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:22 -08:00
NeilBrown
1d41c216fe md/raid1: change printk() to pr_*()
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:22 -08:00
NeilBrown
766038846e md/raid0: replace printk() with pr_*()
This makes md/raid0 much less verbose as the messages about
the array geometry are now pr_debug()

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:22 -08:00
NeilBrown
7279694da4 md/multipath: replace printk() with pr_*()
Also remove all messages about memory allocation failure.
page_alloc() reports those.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:22 -08:00
NeilBrown
a2e202afa6 md/linear: replace printk() with pr_*()
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:21 -08:00
NeilBrown
ec0cc22685 md/bitmap: change all printk() to pr_*()
Follow err/warn distinction introduced in md.c
Join multi-part strings into single string.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:21 -08:00
NeilBrown
9d48739ef1 md: change all printk() to pr_err() or pr_warn() etc.
1/ using pr_debug() for a number of messages reduces the noise of
   md, but still allows them to be enabled when needed.
2/ try to be consistent in the usage of pr_err() and pr_warn(), and
   document the intention
3/ When strings have been split onto multiple lines, rejoin into
   a single string.
   The cost of having lines > 80 chars is less than the cost of not
   being able to easily search for a particular message.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:21 -08:00
NeilBrown
7f0f0d87fa md: fix some issues with alloc_disk_sb()
1/ don't print a warning if allocation fails.
 page_alloc() does that already.
2/ always check return status for error.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:21 -08:00
Guoqing Jiang
cbb3873236 md/bitmap: call bitmap_file_unmap once bitmap_storage_alloc returns -ENOMEM
It is possible that bitmap_storage_alloc could return -ENOMEM,
and some member inside store could be allocated such as filemap.

To avoid memory leak, we need to call bitmap_file_unmap to free
those members in the bitmap_resize.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:21 -08:00
Tomasz Majchrzak
7adb072ca8 raid5: revert commit 11367799f3
Revert commit 11367799f3 ("md: Prevent IO hold during accessing to faulty
raid5 array") as it doesn't comply with commit c3cce6cda1 ("md/raid5:
ensure device failure recorded before write request returns."). That change
is not required anymore as the problem is resolved by commit 16f889499a
("md: report 'write_pending' state when array in sync") - read request is
stuck as array state is not reported correctly via sysfs attribute.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:21 -08:00
Tomasz Majchrzak
91a6c4aded md: wake up personality thread after array state update
When raid1/raid10 array fails to write to one of the drives, the request
is added to bio_end_io_list and finished by personality thread. The
thread doesn't handle it as long as MD_CHANGE_PENDING flag is set. In
case of external metadata this flag is cleared, however the thread is
not woken up. It causes request to be blocked for few seconds (until
another action on the array wakes up the thread) or to get stuck
indefinitely.

Wake up personality thread once MD_CHANGE_PENDING has been cleared.
Moving 'restart_array' call after the flag is cleared it not a solution
because in read-write mode the call doesn't wake up the thread.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:21 -08:00
Tomasz Majchrzak
dcbcb48650 md: don't fail an array if there are unacknowledged bad blocks
If external metadata handler supports bad blocks and unacknowledged bad
blocks are present, don't report disk via sysfs as faulty. Such
situation can be still handled so disk just has to be blocked for a
moment. It makes it consistent with kernel state as corresponding rdev
flag is also not set.

When the disk in being unblocked there are few cases:
1. Disk has been in blocked and faulty state, it is being unblocked but
it still remains in faulty state. Metadata handler will remove it from
array in the next call.
2. There is no bad block support in external metadata handler and bad
blocks are present - put the disk in blocked and faulty state (see
case 1).
3. There is bad block support in external metadata handler and all bad
blocks are acknowledged - clear all flags, continue.
4. There is bad block support in external metadata handler but there are
still unacknowledged bad blocks - clear all flags, continue. It is fine
to clear Blocked flag because it was probably not set anyway (if it was
it is case 1). BlockedBadBlocks flag can also be cleared because the
request waiting for it will set it again when it finds out that some bad
block is still not acknowledged. Recovery is not necessary but there are
no problems if the flag is set. Sysfs rdev state is still reported as
blocked (due to unacknowledged bad blocks) so metadata handler will
process remaining bad blocks and unblock disk again.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:20 -08:00
Tomasz Majchrzak
35b785f769 md: add bad block support for external metadata
Add new rdev flag which external metadata handler can use to switch
on/off bad block support. If new bad block is encountered, notify it via
rdev 'unacknowledged_bad_blocks' sysfs file. If bad block has been
cleared, notify update to rdev 'bad_blocks' sysfs file.

When bad blocks support is being removed, just clear rdev flag. It is
not necessary to reset badblocks->shift field. If there are bad blocks
cleared or added at the same time, it is ok for those changes to be
applied to the structure. The array is in blocked state and the drive
which cannot handle bad blocks any more will be removed from the array
before it is unlocked.

Simplify state_show function by adding a separator at the end of each
string and overwrite last separator with new line.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:20 -08:00