Commit Graph

116 Commits

Author SHA1 Message Date
JackieLiu
dbd22c8d7f md/raid5-cache: remove the unnecessary next_cp_seq field from the r5l_log
The next_cp_seq field is useless, remove it.

Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-29 14:46:22 -08:00
JackieLiu
bc8f167f9c md/raid5-cache: release the stripe_head at the appropriate location
If we released the 'stripe_head' in r5c_recovery_flush_log,
ctx->cached_list will both release the data-parity stripes and
data-only stripes, which will become empty.
And we also need to use the data-only stripes in
r5c_recovery_rewrite_data_only_stripes, so we should wait util rewrite
data-only stripes is done before releasing them.

Reviewed-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Reviewed-by: Song Liu <songliubraving@fb.com>
Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-29 14:46:22 -08:00
JackieLiu
fc833c2a2f md/raid5-cache: use ring add to prevent overflow
'write_pos' must be protected with 'r5l_ring_add', or it may overflow

Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
Reviewed-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-29 14:46:21 -08:00
JackieLiu
9b69173e5c md/raid5-cache: remove unnecessary function parameters
The function parameter 'recovery_list' is not used in
body, we can delete it

Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
Reviewed-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-29 14:46:21 -08:00
Zhengyuan Liu
462eb7d872 raid5-cache: don't set STRIPE_R5C_PARTIAL_STRIPE flag while load stripe into cache
r5c_recovery_load_one_stripe should not set STRIPE_R5C_PARTIAL_STRIPE flag,as
the data-only stripe may be STRIPE_R5C_FULL_STRIPE stripe. The state machine
would release the stripe later and add it into neither r5c_cached_full_stripes
list or r5c_cached_partial_stripes list and set correct flag.

Reviewed-by: JackieLiu <liuyun01@kylinos.cn>
Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-29 14:45:14 -08:00
Zhengyuan Liu
f7b7bee75e raid5-cache: add another check conditon before replaying one stripe
New stripe that was just allocated has no STRIPE_R5C_CACHING state too,
add this check condition could avoid unnecessary replaying for empty stripe.

r5l_recovery_replay_one_stripe would reset stripe for any case, delete it
to make code more clean.

Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-29 11:56:20 -08:00
Dan Carpenter
d3014e21e1 md/r5cache: enable IRQs on error path
We need to re-enable the IRQs here before returning.

Fixes: a39f7afde3 ("md/r5cache: write-out phase and reclaim support")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-27 21:38:08 -08:00
Song Liu
d7bd398e97 md/r5cache: handle alloc_page failure
RMW of r5c write back cache uses an extra page to store old data for
prexor. handle_stripe_dirtying() allocates this page by calling
alloc_page(). However, alloc_page() may fail.

To handle alloc_page() failures, this patch adds an extra page to
disk_info. When alloc_page fails, handle_stripe() trys to use these
pages. When these pages are used by other stripe (R5C_EXTRA_PAGE_IN_USE),
the stripe is added to delayed_list.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-27 21:35:38 -08:00
Shaohua Li
ce1ccd079f raid5-cache: suspend reclaim thread instead of shutdown
There is mechanism to suspend a kernel thread. Use it instead of playing
create/destroy game.

Signed-off-by: Shaohua Li <shli@fb.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Cc: Song Liu <songliubraving@fb.com>
2016-11-23 19:30:06 -08:00
Ming Lei
3a83f46775 block: bio: pass bvec table to bio_init()
Some drivers often use external bvec table, so introduce
this helper for this case. It is always safe to access the
bio->bi_io_vec in this way for this case.

After converting to this usage, it will becomes a bit easier
to evaluate the remaining direct access to bio->bi_io_vec,
so it can help to prepare for the following multipage bvec
support.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

Fixed up the new O_DIRECT cases.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:57:21 -07:00
Song Liu
3bddb7f8f2 md/r5cache: handle FLUSH and FUA
With raid5 cache, we committing data from journal device. When
there is flush request, we need to flush journal device's cache.
This was not needed in raid5 journal, because we will flush the
journal before committing data to raid disks.

This is similar to FUA, except that we also need flush journal for
FUA. Otherwise, corruptions in earlier meta data will stop recovery
from reaching FUA data.

slightly changed the code by Shaohua

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 17:13:49 -08:00
Song Liu
5aabf7c49d md/r5cache: r5cache recovery: part 2
1. In previous patch, we:
      - add new data to r5l_recovery_ctx
      - add new functions to recovery write-back cache
   The new functions are not used in this patch, so this patch does not
   change the behavior of recovery.

2. In this patchpatch, we:
      - modify main recovery procedure r5l_recovery_log() to call new
        functions
      - remove old functions

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 13:28:28 -08:00
Song Liu
b4c625c673 md/r5cache: r5cache recovery: part 1
Recovery of write-back cache has different logic to write-through only
cache. Specifically, for write-back cache, the recovery need to scan
through all active journal entries before flushing data out. Therefore,
large portion of the recovery logic is rewritten here.

To make the diffs cleaner, we split the rewrite as follows:

1. In this patch, we:
      - add new data to r5l_recovery_ctx
      - add new functions to recovery write-back cache
   The new functions are not used in this patch, so this patch does not
   change the behavior of recovery.

2. In next patch, we:
      - modify main recovery procedure r5l_recovery_log() to call new
        functions
      - remove old functions

With cache feature, there are 2 different scenarios of recovery:
1. Data-Parity stripe: a stripe with complete parity in journal.
2. Data-Only stripe: a stripe with only data in journal (or partial
   parity).

The code differentiate Data-Parity stripe from Data-Only stripe with
flag STRIPE_R5C_CACHING.

For Data-Parity stripes, we use the same procedure as raid5 journal,
where all the data and parity are replayed to the RAID devices.

For Data-Only strips, we need to finish complete calculate parity and
finish the full reconstruct write or RMW write. For simplicity, in
the recovery, we load the stripe to stripe cache. Once the array is
started, the stripe cache state machine will handle these stripes
through normal write path.

r5c_recovery_flush_log contains the main procedure of recovery. The
recovery code first scans through the journal and loads data to
stripe cache. The code keeps tracks of all these stripes in a list
(use sh->lru and ctx->cached_list), stripes in the list are
organized in the order of its first appearance on the journal.
During the scan, the recovery code assesses each stripe as
Data-Parity or Data-Only.

During scan, the array may run out of stripe cache. In these cases,
the recovery code will also call raid5_set_cache_size to increase
stripe cache size. If the array still runs out of stripe cache
because there isn't enough memory, the array will not assemble.

At the end of scan, the recovery code replays all Data-Parity
stripes, and sets proper states for Data-Only stripes. The recovery
code also increases seq number by 10 and rewrites all Data-Only
stripes to journal. This is to avoid confusion after repeated
crashes. More details is explained in raid5-cache.c before
r5c_recovery_rewrite_data_only_stripes().

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 13:28:14 -08:00
Song Liu
9ed988f5dc md/r5cache: refactoring journal recovery code
1. rename r5l_read_meta_block() as r5l_recovery_read_meta_block();
2. pull the code that initialize r5l_meta_block from
   r5l_log_write_empty_meta_block() to a separate function
   r5l_recovery_create_empty_meta_block(), so that we can reuse this
   piece of code.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 13:27:45 -08:00
Song Liu
2c7da14b90 md/r5cache: sysfs entry journal_mode
With write cache, journal_mode is the knob to switch between
write-back and write-through.

Below is an example:

root@virt-test:~/# cat /sys/block/md0/md/journal_mode
[write-through] write-back
root@virt-test:~/# echo write-back > /sys/block/md0/md/journal_mode
root@virt-test:~/# cat /sys/block/md0/md/journal_mode
write-through [write-back]

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 13:27:24 -08:00
Song Liu
a39f7afde3 md/r5cache: write-out phase and reclaim support
There are two limited resources, stripe cache and journal disk space.
For better performance, we priotize reclaim of full stripe writes.
To free up more journal space, we free earliest data on the journal.

In current implementation, reclaim happens when:
1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
   if there is no reclaim in the past 5 seconds.
2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
   or cached stripes is enough for a full stripe (chunk size / 4k)
   (r5c_check_cached_full_stripe)
3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)

r5c_do_reclaim() contains new logic of reclaim.

For stripe cache:

When stripe cache pressure is high (more than 3/4 stripes are cached,
or there is empty inactive lists), flush all full stripe. If fewer
than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
are flushed, flush some paritial stripes. When stripe cache pressure
is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.

For log space:

To avoid deadlock due to log space, we need to reserve enough space
to flush cached data. The size of required log space depends on total
number of cached stripes (stripe_in_journal_count). In current
implementation, the writing-out phase automatically include pending
data writes with parity writes (similar to write through case).
Therefore, we need up to (conf->raid_disks + 1) pages for each cached
stripe (1 page for meta data, raid_disks pages for all data and
parity). r5c_log_required_to_flush_cache() calculates log space
required to flush cache. In the following, we refer to the space
calculated by r5c_log_required_to_flush_cache() as
reclaim_required_space.

Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
is set when free space on the log device is less than 2x of
reclaim_required_space.

r5c_cache keeps all data in cache (not fully committed to RAID) in
a list (stripe_in_journal_list). These stripes are in the order of their
first appearance on the journal. So the log tail (last_checkpoint)
should point to the journal_start of the first item in the list.

When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
set, the state machine only writes data that are already in the
log device (in stripe_in_journal_list).

This patch includes a fix to improve performance by
Shaohua Li <shli@fb.com>.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 13:26:48 -08:00
Song Liu
1e6d690b93 md/r5cache: caching phase of r5cache
As described in previous patch, write back cache operates in two
phases: caching and writing-out. The caching phase works as:
1. write data to journal
   (r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
   (r5c_handle_data_cached, r5c_return_dev_pending_writes).

Then the writing-out phase is as:
1. Mark the stripe as write-out (r5c_make_stripe_write_out)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks

This patch implements caching phase. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.

Writing-out phase of the cache is implemented in the next patch.

With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.

This patch adds 2 flags to stripe_head.state:
 - STRIPE_R5C_PARTIAL_STRIPE,
 - STRIPE_R5C_FULL_STRIPE,

Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".

For RMW, the code allocates an extra page for each data block
being updated.  This is stored in r5dev->orig_page and the old data
is read into it.  Then the prexor calculation subtracts ->orig_page
from the parity block, and the reconstruct calculation adds the
->page data back into the parity block.

r5cache naturally excludes SkipCopy. When the array has write back
cache, async_copy_data() will not skip copy.

There are some known limitations of the cache implementation:

1. Write cache only covers full page writes (R5_OVERWRITE). Writes
   of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
   writes for the same stripe have to wait. This can be improved by
   moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
   is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
   the log, so recovery code has to replay more than necessary data
   (sometimes all the log from last_checkpoint). This reduces
   availability of the array.

This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 13:26:30 -08:00
Song Liu
2ded370373 md/r5cache: State machine for raid5-cache write back mode
This patch adds state machine for raid5-cache. With log device, the
raid456 array could operate in two different modes (r5c_journal_mode):
  - write-back (R5C_MODE_WRITE_BACK)
  - write-through (R5C_MODE_WRITE_THROUGH)

Existing code of raid5-cache only has write-through mode. For write-back
cache, it is necessary to extend the state machine.

With write-back cache, every stripe could operate in two different
phases:
  - caching
  - writing-out

In caching phase, the stripe handles writes as:
  - write to journal
  - return IO

In writing-out phase, the stripe behaviors as a stripe in write through
mode R5C_MODE_WRITE_THROUGH.

STRIPE_R5C_CACHING is added to sh->state to differentiate caching and
writing-out phase.

Please note: this is a "no-op" patch for raid5-cache write-through
mode.

The following detailed explanation is copied from the raid5-cache.c:

/*
 * raid5 cache state machine
 *
 * With rhe RAID cache, each stripe works in two phases:
 *      - caching phase
 *      - writing-out phase
 *
 * These two phases are controlled by bit STRIPE_R5C_CACHING:
 *   if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase
 *   if STRIPE_R5C_CACHING == 1, the stripe is in caching phase
 *
 * When there is no journal, or the journal is in write-through mode,
 * the stripe is always in writing-out phase.
 *
 * For write-back journal, the stripe is sent to caching phase on write
 * (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off
 * the write-out phase by clearing STRIPE_R5C_CACHING.
 *
 * Stripes in caching phase do not write the raid disks. Instead, all
 * writes are committed from the log device. Therefore, a stripe in
 * caching phase handles writes as:
 *      - write to log device
 *      - return IO
 *
 * Stripes in writing-out phase handle writes as:
 *      - calculate parity
 *      - write pending data and parity to journal
 *      - write data and parity to raid disks
 *      - return IO for pending writes
 */

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 13:26:07 -08:00
Song Liu
c757ec95c2 md/r5cache: Check array size in r5l_init_log
Currently, r5l_write_stripe checks meta size for each stripe write,
which is not necessary.

With this patch, r5l_init_log checks maximal meta size of the array,
which is (r5l_meta_block + raid_disks x r5l_payload_data_parity).
If this is too big to fit in one page, r5l_init_log aborts.

With current meta data, r5l_log support raid_disks up to 203.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18 13:24:46 -08:00
Shaohua Li
354b445b5f raid5-cache: fix lockdep warning
lockdep reports warning of the rcu_dereference usage. Using normal rdev
access pattern to avoid the warning.

Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-17 11:30:27 -08:00
JackieLiu
3fd880af41 raid5-cache: restrict the use area of the log_offset variable
We can calculate this offset by using ctx->meta_total_blocks,
without passing in from the function

Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07 15:08:22 -08:00
Christoph Hellwig
70fd76140a block,fs: use REQ_* flags directly
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly.  Where applicable this also drops usage of the
bio_set_op_attrs wrapper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01 09:43:26 -06:00
Shaohua Li
9a8b27fac5 raid5-cache: correct condition for empty metadata write
As long as we recover one metadata block, we should write the empty metadata
write. The original code could make recovery corrupted if only one meta is
valid.

Reported-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-28 22:04:03 -07:00
Zhengyuan Liu
56056c2e7d md/raid5: write an empty meta-block when creating log super-block
If superblock points to an invalid meta block, r5l_load_log will set
create_super with true and create an new superblock, this runtime path
would always happen if we do no writing I/O to this array since it was
created. Writing an empty meta block could avoid this unnecessary
action at the first time we created log superblock.

Another reason is for the corretness of log recovery. Currently we have
bellow code to guarantee log revocery to be correct.

        if (ctx.seq > log->last_cp_seq + 1) {
                int ret;

                ret = r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq + 10);
                if (ret)
                        return ret;
                log->seq = ctx.seq + 11;
                log->log_start = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
                r5l_write_super(log, ctx.pos);
        } else {
                log->log_start = ctx.pos;
                log->seq = ctx.seq;
        }

If we just created a array with a journal device, log->log_start and
log->last_checkpoint should all be 0, then we write three meta block
which are valid except mid one and supposed crash happened. The ctx.seq
would equal to log->last_cp_seq + 1 and log->log_start would be set to
position of mid invalid meta block after we did a recovery, this will
lead to problems which could be avoided with this patch.

Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-24 15:28:18 -07:00
Zhengyuan Liu
28cd88e2b4 md/raid5: initialize next_checkpoint field before use
No initial operation was done to this field when we
load/recovery the log, it got assignment only when IO
to raid disk was finished. So r5l_quiesce may use wrong
next_checkpoint to reclaim log space, that would make
reclaimable space calculation confused.

Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-24 15:28:18 -07:00
Shaohua Li
8e018c21da raid5-cache: fix a deadlock in superblock write
There is a potential deadlock in superblock write. Discard could zero data, so
before discard we must make sure superblock is updated to new log tail.
Updating superblock (either directly call md_update_sb() or depend on md
thread) must hold reconfig mutex. On the other hand, raid5_quiesce is called
with reconfig_mutex hold. The first step of raid5_quiesce() is waitting for all
IO finish, hence waitting for reclaim thread, while reclaim thread is calling
this function and waitting for reconfig mutex. So there is a deadlock. We
workaround this issue with a trylock. The downside of the solution is we could
miss discard if we can't take reconfig mutex. But this should happen rarely
(mainly in raid array stop), so miss discard shouldn't be a big problem.

Cc: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-08-31 09:05:18 -07:00
Jens Axboe
1eff9d322a block: rename bio bi_rw to bi_opf
Since commit 63a4cc2486, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-07 14:41:02 -06:00
Mike Christie
28a8f0d317 block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH
To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
796a5cf083 md: use bio op accessors
Separate the op from the rq_flag_bits and have md
set/get the bio using bio_set_op_attrs/bio_op.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
4e49ea4a3d block/fs/drivers: remove rw argument from submit_bio
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
instead of passing it in. This makes that use the same as
generic_make_request and how we set the other bio fields.

Signed-off-by: Mike Christie <mchristi@redhat.com>

Fixed up fs/ext4/crypto.c

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Linus Torvalds
feaa7cb5c5 Merge tag 'md/4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull MD updates from Shaohua Li:
 "Several patches from Guoqing fixing md-cluster bugs and several
  patches from Heinz fixing dm-raid bugs"

* tag 'md/4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
  md-cluster: check the return value of process_recvd_msg
  md-cluster: gather resync infos and enable recv_thread after bitmap is ready
  md: set MD_CHANGE_PENDING in a atomic region
  md: raid5: add prerequisite to run underneath dm-raid
  md: raid10: add prerequisite to run underneath dm-raid
  md: md.c: fix oops in mddev_suspend for raid0
  md-cluster: fix ifnullfree.cocci warnings
  md-cluster/bitmap: unplug bitmap to sync dirty pages to disk
  md-cluster/bitmap: fix wrong page num in bitmap_file_clear_bit and bitmap_file_set_bit
  md-cluster/bitmap: fix wrong calcuation of offset
  md-cluster: sync bitmap when node received RESYNCING msg
  md-cluster: always setup in-memory bitmap
  md-cluster: wakeup thread if activated a spare disk
  md-cluster: change array_sectors and update size are not supported
  md-cluster: fix locking when node joins cluster during message broadcast
  md-cluster: unregister thread if err happened
  md-cluster: wake up thread to continue recovery
  md-cluser: make resync_finish only called after pers->sync_request
  md-cluster: change resync lock from asynchronous to synchronous
2016-05-19 17:25:13 -07:00
Guoqing Jiang
85ad1d13ee md: set MD_CHANGE_PENDING in a atomic region
Some code waits for a metadata update by:

1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
2. setting MD_CHANGE_PENDING and waking the management thread
3. waiting for MD_CHANGE_PENDING to be cleared

If the first two are done without locking, the code in md_update_sb()
which checks if it needs to repeat might test if an update is needed
before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
in the wait returning early.

So make sure all places that set MD_CHANGE_PENDING are atomicial, and
bit_clear_unless (suggested by Neil) is introduced for the purpose.

Cc: Martin Kepplinger <martink@posteo.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: <linux-kernel@vger.kernel.org>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-09 09:24:02 -07:00
Jens Axboe
c888a8f95a block: kill off q->flush_flags
Now that we converted everything to the newer block write cache
interface, kill off the queue flush_flags and queueable flush
entries.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-04-13 13:33:19 -06:00
Shaohua Li
16a43f6a65 raid5-cache: handle journal hotadd in quiesce
Handle journal hotadd in quiesce to avoid creating duplicated threads.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-14 11:49:43 +11:00
Shaohua Li
a62ab49eb5 md: set MD_HAS_JOURNAL in correct places
Set MD_HAS_JOURNAL when a array is loaded or journal is initialized.
This is to avoid the flags set too early in journal disk hotadd.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-14 11:49:43 +11:00
Christoph Hellwig
5036c39020 raid5: allow r5l_io_unit allocations to fail
And propagate the error up the stack so we can add the stripe
to no_stripes_list and retry our log operation later.  This avoids
blocking raid5d due to reclaim, an it allows to get rid of the
deadlock-prone GFP_NOFAIL allocation.

shli: add missing mempool_destroy()

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:40:12 +11:00
Christoph Hellwig
e8deb63810 raid5-cache: use a mempool for the metadata block
We only have a limited number in flight, so use a page based mempool.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:40:08 +11:00
Christoph Hellwig
c38d29b33b raid5-cache: use a bio_set
This allows us to make guaranteed forward progress.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:40:04 +11:00
Shaohua Li
f6b6ec5cfa raid5-cache: add journal hot add/remove support
Add support for journal disk hot add/remove. Mostly trival checks in md
part. The raid5 part is a little tricky. For hot-remove, we can't wait
pending write as it's called from raid5d. The wait will cause deadlock.
We simplily fail the hot-remove. A hot-remove retry can success
eventually since if journal disk is faulty all pending write will be
failed and finish. For hot-add, since an array supporting journal but
without journal disk will be marked read-only, we are safe to hot add
journal without stopping IO (should be read IO, while journal only
handles write IO).

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:39:57 +11:00
Christoph Hellwig
ad66d445ee raid5-cache: free meta_page earlier
Once the I/O completed we don't need the meta page anymore.  As the iounits
can live on for a long time this reduces memory pressure a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:39:43 +11:00
Christoph Hellwig
3848c0bcb0 raid5-cache: simplify r5l_move_io_unit_list
It's only used for one kind of move, so make that explicit.  Also clean
up the code a bit by using list_for_each_safe.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:39:34 +11:00
Shaohua Li
7dde2ad3c5 raid5-cache: start raid5 readonly if journal is missing
If raid array is expected to have journal (eg, journal is set in MD
superblock feature map) and the array is started without journal disk,
start the array readonly.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:29 +11:00
Shaohua Li
6e74a9cfb5 raid5-cache: IO error handling
There are 3 places the raid5-cache dispatches IO. The discard IO error
doesn't matter, so we ignore it. The superblock write IO error can be
handled in MD core. The remaining are log write and flush. When the IO
error happens, we mark log disk faulty and fail all write IO. Read IO is
still allowed to run. Userspace will get a notification too and
corresponding daemon can choose setting raid array readonly for example.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:29 +11:00
Shaohua Li
4b482044d2 raid5-cache: add trim support for log
Since superblock is updated infrequently, we do a simple trim of log
disk (a synchronous trim)

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:29 +11:00
Christoph Hellwig
6143e2cecb raid5-cache: use bio chaining
Simplify the bio completion handler by using bio chaining and submitting
bios as soon as they are full.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:28 +11:00
Christoph Hellwig
2b8ef16ec4 raid5-cache: small log->seq cleanup
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:28 +11:00
Christoph Hellwig
c1b9919849 raid5-cache: new helper: r5_reserve_log_entry
Factor out code to reserve log space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:28 +11:00
Christoph Hellwig
51039cd066 raid5-cache: inline r5l_alloc_io_unit into r5l_new_meta
This is the only user, and keeping all code initializing the io_unit
structure together improves readbility.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:28 +11:00
Christoph Hellwig
1e932a37cc raid5-cache: take rdev->data_offset into account early on
Set up bi_sector properly when we allocate an bio instead of updating it
at submission time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:28 +11:00
Christoph Hellwig
b349feb36c raid5-cache: refactor bio allocation
Split out a helper to allocate a bio for log writes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:28 +11:00
Christoph Hellwig
22581f58ed raid5-cache: clean up r5l_get_meta
Remove the only partially used local 'io' variable to simplify the code
flow.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:28 +11:00
Christoph Hellwig
56fef7c6e0 raid5-cache: simplify state machine when caches flushes are not needed
For devices without a volatile write cache we don't need to send a FLUSH
command to ensure writes are stable on disk, and thus can avoid the whole
step of batching up bios for processing by the MD thread.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:28 +11:00
Christoph Hellwig
d8858f4321 raid5-cache: factor out a helper to run all stripes for an I/O unit
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:27 +11:00
Christoph Hellwig
04732f741d raid5-cache: rename flushed_ios to finished_ios
After this series we won't nessecarily have flushed the cache for these
I/Os, so give the list a more neutral name.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:27 +11:00
Christoph Hellwig
170364619a raid5-cache: free I/O units earlier
There is no good reason to keep the I/O unit structures around after the
stripe has been written back to the RAID array.  The only information
we need is the log sequence number, and the checkpoint offset of the
highest successfull writeback.  Store those in the log structure, and
free the IO units from __r5l_stripe_write_finished.

Besides simplifying the code this also avoid having to keep the allocation
for the I/O unit around for a potentially long time as superblock updates
that checkpoint the log do not happen very often.

This also fixes the previously incorrect calculation of 'free' in
r5l_do_reclaim as a side effect: previous if took the last unit which
isn't checkpointed into account.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:27 +11:00
Shaohua Li
e6c033f79a raid5-cache: move reclaim stop to quiesce
Move reclaim stop to quiesce handling, where is safer for this stuff.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:27 +11:00
Shaohua Li
253f9fd41a raid5-cache: don't delay stripe captured in log
There is a case a stripe gets delayed forever.
1. a stripe finishes construction
2. a new bio hits the stripe
3. handle_stripe runs for the stripe. The stripe gets DELAYED bit set
since construction can't run for new bio (the stripe is locked since
step 1)

Without log, handle_stripe will call ops_run_io. After IO finishes, the
stripe gets unlocked and the stripe will restart and run construction
for the new bio. With log, ops_run_io need to run two times. If the
DELAYED bit set, the stripe can't enter into the handle_list, so the
second ops_run_io doesn't run, which leaves the stripe stalled.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:27 +11:00
Shaohua Li
85f2f9a4f4 raid5-cache: check stripe finish out of order
stripes could finish out of order. Hence r5l_move_io_unit_list() of
__r5l_stripe_write_finished might not move any entry and leave
stripe_end_ios list empty.

This applies on top of http://marc.info/?l=linux-raid&m=144122700510667

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Shaohua Li
828cbe989e raid5-cache: optimize FLUSH IO with log enabled
With log enabled, bio is written to raid disks after the bio is settled
down in log disk. The recovery guarantees we can recovery the bio data
from log disk, so we we skip FLUSH IO.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Christoph Hellwig
509ffec708 raid5-cache: move functionality out of __r5l_set_io_unit_state
Just keep __r5l_set_io_unit_state as a small set the state wrapper, and
remove r5l_set_io_unit_state entirely after moving the real
functionality to the two callers that need it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Shaohua Li
0fd22b45b2 raid5-cache: fix a user-after-free bug
r5l_compress_stripe_end_list() can free an io_unit. This breaks the
assumption only reclaimer can free io_unit. We can add a reference count
based io_unit free, but since only reclaim can wait io_unit becoming to
STRIPE_END state, we use a simple global wait queue here.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Shaohua Li
a8c34f9159 raid5-cache: switching to state machine for log disk cache flush
Before we write stripe data to raid disks, we must guarantee stripe data
is settled down in log disk. To do this, we flush log disk cache and
wait the flush finish. That wait introduces sleep time in raid5d thread
and impact performance. This patch moves the log disk cache flush
process to the stripe handling state machine, which can remove the wait
in raid5d.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Shaohua Li
5cb2fbd6ea raid5-cache: use crc32c checksum
crc32c has lower overhead with cpu acceleration. It's a shame I didn't
use it in first post, sorry. This changes disk format, but we are still
ok in current stage.

V2: delete unnecessary type conversion as pointed out by Bart

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
2015-11-01 13:45:39 +11:00
Shaohua Li
355810d12a raid5: log recovery
This is the log recovery support. The process is quite straightforward.
We scan the log and read all valid meta/data/parity into memory. If a
stripe's data/parity checksum is correct, the stripe will be recoveried.
Otherwise, it's discarded and we don't scan the log further. The reclaim
process guarantees stripe which starts to be flushed raid disks has
completed data/parity and has correct checksum. To recovery a stripe, we
just copy its data/parity to corresponding raid disks.

The trick thing is superblock update after recovery. we can't let
superblock point to last valid meta block. The log might look like:
| meta 1| meta 2| meta 3|
meta 1 is valid, meta 2 is invalid. meta 3 could be valid. If superblock
points to meta 1, we write a new valid meta 2n.  If crash happens again,
new recovery will start from meta 1. Since meta 2n is valid, recovery
will think meta 3 is valid, which is wrong.  The solution is we create a
new meta in meta2 with its seq == meta 1's seq + 10 and let superblock
points to meta2.  recovery will not think meta 3 is a valid meta,
because its seq is wrong

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24 17:16:19 +11:00
Shaohua Li
0576b1c618 raid5: log reclaim support
This is the reclaim support for raid5 log. A stripe write will have
following steps:

1. reconstruct the stripe, read data/calculate parity. ops_run_io
prepares to write data/parity to raid disks
2. hijack ops_run_io. stripe data/parity is appending to log disk
3. flush log disk cache
4. ops_run_io run again and do normal operation. stripe data/parity is
written in raid array disks. raid core can return io to upper layer.
5. flush cache of all raid array disks
6. update super block
7. log disk space used by the stripe can be reused

In practice, several stripes consist of an io_unit and we will batch
several io_unit in different steps, but the whole process doesn't
change.

It's possible io return just after data/parity hit log disk, but then
read IO will need read from log disk. For simplicity, IO return happens
at step 4, where read IO can directly read from raid disks.

Currently reclaim run if there is specific reclaimable space (1/4 disk
size or 10G) or we are out of space. Reclaim is just to free log disk
spaces, it doesn't impact data consistency. The size based force reclaim
is to make sure log isn't too big, so recovery doesn't scan log too
much.

Recovery make sure raid disks and log disk have the same data of a
stripe. If crash happens before 4, recovery might/might not recovery
stripe's data/parity depending on if data/parity and its checksum
matches. In either case, this doesn't change the syntax of an IO write.
After step 3, stripe is guaranteed recoverable, because stripe's
data/parity is persistent in log disk. In some cases, log disk content
and raid disks content of a stripe are the same, but recovery will still
copy log disk content to raid disks, this doesn't impact data
consistency. space reuse happens after superblock update and cache
flush.

There is one situation we want to avoid. A broken meta in the middle of
a log causes recovery can't find meta at the head of log. If operations
require meta at the head persistent in log, we must make sure meta
before it persistent in log too. The case is stripe data/parity is in
log and we start write stripe to raid disks (before step 4). stripe
data/parity must be persistent in log before we do the write to raid
disks. The solution is we restrictly maintain io_unit list order. In
this case, we only write stripes of an io_unit to raid disks till the
io_unit is the first one whose data/parity is in log.

The io_unit list order is important for other cases too. For example,
some io_unit are reclaimable and others not. They can be mixed in the
list, we shouldn't reuse space of an unreclaimable io_unit.

Includes fixes to problems which were...
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24 17:16:19 +11:00
Shaohua Li
f6bed0ef0a raid5: add basic stripe log
This introduces a simple log for raid5. Data/parity writing to raid
array first writes to the log, then write to raid array disks. If
crash happens, we can recovery data from the log. This can speed up
raid resync and fix write hole issue.

The log structure is pretty simple. Data/meta data is stored in block
unit, which is 4k generally. It has only one type of meta data block.
The meta data block can track 3 types of data, stripe data, stripe
parity and flush block. MD superblock will point to the last valid
meta data block. Each meta data block has checksum/seq number, so
recovery can scan the log correctly. We store a checksum of stripe
data/parity to the metadata block, so meta data and stripe data/parity
can be written to log disk together. otherwise, meta data write must
wait till stripe data/parity is finished.

For stripe data, meta data block will record stripe data sector and
size. Currently the size is always 4k. This meta data record can be made
simpler if we just fix write hole (eg, we can record data of a stripe's
different disks together), but this format can be extended to support
caching in the future, which must record data address/size.

For stripe parity, meta data block will record stripe sector. It's
size should be 4k (for raid5) or 8k (for raid6). We always store p
parity first. This format should work for caching too.

flush block indicates a stripe is in raid array disks. Fixing write
hole doesn't need this type of meta data, it's for caching extension.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24 17:16:19 +11:00