mirror of
https://mirrors.bfsu.edu.cn/git/linux.git
synced 2024-11-11 12:28:41 +08:00
Merge branch 'md-next' into md-linus
This commit is contained in:
commit
e265eb3a30
@ -276,14 +276,14 @@ All md devices contain:
|
||||
array creation it will default to 0, though starting the array as
|
||||
``clean`` will set it much larger.
|
||||
|
||||
new_dev
|
||||
new_dev
|
||||
This file can be written but not read. The value written should
|
||||
be a block device number as major:minor. e.g. 8:0
|
||||
This will cause that device to be attached to the array, if it is
|
||||
available. It will then appear at md/dev-XXX (depending on the
|
||||
name of the device) and further configuration is then possible.
|
||||
|
||||
safe_mode_delay
|
||||
safe_mode_delay
|
||||
When an md array has seen no write requests for a certain period
|
||||
of time, it will be marked as ``clean``. When another write
|
||||
request arrives, the array is marked as ``dirty`` before the write
|
||||
@ -292,7 +292,7 @@ All md devices contain:
|
||||
period as a number of seconds. The default is 200msec (0.200).
|
||||
Writing a value of 0 disables safemode.
|
||||
|
||||
array_state
|
||||
array_state
|
||||
This file contains a single word which describes the current
|
||||
state of the array. In many cases, the state can be set by
|
||||
writing the word for the desired state, however some states
|
||||
@ -401,7 +401,30 @@ All md devices contain:
|
||||
once the array becomes non-degraded, and this fact has been
|
||||
recorded in the metadata.
|
||||
|
||||
consistency_policy
|
||||
This indicates how the array maintains consistency in case of unexpected
|
||||
shutdown. It can be:
|
||||
|
||||
none
|
||||
Array has no redundancy information, e.g. raid0, linear.
|
||||
|
||||
resync
|
||||
Full resync is performed and all redundancy is regenerated when the
|
||||
array is started after unclean shutdown.
|
||||
|
||||
bitmap
|
||||
Resync assisted by a write-intent bitmap.
|
||||
|
||||
journal
|
||||
For raid4/5/6, journal device is used to log transactions and replay
|
||||
after unclean shutdown.
|
||||
|
||||
ppl
|
||||
For raid5 only, Partial Parity Log is used to close the write hole and
|
||||
eliminate resync.
|
||||
|
||||
The accepted values when writing to this file are ``ppl`` and ``resync``,
|
||||
used to enable and disable PPL.
|
||||
|
||||
|
||||
As component devices are added to an md array, they appear in the ``md``
|
||||
@ -563,6 +586,9 @@ Each directory contains:
|
||||
adds bad blocks without acknowledging them. This is largely
|
||||
for testing.
|
||||
|
||||
ppl_sector, ppl_size
|
||||
Location and size (in sectors) of the space used for Partial Parity Log
|
||||
on this device.
|
||||
|
||||
|
||||
An active md device will also contain an entry for each active device
|
||||
|
@ -321,4 +321,4 @@ The algorithm is:
|
||||
|
||||
There are somethings which are not supported by cluster MD yet.
|
||||
|
||||
- update size and change array_sectors.
|
||||
- change array_sectors.
|
||||
|
44
Documentation/md/raid5-ppl.txt
Normal file
44
Documentation/md/raid5-ppl.txt
Normal file
@ -0,0 +1,44 @@
|
||||
Partial Parity Log
|
||||
|
||||
Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
|
||||
addressed by PPL is that after a dirty shutdown, parity of a particular stripe
|
||||
may become inconsistent with data on other member disks. If the array is also
|
||||
in degraded state, there is no way to recalculate parity, because one of the
|
||||
disks is missing. This can lead to silent data corruption when rebuilding the
|
||||
array or using it is as degraded - data calculated from parity for array blocks
|
||||
that have not been touched by a write request during the unclean shutdown can
|
||||
be incorrect. Such condition is known as the RAID5 Write Hole. Because of
|
||||
this, md by default does not allow starting a dirty degraded array.
|
||||
|
||||
Partial parity for a write operation is the XOR of stripe data chunks not
|
||||
modified by this write. It is just enough data needed for recovering from the
|
||||
write hole. XORing partial parity with the modified chunks produces parity for
|
||||
the stripe, consistent with its state before the write operation, regardless of
|
||||
which chunk writes have completed. If one of the not modified data disks of
|
||||
this stripe is missing, this updated parity can be used to recover its
|
||||
contents. PPL recovery is also performed when starting an array after an
|
||||
unclean shutdown and all disks are available, eliminating the need to resync
|
||||
the array. Because of this, using write-intent bitmap and PPL together is not
|
||||
supported.
|
||||
|
||||
When handling a write request PPL writes partial parity before new data and
|
||||
parity are dispatched to disks. PPL is a distributed log - it is stored on
|
||||
array member drives in the metadata area, on the parity drive of a particular
|
||||
stripe. It does not require a dedicated journaling drive. Write performance is
|
||||
reduced by up to 30%-40% but it scales with the number of drives in the array
|
||||
and the journaling drive does not become a bottleneck or a single point of
|
||||
failure.
|
||||
|
||||
Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
|
||||
not a true journal. It does not protect from losing in-flight data, only from
|
||||
silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
|
||||
performed for this stripe (parity is not updated). So it is possible to have
|
||||
arbitrary data in the written part of a stripe if that disk is lost. In such
|
||||
case the behavior is the same as in plain raid5.
|
||||
|
||||
PPL is available for md version-1 metadata and external (specifically IMSM)
|
||||
metadata arrays. It can be enabled using mdadm option --consistency-policy=ppl.
|
||||
|
||||
Currently, volatile write-back cache should be disabled on all member drives
|
||||
when using PPL. Otherwise it cannot guarantee consistency in case of power
|
||||
failure.
|
61
block/bio.c
61
block/bio.c
@ -633,20 +633,21 @@ struct bio *bio_clone_fast(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs)
|
||||
}
|
||||
EXPORT_SYMBOL(bio_clone_fast);
|
||||
|
||||
static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
|
||||
struct bio_set *bs, int offset,
|
||||
int size)
|
||||
/**
|
||||
* bio_clone_bioset - clone a bio
|
||||
* @bio_src: bio to clone
|
||||
* @gfp_mask: allocation priority
|
||||
* @bs: bio_set to allocate from
|
||||
*
|
||||
* Clone bio. Caller will own the returned bio, but not the actual data it
|
||||
* points to. Reference count of returned bio will be one.
|
||||
*/
|
||||
struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
|
||||
struct bio_set *bs)
|
||||
{
|
||||
struct bvec_iter iter;
|
||||
struct bio_vec bv;
|
||||
struct bio *bio;
|
||||
struct bvec_iter iter_src = bio_src->bi_iter;
|
||||
|
||||
/* for supporting partial clone */
|
||||
if (offset || size != bio_src->bi_iter.bi_size) {
|
||||
bio_advance_iter(bio_src, &iter_src, offset);
|
||||
iter_src.bi_size = size;
|
||||
}
|
||||
|
||||
/*
|
||||
* Pre immutable biovecs, __bio_clone() used to just do a memcpy from
|
||||
@ -670,8 +671,7 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
|
||||
* __bio_clone_fast() anyways.
|
||||
*/
|
||||
|
||||
bio = bio_alloc_bioset(gfp_mask, __bio_segments(bio_src,
|
||||
&iter_src), bs);
|
||||
bio = bio_alloc_bioset(gfp_mask, bio_segments(bio_src), bs);
|
||||
if (!bio)
|
||||
return NULL;
|
||||
bio->bi_bdev = bio_src->bi_bdev;
|
||||
@ -688,7 +688,7 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
|
||||
bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0];
|
||||
break;
|
||||
default:
|
||||
__bio_for_each_segment(bv, bio_src, iter, iter_src)
|
||||
bio_for_each_segment(bv, bio_src, iter)
|
||||
bio->bi_io_vec[bio->bi_vcnt++] = bv;
|
||||
break;
|
||||
}
|
||||
@ -707,43 +707,8 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
|
||||
|
||||
return bio;
|
||||
}
|
||||
|
||||
/**
|
||||
* bio_clone_bioset - clone a bio
|
||||
* @bio_src: bio to clone
|
||||
* @gfp_mask: allocation priority
|
||||
* @bs: bio_set to allocate from
|
||||
*
|
||||
* Clone bio. Caller will own the returned bio, but not the actual data it
|
||||
* points to. Reference count of returned bio will be one.
|
||||
*/
|
||||
struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
|
||||
struct bio_set *bs)
|
||||
{
|
||||
return __bio_clone_bioset(bio_src, gfp_mask, bs, 0,
|
||||
bio_src->bi_iter.bi_size);
|
||||
}
|
||||
EXPORT_SYMBOL(bio_clone_bioset);
|
||||
|
||||
/**
|
||||
* bio_clone_bioset_partial - clone a partial bio
|
||||
* @bio_src: bio to clone
|
||||
* @gfp_mask: allocation priority
|
||||
* @bs: bio_set to allocate from
|
||||
* @offset: cloned starting from the offset
|
||||
* @size: size for the cloned bio
|
||||
*
|
||||
* Clone bio. Caller will own the returned bio, but not the actual data it
|
||||
* points to. Reference count of returned bio will be one.
|
||||
*/
|
||||
struct bio *bio_clone_bioset_partial(struct bio *bio_src, gfp_t gfp_mask,
|
||||
struct bio_set *bs, int offset,
|
||||
int size)
|
||||
{
|
||||
return __bio_clone_bioset(bio_src, gfp_mask, bs, offset, size);
|
||||
}
|
||||
EXPORT_SYMBOL(bio_clone_bioset_partial);
|
||||
|
||||
/**
|
||||
* bio_add_pc_page - attempt to add page to bio
|
||||
* @q: the target queue
|
||||
|
@ -18,7 +18,7 @@ dm-cache-cleaner-y += dm-cache-policy-cleaner.o
|
||||
dm-era-y += dm-era-target.o
|
||||
dm-verity-y += dm-verity-target.o
|
||||
md-mod-y += md.o bitmap.o
|
||||
raid456-y += raid5.o raid5-cache.o
|
||||
raid456-y += raid5.o raid5-cache.o raid5-ppl.o
|
||||
|
||||
# Note: link order is important. All raid personalities
|
||||
# and must come before md.o, as they each initialise
|
||||
|
@ -471,6 +471,7 @@ void bitmap_update_sb(struct bitmap *bitmap)
|
||||
kunmap_atomic(sb);
|
||||
write_page(bitmap, bitmap->storage.sb_page, 1);
|
||||
}
|
||||
EXPORT_SYMBOL(bitmap_update_sb);
|
||||
|
||||
/* print out the bitmap file superblock */
|
||||
void bitmap_print_sb(struct bitmap *bitmap)
|
||||
@ -696,7 +697,7 @@ re_read:
|
||||
|
||||
out:
|
||||
kunmap_atomic(sb);
|
||||
/* Assiging chunksize is required for "re_read" */
|
||||
/* Assigning chunksize is required for "re_read" */
|
||||
bitmap->mddev->bitmap_info.chunksize = chunksize;
|
||||
if (err == 0 && nodes && (bitmap->cluster_slot < 0)) {
|
||||
err = md_setup_cluster(bitmap->mddev, nodes);
|
||||
@ -1727,7 +1728,7 @@ void bitmap_flush(struct mddev *mddev)
|
||||
/*
|
||||
* free memory that was allocated
|
||||
*/
|
||||
static void bitmap_free(struct bitmap *bitmap)
|
||||
void bitmap_free(struct bitmap *bitmap)
|
||||
{
|
||||
unsigned long k, pages;
|
||||
struct bitmap_page *bp;
|
||||
@ -1761,6 +1762,21 @@ static void bitmap_free(struct bitmap *bitmap)
|
||||
kfree(bp);
|
||||
kfree(bitmap);
|
||||
}
|
||||
EXPORT_SYMBOL(bitmap_free);
|
||||
|
||||
void bitmap_wait_behind_writes(struct mddev *mddev)
|
||||
{
|
||||
struct bitmap *bitmap = mddev->bitmap;
|
||||
|
||||
/* wait for behind writes to complete */
|
||||
if (bitmap && atomic_read(&bitmap->behind_writes) > 0) {
|
||||
pr_debug("md:%s: behind writes in progress - waiting to stop.\n",
|
||||
mdname(mddev));
|
||||
/* need to kick something here to make sure I/O goes? */
|
||||
wait_event(bitmap->behind_wait,
|
||||
atomic_read(&bitmap->behind_writes) == 0);
|
||||
}
|
||||
}
|
||||
|
||||
void bitmap_destroy(struct mddev *mddev)
|
||||
{
|
||||
@ -1769,6 +1785,8 @@ void bitmap_destroy(struct mddev *mddev)
|
||||
if (!bitmap) /* there was no bitmap */
|
||||
return;
|
||||
|
||||
bitmap_wait_behind_writes(mddev);
|
||||
|
||||
mutex_lock(&mddev->bitmap_info.mutex);
|
||||
spin_lock(&mddev->lock);
|
||||
mddev->bitmap = NULL; /* disconnect from the md device */
|
||||
@ -1920,6 +1938,27 @@ out:
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(bitmap_load);
|
||||
|
||||
struct bitmap *get_bitmap_from_slot(struct mddev *mddev, int slot)
|
||||
{
|
||||
int rv = 0;
|
||||
struct bitmap *bitmap;
|
||||
|
||||
bitmap = bitmap_create(mddev, slot);
|
||||
if (IS_ERR(bitmap)) {
|
||||
rv = PTR_ERR(bitmap);
|
||||
return ERR_PTR(rv);
|
||||
}
|
||||
|
||||
rv = bitmap_init_from_disk(bitmap, 0);
|
||||
if (rv) {
|
||||
bitmap_free(bitmap);
|
||||
return ERR_PTR(rv);
|
||||
}
|
||||
|
||||
return bitmap;
|
||||
}
|
||||
EXPORT_SYMBOL(get_bitmap_from_slot);
|
||||
|
||||
/* Loads the bitmap associated with slot and copies the resync information
|
||||
* to our bitmap
|
||||
*/
|
||||
@ -1929,14 +1968,13 @@ int bitmap_copy_from_slot(struct mddev *mddev, int slot,
|
||||
int rv = 0, i, j;
|
||||
sector_t block, lo = 0, hi = 0;
|
||||
struct bitmap_counts *counts;
|
||||
struct bitmap *bitmap = bitmap_create(mddev, slot);
|
||||
struct bitmap *bitmap;
|
||||
|
||||
if (IS_ERR(bitmap))
|
||||
return PTR_ERR(bitmap);
|
||||
|
||||
rv = bitmap_init_from_disk(bitmap, 0);
|
||||
if (rv)
|
||||
goto err;
|
||||
bitmap = get_bitmap_from_slot(mddev, slot);
|
||||
if (IS_ERR(bitmap)) {
|
||||
pr_err("%s can't get bitmap from slot %d\n", __func__, slot);
|
||||
return -1;
|
||||
}
|
||||
|
||||
counts = &bitmap->counts;
|
||||
for (j = 0; j < counts->chunks; j++) {
|
||||
@ -1963,8 +2001,7 @@ int bitmap_copy_from_slot(struct mddev *mddev, int slot,
|
||||
bitmap_unplug(mddev->bitmap);
|
||||
*low = lo;
|
||||
*high = hi;
|
||||
err:
|
||||
bitmap_free(bitmap);
|
||||
|
||||
return rv;
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(bitmap_copy_from_slot);
|
||||
|
@ -267,8 +267,11 @@ void bitmap_daemon_work(struct mddev *mddev);
|
||||
|
||||
int bitmap_resize(struct bitmap *bitmap, sector_t blocks,
|
||||
int chunksize, int init);
|
||||
struct bitmap *get_bitmap_from_slot(struct mddev *mddev, int slot);
|
||||
int bitmap_copy_from_slot(struct mddev *mddev, int slot,
|
||||
sector_t *lo, sector_t *hi, bool clear_bits);
|
||||
void bitmap_free(struct bitmap *bitmap);
|
||||
void bitmap_wait_behind_writes(struct mddev *mddev);
|
||||
#endif
|
||||
|
||||
#endif
|
||||
|
@ -249,54 +249,49 @@ static void linear_make_request(struct mddev *mddev, struct bio *bio)
|
||||
{
|
||||
char b[BDEVNAME_SIZE];
|
||||
struct dev_info *tmp_dev;
|
||||
struct bio *split;
|
||||
sector_t start_sector, end_sector, data_offset;
|
||||
sector_t bio_sector = bio->bi_iter.bi_sector;
|
||||
|
||||
if (unlikely(bio->bi_opf & REQ_PREFLUSH)) {
|
||||
md_flush_request(mddev, bio);
|
||||
return;
|
||||
}
|
||||
|
||||
do {
|
||||
sector_t bio_sector = bio->bi_iter.bi_sector;
|
||||
tmp_dev = which_dev(mddev, bio_sector);
|
||||
start_sector = tmp_dev->end_sector - tmp_dev->rdev->sectors;
|
||||
end_sector = tmp_dev->end_sector;
|
||||
data_offset = tmp_dev->rdev->data_offset;
|
||||
bio->bi_bdev = tmp_dev->rdev->bdev;
|
||||
tmp_dev = which_dev(mddev, bio_sector);
|
||||
start_sector = tmp_dev->end_sector - tmp_dev->rdev->sectors;
|
||||
end_sector = tmp_dev->end_sector;
|
||||
data_offset = tmp_dev->rdev->data_offset;
|
||||
|
||||
if (unlikely(bio_sector >= end_sector ||
|
||||
bio_sector < start_sector))
|
||||
goto out_of_bounds;
|
||||
if (unlikely(bio_sector >= end_sector ||
|
||||
bio_sector < start_sector))
|
||||
goto out_of_bounds;
|
||||
|
||||
if (unlikely(bio_end_sector(bio) > end_sector)) {
|
||||
/* This bio crosses a device boundary, so we have to
|
||||
* split it.
|
||||
*/
|
||||
split = bio_split(bio, end_sector - bio_sector,
|
||||
GFP_NOIO, fs_bio_set);
|
||||
bio_chain(split, bio);
|
||||
} else {
|
||||
split = bio;
|
||||
}
|
||||
if (unlikely(bio_end_sector(bio) > end_sector)) {
|
||||
/* This bio crosses a device boundary, so we have to split it */
|
||||
struct bio *split = bio_split(bio, end_sector - bio_sector,
|
||||
GFP_NOIO, mddev->bio_set);
|
||||
bio_chain(split, bio);
|
||||
generic_make_request(bio);
|
||||
bio = split;
|
||||
}
|
||||
|
||||
split->bi_iter.bi_sector = split->bi_iter.bi_sector -
|
||||
start_sector + data_offset;
|
||||
bio->bi_bdev = tmp_dev->rdev->bdev;
|
||||
bio->bi_iter.bi_sector = bio->bi_iter.bi_sector -
|
||||
start_sector + data_offset;
|
||||
|
||||
if (unlikely((bio_op(split) == REQ_OP_DISCARD) &&
|
||||
!blk_queue_discard(bdev_get_queue(split->bi_bdev)))) {
|
||||
/* Just ignore it */
|
||||
bio_endio(split);
|
||||
} else {
|
||||
if (mddev->gendisk)
|
||||
trace_block_bio_remap(bdev_get_queue(split->bi_bdev),
|
||||
split, disk_devt(mddev->gendisk),
|
||||
bio_sector);
|
||||
mddev_check_writesame(mddev, split);
|
||||
mddev_check_write_zeroes(mddev, split);
|
||||
generic_make_request(split);
|
||||
}
|
||||
} while (split != bio);
|
||||
if (unlikely((bio_op(bio) == REQ_OP_DISCARD) &&
|
||||
!blk_queue_discard(bdev_get_queue(bio->bi_bdev)))) {
|
||||
/* Just ignore it */
|
||||
bio_endio(bio);
|
||||
} else {
|
||||
if (mddev->gendisk)
|
||||
trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
|
||||
bio, disk_devt(mddev->gendisk),
|
||||
bio_sector);
|
||||
mddev_check_writesame(mddev, bio);
|
||||
mddev_check_write_zeroes(mddev, bio);
|
||||
generic_make_request(bio);
|
||||
}
|
||||
return;
|
||||
|
||||
out_of_bounds:
|
||||
|
@ -67,9 +67,10 @@ struct resync_info {
|
||||
* set up all the related infos such as bitmap and personality */
|
||||
#define MD_CLUSTER_ALREADY_IN_CLUSTER 6
|
||||
#define MD_CLUSTER_PENDING_RECV_EVENT 7
|
||||
|
||||
#define MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD 8
|
||||
|
||||
struct md_cluster_info {
|
||||
struct mddev *mddev; /* the md device which md_cluster_info belongs to */
|
||||
/* dlm lock space and resources for clustered raid. */
|
||||
dlm_lockspace_t *lockspace;
|
||||
int slot_number;
|
||||
@ -103,6 +104,7 @@ enum msg_type {
|
||||
REMOVE,
|
||||
RE_ADD,
|
||||
BITMAP_NEEDS_SYNC,
|
||||
CHANGE_CAPACITY,
|
||||
};
|
||||
|
||||
struct cluster_msg {
|
||||
@ -523,11 +525,17 @@ static void process_add_new_disk(struct mddev *mddev, struct cluster_msg *cmsg)
|
||||
|
||||
static void process_metadata_update(struct mddev *mddev, struct cluster_msg *msg)
|
||||
{
|
||||
int got_lock = 0;
|
||||
struct md_cluster_info *cinfo = mddev->cluster_info;
|
||||
mddev->good_device_nr = le32_to_cpu(msg->raid_slot);
|
||||
set_bit(MD_RELOAD_SB, &mddev->flags);
|
||||
|
||||
dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR);
|
||||
md_wakeup_thread(mddev->thread);
|
||||
wait_event(mddev->thread->wqueue,
|
||||
(got_lock = mddev_trylock(mddev)) ||
|
||||
test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state));
|
||||
md_reload_sb(mddev, mddev->good_device_nr);
|
||||
if (got_lock)
|
||||
mddev_unlock(mddev);
|
||||
}
|
||||
|
||||
static void process_remove_disk(struct mddev *mddev, struct cluster_msg *msg)
|
||||
@ -572,6 +580,10 @@ static int process_recvd_msg(struct mddev *mddev, struct cluster_msg *msg)
|
||||
case METADATA_UPDATED:
|
||||
process_metadata_update(mddev, msg);
|
||||
break;
|
||||
case CHANGE_CAPACITY:
|
||||
set_capacity(mddev->gendisk, mddev->array_sectors);
|
||||
revalidate_disk(mddev->gendisk);
|
||||
break;
|
||||
case RESYNCING:
|
||||
process_suspend_info(mddev, le32_to_cpu(msg->slot),
|
||||
le64_to_cpu(msg->low),
|
||||
@ -646,11 +658,29 @@ out:
|
||||
* Takes the lock on the TOKEN lock resource so no other
|
||||
* node can communicate while the operation is underway.
|
||||
*/
|
||||
static int lock_token(struct md_cluster_info *cinfo)
|
||||
static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked)
|
||||
{
|
||||
int error;
|
||||
int error, set_bit = 0;
|
||||
struct mddev *mddev = cinfo->mddev;
|
||||
|
||||
/*
|
||||
* If resync thread run after raid1d thread, then process_metadata_update
|
||||
* could not continue if raid1d held reconfig_mutex (and raid1d is blocked
|
||||
* since another node already got EX on Token and waitting the EX of Ack),
|
||||
* so let resync wake up thread in case flag is set.
|
||||
*/
|
||||
if (mddev_locked && !test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
|
||||
&cinfo->state)) {
|
||||
error = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
|
||||
&cinfo->state);
|
||||
WARN_ON_ONCE(error);
|
||||
md_wakeup_thread(mddev->thread);
|
||||
set_bit = 1;
|
||||
}
|
||||
error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX);
|
||||
if (set_bit)
|
||||
clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
|
||||
|
||||
if (error)
|
||||
pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n",
|
||||
__func__, __LINE__, error);
|
||||
@ -663,12 +693,12 @@ static int lock_token(struct md_cluster_info *cinfo)
|
||||
/* lock_comm()
|
||||
* Sets the MD_CLUSTER_SEND_LOCK bit to lock the send channel.
|
||||
*/
|
||||
static int lock_comm(struct md_cluster_info *cinfo)
|
||||
static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked)
|
||||
{
|
||||
wait_event(cinfo->wait,
|
||||
!test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));
|
||||
|
||||
return lock_token(cinfo);
|
||||
return lock_token(cinfo, mddev_locked);
|
||||
}
|
||||
|
||||
static void unlock_comm(struct md_cluster_info *cinfo)
|
||||
@ -743,11 +773,12 @@ failed_message:
|
||||
return error;
|
||||
}
|
||||
|
||||
static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
|
||||
static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg,
|
||||
bool mddev_locked)
|
||||
{
|
||||
int ret;
|
||||
|
||||
lock_comm(cinfo);
|
||||
lock_comm(cinfo, mddev_locked);
|
||||
ret = __sendmsg(cinfo, cmsg);
|
||||
unlock_comm(cinfo);
|
||||
return ret;
|
||||
@ -834,6 +865,7 @@ static int join(struct mddev *mddev, int nodes)
|
||||
mutex_init(&cinfo->recv_mutex);
|
||||
|
||||
mddev->cluster_info = cinfo;
|
||||
cinfo->mddev = mddev;
|
||||
|
||||
memset(str, 0, 64);
|
||||
sprintf(str, "%pU", mddev->uuid);
|
||||
@ -908,6 +940,7 @@ static int join(struct mddev *mddev, int nodes)
|
||||
|
||||
return 0;
|
||||
err:
|
||||
set_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
|
||||
md_unregister_thread(&cinfo->recovery_thread);
|
||||
md_unregister_thread(&cinfo->recv_thread);
|
||||
lockres_free(cinfo->message_lockres);
|
||||
@ -943,7 +976,7 @@ static void resync_bitmap(struct mddev *mddev)
|
||||
int err;
|
||||
|
||||
cmsg.type = cpu_to_le32(BITMAP_NEEDS_SYNC);
|
||||
err = sendmsg(cinfo, &cmsg);
|
||||
err = sendmsg(cinfo, &cmsg, 1);
|
||||
if (err)
|
||||
pr_err("%s:%d: failed to send BITMAP_NEEDS_SYNC message (%d)\n",
|
||||
__func__, __LINE__, err);
|
||||
@ -963,6 +996,7 @@ static int leave(struct mddev *mddev)
|
||||
if (cinfo->slot_number > 0 && mddev->recovery_cp != MaxSector)
|
||||
resync_bitmap(mddev);
|
||||
|
||||
set_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
|
||||
md_unregister_thread(&cinfo->recovery_thread);
|
||||
md_unregister_thread(&cinfo->recv_thread);
|
||||
lockres_free(cinfo->message_lockres);
|
||||
@ -997,16 +1031,30 @@ static int slot_number(struct mddev *mddev)
|
||||
static int metadata_update_start(struct mddev *mddev)
|
||||
{
|
||||
struct md_cluster_info *cinfo = mddev->cluster_info;
|
||||
int ret;
|
||||
|
||||
/*
|
||||
* metadata_update_start is always called with the protection of
|
||||
* reconfig_mutex, so set WAITING_FOR_TOKEN here.
|
||||
*/
|
||||
ret = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
|
||||
&cinfo->state);
|
||||
WARN_ON_ONCE(ret);
|
||||
md_wakeup_thread(mddev->thread);
|
||||
|
||||
wait_event(cinfo->wait,
|
||||
!test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state) ||
|
||||
test_and_clear_bit(MD_CLUSTER_SEND_LOCKED_ALREADY, &cinfo->state));
|
||||
|
||||
/* If token is already locked, return 0 */
|
||||
if (cinfo->token_lockres->mode == DLM_LOCK_EX)
|
||||
if (cinfo->token_lockres->mode == DLM_LOCK_EX) {
|
||||
clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
|
||||
return 0;
|
||||
}
|
||||
|
||||
return lock_token(cinfo);
|
||||
ret = lock_token(cinfo, 1);
|
||||
clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static int metadata_update_finish(struct mddev *mddev)
|
||||
@ -1043,6 +1091,141 @@ static void metadata_update_cancel(struct mddev *mddev)
|
||||
unlock_comm(cinfo);
|
||||
}
|
||||
|
||||
/*
|
||||
* return 0 if all the bitmaps have the same sync_size
|
||||
*/
|
||||
int cluster_check_sync_size(struct mddev *mddev)
|
||||
{
|
||||
int i, rv;
|
||||
bitmap_super_t *sb;
|
||||
unsigned long my_sync_size, sync_size = 0;
|
||||
int node_num = mddev->bitmap_info.nodes;
|
||||
int current_slot = md_cluster_ops->slot_number(mddev);
|
||||
struct bitmap *bitmap = mddev->bitmap;
|
||||
char str[64];
|
||||
struct dlm_lock_resource *bm_lockres;
|
||||
|
||||
sb = kmap_atomic(bitmap->storage.sb_page);
|
||||
my_sync_size = sb->sync_size;
|
||||
kunmap_atomic(sb);
|
||||
|
||||
for (i = 0; i < node_num; i++) {
|
||||
if (i == current_slot)
|
||||
continue;
|
||||
|
||||
bitmap = get_bitmap_from_slot(mddev, i);
|
||||
if (IS_ERR(bitmap)) {
|
||||
pr_err("can't get bitmap from slot %d\n", i);
|
||||
return -1;
|
||||
}
|
||||
|
||||
/*
|
||||
* If we can hold the bitmap lock of one node then
|
||||
* the slot is not occupied, update the sb.
|
||||
*/
|
||||
snprintf(str, 64, "bitmap%04d", i);
|
||||
bm_lockres = lockres_init(mddev, str, NULL, 1);
|
||||
if (!bm_lockres) {
|
||||
pr_err("md-cluster: Cannot initialize %s\n", str);
|
||||
bitmap_free(bitmap);
|
||||
return -1;
|
||||
}
|
||||
bm_lockres->flags |= DLM_LKF_NOQUEUE;
|
||||
rv = dlm_lock_sync(bm_lockres, DLM_LOCK_PW);
|
||||
if (!rv)
|
||||
bitmap_update_sb(bitmap);
|
||||
lockres_free(bm_lockres);
|
||||
|
||||
sb = kmap_atomic(bitmap->storage.sb_page);
|
||||
if (sync_size == 0)
|
||||
sync_size = sb->sync_size;
|
||||
else if (sync_size != sb->sync_size) {
|
||||
kunmap_atomic(sb);
|
||||
bitmap_free(bitmap);
|
||||
return -1;
|
||||
}
|
||||
kunmap_atomic(sb);
|
||||
bitmap_free(bitmap);
|
||||
}
|
||||
|
||||
return (my_sync_size == sync_size) ? 0 : -1;
|
||||
}
|
||||
|
||||
/*
|
||||
* Update the size for cluster raid is a little more complex, we perform it
|
||||
* by the steps:
|
||||
* 1. hold token lock and update superblock in initiator node.
|
||||
* 2. send METADATA_UPDATED msg to other nodes.
|
||||
* 3. The initiator node continues to check each bitmap's sync_size, if all
|
||||
* bitmaps have the same value of sync_size, then we can set capacity and
|
||||
* let other nodes to perform it. If one node can't update sync_size
|
||||
* accordingly, we need to revert to previous value.
|
||||
*/
|
||||
static void update_size(struct mddev *mddev, sector_t old_dev_sectors)
|
||||
{
|
||||
struct md_cluster_info *cinfo = mddev->cluster_info;
|
||||
struct cluster_msg cmsg;
|
||||
struct md_rdev *rdev;
|
||||
int ret = 0;
|
||||
int raid_slot = -1;
|
||||
|
||||
md_update_sb(mddev, 1);
|
||||
lock_comm(cinfo, 1);
|
||||
|
||||
memset(&cmsg, 0, sizeof(cmsg));
|
||||
cmsg.type = cpu_to_le32(METADATA_UPDATED);
|
||||
rdev_for_each(rdev, mddev)
|
||||
if (rdev->raid_disk >= 0 && !test_bit(Faulty, &rdev->flags)) {
|
||||
raid_slot = rdev->desc_nr;
|
||||
break;
|
||||
}
|
||||
if (raid_slot >= 0) {
|
||||
cmsg.raid_slot = cpu_to_le32(raid_slot);
|
||||
/*
|
||||
* We can only change capiticy after all the nodes can do it,
|
||||
* so need to wait after other nodes already received the msg
|
||||
* and handled the change
|
||||
*/
|
||||
ret = __sendmsg(cinfo, &cmsg);
|
||||
if (ret) {
|
||||
pr_err("%s:%d: failed to send METADATA_UPDATED msg\n",
|
||||
__func__, __LINE__);
|
||||
unlock_comm(cinfo);
|
||||
return;
|
||||
}
|
||||
} else {
|
||||
pr_err("md-cluster: No good device id found to send\n");
|
||||
unlock_comm(cinfo);
|
||||
return;
|
||||
}
|
||||
|
||||
/*
|
||||
* check the sync_size from other node's bitmap, if sync_size
|
||||
* have already updated in other nodes as expected, send an
|
||||
* empty metadata msg to permit the change of capacity
|
||||
*/
|
||||
if (cluster_check_sync_size(mddev) == 0) {
|
||||
memset(&cmsg, 0, sizeof(cmsg));
|
||||
cmsg.type = cpu_to_le32(CHANGE_CAPACITY);
|
||||
ret = __sendmsg(cinfo, &cmsg);
|
||||
if (ret)
|
||||
pr_err("%s:%d: failed to send CHANGE_CAPACITY msg\n",
|
||||
__func__, __LINE__);
|
||||
set_capacity(mddev->gendisk, mddev->array_sectors);
|
||||
revalidate_disk(mddev->gendisk);
|
||||
} else {
|
||||
/* revert to previous sectors */
|
||||
ret = mddev->pers->resize(mddev, old_dev_sectors);
|
||||
if (!ret)
|
||||
revalidate_disk(mddev->gendisk);
|
||||
ret = __sendmsg(cinfo, &cmsg);
|
||||
if (ret)
|
||||
pr_err("%s:%d: failed to send METADATA_UPDATED msg\n",
|
||||
__func__, __LINE__);
|
||||
}
|
||||
unlock_comm(cinfo);
|
||||
}
|
||||
|
||||
static int resync_start(struct mddev *mddev)
|
||||
{
|
||||
struct md_cluster_info *cinfo = mddev->cluster_info;
|
||||
@ -1069,7 +1252,14 @@ static int resync_info_update(struct mddev *mddev, sector_t lo, sector_t hi)
|
||||
cmsg.low = cpu_to_le64(lo);
|
||||
cmsg.high = cpu_to_le64(hi);
|
||||
|
||||
return sendmsg(cinfo, &cmsg);
|
||||
/*
|
||||
* mddev_lock is held if resync_info_update is called from
|
||||
* resync_finish (md_reap_sync_thread -> resync_finish)
|
||||
*/
|
||||
if (lo == 0 && hi == 0)
|
||||
return sendmsg(cinfo, &cmsg, 1);
|
||||
else
|
||||
return sendmsg(cinfo, &cmsg, 0);
|
||||
}
|
||||
|
||||
static int resync_finish(struct mddev *mddev)
|
||||
@ -1119,7 +1309,7 @@ static int add_new_disk(struct mddev *mddev, struct md_rdev *rdev)
|
||||
cmsg.type = cpu_to_le32(NEWDISK);
|
||||
memcpy(cmsg.uuid, uuid, 16);
|
||||
cmsg.raid_slot = cpu_to_le32(rdev->desc_nr);
|
||||
lock_comm(cinfo);
|
||||
lock_comm(cinfo, 1);
|
||||
ret = __sendmsg(cinfo, &cmsg);
|
||||
if (ret)
|
||||
return ret;
|
||||
@ -1179,7 +1369,7 @@ static int remove_disk(struct mddev *mddev, struct md_rdev *rdev)
|
||||
struct md_cluster_info *cinfo = mddev->cluster_info;
|
||||
cmsg.type = cpu_to_le32(REMOVE);
|
||||
cmsg.raid_slot = cpu_to_le32(rdev->desc_nr);
|
||||
return sendmsg(cinfo, &cmsg);
|
||||
return sendmsg(cinfo, &cmsg, 1);
|
||||
}
|
||||
|
||||
static int lock_all_bitmaps(struct mddev *mddev)
|
||||
@ -1243,7 +1433,7 @@ static int gather_bitmaps(struct md_rdev *rdev)
|
||||
|
||||
cmsg.type = cpu_to_le32(RE_ADD);
|
||||
cmsg.raid_slot = cpu_to_le32(rdev->desc_nr);
|
||||
err = sendmsg(cinfo, &cmsg);
|
||||
err = sendmsg(cinfo, &cmsg, 1);
|
||||
if (err)
|
||||
goto out;
|
||||
|
||||
@ -1281,6 +1471,7 @@ static struct md_cluster_operations cluster_ops = {
|
||||
.gather_bitmaps = gather_bitmaps,
|
||||
.lock_all_bitmaps = lock_all_bitmaps,
|
||||
.unlock_all_bitmaps = unlock_all_bitmaps,
|
||||
.update_size = update_size,
|
||||
};
|
||||
|
||||
static int __init cluster_init(void)
|
||||
|
@ -27,6 +27,7 @@ struct md_cluster_operations {
|
||||
int (*gather_bitmaps)(struct md_rdev *rdev);
|
||||
int (*lock_all_bitmaps)(struct mddev *mddev);
|
||||
void (*unlock_all_bitmaps)(struct mddev *mddev);
|
||||
void (*update_size)(struct mddev *mddev, sector_t old_dev_sectors);
|
||||
};
|
||||
|
||||
#endif /* _MD_CLUSTER_H */
|
||||
|
416
drivers/md/md.c
416
drivers/md/md.c
@ -65,6 +65,8 @@
|
||||
#include <linux/raid/md_p.h>
|
||||
#include <linux/raid/md_u.h>
|
||||
#include <linux/slab.h>
|
||||
#include <linux/percpu-refcount.h>
|
||||
|
||||
#include <trace/events/block.h>
|
||||
#include "md.h"
|
||||
#include "bitmap.h"
|
||||
@ -172,6 +174,16 @@ static const struct block_device_operations md_fops;
|
||||
|
||||
static int start_readonly;
|
||||
|
||||
/*
|
||||
* The original mechanism for creating an md device is to create
|
||||
* a device node in /dev and to open it. This causes races with device-close.
|
||||
* The preferred method is to write to the "new_array" module parameter.
|
||||
* This can avoid races.
|
||||
* Setting create_on_open to false disables the original mechanism
|
||||
* so all the races disappear.
|
||||
*/
|
||||
static bool create_on_open = true;
|
||||
|
||||
/* bio_clone_mddev
|
||||
* like bio_clone, but with a local bio set
|
||||
*/
|
||||
@ -1507,6 +1519,12 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_
|
||||
} else if (sb->bblog_offset != 0)
|
||||
rdev->badblocks.shift = 0;
|
||||
|
||||
if (le32_to_cpu(sb->feature_map) & MD_FEATURE_PPL) {
|
||||
rdev->ppl.offset = (__s16)le16_to_cpu(sb->ppl.offset);
|
||||
rdev->ppl.size = le16_to_cpu(sb->ppl.size);
|
||||
rdev->ppl.sector = rdev->sb_start + rdev->ppl.offset;
|
||||
}
|
||||
|
||||
if (!refdev) {
|
||||
ret = 1;
|
||||
} else {
|
||||
@ -1619,6 +1637,13 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
|
||||
|
||||
if (le32_to_cpu(sb->feature_map) & MD_FEATURE_JOURNAL)
|
||||
set_bit(MD_HAS_JOURNAL, &mddev->flags);
|
||||
|
||||
if (le32_to_cpu(sb->feature_map) & MD_FEATURE_PPL) {
|
||||
if (le32_to_cpu(sb->feature_map) &
|
||||
(MD_FEATURE_BITMAP_OFFSET | MD_FEATURE_JOURNAL))
|
||||
return -EINVAL;
|
||||
set_bit(MD_HAS_PPL, &mddev->flags);
|
||||
}
|
||||
} else if (mddev->pers == NULL) {
|
||||
/* Insist of good event counter while assembling, except for
|
||||
* spares (which don't need an event count) */
|
||||
@ -1832,6 +1857,12 @@ retry:
|
||||
if (test_bit(MD_HAS_JOURNAL, &mddev->flags))
|
||||
sb->feature_map |= cpu_to_le32(MD_FEATURE_JOURNAL);
|
||||
|
||||
if (test_bit(MD_HAS_PPL, &mddev->flags)) {
|
||||
sb->feature_map |= cpu_to_le32(MD_FEATURE_PPL);
|
||||
sb->ppl.offset = cpu_to_le16(rdev->ppl.offset);
|
||||
sb->ppl.size = cpu_to_le16(rdev->ppl.size);
|
||||
}
|
||||
|
||||
rdev_for_each(rdev2, mddev) {
|
||||
i = rdev2->desc_nr;
|
||||
if (test_bit(Faulty, &rdev2->flags))
|
||||
@ -2072,6 +2103,10 @@ static int bind_rdev_to_array(struct md_rdev *rdev, struct mddev *mddev)
|
||||
if (find_rdev(mddev, rdev->bdev->bd_dev))
|
||||
return -EEXIST;
|
||||
|
||||
if ((bdev_read_only(rdev->bdev) || bdev_read_only(rdev->meta_bdev)) &&
|
||||
mddev->pers)
|
||||
return -EROFS;
|
||||
|
||||
/* make sure rdev->sectors exceeds mddev->dev_sectors */
|
||||
if (!test_bit(Journal, &rdev->flags) &&
|
||||
rdev->sectors &&
|
||||
@ -2233,6 +2268,33 @@ static void export_array(struct mddev *mddev)
|
||||
mddev->major_version = 0;
|
||||
}
|
||||
|
||||
static bool set_in_sync(struct mddev *mddev)
|
||||
{
|
||||
WARN_ON_ONCE(!spin_is_locked(&mddev->lock));
|
||||
if (!mddev->in_sync) {
|
||||
mddev->sync_checkers++;
|
||||
spin_unlock(&mddev->lock);
|
||||
percpu_ref_switch_to_atomic_sync(&mddev->writes_pending);
|
||||
spin_lock(&mddev->lock);
|
||||
if (!mddev->in_sync &&
|
||||
percpu_ref_is_zero(&mddev->writes_pending)) {
|
||||
mddev->in_sync = 1;
|
||||
/*
|
||||
* Ensure ->in_sync is visible before we clear
|
||||
* ->sync_checkers.
|
||||
*/
|
||||
smp_mb();
|
||||
set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
|
||||
sysfs_notify_dirent_safe(mddev->sysfs_state);
|
||||
}
|
||||
if (--mddev->sync_checkers == 0)
|
||||
percpu_ref_switch_to_percpu(&mddev->writes_pending);
|
||||
}
|
||||
if (mddev->safemode == 1)
|
||||
mddev->safemode = 0;
|
||||
return mddev->in_sync;
|
||||
}
|
||||
|
||||
static void sync_sbs(struct mddev *mddev, int nospares)
|
||||
{
|
||||
/* Update each superblock (in-memory image), but
|
||||
@ -3131,6 +3193,78 @@ static ssize_t ubb_store(struct md_rdev *rdev, const char *page, size_t len)
|
||||
static struct rdev_sysfs_entry rdev_unack_bad_blocks =
|
||||
__ATTR(unacknowledged_bad_blocks, S_IRUGO|S_IWUSR, ubb_show, ubb_store);
|
||||
|
||||
static ssize_t
|
||||
ppl_sector_show(struct md_rdev *rdev, char *page)
|
||||
{
|
||||
return sprintf(page, "%llu\n", (unsigned long long)rdev->ppl.sector);
|
||||
}
|
||||
|
||||
static ssize_t
|
||||
ppl_sector_store(struct md_rdev *rdev, const char *buf, size_t len)
|
||||
{
|
||||
unsigned long long sector;
|
||||
|
||||
if (kstrtoull(buf, 10, §or) < 0)
|
||||
return -EINVAL;
|
||||
if (sector != (sector_t)sector)
|
||||
return -EINVAL;
|
||||
|
||||
if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
|
||||
rdev->raid_disk >= 0)
|
||||
return -EBUSY;
|
||||
|
||||
if (rdev->mddev->persistent) {
|
||||
if (rdev->mddev->major_version == 0)
|
||||
return -EINVAL;
|
||||
if ((sector > rdev->sb_start &&
|
||||
sector - rdev->sb_start > S16_MAX) ||
|
||||
(sector < rdev->sb_start &&
|
||||
rdev->sb_start - sector > -S16_MIN))
|
||||
return -EINVAL;
|
||||
rdev->ppl.offset = sector - rdev->sb_start;
|
||||
} else if (!rdev->mddev->external) {
|
||||
return -EBUSY;
|
||||
}
|
||||
rdev->ppl.sector = sector;
|
||||
return len;
|
||||
}
|
||||
|
||||
static struct rdev_sysfs_entry rdev_ppl_sector =
|
||||
__ATTR(ppl_sector, S_IRUGO|S_IWUSR, ppl_sector_show, ppl_sector_store);
|
||||
|
||||
static ssize_t
|
||||
ppl_size_show(struct md_rdev *rdev, char *page)
|
||||
{
|
||||
return sprintf(page, "%u\n", rdev->ppl.size);
|
||||
}
|
||||
|
||||
static ssize_t
|
||||
ppl_size_store(struct md_rdev *rdev, const char *buf, size_t len)
|
||||
{
|
||||
unsigned int size;
|
||||
|
||||
if (kstrtouint(buf, 10, &size) < 0)
|
||||
return -EINVAL;
|
||||
|
||||
if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
|
||||
rdev->raid_disk >= 0)
|
||||
return -EBUSY;
|
||||
|
||||
if (rdev->mddev->persistent) {
|
||||
if (rdev->mddev->major_version == 0)
|
||||
return -EINVAL;
|
||||
if (size > U16_MAX)
|
||||
return -EINVAL;
|
||||
} else if (!rdev->mddev->external) {
|
||||
return -EBUSY;
|
||||
}
|
||||
rdev->ppl.size = size;
|
||||
return len;
|
||||
}
|
||||
|
||||
static struct rdev_sysfs_entry rdev_ppl_size =
|
||||
__ATTR(ppl_size, S_IRUGO|S_IWUSR, ppl_size_show, ppl_size_store);
|
||||
|
||||
static struct attribute *rdev_default_attrs[] = {
|
||||
&rdev_state.attr,
|
||||
&rdev_errors.attr,
|
||||
@ -3141,6 +3275,8 @@ static struct attribute *rdev_default_attrs[] = {
|
||||
&rdev_recovery_start.attr,
|
||||
&rdev_bad_blocks.attr,
|
||||
&rdev_unack_bad_blocks.attr,
|
||||
&rdev_ppl_sector.attr,
|
||||
&rdev_ppl_size.attr,
|
||||
NULL,
|
||||
};
|
||||
static ssize_t
|
||||
@ -3903,6 +4039,7 @@ array_state_show(struct mddev *mddev, char *page)
|
||||
st = read_auto;
|
||||
break;
|
||||
case 0:
|
||||
spin_lock(&mddev->lock);
|
||||
if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
|
||||
st = write_pending;
|
||||
else if (mddev->in_sync)
|
||||
@ -3911,6 +4048,7 @@ array_state_show(struct mddev *mddev, char *page)
|
||||
st = active_idle;
|
||||
else
|
||||
st = active;
|
||||
spin_unlock(&mddev->lock);
|
||||
}
|
||||
else {
|
||||
if (list_empty(&mddev->disks) &&
|
||||
@ -3931,7 +4069,7 @@ static int restart_array(struct mddev *mddev);
|
||||
static ssize_t
|
||||
array_state_store(struct mddev *mddev, const char *buf, size_t len)
|
||||
{
|
||||
int err;
|
||||
int err = 0;
|
||||
enum array_state st = match_word(buf, array_states);
|
||||
|
||||
if (mddev->pers && (st == active || st == clean) && mddev->ro != 1) {
|
||||
@ -3944,18 +4082,9 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len)
|
||||
clear_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
|
||||
md_wakeup_thread(mddev->thread);
|
||||
wake_up(&mddev->sb_wait);
|
||||
err = 0;
|
||||
} else /* st == clean */ {
|
||||
restart_array(mddev);
|
||||
if (atomic_read(&mddev->writes_pending) == 0) {
|
||||
if (mddev->in_sync == 0) {
|
||||
mddev->in_sync = 1;
|
||||
if (mddev->safemode == 1)
|
||||
mddev->safemode = 0;
|
||||
set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
|
||||
}
|
||||
err = 0;
|
||||
} else
|
||||
if (!set_in_sync(mddev))
|
||||
err = -EBUSY;
|
||||
}
|
||||
if (!err)
|
||||
@ -4013,15 +4142,7 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len)
|
||||
if (err)
|
||||
break;
|
||||
spin_lock(&mddev->lock);
|
||||
if (atomic_read(&mddev->writes_pending) == 0) {
|
||||
if (mddev->in_sync == 0) {
|
||||
mddev->in_sync = 1;
|
||||
if (mddev->safemode == 1)
|
||||
mddev->safemode = 0;
|
||||
set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
|
||||
}
|
||||
err = 0;
|
||||
} else
|
||||
if (!set_in_sync(mddev))
|
||||
err = -EBUSY;
|
||||
spin_unlock(&mddev->lock);
|
||||
} else
|
||||
@ -4843,8 +4964,10 @@ array_size_store(struct mddev *mddev, const char *buf, size_t len)
|
||||
return err;
|
||||
|
||||
/* cluster raid doesn't support change array_sectors */
|
||||
if (mddev_is_clustered(mddev))
|
||||
if (mddev_is_clustered(mddev)) {
|
||||
mddev_unlock(mddev);
|
||||
return -EINVAL;
|
||||
}
|
||||
|
||||
if (strncmp(buf, "default", 7) == 0) {
|
||||
if (mddev->pers)
|
||||
@ -4877,6 +5000,52 @@ static struct md_sysfs_entry md_array_size =
|
||||
__ATTR(array_size, S_IRUGO|S_IWUSR, array_size_show,
|
||||
array_size_store);
|
||||
|
||||
static ssize_t
|
||||
consistency_policy_show(struct mddev *mddev, char *page)
|
||||
{
|
||||
int ret;
|
||||
|
||||
if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
|
||||
ret = sprintf(page, "journal\n");
|
||||
} else if (test_bit(MD_HAS_PPL, &mddev->flags)) {
|
||||
ret = sprintf(page, "ppl\n");
|
||||
} else if (mddev->bitmap) {
|
||||
ret = sprintf(page, "bitmap\n");
|
||||
} else if (mddev->pers) {
|
||||
if (mddev->pers->sync_request)
|
||||
ret = sprintf(page, "resync\n");
|
||||
else
|
||||
ret = sprintf(page, "none\n");
|
||||
} else {
|
||||
ret = sprintf(page, "unknown\n");
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
static ssize_t
|
||||
consistency_policy_store(struct mddev *mddev, const char *buf, size_t len)
|
||||
{
|
||||
int err = 0;
|
||||
|
||||
if (mddev->pers) {
|
||||
if (mddev->pers->change_consistency_policy)
|
||||
err = mddev->pers->change_consistency_policy(mddev, buf);
|
||||
else
|
||||
err = -EBUSY;
|
||||
} else if (mddev->external && strncmp(buf, "ppl", 3) == 0) {
|
||||
set_bit(MD_HAS_PPL, &mddev->flags);
|
||||
} else {
|
||||
err = -EINVAL;
|
||||
}
|
||||
|
||||
return err ? err : len;
|
||||
}
|
||||
|
||||
static struct md_sysfs_entry md_consistency_policy =
|
||||
__ATTR(consistency_policy, S_IRUGO | S_IWUSR, consistency_policy_show,
|
||||
consistency_policy_store);
|
||||
|
||||
static struct attribute *md_default_attrs[] = {
|
||||
&md_level.attr,
|
||||
&md_layout.attr,
|
||||
@ -4892,6 +5061,7 @@ static struct attribute *md_default_attrs[] = {
|
||||
&md_reshape_direction.attr,
|
||||
&md_array_size.attr,
|
||||
&max_corr_read_errors.attr,
|
||||
&md_consistency_policy.attr,
|
||||
NULL,
|
||||
};
|
||||
|
||||
@ -4976,6 +5146,7 @@ static void md_free(struct kobject *ko)
|
||||
del_gendisk(mddev->gendisk);
|
||||
put_disk(mddev->gendisk);
|
||||
}
|
||||
percpu_ref_exit(&mddev->writes_pending);
|
||||
|
||||
kfree(mddev);
|
||||
}
|
||||
@ -5001,8 +5172,19 @@ static void mddev_delayed_delete(struct work_struct *ws)
|
||||
kobject_put(&mddev->kobj);
|
||||
}
|
||||
|
||||
static void no_op(struct percpu_ref *r) {}
|
||||
|
||||
static int md_alloc(dev_t dev, char *name)
|
||||
{
|
||||
/*
|
||||
* If dev is zero, name is the name of a device to allocate with
|
||||
* an arbitrary minor number. It will be "md_???"
|
||||
* If dev is non-zero it must be a device number with a MAJOR of
|
||||
* MD_MAJOR or mdp_major. In this case, if "name" is NULL, then
|
||||
* the device is being created by opening a node in /dev.
|
||||
* If "name" is not NULL, the device is being created by
|
||||
* writing to /sys/module/md_mod/parameters/new_array.
|
||||
*/
|
||||
static DEFINE_MUTEX(disks_mutex);
|
||||
struct mddev *mddev = mddev_find(dev);
|
||||
struct gendisk *disk;
|
||||
@ -5028,7 +5210,7 @@ static int md_alloc(dev_t dev, char *name)
|
||||
if (mddev->gendisk)
|
||||
goto abort;
|
||||
|
||||
if (name) {
|
||||
if (name && !dev) {
|
||||
/* Need to ensure that 'name' is not a duplicate.
|
||||
*/
|
||||
struct mddev *mddev2;
|
||||
@ -5042,6 +5224,11 @@ static int md_alloc(dev_t dev, char *name)
|
||||
}
|
||||
spin_unlock(&all_mddevs_lock);
|
||||
}
|
||||
if (name && dev)
|
||||
/*
|
||||
* Creating /dev/mdNNN via "newarray", so adjust hold_active.
|
||||
*/
|
||||
mddev->hold_active = UNTIL_STOP;
|
||||
|
||||
error = -ENOMEM;
|
||||
mddev->queue = blk_alloc_queue(GFP_KERNEL);
|
||||
@ -5052,6 +5239,10 @@ static int md_alloc(dev_t dev, char *name)
|
||||
blk_queue_make_request(mddev->queue, md_make_request);
|
||||
blk_set_stacking_limits(&mddev->queue->limits);
|
||||
|
||||
if (percpu_ref_init(&mddev->writes_pending, no_op, 0, GFP_KERNEL) < 0)
|
||||
goto abort;
|
||||
/* We want to start with the refcount at zero */
|
||||
percpu_ref_put(&mddev->writes_pending);
|
||||
disk = alloc_disk(1 << shift);
|
||||
if (!disk) {
|
||||
blk_cleanup_queue(mddev->queue);
|
||||
@ -5108,38 +5299,48 @@ static int md_alloc(dev_t dev, char *name)
|
||||
|
||||
static struct kobject *md_probe(dev_t dev, int *part, void *data)
|
||||
{
|
||||
md_alloc(dev, NULL);
|
||||
if (create_on_open)
|
||||
md_alloc(dev, NULL);
|
||||
return NULL;
|
||||
}
|
||||
|
||||
static int add_named_array(const char *val, struct kernel_param *kp)
|
||||
{
|
||||
/* val must be "md_*" where * is not all digits.
|
||||
* We allocate an array with a large free minor number, and
|
||||
/*
|
||||
* val must be "md_*" or "mdNNN".
|
||||
* For "md_*" we allocate an array with a large free minor number, and
|
||||
* set the name to val. val must not already be an active name.
|
||||
* For "mdNNN" we allocate an array with the minor number NNN
|
||||
* which must not already be in use.
|
||||
*/
|
||||
int len = strlen(val);
|
||||
char buf[DISK_NAME_LEN];
|
||||
unsigned long devnum;
|
||||
|
||||
while (len && val[len-1] == '\n')
|
||||
len--;
|
||||
if (len >= DISK_NAME_LEN)
|
||||
return -E2BIG;
|
||||
strlcpy(buf, val, len+1);
|
||||
if (strncmp(buf, "md_", 3) != 0)
|
||||
return -EINVAL;
|
||||
return md_alloc(0, buf);
|
||||
if (strncmp(buf, "md_", 3) == 0)
|
||||
return md_alloc(0, buf);
|
||||
if (strncmp(buf, "md", 2) == 0 &&
|
||||
isdigit(buf[2]) &&
|
||||
kstrtoul(buf+2, 10, &devnum) == 0 &&
|
||||
devnum <= MINORMASK)
|
||||
return md_alloc(MKDEV(MD_MAJOR, devnum), NULL);
|
||||
|
||||
return -EINVAL;
|
||||
}
|
||||
|
||||
static void md_safemode_timeout(unsigned long data)
|
||||
{
|
||||
struct mddev *mddev = (struct mddev *) data;
|
||||
|
||||
if (!atomic_read(&mddev->writes_pending)) {
|
||||
mddev->safemode = 1;
|
||||
if (mddev->external)
|
||||
sysfs_notify_dirent_safe(mddev->sysfs_state);
|
||||
}
|
||||
mddev->safemode = 1;
|
||||
if (mddev->external)
|
||||
sysfs_notify_dirent_safe(mddev->sysfs_state);
|
||||
|
||||
md_wakeup_thread(mddev->thread);
|
||||
}
|
||||
|
||||
@ -5185,6 +5386,13 @@ int md_run(struct mddev *mddev)
|
||||
continue;
|
||||
sync_blockdev(rdev->bdev);
|
||||
invalidate_bdev(rdev->bdev);
|
||||
if (mddev->ro != 1 &&
|
||||
(bdev_read_only(rdev->bdev) ||
|
||||
bdev_read_only(rdev->meta_bdev))) {
|
||||
mddev->ro = 1;
|
||||
if (mddev->gendisk)
|
||||
set_disk_ro(mddev->gendisk, 1);
|
||||
}
|
||||
|
||||
/* perform some consistency tests on the device.
|
||||
* We don't want the data to overlap the metadata,
|
||||
@ -5344,7 +5552,6 @@ int md_run(struct mddev *mddev)
|
||||
} else if (mddev->ro == 2) /* auto-readonly not meaningful */
|
||||
mddev->ro = 0;
|
||||
|
||||
atomic_set(&mddev->writes_pending,0);
|
||||
atomic_set(&mddev->max_corr_read_errors,
|
||||
MD_DEFAULT_MAX_CORRECTED_READ_ERRORS);
|
||||
mddev->safemode = 0;
|
||||
@ -5410,6 +5617,9 @@ out:
|
||||
static int restart_array(struct mddev *mddev)
|
||||
{
|
||||
struct gendisk *disk = mddev->gendisk;
|
||||
struct md_rdev *rdev;
|
||||
bool has_journal = false;
|
||||
bool has_readonly = false;
|
||||
|
||||
/* Complain if it has no devices */
|
||||
if (list_empty(&mddev->disks))
|
||||
@ -5418,24 +5628,21 @@ static int restart_array(struct mddev *mddev)
|
||||
return -EINVAL;
|
||||
if (!mddev->ro)
|
||||
return -EBUSY;
|
||||
if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
|
||||
struct md_rdev *rdev;
|
||||
bool has_journal = false;
|
||||
|
||||
rcu_read_lock();
|
||||
rdev_for_each_rcu(rdev, mddev) {
|
||||
if (test_bit(Journal, &rdev->flags) &&
|
||||
!test_bit(Faulty, &rdev->flags)) {
|
||||
has_journal = true;
|
||||
break;
|
||||
}
|
||||
}
|
||||
rcu_read_unlock();
|
||||
|
||||
/* Don't restart rw with journal missing/faulty */
|
||||
if (!has_journal)
|
||||
return -EINVAL;
|
||||
rcu_read_lock();
|
||||
rdev_for_each_rcu(rdev, mddev) {
|
||||
if (test_bit(Journal, &rdev->flags) &&
|
||||
!test_bit(Faulty, &rdev->flags))
|
||||
has_journal = true;
|
||||
if (bdev_read_only(rdev->bdev))
|
||||
has_readonly = true;
|
||||
}
|
||||
rcu_read_unlock();
|
||||
if (test_bit(MD_HAS_JOURNAL, &mddev->flags) && !has_journal)
|
||||
/* Don't restart rw with journal missing/faulty */
|
||||
return -EINVAL;
|
||||
if (has_readonly)
|
||||
return -EROFS;
|
||||
|
||||
mddev->safemode = 0;
|
||||
mddev->ro = 0;
|
||||
@ -5535,15 +5742,7 @@ EXPORT_SYMBOL_GPL(md_stop_writes);
|
||||
|
||||
static void mddev_detach(struct mddev *mddev)
|
||||
{
|
||||
struct bitmap *bitmap = mddev->bitmap;
|
||||
/* wait for behind writes to complete */
|
||||
if (bitmap && atomic_read(&bitmap->behind_writes) > 0) {
|
||||
pr_debug("md:%s: behind writes in progress - waiting to stop.\n",
|
||||
mdname(mddev));
|
||||
/* need to kick something here to make sure I/O goes? */
|
||||
wait_event(bitmap->behind_wait,
|
||||
atomic_read(&bitmap->behind_writes) == 0);
|
||||
}
|
||||
bitmap_wait_behind_writes(mddev);
|
||||
if (mddev->pers && mddev->pers->quiesce) {
|
||||
mddev->pers->quiesce(mddev, 1);
|
||||
mddev->pers->quiesce(mddev, 0);
|
||||
@ -5556,6 +5755,7 @@ static void mddev_detach(struct mddev *mddev)
|
||||
static void __md_stop(struct mddev *mddev)
|
||||
{
|
||||
struct md_personality *pers = mddev->pers;
|
||||
bitmap_destroy(mddev);
|
||||
mddev_detach(mddev);
|
||||
/* Ensure ->event_work is done */
|
||||
flush_workqueue(md_misc_wq);
|
||||
@ -5576,7 +5776,6 @@ void md_stop(struct mddev *mddev)
|
||||
* This is called from dm-raid
|
||||
*/
|
||||
__md_stop(mddev);
|
||||
bitmap_destroy(mddev);
|
||||
if (mddev->bio_set)
|
||||
bioset_free(mddev->bio_set);
|
||||
}
|
||||
@ -5714,7 +5913,6 @@ static int do_md_stop(struct mddev *mddev, int mode,
|
||||
if (mode == 0) {
|
||||
pr_info("md: %s stopped.\n", mdname(mddev));
|
||||
|
||||
bitmap_destroy(mddev);
|
||||
if (mddev->bitmap_info.file) {
|
||||
struct file *f = mddev->bitmap_info.file;
|
||||
spin_lock(&mddev->lock);
|
||||
@ -6493,10 +6691,7 @@ static int update_size(struct mddev *mddev, sector_t num_sectors)
|
||||
struct md_rdev *rdev;
|
||||
int rv;
|
||||
int fit = (num_sectors == 0);
|
||||
|
||||
/* cluster raid doesn't support update size */
|
||||
if (mddev_is_clustered(mddev))
|
||||
return -EINVAL;
|
||||
sector_t old_dev_sectors = mddev->dev_sectors;
|
||||
|
||||
if (mddev->pers->resize == NULL)
|
||||
return -EINVAL;
|
||||
@ -6525,7 +6720,9 @@ static int update_size(struct mddev *mddev, sector_t num_sectors)
|
||||
}
|
||||
rv = mddev->pers->resize(mddev, num_sectors);
|
||||
if (!rv) {
|
||||
if (mddev->queue) {
|
||||
if (mddev_is_clustered(mddev))
|
||||
md_cluster_ops->update_size(mddev, old_dev_sectors);
|
||||
else if (mddev->queue) {
|
||||
set_capacity(mddev->gendisk, mddev->array_sectors);
|
||||
revalidate_disk(mddev->gendisk);
|
||||
}
|
||||
@ -6776,6 +6973,7 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
|
||||
void __user *argp = (void __user *)arg;
|
||||
struct mddev *mddev = NULL;
|
||||
int ro;
|
||||
bool did_set_md_closing = false;
|
||||
|
||||
if (!md_ioctl_valid(cmd))
|
||||
return -ENOTTY;
|
||||
@ -6865,7 +7063,9 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
|
||||
err = -EBUSY;
|
||||
goto out;
|
||||
}
|
||||
WARN_ON_ONCE(test_bit(MD_CLOSING, &mddev->flags));
|
||||
set_bit(MD_CLOSING, &mddev->flags);
|
||||
did_set_md_closing = true;
|
||||
mutex_unlock(&mddev->open_mutex);
|
||||
sync_blockdev(bdev);
|
||||
}
|
||||
@ -7058,6 +7258,8 @@ unlock:
|
||||
mddev->hold_active = 0;
|
||||
mddev_unlock(mddev);
|
||||
out:
|
||||
if(did_set_md_closing)
|
||||
clear_bit(MD_CLOSING, &mddev->flags);
|
||||
return err;
|
||||
}
|
||||
#ifdef CONFIG_COMPAT
|
||||
@ -7208,8 +7410,8 @@ void md_wakeup_thread(struct md_thread *thread)
|
||||
{
|
||||
if (thread) {
|
||||
pr_debug("md: waking up MD thread %s.\n", thread->tsk->comm);
|
||||
set_bit(THREAD_WAKEUP, &thread->flags);
|
||||
wake_up(&thread->wqueue);
|
||||
if (!test_and_set_bit(THREAD_WAKEUP, &thread->flags))
|
||||
wake_up(&thread->wqueue);
|
||||
}
|
||||
}
|
||||
EXPORT_SYMBOL(md_wakeup_thread);
|
||||
@ -7756,10 +7958,13 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
|
||||
md_wakeup_thread(mddev->sync_thread);
|
||||
did_change = 1;
|
||||
}
|
||||
atomic_inc(&mddev->writes_pending);
|
||||
rcu_read_lock();
|
||||
percpu_ref_get(&mddev->writes_pending);
|
||||
smp_mb(); /* Match smp_mb in set_in_sync() */
|
||||
if (mddev->safemode == 1)
|
||||
mddev->safemode = 0;
|
||||
if (mddev->in_sync) {
|
||||
/* sync_checkers is always 0 when writes_pending is in per-cpu mode */
|
||||
if (mddev->in_sync || !mddev->sync_checkers) {
|
||||
spin_lock(&mddev->lock);
|
||||
if (mddev->in_sync) {
|
||||
mddev->in_sync = 0;
|
||||
@ -7770,6 +7975,7 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
|
||||
}
|
||||
spin_unlock(&mddev->lock);
|
||||
}
|
||||
rcu_read_unlock();
|
||||
if (did_change)
|
||||
sysfs_notify_dirent_safe(mddev->sysfs_state);
|
||||
wait_event(mddev->sb_wait,
|
||||
@ -7777,15 +7983,38 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
|
||||
}
|
||||
EXPORT_SYMBOL(md_write_start);
|
||||
|
||||
/* md_write_inc can only be called when md_write_start() has
|
||||
* already been called at least once of the current request.
|
||||
* It increments the counter and is useful when a single request
|
||||
* is split into several parts. Each part causes an increment and
|
||||
* so needs a matching md_write_end().
|
||||
* Unlike md_write_start(), it is safe to call md_write_inc() inside
|
||||
* a spinlocked region.
|
||||
*/
|
||||
void md_write_inc(struct mddev *mddev, struct bio *bi)
|
||||
{
|
||||
if (bio_data_dir(bi) != WRITE)
|
||||
return;
|
||||
WARN_ON_ONCE(mddev->in_sync || mddev->ro);
|
||||
percpu_ref_get(&mddev->writes_pending);
|
||||
}
|
||||
EXPORT_SYMBOL(md_write_inc);
|
||||
|
||||
void md_write_end(struct mddev *mddev)
|
||||
{
|
||||
if (atomic_dec_and_test(&mddev->writes_pending)) {
|
||||
if (mddev->safemode == 2)
|
||||
md_wakeup_thread(mddev->thread);
|
||||
else if (mddev->safemode_delay)
|
||||
mod_timer(&mddev->safemode_timer, jiffies + mddev->safemode_delay);
|
||||
}
|
||||
percpu_ref_put(&mddev->writes_pending);
|
||||
|
||||
if (mddev->safemode == 2)
|
||||
md_wakeup_thread(mddev->thread);
|
||||
else if (mddev->safemode_delay)
|
||||
/* The roundup() ensures this only performs locking once
|
||||
* every ->safemode_delay jiffies
|
||||
*/
|
||||
mod_timer(&mddev->safemode_timer,
|
||||
roundup(jiffies, mddev->safemode_delay) +
|
||||
mddev->safemode_delay);
|
||||
}
|
||||
|
||||
EXPORT_SYMBOL(md_write_end);
|
||||
|
||||
/* md_allow_write(mddev)
|
||||
@ -8385,9 +8614,8 @@ void md_check_recovery(struct mddev *mddev)
|
||||
(mddev->sb_flags & ~ (1<<MD_SB_CHANGE_PENDING)) ||
|
||||
test_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
|
||||
test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
|
||||
test_bit(MD_RELOAD_SB, &mddev->flags) ||
|
||||
(mddev->external == 0 && mddev->safemode == 1) ||
|
||||
(mddev->safemode == 2 && ! atomic_read(&mddev->writes_pending)
|
||||
(mddev->safemode == 2
|
||||
&& !mddev->in_sync && mddev->recovery_cp == MaxSector)
|
||||
))
|
||||
return;
|
||||
@ -8434,27 +8662,12 @@ void md_check_recovery(struct mddev *mddev)
|
||||
rdev->raid_disk < 0)
|
||||
md_kick_rdev_from_array(rdev);
|
||||
}
|
||||
|
||||
if (test_and_clear_bit(MD_RELOAD_SB, &mddev->flags))
|
||||
md_reload_sb(mddev, mddev->good_device_nr);
|
||||
}
|
||||
|
||||
if (!mddev->external) {
|
||||
int did_change = 0;
|
||||
if (!mddev->external && !mddev->in_sync) {
|
||||
spin_lock(&mddev->lock);
|
||||
if (mddev->safemode &&
|
||||
!atomic_read(&mddev->writes_pending) &&
|
||||
!mddev->in_sync &&
|
||||
mddev->recovery_cp == MaxSector) {
|
||||
mddev->in_sync = 1;
|
||||
did_change = 1;
|
||||
set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
|
||||
}
|
||||
if (mddev->safemode == 1)
|
||||
mddev->safemode = 0;
|
||||
set_in_sync(mddev);
|
||||
spin_unlock(&mddev->lock);
|
||||
if (did_change)
|
||||
sysfs_notify_dirent_safe(mddev->sysfs_state);
|
||||
}
|
||||
|
||||
if (mddev->sb_flags)
|
||||
@ -8747,6 +8960,18 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev)
|
||||
int role, ret;
|
||||
char b[BDEVNAME_SIZE];
|
||||
|
||||
/*
|
||||
* If size is changed in another node then we need to
|
||||
* do resize as well.
|
||||
*/
|
||||
if (mddev->dev_sectors != le64_to_cpu(sb->size)) {
|
||||
ret = mddev->pers->resize(mddev, le64_to_cpu(sb->size));
|
||||
if (ret)
|
||||
pr_info("md-cluster: resize failed\n");
|
||||
else
|
||||
bitmap_update_sb(mddev->bitmap);
|
||||
}
|
||||
|
||||
/* Check for change of roles in the active devices */
|
||||
rdev_for_each(rdev2, mddev) {
|
||||
if (test_bit(Faulty, &rdev2->flags))
|
||||
@ -8997,6 +9222,7 @@ static int set_ro(const char *val, struct kernel_param *kp)
|
||||
module_param_call(start_ro, set_ro, get_ro, NULL, S_IRUSR|S_IWUSR);
|
||||
module_param(start_dirty_degraded, int, S_IRUGO|S_IWUSR);
|
||||
module_param_call(new_array, add_named_array, NULL, NULL, S_IWUSR);
|
||||
module_param(create_on_open, bool, S_IRUSR|S_IWUSR);
|
||||
|
||||
MODULE_LICENSE("GPL");
|
||||
MODULE_DESCRIPTION("MD RAID framework");
|
||||
|
@ -122,6 +122,13 @@ struct md_rdev {
|
||||
* sysfs entry */
|
||||
|
||||
struct badblocks badblocks;
|
||||
|
||||
struct {
|
||||
short offset; /* Offset from superblock to start of PPL.
|
||||
* Not used by external metadata. */
|
||||
unsigned int size; /* Size in sectors of the PPL space */
|
||||
sector_t sector; /* First sector of the PPL space */
|
||||
} ppl;
|
||||
};
|
||||
enum flag_bits {
|
||||
Faulty, /* device is known to have a fault */
|
||||
@ -219,9 +226,6 @@ enum mddev_flags {
|
||||
* it then */
|
||||
MD_JOURNAL_CLEAN, /* A raid with journal is already clean */
|
||||
MD_HAS_JOURNAL, /* The raid array has journal feature set */
|
||||
MD_RELOAD_SB, /* Reload the superblock because another node
|
||||
* updated it.
|
||||
*/
|
||||
MD_CLUSTER_RESYNC_LOCKED, /* cluster raid only, which means node
|
||||
* already took resync lock, need to
|
||||
* release the lock */
|
||||
@ -229,6 +233,7 @@ enum mddev_flags {
|
||||
* supported as calls to md_error() will
|
||||
* never cause the array to become failed.
|
||||
*/
|
||||
MD_HAS_PPL, /* The raid array has PPL feature set */
|
||||
};
|
||||
|
||||
enum mddev_sb_flags {
|
||||
@ -404,7 +409,8 @@ struct mddev {
|
||||
*/
|
||||
unsigned int safemode_delay;
|
||||
struct timer_list safemode_timer;
|
||||
atomic_t writes_pending;
|
||||
struct percpu_ref writes_pending;
|
||||
int sync_checkers; /* # of threads checking writes_pending */
|
||||
struct request_queue *queue; /* for plugging ... */
|
||||
|
||||
struct bitmap *bitmap; /* the bitmap for the device */
|
||||
@ -540,6 +546,8 @@ struct md_personality
|
||||
/* congested implements bdi.congested_fn().
|
||||
* Will not be called while array is 'suspended' */
|
||||
int (*congested)(struct mddev *mddev, int bits);
|
||||
/* Changes the consistency policy of an active array. */
|
||||
int (*change_consistency_policy)(struct mddev *mddev, const char *buf);
|
||||
};
|
||||
|
||||
struct md_sysfs_entry {
|
||||
@ -641,6 +649,7 @@ extern void md_wakeup_thread(struct md_thread *thread);
|
||||
extern void md_check_recovery(struct mddev *mddev);
|
||||
extern void md_reap_sync_thread(struct mddev *mddev);
|
||||
extern void md_write_start(struct mddev *mddev, struct bio *bi);
|
||||
extern void md_write_inc(struct mddev *mddev, struct bio *bi);
|
||||
extern void md_write_end(struct mddev *mddev);
|
||||
extern void md_done_sync(struct mddev *mddev, int blocks, int ok);
|
||||
extern void md_error(struct mddev *mddev, struct md_rdev *rdev);
|
||||
@ -716,4 +725,58 @@ static inline void mddev_check_write_zeroes(struct mddev *mddev, struct bio *bio
|
||||
!bdev_get_queue(bio->bi_bdev)->limits.max_write_zeroes_sectors)
|
||||
mddev->queue->limits.max_write_zeroes_sectors = 0;
|
||||
}
|
||||
|
||||
/* Maximum size of each resync request */
|
||||
#define RESYNC_BLOCK_SIZE (64*1024)
|
||||
#define RESYNC_PAGES ((RESYNC_BLOCK_SIZE + PAGE_SIZE-1) / PAGE_SIZE)
|
||||
|
||||
/* for managing resync I/O pages */
|
||||
struct resync_pages {
|
||||
unsigned idx; /* for get/put page from the pool */
|
||||
void *raid_bio;
|
||||
struct page *pages[RESYNC_PAGES];
|
||||
};
|
||||
|
||||
static inline int resync_alloc_pages(struct resync_pages *rp,
|
||||
gfp_t gfp_flags)
|
||||
{
|
||||
int i;
|
||||
|
||||
for (i = 0; i < RESYNC_PAGES; i++) {
|
||||
rp->pages[i] = alloc_page(gfp_flags);
|
||||
if (!rp->pages[i])
|
||||
goto out_free;
|
||||
}
|
||||
|
||||
return 0;
|
||||
|
||||
out_free:
|
||||
while (--i >= 0)
|
||||
put_page(rp->pages[i]);
|
||||
return -ENOMEM;
|
||||
}
|
||||
|
||||
static inline void resync_free_pages(struct resync_pages *rp)
|
||||
{
|
||||
int i;
|
||||
|
||||
for (i = 0; i < RESYNC_PAGES; i++)
|
||||
put_page(rp->pages[i]);
|
||||
}
|
||||
|
||||
static inline void resync_get_all_pages(struct resync_pages *rp)
|
||||
{
|
||||
int i;
|
||||
|
||||
for (i = 0; i < RESYNC_PAGES; i++)
|
||||
get_page(rp->pages[i]);
|
||||
}
|
||||
|
||||
static inline struct page *resync_fetch_page(struct resync_pages *rp,
|
||||
unsigned idx)
|
||||
{
|
||||
if (WARN_ON_ONCE(idx >= RESYNC_PAGES))
|
||||
return NULL;
|
||||
return rp->pages[idx];
|
||||
}
|
||||
#endif /* _MD_MD_H */
|
||||
|
@ -29,7 +29,8 @@
|
||||
#define UNSUPPORTED_MDDEV_FLAGS \
|
||||
((1L << MD_HAS_JOURNAL) | \
|
||||
(1L << MD_JOURNAL_CLEAN) | \
|
||||
(1L << MD_FAILFAST_SUPPORTED))
|
||||
(1L << MD_FAILFAST_SUPPORTED) |\
|
||||
(1L << MD_HAS_PPL))
|
||||
|
||||
static int raid0_congested(struct mddev *mddev, int bits)
|
||||
{
|
||||
@ -462,53 +463,54 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
|
||||
{
|
||||
struct strip_zone *zone;
|
||||
struct md_rdev *tmp_dev;
|
||||
struct bio *split;
|
||||
sector_t bio_sector;
|
||||
sector_t sector;
|
||||
unsigned chunk_sects;
|
||||
unsigned sectors;
|
||||
|
||||
if (unlikely(bio->bi_opf & REQ_PREFLUSH)) {
|
||||
md_flush_request(mddev, bio);
|
||||
return;
|
||||
}
|
||||
|
||||
do {
|
||||
sector_t bio_sector = bio->bi_iter.bi_sector;
|
||||
sector_t sector = bio_sector;
|
||||
unsigned chunk_sects = mddev->chunk_sectors;
|
||||
bio_sector = bio->bi_iter.bi_sector;
|
||||
sector = bio_sector;
|
||||
chunk_sects = mddev->chunk_sectors;
|
||||
|
||||
unsigned sectors = chunk_sects -
|
||||
(likely(is_power_of_2(chunk_sects))
|
||||
? (sector & (chunk_sects-1))
|
||||
: sector_div(sector, chunk_sects));
|
||||
sectors = chunk_sects -
|
||||
(likely(is_power_of_2(chunk_sects))
|
||||
? (sector & (chunk_sects-1))
|
||||
: sector_div(sector, chunk_sects));
|
||||
|
||||
/* Restore due to sector_div */
|
||||
sector = bio_sector;
|
||||
/* Restore due to sector_div */
|
||||
sector = bio_sector;
|
||||
|
||||
if (sectors < bio_sectors(bio)) {
|
||||
split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
|
||||
bio_chain(split, bio);
|
||||
} else {
|
||||
split = bio;
|
||||
}
|
||||
if (sectors < bio_sectors(bio)) {
|
||||
struct bio *split = bio_split(bio, sectors, GFP_NOIO, mddev->bio_set);
|
||||
bio_chain(split, bio);
|
||||
generic_make_request(bio);
|
||||
bio = split;
|
||||
}
|
||||
|
||||
zone = find_zone(mddev->private, §or);
|
||||
tmp_dev = map_sector(mddev, zone, sector, §or);
|
||||
split->bi_bdev = tmp_dev->bdev;
|
||||
split->bi_iter.bi_sector = sector + zone->dev_start +
|
||||
tmp_dev->data_offset;
|
||||
zone = find_zone(mddev->private, §or);
|
||||
tmp_dev = map_sector(mddev, zone, sector, §or);
|
||||
bio->bi_bdev = tmp_dev->bdev;
|
||||
bio->bi_iter.bi_sector = sector + zone->dev_start +
|
||||
tmp_dev->data_offset;
|
||||
|
||||
if (unlikely((bio_op(split) == REQ_OP_DISCARD) &&
|
||||
!blk_queue_discard(bdev_get_queue(split->bi_bdev)))) {
|
||||
/* Just ignore it */
|
||||
bio_endio(split);
|
||||
} else {
|
||||
if (mddev->gendisk)
|
||||
trace_block_bio_remap(bdev_get_queue(split->bi_bdev),
|
||||
split, disk_devt(mddev->gendisk),
|
||||
bio_sector);
|
||||
mddev_check_writesame(mddev, split);
|
||||
mddev_check_write_zeroes(mddev, split);
|
||||
generic_make_request(split);
|
||||
}
|
||||
} while (split != bio);
|
||||
if (unlikely((bio_op(bio) == REQ_OP_DISCARD) &&
|
||||
!blk_queue_discard(bdev_get_queue(bio->bi_bdev)))) {
|
||||
/* Just ignore it */
|
||||
bio_endio(bio);
|
||||
} else {
|
||||
if (mddev->gendisk)
|
||||
trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
|
||||
bio, disk_devt(mddev->gendisk),
|
||||
bio_sector);
|
||||
mddev_check_writesame(mddev, bio);
|
||||
mddev_check_write_zeroes(mddev, bio);
|
||||
generic_make_request(bio);
|
||||
}
|
||||
}
|
||||
|
||||
static void raid0_status(struct seq_file *seq, struct mddev *mddev)
|
||||
|
File diff suppressed because it is too large
Load Diff
@ -84,6 +84,7 @@ struct r1conf {
|
||||
*/
|
||||
wait_queue_head_t wait_barrier;
|
||||
spinlock_t resync_lock;
|
||||
atomic_t nr_sync_pending;
|
||||
atomic_t *nr_pending;
|
||||
atomic_t *nr_waiting;
|
||||
atomic_t *nr_queued;
|
||||
@ -107,6 +108,8 @@ struct r1conf {
|
||||
mempool_t *r1bio_pool;
|
||||
mempool_t *r1buf_pool;
|
||||
|
||||
struct bio_set *bio_split;
|
||||
|
||||
/* temporary buffer to synchronous IO when attempting to repair
|
||||
* a read error.
|
||||
*/
|
||||
@ -153,9 +156,13 @@ struct r1bio {
|
||||
int read_disk;
|
||||
|
||||
struct list_head retry_list;
|
||||
/* Next two are only valid when R1BIO_BehindIO is set */
|
||||
struct bio_vec *behind_bvecs;
|
||||
int behind_page_count;
|
||||
|
||||
/*
|
||||
* When R1BIO_BehindIO is set, we store pages for write behind
|
||||
* in behind_master_bio.
|
||||
*/
|
||||
struct bio *behind_master_bio;
|
||||
|
||||
/*
|
||||
* if the IO is in WRITE direction, then multiple bios are used.
|
||||
* We choose the number when they are allocated.
|
||||
|
File diff suppressed because it is too large
Load Diff
@ -82,6 +82,7 @@ struct r10conf {
|
||||
mempool_t *r10bio_pool;
|
||||
mempool_t *r10buf_pool;
|
||||
struct page *tmppage;
|
||||
struct bio_set *bio_split;
|
||||
|
||||
/* When taking over an array from a different personality, we store
|
||||
* the new thread here until we fully activate the array.
|
||||
|
@ -30,6 +30,7 @@
|
||||
* underneath hardware sector size. only works with PAGE_SIZE == 4096
|
||||
*/
|
||||
#define BLOCK_SECTORS (8)
|
||||
#define BLOCK_SECTOR_SHIFT (3)
|
||||
|
||||
/*
|
||||
* log->max_free_space is min(1/4 disk size, 10G reclaimable space).
|
||||
@ -43,7 +44,7 @@
|
||||
/* wake up reclaim thread periodically */
|
||||
#define R5C_RECLAIM_WAKEUP_INTERVAL (30 * HZ)
|
||||
/* start flush with these full stripes */
|
||||
#define R5C_FULL_STRIPE_FLUSH_BATCH 256
|
||||
#define R5C_FULL_STRIPE_FLUSH_BATCH(conf) (conf->max_nr_stripes / 4)
|
||||
/* reclaim stripes in groups */
|
||||
#define R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2)
|
||||
|
||||
@ -307,8 +308,7 @@ static void __r5l_set_io_unit_state(struct r5l_io_unit *io,
|
||||
}
|
||||
|
||||
static void
|
||||
r5c_return_dev_pending_writes(struct r5conf *conf, struct r5dev *dev,
|
||||
struct bio_list *return_bi)
|
||||
r5c_return_dev_pending_writes(struct r5conf *conf, struct r5dev *dev)
|
||||
{
|
||||
struct bio *wbi, *wbi2;
|
||||
|
||||
@ -317,24 +317,21 @@ r5c_return_dev_pending_writes(struct r5conf *conf, struct r5dev *dev,
|
||||
while (wbi && wbi->bi_iter.bi_sector <
|
||||
dev->sector + STRIPE_SECTORS) {
|
||||
wbi2 = r5_next_bio(wbi, dev->sector);
|
||||
if (!raid5_dec_bi_active_stripes(wbi)) {
|
||||
md_write_end(conf->mddev);
|
||||
bio_list_add(return_bi, wbi);
|
||||
}
|
||||
md_write_end(conf->mddev);
|
||||
bio_endio(wbi);
|
||||
wbi = wbi2;
|
||||
}
|
||||
}
|
||||
|
||||
void r5c_handle_cached_data_endio(struct r5conf *conf,
|
||||
struct stripe_head *sh, int disks, struct bio_list *return_bi)
|
||||
struct stripe_head *sh, int disks)
|
||||
{
|
||||
int i;
|
||||
|
||||
for (i = sh->disks; i--; ) {
|
||||
if (sh->dev[i].written) {
|
||||
set_bit(R5_UPTODATE, &sh->dev[i].flags);
|
||||
r5c_return_dev_pending_writes(conf, &sh->dev[i],
|
||||
return_bi);
|
||||
r5c_return_dev_pending_writes(conf, &sh->dev[i]);
|
||||
bitmap_endwrite(conf->mddev->bitmap, sh->sector,
|
||||
STRIPE_SECTORS,
|
||||
!test_bit(STRIPE_DEGRADED, &sh->state),
|
||||
@ -343,6 +340,8 @@ void r5c_handle_cached_data_endio(struct r5conf *conf,
|
||||
}
|
||||
}
|
||||
|
||||
void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
|
||||
|
||||
/* Check whether we should flush some stripes to free up stripe cache */
|
||||
void r5c_check_stripe_cache_usage(struct r5conf *conf)
|
||||
{
|
||||
@ -381,7 +380,7 @@ void r5c_check_cached_full_stripe(struct r5conf *conf)
|
||||
* or a full stripe (chunk size / 4k stripes).
|
||||
*/
|
||||
if (atomic_read(&conf->r5c_cached_full_stripes) >=
|
||||
min(R5C_FULL_STRIPE_FLUSH_BATCH,
|
||||
min(R5C_FULL_STRIPE_FLUSH_BATCH(conf),
|
||||
conf->chunk_sectors >> STRIPE_SHIFT))
|
||||
r5l_wake_reclaim(conf->log, 0);
|
||||
}
|
||||
@ -590,7 +589,7 @@ static void r5l_log_endio(struct bio *bio)
|
||||
|
||||
spin_lock_irqsave(&log->io_list_lock, flags);
|
||||
__r5l_set_io_unit_state(io, IO_UNIT_IO_END);
|
||||
if (log->need_cache_flush)
|
||||
if (log->need_cache_flush && !list_empty(&io->stripe_list))
|
||||
r5l_move_to_end_ios(log);
|
||||
else
|
||||
r5l_log_run_stripes(log);
|
||||
@ -618,9 +617,11 @@ static void r5l_log_endio(struct bio *bio)
|
||||
bio_endio(bi);
|
||||
atomic_dec(&io->pending_stripe);
|
||||
}
|
||||
if (atomic_read(&io->pending_stripe) == 0)
|
||||
__r5l_stripe_write_finished(io);
|
||||
}
|
||||
|
||||
/* finish flush only io_unit and PAYLOAD_FLUSH only io_unit */
|
||||
if (atomic_read(&io->pending_stripe) == 0)
|
||||
__r5l_stripe_write_finished(io);
|
||||
}
|
||||
|
||||
static void r5l_do_submit_io(struct r5l_log *log, struct r5l_io_unit *io)
|
||||
@ -842,6 +843,41 @@ static void r5l_append_payload_page(struct r5l_log *log, struct page *page)
|
||||
r5_reserve_log_entry(log, io);
|
||||
}
|
||||
|
||||
static void r5l_append_flush_payload(struct r5l_log *log, sector_t sect)
|
||||
{
|
||||
struct mddev *mddev = log->rdev->mddev;
|
||||
struct r5conf *conf = mddev->private;
|
||||
struct r5l_io_unit *io;
|
||||
struct r5l_payload_flush *payload;
|
||||
int meta_size;
|
||||
|
||||
/*
|
||||
* payload_flush requires extra writes to the journal.
|
||||
* To avoid handling the extra IO in quiesce, just skip
|
||||
* flush_payload
|
||||
*/
|
||||
if (conf->quiesce)
|
||||
return;
|
||||
|
||||
mutex_lock(&log->io_mutex);
|
||||
meta_size = sizeof(struct r5l_payload_flush) + sizeof(__le64);
|
||||
|
||||
if (r5l_get_meta(log, meta_size)) {
|
||||
mutex_unlock(&log->io_mutex);
|
||||
return;
|
||||
}
|
||||
|
||||
/* current implementation is one stripe per flush payload */
|
||||
io = log->current_io;
|
||||
payload = page_address(io->meta_page) + io->meta_offset;
|
||||
payload->header.type = cpu_to_le16(R5LOG_PAYLOAD_FLUSH);
|
||||
payload->header.flags = cpu_to_le16(0);
|
||||
payload->size = cpu_to_le32(sizeof(__le64));
|
||||
payload->flush_stripes[0] = cpu_to_le64(sect);
|
||||
io->meta_offset += meta_size;
|
||||
mutex_unlock(&log->io_mutex);
|
||||
}
|
||||
|
||||
static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
|
||||
int data_pages, int parity_pages)
|
||||
{
|
||||
@ -1393,7 +1429,7 @@ static void r5c_do_reclaim(struct r5conf *conf)
|
||||
stripes_to_flush = R5C_RECLAIM_STRIPE_GROUP;
|
||||
else if (total_cached > conf->min_nr_stripes * 1 / 2 ||
|
||||
atomic_read(&conf->r5c_cached_full_stripes) - flushing_full >
|
||||
R5C_FULL_STRIPE_FLUSH_BATCH)
|
||||
R5C_FULL_STRIPE_FLUSH_BATCH(conf))
|
||||
/*
|
||||
* if stripe cache pressure moderate, or if there is many full
|
||||
* stripes,flush all full stripes
|
||||
@ -1552,6 +1588,8 @@ bool r5l_log_disk_error(struct r5conf *conf)
|
||||
return ret;
|
||||
}
|
||||
|
||||
#define R5L_RECOVERY_PAGE_POOL_SIZE 256
|
||||
|
||||
struct r5l_recovery_ctx {
|
||||
struct page *meta_page; /* current meta */
|
||||
sector_t meta_total_blocks; /* total size of current meta and data */
|
||||
@ -1560,18 +1598,131 @@ struct r5l_recovery_ctx {
|
||||
int data_parity_stripes; /* number of data_parity stripes */
|
||||
int data_only_stripes; /* number of data_only stripes */
|
||||
struct list_head cached_list;
|
||||
|
||||
/*
|
||||
* read ahead page pool (ra_pool)
|
||||
* in recovery, log is read sequentially. It is not efficient to
|
||||
* read every page with sync_page_io(). The read ahead page pool
|
||||
* reads multiple pages with one IO, so further log read can
|
||||
* just copy data from the pool.
|
||||
*/
|
||||
struct page *ra_pool[R5L_RECOVERY_PAGE_POOL_SIZE];
|
||||
sector_t pool_offset; /* offset of first page in the pool */
|
||||
int total_pages; /* total allocated pages */
|
||||
int valid_pages; /* pages with valid data */
|
||||
struct bio *ra_bio; /* bio to do the read ahead */
|
||||
};
|
||||
|
||||
static int r5l_recovery_allocate_ra_pool(struct r5l_log *log,
|
||||
struct r5l_recovery_ctx *ctx)
|
||||
{
|
||||
struct page *page;
|
||||
|
||||
ctx->ra_bio = bio_alloc_bioset(GFP_KERNEL, BIO_MAX_PAGES, log->bs);
|
||||
if (!ctx->ra_bio)
|
||||
return -ENOMEM;
|
||||
|
||||
ctx->valid_pages = 0;
|
||||
ctx->total_pages = 0;
|
||||
while (ctx->total_pages < R5L_RECOVERY_PAGE_POOL_SIZE) {
|
||||
page = alloc_page(GFP_KERNEL);
|
||||
|
||||
if (!page)
|
||||
break;
|
||||
ctx->ra_pool[ctx->total_pages] = page;
|
||||
ctx->total_pages += 1;
|
||||
}
|
||||
|
||||
if (ctx->total_pages == 0) {
|
||||
bio_put(ctx->ra_bio);
|
||||
return -ENOMEM;
|
||||
}
|
||||
|
||||
ctx->pool_offset = 0;
|
||||
return 0;
|
||||
}
|
||||
|
||||
static void r5l_recovery_free_ra_pool(struct r5l_log *log,
|
||||
struct r5l_recovery_ctx *ctx)
|
||||
{
|
||||
int i;
|
||||
|
||||
for (i = 0; i < ctx->total_pages; ++i)
|
||||
put_page(ctx->ra_pool[i]);
|
||||
bio_put(ctx->ra_bio);
|
||||
}
|
||||
|
||||
/*
|
||||
* fetch ctx->valid_pages pages from offset
|
||||
* In normal cases, ctx->valid_pages == ctx->total_pages after the call.
|
||||
* However, if the offset is close to the end of the journal device,
|
||||
* ctx->valid_pages could be smaller than ctx->total_pages
|
||||
*/
|
||||
static int r5l_recovery_fetch_ra_pool(struct r5l_log *log,
|
||||
struct r5l_recovery_ctx *ctx,
|
||||
sector_t offset)
|
||||
{
|
||||
bio_reset(ctx->ra_bio);
|
||||
ctx->ra_bio->bi_bdev = log->rdev->bdev;
|
||||
bio_set_op_attrs(ctx->ra_bio, REQ_OP_READ, 0);
|
||||
ctx->ra_bio->bi_iter.bi_sector = log->rdev->data_offset + offset;
|
||||
|
||||
ctx->valid_pages = 0;
|
||||
ctx->pool_offset = offset;
|
||||
|
||||
while (ctx->valid_pages < ctx->total_pages) {
|
||||
bio_add_page(ctx->ra_bio,
|
||||
ctx->ra_pool[ctx->valid_pages], PAGE_SIZE, 0);
|
||||
ctx->valid_pages += 1;
|
||||
|
||||
offset = r5l_ring_add(log, offset, BLOCK_SECTORS);
|
||||
|
||||
if (offset == 0) /* reached end of the device */
|
||||
break;
|
||||
}
|
||||
|
||||
return submit_bio_wait(ctx->ra_bio);
|
||||
}
|
||||
|
||||
/*
|
||||
* try read a page from the read ahead page pool, if the page is not in the
|
||||
* pool, call r5l_recovery_fetch_ra_pool
|
||||
*/
|
||||
static int r5l_recovery_read_page(struct r5l_log *log,
|
||||
struct r5l_recovery_ctx *ctx,
|
||||
struct page *page,
|
||||
sector_t offset)
|
||||
{
|
||||
int ret;
|
||||
|
||||
if (offset < ctx->pool_offset ||
|
||||
offset >= ctx->pool_offset + ctx->valid_pages * BLOCK_SECTORS) {
|
||||
ret = r5l_recovery_fetch_ra_pool(log, ctx, offset);
|
||||
if (ret)
|
||||
return ret;
|
||||
}
|
||||
|
||||
BUG_ON(offset < ctx->pool_offset ||
|
||||
offset >= ctx->pool_offset + ctx->valid_pages * BLOCK_SECTORS);
|
||||
|
||||
memcpy(page_address(page),
|
||||
page_address(ctx->ra_pool[(offset - ctx->pool_offset) >>
|
||||
BLOCK_SECTOR_SHIFT]),
|
||||
PAGE_SIZE);
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int r5l_recovery_read_meta_block(struct r5l_log *log,
|
||||
struct r5l_recovery_ctx *ctx)
|
||||
{
|
||||
struct page *page = ctx->meta_page;
|
||||
struct r5l_meta_block *mb;
|
||||
u32 crc, stored_crc;
|
||||
int ret;
|
||||
|
||||
if (!sync_page_io(log->rdev, ctx->pos, PAGE_SIZE, page, REQ_OP_READ, 0,
|
||||
false))
|
||||
return -EIO;
|
||||
ret = r5l_recovery_read_page(log, ctx, page, ctx->pos);
|
||||
if (ret != 0)
|
||||
return ret;
|
||||
|
||||
mb = page_address(page);
|
||||
stored_crc = le32_to_cpu(mb->checksum);
|
||||
@ -1653,8 +1804,7 @@ static void r5l_recovery_load_data(struct r5l_log *log,
|
||||
raid5_compute_sector(conf,
|
||||
le64_to_cpu(payload->location), 0,
|
||||
&dd_idx, sh);
|
||||
sync_page_io(log->rdev, log_offset, PAGE_SIZE,
|
||||
sh->dev[dd_idx].page, REQ_OP_READ, 0, false);
|
||||
r5l_recovery_read_page(log, ctx, sh->dev[dd_idx].page, log_offset);
|
||||
sh->dev[dd_idx].log_checksum =
|
||||
le32_to_cpu(payload->checksum[0]);
|
||||
ctx->meta_total_blocks += BLOCK_SECTORS;
|
||||
@ -1673,17 +1823,15 @@ static void r5l_recovery_load_parity(struct r5l_log *log,
|
||||
struct r5conf *conf = mddev->private;
|
||||
|
||||
ctx->meta_total_blocks += BLOCK_SECTORS * conf->max_degraded;
|
||||
sync_page_io(log->rdev, log_offset, PAGE_SIZE,
|
||||
sh->dev[sh->pd_idx].page, REQ_OP_READ, 0, false);
|
||||
r5l_recovery_read_page(log, ctx, sh->dev[sh->pd_idx].page, log_offset);
|
||||
sh->dev[sh->pd_idx].log_checksum =
|
||||
le32_to_cpu(payload->checksum[0]);
|
||||
set_bit(R5_Wantwrite, &sh->dev[sh->pd_idx].flags);
|
||||
|
||||
if (sh->qd_idx >= 0) {
|
||||
sync_page_io(log->rdev,
|
||||
r5l_ring_add(log, log_offset, BLOCK_SECTORS),
|
||||
PAGE_SIZE, sh->dev[sh->qd_idx].page,
|
||||
REQ_OP_READ, 0, false);
|
||||
r5l_recovery_read_page(
|
||||
log, ctx, sh->dev[sh->qd_idx].page,
|
||||
r5l_ring_add(log, log_offset, BLOCK_SECTORS));
|
||||
sh->dev[sh->qd_idx].log_checksum =
|
||||
le32_to_cpu(payload->checksum[1]);
|
||||
set_bit(R5_Wantwrite, &sh->dev[sh->qd_idx].flags);
|
||||
@ -1814,14 +1962,15 @@ r5c_recovery_replay_stripes(struct list_head *cached_stripe_list,
|
||||
|
||||
/* if matches return 0; otherwise return -EINVAL */
|
||||
static int
|
||||
r5l_recovery_verify_data_checksum(struct r5l_log *log, struct page *page,
|
||||
r5l_recovery_verify_data_checksum(struct r5l_log *log,
|
||||
struct r5l_recovery_ctx *ctx,
|
||||
struct page *page,
|
||||
sector_t log_offset, __le32 log_checksum)
|
||||
{
|
||||
void *addr;
|
||||
u32 checksum;
|
||||
|
||||
sync_page_io(log->rdev, log_offset, PAGE_SIZE,
|
||||
page, REQ_OP_READ, 0, false);
|
||||
r5l_recovery_read_page(log, ctx, page, log_offset);
|
||||
addr = kmap_atomic(page);
|
||||
checksum = crc32c_le(log->uuid_checksum, addr, PAGE_SIZE);
|
||||
kunmap_atomic(addr);
|
||||
@ -1843,6 +1992,7 @@ r5l_recovery_verify_data_checksum_for_mb(struct r5l_log *log,
|
||||
sector_t log_offset = r5l_ring_add(log, ctx->pos, BLOCK_SECTORS);
|
||||
struct page *page;
|
||||
struct r5l_payload_data_parity *payload;
|
||||
struct r5l_payload_flush *payload_flush;
|
||||
|
||||
page = alloc_page(GFP_KERNEL);
|
||||
if (!page)
|
||||
@ -1850,33 +2000,42 @@ r5l_recovery_verify_data_checksum_for_mb(struct r5l_log *log,
|
||||
|
||||
while (mb_offset < le32_to_cpu(mb->meta_size)) {
|
||||
payload = (void *)mb + mb_offset;
|
||||
payload_flush = (void *)mb + mb_offset;
|
||||
|
||||
if (payload->header.type == R5LOG_PAYLOAD_DATA) {
|
||||
if (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_DATA) {
|
||||
if (r5l_recovery_verify_data_checksum(
|
||||
log, page, log_offset,
|
||||
log, ctx, page, log_offset,
|
||||
payload->checksum[0]) < 0)
|
||||
goto mismatch;
|
||||
} else if (payload->header.type == R5LOG_PAYLOAD_PARITY) {
|
||||
} else if (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_PARITY) {
|
||||
if (r5l_recovery_verify_data_checksum(
|
||||
log, page, log_offset,
|
||||
log, ctx, page, log_offset,
|
||||
payload->checksum[0]) < 0)
|
||||
goto mismatch;
|
||||
if (conf->max_degraded == 2 && /* q for RAID 6 */
|
||||
r5l_recovery_verify_data_checksum(
|
||||
log, page,
|
||||
log, ctx, page,
|
||||
r5l_ring_add(log, log_offset,
|
||||
BLOCK_SECTORS),
|
||||
payload->checksum[1]) < 0)
|
||||
goto mismatch;
|
||||
} else /* not R5LOG_PAYLOAD_DATA or R5LOG_PAYLOAD_PARITY */
|
||||
} else if (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_FLUSH) {
|
||||
/* nothing to do for R5LOG_PAYLOAD_FLUSH here */
|
||||
} else /* not R5LOG_PAYLOAD_DATA/PARITY/FLUSH */
|
||||
goto mismatch;
|
||||
|
||||
log_offset = r5l_ring_add(log, log_offset,
|
||||
le32_to_cpu(payload->size));
|
||||
if (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_FLUSH) {
|
||||
mb_offset += sizeof(struct r5l_payload_flush) +
|
||||
le32_to_cpu(payload_flush->size);
|
||||
} else {
|
||||
/* DATA or PARITY payload */
|
||||
log_offset = r5l_ring_add(log, log_offset,
|
||||
le32_to_cpu(payload->size));
|
||||
mb_offset += sizeof(struct r5l_payload_data_parity) +
|
||||
sizeof(__le32) *
|
||||
(le32_to_cpu(payload->size) >> (PAGE_SHIFT - 9));
|
||||
}
|
||||
|
||||
mb_offset += sizeof(struct r5l_payload_data_parity) +
|
||||
sizeof(__le32) *
|
||||
(le32_to_cpu(payload->size) >> (PAGE_SHIFT - 9));
|
||||
}
|
||||
|
||||
put_page(page);
|
||||
@ -1904,6 +2063,7 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
|
||||
struct r5conf *conf = mddev->private;
|
||||
struct r5l_meta_block *mb;
|
||||
struct r5l_payload_data_parity *payload;
|
||||
struct r5l_payload_flush *payload_flush;
|
||||
int mb_offset;
|
||||
sector_t log_offset;
|
||||
sector_t stripe_sect;
|
||||
@ -1929,7 +2089,31 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
|
||||
int dd;
|
||||
|
||||
payload = (void *)mb + mb_offset;
|
||||
stripe_sect = (payload->header.type == R5LOG_PAYLOAD_DATA) ?
|
||||
payload_flush = (void *)mb + mb_offset;
|
||||
|
||||
if (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_FLUSH) {
|
||||
int i, count;
|
||||
|
||||
count = le32_to_cpu(payload_flush->size) / sizeof(__le64);
|
||||
for (i = 0; i < count; ++i) {
|
||||
stripe_sect = le64_to_cpu(payload_flush->flush_stripes[i]);
|
||||
sh = r5c_recovery_lookup_stripe(cached_stripe_list,
|
||||
stripe_sect);
|
||||
if (sh) {
|
||||
WARN_ON(test_bit(STRIPE_R5C_CACHING, &sh->state));
|
||||
r5l_recovery_reset_stripe(sh);
|
||||
list_del_init(&sh->lru);
|
||||
raid5_release_stripe(sh);
|
||||
}
|
||||
}
|
||||
|
||||
mb_offset += sizeof(struct r5l_payload_flush) +
|
||||
le32_to_cpu(payload_flush->size);
|
||||
continue;
|
||||
}
|
||||
|
||||
/* DATA or PARITY payload */
|
||||
stripe_sect = (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_DATA) ?
|
||||
raid5_compute_sector(
|
||||
conf, le64_to_cpu(payload->location), 0, &dd,
|
||||
NULL)
|
||||
@ -1967,7 +2151,7 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
|
||||
list_add_tail(&sh->lru, cached_stripe_list);
|
||||
}
|
||||
|
||||
if (payload->header.type == R5LOG_PAYLOAD_DATA) {
|
||||
if (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_DATA) {
|
||||
if (!test_bit(STRIPE_R5C_CACHING, &sh->state) &&
|
||||
test_bit(R5_Wantwrite, &sh->dev[sh->pd_idx].flags)) {
|
||||
r5l_recovery_replay_one_stripe(conf, sh, ctx);
|
||||
@ -1975,7 +2159,7 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
|
||||
}
|
||||
r5l_recovery_load_data(log, sh, ctx, payload,
|
||||
log_offset);
|
||||
} else if (payload->header.type == R5LOG_PAYLOAD_PARITY)
|
||||
} else if (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_PARITY)
|
||||
r5l_recovery_load_parity(log, sh, ctx, payload,
|
||||
log_offset);
|
||||
else
|
||||
@ -2177,7 +2361,7 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
|
||||
payload = (void *)mb + offset;
|
||||
payload->header.type = cpu_to_le16(
|
||||
R5LOG_PAYLOAD_DATA);
|
||||
payload->size = BLOCK_SECTORS;
|
||||
payload->size = cpu_to_le32(BLOCK_SECTORS);
|
||||
payload->location = cpu_to_le64(
|
||||
raid5_compute_blocknr(sh, i, 0));
|
||||
addr = kmap_atomic(dev->page);
|
||||
@ -2241,55 +2425,70 @@ static void r5c_recovery_flush_data_only_stripes(struct r5l_log *log,
|
||||
static int r5l_recovery_log(struct r5l_log *log)
|
||||
{
|
||||
struct mddev *mddev = log->rdev->mddev;
|
||||
struct r5l_recovery_ctx ctx;
|
||||
struct r5l_recovery_ctx *ctx;
|
||||
int ret;
|
||||
sector_t pos;
|
||||
|
||||
ctx.pos = log->last_checkpoint;
|
||||
ctx.seq = log->last_cp_seq;
|
||||
ctx.meta_page = alloc_page(GFP_KERNEL);
|
||||
ctx.data_only_stripes = 0;
|
||||
ctx.data_parity_stripes = 0;
|
||||
INIT_LIST_HEAD(&ctx.cached_list);
|
||||
|
||||
if (!ctx.meta_page)
|
||||
ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
|
||||
if (!ctx)
|
||||
return -ENOMEM;
|
||||
|
||||
ret = r5c_recovery_flush_log(log, &ctx);
|
||||
__free_page(ctx.meta_page);
|
||||
ctx->pos = log->last_checkpoint;
|
||||
ctx->seq = log->last_cp_seq;
|
||||
INIT_LIST_HEAD(&ctx->cached_list);
|
||||
ctx->meta_page = alloc_page(GFP_KERNEL);
|
||||
|
||||
if (!ctx->meta_page) {
|
||||
ret = -ENOMEM;
|
||||
goto meta_page;
|
||||
}
|
||||
|
||||
if (r5l_recovery_allocate_ra_pool(log, ctx) != 0) {
|
||||
ret = -ENOMEM;
|
||||
goto ra_pool;
|
||||
}
|
||||
|
||||
ret = r5c_recovery_flush_log(log, ctx);
|
||||
|
||||
if (ret)
|
||||
return ret;
|
||||
goto error;
|
||||
|
||||
pos = ctx.pos;
|
||||
ctx.seq += 10000;
|
||||
pos = ctx->pos;
|
||||
ctx->seq += 10000;
|
||||
|
||||
|
||||
if ((ctx.data_only_stripes == 0) && (ctx.data_parity_stripes == 0))
|
||||
if ((ctx->data_only_stripes == 0) && (ctx->data_parity_stripes == 0))
|
||||
pr_debug("md/raid:%s: starting from clean shutdown\n",
|
||||
mdname(mddev));
|
||||
else
|
||||
pr_debug("md/raid:%s: recovering %d data-only stripes and %d data-parity stripes\n",
|
||||
mdname(mddev), ctx.data_only_stripes,
|
||||
ctx.data_parity_stripes);
|
||||
mdname(mddev), ctx->data_only_stripes,
|
||||
ctx->data_parity_stripes);
|
||||
|
||||
if (ctx.data_only_stripes == 0) {
|
||||
log->next_checkpoint = ctx.pos;
|
||||
r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq++);
|
||||
ctx.pos = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
|
||||
} else if (r5c_recovery_rewrite_data_only_stripes(log, &ctx)) {
|
||||
if (ctx->data_only_stripes == 0) {
|
||||
log->next_checkpoint = ctx->pos;
|
||||
r5l_log_write_empty_meta_block(log, ctx->pos, ctx->seq++);
|
||||
ctx->pos = r5l_ring_add(log, ctx->pos, BLOCK_SECTORS);
|
||||
} else if (r5c_recovery_rewrite_data_only_stripes(log, ctx)) {
|
||||
pr_err("md/raid:%s: failed to rewrite stripes to journal\n",
|
||||
mdname(mddev));
|
||||
return -EIO;
|
||||
ret = -EIO;
|
||||
goto error;
|
||||
}
|
||||
|
||||
log->log_start = ctx.pos;
|
||||
log->seq = ctx.seq;
|
||||
log->log_start = ctx->pos;
|
||||
log->seq = ctx->seq;
|
||||
log->last_checkpoint = pos;
|
||||
r5l_write_super(log, pos);
|
||||
|
||||
r5c_recovery_flush_data_only_stripes(log, &ctx);
|
||||
return 0;
|
||||
r5c_recovery_flush_data_only_stripes(log, ctx);
|
||||
ret = 0;
|
||||
error:
|
||||
r5l_recovery_free_ra_pool(log, ctx);
|
||||
ra_pool:
|
||||
__free_page(ctx->meta_page);
|
||||
meta_page:
|
||||
kfree(ctx);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static void r5l_write_super(struct r5l_log *log, sector_t cp)
|
||||
@ -2618,11 +2817,11 @@ void r5c_finish_stripe_write_out(struct r5conf *conf,
|
||||
atomic_dec(&conf->r5c_flushing_full_stripes);
|
||||
atomic_dec(&conf->r5c_cached_full_stripes);
|
||||
}
|
||||
|
||||
r5l_append_flush_payload(log, sh->sector);
|
||||
}
|
||||
|
||||
int
|
||||
r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
|
||||
struct stripe_head_state *s)
|
||||
int r5c_cache_data(struct r5l_log *log, struct stripe_head *sh)
|
||||
{
|
||||
struct r5conf *conf = sh->raid_conf;
|
||||
int pages = 0;
|
||||
@ -2785,6 +2984,10 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
|
||||
{
|
||||
struct request_queue *q = bdev_get_queue(rdev->bdev);
|
||||
struct r5l_log *log;
|
||||
char b[BDEVNAME_SIZE];
|
||||
|
||||
pr_debug("md/raid:%s: using device %s as journal\n",
|
||||
mdname(conf->mddev), bdevname(rdev->bdev, b));
|
||||
|
||||
if (PAGE_SIZE != 4096)
|
||||
return -EINVAL;
|
||||
@ -2887,8 +3090,13 @@ io_kc:
|
||||
return -EINVAL;
|
||||
}
|
||||
|
||||
void r5l_exit_log(struct r5l_log *log)
|
||||
void r5l_exit_log(struct r5conf *conf)
|
||||
{
|
||||
struct r5l_log *log = conf->log;
|
||||
|
||||
conf->log = NULL;
|
||||
synchronize_rcu();
|
||||
|
||||
flush_work(&log->disable_writeback_work);
|
||||
md_unregister_thread(&log->reclaim_thread);
|
||||
mempool_destroy(log->meta_pool);
|
||||
|
115
drivers/md/raid5-log.h
Normal file
115
drivers/md/raid5-log.h
Normal file
@ -0,0 +1,115 @@
|
||||
#ifndef _RAID5_LOG_H
|
||||
#define _RAID5_LOG_H
|
||||
|
||||
extern int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev);
|
||||
extern void r5l_exit_log(struct r5conf *conf);
|
||||
extern int r5l_write_stripe(struct r5l_log *log, struct stripe_head *head_sh);
|
||||
extern void r5l_write_stripe_run(struct r5l_log *log);
|
||||
extern void r5l_flush_stripe_to_raid(struct r5l_log *log);
|
||||
extern void r5l_stripe_write_finished(struct stripe_head *sh);
|
||||
extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
|
||||
extern void r5l_quiesce(struct r5l_log *log, int state);
|
||||
extern bool r5l_log_disk_error(struct r5conf *conf);
|
||||
extern bool r5c_is_writeback(struct r5l_log *log);
|
||||
extern int
|
||||
r5c_try_caching_write(struct r5conf *conf, struct stripe_head *sh,
|
||||
struct stripe_head_state *s, int disks);
|
||||
extern void
|
||||
r5c_finish_stripe_write_out(struct r5conf *conf, struct stripe_head *sh,
|
||||
struct stripe_head_state *s);
|
||||
extern void r5c_release_extra_page(struct stripe_head *sh);
|
||||
extern void r5c_use_extra_page(struct stripe_head *sh);
|
||||
extern void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
|
||||
extern void r5c_handle_cached_data_endio(struct r5conf *conf,
|
||||
struct stripe_head *sh, int disks);
|
||||
extern int r5c_cache_data(struct r5l_log *log, struct stripe_head *sh);
|
||||
extern void r5c_make_stripe_write_out(struct stripe_head *sh);
|
||||
extern void r5c_flush_cache(struct r5conf *conf, int num);
|
||||
extern void r5c_check_stripe_cache_usage(struct r5conf *conf);
|
||||
extern void r5c_check_cached_full_stripe(struct r5conf *conf);
|
||||
extern struct md_sysfs_entry r5c_journal_mode;
|
||||
extern void r5c_update_on_rdev_error(struct mddev *mddev);
|
||||
extern bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
|
||||
|
||||
extern struct dma_async_tx_descriptor *
|
||||
ops_run_partial_parity(struct stripe_head *sh, struct raid5_percpu *percpu,
|
||||
struct dma_async_tx_descriptor *tx);
|
||||
extern int ppl_init_log(struct r5conf *conf);
|
||||
extern void ppl_exit_log(struct r5conf *conf);
|
||||
extern int ppl_write_stripe(struct r5conf *conf, struct stripe_head *sh);
|
||||
extern void ppl_write_stripe_run(struct r5conf *conf);
|
||||
extern void ppl_stripe_write_finished(struct stripe_head *sh);
|
||||
extern int ppl_modify_log(struct r5conf *conf, struct md_rdev *rdev, bool add);
|
||||
|
||||
static inline bool raid5_has_ppl(struct r5conf *conf)
|
||||
{
|
||||
return test_bit(MD_HAS_PPL, &conf->mddev->flags);
|
||||
}
|
||||
|
||||
static inline int log_stripe(struct stripe_head *sh, struct stripe_head_state *s)
|
||||
{
|
||||
struct r5conf *conf = sh->raid_conf;
|
||||
|
||||
if (conf->log) {
|
||||
if (!test_bit(STRIPE_R5C_CACHING, &sh->state)) {
|
||||
/* writing out phase */
|
||||
if (s->waiting_extra_page)
|
||||
return 0;
|
||||
return r5l_write_stripe(conf->log, sh);
|
||||
} else if (test_bit(STRIPE_LOG_TRAPPED, &sh->state)) {
|
||||
/* caching phase */
|
||||
return r5c_cache_data(conf->log, sh);
|
||||
}
|
||||
} else if (raid5_has_ppl(conf)) {
|
||||
return ppl_write_stripe(conf, sh);
|
||||
}
|
||||
|
||||
return -EAGAIN;
|
||||
}
|
||||
|
||||
static inline void log_stripe_write_finished(struct stripe_head *sh)
|
||||
{
|
||||
struct r5conf *conf = sh->raid_conf;
|
||||
|
||||
if (conf->log)
|
||||
r5l_stripe_write_finished(sh);
|
||||
else if (raid5_has_ppl(conf))
|
||||
ppl_stripe_write_finished(sh);
|
||||
}
|
||||
|
||||
static inline void log_write_stripe_run(struct r5conf *conf)
|
||||
{
|
||||
if (conf->log)
|
||||
r5l_write_stripe_run(conf->log);
|
||||
else if (raid5_has_ppl(conf))
|
||||
ppl_write_stripe_run(conf);
|
||||
}
|
||||
|
||||
static inline void log_exit(struct r5conf *conf)
|
||||
{
|
||||
if (conf->log)
|
||||
r5l_exit_log(conf);
|
||||
else if (raid5_has_ppl(conf))
|
||||
ppl_exit_log(conf);
|
||||
}
|
||||
|
||||
static inline int log_init(struct r5conf *conf, struct md_rdev *journal_dev,
|
||||
bool ppl)
|
||||
{
|
||||
if (journal_dev)
|
||||
return r5l_init_log(conf, journal_dev);
|
||||
else if (ppl)
|
||||
return ppl_init_log(conf);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static inline int log_modify(struct r5conf *conf, struct md_rdev *rdev, bool add)
|
||||
{
|
||||
if (raid5_has_ppl(conf))
|
||||
return ppl_modify_log(conf, rdev, add);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
#endif
|
1271
drivers/md/raid5-ppl.c
Normal file
1271
drivers/md/raid5-ppl.c
Normal file
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@ -224,10 +224,16 @@ struct stripe_head {
|
||||
spinlock_t batch_lock; /* only header's lock is useful */
|
||||
struct list_head batch_list; /* protected by head's batch lock*/
|
||||
|
||||
struct r5l_io_unit *log_io;
|
||||
union {
|
||||
struct r5l_io_unit *log_io;
|
||||
struct ppl_io_unit *ppl_io;
|
||||
};
|
||||
|
||||
struct list_head log_list;
|
||||
sector_t log_start; /* first meta block on the journal */
|
||||
struct list_head r5c; /* for r5c_cache->stripe_in_journal */
|
||||
|
||||
struct page *ppl_page; /* partial parity of this stripe */
|
||||
/**
|
||||
* struct stripe_operations
|
||||
* @target - STRIPE_OP_COMPUTE_BLK target
|
||||
@ -272,7 +278,6 @@ struct stripe_head_state {
|
||||
int dec_preread_active;
|
||||
unsigned long ops_request;
|
||||
|
||||
struct bio_list return_bi;
|
||||
struct md_rdev *blocked_rdev;
|
||||
int handle_bad_blocks;
|
||||
int log_failed;
|
||||
@ -400,6 +405,7 @@ enum {
|
||||
STRIPE_OP_BIODRAIN,
|
||||
STRIPE_OP_RECONSTRUCT,
|
||||
STRIPE_OP_CHECK,
|
||||
STRIPE_OP_PARTIAL_PARITY,
|
||||
};
|
||||
|
||||
/*
|
||||
@ -481,50 +487,6 @@ static inline struct bio *r5_next_bio(struct bio *bio, sector_t sector)
|
||||
return NULL;
|
||||
}
|
||||
|
||||
/*
|
||||
* We maintain a biased count of active stripes in the bottom 16 bits of
|
||||
* bi_phys_segments, and a count of processed stripes in the upper 16 bits
|
||||
*/
|
||||
static inline int raid5_bi_processed_stripes(struct bio *bio)
|
||||
{
|
||||
atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
|
||||
|
||||
return (atomic_read(segments) >> 16) & 0xffff;
|
||||
}
|
||||
|
||||
static inline int raid5_dec_bi_active_stripes(struct bio *bio)
|
||||
{
|
||||
atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
|
||||
|
||||
return atomic_sub_return(1, segments) & 0xffff;
|
||||
}
|
||||
|
||||
static inline void raid5_inc_bi_active_stripes(struct bio *bio)
|
||||
{
|
||||
atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
|
||||
|
||||
atomic_inc(segments);
|
||||
}
|
||||
|
||||
static inline void raid5_set_bi_processed_stripes(struct bio *bio,
|
||||
unsigned int cnt)
|
||||
{
|
||||
atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
|
||||
int old, new;
|
||||
|
||||
do {
|
||||
old = atomic_read(segments);
|
||||
new = (old & 0xffff) | (cnt << 16);
|
||||
} while (atomic_cmpxchg(segments, old, new) != old);
|
||||
}
|
||||
|
||||
static inline void raid5_set_bi_stripes(struct bio *bio, unsigned int cnt)
|
||||
{
|
||||
atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
|
||||
|
||||
atomic_set(segments, cnt);
|
||||
}
|
||||
|
||||
/* NOTE NR_STRIPE_HASH_LOCKS must remain below 64.
|
||||
* This is because we sometimes take all the spinlocks
|
||||
* and creating that much locking depth can cause
|
||||
@ -542,6 +504,7 @@ struct r5worker {
|
||||
|
||||
struct r5worker_group {
|
||||
struct list_head handle_list;
|
||||
struct list_head loprio_list;
|
||||
struct r5conf *conf;
|
||||
struct r5worker *workers;
|
||||
int stripes_cnt;
|
||||
@ -571,6 +534,14 @@ enum r5_cache_state {
|
||||
*/
|
||||
};
|
||||
|
||||
#define PENDING_IO_MAX 512
|
||||
#define PENDING_IO_ONE_FLUSH 128
|
||||
struct r5pending_data {
|
||||
struct list_head sibling;
|
||||
sector_t sector; /* stripe sector */
|
||||
struct bio_list bios;
|
||||
};
|
||||
|
||||
struct r5conf {
|
||||
struct hlist_head *stripe_hashtbl;
|
||||
/* only protect corresponding hash list and inactive_list */
|
||||
@ -608,10 +579,12 @@ struct r5conf {
|
||||
*/
|
||||
|
||||
struct list_head handle_list; /* stripes needing handling */
|
||||
struct list_head loprio_list; /* low priority stripes */
|
||||
struct list_head hold_list; /* preread ready stripes */
|
||||
struct list_head delayed_list; /* stripes that have plugged requests */
|
||||
struct list_head bitmap_list; /* stripes delaying awaiting bitmap update */
|
||||
struct bio *retry_read_aligned; /* currently retrying aligned bios */
|
||||
unsigned int retry_read_offset; /* sector offset into retry_read_aligned */
|
||||
struct bio *retry_read_aligned_list; /* aligned bios retry list */
|
||||
atomic_t preread_active_stripes; /* stripes with scheduled io */
|
||||
atomic_t active_aligned_reads;
|
||||
@ -621,9 +594,6 @@ struct r5conf {
|
||||
int skip_copy; /* Don't copy data from bio to stripe cache */
|
||||
struct list_head *last_hold; /* detect hold_list promotions */
|
||||
|
||||
/* bios to have bi_end_io called after metadata is synced */
|
||||
struct bio_list return_bi;
|
||||
|
||||
atomic_t reshape_stripes; /* stripes with pending writes for reshape */
|
||||
/* unfortunately we need two cache names as we temporarily have
|
||||
* two caches.
|
||||
@ -676,6 +646,7 @@ struct r5conf {
|
||||
int pool_size; /* number of disks in stripeheads in pool */
|
||||
spinlock_t device_lock;
|
||||
struct disk_info *disks;
|
||||
struct bio_set *bio_split;
|
||||
|
||||
/* When taking over an array from a different personality, we store
|
||||
* the new thread here until we fully activate the array.
|
||||
@ -686,10 +657,15 @@ struct r5conf {
|
||||
int group_cnt;
|
||||
int worker_cnt_per_group;
|
||||
struct r5l_log *log;
|
||||
void *log_private;
|
||||
|
||||
struct bio_list pending_bios;
|
||||
spinlock_t pending_bios_lock;
|
||||
bool batch_bio_dispatch;
|
||||
struct r5pending_data *pending_data;
|
||||
struct list_head free_list;
|
||||
struct list_head pending_list;
|
||||
int pending_data_cnt;
|
||||
struct r5pending_data *next_pending_data;
|
||||
};
|
||||
|
||||
|
||||
@ -765,34 +741,4 @@ extern struct stripe_head *
|
||||
raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
|
||||
int previous, int noblock, int noquiesce);
|
||||
extern int raid5_calc_degraded(struct r5conf *conf);
|
||||
extern int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev);
|
||||
extern void r5l_exit_log(struct r5l_log *log);
|
||||
extern int r5l_write_stripe(struct r5l_log *log, struct stripe_head *head_sh);
|
||||
extern void r5l_write_stripe_run(struct r5l_log *log);
|
||||
extern void r5l_flush_stripe_to_raid(struct r5l_log *log);
|
||||
extern void r5l_stripe_write_finished(struct stripe_head *sh);
|
||||
extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
|
||||
extern void r5l_quiesce(struct r5l_log *log, int state);
|
||||
extern bool r5l_log_disk_error(struct r5conf *conf);
|
||||
extern bool r5c_is_writeback(struct r5l_log *log);
|
||||
extern int
|
||||
r5c_try_caching_write(struct r5conf *conf, struct stripe_head *sh,
|
||||
struct stripe_head_state *s, int disks);
|
||||
extern void
|
||||
r5c_finish_stripe_write_out(struct r5conf *conf, struct stripe_head *sh,
|
||||
struct stripe_head_state *s);
|
||||
extern void r5c_release_extra_page(struct stripe_head *sh);
|
||||
extern void r5c_use_extra_page(struct stripe_head *sh);
|
||||
extern void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
|
||||
extern void r5c_handle_cached_data_endio(struct r5conf *conf,
|
||||
struct stripe_head *sh, int disks, struct bio_list *return_bi);
|
||||
extern int r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
|
||||
struct stripe_head_state *s);
|
||||
extern void r5c_make_stripe_write_out(struct stripe_head *sh);
|
||||
extern void r5c_flush_cache(struct r5conf *conf, int num);
|
||||
extern void r5c_check_stripe_cache_usage(struct r5conf *conf);
|
||||
extern void r5c_check_cached_full_stripe(struct r5conf *conf);
|
||||
extern struct md_sysfs_entry r5c_journal_mode;
|
||||
extern void r5c_update_on_rdev_error(struct mddev *mddev);
|
||||
extern bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
|
||||
#endif
|
||||
|
@ -183,7 +183,7 @@ static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
|
||||
|
||||
#define bio_iter_last(bvec, iter) ((iter).bi_size == (bvec).bv_len)
|
||||
|
||||
static inline unsigned __bio_segments(struct bio *bio, struct bvec_iter *bvec)
|
||||
static inline unsigned bio_segments(struct bio *bio)
|
||||
{
|
||||
unsigned segs = 0;
|
||||
struct bio_vec bv;
|
||||
@ -205,17 +205,12 @@ static inline unsigned __bio_segments(struct bio *bio, struct bvec_iter *bvec)
|
||||
break;
|
||||
}
|
||||
|
||||
__bio_for_each_segment(bv, bio, iter, *bvec)
|
||||
bio_for_each_segment(bv, bio, iter)
|
||||
segs++;
|
||||
|
||||
return segs;
|
||||
}
|
||||
|
||||
static inline unsigned bio_segments(struct bio *bio)
|
||||
{
|
||||
return __bio_segments(bio, &bio->bi_iter);
|
||||
}
|
||||
|
||||
/*
|
||||
* get a reference to a bio, so it won't disappear. the intended use is
|
||||
* something like:
|
||||
@ -389,8 +384,6 @@ extern void bio_put(struct bio *);
|
||||
extern void __bio_clone_fast(struct bio *, struct bio *);
|
||||
extern struct bio *bio_clone_fast(struct bio *, gfp_t, struct bio_set *);
|
||||
extern struct bio *bio_clone_bioset(struct bio *, gfp_t, struct bio_set *bs);
|
||||
extern struct bio *bio_clone_bioset_partial(struct bio *, gfp_t,
|
||||
struct bio_set *, int, int);
|
||||
|
||||
extern struct bio_set *fs_bio_set;
|
||||
|
||||
|
@ -99,6 +99,7 @@ int __must_check percpu_ref_init(struct percpu_ref *ref,
|
||||
void percpu_ref_exit(struct percpu_ref *ref);
|
||||
void percpu_ref_switch_to_atomic(struct percpu_ref *ref,
|
||||
percpu_ref_func_t *confirm_switch);
|
||||
void percpu_ref_switch_to_atomic_sync(struct percpu_ref *ref);
|
||||
void percpu_ref_switch_to_percpu(struct percpu_ref *ref);
|
||||
void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
|
||||
percpu_ref_func_t *confirm_kill);
|
||||
|
@ -242,10 +242,18 @@ struct mdp_superblock_1 {
|
||||
|
||||
__le32 chunksize; /* in 512byte sectors */
|
||||
__le32 raid_disks;
|
||||
__le32 bitmap_offset; /* sectors after start of superblock that bitmap starts
|
||||
* NOTE: signed, so bitmap can be before superblock
|
||||
* only meaningful of feature_map[0] is set.
|
||||
*/
|
||||
union {
|
||||
__le32 bitmap_offset; /* sectors after start of superblock that bitmap starts
|
||||
* NOTE: signed, so bitmap can be before superblock
|
||||
* only meaningful of feature_map[0] is set.
|
||||
*/
|
||||
|
||||
/* only meaningful when feature_map[MD_FEATURE_PPL] is set */
|
||||
struct {
|
||||
__le16 offset; /* sectors from start of superblock that ppl starts (signed) */
|
||||
__le16 size; /* ppl size in sectors */
|
||||
} ppl;
|
||||
};
|
||||
|
||||
/* These are only valid with feature bit '4' */
|
||||
__le32 new_level; /* new level we are reshaping to */
|
||||
@ -318,6 +326,7 @@ struct mdp_superblock_1 {
|
||||
*/
|
||||
#define MD_FEATURE_CLUSTERED 256 /* clustered MD */
|
||||
#define MD_FEATURE_JOURNAL 512 /* support write cache */
|
||||
#define MD_FEATURE_PPL 1024 /* support PPL */
|
||||
#define MD_FEATURE_ALL (MD_FEATURE_BITMAP_OFFSET \
|
||||
|MD_FEATURE_RECOVERY_OFFSET \
|
||||
|MD_FEATURE_RESHAPE_ACTIVE \
|
||||
@ -328,6 +337,7 @@ struct mdp_superblock_1 {
|
||||
|MD_FEATURE_RECOVERY_BITMAP \
|
||||
|MD_FEATURE_CLUSTERED \
|
||||
|MD_FEATURE_JOURNAL \
|
||||
|MD_FEATURE_PPL \
|
||||
)
|
||||
|
||||
struct r5l_payload_header {
|
||||
@ -388,4 +398,31 @@ struct r5l_meta_block {
|
||||
|
||||
#define R5LOG_VERSION 0x1
|
||||
#define R5LOG_MAGIC 0x6433c509
|
||||
|
||||
struct ppl_header_entry {
|
||||
__le64 data_sector; /* raid sector of the new data */
|
||||
__le32 pp_size; /* length of partial parity */
|
||||
__le32 data_size; /* length of data */
|
||||
__le32 parity_disk; /* member disk containing parity */
|
||||
__le32 checksum; /* checksum of partial parity data for this
|
||||
* entry (~crc32c) */
|
||||
} __attribute__ ((__packed__));
|
||||
|
||||
#define PPL_HEADER_SIZE 4096
|
||||
#define PPL_HDR_RESERVED 512
|
||||
#define PPL_HDR_ENTRY_SPACE \
|
||||
(PPL_HEADER_SIZE - PPL_HDR_RESERVED - 4 * sizeof(__le32) - sizeof(__le64))
|
||||
#define PPL_HDR_MAX_ENTRIES \
|
||||
(PPL_HDR_ENTRY_SPACE / sizeof(struct ppl_header_entry))
|
||||
|
||||
struct ppl_header {
|
||||
__u8 reserved[PPL_HDR_RESERVED];/* reserved space, fill with 0xff */
|
||||
__le32 signature; /* signature (family number of volume) */
|
||||
__le32 padding; /* zero pad */
|
||||
__le64 generation; /* generation number of the header */
|
||||
__le32 entries_count; /* number of entries in entry array */
|
||||
__le32 checksum; /* checksum of the header (~crc32c) */
|
||||
struct ppl_header_entry entries[PPL_HDR_MAX_ENTRIES];
|
||||
} __attribute__ ((__packed__));
|
||||
|
||||
#endif
|
||||
|
@ -260,6 +260,22 @@ void percpu_ref_switch_to_atomic(struct percpu_ref *ref,
|
||||
|
||||
spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic);
|
||||
|
||||
/**
|
||||
* percpu_ref_switch_to_atomic_sync - switch a percpu_ref to atomic mode
|
||||
* @ref: percpu_ref to switch to atomic mode
|
||||
*
|
||||
* Schedule switching the ref to atomic mode, and wait for the
|
||||
* switch to complete. Caller must ensure that no other thread
|
||||
* will switch back to percpu mode.
|
||||
*/
|
||||
void percpu_ref_switch_to_atomic_sync(struct percpu_ref *ref)
|
||||
{
|
||||
percpu_ref_switch_to_atomic(ref, NULL);
|
||||
wait_event(percpu_ref_switch_waitq, !ref->confirm_switch);
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic_sync);
|
||||
|
||||
/**
|
||||
* percpu_ref_switch_to_percpu - switch a percpu_ref to percpu mode
|
||||
@ -290,6 +306,7 @@ void percpu_ref_switch_to_percpu(struct percpu_ref *ref)
|
||||
|
||||
spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(percpu_ref_switch_to_percpu);
|
||||
|
||||
/**
|
||||
* percpu_ref_kill_and_confirm - drop the initial ref and schedule confirmation
|
||||
|
Loading…
Reference in New Issue
Block a user