linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-11-19 18:24:14 +08:00

Author	SHA1	Message	Date
NeilBrown	50512625da	Revert "block: introduce bio_copy_data_partial" This reverts commit `6f8802852f`. bio_copy_data_partial() is no longer needed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-04-11 10:09:03 -07:00
NeilBrown	cb83efcfd2	md/raid1: simplify alloc_behind_master_bio() Now that we always always pass an offset of 0 and a size that matches the bio to alloc_behind_master_bio(), we can remove the offset/size args and simplify the code. We could probably remove bio_copy_data_partial() too. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-04-11 10:08:47 -07:00
NeilBrown	c230e7e535	md/raid1: simplify the splitting of requests. raid1 currently splits requests in two different ways for two different reasons. First, bio_split() is used to ensure the bio fits within a resync accounting region. Second, multiple r1bios are allocated for each bio to handle the possiblity of known bad blocks on some devices. This can be simplified to just use bio_split() once, and not use multiple r1bios. We delay the split until we know a maximum bio size that can be handled with a single r1bio, and then split the bio and queue the remainder for later handling. This avoids all loops inside raid1.c request handling. Just a single read, or a single set of writes, is submitted to lower-level devices for each bio that comes from generic_make_request(). When the bio needs to be split, generic_make_request() will do the necessary looping and call md_make_request() multiple times. raid1_make_request() no longer queues request for raid1 to handle, so we can remove that branch from the 'if'. This patch also creates a new private bio_set (conf->bio_split) for splitting bios. Using fs_bio_set is wrong, as it is meant to be used by filesystems, not block devices. Using it inside md can lead to deadlocks under high memory pressure. Delete unused variable in raid1_write_request() (Shaohua) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-04-11 10:07:27 -07:00
Artur Paszkiewicz	ae1713e296	raid5-ppl: partial parity calculation optimization In case of read-modify-write, partial partity is the same as the result of ops_run_prexor5(), so we can just copy sh->dev[pd_idx].page into sh->ppl_page instead of calculating it again. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-04-10 12:01:37 -07:00
Artur Paszkiewicz	845b9e229f	raid5-ppl: use resize_stripes() when enabling or disabling ppl Use resize_stripes() instead of raid5_reset_stripe_cache() to allocate or free sh->ppl_page at runtime for all stripes in the stripe cache. raid5_reset_stripe_cache() required suspending the mddev and could deadlock because of GFP_KERNEL allocations. Move the 'newsize' check to check_reshape() to allow reallocating the stripes with the same number of disks. Allocate sh->ppl_page in alloc_stripe() instead of grow_buffers(). Pass 'struct r5conf *conf' as a parameter to alloc_stripe() because it is needed to check whether to allocate ppl_page. Add free_stripe() and use it to free stripes rather than directly call kmem_cache_free(). Also free sh->ppl_page in free_stripe(). Set MD_HAS_PPL at the end of ppl_init_log() instead of explicitly setting it in advance and add another parameter to log_init() to allow calling ppl_init_log() without the bit set. Don't try to calculate partial parity or add a stripe to log if it does not have ppl_page set. Enabling ppl can now be performed without suspending the mddev, because the log won't be used until new stripes are allocated with ppl_page. Calling mddev_suspend/resume is still necessary when disabling ppl, because we want all stripes to finish before stopping the log, but resize_stripes() can be called after mddev_resume() when ppl is no longer active. Suggested-by: NeilBrown <neilb@suse.com> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-04-10 12:00:49 -07:00
Artur Paszkiewicz	94568f64af	raid5-ppl: move no_mem_stripes to struct ppl_conf Use a single no_mem_stripes list instead of per member device lists for handling stripes that need retrying in case of failed io_unit allocation. Because io_units are allocated from a memory pool shared between all member disks, the no_mem_stripes list should be checked when an io_unit for any member is freed. This fixes a deadlock that could happen if there are stripes in more than one no_mem_stripes list. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-04-10 12:00:27 -07:00
NeilBrown	0c9d5b127f	md/raid1: avoid reusing a resync bio after error handling. fix_sync_read_error() modifies a bio on a newly faulty device by setting bi_end_io to end_sync_write. This ensure that put_buf() will still call rdev_dec_pending() as required, but makes sure that subsequent code in fix_sync_read_error() doesn't try to read from the device. Unfortunately this interacts badly with sync_request_write() which assumes that any bio with bi_end_io set to non-NULL other than end_sync_read is safe to write to. As the device is now faulty it doesn't make sense to write. As the bio was recently used for a read, it is "dirty" and not suitable for immediate submission. In particular, ->bi_next might be non-NULL, which will cause generic_make_request() to complain. Break this interaction by refusing to write to devices which are marked as Faulty. Reported-and-tested-by: Michael Wang <yun.wang@profitbricks.com> Fixes: `2e52d449bc` ("md/raid1: add failfast handling for reads.") Cc: stable@vger.kernel.org (v4.10+) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-04-10 11:05:26 -07:00
Zhilong Liu	b670883bb9	md.c:didn't unlock the mddev before return EINVAL in array_size_store md.c: it needs to release the mddev lock before the array_size_store() returns. Fixes: `ab5a98b132` ("md-cluster: change array_sectors and update size are not supported") Signed-off-by: Zhilong Liu <zlliu@suse.com> Reviewed-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-04-10 10:50:24 -07:00
NeilBrown	065e519e71	md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stop if called md_set_readonly and set MD_CLOSING bit, the mddev cannot be opened any more due to the MD_CLOING bit wasn't cleared. Thus it needs to be cleared in md_ioctl after any call to md_set_readonly() or do_md_stop(). Signed-off-by: NeilBrown <neilb@suse.com> Fixes: `af8d8e6f03` ("md: changes for MD_STILL_CLOSED flag") Cc: stable@vger.kernel.org (v4.9+) Signed-off-by: Zhilong Liu <zlliu@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-04-10 10:47:50 -07:00
Guoqing Jiang	6f287ca604	md/raid10: reset the 'first' at the end of loop We need to set "first = 0' at the end of rdev_for_each loop, so we can get the array's min_offset_diff correctly otherwise min_offset_diff just means the last rdev's offset diff. Suggested-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-04-10 10:41:50 -07:00
NeilBrown	7471fb77ce	md/raid6: Fix anomily when recovering a single device in RAID6. When recoverying a single missing/failed device in a RAID6, those stripes where the Q block is on the missing device are handled a bit differently. In these cases it is easy to check that the P block is correct, so we do. This results in the P block be destroy. Consequently the P block needs to be read a second time in order to compute Q. This causes lots of seeks and hurts performance. It shouldn't be necessary to re-read P as it can be computed from the DATA. But we only compute blocks on missing devices, since `c337869d95` ("md: do not compute parity unless it is on a failed drive"). So relax the change made in that commit to allow computing of the P block in a RAID6 which it is the only missing that block. This makes RAID6 recovery run much faster as the disk just "before" the recovering device is no longer seeking back-and-forth. Reported-by-tested-by: Brad Campbell <lists2009@fnarfbargle.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-04-10 10:35:27 -07:00
Dennis Yang	583da48e38	md: update slab_cache before releasing new stripes when stripes resizing When growing raid5 device on machine with small memory, there is chance that mdadm will be killed and the following bug report can be observed. The same bug could also be reproduced in linux-4.10.6. [57600.075774] BUG: unable to handle kernel NULL pointer dereference at (null) [57600.083796] IP: [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20 [57600.110378] PGD 421cf067 PUD 4442d067 PMD 0 [57600.114678] Oops: 0002 [#1] SMP [57600.180799] CPU: 1 PID: 25990 Comm: mdadm Tainted: P O 4.2.8 #1 [57600.187849] Hardware name: To be filled by O.E.M. To be filled by O.E.M./MAHOBAY, BIOS QV05AR66 03/06/2013 [57600.197490] task: ffff880044e47240 ti: ffff880043070000 task.ti: ffff880043070000 [57600.204963] RIP: 0010:[<ffffffff81a6aa87>] [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20 [57600.213057] RSP: 0018:ffff880043073810 EFLAGS: 00010046 [57600.218359] RAX: 0000000000000000 RBX: 000000000000000c RCX: ffff88011e296dd0 [57600.225486] RDX: 0000000000000001 RSI: ffffe8ffffcb46c0 RDI: 0000000000000000 [57600.232613] RBP: ffff880043073878 R08: ffff88011e5f8170 R09: 0000000000000282 [57600.239739] R10: 0000000000000005 R11: 28f5c28f5c28f5c3 R12: ffff880043073838 [57600.246872] R13: ffffe8ffffcb46c0 R14: 0000000000000000 R15: ffff8800b9706a00 [57600.253999] FS: 00007f576106c700(0000) GS:ffff88011e280000(0000) knlGS:0000000000000000 [57600.262078] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [57600.267817] CR2: 0000000000000000 CR3: 00000000428fe000 CR4: 00000000001406e0 [57600.274942] Stack: [57600.276949] ffffffff8114ee35 ffff880043073868 0000000000000282 000000000000eb3f [57600.284383] ffffffff81119043 ffff880043073838 ffff880043073838 ffff88003e197b98 [57600.291820] ffffe8ffffcb46c0 ffff88003e197360 0000000000000286 ffff880043073968 [57600.299254] Call Trace: [57600.301698] [<ffffffff8114ee35>] ? cache_flusharray+0x35/0xe0 [57600.307523] [<ffffffff81119043>] ? __page_cache_release+0x23/0x110 [57600.313779] [<ffffffff8114eb53>] kmem_cache_free+0x63/0xc0 [57600.319344] [<ffffffff81579942>] drop_one_stripe+0x62/0x90 [57600.324915] [<ffffffff81579b5b>] raid5_cache_scan+0x8b/0xb0 [57600.330563] [<ffffffff8111b98a>] shrink_slab.part.36+0x19a/0x250 [57600.336650] [<ffffffff8111e38c>] shrink_zone+0x23c/0x250 [57600.342039] [<ffffffff8111e4f3>] do_try_to_free_pages+0x153/0x420 [57600.348210] [<ffffffff8111e851>] try_to_free_pages+0x91/0xa0 [57600.353959] [<ffffffff811145b1>] __alloc_pages_nodemask+0x4d1/0x8b0 [57600.360303] [<ffffffff8157a30b>] check_reshape+0x62b/0x770 [57600.365866] [<ffffffff8157a4a5>] raid5_check_reshape+0x55/0xa0 [57600.371778] [<ffffffff81583df7>] update_raid_disks+0xc7/0x110 [57600.377604] [<ffffffff81592b73>] md_ioctl+0xd83/0x1b10 [57600.382827] [<ffffffff81385380>] blkdev_ioctl+0x170/0x690 [57600.388307] [<ffffffff81195238>] block_ioctl+0x38/0x40 [57600.393525] [<ffffffff811731c5>] do_vfs_ioctl+0x2b5/0x480 [57600.399010] [<ffffffff8115e07b>] ? vfs_write+0x14b/0x1f0 [57600.404400] [<ffffffff811733cc>] SyS_ioctl+0x3c/0x70 [57600.409447] [<ffffffff81a6ad97>] entry_SYSCALL_64_fastpath+0x12/0x6a [57600.415875] Code: 00 00 00 00 55 48 89 e5 8b 07 85 c0 74 04 31 c0 5d c3 ba 01 00 00 00 f0 0f b1 17 85 c0 75 ef b0 01 5d c3 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 85 c0 75 01 c3 55 89 c6 48 89 e5 e8 85 d1 63 ff 5d [57600.435460] RIP [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20 [57600.441208] RSP <ffff880043073810> [57600.444690] CR2: 0000000000000000 [57600.448000] ---[ end trace cbc6b5cc4bf9831d ]--- The problem is that resize_stripes() releases new stripe_heads before assigning new slab cache to conf->slab_cache. If the shrinker function raid5_cache_scan() gets called after resize_stripes() starting releasing new stripes but right before new slab cache being assigned, it is possible that these new stripe_heads will be freed with the old slab_cache which was already been destoryed and that triggers this bug. Signed-off-by: Dennis Yang <dennisyang@qnap.com> Fixes: `edbe83ab4c` ("md/raid5: allow the stripe_cache to grow and shrink.") Cc: stable@vger.kernel.org (4.1+) Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-04-10 09:27:12 -07:00
Ming Lei	8fc04e6ea0	md: raid1: kill warning on powerpc_pseries This patch kills the warning reported on powerpc_pseries, and actually we don't need the initialization. After merging the md tree, today's linux-next build (powerpc pseries_le_defconfig) produced this warning: drivers/md/raid1.c: In function 'raid1d': drivers/md/raid1.c:2172:9: warning: 'page_len$' may be used uninitialized in this function [-Wmaybe-uninitialized] if (memcmp(page_address(ppages[j]), ^ drivers/md/raid1.c:2160:7: note: 'page_len$' was declared here int page_len[RESYNC_PAGES]; ^ Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-28 08:49:52 -07:00
Song Liu	0bb0c10500	md/raid5: use consistency_policy to remove journal feature When journal device of an array fails, the array is forced into read-only mode. To make the array normal without adding another journal device, we need to remove journal _feature_ from the array. This patch allows remove journal _feature_ from an array, For journal existing journal should be either missing or faulty. To remove journal feature, it is necessary to remove the journal device first: mdadm --fail /dev/md0 /dev/sdb mdadm: set /dev/sdb faulty in /dev/md0 mdadm --remove /dev/md0 /dev/sdb mdadm: hot removed /dev/sdb from /dev/md0 Then the journal feature can be removed by echoing into the sysfs file: cat /sys/block/md0/md/consistency_policy journal echo resync > /sys/block/md0/md/consistency_policy cat /sys/block/md0/md/consistency_policy resync Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-27 12:02:33 -07:00
Jason Yan	1ad45a9bc4	md/raid5-cache: fix payload endianness problem in raid5-cache The payload->header.type and payload->size are little-endian, so just convert them to the right byte order. Signed-off-by: Jason Yan <yanaijie@huawei.com> Cc: <stable@vger.kernel.org> #v4.10+ Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-25 09:38:22 -07:00
Shaohua Li	41743c1f04	md/raid1: skip data copy for behind io for discard request discard request doesn't have data attached, so it's meaningless to allocate memory and copy from original bio for behind IO. And the copy is bogus because bio_copy_data_partial can't handle discard request. We don't support writesame/writezeros request so far. Reviewed-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-25 09:38:06 -07:00
Shaohua Li	f45958756f	block: remove bio_clone_bioset_partial() commit c18a1e0(block: introduce bio_clone_bioset_partial()) introduced bio_clone_bioset_partial() for raid1 write behind IO. Now the write behind is rewritten by Ming. We don't need the API any more, so revert the commit. Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Jens Axboe <axboe@fb.com> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-25 09:18:37 -07:00
Ming Lei	2d06e3b714	md: raid10: avoid direct access to bvec table in handle_reshape_read_error All reshape I/O share pages from 1st copy device, so just use that pages for avoiding direct access to bvec table in handle_reshape_read_error. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-24 10:41:37 -07:00
Ming Lei	cdb76be315	md: raid10: retrieve page from preallocated resync page array Now one page array is allocated for each resync bio, and we can retrieve page from this table directly. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-24 10:41:37 -07:00
Ming Lei	f025061836	md: raid10: don't use bio's vec table to manage resync pages Now we allocate one page array for managing resync pages, instead of using bio's vec table to do that, and the old way is very hacky and won't work any more if multipage bvec is enabled. The introduced cost is that we need to allocate (128 + 16) * copies bytes per r10_bio, and it is fine because the inflight r10_bio for resync shouldn't be much, as pointed by Shaohua. Also bio_reset() in raid10_sync_request() and reshape_request() are removed because all bios are freshly new now in these functions and not necessary to reset any more. This patch can be thought as cleanup too. Suggested-by: Shaohua Li <shli@kernel.org> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-24 10:41:37 -07:00
Ming Lei	81fa152008	md: raid10: refactor code of read reshape's .bi_end_io reshape read request is a bit special and requires one extra bio which isn't allocated from r10buf_pool. Refactor the .bi_end_io for read reshape, so that we can use raid10's resync page mangement approach easily in the following patches. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-24 10:41:37 -07:00
Ming Lei	841c1316c7	md: raid1: improve write behind This patch improve handling of write behind in the following ways: - introduce behind master bio to hold all write behind pages - fast clone bios from behind master bio - avoid to change bvec table directly - use bio_copy_data() and make code more clean Suggested-by: Shaohua Li <shli@fb.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-24 10:41:37 -07:00
Ming Lei	d8c84c4f8b	md: raid1: move 'offset' out of loop The 'offset' local variable can't be changed inside the loop, so move it out. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-24 10:41:37 -07:00
Ming Lei	6f8802852f	block: introduce bio_copy_data_partial Turns out we can use bio_copy_data in raid1's write behind, and we can make alloc_behind_pages() more clean/efficient, but we need to partial version of bio_copy_data(). Signed-off-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Jens Axboe <axboe@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-24 10:41:37 -07:00
Ming Lei	60928a91b0	md: raid1: use bio helper in process_checks() Avoid to direct access to bvec table. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-24 10:41:36 -07:00
Ming Lei	44cf0f4dc7	md: raid1: retrieve page from pre-allocated resync page array Now one page array is allocated for each resync bio, and we can retrieve page from this table directly. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-24 10:41:36 -07:00
Ming Lei	98d30c5812	md: raid1: don't use bio's vec table to manage resync pages Now we allocate one page array for managing resync pages, instead of using bio's vec table to do that, and the old way is very hacky and won't work any more if multipage bvec is enabled. The introduced cost is that we need to allocate (128 + 16) * raid_disks bytes per r1_bio, and it is fine because the inflight r1_bio for resync shouldn't be much, as pointed by Shaohua. Also the bio_reset() in raid1_sync_request() is removed because all bios are freshly new now and not necessary to reset any more. This patch can be thought as a cleanup too Suggested-by: Shaohua Li <shli@kernel.org> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-24 10:41:36 -07:00
Ming Lei	a7234234d0	md: raid1: simplify r1buf_pool_free() This patch gets each page's reference of each bio for resync, then r1buf_pool_free() gets simplified a lot. The same policy has been taken in raid10's buf pool allocation/free too. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-24 10:41:36 -07:00
Ming Lei	513e2faa01	md: prepare for managing resync I/O pages in clean way Now resync I/O use bio's bec table to manage pages, this way is very hacky, and may not work any more once multipage bvec is introduced. So introduce helpers and new data structure for managing resync I/O pages more cleanly. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-24 10:41:36 -07:00
Ming Lei	d8e29fbc3b	md: move two macros into md.h Both raid1 and raid10 share common resync block size and page count, so move them into md.h. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-24 10:41:36 -07:00
Ming Lei	c85ba149de	md: raid1/raid10: don't handle failure of bio_add_page() All bio_add_page() is for adding one page into resync bio, which is big enough to hold RESYNC_PAGES pages, and the current bio_add_page() doesn't check queue limit any more, so it won't fail at all. remove unused label (shaohua) Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-24 10:41:36 -07:00
Zhilong Liu	3560741e31	md: fix several trivial typos in comments Signed-off-by: Zhilong Liu <zlliu@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-23 22:54:57 -07:00
Guoqing Jiang	27f26a0f37	md/raid10: refactor some codes from raid10_write_request Previously, we clone both bio and repl_bio in raid10_write_request, then add the cloned bio to plug->pending or conf->pending_bio_list based on plug or not, and most of the logics are same for the two conditions. So introduce raid10_write_one_disk for it, and use replacement parameter to distinguish the difference. No functional changes in the patch. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-23 22:42:14 -07:00
Dan Carpenter	0b408baf7f	raid5-ppl: silence a misleading warning message The "need_cache_flush" variable is never set to false. When the variable is true that means we print a warning message at the end of the function. Fixes: `3418d036c8` ("raid5-ppl: Partial Parity Log write logging implementation") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-23 22:38:46 -07:00
NeilBrown	4ad23a9764	MD: use per-cpu counter for writes_pending The 'writes_pending' counter is used to determine when the array is stable so that it can be marked in the superblock as "Clean". Consequently it needs to be updated frequently but only checked for zero occasionally. Recent changes to raid5 cause the count to be updated even more often - once per 4K rather than once per bio. This provided justification for making the updates more efficient. So we replace the atomic counter a percpu-refcount. This can be incremented and decremented cheaply most of the time, and can be switched to "atomic" mode when more precise counting is needed. As it is possible for multiple threads to want a precise count, we introduce a "sync_checker" counter to count the number of threads in "set_in_sync()", and only switch the refcount back to percpu mode when that is zero. We need to be careful about races between set_in_sync() setting ->in_sync to 1, and md_write_start() setting it to zero. md_write_start() holds the rcu_read_lock() while checking if the refcount is in percpu mode. If it is, then we know a switch to 'atomic' will not happen until after we call rcu_read_unlock(), in which case set_in_sync() will see the elevated count, and not set in_sync to 1. If it is not in percpu mode, we take the mddev->lock to ensure proper synchronization. It is no longer possible to quickly check if the count is zero, which we previously did to update a timer or to schedule the md_thread. So now we do these every time we decrement that counter, but make sure they are fast. mod_timer() already optimizes the case where the timeout value doesn't actually change. We leverage that further by always rounding off the jiffies to the timeout value. This may delay the marking of 'clean' slightly, but ensure we only perform atomic operation here when absolutely needed. md_wakeup_thread() current always calls wake_up(), even if THREAD_WAKEUP is already set. That too can be optimised to avoid calls to wake_up(). Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:18:56 -07:00
NeilBrown	210f7cdcf0	percpu-refcount: support synchronous switch to atomic mode. percpu_ref_switch_to_atomic_sync() schedules the switch to atomic mode, then waits for it to complete. Also export percpu_ref_switch_to_* so they can be used from modules. This will be used in md/raid to count the number of pending write requests to an array. We occasionally need to check if the count is zero, but most often we don't care. We always want updates to the counter to be fast, as in some cases we count every 4K page. Signed-off-by: NeilBrown <neilb@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:18:43 -07:00
NeilBrown	55cc39f345	md: close a race with setting mddev->in_sync If ->in_sync is being set just as md_write_start() is being called, it is possible that set_in_sync() won't see the elevated ->writes_pending, and md_write_start() won't see the set ->in_sync. To close this race, re-test ->writes_pending after setting ->in_sync, and add memory barriers to ensure the increment of ->writes_pending will be seen by the time of this second test, or the new ->in_sync will be seen by md_write_start(). Add a spinlock to array_state_show() to ensure this temporary instability is never visible from userspace. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:18:30 -07:00
NeilBrown	6497709b5d	md: factor out set_in_sync() Three separate places in md.c check if the number of active writes is zero and, if so, sets mddev->in_sync. There are a few differences, but there shouldn't be: - it is always appropriate to notify the change in sysfs_state, and there is no need to do this outside a spin-locked region. - we never need to check ->recovery_cp. The state of resync is not relevant for whether there are any pending writes or not (which is what ->in_sync reports). So create set_in_sync() which does the correct tests and makes the correct changes, and call this in all three places. Any behaviour changes here a minor and cosmetic. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:18:18 -07:00
NeilBrown	84dd97a690	md/raid5: don't test ->writes_pending in raid5_remove_disk This test on ->writes_pending cannot be safe as the counter can be incremented at any moment and cannot be locked against. Change it to test conf->active_stripes, which at least can be locked against. More changes are still needed. A future patch will change ->writes_pending, and testing it here will be very inconvenient. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:18:05 -07:00
NeilBrown	37011e3afb	md/raid1: stop using bi_phys_segment Change to use bio->__bi_remaining to count number of r1bio attached to a bio. See precious raid10 patch for more details. Like the raid10.c patch, this fixes a bug as nr_queued and nr_pending used to measure different things, but were being compared. This patch fixes another bug in that nr_pending previously did not could write-behind requests, so behind writes could continue while resync was happening. How that nr_pending counts all r1_bio, the resync cannot commence until the behind writes have completed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:17:53 -07:00
NeilBrown	fd16f2e848	md/raid10: stop using bi_phys_segments raid10 currently repurposes bi_phys_segments on each incoming bio to count how many r10bio was used to encode the request. We need to know when the number of attached r10bio reaches zero to: 1/ call bio_endio() when all IO on the bio is finished 2/ decrement ->nr_pending so that resync IO can proceed. Now that the bio has its own __bi_remaining counter, that can be used instead. We can call bio_inc_remaining to increment the counter and call bio_endio() every time an r10bio completes, rather than only when bi_phys_segments reaches zero. This addresses point 1, but not point 2. bio_endio() doesn't (and cannot) report when the last r10bio has finished, so a different approach is needed. So: instead of counting bios in ->nr_pending, count r10bios. i.e. every time we attach a bio, increment nr_pending. Every time an r10bio completes, decrement nr_pending. Normally we only increment nr_pending after first checking that ->barrier is zero, or some other non-trivial tests and possible waiting. When attaching multiple r10bios to a bio, we only need the tests and the waiting once. After the first increment, subsequent increments can happen unconditionally as they are really all part of the one request. So introduce inc_pending() which can be used when we know that nr_pending is already elevated. Note that this fixes a bug. freeze_array() contains the line atomic_read(&conf->nr_pending) == conf->nr_queued+extra, which implies that the units for ->nr_pending, ->nr_queued and extra are the same. ->nr_queue and extra count r10_bios, but prior to this patch, ->nr_pending counted bios. If a bio ever resulted in multiple r10_bios (due to bad blocks), freeze_array() would not work correctly. Now it does. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:17:41 -07:00
NeilBrown	6b6c8110e1	md/raid1, raid10: move rXbio accounting closer to allocation. When raid1 or raid10 find they will need to allocate a new r1bio/r10bio, in order to work around a known bad block, they account for the allocation well before the allocation is made. This separation makes the correctness less obvious and requires comments. The accounting needs to be a little before: before the first rXbio is submitted, but that is all. So move the accounting down to where it makes more sense. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:17:24 -07:00
NeilBrown	97d5343808	Revert "md/raid5: limit request size according to implementation limits" This reverts commit `e8d7c33232`. Now that raid5 doesn't abuse bi_phys_segments any more, we no longer need to impose these limits. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:17:12 -07:00
NeilBrown	0472a42ba1	md/raid5: remove over-loading of ->bi_phys_segments. When a read request, which bypassed the cache, fails, we need to retry it through the cache. This involves attaching it to a sequence of stripe_heads, and it may not be possible to get all the stripe_heads we need at once. We do what we can, and record how far we got in ->bi_phys_segments so we can pick up again later. There is only ever one bio which may have a non-zero offset stored in ->bi_phys_segments, the one that is either active in the single thread which calls retry_aligned_read(), or is in conf->retry_read_aligned waiting for retry_aligned_read() to be called again. So we only need to store one offset value. This can be in a local variable passed between remove_bio_from_retry() and retry_aligned_read(), or in the r5conf structure next to the ->retry_read_aligned pointer. Storing it there allows the last usage of ->bi_phys_segments to be removed from md/raid5.c. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:16:56 -07:00
NeilBrown	016c76ac76	md/raid5: use bio_inc_remaining() instead of repurposing bi_phys_segments as a counter md/raid5 needs to keep track of how many stripe_heads are processing a bio so that it can delay calling bio_endio() until all stripe_heads have completed. It currently uses 16 bits of ->bi_phys_segments for this purpose. 16 bits is only enough for 256M requests, and it is possible for a single bio to be larger than this, which causes problems. Also, the bio struct contains a larger counter, __bi_remaining, which has a purpose very similar to the purpose of our counter. So stop using ->bi_phys_segments, and instead use __bi_remaining. This means we don't need to initialize the counter, as our caller initializes it to '1'. It also means we can call bio_endio() directly as it tests this counter internally. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:16:30 -07:00
NeilBrown	bd83d0a28c	md/raid5: call bio_endio() directly rather than queueing for later. We currently gather bios that need to be returned into a bio_list and call bio_endio() on them all together. The original reason for this was to avoid making the calls while holding a spinlock. Locking has changed a lot since then, and that reason is no longer valid. So discard return_io() and various return_bi lists, and just call bio_endio() directly as needed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:16:12 -07:00
NeilBrown	16d997b78b	md/raid5: simplfy delaying of writes while metadata is updated. If a device fails during a write, we must ensure the failure is recorded in the metadata before the completion of the write is acknowleged. Commit `c3cce6cda1` ("md/raid5: ensure device failure recorded before write request returns.") added code for this, but it was unnecessarily complicated. We already had similar functionality for handling updates to the bad-block-list, thanks to Commit `de393cdea6` ("md: make it easier to wait for bad blocks to be acknowledged.") So revert most of the former commit, and instead avoid collecting completed writes if MD_CHANGE_PENDING is set. raid5d() will then flush the metadata and retry the stripe_head. As this change can leave a stripe_head ready for handling immediately after handle_active_stripes() returns, we change raid5_do_work() to pause when MD_CHANGE_PENDING is set, so that it doesn't spin. We check MD_CHANGE_PENDING after analyse_stripe() as it could be set asynchronously. After analyse_stripe(), we have collected stable data about the state of devices, which will be used to make decisions. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:15:57 -07:00
NeilBrown	497280509f	md/raid5: use md_write_start to count stripes, not bios We use md_write_start() to increase the count of pending writes, and md_write_end() to decrement the count. We currently count bios submitted to md/raid5. Change it count stripe_heads that a WRITE bio has been attached to. So now, raid5_make_request() calls md_write_start() and then md_write_end() to keep the count elevated during the setup of the request. add_stripe_bio() calls md_write_start() for each stripe_head, and the completion routines always call md_write_end(), instead of only calling it when raid5_dec_bi_active_stripes() returns 0. make_discard_request also calls md_write_start/end(). The parallel between md_write_{start,end} and use of bi_phys_segments can be seen in that: Whenever we set bi_phys_segments to 1, we now call md_write_start. Whenever we increment it on non-read requests with raid5_inc_bi_active_stripes(), we now call md_write_start(). Whenever we decrement bi_phys_segments on non-read requsts with raid5_dec_bi_active_stripes(), we now call md_write_end(). This reduces our dependence on keeping a per-bio count of active stripes in bi_phys_segments. md_write_inc() is added which parallels md_write_start(), but requires that a write has already been started, and is certain never to sleep. This can be used inside a spinlocked region when adding to a write request. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:15:42 -07:00
Guoqing Jiang	48df498daf	md: move bitmap_destroy to the beginning of __md_stop Since we have switched to sync way to handle METADATA_UPDATED msg for md-cluster, then process_metadata_update is depended on mddev->thread->wqueue. With the new change, clustered raid could possible hang if array received a METADATA_UPDATED msg after array unregistered mddev->thread, so we need to stop clustered raid (bitmap_destroy -> bitmap_free -> md_cluster_stop) earlier than unregister thread (mddev_detach -> md_unregister_thread). And this change should be safe for non-clustered raid since all writes are stopped before the destroy. Also in md_run, we activate the personality (pers->run()) before activating the bitmap (bitmap_create()). So it is pleasingly symmetric to stop the bitmap (bitmap_destroy()) before stopping the personality (__md_stop() calls pers->free()), we achieve this by move bitmap_destroy to the beginning of __md_stop. But we don't want to break the codes for waiting behind IO as Shaohua mentioned, so introduce bitmap_wait_behind_writes to call the codes, and call the new fun in both mddev_detach and bitmap_destroy, then we will not break original behind IO code and also fit the new condition well. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:58 -07:00
Song Liu	ea17481fb4	md/r5cache: generate R5LOG_PAYLOAD_FLUSH In r5c_finish_stripe_write_out(), R5LOG_PAYLOAD_FLUSH is append to log->current_io. Appending R5LOG_PAYLOAD_FLUSH in quiesce needs extra writes to journal. To simplify the logic, we just skip R5LOG_PAYLOAD_FLUSH in quiesce. Even R5LOG_PAYLOAD_FLUSH supports multiple stripes per payload. However, current implementation is one stripe per R5LOG_PAYLOAD_FLUSH, which is simpler. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:57 -07:00

1 2 3 4 5 ...

662169 Commits