linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-11-20 02:34:23 +08:00

Author	SHA1	Message	Date
Guoqing Jiang	15858fa5b0	md-cluster: Defer MD reloading to mddev->thread Reloading of superblock must be performed under reconfig_mutex. However, this cannot be done with md_reload_sb because it would deadlock with the message DLM lock. So, we defer it in md_check_recovery() which is executed by mddev->thread. This introduces a new flag, MD_RELOAD_SB, which if set, will reload the superblock. And good_device_nr is also added to 'struct mddev' which is used to get the num of the good device within cluster raid. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:39:10 +11:00
Guoqing Jiang	f6a2dc64ee	md-cluster: append some actions when change bitmap from clustered to none For clustered raid, we need to do extra actions when change bitmap to none. 1. check if all the bitmap lock could be get or not, if yes then we can continue the change since cluster raid is only active in current node. Otherwise return fail and unlock the related bitmap locks 2. set nodes to 0 and then leave cluster environment. 3. release other nodes's bitmap lock. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:38:57 +11:00
Goldwyn Rodrigues	09afd2a8d6	md-cluster: Allow spare devices to be marked as faulty If a spare device was marked faulty, it would not be reflected in receiving nodes because it would mark it as activated and continue. Continue the operation, so it may be set as faulty. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:38:51 +11:00
Goldwyn Rodrigues	54a88392cd	md-cluster: Fix the remove sequence with the new MD reload code The remove disk message does not need metadata_update_start(), but can be an independent message. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:38:42 +11:00
Guoqing Jiang	659b254fa7	md-cluster: remove a disk asynchronously from cluster environment For cluster raid, if one disk couldn't be reach in one node, then other nodes would receive the REMOVE message for the disk. In receiving node, we can't call md_kick_rdev_from_array to remove the disk from array synchronously since the disk might still be busy in this node. So let's set a ClusterRemove flag on the disk, then let the thread to do the removal job eventually. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:38:36 +11:00
Goldwyn Rodrigues	ac277c6a8a	md-cluster: Avoid the resync ping-pong If a RESYNCING message with (0,0) has been sent before, do not send it again. This avoids a resync ping pong between the nodes. We read the bitmap lockresource's LVB to figure out the previous value of the RESYNCING message. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:38:27 +11:00
Roman Gushchin	b46020aa3a	md/raid5: remove redundant check in stripe_add_to_batch_list() The stripe_add_to_batch_list() function is called only if stripe_can_batch() returned true, so there is no need for double check. Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Cc: Neil Brown <neilb@suse.com> Cc: linux-raid@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:38:22 +11:00
Al Viro	93bbf5831d	md: more open-coded offset_in_page() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-01-04 10:29:12 -05:00
Al Viro	756d097b95	dm-bufio: virt_to_phys() doesn't change remainder modulo PAGE_SIZE ... so virt_to_phys(p) & (PAGE_SIZE - 1) is a very odd way to spell offset_in_page(p). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-01-04 10:29:07 -05:00
Kent Overstreet	627ccd20b4	bcache: Change refill_dirty() to always scan entire disk if necessary Previously, it would only scan the entire disk if it was starting from the very start of the disk - i.e. if the previous scan got to the end. This was broken by refill_full_stripes(), which updates last_scanned so that refill_dirty was never triggering the searched_from_start path. But if we change refill_dirty() to always scan the entire disk if necessary, regardless of what last_scanned was, the code gets cleaner and we fix that bug too. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-30 20:23:16 -07:00
Stefan Bader	8d16ce540c	bcache: prevent crash on changing writeback_running Added a safeguard in the shutdown case. At least while not being attached it is also possible to trigger a kernel bug by writing into writeback_running. This change adds the same check before trying to wake up the thread for that case. Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-30 20:23:14 -07:00
Gabriel de Perthuis	d7076f2162	bcache: allows use of register in udev to avoid "device_busy" error. Allows to use register, not register_quiet in udev to avoid "device_busy" error. The initial patch proposed at https://lkml.org/lkml/2013/8/26/549 by Gabriel de Perthuis <g2p.code@gmail.com> does not unlock the mutex and hangs the kernel. See http://thread.gmane.org/gmane.linux.kernel.bcache.devel/2594 for the discussion. Cc: Denis Bychkov <manover@gmail.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Gabriel de Perthuis <g2p.code@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-30 20:23:13 -07:00
Zheng Liu	2ecf0cdb2b	bcache: unregister reboot notifier if bcache fails to unregister device In bcache_init() function it forgot to unregister reboot notifier if bcache fails to unregister a block device. This commit fixes this. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Tested-by: Joshua Schmid <jschmid@suse.com> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Kent Overstreet <kmo@daterainc.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-30 20:23:11 -07:00
Al Viro	4d4d8573a8	bcache: fix a leak in bch_cached_dev_run() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Joshua Schmid <jschmid@suse.com> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Kent Overstreet <kmo@daterainc.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-30 20:23:10 -07:00
Zheng Liu	fecaee6f20	bcache: clear BCACHE_DEV_UNLINK_DONE flag when attaching a backing device This bug can be reproduced by the following script: #!/bin/bash bcache_sysfs="/sys/fs/bcache" function clear_cache() { if [ ! -e $bcache_sysfs ]; then echo "no bcache sysfs" exit fi cset_uuid=$(ls -l $bcache_sysfs\|head -n 2\|tail -n 1\|awk '{print $9}') sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/detach" sleep 5 sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/attach" } for ((i=0;i<10;i++)); do clear_cache done The warning messages look like below: [ 275.948611] ------------[ cut here ]------------ [ 275.963840] WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xb8/0xd0() (Tainted: P W --------------- ) [ 275.979253] Hardware name: Tecal RH2285 [ 275.994106] sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:09.0/0000:08:00.0/host4/target4:2:1/4:2:1:0/block/sdb/sdb1/bcache/cache' [ 276.024105] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan] [ 276.072643] Pid: 2765, comm: sh Tainted: P W --------------- 2.6.32 #1 [ 276.089315] Call Trace: [ 276.105801] [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0 [ 276.122650] [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50 [ 276.139361] [<ffffffff81205c08>] ? sysfs_add_one+0xb8/0xd0 [ 276.156012] [<ffffffff8120609b>] ? sysfs_do_create_link+0x12b/0x170 [ 276.172682] [<ffffffff81206113>] ? sysfs_create_link+0x13/0x20 [ 276.189282] [<ffffffffa03bda21>] ? bcache_device_link+0xc1/0x110 [bcache] [ 276.205993] [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache] [ 276.222794] [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache] [ 276.239680] [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110 [ 276.256594] [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170 [ 276.273364] [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0 [ 276.290133] [<ffffffff811890b1>] ? sys_write+0x51/0x90 [ 276.306368] [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b [ 276.322301] ---[ end trace 9f5d4fcdd0c3edfb ]--- [ 276.338241] ------------[ cut here ]------------ [ 276.354109] WARNING: at /home/wenqing.lz/bcache/bcache/super.c:720 bcache_device_link+0xdf/0x110 [bcache]() (Tainted: P W --------------- ) [ 276.386017] Hardware name: Tecal RH2285 [ 276.401430] Couldn't create device <-> cache set symlinks [ 276.401759] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan] [ 276.465477] Pid: 2765, comm: sh Tainted: P W --------------- 2.6.32 #1 [ 276.482169] Call Trace: [ 276.498610] [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0 [ 276.515405] [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50 [ 276.532059] [<ffffffffa03bda3f>] ? bcache_device_link+0xdf/0x110 [bcache] [ 276.548808] [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache] [ 276.565569] [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache] [ 276.582418] [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110 [ 276.599341] [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170 [ 276.616142] [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0 [ 276.632607] [<ffffffff811890b1>] ? sys_write+0x51/0x90 [ 276.648671] [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b [ 276.664756] ---[ end trace 9f5d4fcdd0c3edfc ]--- We forget to clear BCACHE_DEV_UNLINK_DONE flag in bcache_device_attach() function when we attach a backing device first time. After detaching this backing device, this flag will be true and sysfs_remove_link() isn't called in bcache_device_unlink(). Then when we attach this backing device again, sysfs_create_link() will return EEXIST error in bcache_device_link(). So the fix is trival and we clear this flag in bcache_device_link(). Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Tested-by: Joshua Schmid <jschmid@suse.com> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Kent Overstreet <kmo@daterainc.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-30 20:23:08 -07:00
Kent Overstreet	c5f1e5adf9	bcache: Add a cond_resched() call to gc Signed-off-by: Takashi Iwai <tiwai@suse.de> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Kent Overstreet <kmo@daterainc.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-30 20:23:06 -07:00
Zheng Liu	2ef9ccbfcb	bcache: fix a livelock when we cause a huge number of cache misses Subject : [PATCH v2] bcache: fix a livelock in btree lock Date : Wed, 25 Feb 2015 20:32:09 +0800 (02/25/2015 04:32:09 AM) This commit tries to fix a livelock in bcache. This livelock might happen when we causes a huge number of cache misses simultaneously. When we get a cache miss, bcache will execute the following path. ->cached_dev_make_request() ->cached_dev_read() ->cached_lookup() ->bch->btree_map_keys() ->btree_root() <------------------------ ->bch_btree_map_keys_recurse() \| ->cache_lookup_fn() \| ->cached_dev_cache_miss() \| ->bch_btree_insert_check_key() -\| [If btree->seq is not equal to seq + 1, we should return EINTR and traverse btree again.] In bch_btree_insert_check_key() function we first need to check upgrade flag (op->lock == -1), and when this flag is true we need to release read btree->lock and try to take write btree->lock. During taking and releasing this write lock, btree->seq will be monotone increased in order to prevent other threads modify this in cache miss (see btree.h:74). But if there are some cache misses caused by some requested, we could meet a livelock because btree->seq is always changed by others. Thus no one can make progress. This commit will try to take write btree->lock if it encounters a race when we traverse btree. Although it sacrifice the scalability but we can ensure that only one can modify the btree. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Tested-by: Joshua Schmid <jschmid@suse.com> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Joshua Schmid <jschmid@suse.com> Cc: Zhu Yanhai <zhu.yanhai@gmail.com> Cc: Kent Overstreet <kmo@daterainc.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-30 20:23:05 -07:00
NeilBrown	312045eef9	md: remove check for MD_RECOVERY_NEEDED in action_store. md currently doesn't allow a 'sync_action' such as 'reshape' to be set while MD_RECOVERY_NEEDED is set. This s a problem, particularly since commit `738a273806` as that can cause ->check_shape to call mddev_resume() which sets MD_RECOVERY_NEEDED. So by the time we come to start 'reshape' it is very likely that MD_RECOVERY_NEEDED is still set. Testing for this flag is not really needed and is in any case very racy as it can be set at any moment - asynchronously. Any race between setting a sync_action and setting MD_RECOVERY_NEEDED must already be handled properly in some locked code, probably md_check_recovery(), so remove the test here. The test on MD_RECOVERY_RUNNING is also racy in the 'reshape' case so we should test it again after getting mddev_lock(). As this fixes a race and a regression which can cause 'reshape' to fail, it is suitable for -stable kernels since 4.1 Reported-by: Xiao Ni <xni@redhat.com> Fixes: `738a273806` ("md/raid5: fix allocation of 'scribble' array.") Cc: stable@vger.kernel.org (v4.1+) Signed-off-by: NeilBrown <neilb@suse.com>	2015-12-21 11:10:06 +11:00
Goldwyn Rodrigues	cb01c5496d	Fix remove_and_add_spares removes drive added as spare in slot_store Commit `2910ff17d1` introduced a regression which would remove a recently added spare via slot_store. Revert part of the patch which touches slot_store() and add the disk directly using pers->hot_add_disk() Fixes: `2910ff17d1` ("md: remove_and_add_spares() to activate specific rdev") Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-12-18 15:19:16 +11:00
Mikulas Patocka	0dc10e50f2	md: fix bug due to nested suspend The patch `c7bfced9a6` committed to 4.4-rc causes crash in LVM test shell/lvchange-raid.sh. The kernel crashes with this BUG, the reason is that we attempt to suspend a device that is already suspended. See also https://bugzilla.redhat.com/show_bug.cgi?id=1283491 This patch fixes the bug by changing functions mddev_suspend and mddev_resume to always nest. The number of nested calls to mddev_nested_suspend is kept in the variable mddev->suspended. [neilb: made mddev_suspend() always nest instead of introduce mddev_nested_suspend] kernel BUG at drivers/md/md.c:317! CPU: 3 PID: 32754 Comm: lvm Not tainted 4.4.0-rc2 #1 task: 0000000047076040 ti: 0000000047014000 task.ti: 0000000047014000 YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI PSW: 00001000000001000000000000001111 Not tainted r00-03 000000000804000f 00000000102c5280 0000000010c7522c 000000007e3d1810 r04-07 0000000010c6f000 000000004ef37f20 000000007e3d1dd0 000000007e3d1810 r08-11 000000007c9f1600 0000000000000000 0000000000000001 ffffffffffffffff r12-15 0000000010c1d000 0000000000000041 00000000f98d63c8 00000000f98e49e4 r16-19 00000000f98e49e4 00000000c138fd06 00000000f98d63c8 0000000000000001 r20-23 0000000000000002 000000004ef37f00 00000000000000b0 00000000000001d1 r24-27 00000000424783a0 000000007e3d1dd0 000000007e3d1810 00000000102b2000 r28-31 0000000000000001 0000000047014840 0000000047014930 0000000000000001 sr00-03 0000000007040800 0000000000000000 0000000000000000 0000000007040800 sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000 IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000102c538c 00000000102c5390 IIR: 03ffe01f ISR: 0000000000000000 IOR: 00000000102b2748 CPU: 3 CR30: 0000000047014000 CR31: 0000000000000000 ORIG_R28: 00000000000000b0 IAOQ[0]: mddev_suspend+0x10c/0x160 [md_mod] IAOQ[1]: mddev_suspend+0x110/0x160 [md_mod] RP(r2): raid1_add_disk+0xd4/0x2c0 [raid1] Backtrace: [<0000000010c7522c>] raid1_add_disk+0xd4/0x2c0 [raid1] [<0000000010c20078>] raid_resume+0x390/0x418 [dm_raid] [<00000000105833e8>] dm_table_resume_targets+0xc0/0x188 [dm_mod] [<000000001057f784>] dm_resume+0x144/0x1e0 [dm_mod] [<0000000010587dd4>] dev_suspend+0x1e4/0x568 [dm_mod] [<0000000010589278>] ctl_ioctl+0x1e8/0x428 [dm_mod] [<0000000010589518>] dm_compat_ctl_ioctl+0x18/0x68 [dm_mod] [<0000000040377b88>] compat_SyS_ioctl+0xd0/0x1558 Fixes: `c7bfced9a6` ("md: suspend i/o during runtime blk_integrity_unregister") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-12-18 15:19:16 +11:00
Shaohua Li	9b15603dbd	MD: change journal disk role to disk 0 Neil pointed out setting journal disk role to raid_disks will confuse reshape if we support reshape eventually. Switching the role to 0 (we should be fine as long as the value >=0) and skip sysfs file creation to avoid error. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-12-18 15:19:16 +11:00
Artur Paszkiewicz	cc57858831	md/raid10: fix data corruption and crash during resync The commit `c31df25f20` ("md/raid10: make sync_request_write() call bio_copy_data()") replaced manual data copying with bio_copy_data() but it doesn't work as intended. The source bio (fbio) is already processed, so its bvec_iter has bi_size == 0 and bi_idx == bi_vcnt. Because of this, bio_copy_data() either does not copy anything, or worse, copies data from the ->bi_next bio if it is set. This causes wrong data to be written to drives during resync and sometimes lockups/crashes in bio_copy_data(): [ 517.338478] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [md126_raid10:3319] [ 517.347324] Modules linked in: raid10 xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul cryptd shpchp pcspkr ipmi_si ipmi_msghandler tpm_crb acpi_power_meter acpi_cpufreq ext4 mbcache jbd2 sr_mod cdrom sd_mod e1000e ax88179_178a usbnet mii ahci ata_generic crc32c_intel libahci ptp pata_acpi libata pps_core wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod [ 517.440555] CPU: 0 PID: 3319 Comm: md126_raid10 Not tainted 4.3.0-rc6+ #1 [ 517.448384] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYDCRB1.86B.0055.D14.1509221924 09/22/2015 [ 517.459768] task: ffff880153773980 ti: ffff880150df8000 task.ti: ffff880150df8000 [ 517.468529] RIP: 0010:[<ffffffff812e1888>] [<ffffffff812e1888>] bio_copy_data+0xc8/0x3c0 [ 517.478164] RSP: 0018:ffff880150dfbc98 EFLAGS: 00000246 [ 517.484341] RAX: ffff880169356688 RBX: 0000000000001000 RCX: 0000000000000000 [ 517.492558] RDX: 0000000000000000 RSI: ffffea0001ac2980 RDI: ffffea0000d835c0 [ 517.500773] RBP: ffff880150dfbd08 R08: 0000000000000001 R09: ffff880153773980 [ 517.508987] R10: ffff880169356600 R11: 0000000000001000 R12: 0000000000010000 [ 517.517199] R13: 000000000000e000 R14: 0000000000000000 R15: 0000000000001000 [ 517.525412] FS: 0000000000000000(0000) GS:ffff880174a00000(0000) knlGS:0000000000000000 [ 517.534844] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 517.541507] CR2: 00007f8a044d5fed CR3: 0000000169504000 CR4: 00000000001406f0 [ 517.549722] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 517.557929] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 517.566144] Stack: [ 517.568626] ffff880174a16bc0 ffff880153773980 ffff880169356600 0000000000000000 [ 517.577659] 0000000000000001 0000000000000001 ffff880153773980 ffff88016a61a800 [ 517.586715] ffff880150dfbcf8 0000000000000001 ffff88016dd209e0 0000000000001000 [ 517.595773] Call Trace: [ 517.598747] [<ffffffffa043ef95>] raid10d+0xfc5/0x1690 [raid10] [ 517.605610] [<ffffffff816697ae>] ? __schedule+0x29e/0x8e2 [ 517.611987] [<ffffffff814ff206>] md_thread+0x106/0x140 [ 517.618072] [<ffffffff810c1d80>] ? wait_woken+0x80/0x80 [ 517.624252] [<ffffffff814ff100>] ? super_1_load+0x520/0x520 [ 517.630817] [<ffffffff8109ef89>] kthread+0xc9/0xe0 [ 517.636506] [<ffffffff8109eec0>] ? flush_kthread_worker+0x70/0x70 [ 517.643653] [<ffffffff8166d99f>] ret_from_fork+0x3f/0x70 [ 517.649929] [<ffffffff8109eec0>] ? flush_kthread_worker+0x70/0x70 Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Reviewed-by: Shaohua Li <shli@kernel.org> Cc: stable@vger.kernel.org (v4.2+) Fixes: `c31df25f20` ("md/raid10: make sync_request_write() call bio_copy_data()") Signed-off-by: NeilBrown <neilb@suse.com>	2015-12-18 15:19:16 +11:00
Nikolay Borisov	18d03e8c25	dm thin: fix race condition when destroying thin pool workqueue When a thin pool is being destroyed delayed work items are cancelled using cancel_delayed_work(), which doesn't guarantee that on return the delayed item isn't running. This can cause the work item to requeue itself on an already destroyed workqueue. Fix this by using cancel_delayed_work_sync() which guarantees that on return the work item is not running anymore. Fixes: `905e51b39a` ("dm thin: commit outstanding data every second") Fixes: `85ad643b7e` ("dm thin: add timeout to stop out-of-data-space mode holding IO forever") Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-12-17 15:47:20 -05:00
Mike Snitzer	512167788a	dm space map metadata: remove unused variable in brb_pop() Remove the unused struct block_op pointer that was inadvertantly introduced, via cut-and-paste of previous brb_op() code, as part of commit `50dd842ad`. (Cc'ing stable@ because commit `50dd842ad` did) Fixes: `50dd842ad` ("dm space map metadata: fix ref counting bug when bootstrapping a new space map") Reported-by: David Binderman <dcb314@hotmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-12-14 09:26:01 -05:00
Sami Tolvanen	0cc37c2df4	dm verity: add ignore_zero_blocks feature If ignore_zero_blocks is enabled dm-verity will return zeroes for blocks matching a zero hash without validating the content. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:39:03 -05:00
Sami Tolvanen	a739ff3f54	dm verity: add support for forward error correction Add support for correcting corrupted blocks using Reed-Solomon. This code uses RS(255, N) interleaved across data and hash blocks. Each error-correcting block covers N bytes evenly distributed across the combined total data, so that each byte is a maximum distance away from the others. This makes it possible to recover from several consecutive corrupted blocks with relatively small space overhead. In addition, using verity hashes to locate erasures nearly doubles the effectiveness of error correction. Being able to detect corrupted blocks also improves performance, because only corrupted blocks need to corrected. For a 2 GiB partition, RS(255, 253) (two parity bytes for each 253-byte block) can correct up to 16 MiB of consecutive corrupted blocks if erasures can be located, and 8 MiB if they cannot, with 16 MiB space overhead. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:39:03 -05:00
Sami Tolvanen	bb4d73ac5e	dm verity: factor out verity_for_bv_block() verity_for_bv_block() will be re-used by optional dm-verity object. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:39:02 -05:00
Sami Tolvanen	ffa393807c	dm verity: factor out structures and functions useful to separate object Prepare for an optional verity object to make use of existing dm-verity structures and functions. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:39:01 -05:00
Sami Tolvanen	03045cbafa	dm verity: move dm-verity.c to dm-verity-target.c Prepare for extending dm-verity with an optional object. Follows the naming convention used by other DM targets (e.g. dm-cache and dm-era). Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:39:01 -05:00
Sami Tolvanen	753c1fd028	dm verity: separate function for parsing opt args Move optional argument parsing into a separate function to make it easier to add more of them without making verity_ctr even longer. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:39:00 -05:00
Sami Tolvanen	6dbeda3469	dm verity: clean up duplicate hashing code Handle dm-verity salting in one place to simplify the code. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:39:00 -05:00
Mike Snitzer	ba503835ad	dm btree: factor out need_insert() helper Eliminates code duplication within insert(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:38:59 -05:00
Anup Limbu	86a49e2dac	dm bufio: use BUG_ON instead of conditional call to BUG Signed-off-by: Anup Limbu <anuplimbu14@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:38:58 -05:00
Mikulas Patocka	86bad0c707	dm bufio: store stacktrace in buffers to help find buffer leaks The option DM_DEBUG_BLOCK_STACK_TRACING is moved from persistent-data directory to device mapper directory because it will now be used by persistent-data and bufio. When the option is enabled, each bufio buffer stores the stacktrace of the last dm_bufio_get(), dm_bufio_read() or dm_bufio_new() call that increased the hold count to 1. The buffer's stacktrace is printed if the buffer was not released before the bufio client is destroyed. When DM_DEBUG_BLOCK_STACK_TRACING is enabled, any bufio buffer leaks are considered warnings - i.e. the kernel continues afterwards. If not enabled, buffer leaks are considered BUGs and the kernel with crash. Reasoning on this disposition is: if we only ever warned on buffer leaks users would generally ignore them and the problematic code would never get fixed. Successfully used to find source of bufio leaks fixed with commit fce079f63c3 ("dm btree: fix bufio buffer leaks in dm_btree_del() error path"). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:38:58 -05:00
Mikulas Patocka	f98c8f7970	dm bufio: return NULL to improve code clarity A small code cleanup in new_read() - return NULL instead of b (although b is NULL at this point). This function is not returning pointer to the buffer, it is returning a pointer to the bufffer's data, thus it makes no sense to return the variable b. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:38:57 -05:00
Mikulas Patocka	313c9b9736	dm block manager: cleanup code that prints stacktrace There is no need to record stack trace and immediately print it. Just use dump_stack() to print the current stack. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:38:56 -05:00
Mikulas Patocka	fe3265b180	dm: don't save and restore bi_private Device mapper used the field bi_private to point to dm_target_io. However, since kernel 3.15, the bi_private field is unused, and so the targets do not need to save and restore this field. This patch removes code that saves and restores bi_private from dm-cache, dm-snapshot and dm-verity. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:38:56 -05:00
Joe Thornber	086fbbbda9	dm thin metadata: make dm_thin_find_mapped_range() atomic Refactor dm_thin_find_mapped_range() so that it takes the read lock on the metadata's lock; rather than relying on finer grained locking that is pushed down inside dm_thin_find_next_mapped_block() and dm_thin_find_block(). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:38:55 -05:00
Joe Thornber	3d5f67332a	dm thin metadata: speed up discard of partially mapped volumes Use dm_btree_lookup_next() to more quickly discard partially mapped volumes. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:30:56 -05:00
Joe Thornber	ed8b45a367	dm btree: fix bufio buffer leaks in dm_btree_del() error path If dm_btree_del()'s call to push_frame() fails, e.g. due to btree_node_validator finding invalid metadata, the dm_btree_del() error path must unlock all frames (which have active dm-bufio buffers) that were pushed onto the del_stack. Otherwise, dm_bufio_client_destroy() will BUG_ON() because dm-bufio buffers have leaked, e.g.: device-mapper: bufio: leaked buffer 3, hold count 1, list 0 Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-12-10 10:30:18 -05:00
Joe Thornber	50dd842ad8	dm space map metadata: fix ref counting bug when bootstrapping a new space map When applying block operations (BOPs) do not remove them from the uncommitted BOP ring-buffer until after they've been applied -- in case we recurse. Also, perform BOP_INC operation, in dm_sm_metadata_create() and sm_metadata_extend(), in terms of the uncommitted BOP ring-buffer rather than using direct calls to sm_ll_inc(). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-12-09 13:27:25 -05:00
Joe Thornber	49e99fc717	dm thin metadata: fix bug when taking a metadata snapshot When you take a metadata snapshot the btree roots for the mapping and details tree need to have their reference counts incremented so they persist for the lifetime of the metadata snap. The roots being incremented were those currently written in the superblock, which could possibly be out of date if concurrent IO is triggering new mappings, breaking of sharing, etc. Fix this by performing a commit with the metadata lock held while taking a metadata snapshot. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-12-09 13:18:12 -05:00
Masanari Iida	e3d132d123	treewide: Fix typos in printk This patch fix multiple spelling typos found in various part of kernel. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2015-12-08 14:59:19 +01:00
Joe Thornber	993ceab919	dm thin metadata: fix bug in dm_thin_remove_range() dm_btree_remove_leaves() only unmaps a contiguous region so we need a loop, in __remove_range(), to handle ranges that contain multiple regions. A new btree function, dm_btree_lookup_next(), is introduced which is more efficiently able to skip over regions of the thin device which aren't mapped. __remove_range() uses dm_btree_lookup_next() for each iteration of __remove_range()'s loop. Also, improve description of dm_btree_remove_leaves(). Fixes: `6550f075` ("dm thin metadata: add dm_thin_remove_range()") Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.1+	2015-12-02 13:26:49 -05:00
Mike Snitzer	30ce6e1cc5	dm btree: fix leak of bufio-backed block in btree_split_sibling error path The block allocated at the start of btree_split_sibling() is never released if later insert_at() fails. Fix this by releasing the previously allocated bufio block using unlock_block(). Reported-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-12-02 13:20:34 -05:00
Mike Snitzer	0fcb04d593	dm thin: fix regression in advertised discard limits When establishing a thin device's discard limits we cannot rely on the underlying thin-pool device's discard capabilities (which are inherited from the thin-pool's underlying data device) given that DM thin devices must provide discard support even when the thin-pool's underlying data device doesn't support discards. Users were exposed to this thin device discard limits regression if their thin-pool's underlying data device does _not_ support discards. This regression caused all upper-layers that called the blkdev_issue_discard() interface to not be able to issue discards to thin devices (because discard_granularity was 0). This regression wasn't caught earlier because the device-mapper-test-suite's extensive 'thin-provisioning' discard tests are only ever performed against thin-pool's with data devices that support discards. Fix is to have thin_io_hints() test the pool's 'discard_enabled' feature rather than inferring whether or not a thin device's discard support should be enabled by looking at the thin-pool's discard_granularity. Fixes: `216076705` ("dm thin: disable discard support for thin devices if pool's is disabled") Reported-by: Mike Gerber <mike@sprachgewalt.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.1+	2015-11-23 14:54:46 -05:00
Mikulas Patocka	bcbd94ff48	dm crypt: fix a possible hang due to race condition on exit A kernel thread executes __set_current_state(TASK_INTERRUPTIBLE), __add_wait_queue, spin_unlock_irq and then tests kthread_should_stop(). It is possible that the processor reorders memory accesses so that kthread_should_stop() is executed before __set_current_state(). If such reordering happens, there is a possible race on thread termination: CPU 0: calls kthread_should_stop() it tests KTHREAD_SHOULD_STOP bit, returns false CPU 1: calls kthread_stop(cc->write_thread) sets the KTHREAD_SHOULD_STOP bit calls wake_up_process on the kernel thread, that sets the thread state to TASK_RUNNING CPU 0: sets __set_current_state(TASK_INTERRUPTIBLE) spin_unlock_irq(&cc->write_thread_wait.lock) schedule() - and the process is stuck and never terminates, because the state is TASK_INTERRUPTIBLE and wake_up_process on CPU 1 already terminated Fix this race condition by using a new flag DM_CRYPT_EXIT_THREAD to signal that the kernel thread should exit. The flag is set and tested while holding cc->write_thread_wait.lock, so there is no possibility of racy access to the flag. Also, remove the unnecessary set_task_state(current, TASK_RUNNING) following the schedule() call. When the process was woken up, its state was already set to TASK_RUNNING. Other kernel code also doesn't set the state to TASK_RUNNING following schedule() (for example, do_wait_for_common in completion.c doesn't do it). Fixes: `dc2676210c` ("dm crypt: offload writes to thread") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org # v4.0+ Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-11-19 13:38:30 -05:00
Junichi Nomura	43e43c9ea6	dm mpath: fix infinite recursion in ioctl when no paths and !queue_if_no_path In multipath_prepare_ioctl(), - pgpath is a path selected from available paths - m->queue_io is true if we cannot send a request immediately to paths, either because: * there is no available path * the path group needs activation (pg_init) - pg_init is not started - pg_init is still running - m->queue_if_no_path is true if the device is configured to queue I/O if there are no available paths If !pgpath && !m->queue_if_no_path, the handler should return -EIO. However in the course of refactoring the condition check has broken and returns success in that case. Since bdev points to the dm device itself, dm_blk_ioctl() calls __blk_dev_driver_ioctl() for itself and recurses until crash. You could reproduce the problem like this: # dmsetup create mp --table '0 1024 multipath 0 0 0 0' # sg_inq /dev/mapper/mp <crash> [ 172.648615] BUG: unable to handle kernel paging request at fffffffc81b10268 [ 172.662843] PGD 19dd067 PUD 0 [ 172.666269] Thread overran stack, or stack corrupted [ 172.671808] Oops: 0000 [#1] SMP ... Fix the condition check with some clarifications. Fixes: `e56f81e0b0` ("dm: refactor ioctl handling") Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-11-17 14:19:00 -05:00
Mike Snitzer	647a20d5ca	dm: do not reuse dm_blk_ioctl block_device input as local variable (Ab)using the @bdev passed to dm_blk_ioctl() opens the potential for targets' .prepare_ioctl to fail if they go on to check the bdev for !NULL. Fixes: `e56f81e0b0` ("dm: refactor ioctl handling") Reported-by: Junichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-11-17 14:18:49 -05:00
Junichi Nomura	5bbbfdf685	dm: fix ioctl retry termination with signal dm-mpath retries ioctl, when no path is readily available and the device is configured to queue I/O in such a case. If you want to stop the retry before multipathd decides to turn off queueing mode, you could send signal for the process to exit from the loop. However the check of fatal signal has not carried along when commit `6c182cd88d` ("dm mpath: fix ioctl deadlock when no paths") moved the loop from dm-mpath to dm core. As a result, we can't terminate such a process in the retry loop. Easy reproducer of the situation is: # dmsetup create mp --table '0 1024 multipath 0 0 0 0' # dmsetup message mp 0 'queue_if_no_path' # sg_inq /dev/mapper/mp then you should be able to terminate sg_inq by pressing Ctrl+C. Fixes: `6c182cd88d` ("dm mpath: fix ioctl deadlock when no paths") Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-11-17 14:04:32 -05:00
Mike Snitzer	172c238612	dm thin: restore requested 'error_if_no_space' setting on OODS to WRITE transition A thin-pool that is in out-of-data-space (OODS) mode may transition back to write mode -- without the admin adding more space to the thin-pool -- if/when blocks are released (either by deleting thin devices or discarding provisioned blocks). But as part of the thin-pool's earlier transition to out-of-data-space mode the thin-pool may have set the 'error_if_no_space' flag to true if the no_space_timeout expires without more space having been made available. That implementation detail, of changing the pool's error_if_no_space setting, needs to be reset back to the default that the user specified when the thin-pool's table was loaded. Otherwise we'll drop the user requested behaviour on the floor when this out-of-data-space to write mode transition occurs. Reported-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Fixes: `2c43fd26e4` ("dm thin: fix missing out-of-data-space to write mode transition if blocks are released") Cc: stable@vger.kernel.org	2015-11-16 09:36:08 -05:00
Linus Torvalds	3419b45039	Merge branch 'for-4.4/io-poll' of git://git.kernel.dk/linux-block Pull block IO poll support from Jens Axboe: "Various groups have been doing experimentation around IO polling for (really) fast devices. The code has been reviewed and has been sitting on the side for a few releases, but this is now good enough for coordinated benchmarking and further experimentation. Currently O_DIRECT sync read/write are supported. A framework is in the works that allows scalable stats tracking so we can auto-tune this. And we'll add libaio support as well soon. Fow now, it's an opt-in feature for test purposes" * 'for-4.4/io-poll' of git://git.kernel.dk/linux-block: direct-io: be sure to assign dio->bio_bdev for both paths directio: add block polling support NVMe: add blk polling support block: add block polling support blk-mq: return tag/queue combo in the make_request_fn handlers block: change ->make_request_fn() and users to return a queue cookie	2015-11-10 17:23:49 -08:00
Linus Torvalds	3934bbc044	config fix for md config dependency needed as md/raid5 now uses crc32c -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJWP8lRAAoJEDnsnt1WYoG5AQ0P/REvKfJxP870gS6p5gowMYXN 1pwOdq9t2MeVkQk0Q5xBOZFGQI2TL2VZjaZdiSEKqaHgd3IOD/aGpl2exLN8nZM4 mNx+iD3QvEpmSA9mwlCe+V8vTuE0JeTpod9pXk+aQ0DIUx60dSmWh0Lp8ctr33oQ inlJ7kXFws6rG0xU0pOaSDM1hI0sC06Nyi2tvRSyZlZbBIMjZWorzJFUQuVX43rD 7cQJSQ8z2e2x3V7KXYtZf6Kxe+NzEltnq0OAfnDvjz3iw+a9qU6Qg7diAbcwOdyP m34/MHJGIYX9GBlTtVo3+j+h3ppaPpLfT4emeYUPTQCkMXg7J9Zbjkwso7OSkvnL kovtgIWRVzxFx/k7jhoWzNOsacOhTFz0E4aKDpOeH2cfU7k5H/siIjeIYcqfW3fU q8fJXMplHz3XRYp0JhR5DRJZSw87eQnkIRkvSy8wHx8KVoRi7KNm7fv/3Zi7FpaU XnbxQa6FrIoqbqeBQ666Rlkn+r2Ftmj50eudAhbj4/PBesRmt6otvr9uCK+b+adh ZI748BC2/IOOK34WNktRQCyS2C4b3VLwRVMHReD1xir34rGGcKO04Hci8T7Qb0nP uJBAxmE2zue0oE6d+nnrJS3BlEC2ZFKJGzRxKiLCU7nXXoGCznd++IIPRHAkGhZc MSMpaS2mqtdTNp4n+9y8 =dN6h -----END PGP SIGNATURE----- Merge tag 'md/4.4-rc0-fix' of git://neil.brown.name/md Pull config fix for md from Neil Brown: "New config dependency needed as md/raid5 now uses crc32c" * tag 'md/4.4-rc0-fix' of git://neil.brown.name/md: raid5-cache: add crc32c Kconfig dependency	2015-11-10 12:13:00 -08:00
Arnd Bergmann	14f09e2f9b	raid5-cache: add crc32c Kconfig dependency The recent change of the raid5-cache code to use crc32c instead of crc32 causes link errors when CONFIG_LIBCRC32C is disabled: drivers/built-in.o: In function crc32c' core.c:(.text+0x1c6060): undefined reference to `crc32c' This adds an explicit 'select' statement like all other users of this function do. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: `5cb2fbd6ea` ("raid5-cache: use crc32c checksum") Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-09 09:09:52 +11:00
Linus Torvalds	ad804a0b2a	Merge branch 'akpm' (patches from Andrew) Merge second patch-bomb from Andrew Morton: - most of the rest of MM - procfs - lib/ updates - printk updates - bitops infrastructure tweaks - checkpatch updates - nilfs2 update - signals - various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc, dma-debug, dma-mapping, ... * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits) ipc,msg: drop dst nil validation in copy_msg include/linux/zutil.h: fix usage example of zlib_adler32() panic: release stale console lock to always get the logbuf printed out dma-debug: check nents in dma_sync_sg* dma-mapping: tidy up dma_parms default handling pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode kexec: use file name as the output message prefix fs, seqfile: always allow oom killer seq_file: reuse string_escape_str() fs/seq_file: use seq_* helpers in seq_hex_dump() coredump: change zap_threads() and zap_process() to use for_each_thread() coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT) signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread() signal: turn dequeue_signal_lock() into kernel_dequeue_signal() signals: kill block_all_signals() and unblock_all_signals() nilfs2: fix gcc uninitialized-variable warnings in powerpc build nilfs2: fix gcc unused-but-set-variable warnings MAINTAINERS: nilfs2: add header file for tracing nilfs2: add tracepoints for analyzing reading and writing metadata files ...	2015-11-07 14:32:45 -08:00
Linus Torvalds	75021d2859	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial updates from Jiri Kosina: "Trivial stuff from trivial tree that can be trivially summed up as: - treewide drop of spurious unlikely() before IS_ERR() from Viresh Kumar - cosmetic fixes (that don't really affect basic functionality of the driver) for pktcdvd and bcache, from Julia Lawall and Petr Mladek - various comment / printk fixes and updates all over the place" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: bcache: Really show state of work pending bit hwmon: applesmc: fix comment typos Kconfig: remove comment about scsi_wait_scan module class_find_device: fix reference to argument "match" debugfs: document that debugfs_remove*() accepts NULL and error values net: Drop unlikely before IS_ERR(_OR_NULL) mm: Drop unlikely before IS_ERR(_OR_NULL) fs: Drop unlikely before IS_ERR(_OR_NULL) drivers: net: Drop unlikely before IS_ERR(_OR_NULL) drivers: misc: Drop unlikely before IS_ERR(_OR_NULL) UBI: Update comments to reflect UBI_METAONLY flag pktcdvd: drop null test before destroy functions	2015-11-07 13:05:44 -08:00
Jens Axboe	dece16353e	block: change ->make_request_fn() and users to return a queue cookie No functional changes in this patch, but it prepares us for returning a more useful cookie related to the IO that was queued up. Signed-off-by: Jens Axboe <axboe@fb.com> Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com>	2015-11-07 10:40:46 -07:00
Mel Gorman	d0164adc89	mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-06 17:50:42 -08:00
Petr Mladek	8d090f4731	bcache: Really show state of work pending bit WORK_STRUCT_PENDING is a mask for testing the pending bit. test_bit() expects the number of the bit and we need to use WORK_STRUCT_PENDING_BIT there. Also work_data_bits() is defined in workqueues.h now. I have noticed this just by chance when looking how WORK_STRUCT_PENDING_BIT is used. The change is compile tested. Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2015-11-06 15:06:05 +01:00
Linus Torvalds	e0700ce709	- Revert a dm-multipath change that caused a regression for unprivledged users (e.g. kvm guests) that issued ioctls when a multipath device had no available paths. - Include Christoph's refactoring of DM's ioctl handling and add support for passing through persistent reservations with DM multipath. - All other changes are very simple cleanups. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJWOp04AAoJEMUj8QotnQNaFLsH/AhMEH/jI1ObOfy4J1Wy4rOx ujJT91uS/s0H3pc9cGKQYnuGpFkX6WWU4wMiabIyiTn4sAsoXaflfIGutivLiDJr HfecrMrGZgnP4ZlpPPB02BmlxFbcPW8yzAU4ma38xBgQ+Pu30RO/HkvX/2vKOppG qwPop/XsNxq3KXgFGM44ToytM6c/MPGluhuvOwbaacAO1HviMuen9qsVjk4kwcf3 jGYTbEPHATxyu5/6oKDTkQTYhzdwg3B2qHCiKMGw3l1kXhaQLFcaOivOLV8Sf3xh bj1070pkGe9OpqaVzMnwDtJ8rnsBl/Nt4wj9oiQPxbX71GYZAmcMIYn9WEkcKFI= =AR2D -----END PGP SIGNATURE----- Merge tag 'dm-4.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: "Smaller set of DM changes for this merge. I've based these changes on Jens' for-4.4/reservations branch because the associated DM changes required it. - Revert a dm-multipath change that caused a regression for unprivledged users (e.g. kvm guests) that issued ioctls when a multipath device had no available paths. - Include Christoph's refactoring of DM's ioctl handling and add support for passing through persistent reservations with DM multipath. - All other changes are very simple cleanups" * tag 'dm-4.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm switch: simplify conditional in alloc_region_table() dm delay: document that offsets are specified in sectors dm delay: capitalize the start of an delay_ctr() error message dm delay: Use DM_MAPIO macros instead of open-coded equivalents dm linear: remove redundant target name from error messages dm persistent data: eliminate unnecessary return values dm: eliminate unused "bioset" process for each bio-based DM device dm: convert ffs to __ffs dm: drop NULL test before kmem_cache_destroy() and mempool_destroy() dm: add support for passing through persistent reservations dm: refactor ioctl handling Revert "dm mpath: fix stalls when handling invalid ioctls" dm: initialize non-blk-mq queue data before queue is used	2015-11-04 21:19:53 -08:00
Linus Torvalds	ac322de6bf	md updates for 4.4. Two major components to this update. 1/ the clustered-raid1 support from SUSE is nearly complete. There are a few outstanding issues being worked on. Maybe half a dozen patches will bring this to a usable state. 2/ The first stage of journalled-raid5 support from Facebook makes an appearance. With a journal device configured (typically NVRAM or SSD), the "RAID5 write hole" should be closed - a crash during degraded operations cannot result in data corruption. The next stage will be to use the journal as a write-behind cache so that latency can be reduced and in some cases throughput increased by performing more full-stripe writes. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJWNX9RAAoJEDnsnt1WYoG5bYMP/jI0pV3wcbs7mZQAa8S/V0lU 2l25x4MdwDvqVKMfjIc/C5J08QNgcrgSvhiVPCEOK0w18q395vep9f6gFKbMHhu/ lWU3PLHGw8XBHp5yEnxrpQkN0pRrNjh5NqIdlVMBNyL6u+RZPS2ZuzxJ8wiNAFg1 MypNkgoUu6s+nBp4DWWnMGYhBc+szBR+gTYAzGiZ8vqOH9uiSJ2SsGG5aRVUN/af oMYvJAf9aA6uj+xSzNlXIaLfWJIrshQYS1jU/W4gTm0DwK9yqbTxvubJaE0SGu/o 73FGU8tmQ6ELYfsp3D/jmfUkE7weiNEQhdVb/4wy1A/SGc+W7Ju9pxfhm8ra57s0 /BCkfwWZXEvx1flegXfK1mC6EMpMIcGAD2FQEhmQbW6wTdDwtNyEhIePDVGJwD/F rhEThFa+Dg9+xnBGnS6OUK3EpXgml2hAeAC7uA3TVSAnWd/9/Mpim6fZhqrB/v9L Ik0tZt+H4nxYaheZjKlKhuXUQYcUWGiMb67bGMem/YAlMa4y9C9qF+9mPXxyjVlI hBsd5SfZNz99DyB/bO8BumQeIWlTfzLeFzWW67eQ864LRKO6k0/VIbPZHCfn2oVG XvyC2fUhNOIURP3IMxcyHYxOA7Mu6EDsVVDTpuqLVbZQ5IPjDEfQ54yB/BLUvbX/ Gh2/tKn7Xc25HuLAFEbs =TD5o -----END PGP SIGNATURE----- Merge tag 'md/4.4' of git://neil.brown.name/md Pull md updates from Neil Brown: "Two major components to this update. 1) The clustered-raid1 support from SUSE is nearly complete. There are a few outstanding issues being worked on. Maybe half a dozen patches will bring this to a usable state. 2) The first stage of journalled-raid5 support from Facebook makes an appearance. With a journal device configured (typically NVRAM or SSD), the "RAID5 write hole" should be closed - a crash during degraded operations cannot result in data corruption. The next stage will be to use the journal as a write-behind cache so that latency can be reduced and in some cases throughput increased by performing more full-stripe writes. * tag 'md/4.4' of git://neil.brown.name/md: (66 commits) MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RW MD: set journal disk ->raid_disk MD: kick out journal disk if it's not fresh raid5-cache: start raid5 readonly if journal is missing MD: add new bit to indicate raid array with journal raid5-cache: IO error handling raid5: journal disk can't be removed raid5-cache: add trim support for log MD: fix info output for journal disk raid5-cache: use bio chaining raid5-cache: small log->seq cleanup raid5-cache: new helper: r5_reserve_log_entry raid5-cache: inline r5l_alloc_io_unit into r5l_new_meta raid5-cache: take rdev->data_offset into account early on raid5-cache: refactor bio allocation raid5-cache: clean up r5l_get_meta raid5-cache: simplify state machine when caches flushes are not needed raid5-cache: factor out a helper to run all stripes for an I/O unit raid5-cache: rename flushed_ios to finished_ios raid5-cache: free I/O units earlier ...	2015-11-04 21:12:47 -08:00
Linus Torvalds	527d1529e3	Merge branch 'for-4.4/integrity' of git://git.kernel.dk/linux-block Pull block integrity updates from Jens Axboe: ""This is the joint work of Dan and Martin, cleaning up and improving the support for block data integrity" * 'for-4.4/integrity' of git://git.kernel.dk/linux-block: block, libnvdimm, nvme: provide a built-in blk_integrity nop profile block: blk_flush_integrity() for bio-based drivers block: move blk_integrity to request_queue block: generic request_queue reference counting nvme: suspend i/o during runtime blk_integrity_unregister md: suspend i/o during runtime blk_integrity_unregister md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown block: Inline blk_integrity in struct gendisk block: Export integrity data interval size in sysfs block: Reduce the size of struct blk_integrity block: Consolidate static integrity profile properties block: Move integrity kobject to struct gendisk	2015-11-04 20:51:48 -08:00
Linus Torvalds	af7eba0158	two more bug fixes for md. One bugfix for a list corruption in raid5 because of incorrect locking. Other for possible data corruption when a recovering device is failed, removed, and re-added. Both tagged for -stable. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJWNAnCAAoJEDnsnt1WYoG5VzwP/jRfljaxLKUicIGEd9qeaei5 vVtmzPugthzYTfcJRm54ZufQOnjse/uXEBqmvoMJhBeEMbL2VIWbn1i7sMWjgq8W HVz/N/N8iT5DFWgzXmQQCaxIu17njbEmO8IelcNr3i6tE5wbHJ9WhV5UDGdQgwca 6XjQ8r8QjmN3uVaiNL4JcrEtZfpE8PqgBCJzsQYRZzOp9m3jJlgyJ9YWQA/VQ2Cj cVXs7tTGTdpPQdgRRFHF/Z/7H+KUcp9XV5ahDRwqDryQkZHuFvwyZFfpLspqvc5P OJio9WY0y93QmRRsuIC/ig0+CnxDXeqHiRwbprIMvKbPxxaGn4s24VioRDa0PRzT v9HcjtUhs9q8iTo4TZJrggD+rPm523a9iiDU6SbM2qlM3XUpBis/fjbLN82Nnas+ ADa0/DAsOE0I1WltoaaNUbweuflGw2NnFxnUueqq8dK23/Vabu3vw4tohiNp+/3m km8Is3j7lGKobQ3AKYIFlbeAqdsyASZkSDfA9IZr3SIeYBWgPljd9n+6BIPOqPNq S2HtOLke/XX2KEM0BHzgu4XliE9P9+B9lQETI4MehP0rMzTURGKh87du41YHckp1 beX22aeOuv//ok11JTs6M+StREKeTrl+dTStn0U1jt6HyMzseGaNoy3Eib5ePBCb C2ZZRz4OR2MWvWUFxKms =QYvv -----END PGP SIGNATURE----- Merge tag 'md/4.3-rc7-fixes' of git://neil.brown.name/md Pull md bug fixes from Neil Brown: "Two more bug fixes for md. One bugfix for a list corruption in raid5 because of incorrect locking. Other for possible data corruption when a recovering device is failed, removed, and re-added. Both tagged for -stable" * tag 'md/4.3-rc7-fixes' of git://neil.brown.name/md: Revert "md: allow a partially recovered device to be hot-added to an array." md/raid5: fix locking in handle_stripe_clean_event()	2015-10-31 21:20:49 -07:00
Song Liu	339421def5	MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RW When RAID-4/5/6 array suffers from missing journal device, we put the array in read only state. We should not allow trasition to read-write states (clean and active) before replacing journal device. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:29 +11:00
Shaohua Li	f2076e7d06	MD: set journal disk ->raid_disk Set journal disk ->raid_disk to >=0, I choose raid_disks + 1 instead of 0, because we already have a disk with ->raid_disk 0 and this causes sysfs entry creation conflict. A lot of places assumes disk with ->raid_disk >=0 is normal raid disk, so we add check for journal disk. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:29 +11:00
Song Liu	a3dfbdaadb	MD: kick out journal disk if it's not fresh When journal disk is faulty and we are reassemabling the raid array, the journal disk is old. We don't allow the journal disk added to the raid array. Since journal disk is missing in the array, the raid5 will mark the array readonly. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:29 +11:00
Shaohua Li	7dde2ad3c5	raid5-cache: start raid5 readonly if journal is missing If raid array is expected to have journal (eg, journal is set in MD superblock feature map) and the array is started without journal disk, start the array readonly. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:29 +11:00
Song Liu	a97b789644	MD: add new bit to indicate raid array with journal If a raid array has journal feature bit set, add a new bit to indicate this. If the array is started without journal disk existing, we know there is something wrong. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:29 +11:00
Shaohua Li	6e74a9cfb5	raid5-cache: IO error handling There are 3 places the raid5-cache dispatches IO. The discard IO error doesn't matter, so we ignore it. The superblock write IO error can be handled in MD core. The remaining are log write and flush. When the IO error happens, we mark log disk faulty and fail all write IO. Read IO is still allowed to run. Userspace will get a notification too and corresponding daemon can choose setting raid array readonly for example. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:29 +11:00
Shaohua Li	c2bb6242ec	raid5: journal disk can't be removed raid5-cache uses journal disk rdev->bdev, rdev->mddev in several places. Don't allow journal disk disappear magically. On the other hand, we do need to update superblock for other disks to bump up ->events, so next time journal disk will be identified as stale. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:29 +11:00
Shaohua Li	4b482044d2	raid5-cache: add trim support for log Since superblock is updated infrequently, we do a simple trim of log disk (a synchronous trim) Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:29 +11:00
Shaohua Li	9efdca16e0	MD: fix info output for journal disk journal disk can be faulty. The Journal and Faulty aren't exclusive with each other. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:29 +11:00
Christoph Hellwig	6143e2cecb	raid5-cache: use bio chaining Simplify the bio completion handler by using bio chaining and submitting bios as soon as they are full. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:28 +11:00
Christoph Hellwig	2b8ef16ec4	raid5-cache: small log->seq cleanup Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:28 +11:00
Christoph Hellwig	c1b9919849	raid5-cache: new helper: r5_reserve_log_entry Factor out code to reserve log space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:28 +11:00
Christoph Hellwig	51039cd066	raid5-cache: inline r5l_alloc_io_unit into r5l_new_meta This is the only user, and keeping all code initializing the io_unit structure together improves readbility. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:28 +11:00
Christoph Hellwig	1e932a37cc	raid5-cache: take rdev->data_offset into account early on Set up bi_sector properly when we allocate an bio instead of updating it at submission time. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:28 +11:00
Christoph Hellwig	b349feb36c	raid5-cache: refactor bio allocation Split out a helper to allocate a bio for log writes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:28 +11:00
Christoph Hellwig	22581f58ed	raid5-cache: clean up r5l_get_meta Remove the only partially used local 'io' variable to simplify the code flow. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:28 +11:00
Christoph Hellwig	56fef7c6e0	raid5-cache: simplify state machine when caches flushes are not needed For devices without a volatile write cache we don't need to send a FLUSH command to ensure writes are stable on disk, and thus can avoid the whole step of batching up bios for processing by the MD thread. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:28 +11:00
Christoph Hellwig	d8858f4321	raid5-cache: factor out a helper to run all stripes for an I/O unit Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:27 +11:00
Christoph Hellwig	04732f741d	raid5-cache: rename flushed_ios to finished_ios After this series we won't nessecarily have flushed the cache for these I/Os, so give the list a more neutral name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:27 +11:00
Christoph Hellwig	170364619a	raid5-cache: free I/O units earlier There is no good reason to keep the I/O unit structures around after the stripe has been written back to the RAID array. The only information we need is the log sequence number, and the checkpoint offset of the highest successfull writeback. Store those in the log structure, and free the IO units from __r5l_stripe_write_finished. Besides simplifying the code this also avoid having to keep the allocation for the I/O unit around for a potentially long time as superblock updates that checkpoint the log do not happen very often. This also fixes the previously incorrect calculation of 'free' in r5l_do_reclaim as a side effect: previous if took the last unit which isn't checkpointed into account. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:27 +11:00
Shaohua Li	e6c033f79a	raid5-cache: move reclaim stop to quiesce Move reclaim stop to quiesce handling, where is safer for this stuff. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:27 +11:00
Shaohua Li	ac6096e9d5	md: show journal for journal disk in disk state sysfs Journal disk state sysfs entry should indicate it's journal Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:27 +11:00
Song Liu	0b020e85bd	skip match_mddev_units check for special roles match_mddev_units is used to check whether 2 RAID arrays share same disk(s). Arrays that share disk(s) will not do resync at the same time for better performance (fewer HDD seek). However, this check should not apply to Spare, Faulty, and Journal disks, as they do not paticipate in resync. In this patch, match_mddev_units skips check for disks with flag "Faulty" or "Journal" or raid_disk < 0. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:27 +11:00
Shaohua Li	253f9fd41a	raid5-cache: don't delay stripe captured in log There is a case a stripe gets delayed forever. 1. a stripe finishes construction 2. a new bio hits the stripe 3. handle_stripe runs for the stripe. The stripe gets DELAYED bit set since construction can't run for new bio (the stripe is locked since step 1) Without log, handle_stripe will call ops_run_io. After IO finishes, the stripe gets unlocked and the stripe will restart and run construction for the new bio. With log, ops_run_io need to run two times. If the DELAYED bit set, the stripe can't enter into the handle_list, so the second ops_run_io doesn't run, which leaves the stripe stalled. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:27 +11:00
Shaohua Li	85f2f9a4f4	raid5-cache: check stripe finish out of order stripes could finish out of order. Hence r5l_move_io_unit_list() of __r5l_stripe_write_finished might not move any entry and leave stripe_end_ios list empty. This applies on top of http://marc.info/?l=linux-raid&m=144122700510667 Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:26 +11:00
Shaohua Li	bd18f6462f	md: skip resync for raid array with journal If a raid array has journal, the journal can guarantee the consistency, we can skip resync after a unclean shutdown. The exception is raid creation or user initiated resync, which we still do a raid resync. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:26 +11:00
Shaohua Li	828cbe989e	raid5-cache: optimize FLUSH IO with log enabled With log enabled, bio is written to raid disks after the bio is settled down in log disk. The recovery guarantees we can recovery the bio data from log disk, so we we skip FLUSH IO. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:26 +11:00
Christoph Hellwig	509ffec708	raid5-cache: move functionality out of __r5l_set_io_unit_state Just keep __r5l_set_io_unit_state as a small set the state wrapper, and remove r5l_set_io_unit_state entirely after moving the real functionality to the two callers that need it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:26 +11:00
Shaohua Li	0fd22b45b2	raid5-cache: fix a user-after-free bug r5l_compress_stripe_end_list() can free an io_unit. This breaks the assumption only reclaimer can free io_unit. We can add a reference count based io_unit free, but since only reclaim can wait io_unit becoming to STRIPE_END state, we use a simple global wait queue here. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:26 +11:00
Shaohua Li	a8c34f9159	raid5-cache: switching to state machine for log disk cache flush Before we write stripe data to raid disks, we must guarantee stripe data is settled down in log disk. To do this, we flush log disk cache and wait the flush finish. That wait introduces sleep time in raid5d thread and impact performance. This patch moves the log disk cache flush process to the stripe handling state machine, which can remove the wait in raid5d. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:26 +11:00
Shaohua Li	5c7e81c3de	raid5: enable log for raid array with cache disk Now log is safe to enable for raid array with cache disk Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:26 +11:00
Shaohua Li	713cf5a639	raid5: don't allow resize/reshape with cache(log) support If cache(log) support is enabled, don't allow resize/reshape in current stage. In the future, we can flush all data from cache(log) to raid before resize/reshape and then allow resize/reshape. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:26 +11:00
Shaohua Li	9c3e333d3f	raid5: disable batch with log enabled With log enabled, r5l_write_stripe will add the stripe to log. With batch, several stripes are linked together. The stripes must be in the same state. While with log, the log/reclaim unit is stripe, we can't guarantee the several stripes are in the same state. Disabling batch for log now. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:26 +11:00
Shaohua Li	5cb2fbd6ea	raid5-cache: use crc32c checksum crc32c has lower overhead with cpu acceleration. It's a shame I didn't use it in first post, sorry. This changes disk format, but we are still ok in current stage. V2: delete unnecessary type conversion as pointed out by Bart Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>	2015-11-01 13:45:39 +11:00
Tomohiro Kusumi	aad9ae4550	dm switch: simplify conditional in alloc_region_table() The variable sctx->nr_regions has type unsigned long and the variable nr_regions has type sector_t. Thus the variables may be different when overflow happens. Changed the conditional to "if (nr_regions >= ULONG_MAX)". Also move the assignment of nr_regions after sector_div() and the sanity check which looks more sane. Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-31 19:06:06 -04:00
Tomohiro Kusumi	f49e869a61	dm delay: document that offsets are specified in sectors Only delay params are mentioned in delay.txt. Mention offsets just like documents for linear and flakey do. Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-31 19:06:05 -04:00
Tomohiro Kusumi	e213f33e4d	dm delay: capitalize the start of an delay_ctr() error message All other error messages start capitalized. Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-31 19:06:04 -04:00
Tomohiro Kusumi	340c9ec09b	dm delay: Use DM_MAPIO macros instead of open-coded equivalents .map function of dm-delay returns return value of delay_bio(), hence it's supposed to return using a defined DM_MAPIO macro. Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Acked-By: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-31 19:06:04 -04:00
Tomohiro Kusumi	00272c854e	dm linear: remove redundant target name from error messages Commit `72d94861` back in 2006 should have consistently removed "dm-linear: " from all error messages. Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-31 19:06:03 -04:00
Mikulas Patocka	4c7da06f5a	dm persistent data: eliminate unnecessary return values dm_bm_unlock and dm_tm_unlock return an integer value but the returned value is always 0. The calling code sometimes checks the return value and sometimes doesn't. Eliminate these unnecessary return values and also the checks for them. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-31 19:06:02 -04:00
Mikulas Patocka	dbba42d8a9	dm: eliminate unused "bioset" process for each bio-based DM device Commit `54efd50bfd` ("block: make generic_make_request handle arbitrarily sized bios") makes it possible for block devices to process large bios. In doing so that commit allocates a new queue->bio_split bioset for each block device, this bioset is used for allocating bios when the driver needs to split large bios. Each bioset allocates a workqueue process, thus the above commit increases the number of processes allocated per block device. DM doesn't need the queue->bio_split bioset, thus we can deallocate it. This reduces the number of allocated processes per bio-based DM device from 3 to 2. Also remove the call to blk_queue_split(), it is not needed because DM does its own splitting. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-31 19:06:02 -04:00
Mikulas Patocka	a3d939ae7b	dm: convert ffs to __ffs ffs counts bit starting with 1 (for the least significant bit), __ffs counts bits starting with 0. This patch changes various occurrences of ffs to __ffs and removes subtraction of 1 from the result. Note that __ffs (unlike ffs) is not defined when called with zero argument, but it is not called with zero argument in any of these cases. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-31 19:06:01 -04:00
Julia Lawall	6f65985e26	dm: drop NULL test before kmem_cache_destroy() and mempool_destroy() Remove DM's unneeded NULL tests before calling these destroy functions, now that they check for NULL, thanks to these v4.3 commits: `3942d2991` ("mm/slab_common: allow NULL cache pointer in kmem_cache_destroy()") `4e3ca3e03` ("mm/mempool: allow NULL `pool' pointer in mempool_destroy()") The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x; @@ -if (x != NULL) $kmem_cache_destroy\\|mempool_destroy\\|dma_pool_destroy$(x); // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-31 19:06:00 -04:00
Christoph Hellwig	71cdb6978a	dm: add support for passing through persistent reservations This adds support to pass through persistent reservation requests similar to the existing ioctl handling, and with the same limitations, e.g. devices may only have a single target attached. This is mostly intended for multipathing. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-31 19:05:59 -04:00
Christoph Hellwig	e56f81e0b0	dm: refactor ioctl handling This moves the call to blkdev_ioctl and the argument checking to DM core code, and only leaves a callout to find the block device to operate on in the targets. This simplifies the code and allows us to pass through ioctl-like command using other methods in the next patch. Also split out a helper around calling the prepare_ioctl method that will be reused for persistent reservation handling. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-31 19:05:59 -04:00
Mauricio Faria de Oliveira	47796938c4	Revert "dm mpath: fix stalls when handling invalid ioctls" This reverts commit `a1989b3300`. That commit introduced a regression at least for the case of the SG_IO ioctl() running without CAP_SYS_RAWIO capability (e.g., unprivileged users) when there are no active paths: the ioctl() fails with the ENOTTY errno immediately rather than blocking due to queue_if_no_path until a path becomes active, for example. That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices (qemu "-device scsi-block" [1], libvirt "<disk type='block' device='lun'>" [2]) from multipath devices; which leads to SCSI/filesystem errors in such a guest. More general scenarios can hit that regression too. The following demonstration employs a SG_IO ioctl() with a standard SCSI INQUIRY command for this objective (some output & user changes omitted for brevity and comments added for clarity). Reverting that commit restores normal operation (queueing) in failing scenarios; tested on linux-next (next-20151022). 1) Test-case is based on sg_simple0 [3] (just SG_IO; remove SG_GET_VERSION_NUM) $ cat sg_simple0.c ... see [3] ... $ sed '/SG_GET_VERSION_NUM/,/}/d' sg_simple0.c > sgio_inquiry.c $ gcc sgio_inquiry.c -o sgio_inquiry 2) The ioctl() works fine with active paths present. # multipath -l 85ag56 85ag56 (...) dm-19 IBM ,2145 size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw \|-+- policy='service-time 0' prio=0 status=active \| \|- 8:0:11:0 sdz 65:144 active undef running \| `- 9:0:9:0 sdbf 67:144 active undef running `-+- policy='service-time 0' prio=0 status=enabled \|- 8:0:12:0 sdae 65:224 active undef running `- 9:0:12:0 sdbo 68:32 active undef running $ ./sgio_inquiry /dev/mapper/85ag56 Some of the INQUIRY command's response: IBM 2145 0000 INQUIRY duration=0 millisecs, resid=0 3) The ioctl() fails with ENOTTY errno with _no_ active paths present, for unprivileged users (rather than blocking due to queue_if_no_path). # for path in $(multipath -l 85ag56 \| grep -o 'sd[a-z]\+'); \ do multipathd -k"fail path $path"; done # multipath -l 85ag56 85ag56 (...) dm-19 IBM ,2145 size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw \|-+- policy='service-time 0' prio=0 status=enabled \| \|- 8:0:11:0 sdz 65:144 failed undef running \| `- 9:0:9:0 sdbf 67:144 failed undef running `-+- policy='service-time 0' prio=0 status=enabled \|- 8:0:12:0 sdae 65:224 failed undef running `- 9:0:12:0 sdbo 68:32 failed undef running $ ./sgio_inquiry /dev/mapper/85ag56 sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device 4) dmesg shows that scsi_verify_blk_ioctl() failed for SG_IO (0x2285); it returns -ENOIOCTLCMD, later replaced with -ENOTTY in vfs_ioctl(). $ dmesg <...> [] device-mapper: multipath: Failing path 65:144. [] device-mapper: multipath: Failing path 67:144. [] device-mapper: multipath: Failing path 65:224. [] device-mapper: multipath: Failing path 68:32. [] sgio_inquiry: sending ioctl 2285 to a partition! 5) The ioctl() only works if the SYS_CAP_RAWIO capability is present (then queueing happens -- in this example, queue_if_no_path is set); this is due to a conditional check in scsi_verify_blk_ioctl(). # capsh --drop=cap_sys_rawio -- -c './sgio_inquiry /dev/mapper/85ag56' sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device # ./sgio_inquiry /dev/mapper/85ag56 & [1] 72830 # cat /proc/72830/stack [<c00000171c0df700>] 0xc00000171c0df700 [<c000000000015934>] __switch_to+0x204/0x350 [<c000000000152d4c>] msleep+0x5c/0x80 [<c00000000077dfb0>] dm_blk_ioctl+0x70/0x170 [<c000000000487c40>] blkdev_ioctl+0x2b0/0x9b0 [<c0000000003128e4>] block_ioctl+0x64/0xd0 [<c0000000002dd3b0>] do_vfs_ioctl+0x490/0x780 [<c0000000002dd774>] SyS_ioctl+0xd4/0xf0 [<c000000000009358>] system_call+0x38/0xd0 6) This is the function call chain exercised in this analysis: SYSCALL_DEFINE3(ioctl, <...>) @ fs/ioctl.c -> do_vfs_ioctl() -> vfs_ioctl() ... error = filp->f_op->unlocked_ioctl(filp, cmd, arg); ... -> dm_blk_ioctl() @ drivers/md/dm.c -> multipath_ioctl() @ drivers/md/dm-mpath.c ... (bdev = NULL, due to no active paths) ... if (!bdev \|\| <...>) { int err = scsi_verify_blk_ioctl(NULL, cmd); if (err) r = err; } ... -> scsi_verify_blk_ioctl() @ block/scsi_ioctl.c ... if (bd && bd == bd->bd_contains) // not taken (bd = NULL) return 0; ... if (capable(CAP_SYS_RAWIO)) // not taken (unprivileged user) return 0; ... printk_ratelimited(KERN_WARNING "%s: sending ioctl %x to a partition!\n" <...>); return -ENOIOCTLCMD; <- ... return r ? : <...> <- ... if (error == -ENOIOCTLCMD) error = -ENOTTY; out: return error; ... Links: [1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52 [2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device') [3] http://tldp.org/HOWTO/SCSI-Generic-HOWTO/pexample.html (Revision 1.2, 2002-05-03) Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-10-31 18:53:51 -04:00
NeilBrown	d01552a76d	Revert "md: allow a partially recovered device to be hot-added to an array." This reverts commit `7eb418851f`. This commit is poorly justified, I can find not discusison in email, and it clearly causes a problem. If a device which is being recovered fails and is subsequently re-added to an array, there could easily have been changes to the array before the point where the recovery was up to. So the recovery must start again from the beginning. If a spare is being recovered and fails, then when it is re-added we really should do a bitmap-based recovery up to the recovery-offset, and then a full recovery from there. Before this reversion, we only did the "full recovery from there" which is not corect. After this reversion with will do a full recovery from the start, which is safer but not ideal. It will be left to a future patch to arrange the two different styles of recovery. Reported-and-tested-by: Nate Dailey <nate.dailey@stratus.com> Signed-off-by: NeilBrown <neilb@suse.com> Cc: stable@vger.kernel.org (3.14+) Fixes: `7eb418851f` ("md: allow a partially recovered device to be hot-added to an array.")	2015-10-31 11:00:56 +11:00
Roman Gushchin	b8a9d66d04	md/raid5: fix locking in handle_stripe_clean_event() After commit `566c09c534` ("raid5: relieve lock contention in get_active_stripe()") __find_stripe() is called under conf->hash_locks + hash. But handle_stripe_clean_event() calls remove_hash() under conf->device_lock. Under some cirscumstances the hash chain can be circuited, and we get an infinite loop with disabled interrupts and locked hash lock in __find_stripe(). This leads to hard lockup on multiple CPUs and following system crash. I was able to reproduce this behavior on raid6 over 6 ssd disks. The devices_handle_discard_safely option should be set to enable trim support. The following script was used: for i in `seq 1 32`; do dd if=/dev/zero of=large$i bs=10M count=100 & done neilb: original was against a 3.x kernel. I forward-ported to 4.3-rc. This verison is suitable for any kernel since Commit: `59fc630b8b` ("RAID5: batch adjacent full stripe write") (v4.1+). I'll post a version for earlier kernels to stable. Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Fixes: `566c09c534` ("raid5: relieve lock contention in get_active_stripe()") Signed-off-by: NeilBrown <neilb@suse.com> Cc: Shaohua Li <shli@kernel.org> Cc: <stable@vger.kernel.org> # 3.13 - 4.2	2015-10-31 10:53:50 +11:00
Mikulas Patocka	ad5f498f61	dm: initialize non-blk-mq queue data before queue is used Commit `bfebd1cdb4` ("dm: add full blk-mq support to request-based DM") moves the initialization of the fields backing_dev_info.congested_fn, backing_dev_info.congested_data and queuedata from the function dm_init_md_queue (that is called when the device is created) to dm_init_old_md_queue (that is called after the device type is determined). There is no locking when accessing these variables, thus it is possible for other parts of the kernel to briefly see this data in a transient state (e.g. queue->backing_dev_info.congested_fn initialized and md->queue->backing_dev_info.congested_data uninitialized, resulting in passing an incorrect parameter to the function dm_any_congested). This queue data is left initialized for blk-mq devices even though they that don't use it. Fixes: `bfebd1cdb4` ("dm: add full blk-mq support to request-based DM") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # v4.1+	2015-10-29 22:09:40 -04:00
Linus Torvalds	ce6f988603	Some raid1/raid10 fixes. Two fixes for bugs that are in both raid1 and raid10. Both related to bad-block-lists and at least one needs to be back ported to 3.1. Also a revision for the "new" layout in raid10. This "new" code (which aims to improve robustness) actually reduces robustness in some cases. It probably isn't in use at all as not public user-space code makes use of these new layouts. However just in case someone has their own code, it would be good to get the WARNing out for them sooner. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJWKxc0AAoJEDnsnt1WYoG5BkIP/0DmbcISl9eSMQt+k9E5B/IN hk2Z5Q3BeMi7VhjKJDsGsg+oQM1p2ef/vvfhx1lgk34jiVrELRPdzvIlnv0XeQ7y NwMGkV7KKTbrzvK42eR6XhpO1UdJ3FIiC2RCH/5fmRai5JqgQ4jRWzX4wGDL2p5d ZKK63KUjnlrqrLtch/kAxeynQbAWhtefzRKfspiUVtnaLD9sIhUwMS+IZDPYHbhd YMowQEquQW0uEmiyX0j/XNgw14yLau5zXjSZ0SvtDfa+IAiAlHQWpxhatA3Vj0NA xxKrUjYD41Rkzrm9dLfRtgG1U8Wq51q12wg6McY+i1glrR8d6AISe8PczfQsqRWu TjKPHqfhENemSMHOxW+8NB9K6BXV7W/rCH4t6iUMG00KTGhVJNPt0T94yYI1p2Yc Fs+dR6rYILS5whXbRpeLAVZ4Np53eka9O/Wo2qoujPgIOfNrG/Ed3Lfqylb7jk7Z B8jalgn+99Bok9DuCg/HFtGLrU3KN/BjWdet9YX/Z8zBQifrfroATLOq5PJtuSpI 0STtZ6cOZYwvb70XC1w6eNPgQxz6rzJbPHDjwZ0woKYe4Bh+ZtCBJq3ufwJ/rqRV OXmCceFO8KtK3/zJqeZOd0eNEkxiXDaKfJ0Ut6t7/kumCflE/tS4lyOpu8ptdT7s hnATXrkvrL+6vtT1owAJ =96PZ -----END PGP SIGNATURE----- Merge tag 'md/4.3-rc6-fixes' of git://neil.brown.name/md Pull md fixes from Neil Brown: "Some raid1/raid10 fixes. I meant to get this to you before -rc7, but what with all the travel plans.. Two fixes for bugs that are in both raid1 and raid10. Both related to bad-block-lists and at least one needs to be back ported to 3.1. Also a revision for the "new" layout in raid10. This "new" code (which aims to improve robustness) actually reduces robustness in some cases. It probably isn't in use at all as not public user-space code makes use of these new layouts. However just in case someone has their own code, it would be good to get the WARNing out for them sooner" * tag 'md/4.3-rc6-fixes' of git://neil.brown.name/md: md/raid10: fix the 'new' raid10 layout to work correctly. md/raid10: don't clear bitmap bit when bad-block-list write fails. md/raid1: don't clear bitmap bit when bad-block-list write fails. md/raid10: submit_bio_wait() returns 0 on success md/raid1: submit_bio_wait() returns 0 on success	2015-10-27 07:41:48 +09:00
Shaohua Li	355810d12a	raid5: log recovery This is the log recovery support. The process is quite straightforward. We scan the log and read all valid meta/data/parity into memory. If a stripe's data/parity checksum is correct, the stripe will be recoveried. Otherwise, it's discarded and we don't scan the log further. The reclaim process guarantees stripe which starts to be flushed raid disks has completed data/parity and has correct checksum. To recovery a stripe, we just copy its data/parity to corresponding raid disks. The trick thing is superblock update after recovery. we can't let superblock point to last valid meta block. The log might look like: \| meta 1\| meta 2\| meta 3\| meta 1 is valid, meta 2 is invalid. meta 3 could be valid. If superblock points to meta 1, we write a new valid meta 2n. If crash happens again, new recovery will start from meta 1. Since meta 2n is valid, recovery will think meta 3 is valid, which is wrong. The solution is we create a new meta in meta2 with its seq == meta 1's seq + 10 and let superblock points to meta2. recovery will not think meta 3 is a valid meta, because its seq is wrong Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:19 +11:00
Shaohua Li	0576b1c618	raid5: log reclaim support This is the reclaim support for raid5 log. A stripe write will have following steps: 1. reconstruct the stripe, read data/calculate parity. ops_run_io prepares to write data/parity to raid disks 2. hijack ops_run_io. stripe data/parity is appending to log disk 3. flush log disk cache 4. ops_run_io run again and do normal operation. stripe data/parity is written in raid array disks. raid core can return io to upper layer. 5. flush cache of all raid array disks 6. update super block 7. log disk space used by the stripe can be reused In practice, several stripes consist of an io_unit and we will batch several io_unit in different steps, but the whole process doesn't change. It's possible io return just after data/parity hit log disk, but then read IO will need read from log disk. For simplicity, IO return happens at step 4, where read IO can directly read from raid disks. Currently reclaim run if there is specific reclaimable space (1/4 disk size or 10G) or we are out of space. Reclaim is just to free log disk spaces, it doesn't impact data consistency. The size based force reclaim is to make sure log isn't too big, so recovery doesn't scan log too much. Recovery make sure raid disks and log disk have the same data of a stripe. If crash happens before 4, recovery might/might not recovery stripe's data/parity depending on if data/parity and its checksum matches. In either case, this doesn't change the syntax of an IO write. After step 3, stripe is guaranteed recoverable, because stripe's data/parity is persistent in log disk. In some cases, log disk content and raid disks content of a stripe are the same, but recovery will still copy log disk content to raid disks, this doesn't impact data consistency. space reuse happens after superblock update and cache flush. There is one situation we want to avoid. A broken meta in the middle of a log causes recovery can't find meta at the head of log. If operations require meta at the head persistent in log, we must make sure meta before it persistent in log too. The case is stripe data/parity is in log and we start write stripe to raid disks (before step 4). stripe data/parity must be persistent in log before we do the write to raid disks. The solution is we restrictly maintain io_unit list order. In this case, we only write stripes of an io_unit to raid disks till the io_unit is the first one whose data/parity is in log. The io_unit list order is important for other cases too. For example, some io_unit are reclaimable and others not. They can be mixed in the list, we shouldn't reuse space of an unreclaimable io_unit. Includes fixes to problems which were... Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:19 +11:00
Shaohua Li	f6bed0ef0a	raid5: add basic stripe log This introduces a simple log for raid5. Data/parity writing to raid array first writes to the log, then write to raid array disks. If crash happens, we can recovery data from the log. This can speed up raid resync and fix write hole issue. The log structure is pretty simple. Data/meta data is stored in block unit, which is 4k generally. It has only one type of meta data block. The meta data block can track 3 types of data, stripe data, stripe parity and flush block. MD superblock will point to the last valid meta data block. Each meta data block has checksum/seq number, so recovery can scan the log correctly. We store a checksum of stripe data/parity to the metadata block, so meta data and stripe data/parity can be written to log disk together. otherwise, meta data write must wait till stripe data/parity is finished. For stripe data, meta data block will record stripe data sector and size. Currently the size is always 4k. This meta data record can be made simpler if we just fix write hole (eg, we can record data of a stripe's different disks together), but this format can be extended to support caching in the future, which must record data address/size. For stripe parity, meta data block will record stripe sector. It's size should be 4k (for raid5) or 8k (for raid6). We always store p parity first. This format should work for caching too. flush block indicates a stripe is in raid array disks. Fixing write hole doesn't need this type of meta data, it's for caching extension. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:19 +11:00
Shaohua Li	b70abcb247	raid5: add a new state for stripe log handling When a stripe finishes construction, we write the stripe to raid in ops_run_io normally. With log, we do a bunch of other operations before the stripe is written to raid. Mainly write the stripe to log disk, flush disk cache and so on. The operations are still driven by raid5d and run in the stripe state machine. We introduce a new state for such stripe (trapped into log). The stripe is in this state from the time it first enters ops_run_io (finish construction) to the time it is written to raid. Since we know the state is only for log, we bypass other check/operation in handle_stripe. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:19 +11:00
Shaohua Li	6d036f7d52	raid5: export some functions Next several patches use some raid5 functions, rename them with raid5 prefix and export out. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:18 +11:00
Shaohua Li	3069aa8def	md: override md superblock recovery_offset for journal device Journal device stores data in a log structure. We need record the log start. Here we override md superblock recovery_offset for this purpose. This field of a journal device is meaningless otherwise. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:18 +11:00
Song Liu	bac624f3f8	MD: add a new disk role to present write journal device Next patches will use a disk as raid5/6 journaling. We need a new disk role to present the journal device and add MD_FEATURE_JOURNAL to feature_map for backward compability. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:18 +11:00
Song Liu	c4d4c91b44	MD: replace special disk roles with macros Add the following two macros for special roles: spare and faulty MD_DISK_ROLE_SPARE 0xffff MD_DISK_ROLE_FAULTY 0xfffe Add MD_DISK_ROLE_MAX 0xff00 as the maximal possible regular role, and minimal value of special role. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:18 +11:00
Goldwyn Rodrigues	28c1b9fdf4	md-cluster: Call update_raid_disks() if another node --grow's raid_disks To incorporate --grow feature executed on one node, other nodes need to acknowledge the change in number of disks. Call update_raid_disks() to update internal data structures. This leads to call check_reshape() -> md_allow_write() -> md_update_sb(), this results in a deadlock. This is done so it can safely allocate memory (which might trigger writeback which might write to raid1). This is not required for md with a bitmap. In the clustered case, we don't perform md_update_sb() in md_allow_write(), but in do_md_run(). Also we disable safemode for clustered mode. mddev->recovery_cp need not be set in check_sb_changes() because this is required only when a node reads another node's bitmap. mddev->recovery_cp (which is read from sb->resync_offset), is set only if mddev is in_sync. Since we disabled safemode, in_sync is set to zero. In a clustered environment, the MD may not be in sync because another node could be writing to it. So make sure that in_sync is not set in case of clustered node in __md_stop_writes(). Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:18 +11:00
NeilBrown	30661b49be	md-cluster: remove mddev arg from add_resync_info() The arg isn't used, so its presence is only confusing. Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:18 +11:00
NeilBrown	2e2a7cd96f	md-cluster: don't cast void pointers when assigning them. It is common practice in the kernel to leave out this case. It isn't needed and adds little if any value. Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:18 +11:00
NeilBrown	823815238f	md-cluster: discard unused sb_mutex. Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:18 +11:00
Guoqing Jiang	cf97a348c8	md-cluster: Fix warnings when build with CF=-D__CHECK_ENDIAN__ This patches fixes sparse warnings like incorrect type in assignment (different base types), cast to restricted __le64. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:18 +11:00
NeilBrown	8bce6d35b3	md/raid10: fix the 'new' raid10 layout to work correctly. In Linux 3.9 we introduce a new 'far' layout for RAID10 which was supposed to rotate the replicas differently and so provide better resilience. In particular it could survive more combinations of 2 drive failures. Unfortunately. due to a coding error, this some did what was wanted, sometimes improved less than we hoped, and sometimes - in very unlikely circumstances - put multiple replicas on the same device so the redundancy was harmed. No public user-space tool has created arrays using this layout so it is very unlikely that zero-redundancy arrays actually exist. Probably no arrays using any form of the new layout exist. But we cannot be certain. So use another bit in the 'layout' number and introduce a bug-fixed version of the layout. Also when assembling an array, if it has a zero-redundancy layout, give a warning. Reported-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 16:24:25 +11:00
NeilBrown	c340702ca2	md/raid10: don't clear bitmap bit when bad-block-list write fails. When a write fails and a bad-block-list is present, we can update the bad-block-list instead of writing the data. If this succeeds then it is OK clear the relevant bitmap-bit as no further 'sync' of the block is needed. However if writing the bad-block-list fails then we need to treat the write as failed and particularly must not clear the bitmap bit. Otherwise the device can be re-added (after any hardware connection issues are resolved) and because the relevant bit in the bitmap is clear, that block will not be resynced. This leads to data corruption. We already delay the final bio_endio() on the write until the bad-block-list is written so that when the write returns: either that data is safe, the bad-block record is safe, or the fact that the device is faulty is safe. However we don't delay the clearing of the bitmap, so the bitmap bit can be recorded as cleared before we know if the bad-block-list was written safely. So: delay that until the write really is safe. i.e. move the call to close_write() until just before calling bio_endio(), and recheck the 'is array degraded' status before making that call. This bug goes back to v3.1 when bad-block-lists were introduced, though it only affects arrays created with mdadm-3.3 or later as only those have bad-block lists. Backports will require at least Commit: `95af587e95` ("md/raid10: ensure device failure recorded before write request returns.") as well. I'll send that to 'stable' separately. Note that of the two tests of R10BIO_WriteError that this patch adds, the first is certain to fail and the second is certain to succeed. However doing it this way makes the patch more obviously correct. I will tidy the code up in a future merge window. Reported-by: Nate Dailey <nate.dailey@stratus.com> Fixes: `bd870a16c5` ("md/raid10: Handle write errors by updating badblock log.") Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 16:24:23 +11:00
NeilBrown	bd8688a199	md/raid1: don't clear bitmap bit when bad-block-list write fails. When a write fails and a bad-block-list is present, we can update the bad-block-list instead of writing the data. If this succeeds then it is OK clear the relevant bitmap-bit as no further 'sync' of the block is needed. However if writing the bad-block-list fails then we need to treat the write as failed and particularly must not clear the bitmap bit. Otherwise the device can be re-added (after any hardware connection issues are resolved) and because the relevant bit in the bitmap is clear, that block will not be resynced. This leads to data corruption. We already delay the final bio_endio() on the write until the bad-block-list is written so that when the write returns: either that data is safe, the bad-block record is safe, or the fact that the device is faulty is safe. However we don't delay the clearing of the bitmap, so the bitmap bit can be recorded as cleared before we know if the bad-block-list was written safely. So: delay that until the write really is safe. i.e. move the call to close_write() until just before calling bio_endio(), and recheck the 'is array degraded' status before making that call. This bug goes back to v3.1 when bad-block-lists were introduced, though it only affects arrays created with mdadm-3.3 or later as only those have bad-block lists. Backports will require at least Commit: `55ce74d4bf` ("md/raid1: ensure device failure recorded before write request returns.") as well. I'll send that to 'stable' separately. Note that of the two tests of R1BIO_WriteError that this patch adds, the first is certain to fail and the second is certain to succeed. However doing it this way makes the patch more obviously correct. I will tidy the code up in a future merge window. Reported-and-tested-by: Nate Dailey <nate.dailey@stratus.com> Cc: Jes Sorensen <Jes.Sorensen@redhat.com> Fixes: `cd5ff9a16f` ("md/raid1: Handle write errors by updating badblock log.") Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 16:24:22 +11:00
Joe Thornber	3201ac452e	dm cache: the CLEAN_SHUTDOWN flag was not being set If the CLEAN_SHUTDOWN flag is not set when a cache is loaded then all cache blocks are marked as dirty and a full writeback occurs. __commit_transaction() is responsible for setting/clearing CLEAN_SHUTDOWN (based the flags_mutator that is passed in). Fix this issue, of the cache's on-disk flags being wrong, by making sure __commit_transaction() does not reset the flags after the mutator has altered the flags in preparation for them being serialized to disk. before: sb_flags = mutator(le32_to_cpu(disk_super->flags)); disk_super->flags = cpu_to_le32(sb_flags); disk_super->flags = cpu_to_le32(cmd->flags); after: disk_super->flags = cpu_to_le32(cmd->flags); sb_flags = mutator(le32_to_cpu(disk_super->flags)); disk_super->flags = cpu_to_le32(sb_flags); Reported-by: Bogdan Vasiliev <bogdan.vasiliev@gmail.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-10-23 14:02:56 -04:00
Mike Snitzer	4dcb8b57df	dm btree: fix leak of bufio-backed block in btree_split_beneath error path btree_split_beneath()'s error path had an outstanding FIXME that speaks directly to the potential for _not_ cleaning up a previously allocated bufio-backed block. Fix this by releasing the previously allocated bufio block using unlock_block(). Reported-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <thornber@redhat.com> Cc: stable@vger.kernel.org	2015-10-23 14:02:55 -04:00
Joe Thornber	2871c69e02	dm btree remove: fix a bug when rebalancing nodes after removal Commit `4c7e309340` ("dm btree remove: fix bug in redistribute3") wasn't a complete fix for redistribute3(). The redistribute3 function takes 3 btree nodes and shares out the entries evenly between them. If the three nodes in total contained (MAX_ENTRIES * 3) - 1 entries between them then this was erroneously getting rebalanced as (MAX_ENTRIES - 1) on the left and right, and (MAX_ENTRIES + 1) in the center. Fix this issue by being more careful about calculating the target number of entries for the left and right nodes. Unit tested in userspace using this program: https://github.com/jthornber/redistribute3-test/blob/master/redistribute3_t.c Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-10-23 14:02:55 -04:00
Dan Williams	c7bfced9a6	md: suspend i/o during runtime blk_integrity_unregister Synchronize pending i/o against a change in the integrity profile to avoid the possibility of spurious integrity errors. Given linear_add() is suspending the mddev before manipulating the mddev, do the same for the other personalities. Acked-by: NeilBrown <neilb@suse.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-10-21 14:43:38 -06:00
Dan Williams	9609b9942b	md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown Now that the integrity profile is statically allocated there is no work to do when shutting down an integrity enabled block device. Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: James Bottomley <JBottomley@Odin.com> Acked-by: NeilBrown <neilb@suse.com> Acked-by: Keith Busch <keith.busch@intel.com> Acked-by: Vishal Verma <vishal.l.verma@intel.com> Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-10-21 14:43:37 -06:00
Martin K. Petersen	25520d55cd	block: Inline blk_integrity in struct gendisk Up until now the_integrity profile has been dynamically allocated and attached to struct gendisk after the disk has been made active. This causes problems because NVMe devices need to register the profile prior to the partition table being read due to a mandatory metadata buffer requirement. In addition, DM goes through hoops to deal with preallocating, but not initializing integrity profiles. Since the integrity profile is small (4 bytes + a pointer), Christoph suggested moving it to struct gendisk proper. This requires several changes: - Moving the blk_integrity definition to genhd.h. - Inlining blk_integrity in struct gendisk. - Removing the dynamic allocation code. - Adding helper functions which allow gendisk to set up and tear down the integrity sysfs dir when a disk is added/deleted. - Adding a blk_integrity_revalidate() callback for updating the stable pages bdi setting. - The calls that depend on whether a device has an integrity profile or not now key off of the bi->profile pointer. - Simplifying the integrity support routines in DM (Mike Snitzer). Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-10-21 14:42:42 -06:00
Jes Sorensen	681ab46960	md/raid10: submit_bio_wait() returns 0 on success This was introduced with `9e882242c6` which changed the return value of submit_bio_wait() to return != 0 on error, but didn't update the caller accordingly. Fixes: `9e882242c6` ("block: Add submit_bio_wait(), remove from md") Cc: stable@vger.kernel.org (v3.10) Reported-by: Bill Kuzeja <William.Kuzeja@stratus.com> Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-21 07:24:29 +11:00
Jes Sorensen	203d27b022	md/raid1: submit_bio_wait() returns 0 on success This was introduced with `9e882242c6` which changed the return value of submit_bio_wait() to return != 0 on error, but didn't update the caller accordingly. Fixes: `9e882242c6` ("block: Add submit_bio_wait(), remove from md") Cc: stable@vger.kernel.org (v3.10) Reported-by: Bill Kuzeja <William.Kuzeja@stratus.com> Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-21 07:20:15 +11:00
NeilBrown	ba2746b0fa	md-cluster: metadata_update_finish: consistently use cmsg.raid_slot as le32 As cmsg.raid_slot is le32, comparing for >0 is not meaningful. So introduce cpu-endian 'raid_slot' and only assign to cmsg.raid_slot when we know value is valid. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-16 13:48:35 +11:00
NeilBrown	c2a06c38d9	Merge branch 'md-next' of git://github.com/goldwynr/linux into for-next md-cluster: A better way for METADATA_UPDATED processing The processing of METADATA_UPDATED message is too simple and prone to errors. Besides, it would not update the internal data structures as required. This set of patches reads the superblock from one of the device of the MD and checks for changes in the in-memory data structures. If there is a change, it performs the necessary actions to keep the internal data structures as it would be in the primary node. An example is if a devices turns faulty. The algorithm is: 1. The initiator node marks the device as faulty and updates the superblock 2. The initiator node sends METADATA_UPDATED with an advisory device number to the rest of the nodes. 3. The receiving node on receiving the METADATA_UPDATED message 3.1 Reads the superblock 3.2 Detects a device has failed by comparing with memory structure 3.3 Calls the necessary functions to record the failure and get the device out of the active array. 3.4 Acknowledges the message. The patch series also fixes adding the disk which was impacted because of the changes. Patches can also be found at https://github.com/goldwynr/linux branch md-next Changes since V2: - Fix status synchrnoization after --add and --re-add operations - Included Guoqing's patches on endian correctness, zeroing cmsg etc - Restructure add_new_disk() and cancel()	2015-10-14 07:09:52 +11:00
Mike Snitzer	ba30670f4d	dm thin: fix missing pool reference count decrement in pool_ctr error path Fixes: `ac8c3f3df` ("dm thin: generate event when metadata threshold passed") Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.10+	2015-10-13 12:20:55 -04:00
Sudip Mukherjee	a2a678ed4d	dm snapshot persistent: fix missing cleanup in persistent_ctr error path If an unsupported option is given then the early return from persistent_ctr() leaked memory allocated for the 'pstore' and never destroyed the 'metadata_wq'. Fixes: `b0d3cc011e` ("dm snapshot: add new persistent store option to support overflow") Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-13 12:20:54 -04:00
Guoqing Jiang	23b63f9fa8	md: check the return value for metadata_update_start We shouldn't run related funs of md_cluster_ops in case metadata_update_start returned failure. Signed-off-by: Guoqing Jiang <gqjiang@suse.com>	2015-10-12 11:58:15 -05:00
Guoqing Jiang	a9720903d1	md-cluster: only call kick_rdev_from_array after remove disk successfully For cluster raid, we should not kick it from array if the disk can't be remove from array successfully. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-10-12 11:58:15 -05:00
Guoqing Jiang	86b572770e	md-cluster: Add 'SUSE' as author for md-cluster.c Signed-off-by: Guoqing Jiang <gqjiang@suse.com>	2015-10-12 11:58:15 -05:00
Guoqing Jiang	aee177ac5a	md-cluster: zero cmsg before it was sent Signed-off-by: Guoqing Jiang <gqjiang@suse.com>	2015-10-12 11:58:15 -05:00
Guoqing Jiang	256f5b245a	md-cluster: make sure the node do not receive it's own msg During the past test, the node occasionally received the msg which is sent from itself, this case should not happen in theory, but it is better to avoid it in case something wrong happened. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-10-12 11:58:14 -05:00
Guoqing Jiang	487cf9142c	md-cluster: remove unnecessary setting for slot Since slot will be set within _sendmsg, we can remove the redundant code in resync_info_update. Signed-off-by: Guoqing Jiang <gqjiang@suse.com>	2015-10-12 11:58:14 -05:00
Guoqing Jiang	faeff83fa4	md-cluster: make other members of cluster_msg is handled by little endian funcs Signed-off-by: Guoqing Jiang <gqjiang@suse.com>	2015-10-12 11:58:14 -05:00
Goldwyn Rodrigues	d216711bed	md-cluster: Do not printk() every received message The receive daemon prints kernel messages for every network message received. This would fill the kernel message log with unnecessary messages. Remove the pr_info() messages. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-10-12 11:58:00 -05:00
Goldwyn Rodrigues	dbb64f8635	md-cluster: Fix adding of new disk with new reload code Adding the disk worked incorrectly with the new reload code. Fix it: - No operation should be performed on rdev marked as Candidate - After a metadata update operation, kick disk if role is 0xfffe else clear Candidate bit and continue with the regular change check. - Saving the mode of the lock resource to check if token lock is already locked, because it can be called twice while adding a disk. However, unlock_comm() must be called only once. - add_new_disk() is called by the node initiating the --add operation. If it needs to be canceled, call add_new_disk_cancel(). The operation is completed by md_update_sb() which will write and unlock the communication. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-10-12 03:35:30 -05:00
Goldwyn Rodrigues	c186b128cd	md-cluster: Perform resync/recovery under a DLM lock Resync or recovery must be performed by only one node at a time. A DLM lock resource, resync_lockres provides the mutual exclusion so that only one node performs the recovery/resync at a time. If a node is unable to get the resync_lockres, because recovery is being performed by another node, it set MD_RECOVER_NEEDED so as to schedule recovery in the future. Remove the debug message in resync_info_update() used during development. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-10-12 03:32:44 -05:00
Goldwyn Rodrigues	2aa82191ac	md-cluster: Perform a lazy update In a clustered environment, a change such as marking a device faulty, can be recorded by any of the nodes. This is communicated to all the nodes and re-recording such a change is unnecessary, and quite often pretty disruptive. With this patch, just before the update, we detect for the changes and if the changes are already in superblock, we abort the update after clearing all the flags Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-10-12 01:37:17 -05:00
Goldwyn Rodrigues	70bcecdb15	md-cluster: Improve md_reload_sb to be less error prone md_reload_sb is too simplistic and it explicitly needs to determine the changes made by the writing node. However, there are multiple areas where a simple reload could fail. Instead, read the superblock of one of the "good" rdevs and update the necessary information: - read the superblock into a newly allocated page, by temporarily swapping out rdev->sb_page and calling ->load_super. - if that fails return - if it succeeds, call check_sb_changes 1. iterates over list of active devices and checks the matching dev_roles[] value. If that is 'faulty', the device must be marked as faulty - call md_error to mark the device as faulty. Make sure not to set CHANGE_DEVS and wakeup mddev->thread or else it would initiate a resync process, which is the responsibility of the "primary" node. - clear the Blocked bit - Call remove_and_add_spares() to hot remove the device. If the device is 'spare': - call remove_and_add_spares() to get the number of spares added in this operation. - Reduce mddev->degraded to mark the array as not degraded. 2. reset recovery_cp - read the rest of the rdevs to update recovery_offset. If recovery_offset is equal to MaxSector, call spare_active() to set it In_sync This required that recovery_offset be initialized to MaxSector, as opposed to zero so as to communicate the end of sync for a rdev. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-10-12 01:34:48 -05:00
Goldwyn Rodrigues	2910ff17d1	md: remove_and_add_spares() to activate specific rdev remove_and_add_spares() checks for all devices to activate spare. Change it to activate a specific device if a non-null rdev argument is passed. remove_and_add_spares() can be used to activate spares in slot_store() as well. For hot_remove_disk(), check if rdev->raid_disk == -1 before calling remove_and_add_spares() Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-10-12 01:33:58 -05:00
Goldwyn Rodrigues	b8ca846e45	md-cluster: Wake up suspended process When the suspended_area is deleted, the suspended processes must be woken up in order to complete their I/O. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-10-12 01:33:35 -05:00
Guoqing Jiang	099954119d	md-cluster: send BITMAP_NEEDS_SYNC when node is leaving cluster Previously, BITMAP_NEEDS_SYNC message is sent when the resyc aborts, but it could abort for different reasons, and not all of reasons require another node to take over the resync ownship. It is better make BITMAP_NEEDS_SYNC message only be sent when the node is leaving cluster with dirty bitmap. And we also need to ensure dlm connection is ok. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-12 01:32:27 -05:00
Goldwyn Rodrigues	c40f341f1e	md-cluster: Use a small window for resync Suspending the entire device for resync could take too long. Resync in small chunks. cluster's resync window (32M) is maintained in r1conf as cluster_sync_low and cluster_sync_high and processed in raid1's sync_request(). If the current resync is outside the cluster resync window: 1. Set the cluster_sync_low to curr_resync_completed. 2. Check if the sync will fit in the new window, if not issue a wait_barrier() and set cluster_sync_low to sector_nr. 3. Set cluster_sync_high to cluster_sync_low + resync_window. 4. Send a message to all nodes so they may add it in their suspension list. bitmap_cond_end_sync is modified to allow to force a sync inorder to get the curr_resync_completed uptodate with the sector passed. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-10-12 01:32:05 -05:00
Goldwyn Rodrigues	3c462c880b	md: Increment version for clustered bitmaps Add BITMAP_MAJOR_CLUSTERED as 5, in order to prevent older kernels to assemble a clustered device. In order to maximize compatibility, the major version is set to BITMAP_MAJOR_CLUSTERED only if the bitmap is clustered. Added MD_FEATURE_CLUSTERED in order to return error for older kernels which would assemble MD even if the bitmap is corrupted. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-12 01:31:33 -05:00
Goldwyn Rodrigues	9ed38ff530	md-cluster: complete all write requests before adding suspend_info process_suspend_info - which handles the RESYNCING request - must not reply until all writes which were initiated before the request arrived, have completed. As a by-product, all process_* functions now take mddev as their first arguement making it uniform. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-12 01:29:59 -05:00
Linus Torvalds	f24fe98df8	One bug fix for raid1/raid10. Very careless bug earler in 4.3-rc, now fixed :-) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJWGYrdAAoJEDnsnt1WYoG56vcP/03E7DoycDQ7uH46i2szf3M/ isy0dk+2S9D87iBz/xf4RXvAY7FTHy9/vIG/o6UKYkhSRzm6T6xCbrwd+duS2TSc yQM33BJ1VdM+trGj5ywrdF8guwRMjW4NFPnez16moVSVZDbNK2pUdZiw8kGSi39n hpjftyefojISG6rbDGBGK2JiVTNOqDjMH2Ny8MhX2J5ryQQOsd6+9ojgri3nfTbP 6PmP08QyVxdYA3ZUlTZaKUNZ8AQHgoydhiEyGbdCewcE8pYaeEUqvcBi4DrDOil8 9BGHnf755Wl3k26P8uBsvri9zp+SZl0LEZLhSpyFpRmCTaFGn0pnSKJ0intnRTPc JZ7gTY6q1Bt5DXToZw7hHVWgxjos8aweS2JLzSJloB6FFlDCckypkvSR4GQL7R9N jIYntfwaQaJIgUSzVo/Aw6vjBTWbqyLHf3DP8ImsPSe/z0gjtRiyPkjoZgthuYp2 ErLoVe/JgKstR0gmobbdRhShIfXMFVAIwasXOXfq4Ye4LRwvfAwP2UDHrC25mb0O IJi6fMqf3bWxmLIzFUcTe8Z2nzuKolAgP2rcd6kb0bbLxE4Y5xtzCV8fgnhk2obw HvP4zZnacLKx8Nvet+YGUKjVJU3wx4RTgyGLU4WqC13fwZREeJLWwxgK859ZJ8yl k+TQud5fKgfkX20+eTA0 =qdcM -----END PGP SIGNATURE----- Merge tag 'md/4.3-rc4-fix' of git://neil.brown.name/md Pull md bugfix from Neil Brown: "One bug fix for raid1/raid10. Very careless bug earler in 4.3-rc, now fixed :-)" * tag 'md/4.3-rc4-fix' of git://neil.brown.name/md: crash in md-raid1 and md-raid10 due to incorrect list manipulation	2015-10-11 09:35:51 -07:00
Linus Torvalds	0444555670	3 stable@ fixes: - DM core AB-BA deadlock fix in the device destruction path (vs device creation's DM table swap). - DM raid fix to properly round up the region_size to the next power-of-2. - DM cache fix for a NULL pointer seen while switching from the "cleaner" cache policy. 2 fixes for regressions introduced during the 4.3 merge: - request-based DM error propagation regressed due to incorrect changes introduced when adding the bi_error field to bio. - DM snapshot fix to only support snapshots that overflow if the client (e.g. lvm2) is prepared to deal with the associated snapshot status interface change. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJWGC/jAAoJEMUj8QotnQNaTgYIAJz1AG5IcHz8D3zi8+MBWXFL WAYrXfXSxexsymVKFsqi6z9fYiW5fRZ41/+Kl8/dYnhBIS8uUzWlad2qw/JFg+zC l/EzdHWjakzuGm9/quK2h/CBC/3pmRH9UeKgzOPODOpAzkJfrKoO4/J7JPIi3JyP esE/2F2TBwERL4oC74UB7/nuM/xckS/DRjbd3B82/IsfM5n+MARvuSSrqWcPEu8h Hh5k42KyA+Tq7uElLnXF8phFOCJCn9IyI+QLdxj33PfDxwrtXMvV6Sxw7FS8b7oF /gw3Dod4sEv+EJZ1A+O9mxGBk3ajCpMvUYbcY6owIHyB1mKWiSKyvyBPyIY6RiQ= =2z9t -----END PGP SIGNATURE----- Merge tag 'dm-4.3-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull dm fixes from Mike Snitzer: "Three stable fixes: - DM core AB-BA deadlock fix in the device destruction path (vs device creation's DM table swap). - DM raid fix to properly round up the region_size to the next power-of-2. - DM cache fix for a NULL pointer seen while switching from the "cleaner" cache policy. Two fixes for regressions introduced during the 4.3 merge: - request-based DM error propagation regressed due to incorrect changes introduced when adding the bi_error field to bio. - DM snapshot fix to only support snapshots that overflow if the client (e.g. lvm2) is prepared to deal with the associated snapshot status interface change" * tag 'dm-4.3-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm snapshot: add new persistent store option to support overflow dm cache: fix NULL pointer when switching from cleaner policy dm: fix request-based dm error reporting dm raid: fix round up of default region size dm: fix AB-BA deadlock in __dm_destroy()	2015-10-09 16:58:11 -07:00
Mike Snitzer	b0d3cc011e	dm snapshot: add new persistent store option to support overflow Commit `76c44f6d80` introduced the possibly for "Overflow" to be reported by the snapshot device's status. Older userspace (e.g. lvm2) does not handle the "Overflow" status response. Fix this incompatibility by requiring newer userspace code, that can cope with "Overflow", request the persistent store with overflow support by using "PO" (Persistent with Overflow) for the snapshot store type. Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Fixes: `76c44f6d80` ("dm snapshot: don't invalidate on-disk image on snapshot write overflow") Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-09 16:57:03 -04:00
Joe Thornber	2bffa1503c	dm cache: fix NULL pointer when switching from cleaner policy The cleaner policy doesn't make use of the per cache block hint space in the metadata (unlike the other policies). When switching from the cleaner policy to mq or smq a NULL pointer crash (in dm_tm_new_block) was observed. The crash was caused by bugs in dm-cache-metadata.c when trying to skip creation of the hint btree. The minimal fix is to change hint size for the cleaner policy to 4 bytes (only hint size supported). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-10-09 09:16:29 -04:00
Mikulas Patocka	a452744bcb	crash in md-raid1 and md-raid10 due to incorrect list manipulation The commit `55ce74d4bf` (md/raid1: ensure device failure recorded before write request returns) is causing crash in the LVM2 testsuite test shell/lvchange-raid.sh. For me the crash is 100% reproducible. The reason for the crash is that the newly added code in raid1d moves the list from conf->bio_end_io_list to tmp, then tests if tmp is non-empty and then incorrectly pops the bio from conf->bio_end_io_list (which is empty because the list was alrady moved). Raid-10 has a similar bug. Kernel Fault: Code=15 regs=000000006ccb8640 (Addr=0000000100000000) CPU: 3 PID: 1930 Comm: mdX_raid1 Not tainted 4.2.0-rc5-bisect+ #35 task: 000000006cc1f258 ti: 000000006ccb8000 task.ti: 000000006ccb8000 YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI PSW: 00001000000001001111111000001111 Not tainted r00-03 000000ff0804fe0f 000000001059d000 000000001059f818 000000007f16be38 r04-07 000000001059d000 000000007f16be08 0000000000200200 0000000000000001 r08-11 000000006ccb8260 000000007b7934d0 0000000000000001 0000000000000000 r12-15 000000004056f320 0000000000000000 0000000000013dd0 0000000000000000 r16-19 00000000f0d00ae0 0000000000000000 0000000000000000 0000000000000001 r20-23 000000000800000f 0000000042200390 0000000000000000 0000000000000000 r24-27 0000000000000001 000000000800000f 000000007f16be08 000000001059d000 r28-31 0000000100000000 000000006ccb8560 000000006ccb8640 0000000000000000 sr00-03 0000000000249800 0000000000000000 0000000000000000 0000000000249800 sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000 IASQ: 0000000000000000 0000000000000000 IAOQ: 000000001059f61c 000000001059f620 IIR: 0f8010c6 ISR: 0000000000000000 IOR: 0000000100000000 CPU: 3 CR30: 000000006ccb8000 CR31: 0000000000000000 ORIG_R28: 000000001059d000 IAOQ[0]: call_bio_endio+0x34/0x1a8 [raid1] IAOQ[1]: call_bio_endio+0x38/0x1a8 [raid1] RP(r2): raid_end_bio_io+0x88/0x168 [raid1] Backtrace: [<000000001059f818>] raid_end_bio_io+0x88/0x168 [raid1] [<00000000105a4f64>] raid1d+0x144/0x1640 [raid1] [<000000004017fd5c>] kthread+0x144/0x160 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: `55ce74d4bf` ("md/raid1: ensure device failure recorded before write request returns.") Fixes: `95af587e95` ("md/raid10: ensure device failure recorded before write request returns.") Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-09 08:33:46 +11:00
Junichi Nomura	50887bd139	dm: fix request-based dm error reporting end_clone_bio() is a endio callback for clone bio and should check and save the clone's bi_error for error reporting. However, `4246a0b63b` ("block: add a bi_error field to struct bio") changed the function to check the original bio's bi_error, which is 0. Without this fix, clone's error is ignored and reported to the original request as success. Thus data corruption will be observed. Fixes: `4246a0b63b` ("block: add a bi_error field to struct bio") Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-06 10:08:16 -04:00
Linus Torvalds	15ecf9a986	Assorted fixes for md in 4.3-rc Two tagged for -stable One is really a cleanup to match and improve kmemcache interface. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJWDjJNAAoJEDnsnt1WYoG5bOkP/ioJ8DZWkobOWSpnjbNCKIyg xrX3FlTq8MJHPfeqGDzfznjYTZ7vb9ZYkZNkn1HUIOXKCkG0hqr1GL1eVZmKAbgZ B3nuyIuArZe+IXQ5mMoMXn5qpp7/2mO/JPaqBBrUmxHMx+c+Xx0LC0QUdL7GXzY5 oQ8SahoLrl7Xl4/i9dSuhVD9rDhzuC7ZmykLkYrtquxFC69tH4PRUWak0RXXvHsE mzADdqCwATLUu2FvEudoaCecXHxRmcn47CuALcqdaZF+VVPe8WsjIySmeVDRCixZ k9njCdNiqtoKzb87MJECclYbCdHUVcKMNqaOoBkLaZnJumNFABwrPP3LnMtdaNpy TrjYh3x5/xrdOgmWBML2gK/suEtaN2hgT6KyI38rAwlYQlEppxd94ZbIH0Q0wY+L Unhcn28h56janKYVzyumA0Z5p6fbpxkI2OLEws4HzSqq6Ajpuc7yxDSCbUmE2vXL WIoVAgH6PEr5sUCMH7xxqWejoXDi1KinPPVELKuMTWCiwRFr3CnZZzPXGJX5DXSG nS9HCR35WmXuQx9pqC4/YOk7HBmllnNMHUrFlOYCzAn2qbjsCZ0whNlKe78qvN2z +OYiVRF8KmSNAkP+S47sxeyEEYMi4aKVNe1ur1jVjYmA5keIdmjbnIRjGXfSNzff PdvMqZcGouq4jsz2fqQf =yqg5 -----END PGP SIGNATURE----- Merge tag 'md/4.3-fixes' of git://neil.brown.name/md Pull md fixes from Neil Brown: "Assorted fixes for md in 4.3-rc. Two tagged for -stable, and one is really a cleanup to match and improve kmemcache interface. * tag 'md/4.3-fixes' of git://neil.brown.name/md: md/bitmap: don't pass -1 to bitmap_storage_alloc. md/raid1: Avoid raid1 resync getting stuck md: drop null test before destroy functions md: clear CHANGE_PENDING in readonly array md/raid0: apply base queue limits before disk_stack_limits md/raid5: don't index beyond end of array in need_this_block(). raid5: update analysis state for failed stripe md: wait for pending superblock updates before switching to read-only	2015-10-04 11:47:28 +01:00
Mikulas Patocka	042745ee53	dm raid: fix round up of default region size Commit `3a0f9aaee0` ("dm raid: round region_size to power of two") intended to make sure that the default region size is a power of two. However, the logic in that commit is incorrect and sets the variable region_size to 0 or 1, depending on whether min_region_size is a power of two. Fix this logic, using roundup_pow_of_two(), so that region_size is properly rounded up to the next power of two. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: `3a0f9aaee0` ("dm raid: round region_size to power of two") Cc: stable@vger.kernel.org # v3.8+ Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-02 12:02:31 -04:00
NeilBrown	da6fb7a9e5	md/bitmap: don't pass -1 to bitmap_storage_alloc. Passing -1 to bitmap_storage_alloc() causes page->index to be set to -1, which is quite problematic. So only pass ->cluster_slot if mddev_is_clustered(). Fixes: `b97e92574c` ("Use separate bitmaps for each nodes in the cluster") Cc: stable@vger.kernel.org (v4.1+) Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-02 17:24:13 +10:00
Jes Sorensen	e8ff8bf09f	md/raid1: Avoid raid1 resync getting stuck close_sync() needs to set conf->next_resync to a large, but safe value below MaxSector and use it to determine whether or not to set start_next_window in wait_barrier() Solution suggested by Neil Brown. Reported-by: Nate Dailey <nate.dailey@stratus.com> Tested-by: Xiao Ni <xni@redhat.com> Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-02 17:23:44 +10:00
Julia Lawall	644df1a85f	md: drop null test before destroy functions Remove unneeded NULL test. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x; @@ -if (x != NULL) $kmem_cache_destroy\\|mempool_destroy\\|dma_pool_destroy$(x); // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-02 17:23:44 +10:00
Shaohua Li	d4929add83	md: clear CHANGE_PENDING in readonly array If faulty disks of an array are more than allowed degraded number, the array enters error handling. It will be marked as read-only with MD_CHANGE_PENDING/RECOVERY_NEEDED set. But currently recovery doesn't clear CHANGE_PENDING bit for read-only array. If MD_CHANGE_PENDING is set for a raid5 array, all returned IO will be hold on a list till the bit is clear. But recovery nevery clears this bit, the IO is always in pending state and nevery finish. This has bad effects like upper layer can't get an IO error and the array can't be stopped. Fixes: `c3cce6cda1` ("md/raid5: ensure device failure recorded before write request returns.") Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-02 17:23:44 +10:00
NeilBrown	66eefe5de1	md/raid0: apply base queue limits before disk_stack_limits Calling e.g. blk_queue_max_hw_sectors() after calls to disk_stack_limits() discards the settings determined by disk_stack_limits(). So we need to make those calls first. Fixes: `199dc6ed51` ("md/raid0: update queue parameter in a safer location.") Cc: stable@vger.kernel.org (v2.6.35+ - please apply with `199dc6ed51`). Reported-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-02 17:23:44 +10:00
NeilBrown	36707bb2e7	md/raid5: don't index beyond end of array in need_this_block(). When need_this_block probably shouldn't be called when there are more than 2 failed devices, we really don't want it to try indexing beyond the end of the failed_num[] of fdev[] arrays. So limit the loops to at most 2 iterations. Reported-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-10-02 17:23:43 +10:00
Shaohua Li	ebda780bce	raid5: update analysis state for failed stripe handle_failed_stripe() makes the stripe fail, eg, all IO will return with a failure, but it doesn't update stripe_head_state. Later handle_stripe() has special handling for raid6 for handle_stripe_fill(). That check before handle_stripe_fill() doesn't skip the failed stripe and we get a kernel crash in need_this_block. This patch clear the analysis state to make sure no functions wrongly called after handle_failed_stripe() Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-02 17:23:43 +10:00
NeilBrown	88724bfa68	md: wait for pending superblock updates before switching to read-only If a superblock update is pending, wait for it to complete before letting md_set_readonly() switch to readonly. Otherwise we might lose important information about a device having failed. For external arrays, waiting for superblock updates can wait on user-space, so in that case, just return an error. Reported-and-tested-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-02 17:23:43 +10:00
Junichi Nomura	2a708cff93	dm: fix AB-BA deadlock in __dm_destroy() __dm_destroy() takes io_barrier SRCU lock (dm_get_live_table) and suspend_lock in reverse order. Doing so can cause AB-BA deadlock: __dm_destroy dm_swap_table --------------------------------------------------- mutex_lock(suspend_lock) dm_get_live_table() srcu_read_lock(io_barrier) dm_sync_table() synchronize_srcu(io_barrier) .. waiting for dm_put_live_table() mutex_lock(suspend_lock) .. waiting for suspend_lock Fix this by taking the locks in proper order. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Fixes: `ab7c7bb6f4` ("dm: hold suspend_lock while suspending device during device deletion") Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-10-01 10:40:20 -04:00
Mike Snitzer	586b286b11	dm crypt: constrain crypt device's max_segment_size to PAGE_SIZE Setting the dm-crypt device's max_segment_size to PAGE_SIZE is an unfortunate constraint that is required to avoid the potential for exceeding dm-crypt's underlying device's max_segments limits -- due to crypt_alloc_buffer() possibly allocating pages for the encryption bio that are not as physically contiguous as the original bio. It is interesting to note that this problem was already fixed back in 2007 via commit `91e106259` ("dm crypt: use bio_add_page"). But Linux 4.0 commit `cf2f1abfb` ("dm crypt: don't allocate pages for a partial request") regressed dm-crypt back to _not_ using bio_add_page(). But given dm-crypt's cpu parallelization changes all depend on commit cf2f1abfb's abandoning of the more complex io fragments processing that dm-crypt previously had we cannot easily go back to using bio_add_page(). So all said the cleanest way to resolve this issue is to fix dm-crypt to properly constrain the original bios entering dm-crypt so the encryption bios that dm-crypt generates from the original bios are always compatible with the underlying device's max_segments queue limits. It should be noted that technically Linux 4.3 does _not_ need this fix because of the block core's new late bio-splitting capability. But, it is reasoned, there is little to be gained by having the block core split the encrypted bio that is composed of PAGE_SIZE segments. That said, in the future we may revert this change. Fixes: `cf2f1abfb` ("dm crypt: don't allocate pages for a partial request") Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=104421 Suggested-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.0+	2015-09-14 12:04:24 -04:00
Mike Snitzer	216076705d	dm thin: disable discard support for thin devices if pool's is disabled If the pool is configured with 'ignore_discard' its discard support is disabled. The pool's thin devices should also have queue_limits that reflect discards are disabled. Fixes: `34fbcf62` ("dm thin: range discard support") Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.1+	2015-09-13 21:32:10 -04:00
Linus Torvalds	8e78b7dc93	SCSI misc on 20150911 The major pieces of this patch are a set patches facilitating better integration between scsi and scsi_dh (the device handling layer used by multi-path; all the dm parts are acked by Mike Snitzer). It also includes driver updates for mp3sas, scsi_debug and an assortment of bug fixes. Signed-off-by: James Bottomley <JBottomley@Odin.com> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAABAgAGBQJV8yt5AAoJEDeqqVYsXL0MBsQH+wXvlx3o0BGuz5ZXfIs/RxzI MwGnu1J0LSA9FPakkMUVOBtsxIG+pCV+4eKorQMkfGCKAZ8daaYsyYvSEM2mcqIX 1Y/srEnbzfE94JHbsI2pbiMPkB7QdtW27WjTSjQGgD9igAyVmmITiQJrXbpAlSLF F6n++9avng+GhjXQ5TF8/y13OYgabIoAPM1j4B/ut/Ok8ReruBvMBnOla5w5RMKR rBZKTZfUwvX5S0cuREwj8tFsRVUgdBNSrcGswFJrZo5x9WAsSHLC6+SOLZuUy1vC ua0tNtEiyXiuR0/jSP9qv7hJ/j0BW+EGdnW6GZEzKpeMK5PxfVspOsbNunUDRsY= =Y9G1 -----END PGP SIGNATURE----- Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull second round of SCSI updates from James Bottomley: "There's one late arriving patch here (added today), fixing a build issue which the scsi_dh patch set in here uncovered. Other than that, everything has been incubated in -next and the checkers for a week. The major pieces of this patch are a set patches facilitating better integration between scsi and scsi_dh (the device handling layer used by multi-path; all the dm parts are acked by Mike Snitzer). This also includes driver updates for mp3sas, scsi_debug and an assortment of bug fixes" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (50 commits) scsi_dh: fix randconfig build error scsi: fix scsi_error_handler vs. scsi_host_dev_release race fcoe: Convert use of __constant_htons to htons mpt2sas: setpci reset kernel oops fix pm80xx: Don't override ts->stat on IO_OPEN_CNX_ERROR_HW_RESOURCE_BUSY lpfc: Fix possible use-after-free and double free in lpfc_mbx_cmpl_rdp_page_a2() bfa: Fix incorrect de-reference of pointer bfa: Fix indentation scsi_transport_sas: Remove check for SAS expander when querying bay/enclosure IDs. scsi_debug: resp_request: remove unused variable scsi_debug: fix REPORT LUNS Well Known LU scsi_debug: schedule_resp fix input variable check scsi_debug: make dump_sector static scsi_debug: vfree is null safe so drop the check scsi_debug: use SCSI_W_LUN_REPORT_LUNS instead of SAM2_WLUN_REPORT_LUNS; scsi_debug: define pr_fmt() for consistent logging mpt2sas: Refcount fw_events and fix unsafe list usage mpt2sas: Refcount sas_device objects and fix unsafe list usage scsi_dh: return SCSI_DH_NOTCONN in scsi_dh_activate() scsi_dh: don't allow to detach device handlers at runtime ...	2015-09-11 18:15:18 -07:00
Christoph Hellwig	294ab783ad	scsi_dh: fix randconfig build error It looks like the Kconfig check that was meant to fix this (commit `fe9233fb69` [SCSI] scsi_dh: fix kconfig related build errors) was actually reversed, but no-one noticed until the new set of patches which separated DM and SCSI_DH). Fixes: `fe9233fb69` Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: James Bottomley <JBottomley@Odin.com>	2015-09-11 11:54:37 -07:00
Linus Torvalds	2a013e37ce	md updates for 4.3 - An assortment of little fixes, several for minor races only likely to be hit during testing - further cluster-md-raid1 development, not ready for real use yet. - new RAID6 syndrome code for ARM NEON - fix a race where a write can return before failure of one device is properly recorded in metadata, so an immediate crash might result in that write being lost. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJV6rSXAAoJEDnsnt1WYoG5eJIQAJs62+kB3+p87/VEu4hiBgYv yyaCBlTDn3xxy3WFLtvSIc+cZOamvGe/u/+9/aTA5zq30VpS0fwZlLUxwyR3vB7H aXh5y0JL8fViCUp6o+SplOpNDMAv4ntcW5NMv7uWhPLxtQxF/IJu6YLsDRcFaJqL LFCpvKSPgXOQ88ZXHa54xgFgEy+aAh1lxaWQmeqCLtgVc6YhwIsazG00R/vow8Pb 91u3jFioWjBpovTJiRxQO+NGemfOnKrm2EWkR4jzo8taHOouBWOH0RZjh/67dh4p QX4GjMINhFvYSr1UMGXfPm+Fjp2PRgx1qKyR/XhPeXNuE2xZf7T4aCnmKA8DVUA4 vyEl/l0lAZClExNA+bgE/wCrMpvtb9E4NnklzIffDqsDY79m9JzLwznYqDQcXP7m 0zPlRmf8KQoSOVV960N2O6siwQMwvTyPecG0raAv9BKjwZ+7/M8HLOplZuuMsbzT BZ6+FnAIDtc0Id0wwJoARUkghG7Nr4IWi4Q8MtyYLgH9KLnYkomjf/I2B5sEooCF JFIXeg+XX/xKSFHV4TycYdAFMtMEJMJ/pEnbKJ/W7CyAmrHJv+0/U+/gOkA8Mg76 iqYVWqRJHP9ZyWpmaWaaOeGIgFoJqrjM65qFNRcOnzMd/aAi8W63oyM99Lxi+1pm i8StqQBNtiwzds/w32SI =n+/k -----END PGP SIGNATURE----- Merge tag 'md/4.3' of git://neil.brown.name/md Pull md updates from Neil Brown: - an assortment of little fixes, several for minor races only likely to be hit during testing - further cluster-md-raid1 development, not ready for real use yet. - new RAID6 syndrome code for ARM NEON - fix a race where a write can return before failure of one device is properly recorded in metadata, so an immediate crash might result in that write being lost. * tag 'md/4.3' of git://neil.brown.name/md: (33 commits) md/raid5: ensure device failure recorded before write request returns. md/raid5: use bio_list for the list of bios to return. md/raid10: ensure device failure recorded before write request returns. md/raid1: ensure device failure recorded before write request returns. md-cluster: remove inappropriate try_module_get from join() md: extend spinlock protection in register_md_cluster_operations md-cluster: Read the disk bitmap sb and check if it needs recovery md-cluster: only call complete(&cinfo->completion) when node join cluster md-cluster: add missed lockres_free md-cluster: remove the unused sb_lock md-cluster: init suspend_list and suspend_lock early in join md-cluster: add the error check if failed to get dlm lock md-cluster: init completion within lockres_init md-cluster: fix deadlock issue on message lock md-cluster: transfer the resync ownership to another node md-cluster: split recover_slot for future code reuse md-cluster: use %pU to print UUIDs md: setup safemode_timer before it's being used md/raid5: handle possible race as reshape completes. md: sync sync_completed has correct value as recovery finishes. ...	2015-09-05 17:52:22 -07:00
NeilBrown	e89c6fdf9e	Merge linux-block/for-4.3/core into md/for-linux There were a few conflicts that are fairly easy to resolve. Signed-off-by: NeilBrown <neilb@suse.com>	2015-09-05 11:08:32 +02:00
Linus Torvalds	1e1a4e8f43	Merge tag 'dm-4.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper update from Mike Snitzer: - a couple small cleanups in dm-cache, dm-verity, persistent-data's dm-btree, and DM core. - a 4.1-stable fix for dm-cache that fixes the leaking of deferred bio prison cells - a 4.2-stable fix that adds feature reporting for the dm-stats features added in 4.2 - improve DM-snapshot to not invalidate the on-disk snapshot if snapshot device write overflow occurs; but a write overflow triggered through the origin device will still invalidate the snapshot. - optimize DM-thinp's async discard submission a bit now that late bio splitting has been included in block core. - switch DM-cache's SMQ policy lock from using a mutex to a spinlock; improves performance on very low latency devices (eg. NVMe SSD). - document DM RAID 4/5/6's discard support [ I did not pull the slab changes, which weren't appropriate for this tree, and weren't obviously the right thing to do anyway. At the very least they need some discussion and explanation before getting merged. Because not pulling the actual tagged commit but doing a partial pull instead, this merge commit thus also obviously is missing the git signature from the original tag ] * tag 'dm-4.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm cache: fix use after freeing migrations dm cache: small cleanups related to deferred prison cell cleanup dm cache: fix leaking of deferred bio prison cells dm raid: document RAID 4/5/6 discard support dm stats: report precise_timestamps and histogram in @stats_list output dm thin: optimize async discard submission dm snapshot: don't invalidate on-disk image on snapshot write overflow dm: remove unlikely() before IS_ERR() dm: do not override error code returned from dm_get_device() dm: test return value for DM_MAPIO_SUBMITTED dm verity: remove unused mempool dm cache: move wake_waker() from free_migrations() to where it is needed dm btree remove: remove unused function get_nr_entries() dm btree: remove unused "dm_block_t root" parameter in btree_split_sibling() dm cache policy smq: change the mutex to a spinlock	2015-09-02 16:35:26 -07:00
Linus Torvalds	1081230b74	Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-block Pull core block updates from Jens Axboe: "This first core part of the block IO changes contains: - Cleanup of the bio IO error signaling from Christoph. We used to rely on the uptodate bit and passing around of an error, now we store the error in the bio itself. - Improvement of the above from myself, by shrinking the bio size down again to fit in two cachelines on x86-64. - Revert of the max_hw_sectors cap removal from a revision again, from Jeff Moyer. This caused performance regressions in various tests. Reinstate the limit, bump it to a more reasonable size instead. - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me. Most devices have huge trim limits, which can cause nasty latencies when deleting files. Enable the admin to configure the size down. We will look into having a more sane default instead of UINT_MAX sectors. - Improvement of the SGP gaps logic from Keith Busch. - Enable the block core to handle arbitrarily sized bios, which enables a nice simplification of bio_add_page() (which is an IO hot path). From Kent. - Improvements to the partition io stats accounting, making it faster. From Ming Lei. - Also from Ming Lei, a basic fixup for overflow of the sysfs pending file in blk-mq, as well as a fix for a blk-mq timeout race condition. - Ming Lin has been carrying Kents above mentioned patches forward for a while, and testing them. Ming also did a few fixes around that. - Sasha Levin found and fixed a use-after-free problem introduced by the bio->bi_error changes from Christoph. - Small blk cgroup cleanup from Viresh Kumar" * 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits) blk: Fix bio_io_vec index when checking bvec gaps block: Replace SG_GAPS with new queue limits mask block: bump BLK_DEF_MAX_SECTORS to 2560 Revert "block: remove artifical max_hw_sectors cap" blk-mq: fix race between timeout and freeing request blk-mq: fix buffer overflow when reading sysfs file of 'pending' Documentation: update notes in biovecs about arbitrarily sized bios block: remove bio_get_nr_vecs() fs: use helper bio_add_page() instead of open coding on bi_io_vec block: kill merge_bvec_fn() completely md/raid5: get rid of bio_fits_rdev() md/raid5: split bio for chunk_aligned_read block: remove split code in blkdev_issue_{discard,write_same} btrfs: remove bio splitting and merge_bvec_fn() calls bcache: remove driver private bio splitting code block: simplify bio_add_page() block: make generic_make_request handle arbitrarily sized bios blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL) block: don't access bio->bi_error after bio_put() block: shrink struct bio down to 2 cache lines again ...	2015-09-02 13:10:25 -07:00
Linus Torvalds	089b669506	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "The usual stuff from trivial tree for 4.3 (kerneldoc updates, printk() fixes, Documentation and MAINTAINERS updates)" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (28 commits) MAINTAINERS: update my e-mail address mod_devicetable: add space before */ scsi: a100u2w: trivial typo in printk i2c: Fix typo in i2c-bfin-twi.c treewide: fix typos in comment blocks Doc: fix trivial typo in SubmittingPatches proportions: Spelling s/consitent/consistent/ dm: Spelling s/consitent/consistent/ aic7xxx: Fix typo in error message pcmcia: Fix typo in locking documentation scsi/arcmsr: Fix typos in error log drm/nouveau/gr: Fix typo in nv10.c [SCSI] Fix printk typos in drivers/scsi staging: comedi: Grammar s/Enable support a/Enable support for a/ Btrfs: Spelling s/consitent/consistent/ README: GTK+ is a acronym ASoC: omap: Fix typo in config option description mm: tlb.c: Fix error message ntfs: super.c: Fix error log fix typo in Documentation/SubmittingPatches ...	2015-09-01 18:46:42 -07:00
Joe Thornber	cc7da0ba9c	dm cache: fix use after freeing migrations Both free_io_migration() and issue_discard() dereference a migration that was just freed. Fix those by saving off the migrations's cache object before freeing the migration. Also cleanup needless mg->cache dereferences now that the cache object is available directly. Fixes: `e44b6a5a3c` ("dm cache: move wake_waker() from free_migrations() to where it is needed") Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-09-01 08:56:14 -04:00
Mike Snitzer	dc9cee5db5	dm cache: small cleanups related to deferred prison cell cleanup Eliminate __cell_release() since it only had one caller that always released the cell holder. Switch cell_error_with_code() to using free_prison_cell() for the sake of consistency. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-08-31 15:50:28 -04:00
Joe Thornber	9153df7405	dm cache: fix leaking of deferred bio prison cells There were two cases where dm_cell_visit_release() was being called, which removes the cell from the prison's rbtree, but the callers didn't also return the cell to the mempool. Fix this by having them call free_prison_cell(). This leak manifested as the 'kmalloc-96' slab growing until OOM. Fixes: `651f5fa2a3` ("dm cache: defer whole cells") Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.1+	2015-08-31 15:08:14 -04:00
NeilBrown	c3cce6cda1	md/raid5: ensure device failure recorded before write request returns. When a write to one of the devices of a RAID5/6 fails, the failure is recorded in the metadata of the other devices so that after a restart the data on the failed drive wont be trusted even if that drive seems to be working again (maybe a cable was unplugged). Similarly when we record a bad-block in response to a write failure, we must not let the write complete until the bad-block update is safe. Currently there is no interlock between the write request completing and the metadata update. So it is possible that the write will complete, the app will confirm success in some way, and then the machine will crash before the metadata update completes. This is an extremely small hole for a racy to fit in, but it is theoretically possible and so should be closed. So: - set MD_CHANGE_PENDING when requesting a metadata update for a failed device, so we can know with certainty when it completes - queue requests that completed when MD_CHANGE_PENDING is set to only be processed after the metadata update completes - call raid_end_bio_io() on bios in that queue when the time comes. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:43:59 +02:00
NeilBrown	34a6f80e16	md/raid5: use bio_list for the list of bios to return. This will make it easier to splice two lists together which will be needed in future patch. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:43:50 +02:00
NeilBrown	95af587e95	md/raid10: ensure device failure recorded before write request returns. When a write to one of the legs of a RAID10 fails, the failure is recorded in the metadata of the other legs so that after a restart the data on the failed drive wont be trusted even if that drive seems to be working again (maybe a cable was unplugged). Currently there is no interlock between the write request completing and the metadata update. So it is possible that the write will complete, the app will confirm success in some way, and then the machine will crash before the metadata update completes. This is an extremely small hole for a racy to fit in, but it is theoretically possible and so should be closed. So: - set MD_CHANGE_PENDING when requesting a metadata update for a failed device, so we can know with certainty when it completes - queue requests that experienced an error on a new queue which is only processed after the metadata update completes - call raid_end_bio_io() on bios in that queue when the time comes. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:43:45 +02:00
NeilBrown	55ce74d4bf	md/raid1: ensure device failure recorded before write request returns. When a write to one of the legs of a RAID1 fails, the failure is recorded in the metadata of the other leg(s) so that after a restart the data on the failed drive wont be trusted even if that drive seems to be working again (maybe a cable was unplugged). Similarly when we record a bad-block in response to a write failure, we must not let the write complete until the bad-block update is safe. Currently there is no interlock between the write request completing and the metadata update. So it is possible that the write will complete, the app will confirm success in some way, and then the machine will crash before the metadata update completes. This is an extremely small hole for a racy to fit in, but it is theoretically possible and so should be closed. So: - set MD_CHANGE_PENDING when requesting a metadata update for a failed device, so we can know with certainty when it completes - queue requests that experienced an error on a new queue which is only processed after the metadata update completes - call raid_end_bio_io() on bios in that queue when the time comes. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:43:23 +02:00
NeilBrown	18b9f67962	md-cluster: remove inappropriate try_module_get from join() md_setup_cluster already calls try_module_get(), so this try_module_get isn't needed. Also, there is no matching module_put (except in error patch), so this leaves an unbalanced module count. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:43:17 +02:00
NeilBrown	6022e75bf0	md: extend spinlock protection in register_md_cluster_operations This code looks racy. The only possible race is if two modules try to register at the same time and that won't happen. But make the code look safe anyway. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:42:59 +02:00
Guoqing Jiang	abb9b22ac9	md-cluster: Read the disk bitmap sb and check if it needs recovery In gather_all_resync_info, we need to read the disk bitmap sb and check if it needs recovery. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:42:41 +02:00
Guoqing Jiang	eece075cda	md-cluster: only call complete(&cinfo->completion) when node join cluster Introduce MD_CLUSTER_BEGIN_JOIN_CLUSTER flag to make sure complete(&cinfo->completion) is only be invoked when node join cluster. Otherwise node failure could also call the complete, and it doesn't make sense to do it. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:42:31 +02:00
Guoqing Jiang	6e6d9f2cda	md-cluster: add missed lockres_free We also need to free the lock resource before goto out. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:42:23 +02:00
Guoqing Jiang	b2b9bfff0a	md-cluster: remove the unused sb_lock The sb_lock is not used anywhere, so let's remove it. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:42:14 +02:00
Guoqing Jiang	9e3072e373	md-cluster: init suspend_list and suspend_lock early in join If the node just join the cluster, and receive the msg from other nodes before init suspend_list, it will cause kernel crash due to NULL pointer dereference, so move the initializations early to fix the bug. md-cluster: Joined cluster 3578507b-e0cb-6d4f-6322-696cd7b1b10c slot 3 BUG: unable to handle kernel NULL pointer dereference at (null) ... ... ... Call Trace: [<ffffffffa0444924>] process_recvd_msg+0x2e4/0x330 [md_cluster] [<ffffffffa0444a06>] recv_daemon+0x96/0x170 [md_cluster] [<ffffffffa045189d>] md_thread+0x11d/0x170 [md_mod] [<ffffffff810768c4>] kthread+0xb4/0xc0 [<ffffffff8151927c>] ret_from_fork+0x7c/0xb0 ... ... ... RIP [<ffffffffa0443581>] __remove_suspend_info+0x11/0xa0 [md_cluster] Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:42:05 +02:00
Guoqing Jiang	b5ef56789b	md-cluster: add the error check if failed to get dlm lock In complicated cluster environment, it is possible that the dlm lock couldn't be get/convert on purpose, the related err info is added for better debug potential issue. For lockres_free, if the lock is blocking by a lock request or conversion request, then dlm_unlock just put it back to grant queue, so need to ensure the lock is free finally. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:41:56 +02:00

... 2 3 4 5 6 ...

4183 Commits