Commit Graph

3880 Commits

Author SHA1 Message Date
NeilBrown
c340702ca2 md/raid10: don't clear bitmap bit when bad-block-list write fails.
When a write fails and a bad-block-list is present, we can
update the bad-block-list instead of writing the data.  If
this succeeds then it is OK clear the relevant bitmap-bit as
no further 'sync' of the block is needed.

However if writing the bad-block-list fails then we need to
treat the write as failed and particularly must not clear
the bitmap bit.  Otherwise the device can be re-added (after
any hardware connection issues are resolved) and because the
relevant bit in the bitmap is clear, that block will not be
resynced.  This leads to data corruption.

We already delay the final bio_endio() on the write until
the bad-block-list is written so that when the write
returns: either that data is safe, the bad-block record is
safe, or the fact that the device is faulty is safe.
However we *don't* delay the clearing of the bitmap, so the
bitmap bit can be recorded as cleared before we know if the
bad-block-list was written safely.

So: delay that until the write really is safe.
i.e. move the call to close_write() until just before
calling bio_endio(), and recheck the 'is array degraded'
status before making that call.

This bug goes back to v3.1 when bad-block-lists were
introduced, though it only affects arrays created with
mdadm-3.3 or later as only those have bad-block lists.

Backports will require at least
Commit: 95af587e95 ("md/raid10: ensure device failure recorded before write request returns.")
as well.  I'll send that to 'stable' separately.

Note that of the two tests of R10BIO_WriteError that this
patch adds, the first is certain to fail and the second is
certain to succeed.  However doing it this way makes the
patch more obviously correct.  I will tidy the code up in a
future merge window.

Reported-by: Nate Dailey <nate.dailey@stratus.com>
Fixes: bd870a16c5 ("md/raid10:  Handle write errors by updating badblock log.")
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24 16:24:23 +11:00
NeilBrown
bd8688a199 md/raid1: don't clear bitmap bit when bad-block-list write fails.
When a write fails and a bad-block-list is present, we can
update the bad-block-list instead of writing the data.  If
this succeeds then it is OK clear the relevant bitmap-bit as
no further 'sync' of the block is needed.

However if writing the bad-block-list fails then we need to
treat the write as failed and particularly must not clear
the bitmap bit.  Otherwise the device can be re-added (after
any hardware connection issues are resolved) and because the
relevant bit in the bitmap is clear, that block will not be
resynced.  This leads to data corruption.

We already delay the final bio_endio() on the write until
the bad-block-list is written so that when the write
returns: either that data is safe, the bad-block record is
safe, or the fact that the device is faulty is safe.
However we *don't* delay the clearing of the bitmap, so the
bitmap bit can be recorded as cleared before we know if the
bad-block-list was written safely.

So: delay that until the write really is safe.
i.e. move the call to close_write() until just before
calling bio_endio(), and recheck the 'is array degraded'
status before making that call.

This bug goes back to v3.1 when bad-block-lists were
introduced, though it only affects arrays created with
mdadm-3.3 or later as only those have bad-block lists.

Backports will require at least
Commit: 55ce74d4bf ("md/raid1: ensure device failure recorded before write request returns.")
as well.  I'll send that to 'stable' separately.

Note that of the two tests of R1BIO_WriteError that this
patch adds, the first is certain to fail and the second is
certain to succeed.  However doing it this way makes the
patch more obviously correct.  I will tidy the code up in a
future merge window.

Reported-and-tested-by: Nate Dailey <nate.dailey@stratus.com>
Cc: Jes Sorensen <Jes.Sorensen@redhat.com>
Fixes: cd5ff9a16f ("md/raid1:  Handle write errors by updating badblock log.")
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24 16:24:22 +11:00
Jes Sorensen
681ab46960 md/raid10: submit_bio_wait() returns 0 on success
This was introduced with 9e882242c6
which changed the return value of submit_bio_wait() to return != 0 on
error, but didn't update the caller accordingly.

Fixes: 9e882242c6 ("block: Add submit_bio_wait(), remove from md")
Cc: stable@vger.kernel.org (v3.10)
Reported-by: Bill Kuzeja <William.Kuzeja@stratus.com>
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-21 07:24:29 +11:00
Jes Sorensen
203d27b022 md/raid1: submit_bio_wait() returns 0 on success
This was introduced with 9e882242c6
which changed the return value of submit_bio_wait() to return != 0 on
error, but didn't update the caller accordingly.

Fixes: 9e882242c6 ("block: Add submit_bio_wait(), remove from md")
Cc: stable@vger.kernel.org (v3.10)
Reported-by: Bill Kuzeja <William.Kuzeja@stratus.com>
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-21 07:20:15 +11:00
Mike Snitzer
ba30670f4d dm thin: fix missing pool reference count decrement in pool_ctr error path
Fixes: ac8c3f3df ("dm thin: generate event when metadata threshold passed")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.10+
2015-10-13 12:20:55 -04:00
Sudip Mukherjee
a2a678ed4d dm snapshot persistent: fix missing cleanup in persistent_ctr error path
If an unsupported option is given then the early return from
persistent_ctr() leaked memory allocated for the 'pstore' and never
destroyed the 'metadata_wq'.

Fixes: b0d3cc011e ("dm snapshot: add new persistent store option to support overflow")
Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-10-13 12:20:54 -04:00
Linus Torvalds
f24fe98df8 One bug fix for raid1/raid10.
Very careless bug earler in 4.3-rc, now fixed :-)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJWGYrdAAoJEDnsnt1WYoG56vcP/03E7DoycDQ7uH46i2szf3M/
 isy0dk+2S9D87iBz/xf4RXvAY7FTHy9/vIG/o6UKYkhSRzm6T6xCbrwd+duS2TSc
 yQM33BJ1VdM+trGj5ywrdF8guwRMjW4NFPnez16moVSVZDbNK2pUdZiw8kGSi39n
 hpjftyefojISG6rbDGBGK2JiVTNOqDjMH2Ny8MhX2J5ryQQOsd6+9ojgri3nfTbP
 6PmP08QyVxdYA3ZUlTZaKUNZ8AQHgoydhiEyGbdCewcE8pYaeEUqvcBi4DrDOil8
 9BGHnf755Wl3k26P8uBsvri9zp+SZl0LEZLhSpyFpRmCTaFGn0pnSKJ0intnRTPc
 JZ7gTY6q1Bt5DXToZw7hHVWgxjos8aweS2JLzSJloB6FFlDCckypkvSR4GQL7R9N
 jIYntfwaQaJIgUSzVo/Aw6vjBTWbqyLHf3DP8ImsPSe/z0gjtRiyPkjoZgthuYp2
 ErLoVe/JgKstR0gmobbdRhShIfXMFVAIwasXOXfq4Ye4LRwvfAwP2UDHrC25mb0O
 IJi6fMqf3bWxmLIzFUcTe8Z2nzuKolAgP2rcd6kb0bbLxE4Y5xtzCV8fgnhk2obw
 HvP4zZnacLKx8Nvet+YGUKjVJU3wx4RTgyGLU4WqC13fwZREeJLWwxgK859ZJ8yl
 k+TQud5fKgfkX20+eTA0
 =qdcM
 -----END PGP SIGNATURE-----

Merge tag 'md/4.3-rc4-fix' of git://neil.brown.name/md

Pull md bugfix from Neil Brown:
 "One bug fix for raid1/raid10.

  Very careless bug earler in 4.3-rc, now fixed :-)"

* tag 'md/4.3-rc4-fix' of git://neil.brown.name/md:
  crash in md-raid1 and md-raid10 due to incorrect list manipulation
2015-10-11 09:35:51 -07:00
Linus Torvalds
0444555670 3 stable@ fixes:
- DM core AB-BA deadlock fix in the device destruction path (vs device
   creation's DM table swap).
 
 - DM raid fix to properly round up the region_size to the next
   power-of-2.
 
 - DM cache fix for a NULL pointer seen while switching from the
   "cleaner" cache policy.
 
 2 fixes for regressions introduced during the 4.3 merge:
 
 - request-based DM error propagation regressed due to incorrect
   changes introduced when adding the bi_error field to bio.
 
 - DM snapshot fix to only support snapshots that overflow if the client
   (e.g. lvm2) is prepared to deal with the associated snapshot status
   interface change.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJWGC/jAAoJEMUj8QotnQNaTgYIAJz1AG5IcHz8D3zi8+MBWXFL
 WAYrXfXSxexsymVKFsqi6z9fYiW5fRZ41/+Kl8/dYnhBIS8uUzWlad2qw/JFg+zC
 l/EzdHWjakzuGm9/quK2h/CBC/3pmRH9UeKgzOPODOpAzkJfrKoO4/J7JPIi3JyP
 esE/2F2TBwERL4oC74UB7/nuM/xckS/DRjbd3B82/IsfM5n+MARvuSSrqWcPEu8h
 Hh5k42KyA+Tq7uElLnXF8phFOCJCn9IyI+QLdxj33PfDxwrtXMvV6Sxw7FS8b7oF
 /gw3Dod4sEv+EJZ1A+O9mxGBk3ajCpMvUYbcY6owIHyB1mKWiSKyvyBPyIY6RiQ=
 =2z9t
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.3-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull dm fixes from Mike Snitzer:
 "Three stable fixes:

   - DM core AB-BA deadlock fix in the device destruction path (vs
     device creation's DM table swap).

   - DM raid fix to properly round up the region_size to the next
     power-of-2.

   - DM cache fix for a NULL pointer seen while switching from the
     "cleaner" cache policy.

  Two fixes for regressions introduced during the 4.3 merge:

   - request-based DM error propagation regressed due to incorrect
     changes introduced when adding the bi_error field to bio.

   - DM snapshot fix to only support snapshots that overflow if the
     client (e.g. lvm2) is prepared to deal with the associated
     snapshot status interface change"

* tag 'dm-4.3-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm snapshot: add new persistent store option to support overflow
  dm cache: fix NULL pointer when switching from cleaner policy
  dm: fix request-based dm error reporting
  dm raid: fix round up of default region size
  dm: fix AB-BA deadlock in __dm_destroy()
2015-10-09 16:58:11 -07:00
Mike Snitzer
b0d3cc011e dm snapshot: add new persistent store option to support overflow
Commit 76c44f6d80 introduced the possibly for "Overflow" to be reported
by the snapshot device's status.  Older userspace (e.g. lvm2) does not
handle the "Overflow" status response.

Fix this incompatibility by requiring newer userspace code, that can
cope with "Overflow", request the persistent store with overflow support
by using "PO" (Persistent with Overflow) for the snapshot store type.

Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Fixes: 76c44f6d80 ("dm snapshot: don't invalidate on-disk image on snapshot write overflow")
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-10-09 16:57:03 -04:00
Joe Thornber
2bffa1503c dm cache: fix NULL pointer when switching from cleaner policy
The cleaner policy doesn't make use of the per cache block hint space in
the metadata (unlike the other policies).  When switching from the
cleaner policy to mq or smq a NULL pointer crash (in dm_tm_new_block)
was observed.  The crash was caused by bugs in dm-cache-metadata.c
when trying to skip creation of the hint btree.

The minimal fix is to change hint size for the cleaner policy to 4 bytes
(only hint size supported).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-10-09 09:16:29 -04:00
Mikulas Patocka
a452744bcb crash in md-raid1 and md-raid10 due to incorrect list manipulation
The commit 55ce74d4bf (md/raid1: ensure
device failure recorded before write request returns) is causing crash in
the LVM2 testsuite test shell/lvchange-raid.sh. For me the crash is 100%
reproducible.

The reason for the crash is that the newly added code in raid1d moves the
list from conf->bio_end_io_list to tmp, then tests if tmp is non-empty and
then incorrectly pops the bio from conf->bio_end_io_list (which is empty
because the list was alrady moved).

Raid-10 has a similar bug.

Kernel Fault: Code=15 regs=000000006ccb8640 (Addr=0000000100000000)
CPU: 3 PID: 1930 Comm: mdX_raid1 Not tainted 4.2.0-rc5-bisect+ #35
task: 000000006cc1f258 ti: 000000006ccb8000 task.ti: 000000006ccb8000

     YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
PSW: 00001000000001001111111000001111 Not tainted
r00-03  000000ff0804fe0f 000000001059d000 000000001059f818 000000007f16be38
r04-07  000000001059d000 000000007f16be08 0000000000200200 0000000000000001
r08-11  000000006ccb8260 000000007b7934d0 0000000000000001 0000000000000000
r12-15  000000004056f320 0000000000000000 0000000000013dd0 0000000000000000
r16-19  00000000f0d00ae0 0000000000000000 0000000000000000 0000000000000001
r20-23  000000000800000f 0000000042200390 0000000000000000 0000000000000000
r24-27  0000000000000001 000000000800000f 000000007f16be08 000000001059d000
r28-31  0000000100000000 000000006ccb8560 000000006ccb8640 0000000000000000
sr00-03  0000000000249800 0000000000000000 0000000000000000 0000000000249800
sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000

IASQ: 0000000000000000 0000000000000000 IAOQ: 000000001059f61c 000000001059f620
 IIR: 0f8010c6    ISR: 0000000000000000  IOR: 0000000100000000
 CPU:        3   CR30: 000000006ccb8000 CR31: 0000000000000000
 ORIG_R28: 000000001059d000
 IAOQ[0]: call_bio_endio+0x34/0x1a8 [raid1]
 IAOQ[1]: call_bio_endio+0x38/0x1a8 [raid1]
 RP(r2): raid_end_bio_io+0x88/0x168 [raid1]
Backtrace:
 [<000000001059f818>] raid_end_bio_io+0x88/0x168 [raid1]
 [<00000000105a4f64>] raid1d+0x144/0x1640 [raid1]
 [<000000004017fd5c>] kthread+0x144/0x160

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 55ce74d4bf ("md/raid1: ensure device failure recorded before write request returns.")
Fixes: 95af587e95 ("md/raid10: ensure device failure recorded before write request returns.")
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-09 08:33:46 +11:00
Junichi Nomura
50887bd139 dm: fix request-based dm error reporting
end_clone_bio() is a endio callback for clone bio and should check
and save the clone's bi_error for error reporting.  However,
4246a0b63b ("block: add a bi_error field to struct bio") changed
the function to check the original bio's bi_error, which is 0.

Without this fix, clone's error is ignored and reported to the
original request as success.  Thus data corruption will be observed.

Fixes: 4246a0b63b ("block: add a bi_error field to struct bio")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-10-06 10:08:16 -04:00
Linus Torvalds
15ecf9a986 Assorted fixes for md in 4.3-rc
Two tagged for -stable
 One is really a cleanup to match and improve kmemcache interface.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJWDjJNAAoJEDnsnt1WYoG5bOkP/ioJ8DZWkobOWSpnjbNCKIyg
 xrX3FlTq8MJHPfeqGDzfznjYTZ7vb9ZYkZNkn1HUIOXKCkG0hqr1GL1eVZmKAbgZ
 B3nuyIuArZe+IXQ5mMoMXn5qpp7/2mO/JPaqBBrUmxHMx+c+Xx0LC0QUdL7GXzY5
 oQ8SahoLrl7Xl4/i9dSuhVD9rDhzuC7ZmykLkYrtquxFC69tH4PRUWak0RXXvHsE
 mzADdqCwATLUu2FvEudoaCecXHxRmcn47CuALcqdaZF+VVPe8WsjIySmeVDRCixZ
 k9njCdNiqtoKzb87MJECclYbCdHUVcKMNqaOoBkLaZnJumNFABwrPP3LnMtdaNpy
 TrjYh3x5/xrdOgmWBML2gK/suEtaN2hgT6KyI38rAwlYQlEppxd94ZbIH0Q0wY+L
 Unhcn28h56janKYVzyumA0Z5p6fbpxkI2OLEws4HzSqq6Ajpuc7yxDSCbUmE2vXL
 WIoVAgH6PEr5sUCMH7xxqWejoXDi1KinPPVELKuMTWCiwRFr3CnZZzPXGJX5DXSG
 nS9HCR35WmXuQx9pqC4/YOk7HBmllnNMHUrFlOYCzAn2qbjsCZ0whNlKe78qvN2z
 +OYiVRF8KmSNAkP+S47sxeyEEYMi4aKVNe1ur1jVjYmA5keIdmjbnIRjGXfSNzff
 PdvMqZcGouq4jsz2fqQf
 =yqg5
 -----END PGP SIGNATURE-----

Merge tag 'md/4.3-fixes' of git://neil.brown.name/md

Pull md fixes from Neil Brown:
 "Assorted fixes for md in 4.3-rc.

  Two tagged for -stable, and one is really a cleanup to match and
  improve kmemcache interface.

* tag 'md/4.3-fixes' of git://neil.brown.name/md:
  md/bitmap: don't pass -1 to bitmap_storage_alloc.
  md/raid1: Avoid raid1 resync getting stuck
  md: drop null test before destroy functions
  md: clear CHANGE_PENDING in readonly array
  md/raid0: apply base queue limits *before* disk_stack_limits
  md/raid5: don't index beyond end of array in need_this_block().
  raid5: update analysis state for failed stripe
  md: wait for pending superblock updates before switching to read-only
2015-10-04 11:47:28 +01:00
Mikulas Patocka
042745ee53 dm raid: fix round up of default region size
Commit 3a0f9aaee0 ("dm raid: round region_size to power of two")
intended to make sure that the default region size is a power of two.
However, the logic in that commit is incorrect and sets the variable
region_size to 0 or 1, depending on whether min_region_size is a power
of two.

Fix this logic, using roundup_pow_of_two(), so that region_size is
properly rounded up to the next power of two.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 3a0f9aaee0 ("dm raid: round region_size to power of two")
Cc: stable@vger.kernel.org # v3.8+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-10-02 12:02:31 -04:00
NeilBrown
da6fb7a9e5 md/bitmap: don't pass -1 to bitmap_storage_alloc.
Passing -1 to bitmap_storage_alloc() causes page->index to be set to
-1, which is quite problematic.

So only pass ->cluster_slot if mddev_is_clustered().

Fixes: b97e92574c ("Use separate bitmaps for each nodes in the cluster")
Cc: stable@vger.kernel.org (v4.1+)
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-02 17:24:13 +10:00
Jes Sorensen
e8ff8bf09f md/raid1: Avoid raid1 resync getting stuck
close_sync() needs to set conf->next_resync to a large, but safe value
below MaxSector and use it to determine whether or not to set
start_next_window in wait_barrier()

Solution suggested by Neil Brown.

Reported-by: Nate Dailey <nate.dailey@stratus.com>
Tested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-02 17:23:44 +10:00
Julia Lawall
644df1a85f md: drop null test before destroy functions
Remove unneeded NULL test.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@ expression x; @@
-if (x != NULL)
  \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-02 17:23:44 +10:00
Shaohua Li
d4929add83 md: clear CHANGE_PENDING in readonly array
If faulty disks of an array are more than allowed degraded number, the
array enters error handling. It will be marked as read-only with
MD_CHANGE_PENDING/RECOVERY_NEEDED set. But currently recovery doesn't
clear CHANGE_PENDING bit for read-only array.  If MD_CHANGE_PENDING is
set for a raid5 array, all returned IO will be hold on a list till the
bit is clear. But recovery nevery clears this bit, the IO is always in
pending state and nevery finish. This has bad effects like upper layer
can't get an IO error and the array can't be stopped.

Fixes: c3cce6cda1 ("md/raid5: ensure device failure recorded before write request returns.")
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-02 17:23:44 +10:00
NeilBrown
66eefe5de1 md/raid0: apply base queue limits *before* disk_stack_limits
Calling e.g. blk_queue_max_hw_sectors() after calls to
disk_stack_limits() discards the settings determined by
disk_stack_limits().
So we need to make those calls first.

Fixes: 199dc6ed51 ("md/raid0: update queue parameter in a safer location.")
Cc: stable@vger.kernel.org (v2.6.35+ - please apply with 199dc6ed51).
Reported-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-02 17:23:44 +10:00
NeilBrown
36707bb2e7 md/raid5: don't index beyond end of array in need_this_block().
When need_this_block probably shouldn't be called when there
are more than 2 failed devices, we really don't want it to try
indexing beyond the end of the failed_num[] of fdev[] arrays.

So limit the loops to at most 2 iterations.

Reported-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-10-02 17:23:43 +10:00
Shaohua Li
ebda780bce raid5: update analysis state for failed stripe
handle_failed_stripe() makes the stripe fail, eg, all IO will return
with a failure, but it doesn't update stripe_head_state. Later
handle_stripe() has special handling for raid6 for handle_stripe_fill().
That check before handle_stripe_fill() doesn't skip the failed stripe
and we get a kernel crash in need_this_block.  This patch clear the
analysis state to make sure no functions wrongly called after
handle_failed_stripe()

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-02 17:23:43 +10:00
NeilBrown
88724bfa68 md: wait for pending superblock updates before switching to read-only
If a superblock update is pending, wait for it to complete before
letting md_set_readonly() switch to readonly.
Otherwise we might lose important information about a device having
failed.

For external arrays, waiting for superblock updates can wait on
user-space, so in that case, just return an error.

Reported-and-tested-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-02 17:23:43 +10:00
Junichi Nomura
2a708cff93 dm: fix AB-BA deadlock in __dm_destroy()
__dm_destroy() takes io_barrier SRCU lock (dm_get_live_table) and
suspend_lock in reverse order.  Doing so can cause AB-BA deadlock:

  __dm_destroy                    dm_swap_table
  ---------------------------------------------------
                                  mutex_lock(suspend_lock)
  dm_get_live_table()
    srcu_read_lock(io_barrier)
                                  dm_sync_table()
                                    synchronize_srcu(io_barrier)
                                      .. waiting for dm_put_live_table()
  mutex_lock(suspend_lock)
    .. waiting for suspend_lock

Fix this by taking the locks in proper order.

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Fixes: ab7c7bb6f4 ("dm: hold suspend_lock while suspending device during device deletion")
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-10-01 10:40:20 -04:00
Mike Snitzer
586b286b11 dm crypt: constrain crypt device's max_segment_size to PAGE_SIZE
Setting the dm-crypt device's max_segment_size to PAGE_SIZE is an
unfortunate constraint that is required to avoid the potential for
exceeding dm-crypt's underlying device's max_segments limits -- due to
crypt_alloc_buffer() possibly allocating pages for the encryption bio
that are not as physically contiguous as the original bio.

It is interesting to note that this problem was already fixed back in
2007 via commit 91e106259 ("dm crypt: use bio_add_page").  But Linux 4.0
commit cf2f1abfb ("dm crypt: don't allocate pages for a partial
request") regressed dm-crypt back to _not_ using bio_add_page().  But
given dm-crypt's cpu parallelization changes all depend on commit
cf2f1abfb's abandoning of the more complex io fragments processing that
dm-crypt previously had we cannot easily go back to using
bio_add_page().

So all said the cleanest way to resolve this issue is to fix dm-crypt to
properly constrain the original bios entering dm-crypt so the encryption
bios that dm-crypt generates from the original bios are always
compatible with the underlying device's max_segments queue limits.

It should be noted that technically Linux 4.3 does _not_ need this fix
because of the block core's new late bio-splitting capability.  But, it
is reasoned, there is little to be gained by having the block core split
the encrypted bio that is composed of PAGE_SIZE segments.  That said, in
the future we may revert this change.

Fixes: cf2f1abfb ("dm crypt: don't allocate pages for a partial request")
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=104421
Suggested-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.0+
2015-09-14 12:04:24 -04:00
Mike Snitzer
216076705d dm thin: disable discard support for thin devices if pool's is disabled
If the pool is configured with 'ignore_discard' its discard support is
disabled.  The pool's thin devices should also have queue_limits that
reflect discards are disabled.

Fixes: 34fbcf62 ("dm thin: range discard support")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.1+
2015-09-13 21:32:10 -04:00
Linus Torvalds
8e78b7dc93 SCSI misc on 20150911
The major pieces of this patch are a set patches facilitating better
 integration between scsi and scsi_dh (the device handling layer used by
 multi-path; all the dm parts are acked by Mike Snitzer).  It also includes
 driver updates for mp3sas, scsi_debug and an assortment of bug fixes.
 
 Signed-off-by: James Bottomley <JBottomley@Odin.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABAgAGBQJV8yt5AAoJEDeqqVYsXL0MBsQH+wXvlx3o0BGuz5ZXfIs/RxzI
 MwGnu1J0LSA9FPakkMUVOBtsxIG+pCV+4eKorQMkfGCKAZ8daaYsyYvSEM2mcqIX
 1Y/srEnbzfE94JHbsI2pbiMPkB7QdtW27WjTSjQGgD9igAyVmmITiQJrXbpAlSLF
 F6n++9avng+GhjXQ5TF8/y13OYgabIoAPM1j4B/ut/Ok8ReruBvMBnOla5w5RMKR
 rBZKTZfUwvX5S0cuREwj8tFsRVUgdBNSrcGswFJrZo5x9WAsSHLC6+SOLZuUy1vC
 ua0tNtEiyXiuR0/jSP9qv7hJ/j0BW+EGdnW6GZEzKpeMK5PxfVspOsbNunUDRsY=
 =Y9G1
 -----END PGP SIGNATURE-----

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull second round of SCSI updates from James Bottomley:
 "There's one late arriving patch here (added today), fixing a build
  issue which the scsi_dh patch set in here uncovered.  Other than that,
  everything has been incubated in -next and the checkers for a week.

  The major pieces of this patch are a set patches facilitating better
  integration between scsi and scsi_dh (the device handling layer used
  by multi-path; all the dm parts are acked by Mike Snitzer).

  This also includes driver updates for mp3sas, scsi_debug and an
  assortment of bug fixes"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (50 commits)
  scsi_dh: fix randconfig build error
  scsi: fix scsi_error_handler vs. scsi_host_dev_release race
  fcoe: Convert use of __constant_htons to htons
  mpt2sas: setpci reset kernel oops fix
  pm80xx: Don't override ts->stat on IO_OPEN_CNX_ERROR_HW_RESOURCE_BUSY
  lpfc: Fix possible use-after-free and double free in lpfc_mbx_cmpl_rdp_page_a2()
  bfa: Fix incorrect de-reference of pointer
  bfa: Fix indentation
  scsi_transport_sas: Remove check for SAS expander when querying bay/enclosure IDs.
  scsi_debug: resp_request: remove unused variable
  scsi_debug: fix REPORT LUNS Well Known LU
  scsi_debug: schedule_resp fix input variable check
  scsi_debug: make dump_sector static
  scsi_debug: vfree is null safe so drop the check
  scsi_debug: use SCSI_W_LUN_REPORT_LUNS instead of SAM2_WLUN_REPORT_LUNS;
  scsi_debug: define pr_fmt() for consistent logging
  mpt2sas: Refcount fw_events and fix unsafe list usage
  mpt2sas: Refcount sas_device objects and fix unsafe list usage
  scsi_dh: return SCSI_DH_NOTCONN in scsi_dh_activate()
  scsi_dh: don't allow to detach device handlers at runtime
  ...
2015-09-11 18:15:18 -07:00
Christoph Hellwig
294ab783ad scsi_dh: fix randconfig build error
It looks like the Kconfig check that was meant to fix this (commit
fe9233fb69 [SCSI] scsi_dh: fix kconfig related
build errors) was actually reversed, but no-one noticed until the new set of
patches which separated DM and SCSI_DH).

Fixes: fe9233fb69
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: James Bottomley <JBottomley@Odin.com>
2015-09-11 11:54:37 -07:00
Linus Torvalds
2a013e37ce md updates for 4.3
- An assortment of little fixes, several for minor races only likely
   to be hit during testing
 - further cluster-md-raid1 development, not ready for real use yet.
 - new RAID6 syndrome code for ARM NEON
 - fix a race where a write can return before failure of one device
   is properly recorded in metadata, so an immediate crash might result
   in that write being lost.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJV6rSXAAoJEDnsnt1WYoG5eJIQAJs62+kB3+p87/VEu4hiBgYv
 yyaCBlTDn3xxy3WFLtvSIc+cZOamvGe/u/+9/aTA5zq30VpS0fwZlLUxwyR3vB7H
 aXh5y0JL8fViCUp6o+SplOpNDMAv4ntcW5NMv7uWhPLxtQxF/IJu6YLsDRcFaJqL
 LFCpvKSPgXOQ88ZXHa54xgFgEy+aAh1lxaWQmeqCLtgVc6YhwIsazG00R/vow8Pb
 91u3jFioWjBpovTJiRxQO+NGemfOnKrm2EWkR4jzo8taHOouBWOH0RZjh/67dh4p
 QX4GjMINhFvYSr1UMGXfPm+Fjp2PRgx1qKyR/XhPeXNuE2xZf7T4aCnmKA8DVUA4
 vyEl/l0lAZClExNA+bgE/wCrMpvtb9E4NnklzIffDqsDY79m9JzLwznYqDQcXP7m
 0zPlRmf8KQoSOVV960N2O6siwQMwvTyPecG0raAv9BKjwZ+7/M8HLOplZuuMsbzT
 BZ6+FnAIDtc0Id0wwJoARUkghG7Nr4IWi4Q8MtyYLgH9KLnYkomjf/I2B5sEooCF
 JFIXeg+XX/xKSFHV4TycYdAFMtMEJMJ/pEnbKJ/W7CyAmrHJv+0/U+/gOkA8Mg76
 iqYVWqRJHP9ZyWpmaWaaOeGIgFoJqrjM65qFNRcOnzMd/aAi8W63oyM99Lxi+1pm
 i8StqQBNtiwzds/w32SI
 =n+/k
 -----END PGP SIGNATURE-----

Merge tag 'md/4.3' of git://neil.brown.name/md

Pull md updates from Neil Brown:

 - an assortment of little fixes, several for minor races only likely to
   be hit during testing

 - further cluster-md-raid1 development, not ready for real use yet.

 - new RAID6 syndrome code for ARM NEON

 - fix a race where a write can return before failure of one device is
   properly recorded in metadata, so an immediate crash might result in
   that write being lost.

* tag 'md/4.3' of git://neil.brown.name/md: (33 commits)
  md/raid5: ensure device failure recorded before write request returns.
  md/raid5: use bio_list for the list of bios to return.
  md/raid10: ensure device failure recorded before write request returns.
  md/raid1: ensure device failure recorded before write request returns.
  md-cluster: remove inappropriate try_module_get from join()
  md: extend spinlock protection in register_md_cluster_operations
  md-cluster: Read the disk bitmap sb and check if it needs recovery
  md-cluster: only call complete(&cinfo->completion) when node join cluster
  md-cluster: add missed lockres_free
  md-cluster: remove the unused sb_lock
  md-cluster: init suspend_list and suspend_lock early in join
  md-cluster: add the error check if failed to get dlm lock
  md-cluster: init completion within lockres_init
  md-cluster: fix deadlock issue on message lock
  md-cluster: transfer the resync ownership to another node
  md-cluster: split recover_slot for future code reuse
  md-cluster: use %pU to print UUIDs
  md: setup safemode_timer before it's being used
  md/raid5: handle possible race as reshape completes.
  md: sync sync_completed has correct value as recovery finishes.
  ...
2015-09-05 17:52:22 -07:00
NeilBrown
e89c6fdf9e Merge linux-block/for-4.3/core into md/for-linux
There were a few conflicts that are fairly easy to resolve.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-09-05 11:08:32 +02:00
Linus Torvalds
1e1a4e8f43 Merge tag 'dm-4.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper update from Mike Snitzer:

 - a couple small cleanups in dm-cache, dm-verity, persistent-data's
   dm-btree, and DM core.

 - a 4.1-stable fix for dm-cache that fixes the leaking of deferred bio
   prison cells

 - a 4.2-stable fix that adds feature reporting for the dm-stats
   features added in 4.2

 - improve DM-snapshot to not invalidate the on-disk snapshot if
   snapshot device write overflow occurs; but a write overflow triggered
   through the origin device will still invalidate the snapshot.

 - optimize DM-thinp's async discard submission a bit now that late bio
   splitting has been included in block core.

 - switch DM-cache's SMQ policy lock from using a mutex to a spinlock;
   improves performance on very low latency devices (eg. NVMe SSD).

 - document DM RAID 4/5/6's discard support

[ I did not pull the slab changes, which weren't appropriate for this
  tree, and weren't obviously the right thing to do anyway.  At the very
  least they need some discussion and explanation before getting merged.

  Because not pulling the actual tagged commit but doing a partial pull
  instead, this merge commit thus also obviously is missing the git
  signature from the original tag ]

* tag 'dm-4.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm cache: fix use after freeing migrations
  dm cache: small cleanups related to deferred prison cell cleanup
  dm cache: fix leaking of deferred bio prison cells
  dm raid: document RAID 4/5/6 discard support
  dm stats: report precise_timestamps and histogram in @stats_list output
  dm thin: optimize async discard submission
  dm snapshot: don't invalidate on-disk image on snapshot write overflow
  dm: remove unlikely() before IS_ERR()
  dm: do not override error code returned from dm_get_device()
  dm: test return value for DM_MAPIO_SUBMITTED
  dm verity: remove unused mempool
  dm cache: move wake_waker() from free_migrations() to where it is needed
  dm btree remove: remove unused function get_nr_entries()
  dm btree: remove unused "dm_block_t root" parameter in btree_split_sibling()
  dm cache policy smq: change the mutex to a spinlock
2015-09-02 16:35:26 -07:00
Linus Torvalds
1081230b74 Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
 "This first core part of the block IO changes contains:

   - Cleanup of the bio IO error signaling from Christoph.  We used to
     rely on the uptodate bit and passing around of an error, now we
     store the error in the bio itself.

   - Improvement of the above from myself, by shrinking the bio size
     down again to fit in two cachelines on x86-64.

   - Revert of the max_hw_sectors cap removal from a revision again,
     from Jeff Moyer.  This caused performance regressions in various
     tests.  Reinstate the limit, bump it to a more reasonable size
     instead.

   - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me.
     Most devices have huge trim limits, which can cause nasty latencies
     when deleting files.  Enable the admin to configure the size down.
     We will look into having a more sane default instead of UINT_MAX
     sectors.

   - Improvement of the SGP gaps logic from Keith Busch.

   - Enable the block core to handle arbitrarily sized bios, which
     enables a nice simplification of bio_add_page() (which is an IO hot
     path).  From Kent.

   - Improvements to the partition io stats accounting, making it
     faster.  From Ming Lei.

   - Also from Ming Lei, a basic fixup for overflow of the sysfs pending
     file in blk-mq, as well as a fix for a blk-mq timeout race
     condition.

   - Ming Lin has been carrying Kents above mentioned patches forward
     for a while, and testing them.  Ming also did a few fixes around
     that.

   - Sasha Levin found and fixed a use-after-free problem introduced by
     the bio->bi_error changes from Christoph.

   - Small blk cgroup cleanup from Viresh Kumar"

* 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits)
  blk: Fix bio_io_vec index when checking bvec gaps
  block: Replace SG_GAPS with new queue limits mask
  block: bump BLK_DEF_MAX_SECTORS to 2560
  Revert "block: remove artifical max_hw_sectors cap"
  blk-mq: fix race between timeout and freeing request
  blk-mq: fix buffer overflow when reading sysfs file of 'pending'
  Documentation: update notes in biovecs about arbitrarily sized bios
  block: remove bio_get_nr_vecs()
  fs: use helper bio_add_page() instead of open coding on bi_io_vec
  block: kill merge_bvec_fn() completely
  md/raid5: get rid of bio_fits_rdev()
  md/raid5: split bio for chunk_aligned_read
  block: remove split code in blkdev_issue_{discard,write_same}
  btrfs: remove bio splitting and merge_bvec_fn() calls
  bcache: remove driver private bio splitting code
  block: simplify bio_add_page()
  block: make generic_make_request handle arbitrarily sized bios
  blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)
  block: don't access bio->bi_error after bio_put()
  block: shrink struct bio down to 2 cache lines again
  ...
2015-09-02 13:10:25 -07:00
Linus Torvalds
089b669506 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "The usual stuff from trivial tree for 4.3 (kerneldoc updates, printk()
  fixes, Documentation and MAINTAINERS updates)"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (28 commits)
  MAINTAINERS: update my e-mail address
  mod_devicetable: add space before */
  scsi: a100u2w: trivial typo in printk
  i2c: Fix typo in i2c-bfin-twi.c
  treewide: fix typos in comment blocks
  Doc: fix trivial typo in SubmittingPatches
  proportions: Spelling s/consitent/consistent/
  dm: Spelling s/consitent/consistent/
  aic7xxx: Fix typo in error message
  pcmcia: Fix typo in locking documentation
  scsi/arcmsr: Fix typos in error log
  drm/nouveau/gr: Fix typo in nv10.c
  [SCSI] Fix printk typos in drivers/scsi
  staging: comedi: Grammar s/Enable support a/Enable support for a/
  Btrfs: Spelling s/consitent/consistent/
  README: GTK+ is a acronym
  ASoC: omap: Fix typo in config option description
  mm: tlb.c: Fix error message
  ntfs: super.c: Fix error log
  fix typo in Documentation/SubmittingPatches
  ...
2015-09-01 18:46:42 -07:00
Joe Thornber
cc7da0ba9c dm cache: fix use after freeing migrations
Both free_io_migration() and issue_discard() dereference a migration
that was just freed.  Fix those by saving off the migrations's cache
object before freeing the migration.  Also cleanup needless mg->cache
dereferences now that the cache object is available directly.

Fixes: e44b6a5a3c ("dm cache: move wake_waker() from free_migrations() to where it is needed")
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-09-01 08:56:14 -04:00
Mike Snitzer
dc9cee5db5 dm cache: small cleanups related to deferred prison cell cleanup
Eliminate __cell_release() since it only had one caller that always
released the cell holder.

Switch cell_error_with_code() to using free_prison_cell() for the sake
of consistency.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-08-31 15:50:28 -04:00
Joe Thornber
9153df7405 dm cache: fix leaking of deferred bio prison cells
There were two cases where dm_cell_visit_release() was being called,
which removes the cell from the prison's rbtree, but the callers didn't
also return the cell to the mempool.  Fix this by having them call
free_prison_cell().

This leak manifested as the 'kmalloc-96' slab growing until OOM.

Fixes: 651f5fa2a3 ("dm cache: defer whole cells")
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.1+
2015-08-31 15:08:14 -04:00
NeilBrown
c3cce6cda1 md/raid5: ensure device failure recorded before write request returns.
When a write to one of the devices of a RAID5/6 fails, the failure is
recorded in the metadata of the other devices so that after a restart
the data on the failed drive wont be trusted even if that drive seems
to be working again (maybe a cable was unplugged).

Similarly when we record a bad-block in response to a write failure,
we must not let the write complete until the bad-block update is safe.

Currently there is no interlock between the write request completing
and the metadata update.  So it is possible that the write will
complete, the app will confirm success in some way, and then the
machine will crash before the metadata update completes.

This is an extremely small hole for a racy to fit in, but it is
theoretically possible and so should be closed.

So:
 - set MD_CHANGE_PENDING when requesting a metadata update for a
   failed device, so we can know with certainty when it completes
 - queue requests that completed when MD_CHANGE_PENDING is set to
   only be processed after the metadata update completes
 - call raid_end_bio_io() on bios in that queue when the time comes.


Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:43:59 +02:00
NeilBrown
34a6f80e16 md/raid5: use bio_list for the list of bios to return.
This will make it easier to splice two lists together which will
be needed in future patch.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:43:50 +02:00
NeilBrown
95af587e95 md/raid10: ensure device failure recorded before write request returns.
When a write to one of the legs of a RAID10 fails, the failure is
recorded in the metadata of the other legs so that after a restart
the data on the failed drive wont be trusted even if that drive seems
to be working again (maybe a cable was unplugged).

Currently there is no interlock between the write request completing
and the metadata update.  So it is possible that the write will
complete, the app will confirm success in some way, and then the
machine will crash before the metadata update completes.

This is an extremely small hole for a racy to fit in, but it is
theoretically possible and so should be closed.

So:
 - set MD_CHANGE_PENDING when requesting a metadata update for a
   failed device, so we can know with certainty when it completes
 - queue requests that experienced an error on a new queue which
   is only processed after the metadata update completes
 - call raid_end_bio_io() on bios in that queue when the time comes.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:43:45 +02:00
NeilBrown
55ce74d4bf md/raid1: ensure device failure recorded before write request returns.
When a write to one of the legs of a RAID1 fails, the failure is
recorded in the metadata of the other leg(s) so that after a restart
the data on the failed drive wont be trusted even if that drive seems
to be working again  (maybe a cable was unplugged).

Similarly when we record a bad-block in response to a write failure,
we must not let the write complete until the bad-block update is safe.

Currently there is no interlock between the write request completing
and the metadata update.  So it is possible that the write will
complete, the app will confirm success in some way, and then the
machine will crash before the metadata update completes.

This is an extremely small hole for a racy to fit in, but it is
theoretically possible and so should be closed.

So:
 - set MD_CHANGE_PENDING when requesting a metadata update for a
   failed device, so we can know with certainty when it completes
 - queue requests that experienced an error on a new queue which
   is only processed after the metadata update completes
 - call raid_end_bio_io() on bios in that queue when the time comes.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:43:23 +02:00
NeilBrown
18b9f67962 md-cluster: remove inappropriate try_module_get from join()
md_setup_cluster already calls try_module_get(), so this
try_module_get isn't needed.
Also, there is no matching module_put (except in error patch),
so this leaves an unbalanced module count.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:43:17 +02:00
NeilBrown
6022e75bf0 md: extend spinlock protection in register_md_cluster_operations
This code looks racy.

The only possible race is if two modules try to register at the same
time and that won't happen.  But make the code look safe anyway.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:42:59 +02:00
Guoqing Jiang
abb9b22ac9 md-cluster: Read the disk bitmap sb and check if it needs recovery
In gather_all_resync_info, we need to read the disk bitmap sb and
check if it needs recovery.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:42:41 +02:00
Guoqing Jiang
eece075cda md-cluster: only call complete(&cinfo->completion) when node join cluster
Introduce MD_CLUSTER_BEGIN_JOIN_CLUSTER flag to make sure
complete(&cinfo->completion) is only be invoked when node
join cluster. Otherwise node failure could also call the
complete, and it doesn't make sense to do it.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:42:31 +02:00
Guoqing Jiang
6e6d9f2cda md-cluster: add missed lockres_free
We also need to free the lock resource before goto out.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:42:23 +02:00
Guoqing Jiang
b2b9bfff0a md-cluster: remove the unused sb_lock
The sb_lock is not used anywhere, so let's remove it.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:42:14 +02:00
Guoqing Jiang
9e3072e373 md-cluster: init suspend_list and suspend_lock early in join
If the node just join the cluster, and receive the msg from other nodes
before init suspend_list, it will cause kernel crash due to NULL pointer
dereference, so move the initializations early to fix the bug.

md-cluster: Joined cluster 3578507b-e0cb-6d4f-6322-696cd7b1b10c slot 3
BUG: unable to handle kernel NULL pointer dereference at           (null)
... ... ...
Call Trace:
[<ffffffffa0444924>] process_recvd_msg+0x2e4/0x330 [md_cluster]
[<ffffffffa0444a06>] recv_daemon+0x96/0x170 [md_cluster]
[<ffffffffa045189d>] md_thread+0x11d/0x170 [md_mod]
[<ffffffff810768c4>] kthread+0xb4/0xc0
[<ffffffff8151927c>] ret_from_fork+0x7c/0xb0
... ... ...
RIP  [<ffffffffa0443581>] __remove_suspend_info+0x11/0xa0 [md_cluster]

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:42:05 +02:00
Guoqing Jiang
b5ef56789b md-cluster: add the error check if failed to get dlm lock
In complicated cluster environment, it is possible that the
dlm lock couldn't be get/convert on purpose, the related err
info is added for better debug potential issue.

For lockres_free, if the lock is blocking by a lock request or
conversion request, then dlm_unlock just put it back to grant
queue, so need to ensure the lock is free finally.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:41:56 +02:00
Guoqing Jiang
b83d51c078 md-cluster: init completion within lockres_init
We should init completion within lockres_init, otherwise
completion could be initialized more than one time during
it's life cycle.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:41:50 +02:00
Guoqing Jiang
66099bb0ee md-cluster: fix deadlock issue on message lock
There is problem with previous communication mechanism, and we got below
deadlock scenario with cluster which has 3 nodes.

	Sender                	    Receiver        		Receiver

	token(EX)
       message(EX)
      writes message
   downconverts message(CR)
      requests ack(EX)
		                  get message(CR)            gets message(CR)
                		  reads message                reads message
		               requests EX on message    requests EX on message

To fix this problem, we do the following changes:

1. the sender downconverts MESSAGE to CW rather than CR.
2. and the receiver request PR lock not EX lock on message.

And in case we failed to down-convert EX to CW on message, it is better to
unlock message otherthan still hold the lock.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Lidong Zhong <ldzhong@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:41:41 +02:00
Guoqing Jiang
dc737d7c3d md-cluster: transfer the resync ownership to another node
When node A stops an array while the array is doing a resync, we need
to let another node B take over the resync task.

To achieve the goal, we need the A send an explicit BITMAP_NEEDS_SYNC
message to the cluster. And the node B which received that message will
invoke __recover_slot to do resync.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:41:12 +02:00