Some callers assume they can always add at least one page to an empty bio,
so dm_merge_bvec should not return 0 in this case: we'll reject the I/O
later after the bio is submitted.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
When two md arrays share some block device (e.g each uses different
partitions on the one device), a resync of one array will wait for
the resync on the other to finish.
This can be a long time and as it currently waits TASK_UNINTERRUPTIBLE,
the softlockup code notices and complains.
So use TASK_INTERRUPTIBLE instead and make sure to flush signals
before calling schedule.
Signed-off-by: NeilBrown <neilb@suse.de>
A recent patch to protect the rdev list with rcu locking leaves us
with a problem because we can sleep on memalloc while holding the
rcu lock.
The rcu lock is only needed while walking the linked list as
uninteresting devices (failed or spares) can be removed at any time.
So only take the rcu lock while actually walking the linked list.
Take a refcount on the rdev during the time when we drop the lock
and do the memalloc to start IO.
When we return to the locked code, all the interesting devices
on the list will not have moved, so we can simply use
list_for_each_continue_rcu to pick up where we left off.
Signed-off-by: NeilBrown <neilb@suse.de>
When stopping an md array, or just switching to read-only, we
currently call invalidate_partition while holding the mddev lock.
The main reason for this is probably to ensure all dirty buffers
are flushed (invalidate_partition calls fsync_bdev).
However if any dirty buffers are found, it will almost certainly cause
a deadlock as starting writeout will require an update to the
superblock, and performing that updates requires taking the mddev
lock - which is already held.
This deadlock can be demonstrated by running "reboot -f -n" with
a root filesystem on md/raid, and some dirty buffers in memory.
All other calls to stop an array should already happen after a flush.
The normal sequence is to stop using the array (e.g. umount) which
will cause __blkdev_put to call sync_blockdev. Then open the
array and issue the STOP_ARRAY ioctl while the buffers are all still
clean.
So this invalidate_partition is normally a no-op, except for one case
where it will cause a deadlock.
So remove it.
This patch possibly addresses the regression recored in
http://bugzilla.kernel.org/show_bug.cgi?id=11460
and
http://bugzilla.kernel.org/show_bug.cgi?id=11452
though it isn't yet clear how it ever worked.
Signed-off-by: NeilBrown <neilb@suse.de>
If a 'repair' is requested when an array is in a position to 'recover' raid1
will perform the repair while md believes a recovery is happening. Address
this at both ends, i.e. cancel check/repair requests upon detecting a
recover condition and do not call ->spare_active after completing a
check/repair.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The raid10 resync/recovery code currently limits the amount of
in-flight resync IO to 2Meg. This was copied from raid1 where
it seems quite adequate. However for raid10, some layouts require
a bit of seeking to perform a resync, and allowing a larger buffer
size means that the seeking can be significantly reduced.
There is probably no real need to limit the amount of in-flight
IO at all. Any shortage of memory will naturally reduce the
amount of buffer space available down to a set minimum, and any
concurrent normal IO will quickly cause resync IO to back off.
The only problem would be that normal IO has to wait for all resync IO
to finish, so a very large amount of resync IO could cause unpleasant
latency when normal IO starts up.
So: increase RESYNC_DEPTH to allow 32Meg of buffer (if memory is
available) which seems to be a good amount. Also reduce the amount
of memory reserved as there is no need to keep 2Meg just for resync if
memory is tight.
Thanks to Keld for the suggestion.
Cc: Keld Jørn Simonsen <keld@dkuug.dk>
Signed-off-by: NeilBrown <neilb@suse.de>
Removing faulty devices from an array is a two stage process.
First the device is moved from being a part of the active array
to being similar to a spare device. Then it can be removed
by a request from user space.
The first step is currently not performed for read-only arrays,
so the second step can never succeed.
So allow readonly arrays to remove failed devices (which aren't
blocked).
Signed-off-by: NeilBrown <neilb@suse.de>
When we have externally managed metadata, we need to mark a failed
device as 'Blocked' and not allow any writes until that device
have been marked as faulty in the metadata and the Blocked flag has
been removed.
However it is perfectly OK to allow read requests when there is a
Blocked device, and with a readonly array, there may not be any
metadata-handler watching for blocked devices.
So in raid5/raid6 only allow a Blocked device to interfere with
Write request or resync. Read requests go through untouched.
raid1 and raid10 already differentiate between read and write
properly.
Signed-off-by: NeilBrown <neilb@suse.de>
We cannot currently change the size of a write-intent bitmap.
So if we change the size of an array which has such a bitmap, it
tries to set bits beyond the end of the bitmap.
For now, simply reject any request to change the size of an array
which has a bitmap. mdadm can remove the bitmap and add a new one
after the array has changed size.
Signed-off-by: NeilBrown <neilb@suse.de>
A recent patch allowed do_md_stop to know whether it was being called
via an ioctl or not, and thus where to allow for an extra open file
descriptor when checking if it is in use.
This broke then switch to readonly performed by the shutdown notifier,
which needs to work even when the array is still (apparently) active
(as md doesn't get told when the filesystem becomes readonly).
So restore this feature by pretending that there can be lots of
file descriptors open, but we still want do_md_stop to switch to
readonly.
Signed-off-by: NeilBrown <neilb@suse.de>
If we reduce the 'safe_mode_delay', it could still wait for the old
delay to completely expire before doing anything about safe_mode.
Thus the effect if the change is delayed.
To make the effect more immediate, run the timeout function
immediately if the delay was reduced. This may cause it to run
slightly earlier that required, but that is the safer option.
Signed-off-by: NeilBrown <neilb@suse.de>
* 'for-linus' of git://neil.brown.name/md:
md: raid10: wake up frozen array
md: do not count blocked devices as spares
md: do not progress the resync process if the stripe was blocked
md: delay notification of 'active_idle' to the recovery thread
md: fix merge error
md: move async_tx_issue_pending_all outside spin_lock_irq
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
md: the bitmap code needs to use blk_plug_device_unlocked()
block: add a blk_plug_device_unlocked() that grabs the queue lock
When rescheduling a bio in raid10, we wake up
the md thread, but if the array is frozen, this
will have no effect. This causes the array to
remain frozen for eternity. We add a wake_up
to allow the array to de-freeze. This code is
nearly identical to the raid1 code, which has
this fix already.
Signed-off-by: Arthur Jones <ajones@riverbed.com>
Signed-off-by: NeilBrown <neilb@suse.de>
remove_and_add_spares() assumes that failed devices have been hot-removed
from the array. Removal is skipped in the 'blocked' case so do not count a
device in this state as 'spare'.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
handle_stripe will take no action on a stripe when waiting for userspace
to unblock the array, so do not report completed sectors.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
multipath keeps a separate device table which may be
more current than the built-in one.
So we should make sure to always call ->attach whenever
a multipath map with hardware handler is instantiated.
And we should call ->detach on removal, too.
[sekharan: update as per comments from agk]
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
The original STRIPE_OP_IO removal patch had the following hunk:
- for (i = conf->raid_disks; i--; ) {
+ for (i = conf->raid_disks; i--; )
set_bit(R5_Wantwrite, &sh->dev[i].flags);
- if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
- sh->ops.count++;
- }
However it appears the hunk became broken after merging:
- for (i = conf->raid_disks; i--; ) {
+ for (i = conf->raid_disks; i--; )
set_bit(R5_Wantwrite, &sh->dev[i].flags);
set_bit(R5_LOCKED, &dev->flags);
s.locked++;
- if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
- sh->ops.count++;
- }
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Some dma drivers need to call spin_lock_bh in their device_issue_pending
routines. This change avoids:
WARNING: at kernel/softirq.c:136 local_bh_enable_ip+0x3a/0x85()
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* 'for-linus' of git://neil.brown.name/md: (52 commits)
md: Protect access to mddev->disks list using RCU
md: only count actual openers as access which prevent a 'stop'
md: linear: Make array_size sector-based and rename it to array_sectors.
md: Make mddev->array_size sector-based.
md: Make super_type->rdev_size_change() take sector-based sizes.
md: Fix check for overlapping devices.
md: Tidy up rdev_size_store a bit:
md: Remove some unused macros.
md: Turn rdev->sb_offset into a sector-based quantity.
md: Make calc_dev_sboffset() return a sector count.
md: Replace calc_dev_size() by calc_num_sectors().
md: Make update_size() take the number of sectors.
md: Better control of when do_md_stop is allowed to stop the array.
md: get_disk_info(): Don't convert between signed and unsigned and back.
md: Simplify restart_array().
md: alloc_disk_sb(): Return proper error value.
md: Simplify sb_equal().
md: Simplify uuid_equal().
md: sb_equal(): Fix misleading printk.
md: Fix a typo in the comment to cmd_match().
...
This patch implements biovec merge function for crypt target.
If the underlying device has merge function defined, call it.
If not, keep precomputed value.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove max_sector restriction - merge function replaced it.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch implements biovec merge function for linear target.
If the underlying device has merge function defined, call it.
If not, keep precomputed value.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Introduce a bvec merge function for device mapper devices
for dynamic size restrictions.
This code ensures the requested biovec lies within a single
target and then calls a target-specific function to check
against any constraints imposed by underlying devices.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Change snapshot per-module mempool to per-device mempool.
Per-module mempools could cause a deadlock if multiple
snapshot devices are stacked above each other.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Fix a race condition that returns incorrect data when a write causes an
exception to be allocated whilst a read is still in flight.
The race condition happens as follows:
* A read to non-reallocated sector in the snapshot is submitted so that the
read is routed to the original device.
* A write to the original device is submitted. The write causes an exception
that reallocates the block. The write proceeds.
* The original read is dequeued and reads the wrong data.
This race can be triggered with CFQ scheduler and one thread writing and
multiple threads reading simultaneously.
(This patch relies upon the earlier dm-kcopyd-per-device.patch to avoid a
deadlock.)
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Whenever a snapshot read gets mapped through to the origin, track it in
a per-snapshot hash table indexed by chunk number, using memory allocated
from a new per-snapshot mempool.
We need to track these reads to avoid race conditions which will be fixed
by patches that follow.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Return a specific error message if there are an invalid number of multipath
arguments.
This invalid command returns an "Unknown error" because the ti->error field is
not set
dmsetup create --table '0 2 multipath 0 0 1 1 round-robin 0 1 1 /dev/sdh' mpath0
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
dm_dirty_log_{init,exit}() can now become static.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Free path selector if the arguments are invalid.
This command (note that it is invalid) causes reference leak on module
"dm_round_robin" and prevents the module from being removed.
dmsetup create --table '0 2 multipath 0 0 1 1 round-robin /dev/sdh' mpath0
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
All modifications and most access to the mddev->disks list are made
under the reconfig_mutex lock. However there are three places where
the list is walked without any locking. If a reconfig happens at this
time, havoc (and oops) can ensue.
So use RCU to protect these accesses:
- wrap them in rcu_read_{,un}lock()
- use list_for_each_entry_rcu
- add to the list with list_add_rcu
- delete from the list with list_del_rcu
- delay the 'free' with call_rcu rather than schedule_work
Note that export_rdev did a list_del_init on this list. In almost all
cases the entry was not in the list anymore so it was a no-op and so
safe. It is no longer safe as after list_del_rcu we may not touch
the list_head.
An audit shows that export_rdev is called:
- after unbind_rdev_from_array, in which case the delete has
already been done,
- after bind_rdev_to_array fails, in which case the delete isn't needed.
- before the device has been put on a list at all (e.g. in
add_new_disk where reading the superblock fails).
- and in autorun devices after a failure when the device is on a
different list.
So remove the list_del_init call from export_rdev, and add it back
immediately before the called to export_rdev for that last case.
Note also that ->same_set is sometimes used for lists other than
mddev->list (e.g. candidates). In these cases rcu is not needed.
Signed-off-by: NeilBrown <neilb@suse.de>
Open isn't the only thing that increments ->active. e.g. reading
/proc/mdstat will increment it briefly. So to avoid false positives
in testing for concurrent access, introduce a new counter that counts
just the number of times the md device it open.
Signed-off-by: NeilBrown <neilb@suse.de>
This patch renames the array_size field of struct mddev_s to array_sectors
and converts all instances to use units of 512 byte sectors instead of 1k
blocks.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Also, change the type of the size parameter from unsigned long long to
sector_t and rename it to num_sectors.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
The checks in overlaps() expect all parameters either in block-based
or sector-based quantities. However, its single caller passes two
rdev->data_offset arguments as well as two rdev->size arguments, the
former being sector counts while the latter are measured in 1K blocks.
This could cause rdev_size_store() to accept an invalid size from user
space. Fix it by passing only sector-based quantities to overlaps().
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
- used strict_strtoull in place of simple_strtoull
- use my_mddev in place of rdev->mddev (they have the same value)
and more significantly,
- don't adjust mddev->size to fit, rather reject changes which make
rdev->size smaller than mddev->size
Adjusting mddev->size is a hangover from bind_rdev_to_array which
does a similar thing. But it really is a better design to insist that
mddev->size is set as required, then the rdev->sizes are set to allow
for that. The previous way invites confusion.
Signed-off-by: NeilBrown <neilb@suse.de>
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (102 commits)
[SCSI] scsi_dh: fix kconfig related build errors
[SCSI] sym53c8xx: Fix bogus sym_que_entry re-implementation of container_of
[SCSI] scsi_cmnd.h: remove double inclusion of linux/blkdev.h
[SCSI] make struct scsi_{host,target}_type static
[SCSI] fix locking in host use of blk_plug_device()
[SCSI] zfcp: Cleanup external header file
[SCSI] zfcp: Cleanup code in zfcp_erp.c
[SCSI] zfcp: zfcp_fsf cleanup.
[SCSI] zfcp: consolidate sysfs things into one file.
[SCSI] zfcp: Cleanup of code in zfcp_aux.c
[SCSI] zfcp: Cleanup of code in zfcp_scsi.c
[SCSI] zfcp: Move status accessors from zfcp to SCSI include file.
[SCSI] zfcp: Small QDIO cleanups
[SCSI] zfcp: Adapter reopen for large number of unsolicited status
[SCSI] zfcp: Fix error checking for ELS ADISC requests
[SCSI] zfcp: wait until adapter is finished with ERP during auto-port
[SCSI] ibmvfc: IBM Power Virtual Fibre Channel Adapter Client Driver
[SCSI] sg: Add target reset support
[SCSI] lib: Add support for the T10 (SCSI) Data Integrity Field CRC
[SCSI] sd: Move scsi_disk() accessor function to sd.h
...
Do not automatically "select" SCSI_DH for dm-multipath. If SCSI_DH
doesn't exist,just do not allow hardware handlers to be used.
Handle SCSI_DH being a module also. Make sure it doesn't allow DM_MULTIPATH
to be compiled in when SCSI_DH is a module.
[jejb: added comment for Kconfig syntax]
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: (37 commits)
splice: fix generic_file_splice_read() race with page invalidation
ramfs: enable splice write
drivers/block/pktcdvd.c: avoid useless memset
cdrom: revert commit 22a9189 (cdrom: use kmalloced buffers instead of buffers on stack)
scsi: sr avoids useless buffer allocation
block: blk_rq_map_kern uses the bounce buffers for stack buffers
block: add blk_queue_update_dma_pad
DAC960: push down BKL
pktcdvd: push BKL down into driver
paride: push ioctl down into driver
block: use get_unaligned_* helpers
block: extend queue_flag bitops
block: request_module(): use format string
Add bvec_merge_data to handle stacked devices and ->merge_bvec()
block: integrity flags can't use bit ops on unsigned short
cmdfilter: extend default read filter
sg: fix odd style (extra parenthesis) introduced by cmd filter patch
block: add bounce support to blk_rq_map_user_iov
cfq-iosched: get rid of enable_idle being unused warning
allow userspace to modify scsi command filter on per device basis
...
Rename it to sb_start to make sure all users have been converted.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
As BLOCK_SIZE_BITS is 10 and
MD_NEW_SIZE_SECTORS(2 * x) = 2 * NEW_SIZE_BLOCKS(x),
the return value of calc_dev_sboffset() doubles. Fix up all three
callers accordingly.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
Number of sectors is the preferred unit for sizes of raid devices,
so change calc_dev_size() so that it returns this unit instead of
the number of 1K blocks.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
Changing the internal representations of sizes of raid devices
from 1K blocks to sector counts (512B units) is desirable because
it allows to get rid of many divisions/multiplications and unnecessary
casts that are present in the current code.
This patch is a first step in this direction. It replaces the old
1K-based "size" argument of update_size() by "num_sectors" and
fixes up its two callers.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
do_md_stop check the number of active users before allowing the array
to be stopped.
Two problems:
1/ it assumes the request is coming through an open file descriptor
(via ioctl) so it allows for that. This is not always the case.
2/ it doesn't do the check it the array hasn't been activated.
This is not good for cases when we use an inactive array to hold
some devices in a container.
Signed-off-by: Neil Brown <neilb@suse.de>
The current code copies a signed int from user space, converts it to
unsigned and passes the unsigned value to find_rdev_nr() which expects
a signed value. Simply pass the signed value from user space directly.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
If alloc_page() fails, ENOMEM is a more suitable error value
than EINVAL.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
The only caller of sb_equal() tests the return value against
zero, so it's OK to return the negated return value of memcmp().
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
Remove the dubious attempt to prefer 'compute' over 'read'. Not only is it
wrong given commit c337869d (md: do not compute parity unless it is on a failed
drive), but it can trigger a BUG_ON in handle_parity_checks5().
Cc: <stable@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
- Remove superfluous parentheses.
- Make format string match the type of the variable that is printed.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
In case pers->run() succeeds but creating the bitmap fails, we
print an error message stating that pers->run() has failed.
Print this message only if pers->run() really failed.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Neil Brown <neilb@suse.de>
When devices are stacked, one device's merge_bvec_fn may need to perform
the mapping and then call one or more functions for its underlying devices.
The following bio fields are used:
bio->bi_sector
bio->bi_bdev
bio->bi_size
bio->bi_rw using bio_data_dir()
This patch creates a new struct bvec_merge_data holding a copy of those
fields to avoid having to change them directly in the struct bio when
going down the stack only to have to change them back again on the way
back up. (And then when the bio gets mapped for real, the whole
exercise gets repeated, but that's a problem for another day...)
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Milan Broz <mbroz@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Add cond_resched() to prevent monopolising CPU when processing large bios.
dm-crypt processes encryption of bios in sector units. If the bio request
is big it can spend a long time in the encryption call.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Tested-by: Yan Li <elliot.li.tech@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
md_allow_write() marks the metadata dirty while holding mddev->lock and then
waits for the write to complete. For externally managed metadata this causes a
deadlock as userspace needs to take the lock to communicate that the metadata
update has completed.
Change md_allow_write() in the 'external' case to start the 'mark active'
operation and then return -EAGAIN. The expected side effects while waiting for
userspace to write 'active' to 'array_state' are holding off reshape (code
currently handles -ENOMEM), cause some 'stripe_cache_size' change requests to
fail, cause some GET_BITMAP_FILE ioctl requests to fall back to GFP_NOIO, and
cause updates to 'raid_disks' to fail. Except for 'stripe_cache_size' changes
these failures can be mitigated by coordinating with mdmon.
md_write_start() still prevents writes from occurring until the metadata
handler has had a chance to take action as it unconditionally waits for
MD_CHANGE_CLEAN to be cleared.
[neilb@suse.de: return -EAGAIN, try GFP_NOIO]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
From: Dan Williams <dan.j.williams@intel.com>
Commit a4456856 refactored some of the deep code paths in raid5.c into separate
functions. The names chosen at the time do not consistently indicate what is
going to happen to the stripe. So, update the names, and since a stripe is a
cache element use cache semantics like fill, dirty, and clean.
(also, fix up the indentation in fetch_block5)
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
From: Dan Williams <dan.j.williams@intel.com>
Neil said:
> At the end of ops_run_compute5 you have:
> /* ack now if postxor is not set to be run */
> if (tx && !test_bit(STRIPE_OP_POSTXOR, &s->ops_run))
> async_tx_ack(tx);
>
> It looks odd having that test there. Would it fit in raid5_run_ops
> better?
The intended global interpretation is that raid5_run_ops can build a chain
of xor and memcpy operations. When MD registers the compute-xor it tells
async_tx to keep the operation handle around so that another item in the
dependency chain can be submitted. If we are just computing a block to
satisfy a read then we can terminate the chain immediately. raid5_run_ops
gives a better context for this test since it cares about the entire chain.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
From: Dan Williams <dan.j.williams@intel.com>
Currently ops_run_biodrain and other locations have extra logic to determine
which blocks are processed in the prexor and non-prexor cases. This can be
eliminated if handle_write_operations5 flags the blocks to be processed in all
cases via R5_Wantdrain. The presence of the prexor operation is tracked in
sh->reconstruct_state.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
From: Dan Williams <dan.j.williams@intel.com>
Track the state of reconstruct operations (recalculating the parity block
usually due to incoming writes, or as part of array expansion) Reduces the
scope of the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags to only tracking whether
a reconstruct operation has been requested via the ops_request field of struct
stripe_head_state.
This is the final step in the removal of ops.{pending,ack,complete,count}, i.e.
the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags only request an operation and do
not track the state of the operation.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
From: Dan Williams <dan.j.williams@intel.com>
Track the state of compute operations (recalculating a block from all the other
blocks in a stripe) with a state flag. Reduces the scope of the
STRIPE_OP_COMPUTE_BLK flag to only tracking whether a compute operation has
been requested via the ops_request field of struct stripe_head_state.
Note, the compute operation that is performed in the course of doing a 'repair'
operation (check the parity block, recalculate it and write it back if the
check result is not zero) is tracked separately with the 'check_state'
variable. Compute operations are held off while a 'check' is in progress, and
moving this check out to handle_issuing_new_read_requests5 the helper routine
__handle_issuing_new_read_requests5 can be simplified.
This is another step towards the removal of ops.{pending,ack,complete,count},
i.e. STRIPE_OP_COMPUTE_BLK only requests an operation and does not track the
state of the operation.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
From: Dan Williams <dan.j.williams@intel.com>
Track the state of read operations (copying data from the stripe cache to bio
buffers outside the lock) with a state flag. Reduce the scope of the
STRIPE_OP_BIOFILL flag to only tracking whether a biofill operation has been
requested via the ops_request field of struct stripe_head_state.
This is another step towards the removal of ops.{pending,ack,complete,count},
i.e. STRIPE_OP_BIOFILL only requests an operation and does not track the state
of the operation.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
From: Dan Williams <dan.j.williams@intel.com>
The STRIPE_OP_* flags record the state of stripe operations which are
performed outside the stripe lock. Their use in indicating which
operations need to be run is straightforward; however, interpolating what
the next state of the stripe should be based on a given combination of
these flags is not straightforward, and has led to bugs. An easier to read
implementation with minimal degrees of freedom is needed.
Towards this goal, this patch introduces explicit states to replace what was
previously interpolated from the STRIPE_OP_* flags. For now this only converts
the handle_parity_checks5 path, removing a user of the
ops.{pending,ack,complete,count} fields of struct stripe_operations.
This conversion also found a remaining issue with the current code. There is
a small window for a drive to fail between when we schedule a repair and when
the parity calculation for that repair completes. When this happens we will
writeback to 'failed_num' when we really want to write back to 'pd_idx'.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
From: Dan Williams <dan.j.williams@intel.com>
Let the raid6 path call ops_run_io to get pending i/o submitted.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
From: Dan Williams <dan.j.williams@intel.com>
In handle_stripe after taking sh->lock we sample some bits into 's' (struct
stripe_head_state):
s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
Use these values from 's' in ops_run_io() rather than re-sampling the bits.
This ensures a consistent snapshot (as seen under sh->lock) is used.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
From: Dan Williams <dan.j.williams@intel.com>
The R5_Want{Read,Write} flags already gate i/o. So, this flag is
superfluous and we can unconditionally call ops_run_io().
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
From: Dan Williams <dan.j.williams@intel.com>
This micro-optimization allowed the raid code to skip a re-read of the
parity block after checking parity. It took advantage of the fact that
xor-offload-engines have their own internal result buffer and can check
parity without writing to memory. Remove it for the following reasons:
1/ It is a layering violation for MD to need to manage the DMA and
non-DMA paths within async_xor_zero_sum
2/ Bad precedent to toggle the 'ops' flags outside the lock
3/ Hard to realize a performance gain as reads will not need an updated
parity block and writes will dirty it anyways.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
From: Chris Webb <chris@arachsys.com>
Allow /sys/block/mdX/md/rdY/size to change on running arrays, moving the
superblock if necessary for this metadata version. We prevent the available
space from shrinking to less than the used size, and allow it to be set to zero
to fill all the available space on the underlying device.
Signed-off-by: Chris Webb <chris@arachsys.com>
Signed-off-by: Neil Brown <neilb@suse.de>
The important state change happens during an interrupt
in md_error. So just set a flag there and call sysfs_notify
later in process context.
Signed-off-by: Neil Brown <neilb@suse.de>
When a device fails, when a spare is activated, when
an array is reshaped, or when an array is started,
the extent to which the array is degraded can change.
Signed-off-by: Neil Brown <neilb@suse.de>
When the 'resync' thread starts or stops, when we explicitly
set sync_action, or when we determine that there is definitely nothing
to do, we notify sync_action.
To stop "sync_action" from occasionally showing the wrong value,
we introduce a new flags - MD_RECOVERY_RECOVER - to say that a
recovery is probably needed or happening, and we make sure
that we set MD_RECOVERY_RUNNING before clearing MD_RECOVERY_NEEDED.
Signed-off-by: Neil Brown <neilb@suse.de>
Changes in md/array_state could be of interest to a monitoring
program. So make sure all changes trigger a notification.
Exceptions:
changing active_idle to active is not reported because it
is frequent and not interesting.
changing active to active_idle is only reported on arrays
with externally managed metadata, as it is not interesting
otherwise.
Signed-off-by: Neil Brown <neilb@suse.de>
There is really no need for this test here, and there are valid
cases for selectively removing devices from an array that
it not actually active.
Signed-off-by: Neil Brown <neilb@suse.de>
For all array types but linear, ->hot_add_disk returns 1 on
success, 0 on failure.
For linear, it returns 0 on success and -errno on failure.
This doesn't cause a functional problem because the ->hot_add_disk
function of linear is used quite differently to the others.
However it is confusing.
So convert all to return 0 for success or -errno on failure
and fix call sites to match.
Signed-off-by: Neil Brown <neilb@suse.de>
i.e. extend the 'md/dev-XXX/slot' attribute so that you can
tell a device to fill an vacant slot in an and md array.
Signed-off-by: Neil Brown <neilb@suse.de>
offset_store and rdev_size_store allow control of the region of a
device which is to be using in an md/raid array.
They only allow these values to be set when an array is being assembled,
as changing them on an active array could be dangerous.
However when adding a spare device to an array, we might need to
set the offset and size before starting recovery. So allow
these values to be set also if "->raid_disk < 0" which indicates that
the device is still a spare.
Signed-off-by: Neil Brown <neilb@suse.de>
Arrays personalities such as 'raid0' and 'linear' have no redundancy,
and so marking them as 'clean' or 'dirty' is not meaningful.
So always allow write requests without requiring a superblock update.
Such arrays types are detected by ->sync_request being NULL. If it is
not possible to send a sync request we don't need a 'dirty' flag because
all a dirty flag does is trigger some sync_requests.
Signed-off-by: Neil Brown <neilb@suse.de>
There is a possible race in md_probe. If two threads call md_probe
for the same device, then one could exit (having checked that
->gendisk exists) before the other has called kobject_init_and_add,
thus returning an incomplete kobj which will cause problems when
we try to add children to it.
So extend the range of protection of disks_mutex slightly to
avoid this possibility.
Signed-off-by: Neil Brown <neilb@suse.de>
This makes it possible to just resync a small part of an array.
e.g. if a drive reports that it has questionable sectors,
a 'repair' of just the region covering those sectors will
cause them to be read and, if there is an error, re-written
with correct data.
Signed-off-by: Neil Brown <neilb@suse.de>
When an array is degraded, bits in the write-intent bitmap are not
cleared, so that if the missing device is re-added, it can be synced
by only updated those parts of the device that have changed since
it was removed.
The enable this a 'events_cleared' value is stored. It is the event
counter for the array the last time that any bits were cleared.
Sometimes - if a device disappears from an array while it is 'clean' -
the events_cleared value gets updated incorrectly (there are subtle
ordering issues between updateing events in the main metadata and the
bitmap metadata) resulting in the missing device appearing to require
a full resync when it is re-added.
With this patch, we update events_cleared precisely when we are about
to clear a bit in the bitmap. We record events_cleared when we clear
the bit internally, and copy that to the superblock which is written
out before the bit on storage. This makes it more "obviously correct".
We also need to update events_cleared when the event_count is going
backwards (as happens on a dirty->clean transition of a non-degraded
array).
Thanks to Mike Snitzer for identifying this problem and testing early
"fixes".
Cc: "Mike Snitzer" <snitzer@gmail.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Turn calls to bi->bi_end_io() into bio_endio(). Apparently bio_endio does
exactly the same error processing as is hardcoded at these places.
bio_endio() avoids recursion (or will soon), so it should be used.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Neil Brown <neilb@suse.de>
From: "Nikanth Karthikesan" <knikanth@novell.com>
Correct disk numbering problem check.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Neil Brown <neilb@suse.de>
md_probe can fail (e.g. alloc_disk could fail) without
returning an error (as it alway returns NULL).
So when we call mddev_find immediately afterwards, we need
to check that md_probe actually succeeded. This means checking
that mdev->gendisk is non-NULL.
cc: <stable@kernel.org>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Neil Brown <neilb@suse.de>
We shouldn't acknowledge that a stripe has been expanded (When
reshaping a raid5 by adding a device) until the moved data has
actually been written out. However we are currently
acknowledging (by calling md_done_sync) when the POST_XOR
is complete and before the write.
So track in s.locked whether there are pending writes, and don't
call md_done_sync yet if there are.
Note: we all set R5_LOCKED on devices which are are about to
read from. This probably isn't technically necessary, but is
usually done when writing a block, and justifies the use of
s.locked here.
This bug can lead to a crash if an array is stopped while an reshape
is in progress.
Cc: <stable@kernel.org>
Signed-off-by: Neil Brown <neilb@suse.de>
If, while assembling an array, we find a device which is not fully
in-sync with the array, it is important to set the "fullsync" flags.
This is an exact analog to the setting of this flag in hot_add_disk
methods.
Currently, only v1.x metadata supports having devices in an array
which are not fully in-sync (it keep track of how in sync they are).
The 'fullsync' flag only makes a difference when a write-intent bitmap
is being used. In this case it tells recovery to ignore the bitmap
and recovery all blocks.
This fix is already in place for raid1, but not raid5/6 or raid10.
So without this fix, a raid1 ir raid4/5/6 array with version 1.x
metadata and a write intent bitmaps, that is stopped in the middle
of a recovery, will appear to complete the recovery instantly
after it is reassembled, but the recovery will not be correct.
If you might have an array like that, issueing
echo repair > /sys/block/mdXX/md/sync_action
will make sure recovery completes properly.
Cc: <stable@kernel.org>
Signed-off-by: Neil Brown <neilb@suse.de>
If a block is computed (rather than read) then a check/repair operation
may be lead to believe that the data on disk is correct, when infact it
isn't. So only compute blocks for failed devices.
This issue has been around since at least 2.6.12, but has become harder to
hit in recent kernels since most reads bypass the cache.
echo repair > /sys/block/mdN/md/sync_action will set the parity blocks to the
correct state.
Cc: <stable@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If an array was created with --assume-clean we will oops when trying to
set ->resync_max.
Fix this by initializing ->recovery_wait in mddev_find.
Cc: <stable@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During the initial array synchronization process there is a window between
when a prexor operation is scheduled to a specific stripe and when it
completes for a sync_request to be scheduled to the same stripe. When
this happens the prexor completes and the stripe is unconditionally marked
"insync", effectively canceling the sync_request for the stripe. Prior to
2.6.23 this was not a problem because the prexor operation was done under
sh->lock. The effect in older kernels being that the prexor would still
erroneously mark the stripe "insync", but sync_request would be held off
and re-mark the stripe as "!in_sync".
Change the write completion logic to not mark the stripe "in_sync" if a
prexor was performed. The effect of the change is to sometimes not set
STRIPE_INSYNC. The worst this can do is cause the resync to stall waiting
for STRIPE_INSYNC to be set. If this were happening, then STRIPE_SYNCING
would be set and handle_issuing_new_read_requests would cause all
available blocks to eventually be read, at which point prexor would never
be used on that stripe any more and STRIPE_INSYNC would eventually be set.
echo repair > /sys/block/mdN/md/sync_action will correct arrays that may
have lost this race.
Cc: <stable@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch just removes infrastructure that provided support for hardware
handlers in the dm layer as it is not needed anymore.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
This patch removes the 3 hardware handlers that currently exist
under dm as the functionality is moved to SCSI layer in the earlier
patches.
[jejb: removed more makefile hunks and rejection fixes]
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
This patch just removes the dm layer's path initialization completion
routine. This is separated from the other patch(scsi_dh: Use SCSI
device handler in dm-multipath) Just to make that patch more readable.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Before this patch set (SCSI hardware handlers), initialization of a
path was done asynchronously. Doing that requires a workqueue in each
device/hardware handler module and leads to unneccessary complication
in the device handler code, making it difficult to read the code and
follow the state diagram.
Moving that workqueue to this level makes the device handler code simpler.
Hence, the workqueue is moved to dm level.
A new workqueue is added instead of adding it to the existing workqueue
(kmpathd) for the following reasons:
1. Device activation has to happen faster, stacking them along
with the other workqueue might lead to unnecessary delay
in the activation of the path.
2. The effect could be felt the other way too. i.e the current
events that are handled by the existing workqueue might get
a delayed response.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
This patch converts dm-mpath to use scsi device handlers instead of
dm's hardware handlers.
This patch does not add any new functionality. Old behaviors remain and
userspace tools work as is except that arguments supplied with hardware
handler are ignored.
One behavioral exception is: Activation of a path is synchronous in this
patch, opposed to the older behavior of being asynchronous (changed in
patch 07: scsi_dh: Add a single threaded workqueue for initializing a path)
Note: There is no need to get a reference for the device handler module
(as it was done in the dm hardware handler case) here as the reference
is held when the device was first found. Instead we check and make sure
that support for the specified device is present at table load time.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
When we get any IO error during a recovery (rebuilding a spare), we abort
the recovery and restart it.
For RAID6 (and multi-drive RAID1) it may not be best to restart at the
beginning: when multiple failures can be tolerated, the recovery may be
able to continue and re-doing all that has already been done doesn't make
sense.
We already have the infrastructure to record where a recovery is up to
and restart from there, but it is not being used properly.
This is because:
- We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR,
which causes the recovery not be be checkpointed.
- We remove spares and then re-added them which loses important state
information.
The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't
needed. If there is an error, the relevant drive will be marked as
Faulty, and that is enough to ensure correct handling of the error. So we
first remove MD_RECOVERY_ERR, changing some of the uses of it to
MD_RECOVERY_INTR.
Then we cause the attempt to remove a non-faulty device from an array to
fail (unless recovery is impossible as the array is too degraded). Then
when remove_and_add_spares attempts to remove the devices on which
recovery can continue, it will fail, they will remain in place, and
recovery will continue on them as desired.
Issue: If we are halfway through rebuilding a spare and another drive
fails, and a new spare is immediately available, do we want to:
1/ complete the current rebuild, then go back and rebuild the new spare or
2/ restart the rebuild from the start and rebuild both devices in
parallel.
Both options can be argued for. The code currently takes option 2 as
a/ this requires least code change
b/ this results in a minimally-degraded array in minimal time.
Cc: "Eivind Sarto" <ivan@kasenna.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In some configurations, a raid6 resync can be limited by CPU speed
(Calculating P and Q and moving data) rather than by device speed. In
these cases there is nothing to be gained byt serialising resync of arrays
that share a device, and doing the resync in parallel can provide benefit.
So add a sysfs tunable to flag an array as being allowed to resync in
parallel with other arrays that use (a different part of) the same device.
Signed-off-by: Bernd Schubert <bs@q-leap.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This additional notification to 'array_state' is needed to allow the
monitor application to learn about stop events via sysfs. The
sysfs_notify("sync_action") call that comes at the end of do_md_stop()
(via md_new_event) is insufficient since the 'sync_action' attribute has
been removed by this point.
(Seems like a sysfs-notify-on-removal patch is a better fix. Currently
removal updates the event count but does not wake up waiters)
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When an array enters write pending, 'array_state' changes, so we must be
sure to sysfs_notify.
Also, when waiting for user-space to acknowledge 'write-pending' by
marking the metadata as dirty, we don't want to wait for MD_CHANGE_DEVS to
be cleared as that might not happen. So explicity test for the bits that
we are really interested in.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When performing a "recovery" or "check" pass on a RAID1 array, we read
from each device and possible, if there is a difference or a read error,
write back to some devices.
We use the same 'bio' for both read and write, resetting various fields
between the two operations.
We forgot to reset bv_offset and bv_len however. These are often left
unchanged, but in the case where there is an IO error one or two sectors
into a page, they are changed.
This results in correctable errors not being corrected properly. It does
not result in any data corruption.
Cc: "Fairbanks, David" <David.Fairbanks@stratus.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Last night we had scsi problems and a hardware raid unit was offlined
during heavy i/o. While this happened we got for about 3 minutes a huge
number messages like these
Apr 12 03:36:07 pfs1n14 kernel: [197510.696595] raid5:md7: read error not correctable (sector 2993096568 on sdj2).
I guess the high error rate is responsible for not scheduling other events
- during this time the system was not pingable and in the end also other
devices run into scsi command timeouts causing problems on these unrelated
devices as well.
Signed-off-by: Bernd Schubert <bernd-schubert@gmx.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kill the trivial and rather pointless file_path wrapper around d_path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is possible to add a write-intent bitmap to an active array, or remove
the bitmap that is there.
When we do with the 'quiesce' the array, which causes make_request to
block in "wait_barrier()".
However we are sampling the value of "mddev->bitmap" before the
wait_barrier call, and using it afterwards. This can result in using a
bitmap structure that has been freed.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As setting and clearing queue flags now requires that we hold a spinlock
on the queue, and as blk_queue_stack_limits is called without that lock,
get the lock inside blk_queue_stack_limits.
For blk_queue_stack_limits to be able to find the right lock, each md
personality needs to set q->queue_lock to point to the appropriate lock.
Those personalities which didn't previously use a spin_lock, us
q->__queue_lock. So always initialise that lock when allocated.
With this in place, setting/clearing of the QUEUE_FLAG_PLUGGED bit will no
longer cause warnings as it will be clear that the proper lock is held.
Thanks to Dan Williams for review and fixing the silly bugs.
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Alistair John Strachan <alistair@devzero.co.uk>
Cc: Nick Piggin <npiggin@suse.de>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Jacek Luczak <difrost.kernel@gmail.com>
Cc: Prakash Punnoor <prakash@punnoor.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit bd2ab67030 "md: close a livelock window
in handle_parity_checks5" introduced a bug in handling 'repair' operations.
After a repair operation completes we clear the state bits tracking this
operation. However, they are cleared too early and this results in the code
deciding to re-run the parity check operation. Since we have done the repair
in memory the second check does not find a mismatch and thus does not do a
writeback.
Test results:
$ echo repair > /sys/block/md0/md/sync_action
$ cat /sys/block/md0/md/mismatch_cnt
51072
$ echo repair > /sys/block/md0/md/sync_action
$ cat /sys/block/md0/md/mismatch_cnt
0
(also fix incorrect indentation)
Cc: <stable@kernel.org>
Tested-by: George Spelvin <linux@horizon.com>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
drivers/md/raid10.c:889:17: warning: Using plain integer as NULL pointer
drivers/media/video/cx18/cx18-driver.c:616:12: warning: Using plain integer as NULL pointer
sound/oss/kahlua.c:70:12: warning: Using plain integer as NULL pointer
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allows a userspace metadata handler to take action upon detecting a device
failure.
Based on an original patch by Neil Brown.
Changes:
-added blocked_wait waitqueue to rdev
-don't qualify Blocked with Faulty always let userspace block writes
-added md_wait_for_blocked_rdev to wait for the block device to be clear, if
userspace misses the notification another one is sent every 5 seconds
-set MD_RECOVERY_NEEDED after clearing "blocked"
-kill DoBlock flag, just test mddev->external
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Found when trying to reassemble an active externally managed array. Without
this check we hit the more noisy "sysfs duplicate" warning in the later call
to kobject_add.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When setting an array to 'readonly' or to 'active' via sysfs, we must make the
appropriate set_disk_ro call too.
Also when switching to "read_auto" (which is like readonly, but blocks on the
first write so that metadata can be marked 'dirty') we need to be more careful
about what state we are changing from.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
'safemode' relates to marking an array as 'clean' if there has been no write
traffic for a while (a couple of seconds), to reduce the chance of the array
being found dirty on reboot.
->safemode is set to '1' when there have been no write for a while, and it
gets set to '0' when the superblock is updates with the 'clean' flag set.
This requires a few fixes for 'external' metadata:
- When an array is set to 'clean' via sysfs, 'safemode' must be cleared.
- when we write to an array that has 'safemode' set (there must have been
some delay in updating the metadata), we need to clear safemode.
- Don't try to update external metadata in md_check_recovery for safemode
transitions - it won't work.
Also, don't try to support "immediate safe mode" (safemode==2) for external
metadata, it cannot really work (the safemode timeout can be set very low if
this is really needed).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I keep finding problems where an mddev gets reused and some fields has a value
from a previous usage that confuses the new usage. So clear all fields that
could possible need clearing when calling do_md_stop.
Also initialise the 'level' of a new array to LEVEL_NONE (which isn't 0).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All the metadata update processing for external metadata is on in user-space
or through the sysfs interfaces, so make "md_update_sb" a no-op in that case.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
rdev->mddev is no longer valid upon return from entry->store() when the
'remove' command is given.
Cc: <stable@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
block: Skip I/O merges when disabled
block: add large command support
block: replace sizeof(rq->cmd) with BLK_MAX_CDB
ide: use blk_rq_init() to initialize the request
block: use blk_rq_init() to initialize the request
block: rename and export rq_init()
block: no need to initialize rq->cmd with blk_get_request
block: no need to initialize rq->cmd in prepare_flush_fn hook
block/blk-barrier.c:blk_ordered_cur_seq() mustn't be inline
block/elevator.c:elv_rq_merge_ok() mustn't be inline
block: make queue flags non-atomic
block: add dma alignment and padding support to blk_rq_map_kern
unexport blk_max_pfn
ps3disk: Remove superfluous cast
block: make rq_init() do a full memset()
relay: fix splice problem
Use proc_create()/proc_create_data() to make sure that ->proc_fops and ->data
be setup before gluing PDE to main tree.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Dmitry Torokhov <dtor@mail.ru>
Cc: Neil Brown <neilb@suse.de>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
blk_get_request initializes rq->cmd (rq_init does) so the users don't
need to do that.
The purpose of this patch is to remove sizeof(rq->cmd) and &rq->cmd,
as a preparation for large command support, which changes rq->cmd from
the static array to a pointer. sizeof(rq->cmd) will not make sense and
&rq->cmd won't work.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We can save some atomic ops in the IO path, if we clearly define
the rules of how to modify the queue flags.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The functions time_before, time_before_eq, time_after, and time_after_eq
are more robust for comparing jiffies against other values.
A simplified version of the semantic patch making this change is as follows:
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@ change_compare_np @
expression E;
@@
(
- jiffies <= E
+ time_before_eq(jiffies,E)
|
- jiffies >= E
+ time_after_eq(jiffies,E)
|
- jiffies < E
+ time_before(jiffies,E)
|
- jiffies > E
+ time_after(jiffies,E)
)
@ include depends on change_compare_np @
@@
#include <linux/jiffies.h>
@ no_include depends on !include && change_compare_np @
@@
#include <linux/...>
+ #include <linux/jiffies.h>
// </smpl>
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
MD drivers use one printk() call to print 2 log messages and the second line
may be prefixed by a TAB character. It may also output a trailing space
before newline. klogd (I think) turns the TAB character into the 2 characters
'^I' when logging to a file. This looks ugly.
Instead of a leading TAB to indicate continuation, prefix both output lines
with 'raid:' or similar. Also remove any trailing space in the vicinity of
the affected code and consistently end the sentences with a period.
Signed-off-by: Nick Andrew <nick@nick-andrew.net>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
strict_strtoul handles the open-coded sanity checks in
raid5_store_stripe_cache_size and raid5_store_preread_threshold
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Improve write performance by preventing the delayed_list from dumping all its
stripes onto the handle_list in one shot. Delayed stripes are now further
delayed by being held on the 'hold_list'. The 'hold_list' is bypassed when:
* a STRIPE_IO_STARTED stripe is found at the head of 'handle_list'
* 'handle_list' is empty and i/o is being done to satisfy full stripe-width
write requests
* 'bypass_count' is less than 'bypass_threshold'. By default the threshold
is 1, i.e. every other stripe handled is a preread stripe provided the
top two conditions are false.
Benchmark data:
System: 2x Xeon 5150, 4x SATA, mem=1GB
Baseline: 2.6.24-rc7
Configuration: mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 5 --assume-clean
Test1: dd if=/dev/zero of=/dev/md0 bs=1024k count=2048
* patched: +33% (stripe_cache_size = 256), +25% (stripe_cache_size = 512)
Test2: tiobench --size 2048 --numruns 5 --block 4096 --block 131072 (XFS)
* patched: +13%
* patched + preread_bypass_threshold = 0: +37%
Changes since v1:
* reduce bypass_threshold from (chunk_size / sectors_per_chunk) to (1) and
make it configurable. This defaults to fairness and modest performance
gains out of the box.
Changes since v2:
* [neilb@suse.de]: kill STRIPE_PRIO_HI and preread_needed as they are not
necessary, the important change was clearing STRIPE_DELAYED in
add_stripe_bio and this has been moved out to make_request for the hang
fix.
* [neilb@suse.de]: simplify get_priority_stripe
* [dan.j.williams@intel.com]: reset the bypass_count when ->hold_list is
sampled empty (+11%)
* [dan.j.williams@intel.com]: decrement the bypass_count at the detection
of stripes being naturally promoted off of hold_list +2%. Note, resetting
bypass_count instead of decrementing on these events yields +4% but that is
probably too aggressive.
Changes since v3:
* cosmetic fixups
Tested-by: James W. Laferriere <babydr@baby-dragons.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__FUNCTION__ is gcc-specific, use __func__
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
drivers/md/md.c:734:16: warning: Using plain integer as NULL pointer
drivers/md/md.c:1115:16: warning: Using plain integer as NULL pointer
Add some braces to match the else-block as well.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The small patch below:
- Removes the unused md argument from both specific_minor() and next_free_minor()
- Folds kmalloc + memset(0) into a single kzalloc call in alloc_dev()
This has been compile tested on x86.
Signed-off-by: Frederik Deweerdt <frederik.deweerdt@gmail.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
dm_create_error_table() was added in kernel 2.6.18 and never used...
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
void returning functions returned the return value of another void
returning function...
Spotted by sparse.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove an avoidable 3ms delay on some dm-raid1 and kcopyd I/O.
It is specified that any submitted bio without BIO_RW_SYNC flag may plug the
queue (i.e. block the requests from being dispatched to the physical device).
The queue is unplugged when the caller calls blk_unplug() function. Usually, the
sequence is that someone calls submit_bh to submit IO on a buffer. The IO plugs
the queue and waits (to be possibly joined with other adjacent bios). Then, when
the caller calls wait_on_buffer(), it unplugs the queue and submits the IOs to
the disk.
This was happenning:
When doing O_SYNC writes, function fsync_buffers_list() submits a list of
bios to dm_raid1, the bios are added to dm_raid1 write queue and kmirrord is
woken up.
fsync_buffers_list() calls wait_on_buffer(). That unplugs the queue, but
there are no bios on the device queue as they are still in the dm_raid1 queue.
wait_on_buffer() starts waiting until the IO is finished.
kmirrord is scheduled, kmirrord takes bios and submits them to the devices.
The submitted bio plugs the harddisk queue but there is no one to unplug it.
(The process that called wait_on_buffer() is already sleeping.)
So there is a 3ms timeout, after which the queues on the harddisks are
unplugged and requests are processed.
This 3ms timeout meant that in certain workloads (e.g. O_SYNC, 8kb writes),
dm-raid1 is 10 times slower than md raid1.
Every time we submit something asynchronously via dm_io, we must unplug the
queue actually to send the request to the device.
This patch adds an unplug call to kmirrord - while processing requests, it keeps
the queue plugged (so that adjacent bios can be merged); when it finishes
processing all the bios, it unplugs the queue to submit the bios.
It also fixes kcopyd which has the same potential problem. All kcopyd requests
are submitted with BIO_RW_SYNC.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
This patch replaces the schedule() in the main kmirrord thread with a timer.
The schedule() could introduce an unwanted delay when work is ready to be
processed.
The code instead calls wake() when there's work to be done immediately, and
delayed_wake() after a failure to give a short delay before retrying.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove client counting code that is no longer needed.
Initialization and destruction is made globally from dm_init and dm_exit and is
not based on client counts. Initialization allocates only one empty slab cache,
so there is no negative impact from performing the initialization always,
regardless of whether some client uses kcopyd or not.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Change the global mempool in kcopyd into a per-device mempool to avoid
deadlock possibilities.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove internal module reference fields from the interface.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Clean up the dm-log interface to prepare for publishing it in include/linux.
Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Clean up the kcopyd interface to prepare for publishing it in include/linux.
Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Clean up the dm-io interface to prepare for publishing it in include/linux.
Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Save pointer to dm_target in dm_snapshot structure.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move the dirty region log code into a separate module so
other targets can share the code.
Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Change dm-log.c messages from "mirror log" to "dirty region log" as
a new dm target wants to share this code.
Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Limit the amount of memory allocated per snapshot on systems
with a large page size. (The larger default chunk size on
these systems compensates for the smaller number of pages reserved.)
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
If a snapshot has a smaller chunksize than the page size the
conversion to pages currently returns 0 instead of 1, causing:
kernel BUG in mempool_resize.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org
drivers/md/*.[ch] contains only one more printk line with a trailing space.
Remove it.
Signed-off-by: Nick Andrew <nick@nick-andrew.net>
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
If a failure is detected after a parity check operation has been initiated,
but before it completes handle_parity_checks5 will never quiesce operations on
the stripe.
Explicitly handle this case by "canceling" the parity check, i.e. clear the
STRIPE_OP_CHECK flags and queue the stripe on the handle list again to refresh
any non-uptodate blocks.
Kernel versions >= 2.6.23 are susceptible.
Cc: <stable@kernel.org>
Cc: NeilBrown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
write_err is an unsigned long used with set_bit() so should not be passed
around as unsigned int.
http://bugzilla.kernel.org/show_bug.cgi?id=10271
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix regression in dm-crypt introduced in commit
3a7f6c990a ("dm crypt: use async crypto").
If write requests need to be split into pieces, the code must not process them
in parallel because the crypto context cannot be shared. So there can be
parallel crypto operations on one part of the write, but only one write bio
can be processed at a time.
This is not optimal and the workqueue code needs to be optimized for parallel
processing, but for now it solves the problem without affecting the
performance of synchronous crypto operation (most of current dm-crypt users).
http://bugzilla.kernel.org/show_bug.cgi?id=10242http://bugzilla.kernel.org/show_bug.cgi?id=10207
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
gcc-3.4.5 on sparc64:
drivers/md/raid5.c: In function `raid5_end_read_request':
drivers/md/raid5.c:1147: warning: long long unsigned int format, long unsigned int arg (arg 4)
drivers/md/raid5.c:1164: warning: long long unsigned int format, long unsigned int arg (arg 3)
drivers/md/raid5.c:1170: warning: long long unsigned int format, long unsigned int arg (arg 3)
sector_t is u64, and we don't know what type the architecture uses to
implement u64 (on some it is unsigned long).
Cc: Neil Brown <neilb@suse.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Exposing the binary blob which is the md 'super-block' via sysfs doesn't
really fit with the whole sysfs model, and ever since commit
8118a859dc ("sysfs: fix off-by-one error
in fill_read_buffer()") it doesn't actually work at all (as the size of
the blob is often one page).
(akpm: as in, fs/sysfs/file.c:fill_read_buffer() goes BUG)
So just remove it altogether. It isn't really useful.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Recent patch titled
Reduce CPU wastage on idle md array with a write-intent bitmap.
would sometimes leave the array with dirty bitmap bits that stay dirty. A
subsequent write would sort things out so it isn't a big problem, but should
be fixed nonetheless.
We need to make sure that when the bitmap becomes not "allclean", the
daemon_sleep really does get set to a sensible value.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If an md array is "auto-read-only", then this appears in /proc/mdstat as
/dev/md0: active(auto-read-only)
whereas if it is truely readonly, it appears as
/dev/md0: active (read-only)
The difference being a space.
One program known to parse this file expects the space and gets badly
confused. It will be fixed, but it would be best if what the kernel generates
is more consistent too.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This message describes another issue about md RAID10 found by testing the
2.6.24 md RAID10 using new scsi fault injection framework.
Abstract:
When a scsi error results in disabling a disk during RAID10 recovery, the
resync threads of md RAID10 could stall.
This case, the raid array has already been broken and it may not matter. But
I think stall is not preferable. If it occurs, even shutdown or reboot will
fail because of resource busy.
The deadlock mechanism:
The r10bio_s structure has a "remaining" member to keep track of BIOs yet to
be handled when recovering. The "remaining" counter is incremented when
building a BIO in sync_request() and is decremented when finish a BIO in
end_sync_write().
If building a BIO fails for some reasons in sync_request(), the "remaining"
should be decremented if it has already been incremented. I found a case
where this decrement is forgotten. This causes a md_do_sync() deadlock
because md_do_sync() waits for md_done_sync() called by end_sync_write(), but
end_sync_write() never calls md_done_sync() because of the "remaining" counter
mismatch.
For example, this problem would be reproduced in the following case:
Personalities : [raid10]
md0 : active raid10 sdf1[4] sde1[5](F) sdd1[2] sdc1[1] sdb1[6](F)
3919616 blocks 64K chunks 2 near-copies [4/2] [_UU_]
[>....................] recovery = 2.2% (45376/1959808) finish=0.7min speed=45376K/sec
This case, sdf1 is recovering, sdb1 and sde1 are disabled.
An additional error with detaching sdd will cause a deadlock.
md0 : active raid10 sdf1[4] sde1[5](F) sdd1[6](F) sdc1[1] sdb1[7](F)
3919616 blocks 64K chunks 2 near-copies [4/1] [_U__]
[=>...................] recovery = 5.0% (99520/1959808) finish=5.9min speed=5237K/sec
2739 ? S< 0:17 [md0_raid10]
28608 ? D< 0:00 [md0_resync]
28629 pts/1 Ss 0:00 bash
28830 pts/1 R+ 0:00 ps ax
31819 ? D< 0:00 [kjournald]
The resync thread keeps working, but actually it is deadlocked.
Patch:
By this patch, the remaining counter will be decremented if needed.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thanks to K.Tanaka and the scsi fault injection framework, here is a fix for
another possible deadlock in raid1/raid10 error handing.
If a read request returns an error while a resync is happening and a resync
request is pending, the attempt to fix the error will block until the resync
progresses, and the resync will block until the read request completes. Thus
a deadlock.
This patch fixes the problem.
Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch changes the disk to be read for layout "far > 1" to always be the
disk with the lowest block address.
Thus the chunks to be read will always be (for a fully functioning array) from
the first band of stripes, and the raid will then work as a raid0 consisting
of the first band of stripes.
Some advantages:
The fastest part which is the outer sectors of the disks involved will be
used. The outer blocks of a disk may be as much as 100 % faster than the
inner blocks.
Average seek time will be smaller, as seeks will always be confined to the
first part of the disks.
Mixed disks with different performance characteristics will work better, as
they will work as raid0, the sequential read rate will be number of disks
involved times the IO rate of the slowest disk.
If a disk is malfunctioning, the first disk which is working, and has the
lowest block address for the logical block will be used.
Signed-off-by: Keld Simonsen <keld@dkuug.dk>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we access attributes of an rdev (component device on an md array) through
sysfs, we really need to lock the array against concurrent changes. We
currently do that when we change an attribute, but not when we read an
attribute. We need to lock when reading as well else rdev->mddev could become
NULL while we are accessing it.
So add appropriate locking (mddev_lock) to rdev_attr_show.
rdev_size_store requires some extra care as well as it needs to unlock the
mddev while scanning other mddevs for overlapping regions. We currently
assume that rdev->mddev will still be unchanged after the scan, but that
cannot be certain. So take a copy of rdev->mddev for use at the end of the
function.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A resync/reshape/recovery thread will refuse to progress when the array is
marked read-only. So whenever it mark it not read-only, it is important to
wake up thread resync thread. There is one place we didn't do this.
The problem manifests if the start_ro module parameters is set, and a raid5
array that is in the middle of a reshape (restripe) is started. The array
will initially be semi-read-only (meaning it acts like it is readonly until
the first write). So the reshape will not proceed.
On the first write, the array will become read-write, but the reshape will not
be started, and there is no event which will ever restart that thread.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a raid1 array is stopped, all components currently get added to the list
for auto-detection. However we should really only add components that were
found by autodetection in the first place. So add a flag to record that
information, and use it.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make sure the data doesn't start before the end of the superblock when the
superblock is at the start of the device.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On an md array with a write-intent bitmap, a thread wakes up every few seconds
and scans the bitmap looking for work to do. If the array is idle, there will
be no work to do, but a lot of scanning is done to discover this.
So cache the fact that the bitmap is completely clean, and avoid scanning the
whole bitmap when the cache is known to be clean.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When handling a read error, we freeze the array to stop any other IO while
attempting to over-write with correct data.
This is done in the raid1d(raid10d) thread and must wait for all submitted IO
to complete (except for requests that failed and are sitting in the retry
queue - these are counted in ->nr_queue and will stay there during a freeze).
However write requests need attention from raid1d as bitmap updates might be
required. This can cause a deadlock as raid1 is waiting for requests to
finish that themselves need attention from raid1d.
So we create a new function 'flush_pending_writes' to give that attention, and
call it in freeze_array to be sure that we aren't waiting on raid1d.
Thanks to "K.Tanaka" <k-tanaka@ce.jp.nec.com> for finding and reporting this
problem.
Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch fixes two NULL dereferences introduced by commit
06386bbfd2 and spotted by the Coverity
checker.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
d_path() is used on a <dentry,vfsmount> pair. Lets use a struct path to
reflect this.
[akpm@linux-foundation.org: fix build in mm/memory.c]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Acked-by: Bryan Wu <bryan.wu@analog.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
seq_path() is always called with a dentry and a vfsmount from a struct path.
Make seq_path() take it directly as an argument.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Add path_put() functions for releasing a reference to the dentry and
vfsmount of a struct path in the right order
* Switch from path_release(nd) to path_put(&nd->path)
* Rename dput_path() to path_put_conditional()
[akpm@linux-foundation.org: fix cifs]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: <linux-fsdevel@vger.kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the central patch of a cleanup series. In most cases there is no good
reason why someone would want to use a dentry for itself. This series reflects
that fact and embeds a struct path into nameidata.
Together with the other patches of this series
- it enforced the correct order of getting/releasing the reference count on
<dentry,vfsmount> pairs
- it prepares the VFS for stacking support since it is essential to have a
struct path in every place where the stack can be traversed
- it reduces the overall code size:
without patch series:
text data bss dec hex filename
5321639 858418 715768 6895825 6938d1 vmlinux
with patch series:
text data bss dec hex filename
5320026 858418 715768 6894212 693284 vmlinux
This patch:
Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix cifs]
[akpm@linux-foundation.org: fix smack]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
test_and_set_bit() on address of uint32_t is a Bad Idea(tm)...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds extra information to the mirror status output, so that
it can be determined which device(s) have failed. For each mirror device,
a character is printed indicating the most severe error encountered. The
characters are:
* A => Alive - No failures
* D => Dead - A write failure occurred leaving mirror out-of-sync
* S => Sync - A sychronization failure occurred, mirror out-of-sync
* R => Read - A read failure occurred, mirror data unaffected
This allows userspace to properly reconfigure the mirror set.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch gives the ability to respond-to/record device failures
that happen during read operations. It also adds the ability to
read from mirror devices that are not the primary if they are
in-sync.
There are essentially two read paths in mirroring; the direct path
and the queued path. When a read request is mapped, if the region
is 'in-sync' the direct path is taken; otherwise the queued path
is taken.
If the direct path is taken, we must record bio information so that
if the read fails we can retry it. We then discover the status of
a direct read through mirror_end_io. If the read has failed, we will
mark the device from which the read was attempted as failed (so we
don't try to read from it again), restore the bio and try again.
If the queued path is taken, we discover the results of the read
from 'read_callback'. If the device failed, we will mark the device
as failed and attempt the read again if there is another device
where this region is known to be 'in-sync'.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch adds the ability to requeue write I/O to
core device-mapper when there is a log device failure.
If a write to the log produces and error, the pending writes are
put on the "failures" list. Since the log is marked as failed,
they will stay on the failures list until a suspend happens.
Suspends come in two phases, presuspend and postsuspend. We must
make sure that all the writes on the failures list are requeued
in the presuspend phase (a requirement of dm core). This means
that recovery must be complete (because writes may be delayed
behind it) and the failures list must be requeued before we
return from presuspend.
The mechanisms to ensure recovery is complete (or stopped) was
already in place, but needed to be moved from postsuspend to
presuspend. We rely on 'flush_workqueue' to ensure that the
mirror thread is complete and therefore, has requeued all writes
in the failures list.
Because we are using flush_workqueue, we must ensure that no
additional 'queue_work' calls will produce additional I/O
that we need to requeue (because once we return from
presuspend, we are unable to do anything about it). 'queue_work'
is called in response to the following functions:
- complete_resync_work = NA, recovery is stopped
- rh_dec (mirror_end_io) = NA, only calls 'queue_work' if it
is ready to recover the region
(recovery is stopped) or it needs
to clear the region in the log*
**this doesn't get called while
suspending**
- rh_recovery_end = NA, recovery is stopped
- rh_recovery_start = NA, recovery is stopped
- write_callback = 1) Writes w/o failures simply call
bio_endio -> mirror_end_io -> rh_dec
(see rh_dec above)
2) Writes with failures are put on
the failures list and queue_work is
called**
** write_callbacks don't happen
during suspend **
- do_failures = NA, 'queue_work' not called if suspending
- add_mirror (initialization) = NA, only done on mirror creation
- queue_bio = NA, 1) delayed I/O scheduled before flush_workqueue
is called. 2) No more I/Os are being issued.
3) Re-attempted READs can still be handled.
(Write completions are handled through rh_dec/
write_callback - mention above - and do not
use queue_bio.)
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch adds the calls to 'fail_mirror' if an error occurs during
mirror recovery (aka resynchronization). 'fail_mirror' is responsible
for recording the type of error by mirror device and ensuring an event
gets raised for the purpose of notifying userspace.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch gives mirror the ability to handle device failures
during normal write operations.
The 'write_callback' function is called when a write completes.
If all the writes failed or succeeded, we report failure or
success respectively. If some of the writes failed, we call
fail_mirror; which increments the error count for the device, notes
the type of error encountered (DM_RAID1_WRITE_ERROR), and
selects a new primary (if necessary). Note that the primary
device can never change while the mirror is not in-sync (IOW,
while recovery is happening.) This means that the scenario
where a failed write changes the primary and gives
recovery_complete a chance to misread the primary never happens.
The fact that the primary can change has necessitated the change
to the default_mirror field. We need to protect against reading
garbage while the primary changes. We then add the bio to a new
list in the mirror set, 'failures'. For every bio in the 'failures'
list, we call a new function, '__bio_mark_nosync', where we mark
the region 'not-in-sync' in the log and properly set the region
state as, RH_NOSYNC. Userspace must also be notified of the
failure. This is done by 'raising an event' (dm_table_event()).
If fail_mirror is called in process context the event can be raised
right away. If in interrupt context, the event is deferred to the
kmirrord thread - which raises the event if 'event_waiting' is set.
Backwards compatibility is maintained by ignoring errors if
the DM_FEATURES_HANDLE_ERRORS flag is not present.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Provided sector_t is 64 bits, reduce the in-memory footprint of the
snapshot exception table by the simple method of using unused bits of
the chunk number to combine consecutive entries.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch adds additional information to the status line. It is added at the
end of the returned text so it will not interfere with existing
implementations using this data. The addition of this information will allow
for a common return interface to match that returned with the dm-raid1.c
status line (with Jonathan Brassow's patches).
Here is a sample of what is returned with a mirror "status" call:
isw_eeaaabgfg_mirror: 0 488390920 mirror 2 8:16 8:32 3727/3727 1 AA 1 core
Here's what's returned with this patch for a stripe "status" call:
isw_dheeijjdej_stripe: 0 976783872 striped 2 8:16 8:32 1 AA
Signed-off-by: Brian Wood <brian.j.wood@intel.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch adds the stripe_end_io function to process errors that might
occur after an IO operation. As part of this there are a number of
enhancements made to record and trigger events:
- New atomic variable in struct stripe to record the number of
errors each stripe volume device has experienced (could be used
later with uevents to report back directly to userspace)
- New workqueue/work struct setup to process the trigger_event function
- New end_io function. It is here that testing for BIO error conditions
take place. It determines the exact stripe that cause the error,
records this in the new atomic variable, and calls the queue_work() function
- New trigger_event function to process failure events. This
calls dm_table_event()
Signed-off-by: Brian Wood <brian.j.wood@intel.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
If the log type is not recognised, attempt to load the module
'dm-log-<type>.ko'.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add a single-thread workqueue for each mapped device
and move flushing of the lists of pushback and deferred bios
to this new workqueue.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
dm-crypt: Use crypto ablkcipher interface
Move encrypt/decrypt core to async crypto call.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
dm-crypt: Use crypto ablkcipher interface
Prepare callback function for async crypto operation.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
dm-crypt: Use crypto ablkcipher interface
Prepare completion for async crypto request.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
dm-crypt: Use crypto ablkcipher interface
Introduce mempool for async crypto requests.
cc->req is used mainly during synchronous operations
(to prevent allocation and deallocation of the same object).
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
dm-crypt: Use crypto ablkcipher interface
Move scatterlists to separate dm_crypt_struct and
pick out block processing from crypt_convert.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>