.. and other arrays with components larger than 2 terabytes.
We use a "long" rather than a "sector_t" in part of the bitmap
size calculations, which is sad.
Reported-by: "Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE>
Signed-off-by: NeilBrown <neilb@suse.de>
There are circumstances when a user-space process might need to
"oversee" a resync/reshape process. For example when doing an
in-place reshape of a raid5, it is prudent to take a backup of each
section before reshaping it as this is the only way to provide
safety against an unplanned shutdown (i.e. crash/power failure).
The sync_max sysfs value can be used to stop the resync from
advancing beyond a particular point.
So user-space can:
suspend IO to the first section and back it up
set 'sync_max' to the end of the section
wait for 'sync_completed' to reach that point
resume IO on the first section and move on to the next section.
However this process requires the kernel and user-space to run in
lock-step which could introduce unnecessary delays.
It would be better if a 'double buffered' approach could be used with
userspace and kernel space working on different sections with the
'next' section always ready when the 'current' section is finished.
One problem with implementing this is that sync_completed is only
guaranteed to be updated when the sync process reaches sync_max.
(it is updated on a time basis at other times, but it is hard to rely
on that). This defeats some of the double buffering.
With this patch, sync_completed (and reshape_position) get updated as
the current position approaches sync_max, so there is room for
userspace to advance sync_max early without losing updates.
To be precise, sync_completed is updated when the current sync
position reaches half way between the current value of sync_completed
and the value of sync_max. This will usually be a good time for user
space to update sync_max.
If sync_max does not get updated, the updates to sync_completed
(together with associated metadata updates) will occur at an
exponentially increasing frequency which will get unreasonably fast
(one update every page) immediately before the process hits sync_max
and stops. So the update rate will be unreasonably fast only for an
insignificant period of time.
Signed-off-by: NeilBrown <neilb@suse.de>
The sync_completed file reports how much of a resync (or recovery or
reshape) has been completed.
However due to the possibility of out-of-order completion of writes,
it is not certain to be accurate.
We have an internal value - mddev->curr_resync_completed - which is an
accurate value (though it might not always be quite so uptodate).
So:
- make curr_resync_completed be uptodate a little more often,
particularly when raid5 reshape updates status in the metadata
- report curr_resync_completed in the sysfs file
- allow poll/select to report all updates to md/sync_completed.
This makes sync_completed completed usable by any external metadata
handler that wants to record this status information in its metadata.
Signed-off-by: NeilBrown <neilb@suse.de>
When adding devices to an active array via sysfs, there is currently
no way to mark a device as 'in-sync' which is useful when
incrementally assembling an array.
So add that option.
Signed-off-by: NeilBrown <neilb@suse.de>
- update inclusion guard and make sure it covers the whole file
- remove superflous #ifdef CONFIG_BLOCK
- make sure all required headers are included so that new users aren't
required to include others before
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
If the thread calling dm_kcopyd_copy is delayed due to scheduling inside
split_job/segment_complete and the subjobs complete before the loop in
split_job completes, the kcopyd callback could be invoked from the
thread that called dm_kcopyd_copy instead of the kcopyd workqueue.
dm_kcopyd_copy -> split_job -> segment_complete -> job->fn()
Snapshots depend on the fact that callbacks are called from the singlethreaded
kcopyd workqueue and expect that there is no racing between individual
callbacks. The racing between callbacks can lead to corruption of exception
store and it can also mean that exception store callbacks are called twice
for the same exception - a likely reason for crashes reported inside
pending_complete() / remove_exception().
This patch fixes two problems:
1. job->fn being called from the thread that submitted the job (see above).
- Fix: hand over the completion callback to the kcopyd thread.
2. job->fn(read_err, write_err, job->context); in segment_complete
reports the error of the last subjob, not the union of all errors.
- Fix: pass job->write_err to the callback to report all error bits
(it is done already in run_complete_job)
Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Use a variable in segment_complete() to point to the dm_kcopyd_client
struct and only release job->pages in run_complete_job() if any are
defined. These changes are needed by the next patch.
Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Barriers are submitted to a worker thread that issues them in-order.
The thread is modified so that when it sees a barrier request it waits
for all pending IO before the request then submits the barrier and
waits for it. (We must wait, otherwise it could be intermixed with
following requests.)
Errors from the barrier request are recorded in a per-device barrier_error
variable. There may be only one barrier request in progress at once.
For now, the barrier request is converted to a non-barrier request when
sending it to the underlying device.
This patch guarantees correct barrier behavior if the underlying device
doesn't perform write-back caching. The same requirement existed before
barriers were supported in dm.
Bottom layer barrier support (sending barriers by target drivers) and
handling devices with write-back caches will be done in further patches.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove queue_io return value and a loop in dm_request.
IO may be submitted to a worker thread with queue_io(). queue_io() sets
DMF_QUEUE_IO_TO_THREAD so that all further IO is queued for the thread. When
the thread finishes its work, it clears DMF_QUEUE_IO_TO_THREAD and from this
point on, requests are submitted from dm_request again. This will be used
for processing barriers.
Remove the loop in dm_request. queue_io() can submit I/Os to the worker thread
even if DMF_QUEUE_IO_TO_THREAD was not set.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Rework shutting down on suspend and document the associated rules.
Drop write lock in __split_and_process_bio to allow more processing
concurrency.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Refactor the code in dm_request().
Require the new DMF_BLOCK_FOR_SUSPEND flag on readahead bios we will
discard so we don't drop such bios while processing a barrier.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Split the DMF_BLOCK_IO flag into two.
DMF_BLOCK_IO_FOR_SUSPEND is set when I/O must be blocked while suspending a
device. DMF_QUEUE_IO_TO_THREAD is set when I/O must be queued to a
worker thread for later processing.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Prepare for full barrier implementation: first remove the restricted support.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch provides support for data integrity passthrough in the device
mapper.
- If one or more component devices support integrity an integrity
profile is preallocated for the DM device.
- If all component devices have compatible profiles the DM device is
flagged as capable.
- Handle integrity metadata when splitting and cloning bios.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Fix this build error:
drivers/md/raid1.c: In function 'raid1_congested':
drivers/md/raid1.c:589: error: 'BDI_write_congested' undeclared
BDI_write_congested was changed in commit 1faa16d228 ("block: change the
request allocation/congestion logic to be sync/async based")
Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit d3f761104b
newly allocated bvecs aren't initialised to NULL, so we have
to be more careful about freeing a bio which only managed
to get a few pages allocated to it. Otherwise the resync
process crashes.
This patch is appropriate for 2.6.29-stable.
Cc: stable@kernel.org
Cc: "Jens Axboe" <jens.axboe@oracle.com>
Reported-by: Gabriele Tozzi <gabriele@tozzi.eu>
Signed-off-by: NeilBrown <neilb@suse.de>
* 'for-linus' of git://neil.brown.name/md: (53 commits)
md/raid5 revise rules for when to update metadata during reshape
md/raid5: minor code cleanups in make_request.
md: remove CONFIG_MD_RAID_RESHAPE config option.
md/raid5: be more careful about write ordering when reshaping.
md: don't display meaningless values in sysfs files resync_start and sync_speed
md/raid5: allow layout and chunksize to be changed on active array.
md/raid5: reshape using largest of old and new chunk size
md/raid5: prepare for allowing reshape to change layout
md/raid5: prepare for allowing reshape to change chunksize.
md/raid5: clearly differentiate 'before' and 'after' stripes during reshape.
Documentation/md.txt update
md: allow number of drives in raid5 to be reduced
md/raid5: change reshape-progress measurement to cope with reshaping backwards.
md: add explicit method to signal the end of a reshape.
md/raid5: enhance raid5_size to work correctly with negative delta_disks
md/raid5: drop qd_idx from r6_state
md/raid6: move raid6 data processing to raid6_pq.ko
md: raid5 run(): Fix max_degraded for raid level 4.
md: 'array_size' sysfs attribute
md: centralize ->array_sectors modifications
...
Set queue ordered mode. It doesn't really matter what we set here
because we don't ever put any requests on the queue. But we need to set
something other than QUEUE_ORDERED_NONE so that __generic_make_request
passes barrier requests to us.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move wait queue declaration and unplug to dm_wait_for_completion.
The purpose is to minimize duplicate code in the further patches.
The patch reorders functions a little bit. It doesn't change any
functionality. For proper non-deadlock operation, add_wait_queue must
happen before set_current_state(interruptible) and before the test for
!atomic_read(&md->pending).
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Merge pushback and deferred lists into one list - use deferred list
for both deferred and pushed-back bios.
This will be needed for proper support of barrier bios: it is impossible to
support ordering correctly with two lists because the requests on both lists
will be mixed up.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Allow uninterruptible wait for pending IOs.
Add argument "interruptible" to dm_wait_for_completion that specifies
either interruptible or uninterruptible waiting.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Merge __flush_deferred_io() into the only caller, dm_wq_work().
There's no need to have a function that has only one caller.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move the bio_io_error() calls directly into __split_and_process_bio().
This avoids some code duplication in later patches.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Rename __split_bio() to __split_and_process_bio() because it not only splits
the bio to serveral parts, but also submits them to target drivers.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove struct dm_wq_req and move "work" directly into struct mapped_device.
In the revised implementation, the thread will do just one type of work
(processing the queue).
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove the context field from struct dm_wq_req because we will no longer
need it.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove "type" field from struct dm_wq_req because we no longer need it
to have more than one value.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Introduce a function that adds a bio to the head of the list for
use by the patch that will support barriers.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The persistent exception store destructor does not properly
account for all conditions in which it can be called. If it
is called after 'ctr' but before 'read_metadata' (e.g. if
something else in 'snapshot_ctr' fails) then it will attempt
to free areas of memory that haven't been allocated yet.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Let the exception store types print out their status through
the new API, rather than having the snapshot code do it.
Adjust the buffer position to allow for the preceding DMEMIT in the
arguments to type->status().
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
First step of having the exception stores parse their own arguments -
generalizing the interface.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Use DMEMIT in place of snprintf. This makes it easier later when
other modules are helping to populate our status output.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move some of the last bits from dm-snap.h into dm-snap.c where they
belong and remove dm-snap.h.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move useful functions out of dm-snap.h and stop using dm-snap.h.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move COW device from snapshot to exception store.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move chunk fields from snapshot to exception store.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move target pointer from snapshot to exception store.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The logging API needs an extra function to make cluster mirroring
possible. This new function allows us to check whether a mirror
region is being recovered on another machine in the cluster. This
helps us prevent simultaneous recovery I/O and process I/O to the
same locations on disk.
Cluster-aware log modules will implement this function. Single
machine log modules will not. So, there is no performance
penalty for single machine mirrors.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove the 'dm_dirty_log_internal' structure. The resulting cleanup
eliminates extra memory allocations. Therefore exposing the internal
list_head to the external 'dm_dirty_log_type' structure is a worthwhile
compromise.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Avoid private module usage accounting by removing 'use' from
dm_dirty_log_internal. The standard module reference counting is
sufficient.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Use kzfree() instead of memset() + kfree().
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The tt_internal is really just a list_head to manage registered target_type
in a double linked list,
Here embed the list_head into target_type directly,
1. to avoid kmalloc/kfree;
2. then tt_internal is really unneeded;
Cc: stable@kernel.org
Signed-off-by: Cheng Renquan <crquan@gmail.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
upgrade_mode() sets bdev to NULL temporarily, and does not have any
locking to exclude anything from seeing that NULL.
In dm_table_any_congested() bdev_get_queue() can dereference that NULL and
cause a reported oops.
Fix this by not changing that field during the mode upgrade.
Cc: stable@kernel.org
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The tt_internal's 'use' field is superfluous: the module's refcount can do
the work properly. An acceptable side-effect is that this increases the
reference counts reported by 'lsmod'.
Remove the superfluous test when removing a target module.
[Crash possible without this on SMP - agk]
Cc: stable@kernel.org
Signed-off-by: Cheng Renquan <crquan@gmail.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
We need to check if the exception was completed after dropping the lock.
After regaining the lock, __find_pending_exception checks if the exception
was already placed into &s->pending hash.
But we don't check if the exception was already completed and placed into
&s->complete hash. If the process waiting in alloc_pending_exception was
delayed at this point because of a scheduling latency and the exception
was meanwhile completed, we'd miss that and allocate another pending
exception for already completed chunk.
It would lead to a situation where two records for the same chunk exist
and potential data corruption because multiple snapshot I/Os to the
affected chunk could be redirected to different locations in the
snapshot.
Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>