Maintain two flows, one for pow2 chunk sizes (which uses masks and
shift), and a flow for the general case (which uses sector_div).
This is for the sake of performance.
- introduce map_sector and is_io_in_chunk_boundary to encapsulate
those two flows better for raid0_make_request
- fix blk_mergeable to support the two flows.
Signed-off-by: raziebe@gmail.com
Signed-off-by: NeilBrown <neilb@suse.de>
Remove chunk size check from md as this is now performed in the run
function in each personality.
Replace chunk size power 2 code calculations by a regular division.
Signed-off-by: raziebe@gmail.com
Signed-off-by: NeilBrown <neilb@suse.de>
have raid0 check chunk size in run method instead of in md.
This is part of a series moving the checks from common code to
the personalities where they belong.
hardsect is short and chunksize is an int, so it is safe to use %.
Signed-off-by: raziebe@gmail.com
Signed-off-by: NeilBrown <neilb@suse.de>
Replace the linear search with binary search in which_dev.
Signed-off-by: Sandeep K Sinha <sandeepksinha@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Remove num_sectors from dev_info and replace start_sector with
end_sector. This makes a lot of comparisons much simpler.
Signed-off-by: Sandeep K Sinha <sandeepksinha@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Get rid of sector_div and hash table for linear raid and replace
with a linear search in which_dev.
The hash table adds a lot of complexity for little if any gain.
Ultimately a binary search will be used which will have smaller
cache foot print, a similar number of memory access, and no
divisions.
Signed-off-by: Sandeep K Sinha <sandeepksinha@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Having a macro just to cast a void* isn't really helpful.
I would must rather see that we are simply de-referencing ->private,
than have to know what the macro does.
So open code the macro everywhere and remove the pointless cast.
Signed-off-by: NeilBrown <neilb@suse.de>
This setting doesn't seem to make sense (half the chunk size??) and
shouldn't be needed.
The segment boundary exported by raid0 should simply be the minimum
of the segment boundary of all component devices. And we already
get that right.
Signed-off-by: NeilBrown <neilb@suse.de>
If we treat conf->devlist more like a 2 dimensional array,
we can get the devlist for a particular zone simply by indexing
that array, so we don't need to store the pointers to subarrays
in strip_zone. This makes strip_zone smaller and so (hopefully)
searches faster.
Signed-of-by: NeilBrown <neilb@suse.de>
storing ->sectors is redundant as is can be computed from the
difference z->zone_end - (z-1)->zone_end
The one place where it is used, it is just as efficient to use
a zone_end value instead.
And removing it makes strip_zone smaller, so they array of these that
is searched on every request has a better chance to say in cache.
So discard the field and get the value from elsewhere.
Signed-off-by: NeilBrown <neilb@suse.de>
raid0_stop() removes all references to the raid0 configuration but
misses to free the ->devlist buffer.
This patch closes this leak, removes a pointless initialization and
fixes a coding style issue in raid0_stop().
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Currently the raid0 configuration is allocated in raid0_run() while
the buffers for the strip_zone and the dev_list arrays are allocated
in create_strip_zones(). On errors, all three buffers are freed
in raid0_run().
It's easier and more readable to do the allocation and cleanup within
a single function. So move that code into create_strip_zones().
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Currently raid0_run() always returns -ENOMEM on errors. This is
incorrect as running the array might fail for other reasons, for
example because not all component devices were available.
This patch changes create_strip_zones() so that it returns a proper
error code (either -ENOMEM or -EINVAL) rather than 1 on errors and
makes raid0_run(), its single caller, return that value instead
of -ENOMEM.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
The "sector_shift" and "spacing" fields of struct raid0_private_data
were only used for the hash table lookups. So the removal of the
hash table allows get rid of these fields as well which simplifies
create_strip_zones() and raid0_run() quite a bit.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
The raid0 hash table has become unused due to the changes in the
previous patch. This patch removes the hash table allocation and
setup code and kills the hash_table field of struct raid0_private_data.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
1/ remove current_start. The same value is available in
zone->dev_start and storing it separately doesn't gain anything.
2/ rename curr_zone_start to curr_zone_end as we are now more
focused on the 'end' of each zone. We end up storing the
same number though - the old name was a little confusing
(and what does 'current' mean in this context anyway).
Signed-off-by: NeilBrown <neilb@suse.de>
The number of strip_zones of a raid0 array is bounded by the number of
drives in the array and is in fact much smaller for typical setups. For
example, any raid0 array containing identical disks will have only
a single strip_zone.
Therefore, the hash tables which are used for quickly finding the
strip_zone that holds a particular sector are of questionable value
and add quite a bit of unnecessary complexity.
This patch replaces the hash table lookup by equivalent code which
simply loops over all strip zones to find the zone that holds the
given sector.
In order to make this loop as fast as possible, the zone->start field
of struct strip_zone has been renamed to zone_end, and it now stores
the beginning of the next zone in sectors. This allows to save one
addition in the loop.
Subsequent cleanup patches will remove the hash table structure.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
* 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits)
block: add request clone interface (v2)
floppy: fix hibernation
ramdisk: remove long-deprecated "ramdisk=" boot-time parameter
fs/bio.c: add missing __user annotation
block: prevent possible io_context->refcount overflow
Add serial number support for virtio_blk, V4a
block: Add missing bounce_pfn stacking and fix comments
Revert "block: Fix bounce limit setting in DM"
cciss: decode unit attention in SCSI error handling code
cciss: Remove no longer needed sendcmd reject processing code
cciss: change SCSI error handling routines to work with interrupts enabled.
cciss: separate error processing and command retrying code in sendcmd_withirq_core()
cciss: factor out fix target status processing code from sendcmd functions
cciss: simplify interface of sendcmd() and sendcmd_withirq()
cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code
cciss: Use schedule_timeout_uninterruptible in SCSI error handling code
block: needs to set the residual length of a bidi request
Revert "block: implement blkdev_readpages"
block: Fix bounce limit setting in DM
Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt
...
Manually fix conflicts with tracing updates in:
block/blk-sysfs.c
drivers/ide/ide-atapi.c
drivers/ide/ide-cd.c
drivers/ide/ide-floppy.c
drivers/ide/ide-tape.c
include/trace/events/block.h
kernel/trace/blktrace.c
* 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (244 commits)
Revert "x86, bts: reenable ptrace branch trace support"
tracing: do not translate event helper macros in print format
ftrace/documentation: fix typo in function grapher name
tracing/events: convert block trace points to TRACE_EVENT(), fix !CONFIG_BLOCK
tracing: add protection around module events unload
tracing: add trace_seq_vprint interface
tracing: fix the block trace points print size
tracing/events: convert block trace points to TRACE_EVENT()
ring-buffer: fix ret in rb_add_time_stamp
ring-buffer: pass in lockdep class key for reader_lock
tracing: add annotation to what type of stack trace is recorded
tracing: fix multiple use of __print_flags and __print_symbolic
tracing/events: fix output format of user stack
tracing/events: fix output format of kernel stack
tracing/trace_stack: fix the number of entries in the header
ring-buffer: discard timestamps that are at the start of the buffer
ring-buffer: try to discard unneeded timestamps
ring-buffer: fix bug in ring_buffer_discard_commit
ftrace: do not profile functions when disabled
tracing: make trace pipe recognize latency format flag
...
TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
these new capabilities to this tracepoint:
- zero-copy and per-cpu splice() tracing
- binary tracing without printf overhead
- structured logging records exposed under /debug/tracing/events
- trace events embedded in function tracer output and other plugins
- user-defined, per tracepoint filter expressions
...
Cons:
- no dev_t info for the output of plug, unplug_timer and unplug_io events.
no dev_t info for getrq and sleeprq events if bio == NULL.
no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.
This is mainly because we can't get the deivce from a request queue.
But this may change in the future.
- A packet command is converted to a string in TP_assign, not TP_print.
While blktrace do the convertion just before output.
Since pc requests should be rather rare, this is not a big issue.
- In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
has a unique format, which means we have some unused data in a trace entry.
The overhead is minimized by using __dynamic_array() instead of __array().
I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:
dd dd + ioctl blktrace dd + TRACE_EVENT (splice)
1 7.36s, 42.7 MB/s 7.50s, 42.0 MB/s 7.41s, 42.5 MB/s
2 7.43s, 42.3 MB/s 7.48s, 42.1 MB/s 7.43s, 42.4 MB/s
3 7.38s, 42.6 MB/s 7.45s, 42.2 MB/s 7.41s, 42.5 MB/s
So the overhead of tracing is very small, and no regression when using
those trace events vs blktrace.
And the binary output of TRACE_EVENT is much smaller than blktrace:
# ls -l -h
-rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
-rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
-rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out
Following are some comparisons between TRACE_EVENT and blktrace:
plug:
kjournald-480 [000] 303.084981: block_plug: [kjournald]
kjournald-480 [000] 303.084981: 8,0 P N [kjournald]
unplug_io:
kblockd/0-118 [000] 300.052973: block_unplug_io: [kblockd/0] 1
kblockd/0-118 [000] 300.052974: 8,0 U N [kblockd/0] 1
remap:
kjournald-480 [000] 303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
kjournald-480 [000] 303.085043: 8,0 A W 102736992 + 8 <- (8,8) 33384
bio_backmerge:
kjournald-480 [000] 303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
kjournald-480 [000] 303.085086: 8,0 M W 102737032 + 8 [kjournald]
getrq:
kjournald-480 [000] 303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
kjournald-480 [000] 303.084975: 8,0 G W 102736984 + 8 [kjournald]
bash-2066 [001] 1072.953770: 8,0 G N [bash]
bash-2066 [001] 1072.953773: block_getrq: 0,0 N 0 + 0 [bash]
rq_complete:
konsole-2065 [001] 300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
konsole-2065 [001] 300.053191: 8,0 C W 103669040 + 16 [0]
ksoftirqd/1-7 [001] 1072.953811: 8,0 C N (5a 00 08 00 00 00 00 00 24 00) [0]
ksoftirqd/1-7 [001] 1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]
rq_insert:
kjournald-480 [000] 303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
kjournald-480 [000] 303.084986: 8,0 I W 102736984 + 8 [kjournald]
Changelog from v2 -> v3:
- use the newly introduced __dynamic_array().
Changelog from v1 -> v2:
- use __string() instead of __array() to minimize the memory required
to store hex dump of rq->cmd().
- support large pc requests.
- add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.
- some cleanups.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Now that we support changing the chunksize, we calculate
"reshape_sectors" to be the max of number of sectors in old
and new chunk size.
However there is one please where we still use 'chunksize'
rather than 'reshape_sectors'.
This causes a reshape that reduces the size of chunks to freeze.
Signed-off-by: NeilBrown <neilb@suse.de>
md has functionality to 'quiesce' and array so that all pending
IO completed and no new IO starts. This is used to achieve a
stable state before making internal changes.
Currently this quiescing applies equally to normal IO, resync
IO, and reshape IO.
However there is a problem with applying it to reshape IO.
Reshape can have multiple 'stripe_heads' that must be active together.
If the quiesce come between allocating the first and the last of
such a collection, then we deadlock, as the last will not be allocated
until the quiesce is lifted, the quiesce will not be lifted until the
first (which has been allocated) gets used, and that first cannot be
used until the last is allocated.
It is not necessary to inhibit reshape IO when a quiesce is
requested. Those places in the code that require a full quiesce will
ensure the reshape thread is not running at all.
So allow reshape requests to get access to new stripe_heads without
being blocked by a 'quiesce'.
This only affects in-place reshapes (i.e. where the array does not
grow or shrink) and these are only newly supported. So this patch is
not needed in earlier kernels.
Signed-off-by: NeilBrown <neilb@suse.de>
mddev->raid_disks can be changed and any time by a request from
user-space. It is a suggestion as to what number of raid_disks is
desired.
conf->raid_disks can only be changed by the raid5 module with suitable
locks in place. It is a statement as to the current number of
raid_disks.
There are two places where the latter should be used, but the former
is used. This can lead to a crash when reshaping an array.
This patch changes to mddev-> to conf->
Signed-off-by: NeilBrown <neilb@suse.de>
blk_queue_bounce_limit() is more than a wrapper about the request queue
limits.bounce_pfn variable. Introduce blk_queue_bounce_pfn() which can
be called by stacking drivers that wish to set the bounce limit
explicitly.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
A recent patch to raid5.c use min on an int and a sector_t.
This isn't allowed.
So change it to min_t(sector_t,x,y).
Signed-off-by: NeilBrown <neilb@suse.de>
In order for the metadata to always be consistent, we mustn't updated
curr_resync_completed without also updating reshape_position.
The reshape code updates both at the same time. However since
commit 97e4f42d62
the common md_do_sync will sometimes update curr_resync_completed
but is not in a position to update reshape_position.
So if MD_RECOVERY_RESHAPE is set (indicating that a reshape is
happening, so reshape_position might change), don't update
curr_resync_completed in md_do_sync, leave it to the per-personality
reshape code.
Signed-off-by: NeilBrown <neilb@suse.de>
As sector_t in unsigned, we cannot afford to let 'safepos' etc go
negative.
So replace
a -= b;
by
a -= min(b,a);
Signed-off-by: NeilBrown <neilb@suse.de>
The md resync engine has a 'frozen' state which ensures that
no resync/recovery. This is used to avoid races.
Export this state through the 'sync_action' sysfs attribute
so that user-space can benefit and also avoid some races.
Signed-off-by: NeilBrown <neilb@suse.de>
The code for checking which bits in the bitmap can be cleared
has 2 problems:
1/ it repeatedly takes and drops a spinlock, where it would make
more sense to just hold on to it most of the time.
2/ it doesn't make use of some opportunities to skip large sections
of the bitmap
This patch fixes those. It will only affect CPU consumption, not
correctness.
Signed-off-by: NeilBrown <neilb@suse.de>
Instead of always returns EINVAL if anything goes wrong
when setting the array size, add the option of
E2BIG
if the size requested is too large. This makes it easier
for user-space to be sure what went wrong.
Signed-off-by: NeilBrown <neilb@suse.de>
We previously didn't update these fields when writing the metadata
because they could never change. They can now, so we better write
them.
v0.90 metadata always updated these fields.
Signed-off-by: NeilBrown <neilb@suse.de>
Convert all external users of queue limits to using wrapper functions
instead of poking the request queue variables directly.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Until now we have had a 1:1 mapping between storage device physical
block size and the logical block sized used when addressing the device.
With SATA 4KB drives coming out that will no longer be the case. The
sector size will be 4KB but the logical block size will remain
512-bytes. Hence we need to distinguish between the physical block size
and the logical ditto.
This patch renames hardsect_size to logical_block_size.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Merge reason: tracing/core was on a .30-rc1 base and was missing out on
on a handful of tracing fixes present in .30-rc5-almost.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
md maintains link in sys/mdXX/md/ to identify which device has
which role in the array. e.g.
rd2 -> dev-sda
indicates that the device with role '2' in the array is sda.
These links are only present when the array is active. They are
created immediately after ->run is called, and so should be removed
immediately after ->stop is called.
However they are currently removed a little bit later, and it is
possible for ->run to be called again, thus adding these links, before
they are removed.
So move the removal earlier so they are consistently only present when
the array is active.
Signed-off-by: NeilBrown <neilb@suse.de>
Being able to write 'clean' to an 'array_state' of an inactive array
to activate it in 'clean' mode is both unnecessary and inconvenient.
It is unnecessary because the same can be achieved by writing
'active'. This activates and array, but it still remains 'clean'
until the first write.
It is inconvenient because writing 'clean' is more often used to
cause an 'active' array to revert to 'clean' mode (thus blocking
any writes until a 'write-pending' is promoted to 'active').
Allowing 'clean' to both activate an array and mark an active array as
clean can lead to races: One program writes 'clean' to mark the
active array as clean at the same time as another program writes
'inactive' to deactivate (stop) and active array. Depending on which
writes first, the array could be deactivated and immediately
reactivated which isn't what was desired.
So just disable the use of 'clean' to activate an array.
This avoids a race that can be triggered with mdadm-3.0 and external
metadata, so it suitable for -stable.
Reported-by: Rafal Marszewski <rafal.marszewski@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Two problems in status_resync.
1/ It still used Kilobytes as the basic block unit, while most code
now uses sectors uniformly.
2/ It doesn't allow for the possibility that max_sectors exceeds
the range of "unsigned long".
So
- change "max_blocks" to "max_sectors", and store sector numbers
in there and in 'resync'
- Make 'rt' a 'sector_t' so it can temporarily hold the number of
remaining sectors.
- use sector_div rather than normal division.
- change the magic '100' used to preserve precision to '32'.
+ making it a power of 2 makes division easier
+ it doesn't need to be as large as it was chosen when we averaged
speed over the entire run. Now we average speed over the last 30
seconds or so.
Reported-by: "Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE>
Signed-off-by: NeilBrown <neilb@suse.de>
If a write intent bitmap covers more than 2TB, we sometimes work with
values beyond 32bit, so these need to be sector_t. This patches
add the required casts to some unsigned longs that are being shifted
up.
This will affect any raid10 larger than 2TB, or any raid1/4/5/6 with
member devices that are larger than 2TB.
Signed-off-by: NeilBrown <neilb@suse.de>
Reported-by: "Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE>
Cc: stable@kernel.org
If we have a raid10 with multiple missing devices, and we recover just
one of these to a spare, then we risk (depending on the bitmap and
array chunk size) clearing bits of the bitmap for which recovery isn't
complete (because a device is still missing).
This can lead to a subsequent "re-add" being recovered without
any IO happening, which would result in loss of data.
This patch takes the safe approach of not clearing bitmap bits
if the array will still be degraded.
This patch is suitable for all active -stable kernels.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
When md is loading a bitmap which it knows is out of date, it fills
each page with 1s and writes it back out again. However the
write_page call makes used of bitmap->file_pages and
bitmap->last_page_size which haven't been set correctly yet. So this
can sometimes fail.
Move the setting of file_pages and last_page_size to before the call
to write_page.
This bug can cause the assembly on an array to fail, thus making the
data inaccessible. Hence I think it is a suitable candidate for
-stable.
Cc: stable@kernel.org
Reported-by: Vojtech Pavlik <vojtech@suse.cz>
Signed-off-by: NeilBrown <neilb@suse.de>
* 'for-linus' of git://neil.brown.name/md:
md: support bitmaps on RAID10 arrays larger then 2 terabytes
md: update sync_completed and reshape_position even more often.
md: improve usefulness and accuracy of sysfs file md/sync_completed.
md: allow setting newly added device to 'in_sync' via sysfs.
md: tiny md.h cleanups