2
0
mirror of https://github.com/edk2-porting/linux-next.git synced 2025-01-03 11:13:56 +08:00
Commit Graph

3007 Commits

Author SHA1 Message Date
Kent Overstreet
8aee122071 bcache: Kill sequential_merge option
It never really made sense to expose this, so just kill it.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:39 -08:00
Kent Overstreet
50310164bc bcache: Kill bch_next_recurse_key()
This dates from before the btree iterator, and now it's finally gone

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:39 -08:00
Kent Overstreet
bc9389eefe bcache: Avoid deadlocking in garbage collection
Not a complete fix - we could still deadlock if btree_insert_node() has
to split...

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:38 -08:00
Kent Overstreet
a1f0358b2b bcache: Incremental gc
Big garbage collection rewrite; now, garbage collection uses the same
mechanisms as used elsewhere for inserting/updating btree node pointers,
instead of rewriting interior btree nodes in place.

This makes the code significantly cleaner and less fragile, and means we
can now make garbage collection incremental - it doesn't have to hold a
write lock on the root of the btree for the entire duration of garbage
collection.

This means that there's less of a latency hit for doing garbage
collection, which means we can gc more frequently (and do a better job
of reclaiming from the cache), and we can coalesce across more btree
nodes (improving our space efficiency).

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:37 -08:00
Kent Overstreet
8835c1234d bcache: Add make_btree_freeing_key()
Refactoring, prep work for incremental garbage collection.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:37 -08:00
Kent Overstreet
f269af5a07 bcache: Add btree_node_write_sync()
More refactoring - mostly making the interfaces more explicit about what
we actually want to do.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:36 -08:00
Kent Overstreet
0eacac2203 bcache: PRECEDING_KEY()
btree_insert_key() was open coding this, this is just refactoring.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:36 -08:00
Kent Overstreet
d5cc66e957 bcache: bch_(btree|extent)_ptr_invalid()
Trying to treat btree pointers and leaf node pointers the same way was a
mistake - going to start being more explicit about the type of
key/pointer we're dealing with. This is the first part of that
refactoring; this patch shouldn't change any actual behaviour.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:35 -08:00
Kent Overstreet
3a3b6a4e07 bcache: Don't bother with bucket refcount for btree node allocations
The bucket refcount (dropped with bkey_put()) is only needed to prevent
the newly allocated bucket from being garbage collected until we've
added a pointer to it somewhere. But for btree node allocations, the
fact that we have btree nodes locked is enough to guard against races
with garbage collection.

Eventually the per bucket refcount is going to be replaced with
something specific to bch_alloc_sectors().

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:34 -08:00
Kent Overstreet
280481d06c bcache: Debug code improvements
Couple changes:
 * Consolidate bch_check_keys() and bch_check_key_order(), and move the
   checks that only check_key_order() could do to bch_btree_iter_next().

 * Get rid of CONFIG_BCACHE_EDEBUG - now, all that code is compiled in
   when CONFIG_BCACHE_DEBUG is enabled, and there's now a sysfs file to
   flip on the EDEBUG checks at runtime.

 * Dropped an old not terribly useful check in rw_unlock(), and
   refactored/improved a some of the other debug code.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:34 -08:00
Kent Overstreet
e58ff15503 bcache: Fix bch_ptr_bad()
Previously, bch_ptr_bad() could return false when there was a pointer to
a nonexistant device... it only filtered out keys with PTR_CHECK_DEV
pointers.

This behaviour was intended for multiple cache device support; for that,
just because the device for one of the pointers has gone away doesn't
mean we want to filter out the rest of the pointers.

But we don't yet explicitly filter/check individual pointers, so without
that this behaviour was wrong - a corrupt bkey with a bad device pointer
could cause us to deref a bad pointer. Doh.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:33 -08:00
Kent Overstreet
81ab4190ac bcache: Pull on disk data structures out into a separate header
Now, the on disk data structures are in a header that can be exported to
userspace - and having them all centralized is nice too.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:33 -08:00
Kent Overstreet
2599b53b7b bcache: Move sector allocator to alloc.c
Just reorganizing things a bit.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:32 -08:00
Kent Overstreet
220bb38c21 bcache: Break up struct search
With all the recent refactoring around struct btree op struct search has
gotten rather large.

But we can now easily break it up in a different way - we break out
struct btree_insert_op which is for inserting data into the cache, and
that's now what the copying gc code uses - struct search is now specific
to request.c

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:32 -08:00
Kent Overstreet
cc7b881921 bcache: Convert bch_btree_insert() to bch_btree_map_leaf_nodes()
Last of the btree_map() conversions. Main visible effect is
bch_btree_insert() is no longer taking a struct btree_op as an argument
anymore - there's no fancy state machine stuff going on, it's just a
normal function.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:31 -08:00
Kent Overstreet
6054c6d4da bcache: Don't use op->insert_collision
When we convert bch_btree_insert() to bch_btree_map_leaf_nodes(), we
won't be passing struct btree_op to bch_btree_insert() anymore - so we
need a different way of returning whether there was a collision (really,
a replace collision).

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:30 -08:00
Kent Overstreet
1b207d80d5 bcache: Kill op->replace
This is prep work for converting bch_btree_insert to
bch_btree_map_leaf_nodes() - we have to convert all its arguments to
actual arguments. Bunch of churn, but should be straightforward.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:29 -08:00
Kent Overstreet
faadf0c965 bcache: Drop some closure stuff
With a the recent bcache refactoring, some of the closure code isn't
needed anymore.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:10 -08:00
Kent Overstreet
b54d6934da bcache: Kill op->cl
This isn't used for waiting asynchronously anymore - so this is a fairly
trivial refactoring.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:09 -08:00
Kent Overstreet
c18536a72d bcache: Prune struct btree_op
Eventual goal is for struct btree_op to contain only what is necessary
for traversing the btree.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:08 -08:00
Kent Overstreet
cc23196631 bcache: Clean up cache_lookup_fn
There was some looping in submit_partial_cache_hit() and
submit_partial_cache_hit() that isn't needed anymore - originally, we
wouldn't necessarily process the full hit or miss all at once because
when splitting the bio, we took into account the restrictions of the
device we were sending it to.

But, device bio size restrictions are now handled elsewhere, with a
wrapper around generic_make_request() - so that looping has been
unnecessary for awhile now and we can now do quite a bit of cleanup.

And if we trim the key we're reading from to match the subset we're
actually reading, we don't have to explicitly calculate bi_sector
anymore. Neat.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:08 -08:00
Kent Overstreet
2c1953e201 bcache: Convert bch_btree_read_async() to bch_btree_map_keys()
This is a fairly straightforward conversion, mostly reshuffling -
op->lookup_done goes away, replaced by MAP_DONE/MAP_CONTINUE. And the
code for handling cache hits and misses wasn't really btree code, so it
gets moved to request.c.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:07 -08:00
Kent Overstreet
df8e89701f bcache: Move some stuff to btree.c
With the new btree_map() functions, we don't need to export the stuff
needed for traversing the btree anymore.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:07 -08:00
Kent Overstreet
48dad8baf9 bcache: Add btree_map() functions
Lots of stuff has been open coding its own btree traversal - which is
generally pretty simple code, but there are a few subtleties.

This adds new new functions, bch_btree_map_nodes() and
bch_btree_map_keys(), which do the traversal for you. Everything that's
open coding btree traversal now (with the exception of garbage
collection) is slowly going to be converted to these two functions;
being able to write other code at a higher level of abstraction  is a
big improvement w.r.t. overall code quality.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:06 -08:00
Kent Overstreet
5e6926daac bcache: Convert writeback to a kthread
This simplifies the writeback flow control quite a bit - previously, it
was conceptually two coroutines, refill_dirty() and read_dirty(). This
makes the code quite a bit more straightforward.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:05 -08:00
Kent Overstreet
72a44517f3 bcache: Convert gc to a kthread
We needed a dedicated rescuer workqueue for gc anyways... and gc was
conceptually a dedicated thread, just one that wasn't running all the
time. Switch it to a dedicated thread to make the code a bit more
straightforward.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:04 -08:00
Kent Overstreet
35fcd848d7 bcache: Convert bucket_wait to wait_queue_head_t
At one point we did do fancy asynchronous waiting stuff with
bucket_wait, but that's all gone (and bucket_wait is used a lot less
than it used to be). So use the standard primitives.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:04 -08:00
Kent Overstreet
e8e1d4682c bcache: Convert try_wait to wait_queue_head_t
We never waited on c->try_wait asynchronously, so just use the standard
primitives.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:03 -08:00
Kent Overstreet
0b93207abb bcache: Move keylist out of btree_op
Slowly working on pruning struct btree_op - the aim is for it to only
contain things that are actually necessary for traversing the btree.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:02 -08:00
Kent Overstreet
a34a8bfd4e bcache: Refactor journalling flow control
Making things less asynchronous that don't need to be - bch_journal()
only has to block when the journal or journal entry is full, which is
emphatically not a fast path. So make it a normal function that just
returns when it finishes, to make the code and control flow easier to
follow.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:02 -08:00
Kent Overstreet
cdd972b164 bcache: Refactor read request code a bit
More refactoring, and renaming.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:01 -08:00
Kent Overstreet
84f0db03ea bcache: Refactor request_write()
Try to improve some of the naming a bit to be more consistent, and also
improve the flow of control in request_write() a bit.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:00 -08:00
Kent Overstreet
c2f95ae2eb bcache: Clean up keylist code
More random refactoring.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:56:00 -08:00
Kent Overstreet
4f3d40147b bcache: Add explicit keylist arg to btree_insert()
Some refactoring - better to explicitly pass stuff around instead of
having it all in the "big bag of state", struct btree_op. Going to prune
struct btree_op quite a bit over time.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:59 -08:00
Kent Overstreet
e7c590eb63 bcache: Convert btree_insert_check_key() to btree_insert_node()
This was the main point of all this refactoring - now,
btree_insert_check_key() won't fail just because the leaf node happened
to be full.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:59 -08:00
Kent Overstreet
403b6cdeb1 bcache: Insert multiple keys at a time
We'll often end up with a list of adjacent keys to insert -
because bch_data_insert() may have to fragment the data it writes.

Originally, to simplify things and avoid having to deal with corner
cases bch_btree_insert() would pass keys from this list one at a time to
btree_insert_recurse() - mainly because the list of keys might span leaf
nodes, so it was easier this way.

With the btree_insert_node() refactoring, it's now a lot easier to just
pass down the whole list and have btree_insert_recurse() iterate over
leaf nodes until it's done.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:58 -08:00
Kent Overstreet
26c949f806 bcache: Add btree_insert_node()
The flow of control in the old btree insertion code was rather -
backwards; we'd recurse down the btree (in btree_insert_recurse()), and
then if we needed to split the keys to be inserted into the parent node
would be effectively returned up to btree_insert_recurse(), which would
notice there was more work to do and finish the insertion.

The main problem with this was that the full logic for btree insertion
could only be used by calling btree_insert_recurse; if you'd gotten to a
btree leaf some other way and had a key to insert, if it turned out that
node needed to be split you were SOL.

This inverts the flow of control so btree_insert_node() does _full_
btree insertion, including splitting - and takes a (leaf) btree node to
insert into as a parameter.

This means we can now _correctly_ handle cache misses - for cache
misses, we need to insert a fake "check" key into the btree when we
discover we have a cache miss - while we still have the btree locked.
Previously, if the btree node was full inserting a cache miss would just
fail.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:57 -08:00
Kent Overstreet
d6fd3b11ce bcache: Explicitly track btree node's parent
This is prep work for the reworked btree insertion code.

The way we set b->parent is ugly and hacky... the problem is, when
btree_split() or garbage collection splits or rewrites a btree node, the
parent changes for all its (potentially already cached) children.

I may change this later and add some code to look through the btree node
cache and find all our cached child nodes and change the parent pointer
then...

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:57 -08:00
Kent Overstreet
8304ad4dc8 bcache: Remove unnecessary check in should_split()
Checking i->seq was redundant, because since ages ago we always
initialize the new bset when advancing b->written

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:56 -08:00
Kent Overstreet
2d679fc756 bcache: Stripe size isn't necessarily a power of two
Originally I got this right... except that the divides didn't use
do_div(), which broke 32 bit kernels. When I went to fix that, I forgot
that the raid stripe size usually isn't a power of two... doh

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:55 -08:00
Kent Overstreet
77c320eb46 bcache: Add on error panic/unregister setting
Works kind of like the ext4 setting, to panic or remount read only on
errors.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:55 -08:00
Kent Overstreet
49b1212dfa bcache: Use blkdev_issue_discard()
The old asynchronous discard code was really a relic from when all the
allocation code was asynchronous - now that allocation runs out of a
dedicated thread there's no point in keeping around all that complicated
machinery.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:54 -08:00
Kent Overstreet
dd9ec84da5 bcache: Fix a lockdep splat
bch_keybuf_del() takes a spinlock that can't be taken in interrupt context -
whoops. Fortunately, this code isn't enabled by default (you have to toggle a
sysfs thing).

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2013-11-10 21:55:54 -08:00
Kent Overstreet
7857d5d470 bcache: Fix a journalling performance bug 2013-11-10 21:55:53 -08:00
Kent Overstreet
1fa8455deb bcache: Fix dirty_data accounting
Dirty data accounting wasn't quite right - firstly, we were adding the key we're
inserting after it could have merged with another dirty key already in the
btree, and secondly we could sometimes pass the wrong offset to
bcache_dev_sectors_dirty_add() for dirty data we were overwriting - which is
important when tracking dirty data by stripe.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
2013-11-10 21:55:27 -08:00
Joe Thornber
c9d28d5d09 dm cache: promotion optimisation for writes
If a write block triggers promotion and covers a whole block we can
avoid a copy.

Introduce dm_{hook,unhook}_bio to simplify saving and restoring bio
fields (bi_private is now used by overwrite).  Switch writethrough
support over to using these helpers too.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:26 -05:00
Joe Thornber
c86c30706c dm cache: be much more aggressive about promoting writes to discarded blocks
Previously these promotions only got priority if there were unused cache
blocks.  Now we give them priority if there are any clean blocks in the
cache.

The fio_soak_test in the device-mapper-test-suite now gives uniform
performance across subvolumes (~16 seconds).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:25 -05:00
Joe Thornber
01911c19be dm cache policy mq: implement writeback_work() and mq_{set,clear}_dirty()
There are now two multiqueues for in cache blocks.  A clean one and a
dirty one.

writeback_work comes from the dirty one.  Demotions come from the clean
one.

There are two benefits:
- Performance improvement, since demoting a clean block is a noop.
- The cache cleans itself when io load is light.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:25 -05:00
Heinz Mauelshagen
ffcbcb6720 dm cache: optimize commit_if_needed
Check commit_requested flag _before_ calling
dm_cache_changed_this_transaction() superfluously.

Also, be sure to set last_commit_jiffies _after_ dm_cache_commit()
completes.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:24 -05:00
Joe Thornber
40c57f475f dm space map disk: optimise sm_disk_dec_block
Don't waste time spotting blocks that have been allocated and then freed
in the same transaction.

The extra lookup is expensive, and I don't think it really gives us much.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:24 -05:00
Mikulas Patocka
5442851edb dm: fix Kconfig menu indentation
The option DM_LOG_USERSPACE is sub-option of DM_MIRROR, so place it
right after DM_MIRROR.  Doing so fixes various other Device mapper
targets/features to be properly nested under "Device mapper support".

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:22 -05:00
Mikulas Patocka
2c140a246d dm: allow remove to be deferred
This patch allows the removal of an open device to be deferred until
it is closed.  (Previously such a removal attempt would fail.)

The deferred remove functionality is enabled by setting the flag
DM_DEFERRED_REMOVE in the ioctl structure on DM_DEV_REMOVE or
DM_REMOVE_ALL ioctl.

On return from DM_DEV_REMOVE, the flag DM_DEFERRED_REMOVE indicates if
the device was removed immediately or flagged to be removed on close -
if the flag is clear, the device was removed.

On return from DM_DEV_STATUS and other ioctls, the flag
DM_DEFERRED_REMOVE is set if the device is scheduled to be removed on
closure.

A device that is scheduled to be deleted can be revived using the
message "@cancel_deferred_remove". This message clears the
DMF_DEFERRED_REMOVE flag so that the device won't be deleted on close.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:22 -05:00
Mike Snitzer
7833b08e18 dm table: print error on preresume failure
If preresume fails it is worth logging an error given that a device is
left suspended due to the failure.

This change was motivated by local preresume error logging that was
added to the cache target ("preresume failed").  Elevating this
target-agnostic context for the where the target-specific error occurred
relative to the DM core's callouts makes sense.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
2013-11-09 18:20:21 -05:00
Milan Broz
ed04d98169 dm crypt: add TCW IV mode for old CBC TCRYPT containers
dm-crypt can already activate TCRYPT (TrueCrypt compatible) containers
in LRW or XTS block encryption mode.

TCRYPT containers prior to version 4.1 use CBC mode with some additional
tweaks, this patch adds support for these containers.

This new mode is implemented using special IV generator named TCW
(TrueCrypt IV with whitening).  TCW IV only supports containers that are
encrypted with one cipher (Tested with AES, Twofish, Serpent, CAST5 and
TripleDES).

While this mode is legacy and is known to be vulnerable to some
watermarking attacks (e.g. revealing of hidden disk existence) it can
still be useful to activate old containers without using 3rd party
software or for independent forensic analysis of such containers.

(Both the userspace and kernel code is an independent implementation
based on the format documentation and it completely avoids use of
original source code.)

The TCW IV generator uses two additional keys: Kw (whitening seed, size
is always 16 bytes - TCW_WHITENING_SIZE) and Kiv (IV seed, size is
always the IV size of the selected cipher).  These keys are concatenated
at the end of the main encryption key provided in mapping table.

While whitening is completely independent from IV, it is implemented
inside IV generator for simplification.

The whitening value is always 16 bytes long and is calculated per sector
from provided Kw as initial seed, xored with sector number and mixed
with CRC32 algorithm.  Resulting value is xored with ciphertext sector
content.

IV is calculated from the provided Kiv as initial IV seed and xored with
sector number.

Detailed calculation can be found in the Truecrypt documentation for
version < 4.1 and will also be described on dm-crypt site, see:
http://code.google.com/p/cryptsetup/wiki/DMCrypt

The experimental support for activation of these containers is already
present in git devel brach of cryptsetup.

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:20 -05:00
Milan Broz
da31a0787a dm crypt: properly handle extra key string in initialization
Some encryption modes use extra keys (e.g. loopAES has IV seed) which
are not used in block cipher initialization but are part of key string
in table constructor.

This patch adds an additional field which describes the length of the
extra key(s) and substracts it before real key encryption setting.

The key_size always includes the size, in bytes, of the key provided
in mapping table.

The key_parts describes how many parts (usually keys) are contained in
the whole key buffer.  And key_extra_size contains size in bytes of
additional keys part (this number of bytes must be subtracted because it
is processed by the IV generator).

| K1 | K2 | .... | K64 |      Kiv       |
|----------- key_size ----------------- |
|                      |-key_extra_size-|
|     [64 keys]        |  [1 key]       | => key_parts = 65

Example where key string contains main key K, whitening key
Kw and IV seed Kiv:

|     K       |   Kiv   |       Kw      |
|--------------- key_size --------------|
|             |-----key_extra_size------|
|  [1 key]    | [1 key] |     [1 key]   | => key_parts = 3

Because key_extra_size is calculated during IV mode setting, key
initialization is moved after this step.

For now, this change has no effect to supported modes (thanks to ilog2
rounding) but it is required by the following patch.

Also, fix a sparse warning in crypt_iv_lmk_one().

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:20 -05:00
Heinz Mauelshagen
2c2263c93f dm cache: log error message if dm_kcopyd_copy() fails
A migration failure should be logged (albeit limited).

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:19 -05:00
Heinz Mauelshagen
80f659f3f5 dm cache: use cell_defer() boolean argument consistently
Fix a few cell_defer() calls that weren't passing a bool.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:19 -05:00
Mikulas Patocka
4cb3e1db21 dm cache: return -EINVAL if the user specifies unknown cache policy
Return -EINVAL when the specified cache policy is unknown rather than
returning -ENOMEM.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:18 -05:00
Joe Thornber
dd8b0c2096 dm cache metadata: return bool from __superblock_all_zeroes
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:17 -05:00
Joe Thornber
0184b44e32 dm cache policy mq: a few small fixes
Rename takeout_queue to concat_queue.

Fix a harmless bug in mq policies pop() function.  Currently pop()
always succeeds, with up coming changes this wont be the case.

Fix typo in comment above pre_cache_to_cache prototype.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:17 -05:00
Joe Thornber
3351937e4a dm cache policy: remove return from void policy_remove_mapping
No need to return from a void function.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:20:16 -05:00
Joe Thornber
238f8363b6 dm cache: improve efficiency of quiescing flag management
Make the quiescing flag an atomic_t and stop protecting it with a spin
lock.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 18:19:59 -05:00
Joe Thornber
66cb1910df dm cache: fix a race condition between queuing new migrations and quiescing for a shutdown
The code that was trying to do this was inadequate.  The postsuspend
method (in ioctl context), needs to wait for the worker thread to
acknowledge the request to quiesce.  Otherwise the migration count may
drop to zero temporarily before the worker thread realises we're
quiescing.  In this case the target will be taken down, but the worker
thread may have issued a new migration, which will cause an oops when
it completes.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.9+
2013-11-09 17:55:50 -05:00
Joe Thornber
f8e5f01a32 dm cache: io destined for the cache device can now serve as tick bios
Previously only origin bios could trigger ticks, which meant if all
the io was destined for the cache no ticks were generated.  If no ticks
are generated then multiple hits, and movements in general, are
attributed to the same tick.

Only a stop gap fix, we need a better solution.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 17:55:49 -05:00
Joe Thornber
99ba2ae4cd dm cache policy mq: protect residency method with existing mutex
It is safe to use a mutex in mq_residency() at this point since it is
only called from ioctl context.  But future-proof mq_residency() by
using might_sleep() to catch new contexts that cannot sleep.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-09 17:54:34 -05:00
Kent Overstreet
6678d83f18 block: Consolidate duplicated bio_trim() implementations
Someone cut and pasted md's md_trim_bio() into xen-blkfront.c. Come on,
we should know better than this.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Neil Brown <neilb@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:02:31 -07:00
Linus Torvalds
0324e74534 Driver Core / sysfs patches for 3.13-rc1
Here's the big driver core / sysfs update for 3.13-rc1.
 
 There's lots of dev_groups updates for different subsystems, as they all
 get slowly migrated over to the safe versions of the attribute groups
 (removing userspace races with the creation of the sysfs files.)  Also
 in here are some kobject updates, devres expansions, and the first round
 of Tejun's sysfs reworking to enable it to be used by other subsystems
 as a backend for an in-kernel filesystem.
 
 All of these have been in linux-next for a while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iEYEABECAAYFAlJ6xAMACgkQMUfUDdst+yk1kQCfcHXhfnrvFZ5J/mDP509IzhNS
 ddEAoLEWoivtBppNsgrWqXpD1vi4UMsE
 =JmVW
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core / sysfs patches from Greg KH:
 "Here's the big driver core / sysfs update for 3.13-rc1.

  There's lots of dev_groups updates for different subsystems, as they
  all get slowly migrated over to the safe versions of the attribute
  groups (removing userspace races with the creation of the sysfs
  files.) Also in here are some kobject updates, devres expansions, and
  the first round of Tejun's sysfs reworking to enable it to be used by
  other subsystems as a backend for an in-kernel filesystem.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'driver-core-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (83 commits)
  sysfs: rename sysfs_assoc_lock and explain what it's about
  sysfs: use generic_file_llseek() for sysfs_file_operations
  sysfs: return correct error code on unimplemented mmap()
  mdio_bus: convert bus code to use dev_groups
  device: Make dev_WARN/dev_WARN_ONCE print device as well as driver name
  sysfs: separate out dup filename warning into a separate function
  sysfs: move sysfs_hash_and_remove() to fs/sysfs/dir.c
  sysfs: remove unused sysfs_get_dentry() prototype
  sysfs: honor bin_attr.attr.ignore_lockdep
  sysfs: merge sysfs_elem_bin_attr into sysfs_elem_attr
  devres: restore zeroing behavior of devres_alloc()
  sysfs: fix sysfs_write_file for bin file
  input: gameport: convert bus code to use dev_groups
  input: serio: remove bus usage of dev_attrs
  input: serio: use DEVICE_ATTR_RO()
  i2o: convert bus code to use dev_groups
  memstick: convert bus code to use dev_groups
  tifm: convert bus code to use dev_groups
  virtio: convert bus code to use dev_groups
  ipack: convert bus code to use dev_groups
  ...
2013-11-07 11:42:15 +09:00
Joe Thornber
9c1d4de560 dm array: fix bug in growing array
Entries would be lost if the old tail block was partially filled.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.9+
2013-11-05 11:20:50 -05:00
Hannes Reinecke
b63349a7a5 dm mpath: requeue I/O during pg_init
When pg_init is running no I/O can be submitted to the underlying
devices, as the path priority etc might change.  When using queue_io for
this, requests will be piling up within multipath as the block I/O
scheduler just sees a _very fast_ device.  All of this queued I/O has to
be resubmitted from within multipathing once pg_init is done.

This approach has the problem that it's virtually impossible to
abort I/O when pg_init is running, and we're adding heavy load
to the devices after pg_init since all of the queued I/O needs to be
resubmitted _before_ any requests can be pulled off of the request queue
and normal operation continues.

This patch will requeue the I/O that triggers the pg_init call, and
return 'busy' when pg_init is in progress.  With these changes the block
I/O scheduler will stop submitting I/O during pg_init, resulting in a
quicker path switch and less I/O pressure (and memory consumption) after
pg_init.

Signed-off-by: Hannes Reinecke <hare@suse.de>
[patch header edited for clarity and typos by Mike Snitzer]
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-05 11:20:34 -05:00
Shiva Krishna Merla
954a73d5d3 dm mpath: fix race condition between multipath_dtr and pg_init_done
Whenever multipath_dtr() is happening we must prevent queueing any
further path activation work.  Implement this by adding a new
'pg_init_disabled' flag to the multipath structure that denotes future
path activation work should be skipped if it is set.  By disabling
pg_init and then re-enabling in flush_multipath_work() we also avoid the
potential for pg_init to be initiated while suspending an mpath device.

Without this patch a race condition exists that may result in a kernel
panic:

1) If after pg_init_done() decrements pg_init_in_progress to 0, a call
   to wait_for_pg_init_completion() assumes there are no more pending path
   management commands.
2) If pg_init_required is set by pg_init_done(), due to retryable
   mode_select errors, then process_queued_ios() will again queue the
   path activation work.
3) If free_multipath() completes before activate_path() work is called a
   NULL pointer dereference like the following can be seen when
   accessing members of the recently destructed multipath:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
RIP: 0010:[<ffffffffa003db1b>]  [<ffffffffa003db1b>] activate_path+0x1b/0x30 [dm_multipath]
[<ffffffff81090ac0>] worker_thread+0x170/0x2a0
[<ffffffff81096c80>] ? autoremove_wake_function+0x0/0x40

[switch to disabling pg_init in flush_multipath_work & header edits by Mike Snitzer]
Signed-off-by: Shiva Krishna Merla <shivakrishna.merla@netapp.com>
Reviewed-by: Krishnasamy Somasundaram <somasundaram.krishnasamy@netapp.com>
Tested-by: Speagle Andy <Andy.Speagle@netapp.com>
Acked-by: Junichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2013-10-31 21:39:47 -04:00
Mikulas Patocka
f36afb3957 dm: allocate buffer for messages with small number of arguments using GFP_NOIO
dm-mpath and dm-thin must process messages even if some device is
suspended, so we allocate argv buffer with GFP_NOIO. These messages have
a small fixed number of arguments.

On the other hand, dm-switch needs to process bulk data using messages
so excessive use of GFP_NOIO could cause trouble.

The patch also lowers the default number of arguments from 64 to 8, so
that there is smaller load on GFP_NOIO allocations.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-10-31 13:55:45 -04:00
Shaohua Li
d47648fcf0 raid5: avoid finding "discard" stripe
SCSI discard will damage discard stripe bio setting, eg, some fields are
changed. If the stripe is reused very soon, we have wrong bios setting. We
remove discard stripe from hash list, so next time the strip will be fully
initialized.

Suitable for backport to 3.7+.

Cc: <stable@vger.kernel.org> (3.7+)
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-10-24 13:00:24 +11:00
Shaohua Li
37c61ff31e raid5: set bio bi_vcnt 0 for discard request
SCSI layer will add new payload for discard request. If two bios are merged
to one, the second bio has bi_vcnt 1 which is set in raid5. This will confuse
SCSI and cause oops.

Suitable for backport to 3.7+

Cc: stable@vger.kernel.org (v3.7+)
Reported-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
2013-10-24 12:57:36 +11:00
Bian Yu
905b0297a9 md: avoid deadlock when md_set_badblocks.
When operate harddisk and hit errors, md_set_badblocks is called after
scsi_restart_operations which already disabled the irq. but md_set_badblocks
will call write_sequnlock_irq and enable irq. so softirq can preempt the
current thread and that may cause a deadlock. I think this situation should
use write_sequnlock_irqsave/irqrestore instead.

I met the situation and the call trace is below:
[  638.919974] BUG: spinlock recursion on CPU#0, scsi_eh_13/1010
[  638.921923]  lock: 0xffff8800d4d51fc8, .magic: dead4ead, .owner: scsi_eh_13/1010, .owner_cpu: 0
[  638.923890] CPU: 0 PID: 1010 Comm: scsi_eh_13 Not tainted 3.12.0-rc5+ #37
[  638.925844] Hardware name: To be filled by O.E.M. To be filled by O.E.M./MAHOBAY, BIOS 4.6.5 03/05/2013
[  638.927816]  ffff880037ad4640 ffff880118c03d50 ffffffff8172ff85 0000000000000007
[  638.929829]  ffff8800d4d51fc8 ffff880118c03d70 ffffffff81730030 ffff8800d4d51fc8
[  638.931848]  ffffffff81a72eb0 ffff880118c03d90 ffffffff81730056 ffff8800d4d51fc8
[  638.933884] Call Trace:
[  638.935867]  <IRQ>  [<ffffffff8172ff85>] dump_stack+0x55/0x76
[  638.937878]  [<ffffffff81730030>] spin_dump+0x8a/0x8f
[  638.939861]  [<ffffffff81730056>] spin_bug+0x21/0x26
[  638.941836]  [<ffffffff81336de4>] do_raw_spin_lock+0xa4/0xc0
[  638.943801]  [<ffffffff8173f036>] _raw_spin_lock+0x66/0x80
[  638.945747]  [<ffffffff814a73ed>] ? scsi_device_unbusy+0x9d/0xd0
[  638.947672]  [<ffffffff8173fb1b>] ? _raw_spin_unlock+0x2b/0x50
[  638.949595]  [<ffffffff814a73ed>] scsi_device_unbusy+0x9d/0xd0
[  638.951504]  [<ffffffff8149ec47>] scsi_finish_command+0x37/0xe0
[  638.953388]  [<ffffffff814a75e8>] scsi_softirq_done+0xa8/0x140
[  638.955248]  [<ffffffff8130e32b>] blk_done_softirq+0x7b/0x90
[  638.957116]  [<ffffffff8104fddd>] __do_softirq+0xfd/0x330
[  638.958987]  [<ffffffff810b964f>] ? __lock_release+0x6f/0x100
[  638.960861]  [<ffffffff8174a5cc>] call_softirq+0x1c/0x30
[  638.962724]  [<ffffffff81004c7d>] do_softirq+0x8d/0xc0
[  638.964565]  [<ffffffff8105024e>] irq_exit+0x10e/0x150
[  638.966390]  [<ffffffff8174ad4a>] smp_apic_timer_interrupt+0x4a/0x60
[  638.968223]  [<ffffffff817499af>] apic_timer_interrupt+0x6f/0x80
[  638.970079]  <EOI>  [<ffffffff810b964f>] ? __lock_release+0x6f/0x100
[  638.971899]  [<ffffffff8173fa6a>] ? _raw_spin_unlock_irq+0x3a/0x50
[  638.973691]  [<ffffffff8173fa60>] ? _raw_spin_unlock_irq+0x30/0x50
[  638.975475]  [<ffffffff81562393>] md_set_badblocks+0x1f3/0x4a0
[  638.977243]  [<ffffffff81566e07>] rdev_set_badblocks+0x27/0x80
[  638.978988]  [<ffffffffa00d97bb>] raid5_end_read_request+0x36b/0x4e0 [raid456]
[  638.980723]  [<ffffffff811b5a1d>] bio_endio+0x1d/0x40
[  638.982463]  [<ffffffff81304ff3>] req_bio_endio.isra.65+0x83/0xa0
[  638.984214]  [<ffffffff81306b9f>] blk_update_request+0x7f/0x350
[  638.985967]  [<ffffffff81306ea1>] blk_update_bidi_request+0x31/0x90
[  638.987710]  [<ffffffff813085e0>] __blk_end_bidi_request+0x20/0x50
[  638.989439]  [<ffffffff8130862f>] __blk_end_request_all+0x1f/0x30
[  638.991149]  [<ffffffff81308746>] blk_peek_request+0x106/0x250
[  638.992861]  [<ffffffff814a62a9>] ? scsi_kill_request.isra.32+0xe9/0x130
[  638.994561]  [<ffffffff814a633a>] scsi_request_fn+0x4a/0x3d0
[  638.996251]  [<ffffffff813040a7>] __blk_run_queue+0x37/0x50
[  638.997900]  [<ffffffff813045af>] blk_run_queue+0x2f/0x50
[  638.999553]  [<ffffffff814a5750>] scsi_run_queue+0xe0/0x1c0
[  639.001185]  [<ffffffff814a7721>] scsi_run_host_queues+0x21/0x40
[  639.002798]  [<ffffffff814a2e87>] scsi_restart_operations+0x177/0x200
[  639.004391]  [<ffffffff814a4fe9>] scsi_error_handler+0xc9/0xe0
[  639.005996]  [<ffffffff814a4f20>] ? scsi_unjam_host+0xd0/0xd0
[  639.007600]  [<ffffffff81072f6b>] kthread+0xdb/0xe0
[  639.009205]  [<ffffffff81072e90>] ? flush_kthread_worker+0x170/0x170
[  639.010821]  [<ffffffff81748cac>] ret_from_fork+0x7c/0xb0
[  639.012437]  [<ffffffff81072e90>] ? flush_kthread_worker+0x170/0x170

This bug was introduce in commit  2e8ac30312
(the first time rdev_set_badblock was call from interrupt context),
so this patch is appropriate for 3.5 and subsequent kernels.

Cc: <stable@vger.kernel.org> (3.5+)
Signed-off-by: Bian Yu <bianyu@kedacom.com>
Reviewed-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-10-24 12:57:11 +11:00
Lukasz Dorau
61e4947c99 md: Fix skipping recovery for read-only arrays.
Since:
        commit 7ceb17e87b
        md: Allow devices to be re-added to a read-only array.

spares are activated on a read-only array. In case of raid1 and raid10
personalities it causes that not-in-sync devices are marked in-sync
without checking if recovery has been finished.

If a read-only array is degraded and one of its devices is not in-sync
(because the array has been only partially recovered) recovery will be skipped.

This patch adds checking if recovery has been finished before marking a device
in-sync for raid1 and raid10 personalities. In case of raid5 personality
such condition is already present (at raid5.c:6029).

Bug was introduced in 3.10 and causes data corruption.

Cc: stable@vger.kernel.org
Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-10-24 12:55:17 +11:00
Kent Overstreet
d4eddd42f5 bcache: Fixed incorrect order of arguments to bio_alloc_bioset()
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-23 07:55:36 +01:00
Greg Kroah-Hartman
a7204d72db Merge 3.12-rc6 into driver-core-next
We want these fixes here too.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-19 13:05:38 -07:00
Mikulas Patocka
e9c6a18264 dm snapshot: fix data corruption
This patch fixes a particular type of data corruption that has been
encountered when loading a snapshot's metadata from disk.

When we allocate a new chunk in persistent_prepare, we increment
ps->next_free and we make sure that it doesn't point to a metadata area
by further incrementing it if necessary.

When we load metadata from disk on device activation, ps->next_free is
positioned after the last used data chunk. However, if this last used
data chunk is followed by a metadata area, ps->next_free is positioned
erroneously to the metadata area. A newly-allocated chunk is placed at
the same location as the metadata area, resulting in data or metadata
corruption.

This patch changes the code so that ps->next_free skips the metadata
area when metadata are loaded in function read_exceptions.

The patch also moves a piece of code from persistent_prepare_exception
to a separate function skip_metadata to avoid code duplication.

CVE-2013-4299

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-10-16 03:17:47 +01:00
Michael Opdenacker
aa5e5dc2a8 treewide: fix "distingush" typo
Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-10-14 15:38:33 +02:00
Kent Overstreet
2fe80d3bbf bcache: Fix a null ptr deref regression
Commit c0f04d88e4 ("bcache: Fix flushes in writeback mode") was fixing
a reported data corruption bug, but it seems some last minute
refactoring or rebasing introduced a null pointer deref.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Reported-by: Gabriel de Perthuis <g2p.code@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-10 18:17:39 -07:00
Greg Kroah-Hartman
88502b9c0a Merge 3.12-rc3 into driver-core-next
We want the driver core and sysfs fixes in here to make merges and
development easier.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-29 18:29:23 -07:00
Tejun Heo
388975ccca sysfs: clean up sysfs_get_dirent()
The pre-existing sysfs interfaces which take explicit namespace
argument are weird in that they place the optional @ns in front of
@name which is contrary to the established convention.  For example,
we end up forcing vast majority of sysfs_get_dirent() users to do
sysfs_get_dirent(parent, NULL, name), which is silly and error-prone
especially as @ns and @name may be interchanged without causing
compilation warning.

This renames sysfs_get_dirent() to sysfs_get_dirent_ns() and swap the
positions of @name and @ns, and sysfs_get_dirent() is now a wrapper
around sysfs_get_dirent_ns().  This makes confusions a lot less
likely.

There are other interfaces which take @ns before @name.  They'll be
updated by following patches.

This patch doesn't introduce any functional changes.

v2: EXPORT_SYMBOL_GPL() wasn't updated leading to undefined symbol
    error on module builds.  Reported by build test robot.  Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-26 15:33:18 -07:00
Linus Torvalds
e93dd910b9 A set of device-mapper fixes for 3.12.
A few fixes for dm-snapshot, a 32 bit fix for dm-stats, a couple error
 handling fixes for dm-multipath.  A fix for the thin provisioning target
 to not expose non-zero discard limits if discards are disabled.
 
 Lastly, add two DM module parameters which allow users to tune the
 emergency memory reserves that DM mainatins per device -- this helps fix
 a long-standing issue for dm-multipath.  The conservative default
 reserve for request-based dm-multipath devices (256) has proven
 problematic for users with many multipathed SCSI devices but relatively
 little memory.  To responsibly select a smaller value users should use
 the new nr_bios tracepoint info (via commit 75afb352 "block: Add nr_bios
 to block_rq_remap tracepoint") to determine the peak number of bios
 their workloads create.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQEcBAABAgAGBQJSQMVHAAoJEMUj8QotnQNaOXgIAJS6/XJKMoHfiDJ9M+XD34rZ
 Uyr9TEnubX3DKCRBiY23MUcCQn3fx6BjCGv5/c8L4jQFIuLyDi2yatqpwXcbGSJh
 G/S/y6u0Axek+ew7TS80OFop4nblW6MoKnoh9/4N55Ofa+1WvKM4ERUGjHGbauyS
 TxmLQPToCFPLYRIOZ+imd6hQuIZ1+FFdJFvi7kY9O6Llx2sLD6fWi1iruBd/Da2H
 ByMX3biGN45mSpcBzRbSC/FkJ9CRIvT9n82BDPS0o3Tllt8NaVlEDaovB7h4ncc0
 bFuT2Z3Q38B9uZ8Lj0bqdGzv3kXMLCkLo6WhWjyUt84hmDPAzRpBwt60jUqWyZs=
 =bjVp
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.12-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device-mapper fixes from Mike Snitzer:
 "A few fixes for dm-snapshot, a 32 bit fix for dm-stats, a couple error
  handling fixes for dm-multipath.  A fix for the thin provisioning
  target to not expose non-zero discard limits if discards are disabled.

  Lastly, add two DM module parameters which allow users to tune the
  emergency memory reserves that DM mainatins per device -- this helps
  fix a long-standing issue for dm-multipath.  The conservative default
  reserve for request-based dm-multipath devices (256) has proven
  problematic for users with many multipathed SCSI devices but
  relatively little memory.  To responsibly select a smaller value users
  should use the new nr_bios tracepoint info (via commit 75afb352
  "block: Add nr_bios to block_rq_remap tracepoint") to determine the
  peak number of bios their workloads create"

* tag 'dm-3.12-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: add reserved_bio_based_ios module parameter
  dm: add reserved_rq_based_ios module parameter
  dm: lower bio-based mempool reservation
  dm thin: do not expose non-zero discard limits if discards disabled
  dm mpath: disable WRITE SAME if it fails
  dm-snapshot: fix performance degradation due to small hash size
  dm snapshot: workaround for a false positive lockdep warning
  dm stats: fix possible counter corruption on 32-bit systems
  dm mpath: do not fail path on -ENOSPC
2013-09-25 15:12:46 -07:00
Kent Overstreet
c0f04d88e4 bcache: Fix flushes in writeback mode
In writeback mode, when we get a cache flush we need to make sure we
issue a flush to the backing device.

The code for sending down an extra flush was wrong - by cloning the bio
we were probably getting flags that didn't make sense for a bare flush,
and also the old code was firing for FUA bios, for which we don't need
to send a flush to the backing device.

This was causing data corruption somehow - the mechanism was never
determined, but this patch fixes it for the users that were seeing it.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Kent Overstreet
84786438ed bcache: Fix for handling overlapping extents when reading in a btree node
btree_sort_fixup() was overly clever, because it was trying to avoid
pulling a key off the btree iterator in more than one place.

This led to a really obscure bug where we'd break early from the loop in
btree_sort_fixup() if the current key overlapped with keys in more than
one older set, and the next key it overlapped with was zero size.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Kent Overstreet
a698e08c82 bcache: Fix a shrinker deadlock
GFP_NOIO means we could be getting called recursively - mca_alloc() ->
mca_data_alloc() - definitely can't use mutex_lock(bucket_lock) then.
Whoops.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Kent Overstreet
79e3dab90d bcache: Fix a dumb CPU spinning bug in writeback
schedule_timeout() != schedule_timeout_uninterruptible()

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Kent Overstreet
1394d6761b bcache: Fix a flush/fua performance bug
bch_journal_meta() was missing the flush to make the journal write
actually go down (instead of waiting up to journal_delay_ms)...

Whoops

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Kent Overstreet
c2a4f3183a bcache: Fix a writeback performance regression
Background writeback works by scanning the btree for dirty data and
adding those keys into a fixed size buffer, then for each dirty key in
the keybuf writing it to the backing device.

When read_dirty() finishes and it's time to scan for more dirty data, we
need to wait for the outstanding writeback IO to finish - they still
take up slots in the keybuf (so that foreground writes can check for
them to avoid races) - without that wait, we'll continually rescan when
we'll be able to add at most a key or two to the keybuf, and that takes
locks that starves foreground IO.  Doh.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Geert Uytterhoeven
61cbd250f8 bcache: Correct printf()-style format length modifier
Fix

  drivers/md/bcache/btree.c: In function ‘bch_btree_node_read’:
  drivers/md/bcache/btree.c:259: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 3 has type ‘size_t’

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Kent Overstreet
c426c4fd46 bcache: Fix for when no journal entries are found
The journal replay code didn't handle this case, causing it to go into
an infinite loop...

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Gabriel de Perthuis
aee6f1cfff bcache: Strip endline when writing the label through sysfs
sysfs attributes with unusual characters have crappy failure modes
in Squeeze (udev 164); later versions of udev are unaffected.

This should make these characters more unusual.

Signed-off-by: Gabriel de Perthuis <g2p.code@gmail.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Kent Overstreet
6d9d21e35f bcache: Fix a dumb journal discard bug
That switch statement was obviously wrong, leading to some sort of weird
spinning on rare occasion with discards enabled...

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 14:41:43 -07:00
Mike Snitzer
e8603136cb dm: add reserved_bio_based_ios module parameter
Allow user to change the number of IOs that are reserved by
bio-based DM's mempools by writing to this file:
/sys/module/dm_mod/parameters/reserved_bio_based_ios

The default value is RESERVED_BIO_BASED_IOS (16).  The maximum allowed
value is RESERVED_MAX_IOS (1024).

Export dm_get_reserved_bio_based_ios() for use by DM targets and core
code.  Switch to sizing dm-io's mempool and bioset using DM core's
configurable 'reserved_bio_based_ios'.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Frank Mayhar <fmayhar@google.com>
2013-09-23 10:42:24 -04:00
Mike Snitzer
f47908269f dm: add reserved_rq_based_ios module parameter
Allow user to change the number of IOs that are reserved by
request-based DM's mempools by writing to this file:
/sys/module/dm_mod/parameters/reserved_rq_based_ios

The default value is RESERVED_REQUEST_BASED_IOS (256).  The maximum
allowed value is RESERVED_MAX_IOS (1024).

Export dm_get_reserved_rq_based_ios() for use by DM targets and core
code.  Switch to sizing dm-mpath's mempool using DM core's configurable
'reserved_rq_based_ios'.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Frank Mayhar <fmayhar@google.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
2013-09-23 10:42:24 -04:00
Mike Snitzer
6cfa58573f dm: lower bio-based mempool reservation
Bio-based device mapper processing doesn't need larger mempools (like
request-based DM does), so lower the number of reserved entries for
bio-based operation.  16 was already used for bio-based DM's bioset
but mistakenly wasn't used for it's _io_cache.

Formalize difference between bio-based and request-based defaults by
introducing RESERVED_BIO_BASED_IOS and RESERVED_REQUEST_BASED_IOS.

(based on older code from Mikulas Patocka)

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Frank Mayhar <fmayhar@google.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
2013-09-23 10:42:23 -04:00
Mike Snitzer
b60ab990cc dm thin: do not expose non-zero discard limits if discards disabled
Fix issue where the block layer would stack the discard limits of the
pool's data device even if the "ignore_discard" pool feature was
specified.

The pool and thin device(s) still had discards disabled because the
QUEUE_FLAG_DISCARD request_queue flag wasn't set.  But to avoid user
confusion when "ignore_discard" is used: both the pool device and the
thin device(s) have zeroes for all discard limits.

Also, always set discard_zeroes_data_unsupported in targets because they
should never advertise the 'discard_zeroes_data' capability (even if the
pool's data device supports it).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2013-09-23 10:42:06 -04:00
Mike Snitzer
f84cb8a46a dm mpath: disable WRITE SAME if it fails
Workaround the SCSI layer's problematic WRITE SAME heuristics by
disabling WRITE SAME in the DM multipath device's queue_limits if an
underlying device disabled it.

The WRITE SAME heuristics, with both the original commit 5db44863b6
("[SCSI] sd: Implement support for WRITE SAME") and the updated commit
66c28f971 ("[SCSI] sd: Update WRITE SAME heuristics"), default to enabling
WRITE SAME(10) even without successfully determining it is supported.
After the first failed WRITE SAME the SCSI layer will disable WRITE SAME
for the device (by setting sdkp->device->no_write_same which results in
'max_write_same_sectors' in device's queue_limits to be set to 0).

When a device is stacked ontop of such a SCSI device any changes to that
SCSI device's queue_limits do not automatically propagate up the stack.
As such, a DM multipath device will not have its WRITE SAME support
disabled.  This causes the block layer to continue to issue WRITE SAME
requests to the mpath device which causes paths to fail and (if mpath IO
isn't configured to queue when no paths are available) it will result in
actual IO errors to the upper layers.

This fix doesn't help configurations that have additional devices
stacked ontop of the mpath device (e.g. LVM created linear DM devices
ontop).  A proper fix that restacks all the queue_limits from the bottom
of the device stack up will need to be explored if SCSI will continue to
use this model of optimistically allowing op codes and then disabling
them after they fail for the first time.

Before this patch:

EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null)
device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121)
device-mapper: multipath: XXX snitm debugging: failing WRITE SAME IO with error=-121
end_request: critical target error, dev dm-6, sector 528
dm-6: WRITE SAME failed. Manually zeroing.
device-mapper: multipath: Failing path 8:112.
end_request: I/O error, dev dm-6, sector 4616
dm-6: WRITE SAME failed. Manually zeroing.
end_request: I/O error, dev dm-6, sector 4616
end_request: I/O error, dev dm-6, sector 5640
end_request: I/O error, dev dm-6, sector 6664
end_request: I/O error, dev dm-6, sector 7688
end_request: I/O error, dev dm-6, sector 524288
Buffer I/O error on device dm-6, logical block 65536
lost page write due to I/O error on dm-6
JBD2: Error -5 detected when updating journal superblock for dm-6-8.
end_request: I/O error, dev dm-6, sector 524296
Aborting journal on device dm-6-8.
end_request: I/O error, dev dm-6, sector 524288
Buffer I/O error on device dm-6, logical block 65536
lost page write due to I/O error on dm-6
JBD2: Error -5 detected when updating journal superblock for dm-6-8.

# cat /sys/block/sdh/queue/write_same_max_bytes
0
# cat /sys/block/dm-6/queue/write_same_max_bytes
33553920

After this patch:

EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null)
device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121)
device-mapper: multipath: XXX snitm debugging: WRITE SAME I/O failed with error=-121
end_request: critical target error, dev dm-6, sector 528
dm-6: WRITE SAME failed. Manually zeroing.

# cat /sys/block/sdh/queue/write_same_max_bytes
0
# cat /sys/block/dm-6/queue/write_same_max_bytes
0

It should be noted that WRITE SAME support wasn't enabled in DM
multipath until v3.10.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: stable@vger.kernel.org # 3.10+
2013-09-20 10:36:34 -04:00
Mikulas Patocka
60e356f381 dm-snapshot: fix performance degradation due to small hash size
LVM2, since version 2.02.96, creates origin with zero size, then loads
the snapshot driver and then loads the origin.  Consequently, the
snapshot driver sees the origin size zero and sets the hash size to the
lower bound 64.  Such small hash table causes performance degradation.

This patch changes it so that the hash size is determined by the size of
snapshot volume, not minimum of origin and snapshot size.  It doesn't
make sense to set the snapshot size significantly larger than the origin
size, so we do not need to take origin size into account when
calculating the hash size.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2013-09-20 10:36:34 -04:00
Mikulas Patocka
5ea330a75b dm snapshot: workaround for a false positive lockdep warning
The kernel reports a lockdep warning if a snapshot is invalidated because
it runs out of space.

The lockdep warning was triggered by commit 0976dfc1d0
("workqueue: Catch more locking problems with flush_work()") in v3.5.

The warning is false positive.  The real cause for the warning is that
the lockdep engine treats different instances of md->lock as a single
lock.

This patch is a workaround - we use flush_workqueue instead of flush_work.
This code path is not performance sensitive (it is called only on
initialization or invalidation), thus it doesn't matter that we flush the
whole workqueue.

The real fix for the problem would be to teach the lockdep engine to treat
different instances of md->lock as separate locks.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.5+
2013-09-20 10:36:34 -04:00