2
0
mirror of https://github.com/edk2-porting/linux-next.git synced 2024-12-23 04:34:11 +08:00
Commit Graph

4570 Commits

Author SHA1 Message Date
Byungchul Park
eae8263fb1 md/raid5: Don't reinvent the wheel but use existing llist API
Although llist provides proper APIs, they are not used. Make them used.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-16 14:49:05 -08:00
Ming Lei
d7a1030839 md: fast clone bio in bio_clone_mddev()
Firstly bio_clone_mddev() is used in raid normal I/O and isn't
in resync I/O path.

Secondly all the direct access to bvec table in raid happens on
resync I/O except for write behind of raid1, in which we still
use bio_clone() for allocating new bvec table.

So this patch replaces bio_clone() with bio_clone_fast()
in bio_clone_mddev().

Also kill bio_clone_mddev() and call bio_clone_fast() directly, as
suggested by Christoph Hellwig.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-15 11:24:54 -08:00
Ming Lei
ed7ef732ca md: remove unnecessary check on mddev
mddev is never NULL and neither is ->bio_set, so
remove the check.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-15 11:24:13 -08:00
Ming Lei
8e58e327e2 md/raid1: use bio_clone_bioset_partial() in case of write behind
Write behind need to replace pages in bio's bvecs, and we have
to clone a fresh bio with new bvec table, so use the introduced
bio_clone_bioset_partial() for it.

For other bio_clone_mddev() cases, we will use fast clone since
they don't need to touch bvec table.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-15 11:23:49 -08:00
Ming Lei
10273170fd md: fail if mddev->bio_set can't be created
The current behaviour is to fall back to allocate
bio from 'fs_bio_set', that isn't a correct way
because it might cause deadlock.

So this patch simply return failure if mddev->bio_set
can't be created.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-15 11:23:24 -08:00
Shaohua Li
26483819f8 md: disable WRITE SAME if it fails in underlayer disks
This makes md do the same thing as dm for write same IO failure. Please
see 7eee4ae(dm: disable WRITE SAME if it fails) for details why we need
this.

We did a little bit different than dm. Instead of disabling writesame in
the first IO error, we disable it till next writesame IO coming after
the first IO error. This way we don't need to clone a bio.

Also reported here: https://bugzilla.kernel.org/show_bug.cgi?id=118581

Suggested-by: NeilBrown <neilb@suse.com>
Acked-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-13 19:24:16 -08:00
Shaohua Li
e33fbb9cc7 md/raid5-cache: exclude reclaiming stripes in reclaim check
stripes which are being reclaimed are still accounted into cached
stripes. The reclaim takes time. r5c_do_reclaim isn't aware of the
stripes and does unnecessary stripe reclaim. In practice, I saw one
stripe is reclaimed one time. This will cause bad IO pattern. Fixing
this by excluding the reclaing stripes in the check.

Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-13 09:20:05 -08:00
Shaohua Li
e8fd52eec2 md/raid5-cache: stripe reclaim only counts valid stripes
When log space is tight, we try to reclaim stripes from log head. There
are stripes which can't be reclaimed right now if some conditions are
met. We skip such stripes but accidentally count them, which might cause
no stripes are claimed. Fixing this by only counting valid stripes.

Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-13 09:20:02 -08:00
NeilBrown
9356863c94 md: ensure md devices are freed before module is unloaded.
Commit: cbd1998377 ("md: Fix unfortunate interaction with evms")
change mddev_put() so that it would not destroy an md device while
->ctime was non-zero.

Unfortunately, we didn't make sure to clear ->ctime when unloading
the module, so it is possible for an md device to remain after
module unload.  An attempt to open such a device will trigger
an invalid memory reference in:
  get_gendisk -> kobj_lookup -> exact_lock -> get_disk

when tring to access disk->fops, which was in the module that has
been removed.

So ensure we clear ->ctime in md_exit(), and explain how that is useful,
as it isn't immediately obvious when looking at the code.

Fixes: cbd1998377 ("md: Fix unfortunate interaction with evms")
Tested-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-13 09:17:53 -08:00
Song Liu
39b99586b3 md/r5cache: improve journal device efficiency
It is important to be able to flush all stripes in raid5-cache.
Therefore, we need reserve some space on the journal device for
these flushes. If flush operation includes pending writes to the
stripe, we need to reserve (conf->raid_disk + 1) pages per stripe
for the flush out. This reduces the efficiency of journal space.
If we exclude these pending writes from flush operation, we only
need (conf->max_degraded + 1) pages per stripe.

With this patch, when log space is critical (R5C_LOG_CRITICAL=1),
pending writes will be excluded from stripe flush out. Therefore,
we can reduce reserved space for flush out and thus improve journal
device efficiency.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-13 09:17:52 -08:00
Song Liu
03b047f45c md/r5cache: enable chunk_aligned_read with write back cache
Chunk aligned read significantly reduces CPU usage of raid456.
However, it is not safe to fully bypass the write back cache.
This patch enables chunk aligned read with write back cache.

For chunk aligned read, we track stripes in write back cache at
a bigger granularity, "big_stripe". Each chunk may contain more
than one stripe (for example, a 256kB chunk contains 64 4kB-page,
so this chunk contain 64 stripes). For chunk_aligned_read, these
stripes are grouped into one big_stripe, so we only need one lookup
for the whole chunk.

For each big_stripe, struct big_stripe_info tracks how many stripes
of this big_stripe are in the write back cache. We count how many
stripes of this big_stripe are in the write back cache. These
counters are tracked in a radix tree (big_stripe_tree).
r5c_tree_index() is used to calculate keys for the radix tree.

chunk_aligned_read() calls r5c_big_stripe_cached() to look up
big_stripe of each chunk in the tree. If this big_stripe is in the
tree, chunk_aligned_read() aborts. This look up is protected by
rcu_read_lock().

It is necessary to remember whether a stripe is counted in
big_stripe_tree. Instead of adding new flag, we reuses existing flags:
STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE. If either of these
two flags are set, the stripe is counted in big_stripe_tree. This
requires moving set_bit(STRIPE_R5C_PARTIAL_STRIPE) to
r5c_try_caching_write(); and moving clear_bit of
STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE to
r5c_finish_stripe_write_out().

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-13 09:17:51 -08:00
Shaohua Li
765d704db1 raid5: only dispatch IO from raid5d for harddisk raid
We made raid5 stripe handling multi-thread before. It works well for
SSD. But for harddisk, the multi-threading creates more disk seek, so
not always improve performance. For several hard disks based raid5,
multi-threading is required as raid5d becames a bottleneck especially
for sequential write.

To overcome the disk seek issue, we only dispatch IO from raid5d if the
array is harddisk based. Other threads can still handle stripes, but
can't dispatch IO.

Idealy, we should control IO dispatching order according to IO position
interrnally. Right now we still depend on block layer, which isn't very
efficient sometimes though.

My setup has 9 harddisks, each disk can do around 180M/s sequential
write. So in theory, the raid5 can do 180 * 8 = 1440M/s sequential
write. The test machine uses an ATOM CPU. I measure sequential write
with large iodepth bandwidth to raid array:

without patch: ~600M/s
without patch and group_thread_cnt=4: 750M/s
with patch and group_thread_cnt=4: 950M/s
with patch, group_thread_cnt=4, skip_copy=1: 1150M/s

We are pretty close to the maximum bandwidth in the large iodepth
iodepth case. The performance gap of small iodepth sequential write
between software raid and theory value is still very big though, because
we don't have an efficient pipeline.

Cc: NeilBrown <neilb@suse.com>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-13 09:17:50 -08:00
colyli@suse.de
03a9e24ef2 md linear: fix a race between linear_add() and linear_congested()
Recently I receive a bug report that on Linux v3.0 based kerenl, hot add
disk to a md linear device causes kernel crash at linear_congested(). From
the crash image analysis, I find in linear_congested(), mddev->raid_disks
contains value N, but conf->disks[] only has N-1 pointers available. Then
a NULL pointer deference crashes the kernel.

There is a race between linear_add() and linear_congested(), RCU stuffs
used in these two functions cannot avoid the race. Since Linuv v4.0
RCU code is replaced by introducing mddev_suspend().  After checking the
upstream code, it seems linear_congested() is not called in
generic_make_request() code patch, so mddev_suspend() cannot provent it
from being called. The possible race still exists.

Here I explain how the race still exists in current code.  For a machine
has many CPUs, on one CPU, linear_add() is called to add a hard disk to a
md linear device; at the same time on other CPU, linear_congested() is
called to detect whether this md linear device is congested before issuing
an I/O request onto it.

Now I use a possible code execution time sequence to demo how the possible
race happens,

seq    linear_add()                linear_congested()
 0                                 conf=mddev->private
 1   oldconf=mddev->private
 2   mddev->raid_disks++
 3                              for (i=0; i<mddev->raid_disks;i++)
 4                                bdev_get_queue(conf->disks[i].rdev->bdev)
 5   mddev->private=newconf

In linear_add() mddev->raid_disks is increased in time seq 2, and on
another CPU in linear_congested() the for-loop iterates conf->disks[i] by
the increased mddev->raid_disks in time seq 3,4. But conf with one more
element (which is a pointer to struct dev_info type) to conf->disks[] is
not updated yet, accessing its structure member in time seq 4 will cause a
NULL pointer deference fault.

To fix this race, there are 2 parts of modification in the patch,
 1) Add 'int raid_disks' in struct linear_conf, as a copy of
    mddev->raid_disks. It is initialized in linear_conf(), always being
    consistent with pointers number of 'struct dev_info disks[]'. When
    iterating conf->disks[] in linear_congested(), use conf->raid_disks to
    replace mddev->raid_disks in the for-loop, then NULL pointer deference
    will not happen again.
 2) RCU stuffs are back again, and use kfree_rcu() in linear_add() to
    free oldconf memory. Because oldconf may be referenced as mddev->private
    in linear_congested(), kfree_rcu() makes sure that its memory will not
    be released until no one uses it any more.
Also some code comments are added in this patch, to make this modification
to be easier understandable.

This patch can be applied for kernels since v4.0 after commit:
3be260cc18 ("md/linear: remove rcu protections in favour of
suspend/resume"). But this bug is reported on Linux v3.0 based kernel, for
people who maintain kernels before Linux v4.0, they need to do some back
back port to this patch.

Changelog:
 - V3: add 'int raid_disks' in struct linear_conf, and use kfree_rcu() to
       replace rcu_call() in linear_add().
 - v2: add RCU stuffs by suggestion from Shaohua and Neil.
 - v1: initial effort.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Neil Brown <neilb@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: Shaohua Li <shli@fb.com>
2017-02-13 09:17:50 -08:00
Ondrej Kozina
f5b0cba8f2 dm crypt: replace RCU read-side section with rwsem
The lockdep splat below hints at a bug in RCU usage in dm-crypt that
was introduced with commit c538f6ec9f ("dm crypt: add ability to use
keys from the kernel key retention service").  The kernel keyring
function user_key_payload() is in fact a wrapper for
rcu_dereference_protected() which must not be called with only
rcu_read_lock() section mark.

Unfortunately the kernel keyring subsystem doesn't currently provide
an interface that allows the use of an RCU read-side section.  So for
now we must drop RCU in favour of rwsem until a proper function is
made available in the kernel keyring subsystem.

===============================
[ INFO: suspicious RCU usage. ]
4.10.0-rc5 #2 Not tainted
-------------------------------
./include/keys/user-type.h:53 suspicious rcu_dereference_protected() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
2 locks held by cryptsetup/6464:
 #0:  (&md->type_lock){+.+.+.}, at: [<ffffffffa02472a2>] dm_lock_md_type+0x12/0x20 [dm_mod]
 #1:  (rcu_read_lock){......}, at: [<ffffffffa02822f8>] crypt_set_key+0x1d8/0x4b0 [dm_crypt]
stack backtrace:
CPU: 1 PID: 6464 Comm: cryptsetup Not tainted 4.10.0-rc5 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
Call Trace:
 dump_stack+0x67/0x92
 lockdep_rcu_suspicious+0xc5/0x100
 crypt_set_key+0x351/0x4b0 [dm_crypt]
 ? crypt_set_key+0x1d8/0x4b0 [dm_crypt]
 crypt_ctr+0x341/0xa53 [dm_crypt]
 dm_table_add_target+0x147/0x330 [dm_mod]
 table_load+0x111/0x350 [dm_mod]
 ? retrieve_status+0x1c0/0x1c0 [dm_mod]
 ctl_ioctl+0x1f5/0x510 [dm_mod]
 dm_ctl_ioctl+0xe/0x20 [dm_mod]
 do_vfs_ioctl+0x8e/0x690
 ? ____fput+0x9/0x10
 ? task_work_run+0x7e/0xa0
 ? trace_hardirqs_on_caller+0x122/0x1b0
 SyS_ioctl+0x3c/0x70
 entry_SYSCALL_64_fastpath+0x18/0xad
RIP: 0033:0x7f392c9a4ec7
RSP: 002b:00007ffef6383378 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffef63830a0 RCX: 00007f392c9a4ec7
RDX: 000000000124fcc0 RSI: 00000000c138fd09 RDI: 0000000000000005
RBP: 00007ffef6383090 R08: 00000000ffffffff R09: 00000000012482b0
R10: 2a28205d34383336 R11: 0000000000000246 R12: 00007f392d803a08
R13: 00007ffef63831e0 R14: 0000000000000000 R15: 00007f392d803a0b

Fixes: c538f6ec9f ("dm crypt: add ability to use keys from the kernel key retention service")
Reported-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-02-03 10:26:14 -05:00
Mike Snitzer
4087a1fffe dm rq: cope with DM device destruction while in dm_old_request_fn()
Fixes a crash in dm_table_find_target() due to a NULL struct dm_table
being passed from dm_old_request_fn() that races with DM device
destruction.

Reported-by: artem@flashgrid.io
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2017-02-03 10:18:43 -05:00
Mike Snitzer
d19a55ccad dm mpath: cleanup -Wbool-operation warning in choose_pgpath()
Reported-by: David Binderman <dcb314@hotmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-02-03 10:18:37 -05:00
Song Liu
2e38a37f23 md/r5cache: disable write back for degraded array
write-back cache in degraded mode introduces corner cases to the array.
Although we try to cover all these corner cases, it is safer to just
disable write-back cache when the array is in degraded mode.

In this patch, we disable writeback cache for degraded mode:
1. On device failure, if the array enters degraded mode, raid5_error()
   will submit async job r5c_disable_writeback_async to disable
   writeback;
2. In r5c_journal_mode_store(), it is invalid to enable writeback in
   degraded mode;
3. In r5c_try_caching_write(), stripes with s->failed>0 will be handled
   in write-through mode.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-24 11:26:06 -08:00
Song Liu
07e8336484 md/r5cache: shift complex rmw from read path to write path
Write back cache requires a complex RMW mechanism, where old data is
read into dev->orig_page for prexor, and then xor is done with
dev->page. This logic is already implemented in the write path.

However, current read path is not awared of this requirement. When
the array is optimal, the RMW is not required, as the data are
read from raid disks. However, when the target stripe is degraded,
complex RMW is required to generate right data.

To keep read path as clean as possible, we handle read path by
flushing degraded, in-journal stripes before processing reads to
missing dev.

Specifically, when there is read requests to a degraded stripe
with data in journal, handle_stripe_fill() calls
r5c_make_stripe_write_out() and exits. Then handle_stripe_dirtying()
will do the complex RMW and flush the stripe to RAID disks. After
that, read requests are handled.

There is one more corner case when there is non-overwrite bio for
the missing (or out of sync) dev. handle_stripe_dirtying() will not
be able to process the non-overwrite bios without constructing the
data in handle_stripe_fill(). This is fixed by delaying non-overwrite
bios in handle_stripe_dirtying(). So handle_stripe_fill() works on
these bios after the stripe is flushed to raid disks.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-24 11:20:15 -08:00
Song Liu
a85dd7b8df md/r5cache: flush data only stripes in r5l_recovery_log()
For safer operation, all arrays start in write-through mode, which has been
better tested and is more mature. And actually the write-through/write-mode
isn't persistent after array restarted, so we always start array in
write-through mode. However, if recovery found data-only stripes before the
shutdown (from previous write-back mode), it is not safe to start the array in
write-through mode, as write-through mode can not handle stripes with data in
write-back cache. To solve this problem, we flush all data-only stripes in
r5l_recovery_log(). When r5l_recovery_log() returns, the array starts with
empty cache in write-through mode.

This logic is implemented in r5c_recovery_flush_data_only_stripes():

1. enable write back cache
2. flush all stripes
3. wake up conf->mddev->thread
4. wait for all stripes get flushed (reuse wait_for_quiescent)
5. disable write back cache

The wait in 4 will be waked up in release_inactive_stripe_list()
when conf->active_stripes reaches 0.

It is safe to wake up mddev->thread here because all the resource
required for the thread has been initialized.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-24 11:20:15 -08:00
Song Liu
ba02684daf md/raid5: move comment of fetch_block to right location
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-24 11:20:14 -08:00
Song Liu
86aa1397dd md/r5cache: read data into orig_page for prexor of cached data
With write back cache, we use orig_page to do prexor. This patch
makes sure we read data into orig_page for it.

Flag R5_OrigPageUPTDODATE is added to show whether orig_page
has the latest data from raid disk.

We introduce a helper function uptodate_for_rmw() to simplify
the a couple conditions in handle_stripe_dirtying().

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-24 11:20:14 -08:00
Shaohua Li
d46d29f072 md/raid5-cache: delete meaningless code
sector_t is unsigned long, it's never < 0

Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-24 11:20:13 -08:00
Jes Sorensen
32cd7cbbac md/raid5: Use correct IS_ERR() variation on pointer check
This fixes a build error on certain architectures, such as ppc64.

Fixes: 6995f0b247e("md: takeover should clear unrelated bits")
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-09 13:58:10 -08:00
Shaohua Li
394ed8e474 md: cleanup mddev flag clear for takeover
Commit 6995f0b (md: takeover should clear unrelated bits) clear
unrelated bits, but it's quite fragile. To avoid error in the future,
define a macro for unsupported mddev flags for each raid type and use it
to clear unsupported mddev flags. This should be less error-prone.

Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-05 11:45:18 -08:00
Colin Ian King
99f17890f0 md/r5cache: fix spelling mistake on "recoverying"
Trivial fix to spelling mistake "recoverying" to "recovering" in
pr_dbg message.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-05 11:44:39 -08:00
Song Liu
d2250f105f md/r5cache: assign conf->log before r5l_load_log()
r5l_load_log() calls functions that requires a proper conf->log,
for example, r5c_is_writeback(). Therefore, we should set
conf->log before calling r5l_load_log(). If r5l_load_log() fails,
conf->log is set back to NULL.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-05 11:44:38 -08:00
Song Liu
3c66abbaaf md/r5cache: simplify handling of sh->log_start in recovery
We only need to update sh->log_start at the end of recovery,
which is r5c_recovery_rewrite_data_only_stripes(), so it is not
necessary to set it before that. In this patch, log_start is
removed from r5c_recovery_alloc_stripe().

After updating all sh->log_start, rewrite_data_only_stripes()
also updates log->next_checkpoints to the last sh->log_start.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-05 11:44:38 -08:00
JackieLiu
28ca833ecf md/raid5-cache: removes unnecessary write-through mode judgments
The write-through mode has been returned in front of the function,
do not need to do it again.

Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
Reviewed-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-05 11:44:37 -08:00
Robert LeBlanc
bb5f1ed70b md/raid10: Refactor raid10_make_request
Refactor raid10_make_request into seperate read and write functions to
clean up the code.

Shaohua: add the recovery check back to read path

Signed-off-by: Robert LeBlanc <robert@leblancnet.us>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-03 08:56:52 -08:00
Robert LeBlanc
3b046a97cb md/raid1: Refactor raid1_make_request
Refactor raid1_make_request to make read and write code in their own
functions to clean up the code.

Signed-off-by: Robert LeBlanc <robert@leblancnet.us>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-03 08:56:52 -08:00
Linus Torvalds
7c0f6ba682 Replace <asm/uaccess.h> with <linux/uaccess.h> globally
This was entirely automated, using the script by Al:

  PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
  sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
        $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-24 11:46:01 -08:00
Eric Wheeler
b8c0d911ac bcache: partition support: add 16 minors per bcacheN device
Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net>
Tested-by: Wido den Hollander <wido@widodh.nl>
2016-12-17 13:02:00 -07:00
Kent Overstreet
be628be095 bcache: Make gc wakeup sane, remove set_task_state()
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2016-12-17 13:01:55 -07:00
Michael S. Tsirkin
9efeccacd3 linux: drop __bitwise__ everywhere
__bitwise__ used to mean "yes, please enable sparse checks
unconditionally", but now that we dropped __CHECK_ENDIAN__
__bitwise is exactly the same.
There aren't many users, replace it by __bitwise everywhere.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Stefan Schmidt <stefan@osg.samsung.com>
Acked-by: Krzysztof Kozlowski <krzk@kernel.org>
Akced-by: Lee Duncan <lduncan@suse.com>
2016-12-16 00:13:41 +02:00
Linus Torvalds
775a2e29c3 . various fixes and improvements to request-based DM and DM multipath
. some locking improvements in DM bufio
 
 . add Kconfig option to disable the DM block manager's extra locking
   which mainly serves as a developer tool
 
 . a few bug fixes to DM's persistent-data
 
 . a couple changes to prepare for multipage biovec support in the block
   layer
 
 . various improvements and cleanups in the DM core, DM cache, DM raid
   and DM crypt
 
 . add ability to have DM crypt use keys from the kernel key retention
   service
 
 . add a new "error_writes" feature to the DM flakey target, reads are
   left unchanged in this mode
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJYUW8zAAoJEMUj8QotnQNaAWEIAMRQ4aCXq5T7F9Hf4K/l6FwO
 FoBr2TPS3Lf0vm/A5Tr819I47hk7q0oroa61ARbpS90iuGt/Au/Sk35cn1BwT0YW
 llMvMGbh+w9ZBUJGkyexdXbyfm5ywPHuthMr4CK/UNASyjDl2QMAeBuUZ6FLSPn1
 RUL/RYv0mG/7EXOPz0PURPb5rpjO15cAU0NjfNS0862UVR8x8dNS6iImOmScsioe
 Flw90qPl3kMBxBHik8xSPJfhtW+lD7xSaOlWzHKtalnUZHRG2BNUtlAMKdiaynx2
 yl9MhSsi8wlgd4h9WmlmaOr0VqkU5UYY9D9TDuuJwXnHUXGenVSJ/aGOohr+bm4=
 =kOoK
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - various fixes and improvements to request-based DM and DM multipath

 - some locking improvements in DM bufio

 - add Kconfig option to disable the DM block manager's extra locking
   which mainly serves as a developer tool

 - a few bug fixes to DM's persistent-data

 - a couple changes to prepare for multipage biovec support in the block
   layer

 - various improvements and cleanups in the DM core, DM cache, DM raid
   and DM crypt

 - add ability to have DM crypt use keys from the kernel key retention
   service

 - add a new "error_writes" feature to the DM flakey target, reads are
   left unchanged in this mode

* tag 'dm-4.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (40 commits)
  dm flakey: introduce "error_writes" feature
  dm cache policy smq: use hash_32() instead of hash_32_generic()
  dm crypt: reject key strings containing whitespace chars
  dm space map: always set ev if sm_ll_mutate() succeeds
  dm space map metadata: skip useless memcpy in metadata_ll_init_index()
  dm space map metadata: fix 'struct sm_metadata' leak on failed create
  Documentation: dm raid: define data_offset status field
  dm raid: fix discard support regression
  dm raid: don't allow "write behind" with raid4/5/6
  dm mpath: use hw_handler_params if attached hw_handler is same as requested
  dm crypt: add ability to use keys from the kernel key retention service
  dm array: remove a dead assignment in populate_ablock_with_values()
  dm ioctl: use offsetof() instead of open-coding it
  dm rq: simplify use_blk_mq initialization
  dm: use blk_set_queue_dying() in __dm_destroy()
  dm bufio: drop the lock when doing GFP_NOIO allocation
  dm bufio: don't take the lock in dm_bufio_shrink_count
  dm bufio: avoid sleeping while holding the dm_bufio lock
  dm table: simplify dm_table_determine_type()
  dm table: an 'all_blk_mq' table must be loaded for a blk-mq DM device
  ...
2016-12-14 11:01:00 -08:00
Shaohua Li
20737738d3 Merge branch 'md-next' into md-linus 2016-12-13 12:40:15 -08:00
Mike Snitzer
ef548c551e dm flakey: introduce "error_writes" feature
Recent dm-flakey fixes, to have reads error out during the "down"
interval, made it so that the previous read behaviour is no longer
available.

It is useful to have reads complete like normal but have writes error
out, so make it possible again with a new "error_writes" feature.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-13 15:01:31 -05:00
Linus Torvalds
36869cb93d Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the main block pull request this series. Contrary to previous
  release, I've kept the core and driver changes in the same branch. We
  always ended up having dependencies between the two for obvious
  reasons, so makes more sense to keep them together. That said, I'll
  probably try and keep more topical branches going forward, especially
  for cycles that end up being as busy as this one.

  The major parts of this pull request is:

   - Improved support for O_DIRECT on block devices, with a small
     private implementation instead of using the pig that is
     fs/direct-io.c. From Christoph.

   - Request completion tracking in a scalable fashion. This is utilized
     by two components in this pull, the new hybrid polling and the
     writeback queue throttling code.

   - Improved support for polling with O_DIRECT, adding a hybrid mode
     that combines pure polling with an initial sleep. From me.

   - Support for automatic throttling of writeback queues on the block
     side. This uses feedback from the device completion latencies to
     scale the queue on the block side up or down. From me.

   - Support from SMR drives in the block layer and for SD. From Hannes
     and Shaun.

   - Multi-connection support for nbd. From Josef.

   - Cleanup of request and bio flags, so we have a clear split between
     which are bio (or rq) private, and which ones are shared. From
     Christoph.

   - A set of patches from Bart, that improve how we handle queue
     stopping and starting in blk-mq.

   - Support for WRITE_ZEROES from Chaitanya.

   - Lightnvm updates from Javier/Matias.

   - Supoort for FC for the nvme-over-fabrics code. From James Smart.

   - A bunch of fixes from a whole slew of people, too many to name
     here"

* 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
  blk-stat: fix a few cases of missing batch flushing
  blk-flush: run the queue when inserting blk-mq flush
  elevator: make the rqhash helpers exported
  blk-mq: abstract out blk_mq_dispatch_rq_list() helper
  blk-mq: add blk_mq_start_stopped_hw_queue()
  block: improve handling of the magic discard payload
  blk-wbt: don't throttle discard or write zeroes
  nbd: use dev_err_ratelimited in io path
  nbd: reset the setup task for NBD_CLEAR_SOCK
  nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
  nvme-fabrics: Add target support for FC transport
  nvme-fabrics: Add host support for FC transport
  nvme-fabrics: Add FC transport LLDD api definitions
  nvme-fabrics: Add FC transport FC-NVME definitions
  nvme-fabrics: Add FC transport error codes to nvme.h
  Add type 0x28 NVME type code to scsi fc headers
  nvme-fabrics: patch target code in prep for FC transport support
  nvme-fabrics: set sqe.command_id in core not transports
  parser: add u64 number parser
  nvme-rdma: align to generic ib_event logging helper
  ...
2016-12-13 10:19:16 -08:00
Shaohua Li
2953079c69 md: separate flags for superblock changes
The mddev->flags are used for different purposes. There are a lot of
places we check/change the flags without masking unrelated flags, we
could check/change unrelated flags. These usage are most for superblock
write, so spearate superblock related flags. This should make the code
clearer and also fix real bugs.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-12-08 22:01:47 -08:00
Shaohua Li
82a301cb0e md: MD_RECOVERY_NEEDED is set for mddev->recovery
Fixes: 90f5f7ad4f38("md: Wait for md_check_recovery before attempting device
removal.")

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-12-08 22:00:43 -08:00
Shaohua Li
6995f0b247 md: takeover should clear unrelated bits
When we change level from raid1 to raid5, the MD_FAILFAST_SUPPORTED bit
will be accidentally set, but raid5 doesn't support it. The same is true
for the MD_HAS_JOURNAL bit.

Fix: 46533ff (md: Use REQ_FAILFAST_* on metadata writes where appropriate)
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-12-08 22:00:11 -08:00
Mike Snitzer
e99dda8fc4 dm cache policy smq: use hash_32() instead of hash_32_generic()
Switch to using hash_32() because hash_32_generic() should only be used
by the kernel's selftests.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08 19:42:37 -05:00
Ondrej Kozina
027c431ccf dm crypt: reject key strings containing whitespace chars
Unfortunately key_string may theoretically contain whitespace even after
it's processed by dm_split_args().  The reason for this is DM core
supports escaping of almost all chars including any whitespace.

If userspace passes a key to the kernel in format ":32:logon:my_prefix:my\ key"
dm-crypt will look up key "my_prefix:my key" in kernel keyring service.
So far everything's fine.

Unfortunately if userspace later calls DM_TABLE_STATUS ioctl, it will not
receive back expected ":32:logon:my_prefix:my\ key" but the unescaped version
instead.  Also userpace (most notably cryptsetup) is not ready to parse
single target argument containing (even escaped) whitespace chars and any
whitespace is simply taken as delimiter of another argument.

This effect is mitigated by the fact libdevmapper curently performs
double escaping of '\' char.  Any user input in format "x\ x" is
transformed into "x\\ x" before being passed to the kernel.  Nonetheless
dm-crypt may be used without libdevmapper.  Therefore the near-term
solution to this is to reject any key string containing whitespace.

Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08 14:13:16 -05:00
Benjamin Marzinski
b446396b74 dm space map: always set ev if sm_ll_mutate() succeeds
If no block was allocated or freed, sm_ll_mutate() wasn't setting
*ev, leaving the variable unitialized. sm_ll_insert(),
sm_disk_inc_block(), and sm_disk_new_block() all check ev to see
if there was an allocation event in sm_ll_mutate(), possibly
reading unitialized data.

If no allocation event occured, sm_ll_mutate() should set *ev
to SM_NONE.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08 14:13:15 -05:00
Benjamin Marzinski
0c79ce0b75 dm space map metadata: skip useless memcpy in metadata_ll_init_index()
When metadata_ll_init_index() is called by sm_ll_new_metadata(),
ll->mi_le hasn't been initialized yet. So, when
metadata_ll_init_index() copies the contents of ll->mi_le into the
newly allocated bitmap_root, it is just copying garbage. ll->mi_le
will be allocated later in sm_ll_extend() and copied into the
bitmap_root, in sm_ll_commit().

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08 14:13:15 -05:00
Benjamin Marzinski
314c25c56c dm space map metadata: fix 'struct sm_metadata' leak on failed create
In dm_sm_metadata_create() we temporarily change the dm_space_map
operations from 'ops' (whose .destroy function deallocates the
sm_metadata) to 'bootstrap_ops' (whose .destroy function doesn't).

If dm_sm_metadata_create() fails in sm_ll_new_metadata() or
sm_ll_extend(), it exits back to dm_tm_create_internal(), which calls
dm_sm_destroy() with the intention of freeing the sm_metadata, but it
doesn't (because the dm_space_map operations is still set to
'bootstrap_ops').

Fix this by setting the dm_space_map operations back to 'ops' if
dm_sm_metadata_create() fails when it is set to 'bootstrap_ops'.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2016-12-08 14:13:14 -05:00
Heinz Mauelshagen
11e2968478 dm raid: fix discard support regression
Commit ecbfb9f118 ("dm raid: add raid level takeover support") moved the
configure_discard_support() call from raid_ctr() to raid_preresume().

Enabling/disabling discard _must_ happen during table load (through the
.ctr hook).  Fix this regression by moving the
configure_discard_support() call back to raid_ctr().

Fixes: ecbfb9f118 ("dm raid: add raid level takeover support")
Cc: stable@vger.kernel.org # 4.8+
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08 14:13:12 -05:00
Heinz Mauelshagen
affa9d28f7 dm raid: don't allow "write behind" with raid4/5/6
Remove CTR_FLAG_MAX_WRITE_BEHIND from raid4/5/6's valid ctr flags.

Only the md raid1 personality supports setting a maximum number
of "write behind" write IOs on any legs set to "write mostly".
"write mostly" enhances throughput with slow links/disks.

Technically the "write behind" value is a write intent bitmap
property only being respected by the raid1 personality.  It allows a
maximum number of "write behind" writes to any "write mostly" raid1
mirror legs to be delayed and avoids reads from such legs.

No other MD personalities supported via dm-raid make use of "write
behind", thus setting this property is superfluous; it wouldn't cause
harm but it is correct to reject it.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08 14:13:11 -05:00
tang.junhui
54cd640d20 dm mpath: use hw_handler_params if attached hw_handler is same as requested
Let the requested m->hw_handler_params be used if the attached hardware
handler is the same handler as requested with m->hw_handler_name.

Signed-off-by: tang.junhui <tang.junhui@zte.com.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08 14:13:10 -05:00
Ondrej Kozina
c538f6ec9f dm crypt: add ability to use keys from the kernel key retention service
The kernel key service is a generic way to store keys for the use of
other subsystems. Currently there is no way to use kernel keys in dm-crypt.
This patch aims to fix that. Instead of key userspace may pass a key
description with preceding ':'. So message that constructs encryption
mapping now looks like this:

  <cipher> [<key>|:<key_string>] <iv_offset> <dev_path> <start> [<#opt_params> <opt_params>]

where <key_string> is in format: <key_size>:<key_type>:<key_description>

Currently we only support two elementary key types: 'user' and 'logon'.
Keys may be loaded in dm-crypt either via <key_string> or using
classical method and pass the key in hex representation directly.

dm-crypt device initialised with a key passed in hex representation may be
replaced with key passed in key_string format and vice versa.

(Based on original work by Andrey Ryabinin)

Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08 14:13:09 -05:00