Add a device-mapper target called dm-switch to provide a multipath

framework for storage arrays that dynamically reconfigure their
 preferred paths for different device regions.
 
 Fix a bug in the verity target that prevented its use with some
 specific sizes of devices.
 
 Improve some locking mechanisms in the device-mapper core and bufio.
 
 Add Mike Snitzer as a device-mapper maintainer.
 
 A few more clean-ups and fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJR3ehdAAoJEK2W1qbAHj1nseUP+gPgoX2YTBiKW/fQnbixb11c
 0BExXiHtHgVnxQP4aJo8BJRFW9/DAN740UvKb2XjjbNChIQ47j6vOLCCzJ+97wW+
 FCJ48pltsacgywvm5e3BbnwmcmpQXKk1Wd+1/9beWbcib9IzVB2B06Esv3HRtQZj
 cQbIkeeTGbrSnsiAWSQh2xsNqjv1YObUohs43uG+Pa0WmdE1KebAYfkgEvi0b+E6
 ehSsvAMqYRgkLvYdYTxRNJtC+H3pkucS6r42Q/tZj2YciU3tc0v6rsFW9Ey+l0E7
 c5KaUAKk5e3HAhFvJ4ydlj7r1cu7G49rixIBJ60lX86QBwmZ8js5EEPliw0ZoWI+
 av1P+9gLsxaQTH/Cw8jJW4xK7hYAZAvn//iNVBAATATd65nmQImHNWWMjr205Kw9
 9XOeFUxAdnM7ITKXJkFf3vH2tFrRAKgXiR57im5ZuLMOFYWjR6EYE870+GCWSya8
 Dhzj0Mb8IFHrelEbRWicNbD5IaAxvfQ6/sTvXBiV642jImkQIyIj+PBiIvsq8fTH
 LKNL1l545R5aOHSU4TXnseq3TcIqElx0KsPTJuZq+q/2UfvMe9Lv9g+ld5CywfH1
 1HkEB75yWPvEfOtIac9tzQSt3KnF01fC2QMYZE4rSiYs8KPgln9pxo+UulUaZzId
 8Gch3/C5cBBCHjMJtv/b
 =s5m4
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm

Pull device-mapper changes from Alasdair G Kergon:
 "Add a device-mapper target called dm-switch to provide a multipath
  framework for storage arrays that dynamically reconfigure their
  preferred paths for different device regions.

  Fix a bug in the verity target that prevented its use with some
  specific sizes of devices.

  Improve some locking mechanisms in the device-mapper core and bufio.

  Add Mike Snitzer as a device-mapper maintainer.

  A few more clean-ups and fixes"

* tag 'dm-3.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
  dm: add switch target
  dm: update maintainers
  dm: optimize reorder structure
  dm: optimize use SRCU and RCU
  dm bufio: submit writes outside lock
  dm cache: fix arm link errors with inline
  dm verity: use __ffs and __fls
  dm flakey: correct ctr alloc failure mesg
  dm verity: remove pointless comparison
  dm: use __GFP_HIGHMEM in __vmalloc
  dm verity: fix inability to use a few specific devices sizes
  dm ioctl: set noio flag to avoid __vmalloc deadlock
  dm mpath: fix ioctl deadlock when no paths
This commit is contained in:
Linus Torvalds 2013-07-11 13:05:40 -07:00
commit 9903883f1d
15 changed files with 951 additions and 185 deletions

View File

@ -0,0 +1,126 @@
dm-switch
=========
The device-mapper switch target creates a device that supports an
arbitrary mapping of fixed-size regions of I/O across a fixed set of
paths. The path used for any specific region can be switched
dynamically by sending the target a message.
It maps I/O to underlying block devices efficiently when there is a large
number of fixed-sized address regions but there is no simple pattern
that would allow for a compact representation of the mapping such as
dm-stripe.
Background
----------
Dell EqualLogic and some other iSCSI storage arrays use a distributed
frameless architecture. In this architecture, the storage group
consists of a number of distinct storage arrays ("members") each having
independent controllers, disk storage and network adapters. When a LUN
is created it is spread across multiple members. The details of the
spreading are hidden from initiators connected to this storage system.
The storage group exposes a single target discovery portal, no matter
how many members are being used. When iSCSI sessions are created, each
session is connected to an eth port on a single member. Data to a LUN
can be sent on any iSCSI session, and if the blocks being accessed are
stored on another member the I/O will be forwarded as required. This
forwarding is invisible to the initiator. The storage layout is also
dynamic, and the blocks stored on disk may be moved from member to
member as needed to balance the load.
This architecture simplifies the management and configuration of both
the storage group and initiators. In a multipathing configuration, it
is possible to set up multiple iSCSI sessions to use multiple network
interfaces on both the host and target to take advantage of the
increased network bandwidth. An initiator could use a simple round
robin algorithm to send I/O across all paths and let the storage array
members forward it as necessary, but there is a performance advantage to
sending data directly to the correct member.
A device-mapper table already lets you map different regions of a
device onto different targets. However in this architecture the LUN is
spread with an address region size on the order of 10s of MBs, which
means the resulting table could have more than a million entries and
consume far too much memory.
Using this device-mapper switch target we can now build a two-layer
device hierarchy:
Upper Tier Determine which array member the I/O should be sent to.
Lower Tier Load balance amongst paths to a particular member.
The lower tier consists of a single dm multipath device for each member.
Each of these multipath devices contains the set of paths directly to
the array member in one priority group, and leverages existing path
selectors to load balance amongst these paths. We also build a
non-preferred priority group containing paths to other array members for
failover reasons.
The upper tier consists of a single dm-switch device. This device uses
a bitmap to look up the location of the I/O and choose the appropriate
lower tier device to route the I/O. By using a bitmap we are able to
use 4 bits for each address range in a 16 member group (which is very
large for us). This is a much denser representation than the dm table
b-tree can achieve.
Construction Parameters
=======================
<num_paths> <region_size> <num_optional_args> [<optional_args>...]
[<dev_path> <offset>]+
<num_paths>
The number of paths across which to distribute the I/O.
<region_size>
The number of 512-byte sectors in a region. Each region can be redirected
to any of the available paths.
<num_optional_args>
The number of optional arguments. Currently, no optional arguments
are supported and so this must be zero.
<dev_path>
The block device that represents a specific path to the device.
<offset>
The offset of the start of data on the specific <dev_path> (in units
of 512-byte sectors). This number is added to the sector number when
forwarding the request to the specific path. Typically it is zero.
Messages
========
set_region_mappings <index>:<path_nr> [<index>]:<path_nr> [<index>]:<path_nr>...
Modify the region table by specifying which regions are redirected to
which paths.
<index>
The region number (region size was specified in constructor parameters).
If index is omitted, the next region (previous index + 1) is used.
Expressed in hexadecimal (WITHOUT any prefix like 0x).
<path_nr>
The path number in the range 0 ... (<num_paths> - 1).
Expressed in hexadecimal (WITHOUT any prefix like 0x).
Status
======
No status line is reported.
Example
=======
Assume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with
the same size.
Create a switch device with 64kB region size:
dmsetup create switch --table "0 `blockdev --getsize /dev/vg1/switch0`
switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0"
Set mappings for the first 7 entries to point to devices switch0, switch1,
switch2, switch0, switch1, switch2, switch1:
dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1

View File

@ -2574,6 +2574,7 @@ S: Maintained
DEVICE-MAPPER (LVM)
M: Alasdair Kergon <agk@redhat.com>
M: Mike Snitzer <snitzer@redhat.com>
M: dm-devel@redhat.com
L: dm-devel@redhat.com
W: http://sources.redhat.com/dm
@ -2585,6 +2586,7 @@ F: drivers/md/dm*
F: drivers/md/persistent-data/
F: include/linux/device-mapper.h
F: include/linux/dm-*.h
F: include/uapi/linux/dm-*.h
DIOLAN U2C-12 I2C DRIVER
M: Guenter Roeck <linux@roeck-us.net>

View File

@ -412,4 +412,18 @@ config DM_VERITY
If unsure, say N.
config DM_SWITCH
tristate "Switch target support (EXPERIMENTAL)"
depends on BLK_DEV_DM
---help---
This device-mapper target creates a device that supports an arbitrary
mapping of fixed-size regions of I/O across a fixed set of paths.
The path used for any specific region can be switched dynamically
by sending the target a message.
To compile this code as a module, choose M here: the module will
be called dm-switch.
If unsure, say N.
endif # MD

View File

@ -40,6 +40,7 @@ obj-$(CONFIG_DM_FLAKEY) += dm-flakey.o
obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o
obj-$(CONFIG_DM_MULTIPATH_QL) += dm-queue-length.o
obj-$(CONFIG_DM_MULTIPATH_ST) += dm-service-time.o
obj-$(CONFIG_DM_SWITCH) += dm-switch.o
obj-$(CONFIG_DM_SNAPSHOT) += dm-snapshot.o
obj-$(CONFIG_DM_PERSISTENT_DATA) += persistent-data/
obj-$(CONFIG_DM_MIRROR) += dm-mirror.o dm-log.o dm-region-hash.o

View File

@ -145,6 +145,7 @@ struct dm_buffer {
unsigned long state;
unsigned long last_accessed;
struct dm_bufio_client *c;
struct list_head write_list;
struct bio bio;
struct bio_vec bio_vec[DM_BUFIO_INLINE_VECS];
};
@ -349,7 +350,7 @@ static void *alloc_buffer_data(struct dm_bufio_client *c, gfp_t gfp_mask,
if (gfp_mask & __GFP_NORETRY)
noio_flag = memalloc_noio_save();
ptr = __vmalloc(c->block_size, gfp_mask, PAGE_KERNEL);
ptr = __vmalloc(c->block_size, gfp_mask | __GFP_HIGHMEM, PAGE_KERNEL);
if (gfp_mask & __GFP_NORETRY)
memalloc_noio_restore(noio_flag);
@ -630,7 +631,8 @@ static int do_io_schedule(void *word)
* - Submit our write and don't wait on it. We set B_WRITING indicating
* that there is a write in progress.
*/
static void __write_dirty_buffer(struct dm_buffer *b)
static void __write_dirty_buffer(struct dm_buffer *b,
struct list_head *write_list)
{
if (!test_bit(B_DIRTY, &b->state))
return;
@ -639,7 +641,24 @@ static void __write_dirty_buffer(struct dm_buffer *b)
wait_on_bit_lock(&b->state, B_WRITING,
do_io_schedule, TASK_UNINTERRUPTIBLE);
submit_io(b, WRITE, b->block, write_endio);
if (!write_list)
submit_io(b, WRITE, b->block, write_endio);
else
list_add_tail(&b->write_list, write_list);
}
static void __flush_write_list(struct list_head *write_list)
{
struct blk_plug plug;
blk_start_plug(&plug);
while (!list_empty(write_list)) {
struct dm_buffer *b =
list_entry(write_list->next, struct dm_buffer, write_list);
list_del(&b->write_list);
submit_io(b, WRITE, b->block, write_endio);
dm_bufio_cond_resched();
}
blk_finish_plug(&plug);
}
/*
@ -655,7 +674,7 @@ static void __make_buffer_clean(struct dm_buffer *b)
return;
wait_on_bit(&b->state, B_READING, do_io_schedule, TASK_UNINTERRUPTIBLE);
__write_dirty_buffer(b);
__write_dirty_buffer(b, NULL);
wait_on_bit(&b->state, B_WRITING, do_io_schedule, TASK_UNINTERRUPTIBLE);
}
@ -802,7 +821,8 @@ static void __free_buffer_wake(struct dm_buffer *b)
wake_up(&c->free_buffer_wait);
}
static void __write_dirty_buffers_async(struct dm_bufio_client *c, int no_wait)
static void __write_dirty_buffers_async(struct dm_bufio_client *c, int no_wait,
struct list_head *write_list)
{
struct dm_buffer *b, *tmp;
@ -818,7 +838,7 @@ static void __write_dirty_buffers_async(struct dm_bufio_client *c, int no_wait)
if (no_wait && test_bit(B_WRITING, &b->state))
return;
__write_dirty_buffer(b);
__write_dirty_buffer(b, write_list);
dm_bufio_cond_resched();
}
}
@ -853,7 +873,8 @@ static void __get_memory_limit(struct dm_bufio_client *c,
* If we are over threshold_buffers, start freeing buffers.
* If we're over "limit_buffers", block until we get under the limit.
*/
static void __check_watermark(struct dm_bufio_client *c)
static void __check_watermark(struct dm_bufio_client *c,
struct list_head *write_list)
{
unsigned long threshold_buffers, limit_buffers;
@ -872,7 +893,7 @@ static void __check_watermark(struct dm_bufio_client *c)
}
if (c->n_buffers[LIST_DIRTY] > threshold_buffers)
__write_dirty_buffers_async(c, 1);
__write_dirty_buffers_async(c, 1, write_list);
}
/*
@ -897,7 +918,8 @@ static struct dm_buffer *__find(struct dm_bufio_client *c, sector_t block)
*--------------------------------------------------------------*/
static struct dm_buffer *__bufio_new(struct dm_bufio_client *c, sector_t block,
enum new_flag nf, int *need_submit)
enum new_flag nf, int *need_submit,
struct list_head *write_list)
{
struct dm_buffer *b, *new_b = NULL;
@ -924,7 +946,7 @@ static struct dm_buffer *__bufio_new(struct dm_bufio_client *c, sector_t block,
goto found_buffer;
}
__check_watermark(c);
__check_watermark(c, write_list);
b = new_b;
b->hold_count = 1;
@ -992,10 +1014,14 @@ static void *new_read(struct dm_bufio_client *c, sector_t block,
int need_submit;
struct dm_buffer *b;
LIST_HEAD(write_list);
dm_bufio_lock(c);
b = __bufio_new(c, block, nf, &need_submit);
b = __bufio_new(c, block, nf, &need_submit, &write_list);
dm_bufio_unlock(c);
__flush_write_list(&write_list);
if (!b)
return b;
@ -1047,6 +1073,8 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
{
struct blk_plug plug;
LIST_HEAD(write_list);
BUG_ON(dm_bufio_in_request());
blk_start_plug(&plug);
@ -1055,7 +1083,15 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
for (; n_blocks--; block++) {
int need_submit;
struct dm_buffer *b;
b = __bufio_new(c, block, NF_PREFETCH, &need_submit);
b = __bufio_new(c, block, NF_PREFETCH, &need_submit,
&write_list);
if (unlikely(!list_empty(&write_list))) {
dm_bufio_unlock(c);
blk_finish_plug(&plug);
__flush_write_list(&write_list);
blk_start_plug(&plug);
dm_bufio_lock(c);
}
if (unlikely(b != NULL)) {
dm_bufio_unlock(c);
@ -1069,7 +1105,6 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
goto flush_plug;
dm_bufio_lock(c);
}
}
dm_bufio_unlock(c);
@ -1126,11 +1161,14 @@ EXPORT_SYMBOL_GPL(dm_bufio_mark_buffer_dirty);
void dm_bufio_write_dirty_buffers_async(struct dm_bufio_client *c)
{
LIST_HEAD(write_list);
BUG_ON(dm_bufio_in_request());
dm_bufio_lock(c);
__write_dirty_buffers_async(c, 0);
__write_dirty_buffers_async(c, 0, &write_list);
dm_bufio_unlock(c);
__flush_write_list(&write_list);
}
EXPORT_SYMBOL_GPL(dm_bufio_write_dirty_buffers_async);
@ -1147,8 +1185,13 @@ int dm_bufio_write_dirty_buffers(struct dm_bufio_client *c)
unsigned long buffers_processed = 0;
struct dm_buffer *b, *tmp;
LIST_HEAD(write_list);
dm_bufio_lock(c);
__write_dirty_buffers_async(c, 0, &write_list);
dm_bufio_unlock(c);
__flush_write_list(&write_list);
dm_bufio_lock(c);
__write_dirty_buffers_async(c, 0);
again:
list_for_each_entry_safe_reverse(b, tmp, &c->lru[LIST_DIRTY], lru_list) {
@ -1274,7 +1317,7 @@ retry:
BUG_ON(!b->hold_count);
BUG_ON(test_bit(B_READING, &b->state));
__write_dirty_buffer(b);
__write_dirty_buffer(b, NULL);
if (b->hold_count == 1) {
wait_on_bit(&b->state, B_WRITING,
do_io_schedule, TASK_UNINTERRUPTIBLE);

View File

@ -425,6 +425,10 @@ static bool block_size_is_power_of_two(struct cache *cache)
return cache->sectors_per_block_shift >= 0;
}
/* gcc on ARM generates spurious references to __udivdi3 and __umoddi3 */
#if defined(CONFIG_ARM) && __GNUC__ == 4 && __GNUC_MINOR__ <= 6
__always_inline
#endif
static dm_block_t block_div(dm_block_t b, uint32_t n)
{
do_div(b, n);

View File

@ -176,7 +176,7 @@ static int flakey_ctr(struct dm_target *ti, unsigned int argc, char **argv)
fc = kzalloc(sizeof(*fc), GFP_KERNEL);
if (!fc) {
ti->error = "Cannot allocate linear context";
ti->error = "Cannot allocate context";
return -ENOMEM;
}
fc->start_time = jiffies;

View File

@ -36,6 +36,14 @@ struct hash_cell {
struct dm_table *new_map;
};
/*
* A dummy definition to make RCU happy.
* struct dm_table should never be dereferenced in this file.
*/
struct dm_table {
int undefined__;
};
struct vers_iter {
size_t param_size;
struct dm_target_versions *vers, *old_vers;
@ -242,9 +250,10 @@ static int dm_hash_insert(const char *name, const char *uuid, struct mapped_devi
return -EBUSY;
}
static void __hash_remove(struct hash_cell *hc)
static struct dm_table *__hash_remove(struct hash_cell *hc)
{
struct dm_table *table;
int srcu_idx;
/* remove from the dev hash */
list_del(&hc->uuid_list);
@ -253,16 +262,18 @@ static void __hash_remove(struct hash_cell *hc)
dm_set_mdptr(hc->md, NULL);
mutex_unlock(&dm_hash_cells_mutex);
table = dm_get_live_table(hc->md);
if (table) {
table = dm_get_live_table(hc->md, &srcu_idx);
if (table)
dm_table_event(table);
dm_table_put(table);
}
dm_put_live_table(hc->md, srcu_idx);
table = NULL;
if (hc->new_map)
dm_table_destroy(hc->new_map);
table = hc->new_map;
dm_put(hc->md);
free_cell(hc);
return table;
}
static void dm_hash_remove_all(int keep_open_devices)
@ -270,6 +281,7 @@ static void dm_hash_remove_all(int keep_open_devices)
int i, dev_skipped;
struct hash_cell *hc;
struct mapped_device *md;
struct dm_table *t;
retry:
dev_skipped = 0;
@ -287,10 +299,14 @@ retry:
continue;
}
__hash_remove(hc);
t = __hash_remove(hc);
up_write(&_hash_lock);
if (t) {
dm_sync_table(md);
dm_table_destroy(t);
}
dm_put(md);
if (likely(keep_open_devices))
dm_destroy(md);
@ -356,6 +372,7 @@ static struct mapped_device *dm_hash_rename(struct dm_ioctl *param,
struct dm_table *table;
struct mapped_device *md;
unsigned change_uuid = (param->flags & DM_UUID_FLAG) ? 1 : 0;
int srcu_idx;
/*
* duplicate new.
@ -418,11 +435,10 @@ static struct mapped_device *dm_hash_rename(struct dm_ioctl *param,
/*
* Wake up any dm event waiters.
*/
table = dm_get_live_table(hc->md);
if (table) {
table = dm_get_live_table(hc->md, &srcu_idx);
if (table)
dm_table_event(table);
dm_table_put(table);
}
dm_put_live_table(hc->md, srcu_idx);
if (!dm_kobject_uevent(hc->md, KOBJ_CHANGE, param->event_nr))
param->flags |= DM_UEVENT_GENERATED_FLAG;
@ -620,11 +636,14 @@ static int check_name(const char *name)
* _hash_lock without first calling dm_table_put, because dm_table_destroy
* waits for this dm_table_put and could be called under this lock.
*/
static struct dm_table *dm_get_inactive_table(struct mapped_device *md)
static struct dm_table *dm_get_inactive_table(struct mapped_device *md, int *srcu_idx)
{
struct hash_cell *hc;
struct dm_table *table = NULL;
/* increment rcu count, we don't care about the table pointer */
dm_get_live_table(md, srcu_idx);
down_read(&_hash_lock);
hc = dm_get_mdptr(md);
if (!hc || hc->md != md) {
@ -633,8 +652,6 @@ static struct dm_table *dm_get_inactive_table(struct mapped_device *md)
}
table = hc->new_map;
if (table)
dm_table_get(table);
out:
up_read(&_hash_lock);
@ -643,10 +660,11 @@ out:
}
static struct dm_table *dm_get_live_or_inactive_table(struct mapped_device *md,
struct dm_ioctl *param)
struct dm_ioctl *param,
int *srcu_idx)
{
return (param->flags & DM_QUERY_INACTIVE_TABLE_FLAG) ?
dm_get_inactive_table(md) : dm_get_live_table(md);
dm_get_inactive_table(md, srcu_idx) : dm_get_live_table(md, srcu_idx);
}
/*
@ -657,6 +675,7 @@ static void __dev_status(struct mapped_device *md, struct dm_ioctl *param)
{
struct gendisk *disk = dm_disk(md);
struct dm_table *table;
int srcu_idx;
param->flags &= ~(DM_SUSPEND_FLAG | DM_READONLY_FLAG |
DM_ACTIVE_PRESENT_FLAG);
@ -676,26 +695,27 @@ static void __dev_status(struct mapped_device *md, struct dm_ioctl *param)
param->event_nr = dm_get_event_nr(md);
param->target_count = 0;
table = dm_get_live_table(md);
table = dm_get_live_table(md, &srcu_idx);
if (table) {
if (!(param->flags & DM_QUERY_INACTIVE_TABLE_FLAG)) {
if (get_disk_ro(disk))
param->flags |= DM_READONLY_FLAG;
param->target_count = dm_table_get_num_targets(table);
}
dm_table_put(table);
param->flags |= DM_ACTIVE_PRESENT_FLAG;
}
dm_put_live_table(md, srcu_idx);
if (param->flags & DM_QUERY_INACTIVE_TABLE_FLAG) {
table = dm_get_inactive_table(md);
int srcu_idx;
table = dm_get_inactive_table(md, &srcu_idx);
if (table) {
if (!(dm_table_get_mode(table) & FMODE_WRITE))
param->flags |= DM_READONLY_FLAG;
param->target_count = dm_table_get_num_targets(table);
dm_table_put(table);
}
dm_put_live_table(md, srcu_idx);
}
}
@ -796,6 +816,7 @@ static int dev_remove(struct dm_ioctl *param, size_t param_size)
struct hash_cell *hc;
struct mapped_device *md;
int r;
struct dm_table *t;
down_write(&_hash_lock);
hc = __find_device_hash_cell(param);
@ -819,9 +840,14 @@ static int dev_remove(struct dm_ioctl *param, size_t param_size)
return r;
}
__hash_remove(hc);
t = __hash_remove(hc);
up_write(&_hash_lock);
if (t) {
dm_sync_table(md);
dm_table_destroy(t);
}
if (!dm_kobject_uevent(md, KOBJ_REMOVE, param->event_nr))
param->flags |= DM_UEVENT_GENERATED_FLAG;
@ -986,6 +1012,7 @@ static int do_resume(struct dm_ioctl *param)
old_map = dm_swap_table(md, new_map);
if (IS_ERR(old_map)) {
dm_sync_table(md);
dm_table_destroy(new_map);
dm_put(md);
return PTR_ERR(old_map);
@ -1003,6 +1030,10 @@ static int do_resume(struct dm_ioctl *param)
param->flags |= DM_UEVENT_GENERATED_FLAG;
}
/*
* Since dm_swap_table synchronizes RCU, nobody should be in
* read-side critical section already.
*/
if (old_map)
dm_table_destroy(old_map);
@ -1125,6 +1156,7 @@ static int dev_wait(struct dm_ioctl *param, size_t param_size)
int r = 0;
struct mapped_device *md;
struct dm_table *table;
int srcu_idx;
md = find_device(param);
if (!md)
@ -1145,11 +1177,10 @@ static int dev_wait(struct dm_ioctl *param, size_t param_size)
*/
__dev_status(md, param);
table = dm_get_live_or_inactive_table(md, param);
if (table) {
table = dm_get_live_or_inactive_table(md, param, &srcu_idx);
if (table)
retrieve_status(table, param, param_size);
dm_table_put(table);
}
dm_put_live_table(md, srcu_idx);
out:
dm_put(md);
@ -1221,7 +1252,7 @@ static int table_load(struct dm_ioctl *param, size_t param_size)
{
int r;
struct hash_cell *hc;
struct dm_table *t;
struct dm_table *t, *old_map = NULL;
struct mapped_device *md;
struct target_type *immutable_target_type;
@ -1277,14 +1308,14 @@ static int table_load(struct dm_ioctl *param, size_t param_size)
hc = dm_get_mdptr(md);
if (!hc || hc->md != md) {
DMWARN("device has been removed from the dev hash table.");
dm_table_destroy(t);
up_write(&_hash_lock);
dm_table_destroy(t);
r = -ENXIO;
goto out;
}
if (hc->new_map)
dm_table_destroy(hc->new_map);
old_map = hc->new_map;
hc->new_map = t;
up_write(&_hash_lock);
@ -1292,6 +1323,11 @@ static int table_load(struct dm_ioctl *param, size_t param_size)
__dev_status(md, param);
out:
if (old_map) {
dm_sync_table(md);
dm_table_destroy(old_map);
}
dm_put(md);
return r;
@ -1301,6 +1337,7 @@ static int table_clear(struct dm_ioctl *param, size_t param_size)
{
struct hash_cell *hc;
struct mapped_device *md;
struct dm_table *old_map = NULL;
down_write(&_hash_lock);
@ -1312,7 +1349,7 @@ static int table_clear(struct dm_ioctl *param, size_t param_size)
}
if (hc->new_map) {
dm_table_destroy(hc->new_map);
old_map = hc->new_map;
hc->new_map = NULL;
}
@ -1321,6 +1358,10 @@ static int table_clear(struct dm_ioctl *param, size_t param_size)
__dev_status(hc->md, param);
md = hc->md;
up_write(&_hash_lock);
if (old_map) {
dm_sync_table(md);
dm_table_destroy(old_map);
}
dm_put(md);
return 0;
@ -1370,6 +1411,7 @@ static int table_deps(struct dm_ioctl *param, size_t param_size)
{
struct mapped_device *md;
struct dm_table *table;
int srcu_idx;
md = find_device(param);
if (!md)
@ -1377,11 +1419,10 @@ static int table_deps(struct dm_ioctl *param, size_t param_size)
__dev_status(md, param);
table = dm_get_live_or_inactive_table(md, param);
if (table) {
table = dm_get_live_or_inactive_table(md, param, &srcu_idx);
if (table)
retrieve_deps(table, param, param_size);
dm_table_put(table);
}
dm_put_live_table(md, srcu_idx);
dm_put(md);
@ -1396,6 +1437,7 @@ static int table_status(struct dm_ioctl *param, size_t param_size)
{
struct mapped_device *md;
struct dm_table *table;
int srcu_idx;
md = find_device(param);
if (!md)
@ -1403,11 +1445,10 @@ static int table_status(struct dm_ioctl *param, size_t param_size)
__dev_status(md, param);
table = dm_get_live_or_inactive_table(md, param);
if (table) {
table = dm_get_live_or_inactive_table(md, param, &srcu_idx);
if (table)
retrieve_status(table, param, param_size);
dm_table_put(table);
}
dm_put_live_table(md, srcu_idx);
dm_put(md);
@ -1443,6 +1484,7 @@ static int target_message(struct dm_ioctl *param, size_t param_size)
struct dm_target_msg *tmsg = (void *) param + param->data_start;
size_t maxlen;
char *result = get_result_buffer(param, param_size, &maxlen);
int srcu_idx;
md = find_device(param);
if (!md)
@ -1470,9 +1512,9 @@ static int target_message(struct dm_ioctl *param, size_t param_size)
if (r <= 1)
goto out_argv;
table = dm_get_live_table(md);
table = dm_get_live_table(md, &srcu_idx);
if (!table)
goto out_argv;
goto out_table;
if (dm_deleting_md(md)) {
r = -ENXIO;
@ -1491,7 +1533,7 @@ static int target_message(struct dm_ioctl *param, size_t param_size)
}
out_table:
dm_table_put(table);
dm_put_live_table(md, srcu_idx);
out_argv:
kfree(argv);
out:
@ -1644,7 +1686,10 @@ static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl *param_kern
}
if (!dmi) {
dmi = __vmalloc(param_kernel->data_size, GFP_NOIO | __GFP_REPEAT | __GFP_HIGH, PAGE_KERNEL);
unsigned noio_flag;
noio_flag = memalloc_noio_save();
dmi = __vmalloc(param_kernel->data_size, GFP_NOIO | __GFP_REPEAT | __GFP_HIGH | __GFP_HIGHMEM, PAGE_KERNEL);
memalloc_noio_restore(noio_flag);
if (dmi)
*param_flags |= DM_PARAMS_VMALLOC;
}

View File

@ -1561,7 +1561,6 @@ static int multipath_ioctl(struct dm_target *ti, unsigned int cmd,
unsigned long flags;
int r;
again:
bdev = NULL;
mode = 0;
r = 0;
@ -1579,7 +1578,7 @@ again:
}
if ((pgpath && m->queue_io) || (!pgpath && m->queue_if_no_path))
r = -EAGAIN;
r = -ENOTCONN;
else if (!bdev)
r = -EIO;
@ -1591,11 +1590,8 @@ again:
if (!r && ti->len != i_size_read(bdev->bd_inode) >> SECTOR_SHIFT)
r = scsi_verify_blk_ioctl(NULL, cmd);
if (r == -EAGAIN && !fatal_signal_pending(current)) {
if (r == -ENOTCONN && !fatal_signal_pending(current))
queue_work(kmultipathd, &m->process_queued_ios);
msleep(10);
goto again;
}
return r ? : __blkdev_driver_ioctl(bdev, mode, cmd, arg);
}

538
drivers/md/dm-switch.c Normal file
View File

@ -0,0 +1,538 @@
/*
* Copyright (C) 2010-2012 by Dell Inc. All rights reserved.
* Copyright (C) 2011-2013 Red Hat, Inc.
*
* This file is released under the GPL.
*
* dm-switch is a device-mapper target that maps IO to underlying block
* devices efficiently when there are a large number of fixed-sized
* address regions but there is no simple pattern to allow for a compact
* mapping representation such as dm-stripe.
*/
#include <linux/device-mapper.h>
#include <linux/module.h>
#include <linux/init.h>
#include <linux/vmalloc.h>
#define DM_MSG_PREFIX "switch"
/*
* One region_table_slot_t holds <region_entries_per_slot> region table
* entries each of which is <region_table_entry_bits> in size.
*/
typedef unsigned long region_table_slot_t;
/*
* A device with the offset to its start sector.
*/
struct switch_path {
struct dm_dev *dmdev;
sector_t start;
};
/*
* Context block for a dm switch device.
*/
struct switch_ctx {
struct dm_target *ti;
unsigned nr_paths; /* Number of paths in path_list. */
unsigned region_size; /* Region size in 512-byte sectors */
unsigned long nr_regions; /* Number of regions making up the device */
signed char region_size_bits; /* log2 of region_size or -1 */
unsigned char region_table_entry_bits; /* Number of bits in one region table entry */
unsigned char region_entries_per_slot; /* Number of entries in one region table slot */
signed char region_entries_per_slot_bits; /* log2 of region_entries_per_slot or -1 */
region_table_slot_t *region_table; /* Region table */
/*
* Array of dm devices to switch between.
*/
struct switch_path path_list[0];
};
static struct switch_ctx *alloc_switch_ctx(struct dm_target *ti, unsigned nr_paths,
unsigned region_size)
{
struct switch_ctx *sctx;
sctx = kzalloc(sizeof(struct switch_ctx) + nr_paths * sizeof(struct switch_path),
GFP_KERNEL);
if (!sctx)
return NULL;
sctx->ti = ti;
sctx->region_size = region_size;
ti->private = sctx;
return sctx;
}
static int alloc_region_table(struct dm_target *ti, unsigned nr_paths)
{
struct switch_ctx *sctx = ti->private;
sector_t nr_regions = ti->len;
sector_t nr_slots;
if (!(sctx->region_size & (sctx->region_size - 1)))
sctx->region_size_bits = __ffs(sctx->region_size);
else
sctx->region_size_bits = -1;
sctx->region_table_entry_bits = 1;
while (sctx->region_table_entry_bits < sizeof(region_table_slot_t) * 8 &&
(region_table_slot_t)1 << sctx->region_table_entry_bits < nr_paths)
sctx->region_table_entry_bits++;
sctx->region_entries_per_slot = (sizeof(region_table_slot_t) * 8) / sctx->region_table_entry_bits;
if (!(sctx->region_entries_per_slot & (sctx->region_entries_per_slot - 1)))
sctx->region_entries_per_slot_bits = __ffs(sctx->region_entries_per_slot);
else
sctx->region_entries_per_slot_bits = -1;
if (sector_div(nr_regions, sctx->region_size))
nr_regions++;
sctx->nr_regions = nr_regions;
if (sctx->nr_regions != nr_regions || sctx->nr_regions >= ULONG_MAX) {
ti->error = "Region table too large";
return -EINVAL;
}
nr_slots = nr_regions;
if (sector_div(nr_slots, sctx->region_entries_per_slot))
nr_slots++;
if (nr_slots > ULONG_MAX / sizeof(region_table_slot_t)) {
ti->error = "Region table too large";
return -EINVAL;
}
sctx->region_table = vmalloc(nr_slots * sizeof(region_table_slot_t));
if (!sctx->region_table) {
ti->error = "Cannot allocate region table";
return -ENOMEM;
}
return 0;
}
static void switch_get_position(struct switch_ctx *sctx, unsigned long region_nr,
unsigned long *region_index, unsigned *bit)
{
if (sctx->region_entries_per_slot_bits >= 0) {
*region_index = region_nr >> sctx->region_entries_per_slot_bits;
*bit = region_nr & (sctx->region_entries_per_slot - 1);
} else {
*region_index = region_nr / sctx->region_entries_per_slot;
*bit = region_nr % sctx->region_entries_per_slot;
}
*bit *= sctx->region_table_entry_bits;
}
/*
* Find which path to use at given offset.
*/
static unsigned switch_get_path_nr(struct switch_ctx *sctx, sector_t offset)
{
unsigned long region_index;
unsigned bit, path_nr;
sector_t p;
p = offset;
if (sctx->region_size_bits >= 0)
p >>= sctx->region_size_bits;
else
sector_div(p, sctx->region_size);
switch_get_position(sctx, p, &region_index, &bit);
path_nr = (ACCESS_ONCE(sctx->region_table[region_index]) >> bit) &
((1 << sctx->region_table_entry_bits) - 1);
/* This can only happen if the processor uses non-atomic stores. */
if (unlikely(path_nr >= sctx->nr_paths))
path_nr = 0;
return path_nr;
}
static void switch_region_table_write(struct switch_ctx *sctx, unsigned long region_nr,
unsigned value)
{
unsigned long region_index;
unsigned bit;
region_table_slot_t pte;
switch_get_position(sctx, region_nr, &region_index, &bit);
pte = sctx->region_table[region_index];
pte &= ~((((region_table_slot_t)1 << sctx->region_table_entry_bits) - 1) << bit);
pte |= (region_table_slot_t)value << bit;
sctx->region_table[region_index] = pte;
}
/*
* Fill the region table with an initial round robin pattern.
*/
static void initialise_region_table(struct switch_ctx *sctx)
{
unsigned path_nr = 0;
unsigned long region_nr;
for (region_nr = 0; region_nr < sctx->nr_regions; region_nr++) {
switch_region_table_write(sctx, region_nr, path_nr);
if (++path_nr >= sctx->nr_paths)
path_nr = 0;
}
}
static int parse_path(struct dm_arg_set *as, struct dm_target *ti)
{
struct switch_ctx *sctx = ti->private;
unsigned long long start;
int r;
r = dm_get_device(ti, dm_shift_arg(as), dm_table_get_mode(ti->table),
&sctx->path_list[sctx->nr_paths].dmdev);
if (r) {
ti->error = "Device lookup failed";
return r;
}
if (kstrtoull(dm_shift_arg(as), 10, &start) || start != (sector_t)start) {
ti->error = "Invalid device starting offset";
dm_put_device(ti, sctx->path_list[sctx->nr_paths].dmdev);
return -EINVAL;
}
sctx->path_list[sctx->nr_paths].start = start;
sctx->nr_paths++;
return 0;
}
/*
* Destructor: Don't free the dm_target, just the ti->private data (if any).
*/
static void switch_dtr(struct dm_target *ti)
{
struct switch_ctx *sctx = ti->private;
while (sctx->nr_paths--)
dm_put_device(ti, sctx->path_list[sctx->nr_paths].dmdev);
vfree(sctx->region_table);
kfree(sctx);
}
/*
* Constructor arguments:
* <num_paths> <region_size> <num_optional_args> [<optional_args>...]
* [<dev_path> <offset>]+
*
* Optional args are to allow for future extension: currently this
* parameter must be 0.
*/
static int switch_ctr(struct dm_target *ti, unsigned argc, char **argv)
{
static struct dm_arg _args[] = {
{1, (KMALLOC_MAX_SIZE - sizeof(struct switch_ctx)) / sizeof(struct switch_path), "Invalid number of paths"},
{1, UINT_MAX, "Invalid region size"},
{0, 0, "Invalid number of optional args"},
};
struct switch_ctx *sctx;
struct dm_arg_set as;
unsigned nr_paths, region_size, nr_optional_args;
int r;
as.argc = argc;
as.argv = argv;
r = dm_read_arg(_args, &as, &nr_paths, &ti->error);
if (r)
return -EINVAL;
r = dm_read_arg(_args + 1, &as, &region_size, &ti->error);
if (r)
return r;
r = dm_read_arg_group(_args + 2, &as, &nr_optional_args, &ti->error);
if (r)
return r;
/* parse optional arguments here, if we add any */
if (as.argc != nr_paths * 2) {
ti->error = "Incorrect number of path arguments";
return -EINVAL;
}
sctx = alloc_switch_ctx(ti, nr_paths, region_size);
if (!sctx) {
ti->error = "Cannot allocate redirection context";
return -ENOMEM;
}
r = dm_set_target_max_io_len(ti, region_size);
if (r)
goto error;
while (as.argc) {
r = parse_path(&as, ti);
if (r)
goto error;
}
r = alloc_region_table(ti, nr_paths);
if (r)
goto error;
initialise_region_table(sctx);
/* For UNMAP, sending the request down any path is sufficient */
ti->num_discard_bios = 1;
return 0;
error:
switch_dtr(ti);
return r;
}
static int switch_map(struct dm_target *ti, struct bio *bio)
{
struct switch_ctx *sctx = ti->private;
sector_t offset = dm_target_offset(ti, bio->bi_sector);
unsigned path_nr = switch_get_path_nr(sctx, offset);
bio->bi_bdev = sctx->path_list[path_nr].dmdev->bdev;
bio->bi_sector = sctx->path_list[path_nr].start + offset;
return DM_MAPIO_REMAPPED;
}
/*
* We need to parse hex numbers in the message as quickly as possible.
*
* This table-based hex parser improves performance.
* It improves a time to load 1000000 entries compared to the condition-based
* parser.
* table-based parser condition-based parser
* PA-RISC 0.29s 0.31s
* Opteron 0.0495s 0.0498s
*/
static const unsigned char hex_table[256] = {
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 255, 255, 255, 255, 255, 255,
255, 10, 11, 12, 13, 14, 15, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 10, 11, 12, 13, 14, 15, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255
};
static __always_inline unsigned long parse_hex(const char **string)
{
unsigned char d;
unsigned long r = 0;
while ((d = hex_table[(unsigned char)**string]) < 16) {
r = (r << 4) | d;
(*string)++;
}
return r;
}
static int process_set_region_mappings(struct switch_ctx *sctx,
unsigned argc, char **argv)
{
unsigned i;
unsigned long region_index = 0;
for (i = 1; i < argc; i++) {
unsigned long path_nr;
const char *string = argv[i];
if (*string == ':')
region_index++;
else {
region_index = parse_hex(&string);
if (unlikely(*string != ':')) {
DMWARN("invalid set_region_mappings argument: '%s'", argv[i]);
return -EINVAL;
}
}
string++;
if (unlikely(!*string)) {
DMWARN("invalid set_region_mappings argument: '%s'", argv[i]);
return -EINVAL;
}
path_nr = parse_hex(&string);
if (unlikely(*string)) {
DMWARN("invalid set_region_mappings argument: '%s'", argv[i]);
return -EINVAL;
}
if (unlikely(region_index >= sctx->nr_regions)) {
DMWARN("invalid set_region_mappings region number: %lu >= %lu", region_index, sctx->nr_regions);
return -EINVAL;
}
if (unlikely(path_nr >= sctx->nr_paths)) {
DMWARN("invalid set_region_mappings device: %lu >= %u", path_nr, sctx->nr_paths);
return -EINVAL;
}
switch_region_table_write(sctx, region_index, path_nr);
}
return 0;
}
/*
* Messages are processed one-at-a-time.
*
* Only set_region_mappings is supported.
*/
static int switch_message(struct dm_target *ti, unsigned argc, char **argv)
{
static DEFINE_MUTEX(message_mutex);
struct switch_ctx *sctx = ti->private;
int r = -EINVAL;
mutex_lock(&message_mutex);
if (!strcasecmp(argv[0], "set_region_mappings"))
r = process_set_region_mappings(sctx, argc, argv);
else
DMWARN("Unrecognised message received.");
mutex_unlock(&message_mutex);
return r;
}
static void switch_status(struct dm_target *ti, status_type_t type,
unsigned status_flags, char *result, unsigned maxlen)
{
struct switch_ctx *sctx = ti->private;
unsigned sz = 0;
int path_nr;
switch (type) {
case STATUSTYPE_INFO:
result[0] = '\0';
break;
case STATUSTYPE_TABLE:
DMEMIT("%u %u 0", sctx->nr_paths, sctx->region_size);
for (path_nr = 0; path_nr < sctx->nr_paths; path_nr++)
DMEMIT(" %s %llu", sctx->path_list[path_nr].dmdev->name,
(unsigned long long)sctx->path_list[path_nr].start);
break;
}
}
/*
* Switch ioctl:
*
* Passthrough all ioctls to the path for sector 0
*/
static int switch_ioctl(struct dm_target *ti, unsigned cmd,
unsigned long arg)
{
struct switch_ctx *sctx = ti->private;
struct block_device *bdev;
fmode_t mode;
unsigned path_nr;
int r = 0;
path_nr = switch_get_path_nr(sctx, 0);
bdev = sctx->path_list[path_nr].dmdev->bdev;
mode = sctx->path_list[path_nr].dmdev->mode;
/*
* Only pass ioctls through if the device sizes match exactly.
*/
if (ti->len + sctx->path_list[path_nr].start != i_size_read(bdev->bd_inode) >> SECTOR_SHIFT)
r = scsi_verify_blk_ioctl(NULL, cmd);
return r ? : __blkdev_driver_ioctl(bdev, mode, cmd, arg);
}
static int switch_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
struct switch_ctx *sctx = ti->private;
int path_nr;
int r;
for (path_nr = 0; path_nr < sctx->nr_paths; path_nr++) {
r = fn(ti, sctx->path_list[path_nr].dmdev,
sctx->path_list[path_nr].start, ti->len, data);
if (r)
return r;
}
return 0;
}
static struct target_type switch_target = {
.name = "switch",
.version = {1, 0, 0},
.module = THIS_MODULE,
.ctr = switch_ctr,
.dtr = switch_dtr,
.map = switch_map,
.message = switch_message,
.status = switch_status,
.ioctl = switch_ioctl,
.iterate_devices = switch_iterate_devices,
};
static int __init dm_switch_init(void)
{
int r;
r = dm_register_target(&switch_target);
if (r < 0)
DMERR("dm_register_target() failed %d", r);
return r;
}
static void __exit dm_switch_exit(void)
{
dm_unregister_target(&switch_target);
}
module_init(dm_switch_init);
module_exit(dm_switch_exit);
MODULE_DESCRIPTION(DM_NAME " dynamic path switching target");
MODULE_AUTHOR("Kevin D. O'Kelley <Kevin_OKelley@dell.com>");
MODULE_AUTHOR("Narendran Ganapathy <Narendran_Ganapathy@dell.com>");
MODULE_AUTHOR("Jim Ramsay <Jim_Ramsay@dell.com>");
MODULE_AUTHOR("Mikulas Patocka <mpatocka@redhat.com>");
MODULE_LICENSE("GPL");

View File

@ -26,22 +26,8 @@
#define KEYS_PER_NODE (NODE_SIZE / sizeof(sector_t))
#define CHILDREN_PER_NODE (KEYS_PER_NODE + 1)
/*
* The table has always exactly one reference from either mapped_device->map
* or hash_cell->new_map. This reference is not counted in table->holders.
* A pair of dm_create_table/dm_destroy_table functions is used for table
* creation/destruction.
*
* Temporary references from the other code increase table->holders. A pair
* of dm_table_get/dm_table_put functions is used to manipulate it.
*
* When the table is about to be destroyed, we wait for table->holders to
* drop to zero.
*/
struct dm_table {
struct mapped_device *md;
atomic_t holders;
unsigned type;
/* btree table */
@ -208,7 +194,6 @@ int dm_table_create(struct dm_table **result, fmode_t mode,
INIT_LIST_HEAD(&t->devices);
INIT_LIST_HEAD(&t->target_callbacks);
atomic_set(&t->holders, 0);
if (!num_targets)
num_targets = KEYS_PER_NODE;
@ -246,10 +231,6 @@ void dm_table_destroy(struct dm_table *t)
if (!t)
return;
while (atomic_read(&t->holders))
msleep(1);
smp_mb();
/* free the indexes */
if (t->depth >= 2)
vfree(t->index[t->depth - 2]);
@ -274,22 +255,6 @@ void dm_table_destroy(struct dm_table *t)
kfree(t);
}
void dm_table_get(struct dm_table *t)
{
atomic_inc(&t->holders);
}
EXPORT_SYMBOL(dm_table_get);
void dm_table_put(struct dm_table *t)
{
if (!t)
return;
smp_mb__before_atomic_dec();
atomic_dec(&t->holders);
}
EXPORT_SYMBOL(dm_table_put);
/*
* Checks to see if we need to extend highs or targets.
*/

View File

@ -451,7 +451,7 @@ static void verity_prefetch_io(struct work_struct *work)
goto no_prefetch_cluster;
if (unlikely(cluster & (cluster - 1)))
cluster = 1 << (fls(cluster) - 1);
cluster = 1 << __fls(cluster);
hash_block_start &= ~(sector_t)(cluster - 1);
hash_block_end |= cluster - 1;
@ -695,8 +695,8 @@ static int verity_ctr(struct dm_target *ti, unsigned argc, char **argv)
goto bad;
}
if (sscanf(argv[0], "%d%c", &num, &dummy) != 1 ||
num < 0 || num > 1) {
if (sscanf(argv[0], "%u%c", &num, &dummy) != 1 ||
num > 1) {
ti->error = "Invalid version";
r = -EINVAL;
goto bad;
@ -723,7 +723,7 @@ static int verity_ctr(struct dm_target *ti, unsigned argc, char **argv)
r = -EINVAL;
goto bad;
}
v->data_dev_block_bits = ffs(num) - 1;
v->data_dev_block_bits = __ffs(num);
if (sscanf(argv[4], "%u%c", &num, &dummy) != 1 ||
!num || (num & (num - 1)) ||
@ -733,7 +733,7 @@ static int verity_ctr(struct dm_target *ti, unsigned argc, char **argv)
r = -EINVAL;
goto bad;
}
v->hash_dev_block_bits = ffs(num) - 1;
v->hash_dev_block_bits = __ffs(num);
if (sscanf(argv[5], "%llu%c", &num_ll, &dummy) != 1 ||
(sector_t)(num_ll << (v->data_dev_block_bits - SECTOR_SHIFT))
@ -812,7 +812,7 @@ static int verity_ctr(struct dm_target *ti, unsigned argc, char **argv)
}
v->hash_per_block_bits =
fls((1 << v->hash_dev_block_bits) / v->digest_size) - 1;
__fls((1 << v->hash_dev_block_bits) / v->digest_size);
v->levels = 0;
if (v->data_blocks)
@ -831,9 +831,8 @@ static int verity_ctr(struct dm_target *ti, unsigned argc, char **argv)
for (i = v->levels - 1; i >= 0; i--) {
sector_t s;
v->hash_level_block[i] = hash_position;
s = verity_position_at_level(v, v->data_blocks, i);
s = (s >> v->hash_per_block_bits) +
!!(s & ((1 << v->hash_per_block_bits) - 1));
s = (v->data_blocks + ((sector_t)1 << ((i + 1) * v->hash_per_block_bits)) - 1)
>> ((i + 1) * v->hash_per_block_bits);
if (hash_position + s < hash_position) {
ti->error = "Hash device offset overflow";
r = -E2BIG;

View File

@ -116,16 +116,30 @@ EXPORT_SYMBOL_GPL(dm_get_rq_mapinfo);
#define DMF_NOFLUSH_SUSPENDING 5
#define DMF_MERGE_IS_OPTIONAL 6
/*
* A dummy definition to make RCU happy.
* struct dm_table should never be dereferenced in this file.
*/
struct dm_table {
int undefined__;
};
/*
* Work processed by per-device workqueue.
*/
struct mapped_device {
struct rw_semaphore io_lock;
struct srcu_struct io_barrier;
struct mutex suspend_lock;
rwlock_t map_lock;
atomic_t holders;
atomic_t open_count;
/*
* The current mapping.
* Use dm_get_live_table{_fast} or take suspend_lock for
* dereference.
*/
struct dm_table *map;
unsigned long flags;
struct request_queue *queue;
@ -154,11 +168,6 @@ struct mapped_device {
*/
struct workqueue_struct *wq;
/*
* The current mapping.
*/
struct dm_table *map;
/*
* io objects are allocated from here.
*/
@ -386,10 +395,14 @@ static int dm_blk_ioctl(struct block_device *bdev, fmode_t mode,
unsigned int cmd, unsigned long arg)
{
struct mapped_device *md = bdev->bd_disk->private_data;
struct dm_table *map = dm_get_live_table(md);
int srcu_idx;
struct dm_table *map;
struct dm_target *tgt;
int r = -ENOTTY;
retry:
map = dm_get_live_table(md, &srcu_idx);
if (!map || !dm_table_get_size(map))
goto out;
@ -408,7 +421,12 @@ static int dm_blk_ioctl(struct block_device *bdev, fmode_t mode,
r = tgt->type->ioctl(tgt, cmd, arg);
out:
dm_table_put(map);
dm_put_live_table(md, srcu_idx);
if (r == -ENOTCONN) {
msleep(10);
goto retry;
}
return r;
}
@ -502,20 +520,39 @@ static void queue_io(struct mapped_device *md, struct bio *bio)
/*
* Everyone (including functions in this file), should use this
* function to access the md->map field, and make sure they call
* dm_table_put() when finished.
* dm_put_live_table() when finished.
*/
struct dm_table *dm_get_live_table(struct mapped_device *md)
struct dm_table *dm_get_live_table(struct mapped_device *md, int *srcu_idx) __acquires(md->io_barrier)
{
struct dm_table *t;
unsigned long flags;
*srcu_idx = srcu_read_lock(&md->io_barrier);
read_lock_irqsave(&md->map_lock, flags);
t = md->map;
if (t)
dm_table_get(t);
read_unlock_irqrestore(&md->map_lock, flags);
return srcu_dereference(md->map, &md->io_barrier);
}
return t;
void dm_put_live_table(struct mapped_device *md, int srcu_idx) __releases(md->io_barrier)
{
srcu_read_unlock(&md->io_barrier, srcu_idx);
}
void dm_sync_table(struct mapped_device *md)
{
synchronize_srcu(&md->io_barrier);
synchronize_rcu_expedited();
}
/*
* A fast alternative to dm_get_live_table/dm_put_live_table.
* The caller must not block between these two functions.
*/
static struct dm_table *dm_get_live_table_fast(struct mapped_device *md) __acquires(RCU)
{
rcu_read_lock();
return rcu_dereference(md->map);
}
static void dm_put_live_table_fast(struct mapped_device *md) __releases(RCU)
{
rcu_read_unlock();
}
/*
@ -1349,17 +1386,18 @@ static int __split_and_process_non_flush(struct clone_info *ci)
/*
* Entry point to split a bio into clones and submit them to the targets.
*/
static void __split_and_process_bio(struct mapped_device *md, struct bio *bio)
static void __split_and_process_bio(struct mapped_device *md,
struct dm_table *map, struct bio *bio)
{
struct clone_info ci;
int error = 0;
ci.map = dm_get_live_table(md);
if (unlikely(!ci.map)) {
if (unlikely(!map)) {
bio_io_error(bio);
return;
}
ci.map = map;
ci.md = md;
ci.io = alloc_io(md);
ci.io->error = 0;
@ -1386,7 +1424,6 @@ static void __split_and_process_bio(struct mapped_device *md, struct bio *bio)
/* drop the extra reference count */
dec_pending(ci.io, error);
dm_table_put(ci.map);
}
/*-----------------------------------------------------------------
* CRUD END
@ -1397,7 +1434,7 @@ static int dm_merge_bvec(struct request_queue *q,
struct bio_vec *biovec)
{
struct mapped_device *md = q->queuedata;
struct dm_table *map = dm_get_live_table(md);
struct dm_table *map = dm_get_live_table_fast(md);
struct dm_target *ti;
sector_t max_sectors;
int max_size = 0;
@ -1407,7 +1444,7 @@ static int dm_merge_bvec(struct request_queue *q,
ti = dm_table_find_target(map, bvm->bi_sector);
if (!dm_target_is_valid(ti))
goto out_table;
goto out;
/*
* Find maximum amount of I/O that won't need splitting
@ -1436,10 +1473,8 @@ static int dm_merge_bvec(struct request_queue *q,
max_size = 0;
out_table:
dm_table_put(map);
out:
dm_put_live_table_fast(md);
/*
* Always allow an entire first page
*/
@ -1458,8 +1493,10 @@ static void _dm_request(struct request_queue *q, struct bio *bio)
int rw = bio_data_dir(bio);
struct mapped_device *md = q->queuedata;
int cpu;
int srcu_idx;
struct dm_table *map;
down_read(&md->io_lock);
map = dm_get_live_table(md, &srcu_idx);
cpu = part_stat_lock();
part_stat_inc(cpu, &dm_disk(md)->part0, ios[rw]);
@ -1468,7 +1505,7 @@ static void _dm_request(struct request_queue *q, struct bio *bio)
/* if we're suspended, we have to queue this io for later */
if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags))) {
up_read(&md->io_lock);
dm_put_live_table(md, srcu_idx);
if (bio_rw(bio) != READA)
queue_io(md, bio);
@ -1477,8 +1514,8 @@ static void _dm_request(struct request_queue *q, struct bio *bio)
return;
}
__split_and_process_bio(md, bio);
up_read(&md->io_lock);
__split_and_process_bio(md, map, bio);
dm_put_live_table(md, srcu_idx);
return;
}
@ -1664,7 +1701,8 @@ static struct request *dm_start_request(struct mapped_device *md, struct request
static void dm_request_fn(struct request_queue *q)
{
struct mapped_device *md = q->queuedata;
struct dm_table *map = dm_get_live_table(md);
int srcu_idx;
struct dm_table *map = dm_get_live_table(md, &srcu_idx);
struct dm_target *ti;
struct request *rq, *clone;
sector_t pos;
@ -1719,7 +1757,7 @@ requeued:
delay_and_out:
blk_delay_queue(q, HZ / 10);
out:
dm_table_put(map);
dm_put_live_table(md, srcu_idx);
}
int dm_underlying_device_busy(struct request_queue *q)
@ -1732,14 +1770,14 @@ static int dm_lld_busy(struct request_queue *q)
{
int r;
struct mapped_device *md = q->queuedata;
struct dm_table *map = dm_get_live_table(md);
struct dm_table *map = dm_get_live_table_fast(md);
if (!map || test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags))
r = 1;
else
r = dm_table_any_busy_target(map);
dm_table_put(map);
dm_put_live_table_fast(md);
return r;
}
@ -1751,7 +1789,7 @@ static int dm_any_congested(void *congested_data, int bdi_bits)
struct dm_table *map;
if (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
map = dm_get_live_table(md);
map = dm_get_live_table_fast(md);
if (map) {
/*
* Request-based dm cares about only own queue for
@ -1762,9 +1800,8 @@ static int dm_any_congested(void *congested_data, int bdi_bits)
bdi_bits;
else
r = dm_table_any_congested(map, bdi_bits);
dm_table_put(map);
}
dm_put_live_table_fast(md);
}
return r;
@ -1869,12 +1906,14 @@ static struct mapped_device *alloc_dev(int minor)
if (r < 0)
goto bad_minor;
r = init_srcu_struct(&md->io_barrier);
if (r < 0)
goto bad_io_barrier;
md->type = DM_TYPE_NONE;
init_rwsem(&md->io_lock);
mutex_init(&md->suspend_lock);
mutex_init(&md->type_lock);
spin_lock_init(&md->deferred_lock);
rwlock_init(&md->map_lock);
atomic_set(&md->holders, 1);
atomic_set(&md->open_count, 0);
atomic_set(&md->event_nr, 0);
@ -1937,6 +1976,8 @@ bad_thread:
bad_disk:
blk_cleanup_queue(md->queue);
bad_queue:
cleanup_srcu_struct(&md->io_barrier);
bad_io_barrier:
free_minor(minor);
bad_minor:
module_put(THIS_MODULE);
@ -1960,6 +2001,7 @@ static void free_dev(struct mapped_device *md)
bioset_free(md->bs);
blk_integrity_unregister(md->disk);
del_gendisk(md->disk);
cleanup_srcu_struct(&md->io_barrier);
free_minor(minor);
spin_lock(&_minor_lock);
@ -2102,7 +2144,6 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
struct dm_table *old_map;
struct request_queue *q = md->queue;
sector_t size;
unsigned long flags;
int merge_is_optional;
size = dm_table_get_size(t);
@ -2131,9 +2172,8 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
merge_is_optional = dm_table_merge_is_optional(t);
write_lock_irqsave(&md->map_lock, flags);
old_map = md->map;
md->map = t;
rcu_assign_pointer(md->map, t);
md->immutable_target_type = dm_table_get_immutable_target_type(t);
dm_table_set_restrictions(t, q, limits);
@ -2141,7 +2181,7 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
set_bit(DMF_MERGE_IS_OPTIONAL, &md->flags);
else
clear_bit(DMF_MERGE_IS_OPTIONAL, &md->flags);
write_unlock_irqrestore(&md->map_lock, flags);
dm_sync_table(md);
return old_map;
}
@ -2152,15 +2192,13 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
static struct dm_table *__unbind(struct mapped_device *md)
{
struct dm_table *map = md->map;
unsigned long flags;
if (!map)
return NULL;
dm_table_event_callback(map, NULL, NULL);
write_lock_irqsave(&md->map_lock, flags);
md->map = NULL;
write_unlock_irqrestore(&md->map_lock, flags);
rcu_assign_pointer(md->map, NULL);
dm_sync_table(md);
return map;
}
@ -2312,11 +2350,12 @@ EXPORT_SYMBOL_GPL(dm_device_name);
static void __dm_destroy(struct mapped_device *md, bool wait)
{
struct dm_table *map;
int srcu_idx;
might_sleep();
spin_lock(&_minor_lock);
map = dm_get_live_table(md);
map = dm_get_live_table(md, &srcu_idx);
idr_replace(&_minor_idr, MINOR_ALLOCED, MINOR(disk_devt(dm_disk(md))));
set_bit(DMF_FREEING, &md->flags);
spin_unlock(&_minor_lock);
@ -2326,6 +2365,9 @@ static void __dm_destroy(struct mapped_device *md, bool wait)
dm_table_postsuspend_targets(map);
}
/* dm_put_live_table must be before msleep, otherwise deadlock is possible */
dm_put_live_table(md, srcu_idx);
/*
* Rare, but there may be I/O requests still going to complete,
* for example. Wait for all references to disappear.
@ -2340,7 +2382,6 @@ static void __dm_destroy(struct mapped_device *md, bool wait)
dm_device_name(md), atomic_read(&md->holders));
dm_sysfs_exit(md);
dm_table_put(map);
dm_table_destroy(__unbind(md));
free_dev(md);
}
@ -2397,8 +2438,10 @@ static void dm_wq_work(struct work_struct *work)
struct mapped_device *md = container_of(work, struct mapped_device,
work);
struct bio *c;
int srcu_idx;
struct dm_table *map;
down_read(&md->io_lock);
map = dm_get_live_table(md, &srcu_idx);
while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
spin_lock_irq(&md->deferred_lock);
@ -2408,17 +2451,13 @@ static void dm_wq_work(struct work_struct *work)
if (!c)
break;
up_read(&md->io_lock);
if (dm_request_based(md))
generic_make_request(c);
else
__split_and_process_bio(md, c);
down_read(&md->io_lock);
__split_and_process_bio(md, map, c);
}
up_read(&md->io_lock);
dm_put_live_table(md, srcu_idx);
}
static void dm_queue_flush(struct mapped_device *md)
@ -2450,10 +2489,10 @@ struct dm_table *dm_swap_table(struct mapped_device *md, struct dm_table *table)
* reappear.
*/
if (dm_table_has_no_data_devices(table)) {
live_map = dm_get_live_table(md);
live_map = dm_get_live_table_fast(md);
if (live_map)
limits = md->queue->limits;
dm_table_put(live_map);
dm_put_live_table_fast(md);
}
if (!live_map) {
@ -2533,7 +2572,7 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
goto out_unlock;
}
map = dm_get_live_table(md);
map = md->map;
/*
* DMF_NOFLUSH_SUSPENDING must be set before presuspend.
@ -2554,7 +2593,7 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
if (!noflush && do_lockfs) {
r = lock_fs(md);
if (r)
goto out;
goto out_unlock;
}
/*
@ -2569,9 +2608,8 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
* (dm_wq_work), we set BMF_BLOCK_IO_FOR_SUSPEND and call
* flush_workqueue(md->wq).
*/
down_write(&md->io_lock);
set_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags);
up_write(&md->io_lock);
synchronize_srcu(&md->io_barrier);
/*
* Stop md->queue before flushing md->wq in case request-based
@ -2589,10 +2627,9 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
*/
r = dm_wait_for_completion(md, TASK_INTERRUPTIBLE);
down_write(&md->io_lock);
if (noflush)
clear_bit(DMF_NOFLUSH_SUSPENDING, &md->flags);
up_write(&md->io_lock);
synchronize_srcu(&md->io_barrier);
/* were we interrupted ? */
if (r < 0) {
@ -2602,7 +2639,7 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
start_queue(md->queue);
unlock_fs(md);
goto out; /* pushback list is already flushed, so skip flush */
goto out_unlock; /* pushback list is already flushed, so skip flush */
}
/*
@ -2615,9 +2652,6 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
dm_table_postsuspend_targets(map);
out:
dm_table_put(map);
out_unlock:
mutex_unlock(&md->suspend_lock);
return r;
@ -2632,7 +2666,7 @@ int dm_resume(struct mapped_device *md)
if (!dm_suspended_md(md))
goto out;
map = dm_get_live_table(md);
map = md->map;
if (!map || !dm_table_get_size(map))
goto out;
@ -2656,7 +2690,6 @@ int dm_resume(struct mapped_device *md)
r = 0;
out:
dm_table_put(map);
mutex_unlock(&md->suspend_lock);
return r;

View File

@ -446,9 +446,9 @@ int __must_check dm_set_target_max_io_len(struct dm_target *ti, sector_t len);
/*
* Table reference counting.
*/
struct dm_table *dm_get_live_table(struct mapped_device *md);
void dm_table_get(struct dm_table *t);
void dm_table_put(struct dm_table *t);
struct dm_table *dm_get_live_table(struct mapped_device *md, int *srcu_idx);
void dm_put_live_table(struct mapped_device *md, int srcu_idx);
void dm_sync_table(struct mapped_device *md);
/*
* Queries

View File

@ -267,9 +267,9 @@ enum {
#define DM_DEV_SET_GEOMETRY _IOWR(DM_IOCTL, DM_DEV_SET_GEOMETRY_CMD, struct dm_ioctl)
#define DM_VERSION_MAJOR 4
#define DM_VERSION_MINOR 24
#define DM_VERSION_MINOR 25
#define DM_VERSION_PATCHLEVEL 0
#define DM_VERSION_EXTRA "-ioctl (2013-01-15)"
#define DM_VERSION_EXTRA "-ioctl (2013-06-26)"
/* Status bits */
#define DM_READONLY_FLAG (1 << 0) /* In/Out */