A set of device-mapper changes for 3.14.

A lot of attention was paid to improving the thin-provisioning target's
 handling of metadata operation failures and running out of space.  A new
 'error_if_no_space' feature was added to allow users to error IOs rather
 than queue them when either the data or metadata space is exhausted.
 
 Additional fixes/features include:
 - a few fixes to properly support thin metadata device resizing
 - a solution for reliably waiting for a DM device's embedded kobject to
   be released before destroying the device
 - old dm-snapshot is updated to use the dm-bufio interface to take
   advantage of readahead capabilities that improve snapshot activation
 - new dm-cache target tunables to control how quickly data is promoted
   to the cache (fast) device
 - improved write efficiency of cluster mirror target by combining
   userspace flush and mark requests
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJS4GClAAoJEMUj8QotnQNacdEH/2ES5k5itUQRY9jeI+u2zYNP
 vdsRTYf+97+B3jpRvpWbMt4kxT2tjaQbkxJ+iKRHy2MBLFUgq8ruH1RS/Q5VbDeg
 6i6ol8mpNxhlvo/KTMxXqRcWDSxShiMfhz2lXC2bJ7M4sP/iiH85s4Pm4YQ59jpd
 OIX7qj36m/cV/le9YQbexJEEsaj+3genbzL26wyyvtG/rT9fWnXa7clj2gqTdToG
 YCEBCRf5FH9X6W/Oc50nMw5n2dt/MRmPre/MAlOjemeaosB0WJiKaswM25rnvHp0
 JnhxQ2K2C5KIKAWIfwPOImdb9zWW7p1dIRLsS8nHBUQr0BF5VRkmvpnYH4qBtcc=
 =e7e0
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device-mapper changes from Mike Snitzer:
 "A lot of attention was paid to improving the thin-provisioning
  target's handling of metadata operation failures and running out of
  space.  A new 'error_if_no_space' feature was added to allow users to
  error IOs rather than queue them when either the data or metadata
  space is exhausted.

  Additional fixes/features include:
   - a few fixes to properly support thin metadata device resizing
   - a solution for reliably waiting for a DM device's embedded kobject
     to be released before destroying the device
   - old dm-snapshot is updated to use the dm-bufio interface to take
     advantage of readahead capabilities that improve snapshot
     activation
   - new dm-cache target tunables to control how quickly data is
     promoted to the cache (fast) device
   - improved write efficiency of cluster mirror target by combining
     userspace flush and mark requests"

* tag 'dm-3.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (35 commits)
  dm log userspace: allow mark requests to piggyback on flush requests
  dm space map metadata: fix bug in resizing of thin metadata
  dm cache: add policy name to status output
  dm thin: fix pool feature parsing
  dm sysfs: fix a module unload race
  dm snapshot: use dm-bufio prefetch
  dm snapshot: use dm-bufio
  dm snapshot: prepare for switch to using dm-bufio
  dm snapshot: use GFP_KERNEL when initializing exceptions
  dm cache: add block sizes and total cache blocks to status output
  dm btree: add dm_btree_find_lowest_key
  dm space map metadata: fix extending the space map
  dm space map common: make sure new space is used during extend
  dm: wait until embedded kobject is released before destroying a device
  dm: remove pointless kobject comparison in dm_get_from_kobject
  dm snapshot: call destroy_work_on_stack() to pair with INIT_WORK_ONSTACK()
  dm cache policy mq: introduce three promotion threshold tunables
  dm cache policy mq: use list_del_init instead of list_del + INIT_LIST_HEAD
  dm thin: fix set_pool_mode exposed pool operation races
  dm thin: eliminate the no_free_space flag
  ...
This commit is contained in:
Linus Torvalds 2014-01-22 20:17:48 -08:00
commit fe41c2c018
29 changed files with 768 additions and 322 deletions

View File

@ -40,8 +40,11 @@ on hit count on entry. The policy aims to take different cache miss
costs into account and to adjust to varying load patterns automatically.
Message and constructor argument pairs are:
'sequential_threshold <#nr_sequential_ios>' and
'random_threshold <#nr_random_ios>'.
'sequential_threshold <#nr_sequential_ios>'
'random_threshold <#nr_random_ios>'
'read_promote_adjustment <value>'
'write_promote_adjustment <value>'
'discard_promote_adjustment <value>'
The sequential threshold indicates the number of contiguous I/Os
required before a stream is treated as sequential. The random threshold
@ -55,6 +58,15 @@ since spindles tend to have good bandwidth. The io_tracker counts
contiguous I/Os to try to spot when the io is in one of these sequential
modes.
Internally the mq policy maintains a promotion threshold variable. If
the hit count of a block not in the cache goes above this threshold it
gets promoted to the cache. The read, write and discard promote adjustment
tunables allow you to tweak the promotion threshold by adding a small
value based on the io type. They default to 4, 8 and 1 respectively.
If you're trying to quickly warm a new cache device you may wish to
reduce these to encourage promotion. Remember to switch them back to
their defaults after the cache fills though.
cleaner
-------

View File

@ -217,36 +217,43 @@ the characteristics of a specific policy, always request it by name.
Status
------
<#used metadata blocks>/<#total metadata blocks> <#read hits> <#read misses>
<#write hits> <#write misses> <#demotions> <#promotions> <#blocks in cache>
<#dirty> <#features> <features>* <#core args> <core args>* <#policy args>
<policy args>*
<metadata block size> <#used metadata blocks>/<#total metadata blocks>
<cache block size> <#used cache blocks>/<#total cache blocks>
<#read hits> <#read misses> <#write hits> <#write misses>
<#demotions> <#promotions> <#dirty> <#features> <features>*
<#core args> <core args>* <policy name> <#policy args> <policy args>*
#used metadata blocks : Number of metadata blocks used
#total metadata blocks : Total number of metadata blocks
#read hits : Number of times a READ bio has been mapped
metadata block size : Fixed block size for each metadata block in
sectors
#used metadata blocks : Number of metadata blocks used
#total metadata blocks : Total number of metadata blocks
cache block size : Configurable block size for the cache device
in sectors
#used cache blocks : Number of blocks resident in the cache
#total cache blocks : Total number of cache blocks
#read hits : Number of times a READ bio has been mapped
to the cache
#read misses : Number of times a READ bio has been mapped
#read misses : Number of times a READ bio has been mapped
to the origin
#write hits : Number of times a WRITE bio has been mapped
#write hits : Number of times a WRITE bio has been mapped
to the cache
#write misses : Number of times a WRITE bio has been
#write misses : Number of times a WRITE bio has been
mapped to the origin
#demotions : Number of times a block has been removed
#demotions : Number of times a block has been removed
from the cache
#promotions : Number of times a block has been moved to
#promotions : Number of times a block has been moved to
the cache
#blocks in cache : Number of blocks resident in the cache
#dirty : Number of blocks in the cache that differ
#dirty : Number of blocks in the cache that differ
from the origin
#feature args : Number of feature args to follow
feature args : 'writethrough' (optional)
#core args : Number of core arguments (must be even)
core args : Key/value pairs for tuning the core
#feature args : Number of feature args to follow
feature args : 'writethrough' (optional)
#core args : Number of core arguments (must be even)
core args : Key/value pairs for tuning the core
e.g. migration_threshold
#policy args : Number of policy arguments to follow (must be even)
policy args : Key/value pairs
e.g. 'sequential_threshold 1024
policy name : Name of the policy
#policy args : Number of policy arguments to follow (must be even)
policy args : Key/value pairs
e.g. sequential_threshold
Messages
--------

View File

@ -235,6 +235,8 @@ i) Constructor
read_only: Don't allow any changes to be made to the pool
metadata.
error_if_no_space: Error IOs, instead of queueing, if no space.
Data block size must be between 64KB (128 sectors) and 1GB
(2097152 sectors) inclusive.
@ -276,6 +278,11 @@ ii) Status
contain the string 'Fail'. The userspace recovery tools
should then be used.
error_if_no_space|queue_if_no_space
If the pool runs out of data or metadata space, the pool will
either queue or error the IO destined to the data device. The
default is to queue the IO until more space is added.
iii) Messages
create_thin <dev id>

View File

@ -176,8 +176,12 @@ config MD_FAULTY
source "drivers/md/bcache/Kconfig"
config BLK_DEV_DM_BUILTIN
boolean
config BLK_DEV_DM
tristate "Device mapper support"
select BLK_DEV_DM_BUILTIN
---help---
Device-mapper is a low level volume manager. It works by allowing
people to specify mappings for ranges of logical sectors. Various
@ -238,6 +242,7 @@ config DM_CRYPT
config DM_SNAPSHOT
tristate "Snapshot target"
depends on BLK_DEV_DM
select DM_BUFIO
---help---
Allow volume managers to take writable snapshots of a device.
@ -250,12 +255,12 @@ config DM_THIN_PROVISIONING
Provides thin provisioning and snapshots that share a data store.
config DM_DEBUG_BLOCK_STACK_TRACING
boolean "Keep stack trace of thin provisioning block lock holders"
depends on STACKTRACE_SUPPORT && DM_THIN_PROVISIONING
boolean "Keep stack trace of persistent data block lock holders"
depends on STACKTRACE_SUPPORT && DM_PERSISTENT_DATA
select STACKTRACE
---help---
Enable this for messages that may help debug problems with the
block manager locking used by thin provisioning.
block manager locking used by thin provisioning and caching.
If unsure, say N.

View File

@ -32,6 +32,7 @@ obj-$(CONFIG_MD_FAULTY) += faulty.o
obj-$(CONFIG_BCACHE) += bcache/
obj-$(CONFIG_BLK_DEV_MD) += md-mod.o
obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o
obj-$(CONFIG_BLK_DEV_DM_BUILTIN) += dm-builtin.o
obj-$(CONFIG_DM_BUFIO) += dm-bufio.o
obj-$(CONFIG_DM_BIO_PRISON) += dm-bio-prison.o
obj-$(CONFIG_DM_CRYPT) += dm-crypt.o

View File

@ -104,6 +104,8 @@ struct dm_bufio_client {
struct list_head reserved_buffers;
unsigned need_reserved_buffers;
unsigned minimum_buffers;
struct hlist_head *cache_hash;
wait_queue_head_t free_buffer_wait;
@ -861,8 +863,8 @@ static void __get_memory_limit(struct dm_bufio_client *c,
buffers = dm_bufio_cache_size_per_client >>
(c->sectors_per_block_bits + SECTOR_SHIFT);
if (buffers < DM_BUFIO_MIN_BUFFERS)
buffers = DM_BUFIO_MIN_BUFFERS;
if (buffers < c->minimum_buffers)
buffers = c->minimum_buffers;
*limit_buffers = buffers;
*threshold_buffers = buffers * DM_BUFIO_WRITEBACK_PERCENT / 100;
@ -1350,6 +1352,34 @@ retry:
}
EXPORT_SYMBOL_GPL(dm_bufio_release_move);
/*
* Free the given buffer.
*
* This is just a hint, if the buffer is in use or dirty, this function
* does nothing.
*/
void dm_bufio_forget(struct dm_bufio_client *c, sector_t block)
{
struct dm_buffer *b;
dm_bufio_lock(c);
b = __find(c, block);
if (b && likely(!b->hold_count) && likely(!b->state)) {
__unlink_buffer(b);
__free_buffer_wake(b);
}
dm_bufio_unlock(c);
}
EXPORT_SYMBOL(dm_bufio_forget);
void dm_bufio_set_minimum_buffers(struct dm_bufio_client *c, unsigned n)
{
c->minimum_buffers = n;
}
EXPORT_SYMBOL(dm_bufio_set_minimum_buffers);
unsigned dm_bufio_get_block_size(struct dm_bufio_client *c)
{
return c->block_size;
@ -1546,6 +1576,8 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
INIT_LIST_HEAD(&c->reserved_buffers);
c->need_reserved_buffers = reserved_buffers;
c->minimum_buffers = DM_BUFIO_MIN_BUFFERS;
init_waitqueue_head(&c->free_buffer_wait);
c->async_write_error = 0;

View File

@ -108,6 +108,18 @@ int dm_bufio_issue_flush(struct dm_bufio_client *c);
*/
void dm_bufio_release_move(struct dm_buffer *b, sector_t new_block);
/*
* Free the given buffer.
* This is just a hint, if the buffer is in use or dirty, this function
* does nothing.
*/
void dm_bufio_forget(struct dm_bufio_client *c, sector_t block);
/*
* Set the minimum number of buffers before cleanup happens.
*/
void dm_bufio_set_minimum_buffers(struct dm_bufio_client *c, unsigned n);
unsigned dm_bufio_get_block_size(struct dm_bufio_client *c);
sector_t dm_bufio_get_device_size(struct dm_bufio_client *c);
sector_t dm_bufio_get_block_number(struct dm_buffer *b);

48
drivers/md/dm-builtin.c Normal file
View File

@ -0,0 +1,48 @@
#include "dm.h"
/*
* The kobject release method must not be placed in the module itself,
* otherwise we are subject to module unload races.
*
* The release method is called when the last reference to the kobject is
* dropped. It may be called by any other kernel code that drops the last
* reference.
*
* The release method suffers from module unload race. We may prevent the
* module from being unloaded at the start of the release method (using
* increased module reference count or synchronizing against the release
* method), however there is no way to prevent the module from being
* unloaded at the end of the release method.
*
* If this code were placed in the dm module, the following race may
* happen:
* 1. Some other process takes a reference to dm kobject
* 2. The user issues ioctl function to unload the dm device
* 3. dm_sysfs_exit calls kobject_put, however the object is not released
* because of the other reference taken at step 1
* 4. dm_sysfs_exit waits on the completion
* 5. The other process that took the reference in step 1 drops it,
* dm_kobject_release is called from this process
* 6. dm_kobject_release calls complete()
* 7. a reschedule happens before dm_kobject_release returns
* 8. dm_sysfs_exit continues, the dm device is unloaded, module reference
* count is decremented
* 9. The user unloads the dm module
* 10. The other process that was rescheduled in step 7 continues to run,
* it is now executing code in unloaded module, so it crashes
*
* Note that if the process that takes the foreign reference to dm kobject
* has a low priority and the system is sufficiently loaded with
* higher-priority processes that prevent the low-priority process from
* being scheduled long enough, this bug may really happen.
*
* In order to fix this module unload race, we place the release method
* into a helper code that is compiled directly into the kernel.
*/
void dm_kobject_release(struct kobject *kobj)
{
complete(dm_get_completion_from_kobject(kobj));
}
EXPORT_SYMBOL(dm_kobject_release);

View File

@ -287,9 +287,8 @@ static struct entry *alloc_entry(struct entry_pool *ep)
static struct entry *alloc_particular_entry(struct entry_pool *ep, dm_cblock_t cblock)
{
struct entry *e = ep->entries + from_cblock(cblock);
list_del(&e->list);
INIT_LIST_HEAD(&e->list);
list_del_init(&e->list);
INIT_HLIST_NODE(&e->hlist);
ep->nr_allocated++;
@ -391,6 +390,10 @@ struct mq_policy {
*/
unsigned promote_threshold;
unsigned discard_promote_adjustment;
unsigned read_promote_adjustment;
unsigned write_promote_adjustment;
/*
* The hash table allows us to quickly find an entry by origin
* block. Both pre_cache and cache entries are in here.
@ -400,6 +403,10 @@ struct mq_policy {
struct hlist_head *table;
};
#define DEFAULT_DISCARD_PROMOTE_ADJUSTMENT 1
#define DEFAULT_READ_PROMOTE_ADJUSTMENT 4
#define DEFAULT_WRITE_PROMOTE_ADJUSTMENT 8
/*----------------------------------------------------------------*/
/*
@ -642,25 +649,21 @@ static int demote_cblock(struct mq_policy *mq, dm_oblock_t *oblock)
* We bias towards reads, since they can be demoted at no cost if they
* haven't been dirtied.
*/
#define DISCARDED_PROMOTE_THRESHOLD 1
#define READ_PROMOTE_THRESHOLD 4
#define WRITE_PROMOTE_THRESHOLD 8
static unsigned adjusted_promote_threshold(struct mq_policy *mq,
bool discarded_oblock, int data_dir)
{
if (data_dir == READ)
return mq->promote_threshold + READ_PROMOTE_THRESHOLD;
return mq->promote_threshold + mq->read_promote_adjustment;
if (discarded_oblock && (any_free_cblocks(mq) || any_clean_cblocks(mq))) {
/*
* We don't need to do any copying at all, so give this a
* very low threshold.
*/
return DISCARDED_PROMOTE_THRESHOLD;
return mq->discard_promote_adjustment;
}
return mq->promote_threshold + WRITE_PROMOTE_THRESHOLD;
return mq->promote_threshold + mq->write_promote_adjustment;
}
static bool should_promote(struct mq_policy *mq, struct entry *e,
@ -809,7 +812,7 @@ static int no_entry_found(struct mq_policy *mq, dm_oblock_t oblock,
bool can_migrate, bool discarded_oblock,
int data_dir, struct policy_result *result)
{
if (adjusted_promote_threshold(mq, discarded_oblock, data_dir) == 1) {
if (adjusted_promote_threshold(mq, discarded_oblock, data_dir) <= 1) {
if (can_migrate)
insert_in_cache(mq, oblock, result);
else
@ -1135,20 +1138,28 @@ static int mq_set_config_value(struct dm_cache_policy *p,
const char *key, const char *value)
{
struct mq_policy *mq = to_mq_policy(p);
enum io_pattern pattern;
unsigned long tmp;
if (!strcasecmp(key, "random_threshold"))
pattern = PATTERN_RANDOM;
else if (!strcasecmp(key, "sequential_threshold"))
pattern = PATTERN_SEQUENTIAL;
else
return -EINVAL;
if (kstrtoul(value, 10, &tmp))
return -EINVAL;
mq->tracker.thresholds[pattern] = tmp;
if (!strcasecmp(key, "random_threshold")) {
mq->tracker.thresholds[PATTERN_RANDOM] = tmp;
} else if (!strcasecmp(key, "sequential_threshold")) {
mq->tracker.thresholds[PATTERN_SEQUENTIAL] = tmp;
} else if (!strcasecmp(key, "discard_promote_adjustment"))
mq->discard_promote_adjustment = tmp;
else if (!strcasecmp(key, "read_promote_adjustment"))
mq->read_promote_adjustment = tmp;
else if (!strcasecmp(key, "write_promote_adjustment"))
mq->write_promote_adjustment = tmp;
else
return -EINVAL;
return 0;
}
@ -1158,9 +1169,16 @@ static int mq_emit_config_values(struct dm_cache_policy *p, char *result, unsign
ssize_t sz = 0;
struct mq_policy *mq = to_mq_policy(p);
DMEMIT("4 random_threshold %u sequential_threshold %u",
DMEMIT("10 random_threshold %u "
"sequential_threshold %u "
"discard_promote_adjustment %u "
"read_promote_adjustment %u "
"write_promote_adjustment %u",
mq->tracker.thresholds[PATTERN_RANDOM],
mq->tracker.thresholds[PATTERN_SEQUENTIAL]);
mq->tracker.thresholds[PATTERN_SEQUENTIAL],
mq->discard_promote_adjustment,
mq->read_promote_adjustment,
mq->write_promote_adjustment);
return 0;
}
@ -1213,6 +1231,9 @@ static struct dm_cache_policy *mq_create(dm_cblock_t cache_size,
mq->hit_count = 0;
mq->generation = 0;
mq->promote_threshold = 0;
mq->discard_promote_adjustment = DEFAULT_DISCARD_PROMOTE_ADJUSTMENT;
mq->read_promote_adjustment = DEFAULT_READ_PROMOTE_ADJUSTMENT;
mq->write_promote_adjustment = DEFAULT_WRITE_PROMOTE_ADJUSTMENT;
mutex_init(&mq->lock);
spin_lock_init(&mq->tick_lock);
@ -1244,7 +1265,7 @@ bad_pre_cache_init:
static struct dm_cache_policy_type mq_policy_type = {
.name = "mq",
.version = {1, 1, 0},
.version = {1, 2, 0},
.hint_size = 4,
.owner = THIS_MODULE,
.create = mq_create
@ -1252,10 +1273,11 @@ static struct dm_cache_policy_type mq_policy_type = {
static struct dm_cache_policy_type default_policy_type = {
.name = "default",
.version = {1, 1, 0},
.version = {1, 2, 0},
.hint_size = 4,
.owner = THIS_MODULE,
.create = mq_create
.create = mq_create,
.real = &mq_policy_type
};
static int __init mq_init(void)

View File

@ -146,6 +146,10 @@ const char *dm_cache_policy_get_name(struct dm_cache_policy *p)
{
struct dm_cache_policy_type *t = p->private;
/* if t->real is set then an alias was used (e.g. "default") */
if (t->real)
return t->real->name;
return t->name;
}
EXPORT_SYMBOL_GPL(dm_cache_policy_get_name);

View File

@ -222,6 +222,12 @@ struct dm_cache_policy_type {
char name[CACHE_POLICY_NAME_SIZE];
unsigned version[CACHE_POLICY_VERSION_SIZE];
/*
* For use by an alias dm_cache_policy_type to point to the
* real dm_cache_policy_type.
*/
struct dm_cache_policy_type *real;
/*
* Policies may store a hint for each each cache block.
* Currently the size of this hint must be 0 or 4 bytes but we

View File

@ -2826,12 +2826,13 @@ static void cache_resume(struct dm_target *ti)
/*
* Status format:
*
* <#used metadata blocks>/<#total metadata blocks>
* <metadata block size> <#used metadata blocks>/<#total metadata blocks>
* <cache block size> <#used cache blocks>/<#total cache blocks>
* <#read hits> <#read misses> <#write hits> <#write misses>
* <#demotions> <#promotions> <#blocks in cache> <#dirty>
* <#demotions> <#promotions> <#dirty>
* <#features> <features>*
* <#core args> <core args>
* <#policy args> <policy args>*
* <policy name> <#policy args> <policy args>*
*/
static void cache_status(struct dm_target *ti, status_type_t type,
unsigned status_flags, char *result, unsigned maxlen)
@ -2869,17 +2870,20 @@ static void cache_status(struct dm_target *ti, status_type_t type,
residency = policy_residency(cache->policy);
DMEMIT("%llu/%llu %u %u %u %u %u %u %llu %u ",
DMEMIT("%u %llu/%llu %u %llu/%llu %u %u %u %u %u %u %llu ",
(unsigned)(DM_CACHE_METADATA_BLOCK_SIZE >> SECTOR_SHIFT),
(unsigned long long)(nr_blocks_metadata - nr_free_blocks_metadata),
(unsigned long long)nr_blocks_metadata,
cache->sectors_per_block,
(unsigned long long) from_cblock(residency),
(unsigned long long) from_cblock(cache->cache_size),
(unsigned) atomic_read(&cache->stats.read_hit),
(unsigned) atomic_read(&cache->stats.read_miss),
(unsigned) atomic_read(&cache->stats.write_hit),
(unsigned) atomic_read(&cache->stats.write_miss),
(unsigned) atomic_read(&cache->stats.demotion),
(unsigned) atomic_read(&cache->stats.promotion),
(unsigned long long) from_cblock(residency),
cache->nr_dirty);
(unsigned long long) from_cblock(cache->nr_dirty));
if (writethrough_mode(&cache->features))
DMEMIT("1 writethrough ");
@ -2896,6 +2900,8 @@ static void cache_status(struct dm_target *ti, status_type_t type,
}
DMEMIT("2 migration_threshold %llu ", (unsigned long long) cache->migration_threshold);
DMEMIT("%s ", dm_cache_policy_get_name(cache->policy));
if (sz < maxlen) {
r = policy_emit_config_values(cache->policy, result + sz, maxlen - sz);
if (r)
@ -3129,7 +3135,7 @@ static void cache_io_hints(struct dm_target *ti, struct queue_limits *limits)
static struct target_type cache_target = {
.name = "cache",
.version = {1, 2, 0},
.version = {1, 3, 0},
.module = THIS_MODULE,
.ctr = cache_ctr,
.dtr = cache_dtr,

View File

@ -24,7 +24,6 @@ struct delay_c {
struct work_struct flush_expired_bios;
struct list_head delayed_bios;
atomic_t may_delay;
mempool_t *delayed_pool;
struct dm_dev *dev_read;
sector_t start_read;
@ -40,14 +39,11 @@ struct delay_c {
struct dm_delay_info {
struct delay_c *context;
struct list_head list;
struct bio *bio;
unsigned long expires;
};
static DEFINE_MUTEX(delayed_bios_lock);
static struct kmem_cache *delayed_cache;
static void handle_delayed_timer(unsigned long data)
{
struct delay_c *dc = (struct delay_c *)data;
@ -87,13 +83,14 @@ static struct bio *flush_delayed_bios(struct delay_c *dc, int flush_all)
mutex_lock(&delayed_bios_lock);
list_for_each_entry_safe(delayed, next, &dc->delayed_bios, list) {
if (flush_all || time_after_eq(jiffies, delayed->expires)) {
struct bio *bio = dm_bio_from_per_bio_data(delayed,
sizeof(struct dm_delay_info));
list_del(&delayed->list);
bio_list_add(&flush_bios, delayed->bio);
if ((bio_data_dir(delayed->bio) == WRITE))
bio_list_add(&flush_bios, bio);
if ((bio_data_dir(bio) == WRITE))
delayed->context->writes--;
else
delayed->context->reads--;
mempool_free(delayed, dc->delayed_pool);
continue;
}
@ -185,12 +182,6 @@ static int delay_ctr(struct dm_target *ti, unsigned int argc, char **argv)
}
out:
dc->delayed_pool = mempool_create_slab_pool(128, delayed_cache);
if (!dc->delayed_pool) {
DMERR("Couldn't create delayed bio pool.");
goto bad_dev_write;
}
dc->kdelayd_wq = alloc_workqueue("kdelayd", WQ_MEM_RECLAIM, 0);
if (!dc->kdelayd_wq) {
DMERR("Couldn't start kdelayd");
@ -206,12 +197,11 @@ out:
ti->num_flush_bios = 1;
ti->num_discard_bios = 1;
ti->per_bio_data_size = sizeof(struct dm_delay_info);
ti->private = dc;
return 0;
bad_queue:
mempool_destroy(dc->delayed_pool);
bad_dev_write:
if (dc->dev_write)
dm_put_device(ti, dc->dev_write);
bad_dev_read:
@ -232,7 +222,6 @@ static void delay_dtr(struct dm_target *ti)
if (dc->dev_write)
dm_put_device(ti, dc->dev_write);
mempool_destroy(dc->delayed_pool);
kfree(dc);
}
@ -244,10 +233,9 @@ static int delay_bio(struct delay_c *dc, int delay, struct bio *bio)
if (!delay || !atomic_read(&dc->may_delay))
return 1;
delayed = mempool_alloc(dc->delayed_pool, GFP_NOIO);
delayed = dm_per_bio_data(bio, sizeof(struct dm_delay_info));
delayed->context = dc;
delayed->bio = bio;
delayed->expires = expires = jiffies + (delay * HZ / 1000);
mutex_lock(&delayed_bios_lock);
@ -356,13 +344,7 @@ static struct target_type delay_target = {
static int __init dm_delay_init(void)
{
int r = -ENOMEM;
delayed_cache = KMEM_CACHE(dm_delay_info, 0);
if (!delayed_cache) {
DMERR("Couldn't create delayed bio cache.");
goto bad_memcache;
}
int r;
r = dm_register_target(&delay_target);
if (r < 0) {
@ -373,15 +355,12 @@ static int __init dm_delay_init(void)
return 0;
bad_register:
kmem_cache_destroy(delayed_cache);
bad_memcache:
return r;
}
static void __exit dm_delay_exit(void)
{
dm_unregister_target(&delay_target);
kmem_cache_destroy(delayed_cache);
}
/* Module hooks */

View File

@ -10,10 +10,11 @@
#include <linux/device-mapper.h>
#include <linux/dm-log-userspace.h>
#include <linux/module.h>
#include <linux/workqueue.h>
#include "dm-log-userspace-transfer.h"
#define DM_LOG_USERSPACE_VSN "1.1.0"
#define DM_LOG_USERSPACE_VSN "1.3.0"
struct flush_entry {
int type;
@ -58,6 +59,18 @@ struct log_c {
spinlock_t flush_lock;
struct list_head mark_list;
struct list_head clear_list;
/*
* Workqueue for flush of clear region requests.
*/
struct workqueue_struct *dmlog_wq;
struct delayed_work flush_log_work;
atomic_t sched_flush;
/*
* Combine userspace flush and mark requests for efficiency.
*/
uint32_t integrated_flush;
};
static mempool_t *flush_entry_pool;
@ -122,6 +135,9 @@ static int build_constructor_string(struct dm_target *ti,
*ctr_str = NULL;
/*
* Determine overall size of the string.
*/
for (i = 0, str_size = 0; i < argc; i++)
str_size += strlen(argv[i]) + 1; /* +1 for space between args */
@ -141,18 +157,39 @@ static int build_constructor_string(struct dm_target *ti,
return str_size;
}
static void do_flush(struct work_struct *work)
{
int r;
struct log_c *lc = container_of(work, struct log_c, flush_log_work.work);
atomic_set(&lc->sched_flush, 0);
r = userspace_do_request(lc, lc->uuid, DM_ULOG_FLUSH, NULL, 0, NULL, NULL);
if (r)
dm_table_event(lc->ti->table);
}
/*
* userspace_ctr
*
* argv contains:
* <UUID> <other args>
* Where 'other args' is the userspace implementation specific log
* arguments. An example might be:
* <UUID> clustered-disk <arg count> <log dev> <region_size> [[no]sync]
* <UUID> [integrated_flush] <other args>
* Where 'other args' are the userspace implementation-specific log
* arguments.
*
* So, this module will strip off the <UUID> for identification purposes
* when communicating with userspace about a log; but will pass on everything
* else.
* Example:
* <UUID> [integrated_flush] clustered-disk <arg count> <log dev>
* <region_size> [[no]sync]
*
* This module strips off the <UUID> and uses it for identification
* purposes when communicating with userspace about a log.
*
* If integrated_flush is defined, the kernel combines flush
* and mark requests.
*
* The rest of the line, beginning with 'clustered-disk', is passed
* to the userspace ctr function.
*/
static int userspace_ctr(struct dm_dirty_log *log, struct dm_target *ti,
unsigned argc, char **argv)
@ -188,12 +225,22 @@ static int userspace_ctr(struct dm_dirty_log *log, struct dm_target *ti,
return -EINVAL;
}
lc->usr_argc = argc;
strncpy(lc->uuid, argv[0], DM_UUID_LEN);
argc--;
argv++;
spin_lock_init(&lc->flush_lock);
INIT_LIST_HEAD(&lc->mark_list);
INIT_LIST_HEAD(&lc->clear_list);
str_size = build_constructor_string(ti, argc - 1, argv + 1, &ctr_str);
if (!strcasecmp(argv[0], "integrated_flush")) {
lc->integrated_flush = 1;
argc--;
argv++;
}
str_size = build_constructor_string(ti, argc, argv, &ctr_str);
if (str_size < 0) {
kfree(lc);
return str_size;
@ -246,6 +293,19 @@ static int userspace_ctr(struct dm_dirty_log *log, struct dm_target *ti,
DMERR("Failed to register %s with device-mapper",
devices_rdata);
}
if (lc->integrated_flush) {
lc->dmlog_wq = alloc_workqueue("dmlogd", WQ_MEM_RECLAIM, 0);
if (!lc->dmlog_wq) {
DMERR("couldn't start dmlogd");
r = -ENOMEM;
goto out;
}
INIT_DELAYED_WORK(&lc->flush_log_work, do_flush);
atomic_set(&lc->sched_flush, 0);
}
out:
kfree(devices_rdata);
if (r) {
@ -253,7 +313,6 @@ out:
kfree(ctr_str);
} else {
lc->usr_argv_str = ctr_str;
lc->usr_argc = argc;
log->context = lc;
}
@ -264,9 +323,16 @@ static void userspace_dtr(struct dm_dirty_log *log)
{
struct log_c *lc = log->context;
if (lc->integrated_flush) {
/* flush workqueue */
if (atomic_read(&lc->sched_flush))
flush_delayed_work(&lc->flush_log_work);
destroy_workqueue(lc->dmlog_wq);
}
(void) dm_consult_userspace(lc->uuid, lc->luid, DM_ULOG_DTR,
NULL, 0,
NULL, NULL);
NULL, 0, NULL, NULL);
if (lc->log_dev)
dm_put_device(lc->ti, lc->log_dev);
@ -283,8 +349,7 @@ static int userspace_presuspend(struct dm_dirty_log *log)
struct log_c *lc = log->context;
r = dm_consult_userspace(lc->uuid, lc->luid, DM_ULOG_PRESUSPEND,
NULL, 0,
NULL, NULL);
NULL, 0, NULL, NULL);
return r;
}
@ -294,9 +359,14 @@ static int userspace_postsuspend(struct dm_dirty_log *log)
int r;
struct log_c *lc = log->context;
/*
* Run planned flush earlier.
*/
if (lc->integrated_flush && atomic_read(&lc->sched_flush))
flush_delayed_work(&lc->flush_log_work);
r = dm_consult_userspace(lc->uuid, lc->luid, DM_ULOG_POSTSUSPEND,
NULL, 0,
NULL, NULL);
NULL, 0, NULL, NULL);
return r;
}
@ -308,8 +378,7 @@ static int userspace_resume(struct dm_dirty_log *log)
lc->in_sync_hint = 0;
r = dm_consult_userspace(lc->uuid, lc->luid, DM_ULOG_RESUME,
NULL, 0,
NULL, NULL);
NULL, 0, NULL, NULL);
return r;
}
@ -405,7 +474,8 @@ static int flush_one_by_one(struct log_c *lc, struct list_head *flush_list)
return r;
}
static int flush_by_group(struct log_c *lc, struct list_head *flush_list)
static int flush_by_group(struct log_c *lc, struct list_head *flush_list,
int flush_with_payload)
{
int r = 0;
int count;
@ -431,15 +501,29 @@ static int flush_by_group(struct log_c *lc, struct list_head *flush_list)
break;
}
r = userspace_do_request(lc, lc->uuid, type,
(char *)(group),
count * sizeof(uint64_t),
NULL, NULL);
if (r) {
/* Group send failed. Attempt one-by-one. */
list_splice_init(&tmp_list, flush_list);
r = flush_one_by_one(lc, flush_list);
break;
if (flush_with_payload) {
r = userspace_do_request(lc, lc->uuid, DM_ULOG_FLUSH,
(char *)(group),
count * sizeof(uint64_t),
NULL, NULL);
/*
* Integrated flush failed.
*/
if (r)
break;
} else {
r = userspace_do_request(lc, lc->uuid, type,
(char *)(group),
count * sizeof(uint64_t),
NULL, NULL);
if (r) {
/*
* Group send failed. Attempt one-by-one.
*/
list_splice_init(&tmp_list, flush_list);
r = flush_one_by_one(lc, flush_list);
break;
}
}
}
@ -476,6 +560,8 @@ static int userspace_flush(struct dm_dirty_log *log)
struct log_c *lc = log->context;
LIST_HEAD(mark_list);
LIST_HEAD(clear_list);
int mark_list_is_empty;
int clear_list_is_empty;
struct flush_entry *fe, *tmp_fe;
spin_lock_irqsave(&lc->flush_lock, flags);
@ -483,23 +569,51 @@ static int userspace_flush(struct dm_dirty_log *log)
list_splice_init(&lc->clear_list, &clear_list);
spin_unlock_irqrestore(&lc->flush_lock, flags);
if (list_empty(&mark_list) && list_empty(&clear_list))
mark_list_is_empty = list_empty(&mark_list);
clear_list_is_empty = list_empty(&clear_list);
if (mark_list_is_empty && clear_list_is_empty)
return 0;
r = flush_by_group(lc, &mark_list);
r = flush_by_group(lc, &clear_list, 0);
if (r)
goto fail;
goto out;
r = flush_by_group(lc, &clear_list);
if (r)
goto fail;
if (!lc->integrated_flush) {
r = flush_by_group(lc, &mark_list, 0);
if (r)
goto out;
r = userspace_do_request(lc, lc->uuid, DM_ULOG_FLUSH,
NULL, 0, NULL, NULL);
goto out;
}
r = userspace_do_request(lc, lc->uuid, DM_ULOG_FLUSH,
NULL, 0, NULL, NULL);
fail:
/*
* We can safely remove these entries, even if failure.
* Send integrated flush request with mark_list as payload.
*/
r = flush_by_group(lc, &mark_list, 1);
if (r)
goto out;
if (mark_list_is_empty && !atomic_read(&lc->sched_flush)) {
/*
* When there are only clear region requests,
* we schedule a flush in the future.
*/
queue_delayed_work(lc->dmlog_wq, &lc->flush_log_work, 3 * HZ);
atomic_set(&lc->sched_flush, 1);
} else {
/*
* Cancel pending flush because we
* have already flushed in mark_region.
*/
cancel_delayed_work(&lc->flush_log_work);
atomic_set(&lc->sched_flush, 0);
}
out:
/*
* We can safely remove these entries, even after failure.
* Calling code will receive an error and will know that
* the log facility has failed.
*/
@ -603,8 +717,7 @@ static int userspace_get_resync_work(struct dm_dirty_log *log, region_t *region)
rdata_size = sizeof(pkg);
r = userspace_do_request(lc, lc->uuid, DM_ULOG_GET_RESYNC_WORK,
NULL, 0,
(char *)&pkg, &rdata_size);
NULL, 0, (char *)&pkg, &rdata_size);
*region = pkg.r;
return (r) ? r : (int)pkg.i;
@ -630,8 +743,7 @@ static void userspace_set_region_sync(struct dm_dirty_log *log,
pkg.i = (int64_t)in_sync;
r = userspace_do_request(lc, lc->uuid, DM_ULOG_SET_REGION_SYNC,
(char *)&pkg, sizeof(pkg),
NULL, NULL);
(char *)&pkg, sizeof(pkg), NULL, NULL);
/*
* It would be nice to be able to report failures.
@ -657,8 +769,7 @@ static region_t userspace_get_sync_count(struct dm_dirty_log *log)
rdata_size = sizeof(sync_count);
r = userspace_do_request(lc, lc->uuid, DM_ULOG_GET_SYNC_COUNT,
NULL, 0,
(char *)&sync_count, &rdata_size);
NULL, 0, (char *)&sync_count, &rdata_size);
if (r)
return 0;
@ -685,8 +796,7 @@ static int userspace_status(struct dm_dirty_log *log, status_type_t status_type,
switch (status_type) {
case STATUSTYPE_INFO:
r = userspace_do_request(lc, lc->uuid, DM_ULOG_STATUS_INFO,
NULL, 0,
result, &sz);
NULL, 0, result, &sz);
if (r) {
sz = 0;
@ -699,8 +809,10 @@ static int userspace_status(struct dm_dirty_log *log, status_type_t status_type,
BUG_ON(!table_args); /* There will always be a ' ' */
table_args++;
DMEMIT("%s %u %s %s ", log->type->name, lc->usr_argc,
lc->uuid, table_args);
DMEMIT("%s %u %s ", log->type->name, lc->usr_argc, lc->uuid);
if (lc->integrated_flush)
DMEMIT("integrated_flush ");
DMEMIT("%s ", table_args);
break;
}
return (r) ? 0 : (int)sz;

View File

@ -13,10 +13,13 @@
#include <linux/export.h>
#include <linux/slab.h>
#include <linux/dm-io.h>
#include "dm-bufio.h"
#define DM_MSG_PREFIX "persistent snapshot"
#define DM_CHUNK_SIZE_DEFAULT_SECTORS 32 /* 16KB */
#define DM_PREFETCH_CHUNKS 12
/*-----------------------------------------------------------------
* Persistent snapshots, by persistent we mean that the snapshot
* will survive a reboot.
@ -257,6 +260,7 @@ static int chunk_io(struct pstore *ps, void *area, chunk_t chunk, int rw,
INIT_WORK_ONSTACK(&req.work, do_metadata);
queue_work(ps->metadata_wq, &req.work);
flush_workqueue(ps->metadata_wq);
destroy_work_on_stack(&req.work);
return req.result;
}
@ -401,17 +405,18 @@ static int write_header(struct pstore *ps)
/*
* Access functions for the disk exceptions, these do the endian conversions.
*/
static struct disk_exception *get_exception(struct pstore *ps, uint32_t index)
static struct disk_exception *get_exception(struct pstore *ps, void *ps_area,
uint32_t index)
{
BUG_ON(index >= ps->exceptions_per_area);
return ((struct disk_exception *) ps->area) + index;
return ((struct disk_exception *) ps_area) + index;
}
static void read_exception(struct pstore *ps,
static void read_exception(struct pstore *ps, void *ps_area,
uint32_t index, struct core_exception *result)
{
struct disk_exception *de = get_exception(ps, index);
struct disk_exception *de = get_exception(ps, ps_area, index);
/* copy it */
result->old_chunk = le64_to_cpu(de->old_chunk);
@ -421,7 +426,7 @@ static void read_exception(struct pstore *ps,
static void write_exception(struct pstore *ps,
uint32_t index, struct core_exception *e)
{
struct disk_exception *de = get_exception(ps, index);
struct disk_exception *de = get_exception(ps, ps->area, index);
/* copy it */
de->old_chunk = cpu_to_le64(e->old_chunk);
@ -430,7 +435,7 @@ static void write_exception(struct pstore *ps,
static void clear_exception(struct pstore *ps, uint32_t index)
{
struct disk_exception *de = get_exception(ps, index);
struct disk_exception *de = get_exception(ps, ps->area, index);
/* clear it */
de->old_chunk = 0;
@ -442,7 +447,7 @@ static void clear_exception(struct pstore *ps, uint32_t index)
* 'full' is filled in to indicate if the area has been
* filled.
*/
static int insert_exceptions(struct pstore *ps,
static int insert_exceptions(struct pstore *ps, void *ps_area,
int (*callback)(void *callback_context,
chunk_t old, chunk_t new),
void *callback_context,
@ -456,7 +461,7 @@ static int insert_exceptions(struct pstore *ps,
*full = 1;
for (i = 0; i < ps->exceptions_per_area; i++) {
read_exception(ps, i, &e);
read_exception(ps, ps_area, i, &e);
/*
* If the new_chunk is pointing at the start of
@ -493,26 +498,72 @@ static int read_exceptions(struct pstore *ps,
void *callback_context)
{
int r, full = 1;
struct dm_bufio_client *client;
chunk_t prefetch_area = 0;
client = dm_bufio_client_create(dm_snap_cow(ps->store->snap)->bdev,
ps->store->chunk_size << SECTOR_SHIFT,
1, 0, NULL, NULL);
if (IS_ERR(client))
return PTR_ERR(client);
/*
* Setup for one current buffer + desired readahead buffers.
*/
dm_bufio_set_minimum_buffers(client, 1 + DM_PREFETCH_CHUNKS);
/*
* Keeping reading chunks and inserting exceptions until
* we find a partially full area.
*/
for (ps->current_area = 0; full; ps->current_area++) {
r = area_io(ps, READ);
if (r)
return r;
struct dm_buffer *bp;
void *area;
chunk_t chunk;
r = insert_exceptions(ps, callback, callback_context, &full);
if (r)
return r;
if (unlikely(prefetch_area < ps->current_area))
prefetch_area = ps->current_area;
if (DM_PREFETCH_CHUNKS) do {
chunk_t pf_chunk = area_location(ps, prefetch_area);
if (unlikely(pf_chunk >= dm_bufio_get_device_size(client)))
break;
dm_bufio_prefetch(client, pf_chunk, 1);
prefetch_area++;
if (unlikely(!prefetch_area))
break;
} while (prefetch_area <= ps->current_area + DM_PREFETCH_CHUNKS);
chunk = area_location(ps, ps->current_area);
area = dm_bufio_read(client, chunk, &bp);
if (unlikely(IS_ERR(area))) {
r = PTR_ERR(area);
goto ret_destroy_bufio;
}
r = insert_exceptions(ps, area, callback, callback_context,
&full);
dm_bufio_release(bp);
dm_bufio_forget(client, chunk);
if (unlikely(r))
goto ret_destroy_bufio;
}
ps->current_area--;
skip_metadata(ps);
return 0;
r = 0;
ret_destroy_bufio:
dm_bufio_client_destroy(client);
return r;
}
static struct pstore *get_info(struct dm_exception_store *store)
@ -733,7 +784,7 @@ static int persistent_prepare_merge(struct dm_exception_store *store,
ps->current_committed = ps->exceptions_per_area;
}
read_exception(ps, ps->current_committed - 1, &ce);
read_exception(ps, ps->area, ps->current_committed - 1, &ce);
*last_old_chunk = ce.old_chunk;
*last_new_chunk = ce.new_chunk;
@ -743,8 +794,8 @@ static int persistent_prepare_merge(struct dm_exception_store *store,
*/
for (nr_consecutive = 1; nr_consecutive < ps->current_committed;
nr_consecutive++) {
read_exception(ps, ps->current_committed - 1 - nr_consecutive,
&ce);
read_exception(ps, ps->area,
ps->current_committed - 1 - nr_consecutive, &ce);
if (ce.old_chunk != *last_old_chunk - nr_consecutive ||
ce.new_chunk != *last_new_chunk - nr_consecutive)
break;

View File

@ -610,12 +610,12 @@ static struct dm_exception *dm_lookup_exception(struct dm_exception_table *et,
return NULL;
}
static struct dm_exception *alloc_completed_exception(void)
static struct dm_exception *alloc_completed_exception(gfp_t gfp)
{
struct dm_exception *e;
e = kmem_cache_alloc(exception_cache, GFP_NOIO);
if (!e)
e = kmem_cache_alloc(exception_cache, gfp);
if (!e && gfp == GFP_NOIO)
e = kmem_cache_alloc(exception_cache, GFP_ATOMIC);
return e;
@ -697,7 +697,7 @@ static int dm_add_exception(void *context, chunk_t old, chunk_t new)
struct dm_snapshot *s = context;
struct dm_exception *e;
e = alloc_completed_exception();
e = alloc_completed_exception(GFP_KERNEL);
if (!e)
return -ENOMEM;
@ -1405,7 +1405,7 @@ static void pending_complete(struct dm_snap_pending_exception *pe, int success)
goto out;
}
e = alloc_completed_exception();
e = alloc_completed_exception(GFP_NOIO);
if (!e) {
down_write(&s->lock);
__invalidate_snapshot(s, -ENOMEM);

View File

@ -86,6 +86,7 @@ static const struct sysfs_ops dm_sysfs_ops = {
static struct kobj_type dm_ktype = {
.sysfs_ops = &dm_sysfs_ops,
.default_attrs = dm_attrs,
.release = dm_kobject_release,
};
/*
@ -104,5 +105,7 @@ int dm_sysfs_init(struct mapped_device *md)
*/
void dm_sysfs_exit(struct mapped_device *md)
{
kobject_put(dm_kobject(md));
struct kobject *kobj = dm_kobject(md);
kobject_put(kobj);
wait_for_completion(dm_get_completion_from_kobject(kobj));
}

View File

@ -155,7 +155,6 @@ static int alloc_targets(struct dm_table *t, unsigned int num)
{
sector_t *n_highs;
struct dm_target *n_targets;
int n = t->num_targets;
/*
* Allocate both the target array and offset array at once.
@ -169,12 +168,7 @@ static int alloc_targets(struct dm_table *t, unsigned int num)
n_targets = (struct dm_target *) (n_highs + num);
if (n) {
memcpy(n_highs, t->highs, sizeof(*n_highs) * n);
memcpy(n_targets, t->targets, sizeof(*n_targets) * n);
}
memset(n_highs + n, -1, sizeof(*n_highs) * (num - n));
memset(n_highs, -1, sizeof(*n_highs) * num);
vfree(t->highs);
t->num_allocated = num;
@ -260,17 +254,6 @@ void dm_table_destroy(struct dm_table *t)
kfree(t);
}
/*
* Checks to see if we need to extend highs or targets.
*/
static inline int check_space(struct dm_table *t)
{
if (t->num_targets >= t->num_allocated)
return alloc_targets(t, t->num_allocated * 2);
return 0;
}
/*
* See if we've already got a device in the list.
*/
@ -731,8 +714,7 @@ int dm_table_add_target(struct dm_table *t, const char *type,
return -EINVAL;
}
if ((r = check_space(t)))
return r;
BUG_ON(t->num_targets >= t->num_allocated);
tgt = t->targets + t->num_targets;
memset(tgt, 0, sizeof(*tgt));

View File

@ -1349,6 +1349,12 @@ dm_thin_id dm_thin_dev_id(struct dm_thin_device *td)
return td->id;
}
/*
* Check whether @time (of block creation) is older than @td's last snapshot.
* If so then the associated block is shared with the last snapshot device.
* Any block on a device created *after* the device last got snapshotted is
* necessarily not shared.
*/
static bool __snapshotted_since(struct dm_thin_device *td, uint32_t time)
{
return td->snapshotted_time > time;
@ -1458,6 +1464,20 @@ int dm_thin_remove_block(struct dm_thin_device *td, dm_block_t block)
return r;
}
int dm_pool_block_is_used(struct dm_pool_metadata *pmd, dm_block_t b, bool *result)
{
int r;
uint32_t ref_count;
down_read(&pmd->root_lock);
r = dm_sm_get_count(pmd->data_sm, b, &ref_count);
if (!r)
*result = (ref_count != 0);
up_read(&pmd->root_lock);
return r;
}
bool dm_thin_changed_this_transaction(struct dm_thin_device *td)
{
int r;

View File

@ -131,7 +131,7 @@ dm_thin_id dm_thin_dev_id(struct dm_thin_device *td);
struct dm_thin_lookup_result {
dm_block_t block;
unsigned shared:1;
bool shared:1;
};
/*
@ -181,6 +181,8 @@ int dm_pool_get_data_block_size(struct dm_pool_metadata *pmd, sector_t *result);
int dm_pool_get_data_dev_size(struct dm_pool_metadata *pmd, dm_block_t *result);
int dm_pool_block_is_used(struct dm_pool_metadata *pmd, dm_block_t b, bool *result);
/*
* Returns -ENOSPC if the new size is too small and already allocated
* blocks would be lost.

View File

@ -144,6 +144,7 @@ struct pool_features {
bool zero_new_blocks:1;
bool discard_enabled:1;
bool discard_passdown:1;
bool error_if_no_space:1;
};
struct thin_c;
@ -163,8 +164,7 @@ struct pool {
int sectors_per_block_shift;
struct pool_features pf;
unsigned low_water_triggered:1; /* A dm event has been sent */
unsigned no_free_space:1; /* A -ENOSPC warning has been issued */
bool low_water_triggered:1; /* A dm event has been sent */
struct dm_bio_prison *prison;
struct dm_kcopyd_client *copier;
@ -198,7 +198,8 @@ struct pool {
};
static enum pool_mode get_pool_mode(struct pool *pool);
static void set_pool_mode(struct pool *pool, enum pool_mode mode);
static void out_of_data_space(struct pool *pool);
static void metadata_operation_failed(struct pool *pool, const char *op, int r);
/*
* Target context for a pool.
@ -509,15 +510,16 @@ static void remap_and_issue(struct thin_c *tc, struct bio *bio,
struct dm_thin_new_mapping {
struct list_head list;
unsigned quiesced:1;
unsigned prepared:1;
unsigned pass_discard:1;
bool quiesced:1;
bool prepared:1;
bool pass_discard:1;
bool definitely_not_shared:1;
int err;
struct thin_c *tc;
dm_block_t virt_block;
dm_block_t data_block;
struct dm_bio_prison_cell *cell, *cell2;
int err;
/*
* If the bio covers the whole area of a block then we can avoid
@ -534,7 +536,7 @@ static void __maybe_add_mapping(struct dm_thin_new_mapping *m)
struct pool *pool = m->tc->pool;
if (m->quiesced && m->prepared) {
list_add(&m->list, &pool->prepared_mappings);
list_add_tail(&m->list, &pool->prepared_mappings);
wake_worker(pool);
}
}
@ -548,7 +550,7 @@ static void copy_complete(int read_err, unsigned long write_err, void *context)
m->err = read_err || write_err ? -EIO : 0;
spin_lock_irqsave(&pool->lock, flags);
m->prepared = 1;
m->prepared = true;
__maybe_add_mapping(m);
spin_unlock_irqrestore(&pool->lock, flags);
}
@ -563,7 +565,7 @@ static void overwrite_endio(struct bio *bio, int err)
m->err = err;
spin_lock_irqsave(&pool->lock, flags);
m->prepared = 1;
m->prepared = true;
__maybe_add_mapping(m);
spin_unlock_irqrestore(&pool->lock, flags);
}
@ -640,9 +642,7 @@ static void process_prepared_mapping(struct dm_thin_new_mapping *m)
*/
r = dm_thin_insert_block(tc->td, m->virt_block, m->data_block);
if (r) {
DMERR_LIMIT("%s: dm_thin_insert_block() failed: error = %d",
dm_device_name(pool->pool_md), r);
set_pool_mode(pool, PM_READ_ONLY);
metadata_operation_failed(pool, "dm_thin_insert_block", r);
cell_error(pool, m->cell);
goto out;
}
@ -683,7 +683,15 @@ static void process_prepared_discard_passdown(struct dm_thin_new_mapping *m)
cell_defer_no_holder(tc, m->cell2);
if (m->pass_discard)
remap_and_issue(tc, m->bio, m->data_block);
if (m->definitely_not_shared)
remap_and_issue(tc, m->bio, m->data_block);
else {
bool used = false;
if (dm_pool_block_is_used(tc->pool->pmd, m->data_block, &used) || used)
bio_endio(m->bio, 0);
else
remap_and_issue(tc, m->bio, m->data_block);
}
else
bio_endio(m->bio, 0);
@ -751,13 +759,17 @@ static int ensure_next_mapping(struct pool *pool)
static struct dm_thin_new_mapping *get_next_mapping(struct pool *pool)
{
struct dm_thin_new_mapping *r = pool->next_mapping;
struct dm_thin_new_mapping *m = pool->next_mapping;
BUG_ON(!pool->next_mapping);
memset(m, 0, sizeof(struct dm_thin_new_mapping));
INIT_LIST_HEAD(&m->list);
m->bio = NULL;
pool->next_mapping = NULL;
return r;
return m;
}
static void schedule_copy(struct thin_c *tc, dm_block_t virt_block,
@ -769,18 +781,13 @@ static void schedule_copy(struct thin_c *tc, dm_block_t virt_block,
struct pool *pool = tc->pool;
struct dm_thin_new_mapping *m = get_next_mapping(pool);
INIT_LIST_HEAD(&m->list);
m->quiesced = 0;
m->prepared = 0;
m->tc = tc;
m->virt_block = virt_block;
m->data_block = data_dest;
m->cell = cell;
m->err = 0;
m->bio = NULL;
if (!dm_deferred_set_add_work(pool->shared_read_ds, &m->list))
m->quiesced = 1;
m->quiesced = true;
/*
* IO to pool_dev remaps to the pool target's data_dev.
@ -840,15 +847,12 @@ static void schedule_zero(struct thin_c *tc, dm_block_t virt_block,
struct pool *pool = tc->pool;
struct dm_thin_new_mapping *m = get_next_mapping(pool);
INIT_LIST_HEAD(&m->list);
m->quiesced = 1;
m->prepared = 0;
m->quiesced = true;
m->prepared = false;
m->tc = tc;
m->virt_block = virt_block;
m->data_block = data_block;
m->cell = cell;
m->err = 0;
m->bio = NULL;
/*
* If the whole block of data is being overwritten or we are not
@ -895,42 +899,43 @@ static int commit(struct pool *pool)
return -EINVAL;
r = dm_pool_commit_metadata(pool->pmd);
if (r) {
DMERR_LIMIT("%s: dm_pool_commit_metadata failed: error = %d",
dm_device_name(pool->pool_md), r);
set_pool_mode(pool, PM_READ_ONLY);
}
if (r)
metadata_operation_failed(pool, "dm_pool_commit_metadata", r);
return r;
}
static void check_low_water_mark(struct pool *pool, dm_block_t free_blocks)
{
unsigned long flags;
if (free_blocks <= pool->low_water_blocks && !pool->low_water_triggered) {
DMWARN("%s: reached low water mark for data device: sending event.",
dm_device_name(pool->pool_md));
spin_lock_irqsave(&pool->lock, flags);
pool->low_water_triggered = true;
spin_unlock_irqrestore(&pool->lock, flags);
dm_table_event(pool->ti->table);
}
}
static int alloc_data_block(struct thin_c *tc, dm_block_t *result)
{
int r;
dm_block_t free_blocks;
unsigned long flags;
struct pool *pool = tc->pool;
/*
* Once no_free_space is set we must not allow allocation to succeed.
* Otherwise it is difficult to explain, debug, test and support.
*/
if (pool->no_free_space)
return -ENOSPC;
if (get_pool_mode(pool) != PM_WRITE)
return -EINVAL;
r = dm_pool_get_free_block_count(pool->pmd, &free_blocks);
if (r)
if (r) {
metadata_operation_failed(pool, "dm_pool_get_free_block_count", r);
return r;
if (free_blocks <= pool->low_water_blocks && !pool->low_water_triggered) {
DMWARN("%s: reached low water mark for data device: sending event.",
dm_device_name(pool->pool_md));
spin_lock_irqsave(&pool->lock, flags);
pool->low_water_triggered = 1;
spin_unlock_irqrestore(&pool->lock, flags);
dm_table_event(pool->ti->table);
}
check_low_water_mark(pool, free_blocks);
if (!free_blocks) {
/*
* Try to commit to see if that will free up some
@ -941,35 +946,20 @@ static int alloc_data_block(struct thin_c *tc, dm_block_t *result)
return r;
r = dm_pool_get_free_block_count(pool->pmd, &free_blocks);
if (r)
if (r) {
metadata_operation_failed(pool, "dm_pool_get_free_block_count", r);
return r;
}
/*
* If we still have no space we set a flag to avoid
* doing all this checking and return -ENOSPC. This
* flag serves as a latch that disallows allocations from
* this pool until the admin takes action (e.g. resize or
* table reload).
*/
if (!free_blocks) {
DMWARN("%s: no free data space available.",
dm_device_name(pool->pool_md));
spin_lock_irqsave(&pool->lock, flags);
pool->no_free_space = 1;
spin_unlock_irqrestore(&pool->lock, flags);
out_of_data_space(pool);
return -ENOSPC;
}
}
r = dm_pool_alloc_data_block(pool->pmd, result);
if (r) {
if (r == -ENOSPC &&
!dm_pool_get_free_metadata_block_count(pool->pmd, &free_blocks) &&
!free_blocks) {
DMWARN("%s: no free metadata space available.",
dm_device_name(pool->pool_md));
set_pool_mode(pool, PM_READ_ONLY);
}
metadata_operation_failed(pool, "dm_pool_alloc_data_block", r);
return r;
}
@ -992,7 +982,21 @@ static void retry_on_resume(struct bio *bio)
spin_unlock_irqrestore(&pool->lock, flags);
}
static void no_space(struct pool *pool, struct dm_bio_prison_cell *cell)
static void handle_unserviceable_bio(struct pool *pool, struct bio *bio)
{
/*
* When pool is read-only, no cell locking is needed because
* nothing is changing.
*/
WARN_ON_ONCE(get_pool_mode(pool) != PM_READ_ONLY);
if (pool->pf.error_if_no_space)
bio_io_error(bio);
else
retry_on_resume(bio);
}
static void retry_bios_on_resume(struct pool *pool, struct dm_bio_prison_cell *cell)
{
struct bio *bio;
struct bio_list bios;
@ -1001,7 +1005,7 @@ static void no_space(struct pool *pool, struct dm_bio_prison_cell *cell)
cell_release(pool, cell, &bios);
while ((bio = bio_list_pop(&bios)))
retry_on_resume(bio);
handle_unserviceable_bio(pool, bio);
}
static void process_discard(struct thin_c *tc, struct bio *bio)
@ -1040,17 +1044,17 @@ static void process_discard(struct thin_c *tc, struct bio *bio)
*/
m = get_next_mapping(pool);
m->tc = tc;
m->pass_discard = (!lookup_result.shared) && pool->pf.discard_passdown;
m->pass_discard = pool->pf.discard_passdown;
m->definitely_not_shared = !lookup_result.shared;
m->virt_block = block;
m->data_block = lookup_result.block;
m->cell = cell;
m->cell2 = cell2;
m->err = 0;
m->bio = bio;
if (!dm_deferred_set_add_work(pool->all_io_ds, &m->list)) {
spin_lock_irqsave(&pool->lock, flags);
list_add(&m->list, &pool->prepared_discards);
list_add_tail(&m->list, &pool->prepared_discards);
spin_unlock_irqrestore(&pool->lock, flags);
wake_worker(pool);
}
@ -1105,13 +1109,12 @@ static void break_sharing(struct thin_c *tc, struct bio *bio, dm_block_t block,
break;
case -ENOSPC:
no_space(pool, cell);
retry_bios_on_resume(pool, cell);
break;
default:
DMERR_LIMIT("%s: alloc_data_block() failed: error = %d",
__func__, r);
set_pool_mode(pool, PM_READ_ONLY);
cell_error(pool, cell);
break;
}
@ -1184,13 +1187,12 @@ static void provision_block(struct thin_c *tc, struct bio *bio, dm_block_t block
break;
case -ENOSPC:
no_space(pool, cell);
retry_bios_on_resume(pool, cell);
break;
default:
DMERR_LIMIT("%s: alloc_data_block() failed: error = %d",
__func__, r);
set_pool_mode(pool, PM_READ_ONLY);
cell_error(pool, cell);
break;
}
@ -1257,7 +1259,7 @@ static void process_bio_read_only(struct thin_c *tc, struct bio *bio)
switch (r) {
case 0:
if (lookup_result.shared && (rw == WRITE) && bio->bi_size)
bio_io_error(bio);
handle_unserviceable_bio(tc->pool, bio);
else {
inc_all_io_entry(tc->pool, bio);
remap_and_issue(tc, bio, lookup_result.block);
@ -1266,7 +1268,7 @@ static void process_bio_read_only(struct thin_c *tc, struct bio *bio)
case -ENODATA:
if (rw != READ) {
bio_io_error(bio);
handle_unserviceable_bio(tc->pool, bio);
break;
}
@ -1390,16 +1392,16 @@ static enum pool_mode get_pool_mode(struct pool *pool)
return pool->pf.mode;
}
static void set_pool_mode(struct pool *pool, enum pool_mode mode)
static void set_pool_mode(struct pool *pool, enum pool_mode new_mode)
{
int r;
enum pool_mode old_mode = pool->pf.mode;
pool->pf.mode = mode;
switch (mode) {
switch (new_mode) {
case PM_FAIL:
DMERR("%s: switching pool to failure mode",
dm_device_name(pool->pool_md));
if (old_mode != new_mode)
DMERR("%s: switching pool to failure mode",
dm_device_name(pool->pool_md));
dm_pool_metadata_read_only(pool->pmd);
pool->process_bio = process_bio_fail;
pool->process_discard = process_bio_fail;
@ -1408,13 +1410,15 @@ static void set_pool_mode(struct pool *pool, enum pool_mode mode)
break;
case PM_READ_ONLY:
DMERR("%s: switching pool to read-only mode",
dm_device_name(pool->pool_md));
if (old_mode != new_mode)
DMERR("%s: switching pool to read-only mode",
dm_device_name(pool->pool_md));
r = dm_pool_abort_metadata(pool->pmd);
if (r) {
DMERR("%s: aborting transaction failed",
dm_device_name(pool->pool_md));
set_pool_mode(pool, PM_FAIL);
new_mode = PM_FAIL;
set_pool_mode(pool, new_mode);
} else {
dm_pool_metadata_read_only(pool->pmd);
pool->process_bio = process_bio_read_only;
@ -1425,6 +1429,9 @@ static void set_pool_mode(struct pool *pool, enum pool_mode mode)
break;
case PM_WRITE:
if (old_mode != new_mode)
DMINFO("%s: switching pool to write mode",
dm_device_name(pool->pool_md));
dm_pool_metadata_read_write(pool->pmd);
pool->process_bio = process_bio;
pool->process_discard = process_discard;
@ -1432,6 +1439,35 @@ static void set_pool_mode(struct pool *pool, enum pool_mode mode)
pool->process_prepared_discard = process_prepared_discard;
break;
}
pool->pf.mode = new_mode;
}
/*
* Rather than calling set_pool_mode directly, use these which describe the
* reason for mode degradation.
*/
static void out_of_data_space(struct pool *pool)
{
DMERR_LIMIT("%s: no free data space available.",
dm_device_name(pool->pool_md));
set_pool_mode(pool, PM_READ_ONLY);
}
static void metadata_operation_failed(struct pool *pool, const char *op, int r)
{
dm_block_t free_blocks;
DMERR_LIMIT("%s: metadata operation '%s' failed: error = %d",
dm_device_name(pool->pool_md), op, r);
if (r == -ENOSPC &&
!dm_pool_get_free_metadata_block_count(pool->pmd, &free_blocks) &&
!free_blocks)
DMERR_LIMIT("%s: no free metadata space available.",
dm_device_name(pool->pool_md));
set_pool_mode(pool, PM_READ_ONLY);
}
/*----------------------------------------------------------------*/
@ -1538,9 +1574,9 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
if (get_pool_mode(tc->pool) == PM_READ_ONLY) {
/*
* This block isn't provisioned, and we have no way
* of doing so. Just error it.
* of doing so.
*/
bio_io_error(bio);
handle_unserviceable_bio(tc->pool, bio);
return DM_MAPIO_SUBMITTED;
}
/* fall through */
@ -1647,6 +1683,17 @@ static int bind_control_target(struct pool *pool, struct dm_target *ti)
enum pool_mode old_mode = pool->pf.mode;
enum pool_mode new_mode = pt->adjusted_pf.mode;
/*
* Don't change the pool's mode until set_pool_mode() below.
* Otherwise the pool's process_* function pointers may
* not match the desired pool mode.
*/
pt->adjusted_pf.mode = old_mode;
pool->ti = ti;
pool->pf = pt->adjusted_pf;
pool->low_water_blocks = pt->low_water_blocks;
/*
* If we were in PM_FAIL mode, rollback of metadata failed. We're
* not going to recover without a thin_repair. So we never let the
@ -1657,10 +1704,6 @@ static int bind_control_target(struct pool *pool, struct dm_target *ti)
if (old_mode == PM_FAIL)
new_mode = old_mode;
pool->ti = ti;
pool->low_water_blocks = pt->low_water_blocks;
pool->pf = pt->adjusted_pf;
set_pool_mode(pool, new_mode);
return 0;
@ -1682,6 +1725,7 @@ static void pool_features_init(struct pool_features *pf)
pf->zero_new_blocks = true;
pf->discard_enabled = true;
pf->discard_passdown = true;
pf->error_if_no_space = false;
}
static void __pool_destroy(struct pool *pool)
@ -1772,8 +1816,7 @@ static struct pool *pool_create(struct mapped_device *pool_md,
bio_list_init(&pool->deferred_flush_bios);
INIT_LIST_HEAD(&pool->prepared_mappings);
INIT_LIST_HEAD(&pool->prepared_discards);
pool->low_water_triggered = 0;
pool->no_free_space = 0;
pool->low_water_triggered = false;
bio_list_init(&pool->retry_on_resume_list);
pool->shared_read_ds = dm_deferred_set_create();
@ -1898,7 +1941,7 @@ static int parse_pool_features(struct dm_arg_set *as, struct pool_features *pf,
const char *arg_name;
static struct dm_arg _args[] = {
{0, 3, "Invalid number of pool feature arguments"},
{0, 4, "Invalid number of pool feature arguments"},
};
/*
@ -1927,6 +1970,9 @@ static int parse_pool_features(struct dm_arg_set *as, struct pool_features *pf,
else if (!strcasecmp(arg_name, "read_only"))
pf->mode = PM_READ_ONLY;
else if (!strcasecmp(arg_name, "error_if_no_space"))
pf->error_if_no_space = true;
else {
ti->error = "Unrecognised pool feature requested";
r = -EINVAL;
@ -1997,6 +2043,8 @@ static dm_block_t calc_metadata_threshold(struct pool_c *pt)
* skip_block_zeroing: skips the zeroing of newly-provisioned blocks.
* ignore_discard: disable discard
* no_discard_passdown: don't pass discards down to the data device
* read_only: Don't allow any changes to be made to the pool metadata.
* error_if_no_space: error IOs, instead of queueing, if no space.
*/
static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv)
{
@ -2192,11 +2240,13 @@ static int maybe_resize_data_dev(struct dm_target *ti, bool *need_commit)
return -EINVAL;
} else if (data_size > sb_data_size) {
if (sb_data_size)
DMINFO("%s: growing the data device from %llu to %llu blocks",
dm_device_name(pool->pool_md),
sb_data_size, (unsigned long long)data_size);
r = dm_pool_resize_data_dev(pool->pmd, data_size);
if (r) {
DMERR("%s: failed to resize data device",
dm_device_name(pool->pool_md));
set_pool_mode(pool, PM_READ_ONLY);
metadata_operation_failed(pool, "dm_pool_resize_data_dev", r);
return r;
}
@ -2231,10 +2281,12 @@ static int maybe_resize_metadata_dev(struct dm_target *ti, bool *need_commit)
return -EINVAL;
} else if (metadata_dev_size > sb_metadata_dev_size) {
DMINFO("%s: growing the metadata device from %llu to %llu blocks",
dm_device_name(pool->pool_md),
sb_metadata_dev_size, metadata_dev_size);
r = dm_pool_resize_metadata_dev(pool->pmd, metadata_dev_size);
if (r) {
DMERR("%s: failed to resize metadata device",
dm_device_name(pool->pool_md));
metadata_operation_failed(pool, "dm_pool_resize_metadata_dev", r);
return r;
}
@ -2290,8 +2342,7 @@ static void pool_resume(struct dm_target *ti)
unsigned long flags;
spin_lock_irqsave(&pool->lock, flags);
pool->low_water_triggered = 0;
pool->no_free_space = 0;
pool->low_water_triggered = false;
__requeue_bios(pool);
spin_unlock_irqrestore(&pool->lock, flags);
@ -2510,7 +2561,8 @@ static void emit_flags(struct pool_features *pf, char *result,
unsigned sz, unsigned maxlen)
{
unsigned count = !pf->zero_new_blocks + !pf->discard_enabled +
!pf->discard_passdown + (pf->mode == PM_READ_ONLY);
!pf->discard_passdown + (pf->mode == PM_READ_ONLY) +
pf->error_if_no_space;
DMEMIT("%u ", count);
if (!pf->zero_new_blocks)
@ -2524,6 +2576,9 @@ static void emit_flags(struct pool_features *pf, char *result,
if (pf->mode == PM_READ_ONLY)
DMEMIT("read_only ");
if (pf->error_if_no_space)
DMEMIT("error_if_no_space ");
}
/*
@ -2618,11 +2673,16 @@ static void pool_status(struct dm_target *ti, status_type_t type,
DMEMIT("rw ");
if (!pool->pf.discard_enabled)
DMEMIT("ignore_discard");
DMEMIT("ignore_discard ");
else if (pool->pf.discard_passdown)
DMEMIT("discard_passdown");
DMEMIT("discard_passdown ");
else
DMEMIT("no_discard_passdown");
DMEMIT("no_discard_passdown ");
if (pool->pf.error_if_no_space)
DMEMIT("error_if_no_space ");
else
DMEMIT("queue_if_no_space ");
break;
@ -2721,7 +2781,7 @@ static struct target_type pool_target = {
.name = "thin-pool",
.features = DM_TARGET_SINGLETON | DM_TARGET_ALWAYS_WRITEABLE |
DM_TARGET_IMMUTABLE,
.version = {1, 9, 0},
.version = {1, 10, 0},
.module = THIS_MODULE,
.ctr = pool_ctr,
.dtr = pool_dtr,
@ -2899,7 +2959,7 @@ static int thin_endio(struct dm_target *ti, struct bio *bio, int err)
spin_lock_irqsave(&pool->lock, flags);
list_for_each_entry_safe(m, tmp, &work, list) {
list_del(&m->list);
m->quiesced = 1;
m->quiesced = true;
__maybe_add_mapping(m);
}
spin_unlock_irqrestore(&pool->lock, flags);
@ -2911,7 +2971,7 @@ static int thin_endio(struct dm_target *ti, struct bio *bio, int err)
if (!list_empty(&work)) {
spin_lock_irqsave(&pool->lock, flags);
list_for_each_entry_safe(m, tmp, &work, list)
list_add(&m->list, &pool->prepared_discards);
list_add_tail(&m->list, &pool->prepared_discards);
spin_unlock_irqrestore(&pool->lock, flags);
wake_worker(pool);
}
@ -3008,7 +3068,7 @@ static int thin_iterate_devices(struct dm_target *ti,
static struct target_type thin_target = {
.name = "thin",
.version = {1, 9, 0},
.version = {1, 10, 0},
.module = THIS_MODULE,
.ctr = thin_ctr,
.dtr = thin_dtr,

View File

@ -200,8 +200,8 @@ struct mapped_device {
/* forced geometry settings */
struct hd_geometry geometry;
/* sysfs handle */
struct kobject kobj;
/* kobject and completion */
struct dm_kobject_holder kobj_holder;
/* zero-length flush that will be cloned and submitted to targets */
struct bio flush_bio;
@ -2041,6 +2041,7 @@ static struct mapped_device *alloc_dev(int minor)
init_waitqueue_head(&md->wait);
INIT_WORK(&md->work, dm_wq_work);
init_waitqueue_head(&md->eventq);
init_completion(&md->kobj_holder.completion);
md->disk->major = _major;
md->disk->first_minor = minor;
@ -2902,20 +2903,14 @@ struct gendisk *dm_disk(struct mapped_device *md)
struct kobject *dm_kobject(struct mapped_device *md)
{
return &md->kobj;
return &md->kobj_holder.kobj;
}
/*
* struct mapped_device should not be exported outside of dm.c
* so use this check to verify that kobj is part of md structure
*/
struct mapped_device *dm_get_from_kobject(struct kobject *kobj)
{
struct mapped_device *md;
md = container_of(kobj, struct mapped_device, kobj);
if (&md->kobj != kobj)
return NULL;
md = container_of(kobj, struct mapped_device, kobj_holder.kobj);
if (test_bit(DMF_FREEING, &md->flags) ||
dm_deleting_md(md))

View File

@ -15,6 +15,8 @@
#include <linux/list.h>
#include <linux/blkdev.h>
#include <linux/hdreg.h>
#include <linux/completion.h>
#include <linux/kobject.h>
#include "dm-stats.h"
@ -148,11 +150,26 @@ void dm_interface_exit(void);
/*
* sysfs interface
*/
struct dm_kobject_holder {
struct kobject kobj;
struct completion completion;
};
static inline struct completion *dm_get_completion_from_kobject(struct kobject *kobj)
{
return &container_of(kobj, struct dm_kobject_holder, kobj)->completion;
}
int dm_sysfs_init(struct mapped_device *md);
void dm_sysfs_exit(struct mapped_device *md);
struct kobject *dm_kobject(struct mapped_device *md);
struct mapped_device *dm_get_from_kobject(struct kobject *kobj);
/*
* The kobject helper
*/
void dm_kobject_release(struct kobject *kobj);
/*
* Targets for linear and striped mappings
*/

View File

@ -104,7 +104,7 @@ static int __check_holder(struct block_lock *lock)
for (i = 0; i < MAX_HOLDERS; i++) {
if (lock->holders[i] == current) {
DMERR("recursive lock detected in pool metadata");
DMERR("recursive lock detected in metadata");
#ifdef CONFIG_DM_DEBUG_BLOCK_STACK_TRACING
DMERR("previously held here:");
print_stack_trace(lock->traces + i, 4);

View File

@ -770,8 +770,8 @@ EXPORT_SYMBOL_GPL(dm_btree_insert_notify);
/*----------------------------------------------------------------*/
static int find_highest_key(struct ro_spine *s, dm_block_t block,
uint64_t *result_key, dm_block_t *next_block)
static int find_key(struct ro_spine *s, dm_block_t block, bool find_highest,
uint64_t *result_key, dm_block_t *next_block)
{
int i, r;
uint32_t flags;
@ -788,7 +788,11 @@ static int find_highest_key(struct ro_spine *s, dm_block_t block,
else
i--;
*result_key = le64_to_cpu(ro_node(s)->keys[i]);
if (find_highest)
*result_key = le64_to_cpu(ro_node(s)->keys[i]);
else
*result_key = le64_to_cpu(ro_node(s)->keys[0]);
if (next_block || flags & INTERNAL_NODE)
block = value64(ro_node(s), i);
@ -799,16 +803,16 @@ static int find_highest_key(struct ro_spine *s, dm_block_t block,
return 0;
}
int dm_btree_find_highest_key(struct dm_btree_info *info, dm_block_t root,
uint64_t *result_keys)
static int dm_btree_find_key(struct dm_btree_info *info, dm_block_t root,
bool find_highest, uint64_t *result_keys)
{
int r = 0, count = 0, level;
struct ro_spine spine;
init_ro_spine(&spine, info);
for (level = 0; level < info->levels; level++) {
r = find_highest_key(&spine, root, result_keys + level,
level == info->levels - 1 ? NULL : &root);
r = find_key(&spine, root, find_highest, result_keys + level,
level == info->levels - 1 ? NULL : &root);
if (r == -ENODATA) {
r = 0;
break;
@ -822,8 +826,23 @@ int dm_btree_find_highest_key(struct dm_btree_info *info, dm_block_t root,
return r ? r : count;
}
int dm_btree_find_highest_key(struct dm_btree_info *info, dm_block_t root,
uint64_t *result_keys)
{
return dm_btree_find_key(info, root, true, result_keys);
}
EXPORT_SYMBOL_GPL(dm_btree_find_highest_key);
int dm_btree_find_lowest_key(struct dm_btree_info *info, dm_block_t root,
uint64_t *result_keys)
{
return dm_btree_find_key(info, root, false, result_keys);
}
EXPORT_SYMBOL_GPL(dm_btree_find_lowest_key);
/*----------------------------------------------------------------*/
/*
* FIXME: We shouldn't use a recursive algorithm when we have limited stack
* space. Also this only works for single level trees.

View File

@ -134,6 +134,14 @@ int dm_btree_insert_notify(struct dm_btree_info *info, dm_block_t root,
int dm_btree_remove(struct dm_btree_info *info, dm_block_t root,
uint64_t *keys, dm_block_t *new_root);
/*
* Returns < 0 on failure. Otherwise the number of key entries that have
* been filled out. Remember trees can have zero entries, and as such have
* no lowest key.
*/
int dm_btree_find_lowest_key(struct dm_btree_info *info, dm_block_t root,
uint64_t *result_keys);
/*
* Returns < 0 on failure. Otherwise the number of key entries that have
* been filled out. Remember trees can have zero entries, and as such have

View File

@ -245,6 +245,10 @@ int sm_ll_extend(struct ll_disk *ll, dm_block_t extra_blocks)
return -EINVAL;
}
/*
* We need to set this before the dm_tm_new_block() call below.
*/
ll->nr_blocks = nr_blocks;
for (i = old_blocks; i < blocks; i++) {
struct dm_block *b;
struct disk_index_entry idx;
@ -252,6 +256,7 @@ int sm_ll_extend(struct ll_disk *ll, dm_block_t extra_blocks)
r = dm_tm_new_block(ll->tm, &dm_sm_bitmap_validator, &b);
if (r < 0)
return r;
idx.blocknr = cpu_to_le64(dm_block_location(b));
r = dm_tm_unlock(ll->tm, b);
@ -266,7 +271,6 @@ int sm_ll_extend(struct ll_disk *ll, dm_block_t extra_blocks)
return r;
}
ll->nr_blocks = nr_blocks;
return 0;
}

View File

@ -385,13 +385,13 @@ static int sm_metadata_new_block(struct dm_space_map *sm, dm_block_t *b)
int r = sm_metadata_new_block_(sm, b);
if (r) {
DMERR("unable to allocate new metadata block");
DMERR_LIMIT("unable to allocate new metadata block");
return r;
}
r = sm_metadata_get_nr_free(sm, &count);
if (r) {
DMERR("couldn't get free block count");
DMERR_LIMIT("couldn't get free block count");
return r;
}
@ -608,20 +608,38 @@ static int sm_metadata_extend(struct dm_space_map *sm, dm_block_t extra_blocks)
* Flick into a mode where all blocks get allocated in the new area.
*/
smm->begin = old_len;
memcpy(&smm->sm, &bootstrap_ops, sizeof(smm->sm));
memcpy(sm, &bootstrap_ops, sizeof(*sm));
/*
* Extend.
*/
r = sm_ll_extend(&smm->ll, extra_blocks);
if (r)
goto out;
/*
* We repeatedly increment then commit until the commit doesn't
* allocate any new blocks.
*/
do {
for (i = old_len; !r && i < smm->begin; i++) {
r = sm_ll_inc(&smm->ll, i, &ev);
if (r)
goto out;
}
old_len = smm->begin;
r = sm_ll_commit(&smm->ll);
if (r)
goto out;
} while (old_len != smm->begin);
out:
/*
* Switch back to normal behaviour.
*/
memcpy(&smm->sm, &ops, sizeof(smm->sm));
for (i = old_len; !r && i < smm->begin; i++)
r = sm_ll_inc(&smm->ll, i, &ev);
memcpy(sm, &ops, sizeof(*sm));
return r;
}

View File

@ -201,11 +201,18 @@
* int (*flush)(struct dm_dirty_log *log);
*
* Payload-to-userspace:
* None.
* If the 'integrated_flush' directive is present in the constructor
* table, the payload is as same as DM_ULOG_MARK_REGION:
* uint64_t [] - region(s) to mark
* else
* None
* Payload-to-kernel:
* None.
*
* No incoming or outgoing payload. Simply flush log state to disk.
* If the 'integrated_flush' option was used during the creation of the
* log, mark region requests are carried as payload in the flush request.
* Piggybacking the mark requests in this way allows for fewer communications
* between kernel and userspace.
*
* When the request has been processed, user-space must return the
* dm_ulog_request to the kernel - setting the 'error' field and clearing
@ -385,8 +392,15 @@
* version 2: DM_ULOG_CTR allowed to return a string containing a
* device name that is to be registered with DM via
* 'dm_get_device'.
* version 3: DM_ULOG_FLUSH is capable of carrying payload for marking
* regions. This "integrated flush" reduces the number of
* requests between the kernel and userspace by effectively
* merging 'mark' and 'flush' requests. A constructor table
* argument ('integrated_flush') is required to turn this
* feature on, so it is backwards compatible with older
* userspace versions.
*/
#define DM_ULOG_REQUEST_VERSION 2
#define DM_ULOG_REQUEST_VERSION 3
struct dm_ulog_request {
/*