2018-04-04 01:23:33 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2008-03-25 03:01:56 +08:00
|
|
|
/*
|
|
|
|
* Copyright (C) 2007 Oracle. All rights reserved.
|
|
|
|
*/
|
2018-04-04 01:23:33 +08:00
|
|
|
|
2008-03-25 03:01:56 +08:00
|
|
|
#include <linux/sched.h>
|
|
|
|
#include <linux/bio.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 16:04:11 +08:00
|
|
|
#include <linux/slab.h>
|
2008-03-25 03:02:07 +08:00
|
|
|
#include <linux/buffer_head.h>
|
2008-04-21 22:03:05 +08:00
|
|
|
#include <linux/blkdev.h>
|
2012-05-25 22:06:08 +08:00
|
|
|
#include <linux/ratelimit.h>
|
2012-01-17 04:04:48 +08:00
|
|
|
#include <linux/kthread.h>
|
2013-01-30 07:40:14 +08:00
|
|
|
#include <linux/raid/pq.h>
|
2013-08-15 23:11:21 +08:00
|
|
|
#include <linux/semaphore.h>
|
2016-05-21 08:01:00 +08:00
|
|
|
#include <linux/uuid.h>
|
2018-01-23 06:49:36 +08:00
|
|
|
#include <linux/list_sort.h>
|
2008-03-25 03:01:56 +08:00
|
|
|
#include "ctree.h"
|
|
|
|
#include "extent_map.h"
|
|
|
|
#include "disk-io.h"
|
|
|
|
#include "transaction.h"
|
|
|
|
#include "print-tree.h"
|
|
|
|
#include "volumes.h"
|
2013-01-30 07:40:14 +08:00
|
|
|
#include "raid56.h"
|
2008-06-12 04:50:36 +08:00
|
|
|
#include "async-thread.h"
|
2011-11-09 20:44:05 +08:00
|
|
|
#include "check-integrity.h"
|
2012-06-05 02:03:51 +08:00
|
|
|
#include "rcu-string.h"
|
2012-09-13 18:51:36 +08:00
|
|
|
#include "math.h"
|
2012-11-06 20:15:27 +08:00
|
|
|
#include "dev-replace.h"
|
2014-06-03 11:36:00 +08:00
|
|
|
#include "sysfs.h"
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2015-09-15 21:08:06 +08:00
|
|
|
const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
|
|
|
|
[BTRFS_RAID_RAID10] = {
|
|
|
|
.sub_stripes = 2,
|
|
|
|
.dev_stripes = 1,
|
|
|
|
.devs_max = 0, /* 0 == as many as possible */
|
|
|
|
.devs_min = 4,
|
2015-09-15 21:08:07 +08:00
|
|
|
.tolerated_failures = 1,
|
2015-09-15 21:08:06 +08:00
|
|
|
.devs_increment = 2,
|
|
|
|
.ncopies = 2,
|
2018-04-25 19:01:42 +08:00
|
|
|
.raid_name = "raid10",
|
2018-04-25 19:01:43 +08:00
|
|
|
.bg_flag = BTRFS_BLOCK_GROUP_RAID10,
|
2018-04-25 19:01:44 +08:00
|
|
|
.mindev_error = BTRFS_ERROR_DEV_RAID10_MIN_NOT_MET,
|
2015-09-15 21:08:06 +08:00
|
|
|
},
|
|
|
|
[BTRFS_RAID_RAID1] = {
|
|
|
|
.sub_stripes = 1,
|
|
|
|
.dev_stripes = 1,
|
|
|
|
.devs_max = 2,
|
|
|
|
.devs_min = 2,
|
2015-09-15 21:08:07 +08:00
|
|
|
.tolerated_failures = 1,
|
2015-09-15 21:08:06 +08:00
|
|
|
.devs_increment = 2,
|
|
|
|
.ncopies = 2,
|
2018-04-25 19:01:42 +08:00
|
|
|
.raid_name = "raid1",
|
2018-04-25 19:01:43 +08:00
|
|
|
.bg_flag = BTRFS_BLOCK_GROUP_RAID1,
|
2018-04-25 19:01:44 +08:00
|
|
|
.mindev_error = BTRFS_ERROR_DEV_RAID1_MIN_NOT_MET,
|
2015-09-15 21:08:06 +08:00
|
|
|
},
|
|
|
|
[BTRFS_RAID_DUP] = {
|
|
|
|
.sub_stripes = 1,
|
|
|
|
.dev_stripes = 2,
|
|
|
|
.devs_max = 1,
|
|
|
|
.devs_min = 1,
|
2015-09-15 21:08:07 +08:00
|
|
|
.tolerated_failures = 0,
|
2015-09-15 21:08:06 +08:00
|
|
|
.devs_increment = 1,
|
|
|
|
.ncopies = 2,
|
2018-04-25 19:01:42 +08:00
|
|
|
.raid_name = "dup",
|
2018-04-25 19:01:43 +08:00
|
|
|
.bg_flag = BTRFS_BLOCK_GROUP_DUP,
|
2018-04-25 19:01:44 +08:00
|
|
|
.mindev_error = 0,
|
2015-09-15 21:08:06 +08:00
|
|
|
},
|
|
|
|
[BTRFS_RAID_RAID0] = {
|
|
|
|
.sub_stripes = 1,
|
|
|
|
.dev_stripes = 1,
|
|
|
|
.devs_max = 0,
|
|
|
|
.devs_min = 2,
|
2015-09-15 21:08:07 +08:00
|
|
|
.tolerated_failures = 0,
|
2015-09-15 21:08:06 +08:00
|
|
|
.devs_increment = 1,
|
|
|
|
.ncopies = 1,
|
2018-04-25 19:01:42 +08:00
|
|
|
.raid_name = "raid0",
|
2018-04-25 19:01:43 +08:00
|
|
|
.bg_flag = BTRFS_BLOCK_GROUP_RAID0,
|
2018-04-25 19:01:44 +08:00
|
|
|
.mindev_error = 0,
|
2015-09-15 21:08:06 +08:00
|
|
|
},
|
|
|
|
[BTRFS_RAID_SINGLE] = {
|
|
|
|
.sub_stripes = 1,
|
|
|
|
.dev_stripes = 1,
|
|
|
|
.devs_max = 1,
|
|
|
|
.devs_min = 1,
|
2015-09-15 21:08:07 +08:00
|
|
|
.tolerated_failures = 0,
|
2015-09-15 21:08:06 +08:00
|
|
|
.devs_increment = 1,
|
|
|
|
.ncopies = 1,
|
2018-04-25 19:01:42 +08:00
|
|
|
.raid_name = "single",
|
2018-04-25 19:01:43 +08:00
|
|
|
.bg_flag = 0,
|
2018-04-25 19:01:44 +08:00
|
|
|
.mindev_error = 0,
|
2015-09-15 21:08:06 +08:00
|
|
|
},
|
|
|
|
[BTRFS_RAID_RAID5] = {
|
|
|
|
.sub_stripes = 1,
|
|
|
|
.dev_stripes = 1,
|
|
|
|
.devs_max = 0,
|
|
|
|
.devs_min = 2,
|
2015-09-15 21:08:07 +08:00
|
|
|
.tolerated_failures = 1,
|
2015-09-15 21:08:06 +08:00
|
|
|
.devs_increment = 1,
|
2018-10-05 05:24:41 +08:00
|
|
|
.ncopies = 1,
|
2018-04-25 19:01:42 +08:00
|
|
|
.raid_name = "raid5",
|
2018-04-25 19:01:43 +08:00
|
|
|
.bg_flag = BTRFS_BLOCK_GROUP_RAID5,
|
2018-04-25 19:01:44 +08:00
|
|
|
.mindev_error = BTRFS_ERROR_DEV_RAID5_MIN_NOT_MET,
|
2015-09-15 21:08:06 +08:00
|
|
|
},
|
|
|
|
[BTRFS_RAID_RAID6] = {
|
|
|
|
.sub_stripes = 1,
|
|
|
|
.dev_stripes = 1,
|
|
|
|
.devs_max = 0,
|
|
|
|
.devs_min = 3,
|
2015-09-15 21:08:07 +08:00
|
|
|
.tolerated_failures = 2,
|
2015-09-15 21:08:06 +08:00
|
|
|
.devs_increment = 1,
|
2018-10-05 05:24:41 +08:00
|
|
|
.ncopies = 1,
|
2018-04-25 19:01:42 +08:00
|
|
|
.raid_name = "raid6",
|
2018-04-25 19:01:43 +08:00
|
|
|
.bg_flag = BTRFS_BLOCK_GROUP_RAID6,
|
2018-04-25 19:01:44 +08:00
|
|
|
.mindev_error = BTRFS_ERROR_DEV_RAID6_MIN_NOT_MET,
|
2015-09-15 21:08:06 +08:00
|
|
|
},
|
|
|
|
};
|
|
|
|
|
2018-04-25 19:01:42 +08:00
|
|
|
const char *get_raid_name(enum btrfs_raid_types type)
|
|
|
|
{
|
|
|
|
if (type >= BTRFS_NR_RAID_TYPES)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
return btrfs_raid_array[type].raid_name;
|
|
|
|
}
|
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
static int init_first_rw_device(struct btrfs_trans_handle *trans,
|
2017-02-11 02:49:01 +08:00
|
|
|
struct btrfs_fs_info *fs_info);
|
2016-06-23 06:54:24 +08:00
|
|
|
static int btrfs_relocate_sys_chunks(struct btrfs_fs_info *fs_info);
|
2012-05-25 22:06:10 +08:00
|
|
|
static void __btrfs_reset_dev_stats(struct btrfs_device *dev);
|
2013-04-26 04:41:01 +08:00
|
|
|
static void btrfs_dev_stat_print_on_error(struct btrfs_device *dev);
|
2012-05-25 22:06:10 +08:00
|
|
|
static void btrfs_dev_stat_print_on_load(struct btrfs_device *device);
|
2017-03-15 04:33:57 +08:00
|
|
|
static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
|
|
|
|
enum btrfs_map_op op,
|
|
|
|
u64 logical, u64 *length,
|
|
|
|
struct btrfs_bio **bbio_ret,
|
|
|
|
int mirror_num, int need_raid_map);
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2017-06-17 04:30:00 +08:00
|
|
|
/*
|
|
|
|
* Device locking
|
|
|
|
* ==============
|
|
|
|
*
|
|
|
|
* There are several mutexes that protect manipulation of devices and low-level
|
|
|
|
* structures like chunks but not block groups, extents or files
|
|
|
|
*
|
|
|
|
* uuid_mutex (global lock)
|
|
|
|
* ------------------------
|
|
|
|
* protects the fs_uuids list that tracks all per-fs fs_devices, resulting from
|
|
|
|
* the SCAN_DEV ioctl registration or from mount either implicitly (the first
|
|
|
|
* device) or requested by the device= mount option
|
|
|
|
*
|
|
|
|
* the mutex can be very coarse and can cover long-running operations
|
|
|
|
*
|
|
|
|
* protects: updates to fs_devices counters like missing devices, rw devices,
|
|
|
|
* seeding, structure cloning, openning/closing devices at mount/umount time
|
|
|
|
*
|
|
|
|
* global::fs_devs - add, remove, updates to the global list
|
|
|
|
*
|
|
|
|
* does not protect: manipulation of the fs_devices::devices list!
|
|
|
|
*
|
|
|
|
* btrfs_device::name - renames (write side), read is RCU
|
|
|
|
*
|
|
|
|
* fs_devices::device_list_mutex (per-fs, with RCU)
|
|
|
|
* ------------------------------------------------
|
|
|
|
* protects updates to fs_devices::devices, ie. adding and deleting
|
|
|
|
*
|
|
|
|
* simple list traversal with read-only actions can be done with RCU protection
|
|
|
|
*
|
|
|
|
* may be used to exclude some operations from running concurrently without any
|
|
|
|
* modifications to the list (see write_all_supers)
|
|
|
|
*
|
|
|
|
* balance_mutex
|
|
|
|
* -------------
|
|
|
|
* protects balance structures (status, state) and context accessed from
|
|
|
|
* several places (internally, ioctl)
|
|
|
|
*
|
|
|
|
* chunk_mutex
|
|
|
|
* -----------
|
|
|
|
* protects chunks, adding or removing during allocation, trim or when a new
|
|
|
|
* device is added/removed
|
|
|
|
*
|
|
|
|
* cleaner_mutex
|
|
|
|
* -------------
|
|
|
|
* a big lock that is held by the cleaner thread and prevents running subvolume
|
|
|
|
* cleaning together with relocation or delayed iputs
|
|
|
|
*
|
|
|
|
*
|
|
|
|
* Lock nesting
|
|
|
|
* ============
|
|
|
|
*
|
|
|
|
* uuid_mutex
|
|
|
|
* volume_mutex
|
|
|
|
* device_list_mutex
|
|
|
|
* chunk_mutex
|
|
|
|
* balance_mutex
|
2018-04-18 14:59:25 +08:00
|
|
|
*
|
|
|
|
*
|
|
|
|
* Exclusive operations, BTRFS_FS_EXCL_OP
|
|
|
|
* ======================================
|
|
|
|
*
|
|
|
|
* Maintains the exclusivity of the following operations that apply to the
|
|
|
|
* whole filesystem and cannot run in parallel.
|
|
|
|
*
|
|
|
|
* - Balance (*)
|
|
|
|
* - Device add
|
|
|
|
* - Device remove
|
|
|
|
* - Device replace (*)
|
|
|
|
* - Resize
|
|
|
|
*
|
|
|
|
* The device operations (as above) can be in one of the following states:
|
|
|
|
*
|
|
|
|
* - Running state
|
|
|
|
* - Paused state
|
|
|
|
* - Completed state
|
|
|
|
*
|
|
|
|
* Only device operations marked with (*) can go into the Paused state for the
|
|
|
|
* following reasons:
|
|
|
|
*
|
|
|
|
* - ioctl (only Balance can be Paused through ioctl)
|
|
|
|
* - filesystem remounted as read-only
|
|
|
|
* - filesystem unmounted and mounted as read-only
|
|
|
|
* - system power-cycle and filesystem mounted as read-only
|
|
|
|
* - filesystem or device errors leading to forced read-only
|
|
|
|
*
|
|
|
|
* BTRFS_FS_EXCL_OP flag is set and cleared using atomic operations.
|
|
|
|
* During the course of Paused state, the BTRFS_FS_EXCL_OP remains set.
|
|
|
|
* A device operation in Paused or Running state can be canceled or resumed
|
|
|
|
* either by ioctl (Balance only) or when remounted as read-write.
|
|
|
|
* BTRFS_FS_EXCL_OP flag is cleared when the device operation is canceled or
|
|
|
|
* completed.
|
2017-06-17 04:30:00 +08:00
|
|
|
*/
|
|
|
|
|
2014-09-03 21:35:43 +08:00
|
|
|
DEFINE_MUTEX(uuid_mutex);
|
2008-03-25 03:02:07 +08:00
|
|
|
static LIST_HEAD(fs_uuids);
|
2015-03-10 06:38:30 +08:00
|
|
|
struct list_head *btrfs_get_fs_uuids(void)
|
|
|
|
{
|
|
|
|
return &fs_uuids;
|
|
|
|
}
|
2008-03-25 03:02:07 +08:00
|
|
|
|
2017-06-14 08:48:07 +08:00
|
|
|
/*
|
|
|
|
* alloc_fs_devices - allocate struct btrfs_fs_devices
|
|
|
|
* @fsid: if not NULL, copy the uuid to fs_devices::fsid
|
|
|
|
*
|
|
|
|
* Return a pointer to a new struct btrfs_fs_devices on success, or ERR_PTR().
|
|
|
|
* The returned struct is not linked onto any lists and can be destroyed with
|
|
|
|
* kfree() right away.
|
|
|
|
*/
|
|
|
|
static struct btrfs_fs_devices *alloc_fs_devices(const u8 *fsid)
|
2013-08-12 19:33:03 +08:00
|
|
|
{
|
|
|
|
struct btrfs_fs_devices *fs_devs;
|
|
|
|
|
2016-02-11 21:25:38 +08:00
|
|
|
fs_devs = kzalloc(sizeof(*fs_devs), GFP_KERNEL);
|
2013-08-12 19:33:03 +08:00
|
|
|
if (!fs_devs)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
|
|
|
|
mutex_init(&fs_devs->device_list_mutex);
|
|
|
|
|
|
|
|
INIT_LIST_HEAD(&fs_devs->devices);
|
2014-09-03 21:35:33 +08:00
|
|
|
INIT_LIST_HEAD(&fs_devs->resized_devices);
|
2013-08-12 19:33:03 +08:00
|
|
|
INIT_LIST_HEAD(&fs_devs->alloc_list);
|
2018-04-12 10:29:25 +08:00
|
|
|
INIT_LIST_HEAD(&fs_devs->fs_list);
|
2013-08-12 19:33:03 +08:00
|
|
|
if (fsid)
|
|
|
|
memcpy(fs_devs->fsid, fsid, BTRFS_FSID_SIZE);
|
|
|
|
|
|
|
|
return fs_devs;
|
|
|
|
}
|
|
|
|
|
2018-03-20 22:47:33 +08:00
|
|
|
void btrfs_free_device(struct btrfs_device *device)
|
2017-10-31 01:10:25 +08:00
|
|
|
{
|
|
|
|
rcu_string_free(device->name);
|
|
|
|
bio_put(device->flush_bio);
|
|
|
|
kfree(device);
|
|
|
|
}
|
|
|
|
|
2008-12-12 23:03:26 +08:00
|
|
|
static void free_fs_devices(struct btrfs_fs_devices *fs_devices)
|
|
|
|
{
|
|
|
|
struct btrfs_device *device;
|
|
|
|
WARN_ON(fs_devices->opened);
|
|
|
|
while (!list_empty(&fs_devices->devices)) {
|
|
|
|
device = list_entry(fs_devices->devices.next,
|
|
|
|
struct btrfs_device, dev_list);
|
|
|
|
list_del(&device->dev_list);
|
2018-03-20 22:47:33 +08:00
|
|
|
btrfs_free_device(device);
|
2008-12-12 23:03:26 +08:00
|
|
|
}
|
|
|
|
kfree(fs_devices);
|
|
|
|
}
|
|
|
|
|
2012-12-07 03:25:48 +08:00
|
|
|
static void btrfs_kobject_uevent(struct block_device *bdev,
|
|
|
|
enum kobject_action action)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = kobject_uevent(&disk_to_dev(bdev->bd_disk)->kobj, action);
|
|
|
|
if (ret)
|
2013-12-21 00:37:06 +08:00
|
|
|
pr_warn("BTRFS: Sending event '%d' to kobject: '%s' (%p): failed\n",
|
2012-12-07 03:25:48 +08:00
|
|
|
action,
|
|
|
|
kobject_name(&disk_to_dev(bdev->bd_disk)->kobj),
|
|
|
|
&disk_to_dev(bdev->bd_disk)->kobj);
|
|
|
|
}
|
|
|
|
|
2018-02-20 00:24:15 +08:00
|
|
|
void __exit btrfs_cleanup_fs_uuids(void)
|
2008-03-25 03:02:07 +08:00
|
|
|
{
|
|
|
|
struct btrfs_fs_devices *fs_devices;
|
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
while (!list_empty(&fs_uuids)) {
|
|
|
|
fs_devices = list_entry(fs_uuids.next,
|
2018-04-12 10:29:25 +08:00
|
|
|
struct btrfs_fs_devices, fs_list);
|
|
|
|
list_del(&fs_devices->fs_list);
|
2008-12-12 23:03:26 +08:00
|
|
|
free_fs_devices(fs_devices);
|
2008-03-25 03:02:07 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-10-31 01:10:25 +08:00
|
|
|
/*
|
|
|
|
* Returns a pointer to a new btrfs_device on success; ERR_PTR() on error.
|
|
|
|
* Returned struct is not linked onto any lists and must be destroyed using
|
2018-03-20 22:47:33 +08:00
|
|
|
* btrfs_free_device.
|
2017-10-31 01:10:25 +08:00
|
|
|
*/
|
2013-08-23 18:20:17 +08:00
|
|
|
static struct btrfs_device *__alloc_device(void)
|
|
|
|
{
|
|
|
|
struct btrfs_device *dev;
|
|
|
|
|
2016-02-11 21:25:38 +08:00
|
|
|
dev = kzalloc(sizeof(*dev), GFP_KERNEL);
|
2013-08-23 18:20:17 +08:00
|
|
|
if (!dev)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
|
2017-06-06 23:06:06 +08:00
|
|
|
/*
|
|
|
|
* Preallocate a bio that's always going to be used for flushing device
|
|
|
|
* barriers and matches the device lifespan
|
|
|
|
*/
|
|
|
|
dev->flush_bio = bio_alloc_bioset(GFP_KERNEL, 0, NULL);
|
|
|
|
if (!dev->flush_bio) {
|
|
|
|
kfree(dev);
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
}
|
|
|
|
|
2013-08-23 18:20:17 +08:00
|
|
|
INIT_LIST_HEAD(&dev->dev_list);
|
|
|
|
INIT_LIST_HEAD(&dev->dev_alloc_list);
|
2014-09-03 21:35:33 +08:00
|
|
|
INIT_LIST_HEAD(&dev->resized_list);
|
2013-08-23 18:20:17 +08:00
|
|
|
|
|
|
|
spin_lock_init(&dev->io_lock);
|
|
|
|
|
|
|
|
atomic_set(&dev->reada_in_flight, 0);
|
2014-07-24 11:37:11 +08:00
|
|
|
atomic_set(&dev->dev_stats_ccnt, 0);
|
2016-01-15 21:37:15 +08:00
|
|
|
btrfs_device_data_ordered_init(dev);
|
2017-05-05 07:08:08 +08:00
|
|
|
INIT_RADIX_TREE(&dev->reada_zones, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
|
2015-11-07 08:28:21 +08:00
|
|
|
INIT_RADIX_TREE(&dev->reada_extents, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
|
2013-08-23 18:20:17 +08:00
|
|
|
|
|
|
|
return dev;
|
|
|
|
}
|
|
|
|
|
2017-06-16 01:51:51 +08:00
|
|
|
/*
|
|
|
|
* Find a device specified by @devid or @uuid in the list of @fs_devices, or
|
|
|
|
* return NULL.
|
|
|
|
*
|
|
|
|
* If devid and uuid are both specified, the match must be exact, otherwise
|
|
|
|
* only devid is used.
|
|
|
|
*/
|
|
|
|
static struct btrfs_device *find_device(struct btrfs_fs_devices *fs_devices,
|
|
|
|
u64 devid, const u8 *uuid)
|
2008-03-25 03:02:07 +08:00
|
|
|
{
|
|
|
|
struct btrfs_device *dev;
|
|
|
|
|
2018-04-12 10:29:29 +08:00
|
|
|
list_for_each_entry(dev, &fs_devices->devices, dev_list) {
|
2008-04-18 22:29:38 +08:00
|
|
|
if (dev->devid == devid &&
|
2008-04-26 04:53:30 +08:00
|
|
|
(!uuid || !memcmp(dev->uuid, uuid, BTRFS_UUID_SIZE))) {
|
2008-03-25 03:02:07 +08:00
|
|
|
return dev;
|
2008-04-18 22:29:38 +08:00
|
|
|
}
|
2008-03-25 03:02:07 +08:00
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2008-09-06 04:09:51 +08:00
|
|
|
static noinline struct btrfs_fs_devices *find_fsid(u8 *fsid)
|
2008-03-25 03:02:07 +08:00
|
|
|
{
|
|
|
|
struct btrfs_fs_devices *fs_devices;
|
|
|
|
|
2018-04-12 10:29:25 +08:00
|
|
|
list_for_each_entry(fs_devices, &fs_uuids, fs_list) {
|
2008-03-25 03:02:07 +08:00
|
|
|
if (memcmp(fsid, fs_devices->fsid, BTRFS_FSID_SIZE) == 0)
|
|
|
|
return fs_devices;
|
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2012-11-12 21:03:45 +08:00
|
|
|
static int
|
|
|
|
btrfs_get_bdev_and_sb(const char *device_path, fmode_t flags, void *holder,
|
|
|
|
int flush, struct block_device **bdev,
|
|
|
|
struct buffer_head **bh)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
*bdev = blkdev_get_by_path(device_path, flags, holder);
|
|
|
|
|
|
|
|
if (IS_ERR(*bdev)) {
|
|
|
|
ret = PTR_ERR(*bdev);
|
|
|
|
goto error;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flush)
|
|
|
|
filemap_write_and_wait((*bdev)->bd_inode->i_mapping);
|
2017-06-16 07:48:05 +08:00
|
|
|
ret = set_blocksize(*bdev, BTRFS_BDEV_BLOCKSIZE);
|
2012-11-12 21:03:45 +08:00
|
|
|
if (ret) {
|
|
|
|
blkdev_put(*bdev, flags);
|
|
|
|
goto error;
|
|
|
|
}
|
|
|
|
invalidate_bdev(*bdev);
|
|
|
|
*bh = btrfs_read_dev_super(*bdev);
|
2015-08-14 18:32:51 +08:00
|
|
|
if (IS_ERR(*bh)) {
|
|
|
|
ret = PTR_ERR(*bh);
|
2012-11-12 21:03:45 +08:00
|
|
|
blkdev_put(*bdev, flags);
|
|
|
|
goto error;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
error:
|
|
|
|
*bdev = NULL;
|
|
|
|
*bh = NULL;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2009-04-21 03:50:09 +08:00
|
|
|
static void requeue_list(struct btrfs_pending_bios *pending_bios,
|
|
|
|
struct bio *head, struct bio *tail)
|
|
|
|
{
|
|
|
|
|
|
|
|
struct bio *old_head;
|
|
|
|
|
|
|
|
old_head = pending_bios->head;
|
|
|
|
pending_bios->head = head;
|
|
|
|
if (pending_bios->tail)
|
|
|
|
tail->bi_next = old_head;
|
|
|
|
else
|
|
|
|
pending_bios->tail = tail;
|
|
|
|
}
|
|
|
|
|
2008-06-12 04:50:36 +08:00
|
|
|
/*
|
|
|
|
* we try to collect pending bios for a device so we don't get a large
|
|
|
|
* number of procs sending bios down to the same device. This greatly
|
|
|
|
* improves the schedulers ability to collect and merge the bios.
|
|
|
|
*
|
|
|
|
* But, it also turns into a long list of bios to process and that is sure
|
|
|
|
* to eventually make the worker thread block. The solution here is to
|
|
|
|
* make some progress and then put this work struct back at the end of
|
|
|
|
* the list if the block device is congested. This way, multiple devices
|
|
|
|
* can make progress from a single worker thread.
|
|
|
|
*/
|
2012-03-01 21:56:26 +08:00
|
|
|
static noinline void run_scheduled_bios(struct btrfs_device *device)
|
2008-06-12 04:50:36 +08:00
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *fs_info = device->fs_info;
|
2008-06-12 04:50:36 +08:00
|
|
|
struct bio *pending;
|
|
|
|
struct backing_dev_info *bdi;
|
2009-04-21 03:50:09 +08:00
|
|
|
struct btrfs_pending_bios *pending_bios;
|
2008-06-12 04:50:36 +08:00
|
|
|
struct bio *tail;
|
|
|
|
struct bio *cur;
|
|
|
|
int again = 0;
|
2009-04-21 03:50:09 +08:00
|
|
|
unsigned long num_run;
|
2009-06-10 03:59:22 +08:00
|
|
|
unsigned long batch_run = 0;
|
2009-04-03 22:27:10 +08:00
|
|
|
unsigned long last_waited = 0;
|
2009-06-10 03:39:08 +08:00
|
|
|
int force_reg = 0;
|
2011-08-05 17:32:37 +08:00
|
|
|
int sync_pending = 0;
|
2011-04-20 08:12:40 +08:00
|
|
|
struct blk_plug plug;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* this function runs all the bios we've collected for
|
|
|
|
* a particular device. We don't want to wander off to
|
|
|
|
* another device without first sending all of these down.
|
|
|
|
* So, setup a plug here and finish it off before we return
|
|
|
|
*/
|
|
|
|
blk_start_plug(&plug);
|
2008-06-12 04:50:36 +08:00
|
|
|
|
2017-02-02 22:56:53 +08:00
|
|
|
bdi = device->bdev->bd_bdi;
|
2008-08-21 01:39:41 +08:00
|
|
|
|
2008-06-12 04:50:36 +08:00
|
|
|
loop:
|
|
|
|
spin_lock(&device->io_lock);
|
|
|
|
|
2009-02-04 22:19:41 +08:00
|
|
|
loop_lock:
|
2009-06-10 03:39:08 +08:00
|
|
|
num_run = 0;
|
2009-04-21 03:50:09 +08:00
|
|
|
|
2008-06-12 04:50:36 +08:00
|
|
|
/* take all the bios off the list at once and process them
|
|
|
|
* later on (without the lock held). But, remember the
|
|
|
|
* tail and other pointers so the bios can be properly reinserted
|
|
|
|
* into the list if we hit congestion
|
|
|
|
*/
|
2009-06-10 03:39:08 +08:00
|
|
|
if (!force_reg && device->pending_sync_bios.head) {
|
2009-04-21 03:50:09 +08:00
|
|
|
pending_bios = &device->pending_sync_bios;
|
2009-06-10 03:39:08 +08:00
|
|
|
force_reg = 1;
|
|
|
|
} else {
|
2009-04-21 03:50:09 +08:00
|
|
|
pending_bios = &device->pending_bios;
|
2009-06-10 03:39:08 +08:00
|
|
|
force_reg = 0;
|
|
|
|
}
|
2009-04-21 03:50:09 +08:00
|
|
|
|
|
|
|
pending = pending_bios->head;
|
|
|
|
tail = pending_bios->tail;
|
2008-06-12 04:50:36 +08:00
|
|
|
WARN_ON(pending && !tail);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* if pending was null this time around, no bios need processing
|
|
|
|
* at all and we can stop. Otherwise it'll loop back up again
|
|
|
|
* and do an additional check so no bios are missed.
|
|
|
|
*
|
|
|
|
* device->running_pending is used to synchronize with the
|
|
|
|
* schedule_bio code.
|
|
|
|
*/
|
2009-04-21 03:50:09 +08:00
|
|
|
if (device->pending_sync_bios.head == NULL &&
|
|
|
|
device->pending_bios.head == NULL) {
|
2008-06-12 04:50:36 +08:00
|
|
|
again = 0;
|
|
|
|
device->running_pending = 0;
|
2009-04-21 03:50:09 +08:00
|
|
|
} else {
|
|
|
|
again = 1;
|
|
|
|
device->running_pending = 1;
|
2008-06-12 04:50:36 +08:00
|
|
|
}
|
2009-04-21 03:50:09 +08:00
|
|
|
|
|
|
|
pending_bios->head = NULL;
|
|
|
|
pending_bios->tail = NULL;
|
|
|
|
|
2008-06-12 04:50:36 +08:00
|
|
|
spin_unlock(&device->io_lock);
|
|
|
|
|
2009-01-06 10:25:51 +08:00
|
|
|
while (pending) {
|
2009-04-21 03:50:09 +08:00
|
|
|
|
|
|
|
rmb();
|
2009-06-10 03:39:08 +08:00
|
|
|
/* we want to work on both lists, but do more bios on the
|
|
|
|
* sync list than the regular list
|
|
|
|
*/
|
|
|
|
if ((num_run > 32 &&
|
|
|
|
pending_bios != &device->pending_sync_bios &&
|
|
|
|
device->pending_sync_bios.head) ||
|
|
|
|
(num_run > 64 && pending_bios == &device->pending_sync_bios &&
|
|
|
|
device->pending_bios.head)) {
|
2009-04-21 03:50:09 +08:00
|
|
|
spin_lock(&device->io_lock);
|
|
|
|
requeue_list(pending_bios, pending, tail);
|
|
|
|
goto loop_lock;
|
|
|
|
}
|
|
|
|
|
2008-06-12 04:50:36 +08:00
|
|
|
cur = pending;
|
|
|
|
pending = pending->bi_next;
|
|
|
|
cur->bi_next = NULL;
|
2008-08-21 01:39:41 +08:00
|
|
|
|
2015-04-18 06:23:59 +08:00
|
|
|
BUG_ON(atomic_read(&cur->__bi_cnt) == 0);
|
2009-06-10 03:59:22 +08:00
|
|
|
|
2011-08-05 02:28:36 +08:00
|
|
|
/*
|
|
|
|
* if we're doing the sync list, record that our
|
|
|
|
* plug has some sync requests on it
|
|
|
|
*
|
|
|
|
* If we're doing the regular list and there are
|
|
|
|
* sync requests sitting around, unplug before
|
|
|
|
* we add more
|
|
|
|
*/
|
|
|
|
if (pending_bios == &device->pending_sync_bios) {
|
|
|
|
sync_pending = 1;
|
|
|
|
} else if (sync_pending) {
|
|
|
|
blk_finish_plug(&plug);
|
|
|
|
blk_start_plug(&plug);
|
|
|
|
sync_pending = 0;
|
|
|
|
}
|
|
|
|
|
2016-06-06 03:31:41 +08:00
|
|
|
btrfsic_submit_bio(cur);
|
2010-03-15 22:21:30 +08:00
|
|
|
num_run++;
|
|
|
|
batch_run++;
|
2015-01-08 22:15:19 +08:00
|
|
|
|
|
|
|
cond_resched();
|
2008-06-12 04:50:36 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* we made progress, there is more work to do and the bdi
|
|
|
|
* is now congested. Back off and let other work structs
|
|
|
|
* run instead
|
|
|
|
*/
|
2009-08-07 21:59:15 +08:00
|
|
|
if (pending && bdi_write_congested(bdi) && batch_run > 8 &&
|
2008-11-08 07:22:45 +08:00
|
|
|
fs_info->fs_devices->open_devices > 1) {
|
2009-04-03 22:27:10 +08:00
|
|
|
struct io_context *ioc;
|
2008-06-12 04:50:36 +08:00
|
|
|
|
2009-04-03 22:27:10 +08:00
|
|
|
ioc = current->io_context;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* the main goal here is that we don't want to
|
|
|
|
* block if we're going to be able to submit
|
|
|
|
* more requests without blocking.
|
|
|
|
*
|
|
|
|
* This code does two great things, it pokes into
|
|
|
|
* the elevator code from a filesystem _and_
|
|
|
|
* it makes assumptions about how batching works.
|
|
|
|
*/
|
|
|
|
if (ioc && ioc->nr_batch_requests > 0 &&
|
|
|
|
time_before(jiffies, ioc->last_waited + HZ/50UL) &&
|
|
|
|
(last_waited == 0 ||
|
|
|
|
ioc->last_waited == last_waited)) {
|
|
|
|
/*
|
|
|
|
* we want to go through our batch of
|
|
|
|
* requests and stop. So, we copy out
|
|
|
|
* the ioc->last_waited time and test
|
|
|
|
* against it before looping
|
|
|
|
*/
|
|
|
|
last_waited = ioc->last_waited;
|
2015-01-08 22:15:19 +08:00
|
|
|
cond_resched();
|
2009-04-03 22:27:10 +08:00
|
|
|
continue;
|
|
|
|
}
|
2008-06-12 04:50:36 +08:00
|
|
|
spin_lock(&device->io_lock);
|
2009-04-21 03:50:09 +08:00
|
|
|
requeue_list(pending_bios, pending, tail);
|
2009-02-04 22:19:41 +08:00
|
|
|
device->running_pending = 1;
|
2008-06-12 04:50:36 +08:00
|
|
|
|
|
|
|
spin_unlock(&device->io_lock);
|
2014-02-28 10:46:08 +08:00
|
|
|
btrfs_queue_work(fs_info->submit_workers,
|
|
|
|
&device->work);
|
2008-06-12 04:50:36 +08:00
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
}
|
2009-04-21 03:50:09 +08:00
|
|
|
|
2010-03-11 04:33:32 +08:00
|
|
|
cond_resched();
|
|
|
|
if (again)
|
|
|
|
goto loop;
|
|
|
|
|
|
|
|
spin_lock(&device->io_lock);
|
|
|
|
if (device->pending_bios.head || device->pending_sync_bios.head)
|
|
|
|
goto loop_lock;
|
|
|
|
spin_unlock(&device->io_lock);
|
|
|
|
|
2008-06-12 04:50:36 +08:00
|
|
|
done:
|
2011-04-20 08:12:40 +08:00
|
|
|
blk_finish_plug(&plug);
|
2008-06-12 04:50:36 +08:00
|
|
|
}
|
|
|
|
|
2008-12-02 22:54:17 +08:00
|
|
|
static void pending_bios_fn(struct btrfs_work *work)
|
2008-06-12 04:50:36 +08:00
|
|
|
{
|
|
|
|
struct btrfs_device *device;
|
|
|
|
|
|
|
|
device = container_of(work, struct btrfs_device, work);
|
|
|
|
run_scheduled_bios(device);
|
|
|
|
}
|
|
|
|
|
2018-01-18 22:00:37 +08:00
|
|
|
/*
|
|
|
|
* Search and remove all stale (devices which are not mounted) devices.
|
|
|
|
* When both inputs are NULL, it will search and release all stale devices.
|
|
|
|
* path: Optional. When provided will it release all unmounted devices
|
|
|
|
* matching this path only.
|
|
|
|
* skip_dev: Optional. Will skip this device when searching for the stale
|
|
|
|
* devices.
|
|
|
|
*/
|
|
|
|
static void btrfs_free_stale_devices(const char *path,
|
2018-05-29 15:33:08 +08:00
|
|
|
struct btrfs_device *skip_device)
|
2015-06-17 21:10:48 +08:00
|
|
|
{
|
2018-05-29 15:33:08 +08:00
|
|
|
struct btrfs_fs_devices *fs_devices, *tmp_fs_devices;
|
|
|
|
struct btrfs_device *device, *tmp_device;
|
2015-06-17 21:10:48 +08:00
|
|
|
|
2018-05-29 15:33:08 +08:00
|
|
|
list_for_each_entry_safe(fs_devices, tmp_fs_devices, &fs_uuids, fs_list) {
|
2018-05-29 17:23:20 +08:00
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
|
|
|
if (fs_devices->opened) {
|
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
2015-06-17 21:10:48 +08:00
|
|
|
continue;
|
2018-05-29 17:23:20 +08:00
|
|
|
}
|
2015-06-17 21:10:48 +08:00
|
|
|
|
2018-05-29 15:33:08 +08:00
|
|
|
list_for_each_entry_safe(device, tmp_device,
|
|
|
|
&fs_devices->devices, dev_list) {
|
2018-01-18 22:00:35 +08:00
|
|
|
int not_found = 0;
|
2015-06-17 21:10:48 +08:00
|
|
|
|
2018-05-29 15:33:08 +08:00
|
|
|
if (skip_device && skip_device == device)
|
2018-01-18 22:00:37 +08:00
|
|
|
continue;
|
2018-05-29 15:33:08 +08:00
|
|
|
if (path && !device->name)
|
2015-06-17 21:10:48 +08:00
|
|
|
continue;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
2018-01-18 22:00:37 +08:00
|
|
|
if (path)
|
2018-05-29 15:33:08 +08:00
|
|
|
not_found = strcmp(rcu_str_deref(device->name),
|
2018-01-18 22:00:37 +08:00
|
|
|
path);
|
2015-06-17 21:10:48 +08:00
|
|
|
rcu_read_unlock();
|
2018-01-18 22:00:34 +08:00
|
|
|
if (not_found)
|
|
|
|
continue;
|
2015-06-17 21:10:48 +08:00
|
|
|
|
|
|
|
/* delete the stale device */
|
2018-05-29 17:23:20 +08:00
|
|
|
fs_devices->num_devices--;
|
|
|
|
list_del(&device->dev_list);
|
|
|
|
btrfs_free_device(device);
|
|
|
|
|
|
|
|
if (fs_devices->num_devices == 0)
|
2018-01-30 22:07:37 +08:00
|
|
|
break;
|
2018-05-29 17:23:20 +08:00
|
|
|
}
|
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
|
|
|
if (fs_devices->num_devices == 0) {
|
|
|
|
btrfs_sysfs_remove_fsid(fs_devices);
|
|
|
|
list_del(&fs_devices->fs_list);
|
|
|
|
free_fs_devices(fs_devices);
|
2015-06-17 21:10:48 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-11-09 23:45:24 +08:00
|
|
|
static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
|
|
|
|
struct btrfs_device *device, fmode_t flags,
|
|
|
|
void *holder)
|
|
|
|
{
|
|
|
|
struct request_queue *q;
|
|
|
|
struct block_device *bdev;
|
|
|
|
struct buffer_head *bh;
|
|
|
|
struct btrfs_super_block *disk_super;
|
|
|
|
u64 devid;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (device->bdev)
|
|
|
|
return -EINVAL;
|
|
|
|
if (!device->name)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
ret = btrfs_get_bdev_and_sb(device->name->str, flags, holder, 1,
|
|
|
|
&bdev, &bh);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
disk_super = (struct btrfs_super_block *)bh->b_data;
|
|
|
|
devid = btrfs_stack_device_id(&disk_super->dev_item);
|
|
|
|
if (devid != device->devid)
|
|
|
|
goto error_brelse;
|
|
|
|
|
|
|
|
if (memcmp(device->uuid, disk_super->dev_item.uuid, BTRFS_UUID_SIZE))
|
|
|
|
goto error_brelse;
|
|
|
|
|
|
|
|
device->generation = btrfs_super_generation(disk_super);
|
|
|
|
|
|
|
|
if (btrfs_super_flags(disk_super) & BTRFS_SUPER_FLAG_SEEDING) {
|
2017-12-04 12:54:52 +08:00
|
|
|
clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
|
2017-11-09 23:45:24 +08:00
|
|
|
fs_devices->seeding = 1;
|
|
|
|
} else {
|
2017-12-04 12:54:52 +08:00
|
|
|
if (bdev_read_only(bdev))
|
|
|
|
clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
|
|
|
|
else
|
|
|
|
set_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
|
2017-11-09 23:45:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
q = bdev_get_queue(bdev);
|
|
|
|
if (!blk_queue_nonrot(q))
|
|
|
|
fs_devices->rotating = 1;
|
|
|
|
|
|
|
|
device->bdev = bdev;
|
2017-12-04 12:54:53 +08:00
|
|
|
clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
|
2017-11-09 23:45:24 +08:00
|
|
|
device->mode = flags;
|
|
|
|
|
|
|
|
fs_devices->open_devices++;
|
2017-12-04 12:54:52 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
|
|
|
|
device->devid != BTRFS_DEV_REPLACE_DEVID) {
|
2017-11-09 23:45:24 +08:00
|
|
|
fs_devices->rw_devices++;
|
2018-01-23 06:49:37 +08:00
|
|
|
list_add_tail(&device->dev_alloc_list, &fs_devices->alloc_list);
|
2017-11-09 23:45:24 +08:00
|
|
|
}
|
|
|
|
brelse(bh);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
error_brelse:
|
|
|
|
brelse(bh);
|
|
|
|
blkdev_put(bdev, flags);
|
|
|
|
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2014-03-27 01:26:36 +08:00
|
|
|
/*
|
|
|
|
* Add new device to list of registered devices
|
|
|
|
*
|
|
|
|
* Returns:
|
2018-01-18 22:02:35 +08:00
|
|
|
* device pointer which was just added or updated when successful
|
|
|
|
* error pointer when failed
|
2014-03-27 01:26:36 +08:00
|
|
|
*/
|
2018-01-18 22:02:35 +08:00
|
|
|
static noinline struct btrfs_device *device_list_add(const char *path,
|
2018-05-29 12:28:37 +08:00
|
|
|
struct btrfs_super_block *disk_super,
|
|
|
|
bool *new_device_added)
|
2008-03-25 03:02:07 +08:00
|
|
|
{
|
|
|
|
struct btrfs_device *device;
|
|
|
|
struct btrfs_fs_devices *fs_devices;
|
2012-06-05 02:03:51 +08:00
|
|
|
struct rcu_string *name;
|
2008-03-25 03:02:07 +08:00
|
|
|
u64 found_transid = btrfs_super_generation(disk_super);
|
2018-01-18 22:02:36 +08:00
|
|
|
u64 devid = btrfs_stack_device_id(&disk_super->dev_item);
|
2008-03-25 03:02:07 +08:00
|
|
|
|
|
|
|
fs_devices = find_fsid(disk_super->fsid);
|
|
|
|
if (!fs_devices) {
|
2013-08-12 19:33:03 +08:00
|
|
|
fs_devices = alloc_fs_devices(disk_super->fsid);
|
|
|
|
if (IS_ERR(fs_devices))
|
2018-01-18 22:02:35 +08:00
|
|
|
return ERR_CAST(fs_devices);
|
2013-08-12 19:33:03 +08:00
|
|
|
|
2018-05-29 14:10:20 +08:00
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
2018-04-12 10:29:25 +08:00
|
|
|
list_add(&fs_devices->fs_list, &fs_uuids);
|
2013-08-12 19:33:03 +08:00
|
|
|
|
2008-03-25 03:02:07 +08:00
|
|
|
device = NULL;
|
|
|
|
} else {
|
2018-05-29 14:10:20 +08:00
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
2017-06-16 01:51:51 +08:00
|
|
|
device = find_device(fs_devices, devid,
|
|
|
|
disk_super->dev_item.uuid);
|
2008-03-25 03:02:07 +08:00
|
|
|
}
|
2014-07-24 11:37:15 +08:00
|
|
|
|
2008-03-25 03:02:07 +08:00
|
|
|
if (!device) {
|
2018-05-29 14:10:20 +08:00
|
|
|
if (fs_devices->opened) {
|
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
2018-01-18 22:02:35 +08:00
|
|
|
return ERR_PTR(-EBUSY);
|
2018-05-29 14:10:20 +08:00
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2013-08-23 18:20:17 +08:00
|
|
|
device = btrfs_alloc_device(NULL, &devid,
|
|
|
|
disk_super->dev_item.uuid);
|
|
|
|
if (IS_ERR(device)) {
|
2018-05-29 14:10:20 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
2008-03-25 03:02:07 +08:00
|
|
|
/* we can safely leave the fs_devices entry around */
|
2018-01-18 22:02:35 +08:00
|
|
|
return device;
|
2008-03-25 03:02:07 +08:00
|
|
|
}
|
2012-06-05 02:03:51 +08:00
|
|
|
|
|
|
|
name = rcu_string_strdup(path, GFP_NOFS);
|
|
|
|
if (!name) {
|
2018-03-20 22:47:33 +08:00
|
|
|
btrfs_free_device(device);
|
2018-05-29 14:10:20 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
2018-01-18 22:02:35 +08:00
|
|
|
return ERR_PTR(-ENOMEM);
|
2008-03-25 03:02:07 +08:00
|
|
|
}
|
2012-06-05 02:03:51 +08:00
|
|
|
rcu_assign_pointer(device->name, name);
|
2011-05-23 20:30:00 +08:00
|
|
|
|
2011-04-20 18:09:16 +08:00
|
|
|
list_add_rcu(&device->dev_list, &fs_devices->devices);
|
Btrfs: fix race conditions in BTRFS_IOC_FS_INFO ioctl
The handler for the ioctl BTRFS_IOC_FS_INFO was reading the
number of devices before acquiring the device list mutex.
This could lead to inconsistent results because the update of
the device list and the number of devices counter (amongst other
counters related to the device list) are updated in volumes.c
while holding the device list mutex - except for 2 places, one
was volumes.c:btrfs_prepare_sprout() and the other was
volumes.c:device_list_add().
For example, if we have 2 devices, with IDs 1 and 2 and then add
a new device, with ID 3, and while adding the device is in progress
an BTRFS_IOC_FS_INFO ioctl arrives, it could return a number of
devices of 2 and a max dev id of 3. This would be incorrect.
Also, this ioctl handler was reading the fsid while it can be
updated concurrently. This can happen when while a new device is
being added and the current filesystem is in seeding mode.
Example:
$ mkfs.btrfs -f /dev/sdb1
$ mkfs.btrfs -f /dev/sdb2
$ btrfstune -S 1 /dev/sdb1
$ mount /dev/sdb1 /mnt/test
$ btrfs device add /dev/sdb2 /mnt/test
If during the last step a BTRFS_IOC_FS_INFO ioctl was requested, it
could read an fsid that was never valid (some bits part of the old
fsid and others part of the new fsid). Also, it could read a number
of devices that doesn't match the number of devices in the list and
the max device id, as explained before.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-13 03:56:58 +08:00
|
|
|
fs_devices->num_devices++;
|
2009-06-11 03:17:02 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
device->fs_devices = fs_devices;
|
2018-05-29 12:28:37 +08:00
|
|
|
*new_device_added = true;
|
2018-01-18 22:02:33 +08:00
|
|
|
|
|
|
|
if (disk_super->label[0])
|
|
|
|
pr_info("BTRFS: device label %s devid %llu transid %llu %s\n",
|
|
|
|
disk_super->label, devid, found_transid, path);
|
|
|
|
else
|
|
|
|
pr_info("BTRFS: device fsid %pU devid %llu transid %llu %s\n",
|
|
|
|
disk_super->fsid, devid, found_transid, path);
|
|
|
|
|
2012-06-05 02:03:51 +08:00
|
|
|
} else if (!device->name || strcmp(device->name->str, path)) {
|
2014-07-03 18:22:05 +08:00
|
|
|
/*
|
|
|
|
* When FS is already mounted.
|
|
|
|
* 1. If you are here and if the device->name is NULL that
|
|
|
|
* means this device was missing at time of FS mount.
|
|
|
|
* 2. If you are here and if the device->name is different
|
|
|
|
* from 'path' that means either
|
|
|
|
* a. The same device disappeared and reappeared with
|
|
|
|
* different name. or
|
|
|
|
* b. The missing-disk-which-was-replaced, has
|
|
|
|
* reappeared now.
|
|
|
|
*
|
|
|
|
* We must allow 1 and 2a above. But 2b would be a spurious
|
|
|
|
* and unintentional.
|
|
|
|
*
|
|
|
|
* Further in case of 1 and 2a above, the disk at 'path'
|
|
|
|
* would have missed some transaction when it was away and
|
|
|
|
* in case of 2a the stale bdev has to be updated as well.
|
|
|
|
* 2b must not be allowed at all time.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
2014-09-18 22:49:05 +08:00
|
|
|
* For now, we do allow update to btrfs_fs_device through the
|
|
|
|
* btrfs dev scan cli after FS has been mounted. We're still
|
|
|
|
* tracking a problem where systems fail mount by subvolume id
|
|
|
|
* when we reject replacement on a mounted FS.
|
2014-07-03 18:22:05 +08:00
|
|
|
*/
|
2014-09-18 22:49:05 +08:00
|
|
|
if (!fs_devices->opened && found_transid < device->generation) {
|
2014-07-03 18:22:06 +08:00
|
|
|
/*
|
|
|
|
* That is if the FS is _not_ mounted and if you
|
|
|
|
* are here, that means there is more than one
|
|
|
|
* disk with same uuid and devid.We keep the one
|
|
|
|
* with larger generation number or the last-in if
|
|
|
|
* generation are equal.
|
|
|
|
*/
|
2018-05-29 14:10:20 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
2018-01-18 22:02:35 +08:00
|
|
|
return ERR_PTR(-EEXIST);
|
2014-07-03 18:22:06 +08:00
|
|
|
}
|
2014-07-03 18:22:05 +08:00
|
|
|
|
2012-06-05 02:03:51 +08:00
|
|
|
name = rcu_string_strdup(path, GFP_NOFS);
|
2018-05-29 14:10:20 +08:00
|
|
|
if (!name) {
|
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
2018-01-18 22:02:35 +08:00
|
|
|
return ERR_PTR(-ENOMEM);
|
2018-05-29 14:10:20 +08:00
|
|
|
}
|
2012-06-05 02:03:51 +08:00
|
|
|
rcu_string_free(device->name);
|
|
|
|
rcu_assign_pointer(device->name, name);
|
2017-12-04 12:54:54 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state)) {
|
2010-12-14 03:56:23 +08:00
|
|
|
fs_devices->missing_devices--;
|
2017-12-04 12:54:54 +08:00
|
|
|
clear_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state);
|
2010-12-14 03:56:23 +08:00
|
|
|
}
|
2008-03-25 03:02:07 +08:00
|
|
|
}
|
|
|
|
|
2014-07-03 18:22:06 +08:00
|
|
|
/*
|
|
|
|
* Unmount does not free the btrfs_device struct but would zero
|
|
|
|
* generation along with most of the other members. So just update
|
|
|
|
* it back. We need it to pick the disk with largest generation
|
|
|
|
* (as above).
|
|
|
|
*/
|
|
|
|
if (!fs_devices->opened)
|
|
|
|
device->generation = found_transid;
|
|
|
|
|
2018-01-18 22:02:34 +08:00
|
|
|
fs_devices->total_devices = btrfs_super_num_devices(disk_super);
|
|
|
|
|
2018-05-29 14:10:20 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
2018-01-18 22:02:35 +08:00
|
|
|
return device;
|
2008-03-25 03:02:07 +08:00
|
|
|
}
|
|
|
|
|
2008-12-12 23:03:26 +08:00
|
|
|
static struct btrfs_fs_devices *clone_fs_devices(struct btrfs_fs_devices *orig)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_devices *fs_devices;
|
|
|
|
struct btrfs_device *device;
|
|
|
|
struct btrfs_device *orig_dev;
|
|
|
|
|
2013-08-12 19:33:03 +08:00
|
|
|
fs_devices = alloc_fs_devices(orig->fsid);
|
|
|
|
if (IS_ERR(fs_devices))
|
|
|
|
return fs_devices;
|
2008-12-12 23:03:26 +08:00
|
|
|
|
2014-09-03 21:35:42 +08:00
|
|
|
mutex_lock(&orig->device_list_mutex);
|
2012-06-22 04:03:58 +08:00
|
|
|
fs_devices->total_devices = orig->total_devices;
|
2008-12-12 23:03:26 +08:00
|
|
|
|
2011-04-20 18:08:47 +08:00
|
|
|
/* We have held the volume lock, it is safe to get the devices. */
|
2008-12-12 23:03:26 +08:00
|
|
|
list_for_each_entry(orig_dev, &orig->devices, dev_list) {
|
2012-06-05 02:03:51 +08:00
|
|
|
struct rcu_string *name;
|
|
|
|
|
2013-08-23 18:20:17 +08:00
|
|
|
device = btrfs_alloc_device(NULL, &orig_dev->devid,
|
|
|
|
orig_dev->uuid);
|
|
|
|
if (IS_ERR(device))
|
2008-12-12 23:03:26 +08:00
|
|
|
goto error;
|
|
|
|
|
2012-06-05 02:03:51 +08:00
|
|
|
/*
|
|
|
|
* This is ok to do without rcu read locked because we hold the
|
|
|
|
* uuid mutex so nothing we touch in here is going to disappear.
|
|
|
|
*/
|
2014-06-30 17:12:47 +08:00
|
|
|
if (orig_dev->name) {
|
2016-02-11 21:25:38 +08:00
|
|
|
name = rcu_string_strdup(orig_dev->name->str,
|
|
|
|
GFP_KERNEL);
|
2014-06-30 17:12:47 +08:00
|
|
|
if (!name) {
|
2018-03-20 22:47:33 +08:00
|
|
|
btrfs_free_device(device);
|
2014-06-30 17:12:47 +08:00
|
|
|
goto error;
|
|
|
|
}
|
|
|
|
rcu_assign_pointer(device->name, name);
|
2009-09-30 01:51:04 +08:00
|
|
|
}
|
2008-12-12 23:03:26 +08:00
|
|
|
|
|
|
|
list_add(&device->dev_list, &fs_devices->devices);
|
|
|
|
device->fs_devices = fs_devices;
|
|
|
|
fs_devices->num_devices++;
|
|
|
|
}
|
2014-09-03 21:35:42 +08:00
|
|
|
mutex_unlock(&orig->device_list_mutex);
|
2008-12-12 23:03:26 +08:00
|
|
|
return fs_devices;
|
|
|
|
error:
|
2014-09-03 21:35:42 +08:00
|
|
|
mutex_unlock(&orig->device_list_mutex);
|
2008-12-12 23:03:26 +08:00
|
|
|
free_fs_devices(fs_devices);
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
}
|
|
|
|
|
2018-02-27 12:41:59 +08:00
|
|
|
/*
|
|
|
|
* After we have read the system tree and know devids belonging to
|
|
|
|
* this filesystem, remove the device which does not belong there.
|
|
|
|
*/
|
|
|
|
void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices, int step)
|
2008-05-14 01:46:40 +08:00
|
|
|
{
|
2009-01-21 23:59:08 +08:00
|
|
|
struct btrfs_device *device, *next;
|
2014-07-24 11:37:15 +08:00
|
|
|
struct btrfs_device *latest_dev = NULL;
|
2012-02-21 09:53:43 +08:00
|
|
|
|
2008-05-14 01:46:40 +08:00
|
|
|
mutex_lock(&uuid_mutex);
|
|
|
|
again:
|
2011-04-20 18:08:47 +08:00
|
|
|
/* This is the initialized path, it is safe to release the devices. */
|
2009-01-21 23:59:08 +08:00
|
|
|
list_for_each_entry_safe(device, next, &fs_devices->devices, dev_list) {
|
2017-12-04 12:54:53 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
|
|
|
|
&device->dev_state)) {
|
2017-12-04 12:54:55 +08:00
|
|
|
if (!test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
|
|
|
|
&device->dev_state) &&
|
|
|
|
(!latest_dev ||
|
|
|
|
device->generation > latest_dev->generation)) {
|
2014-07-24 11:37:15 +08:00
|
|
|
latest_dev = device;
|
2012-02-21 09:53:43 +08:00
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
continue;
|
2012-02-21 09:53:43 +08:00
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2012-11-06 20:15:27 +08:00
|
|
|
if (device->devid == BTRFS_DEV_REPLACE_DEVID) {
|
|
|
|
/*
|
|
|
|
* In the first step, keep the device which has
|
|
|
|
* the correct fsid and the devid that is used
|
|
|
|
* for the dev_replace procedure.
|
|
|
|
* In the second step, the dev_replace state is
|
|
|
|
* read from the device tree and it is known
|
|
|
|
* whether the procedure is really active or
|
|
|
|
* not, which means whether this device is
|
|
|
|
* used or whether it should be removed.
|
|
|
|
*/
|
2017-12-04 12:54:55 +08:00
|
|
|
if (step == 0 || test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
|
|
|
|
&device->dev_state)) {
|
2012-11-06 20:15:27 +08:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
if (device->bdev) {
|
2010-11-13 18:55:18 +08:00
|
|
|
blkdev_put(device->bdev, device->mode);
|
2008-11-18 10:11:30 +08:00
|
|
|
device->bdev = NULL;
|
|
|
|
fs_devices->open_devices--;
|
|
|
|
}
|
2017-12-04 12:54:52 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
|
2008-11-18 10:11:30 +08:00
|
|
|
list_del_init(&device->dev_alloc_list);
|
2017-12-04 12:54:52 +08:00
|
|
|
clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
|
2017-12-04 12:54:55 +08:00
|
|
|
if (!test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
|
|
|
|
&device->dev_state))
|
2012-11-06 20:15:27 +08:00
|
|
|
fs_devices->rw_devices--;
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
2008-12-12 23:03:26 +08:00
|
|
|
list_del_init(&device->dev_list);
|
|
|
|
fs_devices->num_devices--;
|
2018-03-20 22:47:33 +08:00
|
|
|
btrfs_free_device(device);
|
2008-05-14 01:46:40 +08:00
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
|
|
|
|
if (fs_devices->seed) {
|
|
|
|
fs_devices = fs_devices->seed;
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
|
2014-07-24 11:37:15 +08:00
|
|
|
fs_devices->latest_bdev = latest_dev->bdev;
|
2012-02-21 09:53:43 +08:00
|
|
|
|
2008-05-14 01:46:40 +08:00
|
|
|
mutex_unlock(&uuid_mutex);
|
|
|
|
}
|
2008-05-14 04:03:06 +08:00
|
|
|
|
2017-06-06 23:08:23 +08:00
|
|
|
static void free_device_rcu(struct rcu_head *head)
|
2011-04-20 18:09:16 +08:00
|
|
|
{
|
|
|
|
struct btrfs_device *device;
|
|
|
|
|
2017-10-24 13:02:54 +08:00
|
|
|
device = container_of(head, struct btrfs_device, rcu);
|
2018-03-20 22:47:33 +08:00
|
|
|
btrfs_free_device(device);
|
2011-04-20 18:09:16 +08:00
|
|
|
}
|
|
|
|
|
2016-07-22 06:04:53 +08:00
|
|
|
static void btrfs_close_bdev(struct btrfs_device *device)
|
|
|
|
{
|
2017-06-19 22:55:35 +08:00
|
|
|
if (!device->bdev)
|
|
|
|
return;
|
|
|
|
|
2017-12-04 12:54:52 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
|
2016-07-22 06:04:53 +08:00
|
|
|
sync_blockdev(device->bdev);
|
|
|
|
invalidate_bdev(device->bdev);
|
|
|
|
}
|
|
|
|
|
2017-06-19 22:55:35 +08:00
|
|
|
blkdev_put(device->bdev, device->mode);
|
2016-07-22 06:04:53 +08:00
|
|
|
}
|
|
|
|
|
2018-06-29 13:26:05 +08:00
|
|
|
static void btrfs_close_one_device(struct btrfs_device *device)
|
2016-06-14 18:55:25 +08:00
|
|
|
{
|
|
|
|
struct btrfs_fs_devices *fs_devices = device->fs_devices;
|
|
|
|
struct btrfs_device *new_device;
|
|
|
|
struct rcu_string *name;
|
|
|
|
|
|
|
|
if (device->bdev)
|
|
|
|
fs_devices->open_devices--;
|
|
|
|
|
2017-12-04 12:54:52 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
|
2016-06-14 18:55:25 +08:00
|
|
|
device->devid != BTRFS_DEV_REPLACE_DEVID) {
|
|
|
|
list_del_init(&device->dev_alloc_list);
|
|
|
|
fs_devices->rw_devices--;
|
|
|
|
}
|
|
|
|
|
2017-12-04 12:54:54 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state))
|
2016-06-14 18:55:25 +08:00
|
|
|
fs_devices->missing_devices--;
|
|
|
|
|
2018-06-29 13:26:05 +08:00
|
|
|
btrfs_close_bdev(device);
|
|
|
|
|
2016-06-14 18:55:25 +08:00
|
|
|
new_device = btrfs_alloc_device(NULL, &device->devid,
|
|
|
|
device->uuid);
|
|
|
|
BUG_ON(IS_ERR(new_device)); /* -ENOMEM */
|
|
|
|
|
|
|
|
/* Safe because we are under uuid_mutex */
|
|
|
|
if (device->name) {
|
|
|
|
name = rcu_string_strdup(device->name->str, GFP_NOFS);
|
|
|
|
BUG_ON(!name); /* -ENOMEM */
|
|
|
|
rcu_assign_pointer(new_device->name, name);
|
|
|
|
}
|
|
|
|
|
|
|
|
list_replace_rcu(&device->dev_list, &new_device->dev_list);
|
|
|
|
new_device->fs_devices = device->fs_devices;
|
2018-06-29 13:26:05 +08:00
|
|
|
|
|
|
|
call_rcu(&device->rcu, free_device_rcu);
|
2016-06-14 18:55:25 +08:00
|
|
|
}
|
|
|
|
|
2018-04-12 10:29:27 +08:00
|
|
|
static int close_fs_devices(struct btrfs_fs_devices *fs_devices)
|
2008-03-25 03:02:07 +08:00
|
|
|
{
|
2015-05-13 07:31:37 +08:00
|
|
|
struct btrfs_device *device, *tmp;
|
2008-12-12 23:03:26 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
if (--fs_devices->opened > 0)
|
|
|
|
return 0;
|
2008-03-25 03:02:07 +08:00
|
|
|
|
2011-04-20 18:07:30 +08:00
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
2015-05-13 07:31:37 +08:00
|
|
|
list_for_each_entry_safe(device, tmp, &fs_devices->devices, dev_list) {
|
2018-06-29 13:26:05 +08:00
|
|
|
btrfs_close_one_device(device);
|
2008-03-25 03:02:07 +08:00
|
|
|
}
|
2011-04-20 18:07:30 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
|
|
|
|
2008-12-12 23:03:26 +08:00
|
|
|
WARN_ON(fs_devices->open_devices);
|
|
|
|
WARN_ON(fs_devices->rw_devices);
|
2008-11-18 10:11:30 +08:00
|
|
|
fs_devices->opened = 0;
|
|
|
|
fs_devices->seeding = 0;
|
|
|
|
|
2008-03-25 03:02:07 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
int btrfs_close_devices(struct btrfs_fs_devices *fs_devices)
|
|
|
|
{
|
2008-12-12 23:03:26 +08:00
|
|
|
struct btrfs_fs_devices *seed_devices = NULL;
|
2008-11-18 10:11:30 +08:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
mutex_lock(&uuid_mutex);
|
2018-04-12 10:29:27 +08:00
|
|
|
ret = close_fs_devices(fs_devices);
|
2008-12-12 23:03:26 +08:00
|
|
|
if (!fs_devices->opened) {
|
|
|
|
seed_devices = fs_devices->seed;
|
|
|
|
fs_devices->seed = NULL;
|
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
mutex_unlock(&uuid_mutex);
|
2008-12-12 23:03:26 +08:00
|
|
|
|
|
|
|
while (seed_devices) {
|
|
|
|
fs_devices = seed_devices;
|
|
|
|
seed_devices = fs_devices->seed;
|
2018-04-12 10:29:27 +08:00
|
|
|
close_fs_devices(fs_devices);
|
2008-12-12 23:03:26 +08:00
|
|
|
free_fs_devices(fs_devices);
|
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-04-12 10:29:28 +08:00
|
|
|
static int open_fs_devices(struct btrfs_fs_devices *fs_devices,
|
2008-12-12 23:03:26 +08:00
|
|
|
fmode_t flags, void *holder)
|
2008-03-25 03:02:07 +08:00
|
|
|
{
|
|
|
|
struct btrfs_device *device;
|
2014-07-24 11:37:15 +08:00
|
|
|
struct btrfs_device *latest_dev = NULL;
|
2008-05-14 04:03:06 +08:00
|
|
|
int ret = 0;
|
2008-03-25 03:02:07 +08:00
|
|
|
|
2010-11-13 18:55:18 +08:00
|
|
|
flags |= FMODE_EXCL;
|
|
|
|
|
2018-04-12 10:29:26 +08:00
|
|
|
list_for_each_entry(device, &fs_devices->devices, dev_list) {
|
2013-04-05 04:45:08 +08:00
|
|
|
/* Just open everything we can; ignore failures here */
|
2017-11-09 23:45:24 +08:00
|
|
|
if (btrfs_open_one_device(fs_devices, device, flags, holder))
|
2012-11-12 21:03:45 +08:00
|
|
|
continue;
|
2008-05-14 04:03:06 +08:00
|
|
|
|
2017-11-09 23:45:25 +08:00
|
|
|
if (!latest_dev ||
|
|
|
|
device->generation > latest_dev->generation)
|
|
|
|
latest_dev = device;
|
2008-03-25 03:02:07 +08:00
|
|
|
}
|
2008-05-14 04:03:06 +08:00
|
|
|
if (fs_devices->open_devices == 0) {
|
2011-10-20 05:06:20 +08:00
|
|
|
ret = -EINVAL;
|
2008-05-14 04:03:06 +08:00
|
|
|
goto out;
|
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
fs_devices->opened = 1;
|
2014-07-24 11:37:15 +08:00
|
|
|
fs_devices->latest_bdev = latest_dev->bdev;
|
2008-11-18 10:11:30 +08:00
|
|
|
fs_devices->total_rw_bytes = 0;
|
2008-05-14 04:03:06 +08:00
|
|
|
out:
|
2008-11-18 10:11:30 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-01-23 06:49:36 +08:00
|
|
|
static int devid_cmp(void *priv, struct list_head *a, struct list_head *b)
|
|
|
|
{
|
|
|
|
struct btrfs_device *dev1, *dev2;
|
|
|
|
|
|
|
|
dev1 = list_entry(a, struct btrfs_device, dev_list);
|
|
|
|
dev2 = list_entry(b, struct btrfs_device, dev_list);
|
|
|
|
|
|
|
|
if (dev1->devid < dev2->devid)
|
|
|
|
return -1;
|
|
|
|
else if (dev1->devid > dev2->devid)
|
|
|
|
return 1;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
|
2008-12-02 19:36:09 +08:00
|
|
|
fmode_t flags, void *holder)
|
2008-11-18 10:11:30 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
2018-06-19 23:09:47 +08:00
|
|
|
lockdep_assert_held(&uuid_mutex);
|
|
|
|
|
2018-04-12 10:29:34 +08:00
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
2008-11-18 10:11:30 +08:00
|
|
|
if (fs_devices->opened) {
|
2008-12-12 23:03:26 +08:00
|
|
|
fs_devices->opened++;
|
|
|
|
ret = 0;
|
2008-11-18 10:11:30 +08:00
|
|
|
} else {
|
2018-01-23 06:49:36 +08:00
|
|
|
list_sort(NULL, &fs_devices->devices, devid_cmp);
|
2018-04-12 10:29:28 +08:00
|
|
|
ret = open_fs_devices(fs_devices, flags, holder);
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
2018-04-12 10:29:34 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
|
|
|
|
2008-03-25 03:02:07 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2017-08-23 14:46:04 +08:00
|
|
|
static void btrfs_release_disk_super(struct page *page)
|
2016-02-13 10:01:29 +08:00
|
|
|
{
|
|
|
|
kunmap(page);
|
|
|
|
put_page(page);
|
|
|
|
}
|
|
|
|
|
2017-08-23 14:46:04 +08:00
|
|
|
static int btrfs_read_disk_super(struct block_device *bdev, u64 bytenr,
|
|
|
|
struct page **page,
|
|
|
|
struct btrfs_super_block **disk_super)
|
2016-02-13 10:01:29 +08:00
|
|
|
{
|
|
|
|
void *p;
|
|
|
|
pgoff_t index;
|
|
|
|
|
|
|
|
/* make sure our super fits in the device */
|
|
|
|
if (bytenr + PAGE_SIZE >= i_size_read(bdev->bd_inode))
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
/* make sure our super fits in the page */
|
|
|
|
if (sizeof(**disk_super) > PAGE_SIZE)
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
/* make sure our super doesn't straddle pages on disk */
|
|
|
|
index = bytenr >> PAGE_SHIFT;
|
|
|
|
if ((bytenr + sizeof(**disk_super) - 1) >> PAGE_SHIFT != index)
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
/* pull in the page with our super */
|
|
|
|
*page = read_cache_page_gfp(bdev->bd_inode->i_mapping,
|
|
|
|
index, GFP_KERNEL);
|
|
|
|
|
|
|
|
if (IS_ERR_OR_NULL(*page))
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
p = kmap(*page);
|
|
|
|
|
|
|
|
/* align our pointer to the offset of the super block */
|
|
|
|
*disk_super = p + (bytenr & ~PAGE_MASK);
|
|
|
|
|
|
|
|
if (btrfs_super_bytenr(*disk_super) != bytenr ||
|
|
|
|
btrfs_super_magic(*disk_super) != BTRFS_MAGIC) {
|
|
|
|
btrfs_release_disk_super(*page);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
if ((*disk_super)->label[0] &&
|
|
|
|
(*disk_super)->label[BTRFS_LABEL_SIZE - 1])
|
|
|
|
(*disk_super)->label[BTRFS_LABEL_SIZE - 1] = '\0';
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-02-16 02:31:02 +08:00
|
|
|
/*
|
|
|
|
* Look for a btrfs signature on a device. This may be called out of the mount path
|
|
|
|
* and we are not allowed to call set_blocksize during the scan. The superblock
|
|
|
|
* is read via pagecache
|
|
|
|
*/
|
2018-07-12 14:23:16 +08:00
|
|
|
struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags,
|
|
|
|
void *holder)
|
2008-03-25 03:02:07 +08:00
|
|
|
{
|
|
|
|
struct btrfs_super_block *disk_super;
|
2018-05-29 12:28:37 +08:00
|
|
|
bool new_device_added = false;
|
2018-07-12 14:23:16 +08:00
|
|
|
struct btrfs_device *device = NULL;
|
2008-03-25 03:02:07 +08:00
|
|
|
struct block_device *bdev;
|
2013-02-16 02:31:02 +08:00
|
|
|
struct page *page;
|
|
|
|
u64 bytenr;
|
2008-03-25 03:02:07 +08:00
|
|
|
|
2018-06-19 22:37:36 +08:00
|
|
|
lockdep_assert_held(&uuid_mutex);
|
|
|
|
|
2013-02-16 02:31:02 +08:00
|
|
|
/*
|
|
|
|
* we would like to check all the supers, but that would make
|
|
|
|
* a btrfs mount succeed after a mkfs from a different FS.
|
|
|
|
* So, we need to add a special mount option to scan for
|
|
|
|
* later supers, using BTRFS_SUPER_MIRROR_MAX instead
|
|
|
|
*/
|
|
|
|
bytenr = btrfs_sb_offset(0);
|
2010-11-13 18:55:18 +08:00
|
|
|
flags |= FMODE_EXCL;
|
2013-02-16 02:31:02 +08:00
|
|
|
|
|
|
|
bdev = blkdev_get_by_path(path, flags, holder);
|
2018-04-12 10:29:24 +08:00
|
|
|
if (IS_ERR(bdev))
|
2018-07-12 14:23:16 +08:00
|
|
|
return ERR_CAST(bdev);
|
2013-02-16 02:31:02 +08:00
|
|
|
|
2017-12-15 15:40:16 +08:00
|
|
|
if (btrfs_read_disk_super(bdev, bytenr, &page, &disk_super)) {
|
2018-07-12 14:23:16 +08:00
|
|
|
device = ERR_PTR(-EINVAL);
|
2013-02-16 02:31:02 +08:00
|
|
|
goto error_bdev_put;
|
2017-12-15 15:40:16 +08:00
|
|
|
}
|
2013-02-16 02:31:02 +08:00
|
|
|
|
2018-05-29 12:28:37 +08:00
|
|
|
device = device_list_add(path, disk_super, &new_device_added);
|
2018-07-12 14:23:16 +08:00
|
|
|
if (!IS_ERR(device)) {
|
2018-05-29 12:28:37 +08:00
|
|
|
if (new_device_added)
|
|
|
|
btrfs_free_stale_devices(path, device);
|
|
|
|
}
|
2013-02-16 02:31:02 +08:00
|
|
|
|
2016-02-13 10:01:29 +08:00
|
|
|
btrfs_release_disk_super(page);
|
2013-02-16 02:31:02 +08:00
|
|
|
|
|
|
|
error_bdev_put:
|
2010-11-13 18:55:18 +08:00
|
|
|
blkdev_put(bdev, flags);
|
2018-04-12 10:29:24 +08:00
|
|
|
|
2018-07-12 14:23:16 +08:00
|
|
|
return device;
|
2008-03-25 03:02:07 +08:00
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2015-06-15 21:41:17 +08:00
|
|
|
static int contains_pending_extent(struct btrfs_transaction *transaction,
|
2013-06-28 01:22:46 +08:00
|
|
|
struct btrfs_device *device,
|
|
|
|
u64 *start, u64 len)
|
|
|
|
{
|
2016-06-23 06:54:56 +08:00
|
|
|
struct btrfs_fs_info *fs_info = device->fs_info;
|
2013-06-28 01:22:46 +08:00
|
|
|
struct extent_map *em;
|
2015-06-15 21:41:17 +08:00
|
|
|
struct list_head *search_list = &fs_info->pinned_chunks;
|
2013-06-28 01:22:46 +08:00
|
|
|
int ret = 0;
|
2015-02-09 17:30:47 +08:00
|
|
|
u64 physical_start = *start;
|
2013-06-28 01:22:46 +08:00
|
|
|
|
2015-06-15 21:41:17 +08:00
|
|
|
if (transaction)
|
|
|
|
search_list = &transaction->pending_chunks;
|
Btrfs: fix race between fs trimming and block group remove/allocation
Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.
If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.
So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.
If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:
checking extents
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
owner ref check failed [833912832 16384]
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
root 5 root dir 256 error
root 5 inode 260 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 262 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 263 errors 2001, no inode item, link count wrong
(...)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-28 05:14:15 +08:00
|
|
|
again:
|
|
|
|
list_for_each_entry(em, search_list, list) {
|
2013-06-28 01:22:46 +08:00
|
|
|
struct map_lookup *map;
|
|
|
|
int i;
|
|
|
|
|
2015-06-03 22:55:48 +08:00
|
|
|
map = em->map_lookup;
|
2013-06-28 01:22:46 +08:00
|
|
|
for (i = 0; i < map->num_stripes; i++) {
|
2015-05-14 17:46:03 +08:00
|
|
|
u64 end;
|
|
|
|
|
2013-06-28 01:22:46 +08:00
|
|
|
if (map->stripes[i].dev != device)
|
|
|
|
continue;
|
2015-02-09 17:30:47 +08:00
|
|
|
if (map->stripes[i].physical >= physical_start + len ||
|
2013-06-28 01:22:46 +08:00
|
|
|
map->stripes[i].physical + em->orig_block_len <=
|
2015-02-09 17:30:47 +08:00
|
|
|
physical_start)
|
2013-06-28 01:22:46 +08:00
|
|
|
continue;
|
2015-05-14 17:46:03 +08:00
|
|
|
/*
|
|
|
|
* Make sure that while processing the pinned list we do
|
|
|
|
* not override our *start with a lower value, because
|
|
|
|
* we can have pinned chunks that fall within this
|
|
|
|
* device hole and that have lower physical addresses
|
|
|
|
* than the pending chunks we processed before. If we
|
|
|
|
* do not take this special care we can end up getting
|
|
|
|
* 2 pending chunks that start at the same physical
|
|
|
|
* device offsets because the end offset of a pinned
|
|
|
|
* chunk can be equal to the start offset of some
|
|
|
|
* pending chunk.
|
|
|
|
*/
|
|
|
|
end = map->stripes[i].physical + em->orig_block_len;
|
|
|
|
if (end > *start) {
|
|
|
|
*start = end;
|
|
|
|
ret = 1;
|
|
|
|
}
|
2013-06-28 01:22:46 +08:00
|
|
|
}
|
|
|
|
}
|
2015-06-15 21:41:17 +08:00
|
|
|
if (search_list != &fs_info->pinned_chunks) {
|
|
|
|
search_list = &fs_info->pinned_chunks;
|
Btrfs: fix race between fs trimming and block group remove/allocation
Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.
If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.
So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.
If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:
checking extents
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
owner ref check failed [833912832 16384]
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
root 5 root dir 256 error
root 5 inode 260 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 262 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 263 errors 2001, no inode item, link count wrong
(...)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-28 05:14:15 +08:00
|
|
|
goto again;
|
|
|
|
}
|
2013-06-28 01:22:46 +08:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2008-03-25 03:01:56 +08:00
|
|
|
/*
|
2015-06-15 21:41:17 +08:00
|
|
|
* find_free_dev_extent_start - find free space in the specified device
|
|
|
|
* @device: the device which we search the free space in
|
|
|
|
* @num_bytes: the size of the free space that we need
|
|
|
|
* @search_start: the position from which to begin the search
|
|
|
|
* @start: store the start of the free space.
|
|
|
|
* @len: the size of the free space. that we find, or the size
|
|
|
|
* of the max free space if we don't find suitable free space
|
2011-01-05 18:07:26 +08:00
|
|
|
*
|
2008-03-25 03:01:56 +08:00
|
|
|
* this uses a pretty simple search, the expectation is that it is
|
|
|
|
* called very infrequently and that a given device has a small number
|
|
|
|
* of extents
|
2011-01-05 18:07:26 +08:00
|
|
|
*
|
|
|
|
* @start is used to store the start of the free space if we find. But if we
|
|
|
|
* don't find suitable free space, it will be used to store the start position
|
|
|
|
* of the max free space.
|
|
|
|
*
|
|
|
|
* @len is used to store the size of the free space that we find.
|
|
|
|
* But if we don't find suitable free space, it is used to store the size of
|
|
|
|
* the max free space.
|
2008-03-25 03:01:56 +08:00
|
|
|
*/
|
2015-06-15 21:41:17 +08:00
|
|
|
int find_free_dev_extent_start(struct btrfs_transaction *transaction,
|
|
|
|
struct btrfs_device *device, u64 num_bytes,
|
|
|
|
u64 search_start, u64 *start, u64 *len)
|
2008-03-25 03:01:56 +08:00
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *fs_info = device->fs_info;
|
|
|
|
struct btrfs_root *root = fs_info->dev_root;
|
2008-03-25 03:01:56 +08:00
|
|
|
struct btrfs_key key;
|
2011-01-05 18:07:26 +08:00
|
|
|
struct btrfs_dev_extent *dev_extent;
|
2008-11-18 10:11:30 +08:00
|
|
|
struct btrfs_path *path;
|
2011-01-05 18:07:26 +08:00
|
|
|
u64 hole_size;
|
|
|
|
u64 max_hole_start;
|
|
|
|
u64 max_hole_size;
|
|
|
|
u64 extent_end;
|
2008-03-25 03:01:56 +08:00
|
|
|
u64 search_end = device->total_bytes;
|
|
|
|
int ret;
|
2011-01-05 18:07:26 +08:00
|
|
|
int slot;
|
2008-03-25 03:01:56 +08:00
|
|
|
struct extent_buffer *l;
|
Btrfs: fix fitrim discarding device area reserved for boot loader's use
As of the 4.3 kernel release, the fitrim ioctl can now discard any region
of a disk that is not allocated to any chunk/block group, including the
first megabyte which is used for our primary superblock and by the boot
loader (grub for example).
Fix this by not allowing to trim/discard any region in the device starting
with an offset not greater than min(alloc_start_mount_option, 1Mb), just
as it was not possible before 4.3.
A reproducer test case for xfstests follows.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
cd /
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
# Write to the [0, 64Kb[ and [68Kb, 1Mb[ ranges of the device. These ranges are
# reserved for a boot loader to use (GRUB for example) and btrfs should never
# use them - neither for allocating metadata/data nor should trim/discard them.
# The range [64Kb, 68Kb[ is used for the primary superblock of the filesystem.
$XFS_IO_PROG -c "pwrite -S 0xfd 0 64K" $SCRATCH_DEV | _filter_xfs_io
$XFS_IO_PROG -c "pwrite -S 0xfd 68K 956K" $SCRATCH_DEV | _filter_xfs_io
# Now mount the filesystem and perform a fitrim against it.
_scratch_mount
_require_batched_discard $SCRATCH_MNT
$FSTRIM_PROG $SCRATCH_MNT
# Now unmount the filesystem and verify the content of the ranges was not
# modified (no trim/discard happened on them).
_scratch_unmount
echo "Content of the ranges [0, 64Kb] and [68Kb, 1Mb[ after fitrim:"
od -t x1 -N $((64 * 1024)) $SCRATCH_DEV
od -t x1 -j $((68 * 1024)) -N $((956 * 1024)) $SCRATCH_DEV
status=0
exit
Reported-by: Vincent Petry <PVince81@yahoo.fr>
Reported-by: Andrei Borzenkov <arvidjaar@gmail.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109341
Fixes: 499f377f49f0 (btrfs: iterate over unused chunk space in FITRIM)
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-01-07 06:42:35 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't want to overwrite the superblock on the drive nor any area
|
|
|
|
* used by the boot loader (grub for example), so we make sure to start
|
|
|
|
* at an offset of at least 1MB.
|
|
|
|
*/
|
2017-06-15 07:30:06 +08:00
|
|
|
search_start = max_t(u64, search_start, SZ_1M);
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2013-06-28 01:22:46 +08:00
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
2015-02-16 18:52:17 +08:00
|
|
|
|
2011-01-05 18:07:26 +08:00
|
|
|
max_hole_start = search_start;
|
|
|
|
max_hole_size = 0;
|
|
|
|
|
2015-02-16 18:52:17 +08:00
|
|
|
again:
|
2017-12-04 12:54:55 +08:00
|
|
|
if (search_start >= search_end ||
|
|
|
|
test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state)) {
|
2011-01-05 18:07:26 +08:00
|
|
|
ret = -ENOSPC;
|
2013-06-28 01:22:46 +08:00
|
|
|
goto out;
|
2011-01-05 18:07:26 +08:00
|
|
|
}
|
|
|
|
|
2015-11-27 23:31:35 +08:00
|
|
|
path->reada = READA_FORWARD;
|
2013-06-28 01:22:46 +08:00
|
|
|
path->search_commit_root = 1;
|
|
|
|
path->skip_locking = 1;
|
2011-01-05 18:07:26 +08:00
|
|
|
|
2008-03-25 03:01:56 +08:00
|
|
|
key.objectid = device->devid;
|
|
|
|
key.offset = search_start;
|
|
|
|
key.type = BTRFS_DEV_EXTENT_KEY;
|
2011-01-05 18:07:26 +08:00
|
|
|
|
2011-12-08 15:07:24 +08:00
|
|
|
ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
|
2008-03-25 03:01:56 +08:00
|
|
|
if (ret < 0)
|
2011-01-05 18:07:26 +08:00
|
|
|
goto out;
|
2009-07-24 23:06:53 +08:00
|
|
|
if (ret > 0) {
|
|
|
|
ret = btrfs_previous_item(root, path, key.objectid, key.type);
|
|
|
|
if (ret < 0)
|
2011-01-05 18:07:26 +08:00
|
|
|
goto out;
|
2009-07-24 23:06:53 +08:00
|
|
|
}
|
2011-01-05 18:07:26 +08:00
|
|
|
|
2008-03-25 03:01:56 +08:00
|
|
|
while (1) {
|
|
|
|
l = path->nodes[0];
|
|
|
|
slot = path->slots[0];
|
|
|
|
if (slot >= btrfs_header_nritems(l)) {
|
|
|
|
ret = btrfs_next_leaf(root, path);
|
|
|
|
if (ret == 0)
|
|
|
|
continue;
|
|
|
|
if (ret < 0)
|
2011-01-05 18:07:26 +08:00
|
|
|
goto out;
|
|
|
|
|
|
|
|
break;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
|
|
|
btrfs_item_key_to_cpu(l, &key, slot);
|
|
|
|
|
|
|
|
if (key.objectid < device->devid)
|
|
|
|
goto next;
|
|
|
|
|
|
|
|
if (key.objectid > device->devid)
|
2011-01-05 18:07:26 +08:00
|
|
|
break;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2014-06-05 00:41:45 +08:00
|
|
|
if (key.type != BTRFS_DEV_EXTENT_KEY)
|
2011-01-05 18:07:26 +08:00
|
|
|
goto next;
|
2009-07-25 04:41:41 +08:00
|
|
|
|
2011-01-05 18:07:26 +08:00
|
|
|
if (key.offset > search_start) {
|
|
|
|
hole_size = key.offset - search_start;
|
2009-07-25 04:41:41 +08:00
|
|
|
|
2013-06-28 01:22:46 +08:00
|
|
|
/*
|
|
|
|
* Have to check before we set max_hole_start, otherwise
|
|
|
|
* we could end up sending back this offset anyway.
|
|
|
|
*/
|
2015-06-15 21:41:17 +08:00
|
|
|
if (contains_pending_extent(transaction, device,
|
2013-06-28 01:22:46 +08:00
|
|
|
&search_start,
|
2015-02-09 17:30:47 +08:00
|
|
|
hole_size)) {
|
|
|
|
if (key.offset >= search_start) {
|
|
|
|
hole_size = key.offset - search_start;
|
|
|
|
} else {
|
|
|
|
WARN_ON_ONCE(1);
|
|
|
|
hole_size = 0;
|
|
|
|
}
|
|
|
|
}
|
2013-06-28 01:22:46 +08:00
|
|
|
|
2011-01-05 18:07:26 +08:00
|
|
|
if (hole_size > max_hole_size) {
|
|
|
|
max_hole_start = search_start;
|
|
|
|
max_hole_size = hole_size;
|
|
|
|
}
|
2009-07-25 04:41:41 +08:00
|
|
|
|
2011-01-05 18:07:26 +08:00
|
|
|
/*
|
|
|
|
* If this free space is greater than which we need,
|
|
|
|
* it must be the max free space that we have found
|
|
|
|
* until now, so max_hole_start must point to the start
|
|
|
|
* of this free space and the length of this free space
|
|
|
|
* is stored in max_hole_size. Thus, we return
|
|
|
|
* max_hole_start and max_hole_size and go back to the
|
|
|
|
* caller.
|
|
|
|
*/
|
|
|
|
if (hole_size >= num_bytes) {
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
dev_extent = btrfs_item_ptr(l, slot, struct btrfs_dev_extent);
|
2011-01-05 18:07:26 +08:00
|
|
|
extent_end = key.offset + btrfs_dev_extent_length(l,
|
|
|
|
dev_extent);
|
|
|
|
if (extent_end > search_start)
|
|
|
|
search_start = extent_end;
|
2008-03-25 03:01:56 +08:00
|
|
|
next:
|
|
|
|
path->slots[0]++;
|
|
|
|
cond_resched();
|
|
|
|
}
|
|
|
|
|
Btrfs: fix a bug of balance on full multi-disk partitions
When balancing, we'll first try to shrink devices for some space,
but if it is working on a full multi-disk partition with raid protection,
we may encounter a bug, that is, while shrinking, total_bytes may be less
than bytes_used, and btrfs may allocate a dev extent that accesses out of
device's bounds.
Then we will not be able to write or read the data which stores at the end
of the device, and get the followings:
device fsid 0939f071-7ea3-46c8-95df-f176d773bfb6 devid 1 transid 10 /dev/sdb5
Btrfs detected SSD devices, enabling SSD mode
btrfs: relocating block group 476315648 flags 9
btrfs: found 4 extents
attempt to access beyond end of device
sdb5: rw=145, want=546176, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546304, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546432, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546560, limit=546147
attempt to access beyond end of device
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-02 10:39:03 +08:00
|
|
|
/*
|
|
|
|
* At this point, search_start should be the end of
|
|
|
|
* allocated dev extents, and when shrinking the device,
|
|
|
|
* search_end may be smaller than search_start.
|
|
|
|
*/
|
2015-02-16 18:52:17 +08:00
|
|
|
if (search_end > search_start) {
|
Btrfs: fix a bug of balance on full multi-disk partitions
When balancing, we'll first try to shrink devices for some space,
but if it is working on a full multi-disk partition with raid protection,
we may encounter a bug, that is, while shrinking, total_bytes may be less
than bytes_used, and btrfs may allocate a dev extent that accesses out of
device's bounds.
Then we will not be able to write or read the data which stores at the end
of the device, and get the followings:
device fsid 0939f071-7ea3-46c8-95df-f176d773bfb6 devid 1 transid 10 /dev/sdb5
Btrfs detected SSD devices, enabling SSD mode
btrfs: relocating block group 476315648 flags 9
btrfs: found 4 extents
attempt to access beyond end of device
sdb5: rw=145, want=546176, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546304, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546432, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546560, limit=546147
attempt to access beyond end of device
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-02 10:39:03 +08:00
|
|
|
hole_size = search_end - search_start;
|
|
|
|
|
2015-06-15 21:41:17 +08:00
|
|
|
if (contains_pending_extent(transaction, device, &search_start,
|
2015-02-16 18:52:17 +08:00
|
|
|
hole_size)) {
|
|
|
|
btrfs_release_path(path);
|
|
|
|
goto again;
|
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2015-02-16 18:52:17 +08:00
|
|
|
if (hole_size > max_hole_size) {
|
|
|
|
max_hole_start = search_start;
|
|
|
|
max_hole_size = hole_size;
|
|
|
|
}
|
2013-06-28 01:22:46 +08:00
|
|
|
}
|
|
|
|
|
2011-01-05 18:07:26 +08:00
|
|
|
/* See above. */
|
2015-02-16 18:52:17 +08:00
|
|
|
if (max_hole_size < num_bytes)
|
2011-01-05 18:07:26 +08:00
|
|
|
ret = -ENOSPC;
|
|
|
|
else
|
|
|
|
ret = 0;
|
|
|
|
|
|
|
|
out:
|
2008-11-18 10:11:30 +08:00
|
|
|
btrfs_free_path(path);
|
2011-01-05 18:07:26 +08:00
|
|
|
*start = max_hole_start;
|
2011-01-05 18:07:28 +08:00
|
|
|
if (len)
|
2011-01-05 18:07:26 +08:00
|
|
|
*len = max_hole_size;
|
2008-03-25 03:01:56 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2015-06-15 21:41:17 +08:00
|
|
|
int find_free_dev_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_device *device, u64 num_bytes,
|
|
|
|
u64 *start, u64 *len)
|
|
|
|
{
|
|
|
|
/* FIXME use last free of some kind */
|
|
|
|
return find_free_dev_extent_start(trans->transaction, device,
|
Btrfs: fix fitrim discarding device area reserved for boot loader's use
As of the 4.3 kernel release, the fitrim ioctl can now discard any region
of a disk that is not allocated to any chunk/block group, including the
first megabyte which is used for our primary superblock and by the boot
loader (grub for example).
Fix this by not allowing to trim/discard any region in the device starting
with an offset not greater than min(alloc_start_mount_option, 1Mb), just
as it was not possible before 4.3.
A reproducer test case for xfstests follows.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
cd /
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
# Write to the [0, 64Kb[ and [68Kb, 1Mb[ ranges of the device. These ranges are
# reserved for a boot loader to use (GRUB for example) and btrfs should never
# use them - neither for allocating metadata/data nor should trim/discard them.
# The range [64Kb, 68Kb[ is used for the primary superblock of the filesystem.
$XFS_IO_PROG -c "pwrite -S 0xfd 0 64K" $SCRATCH_DEV | _filter_xfs_io
$XFS_IO_PROG -c "pwrite -S 0xfd 68K 956K" $SCRATCH_DEV | _filter_xfs_io
# Now mount the filesystem and perform a fitrim against it.
_scratch_mount
_require_batched_discard $SCRATCH_MNT
$FSTRIM_PROG $SCRATCH_MNT
# Now unmount the filesystem and verify the content of the ranges was not
# modified (no trim/discard happened on them).
_scratch_unmount
echo "Content of the ranges [0, 64Kb] and [68Kb, 1Mb[ after fitrim:"
od -t x1 -N $((64 * 1024)) $SCRATCH_DEV
od -t x1 -j $((68 * 1024)) -N $((956 * 1024)) $SCRATCH_DEV
status=0
exit
Reported-by: Vincent Petry <PVince81@yahoo.fr>
Reported-by: Andrei Borzenkov <arvidjaar@gmail.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109341
Fixes: 499f377f49f0 (btrfs: iterate over unused chunk space in FITRIM)
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-01-07 06:42:35 +08:00
|
|
|
num_bytes, 0, start, len);
|
2015-06-15 21:41:17 +08:00
|
|
|
}
|
|
|
|
|
2008-12-02 22:54:17 +08:00
|
|
|
static int btrfs_free_dev_extent(struct btrfs_trans_handle *trans,
|
2008-04-26 04:53:30 +08:00
|
|
|
struct btrfs_device *device,
|
2014-09-03 21:35:41 +08:00
|
|
|
u64 start, u64 *dev_extent_len)
|
2008-04-26 04:53:30 +08:00
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *fs_info = device->fs_info;
|
|
|
|
struct btrfs_root *root = fs_info->dev_root;
|
2008-04-26 04:53:30 +08:00
|
|
|
int ret;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_key key;
|
2008-05-07 23:43:44 +08:00
|
|
|
struct btrfs_key found_key;
|
|
|
|
struct extent_buffer *leaf = NULL;
|
|
|
|
struct btrfs_dev_extent *extent = NULL;
|
2008-04-26 04:53:30 +08:00
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
key.objectid = device->devid;
|
|
|
|
key.offset = start;
|
|
|
|
key.type = BTRFS_DEV_EXTENT_KEY;
|
2011-11-11 09:45:04 +08:00
|
|
|
again:
|
2008-04-26 04:53:30 +08:00
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
|
2008-05-07 23:43:44 +08:00
|
|
|
if (ret > 0) {
|
|
|
|
ret = btrfs_previous_item(root, path, key.objectid,
|
|
|
|
BTRFS_DEV_EXTENT_KEY);
|
2011-05-19 15:03:42 +08:00
|
|
|
if (ret)
|
|
|
|
goto out;
|
2008-05-07 23:43:44 +08:00
|
|
|
leaf = path->nodes[0];
|
|
|
|
btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
|
|
|
|
extent = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_dev_extent);
|
|
|
|
BUG_ON(found_key.offset > start || found_key.offset +
|
|
|
|
btrfs_dev_extent_length(leaf, extent) < start);
|
2011-11-11 09:45:04 +08:00
|
|
|
key = found_key;
|
|
|
|
btrfs_release_path(path);
|
|
|
|
goto again;
|
2008-05-07 23:43:44 +08:00
|
|
|
} else if (ret == 0) {
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
extent = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_dev_extent);
|
2012-03-12 23:03:00 +08:00
|
|
|
} else {
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_handle_fs_error(fs_info, ret, "Slot search failed");
|
2012-03-12 23:03:00 +08:00
|
|
|
goto out;
|
2008-05-07 23:43:44 +08:00
|
|
|
}
|
2008-04-26 04:53:30 +08:00
|
|
|
|
2014-09-03 21:35:41 +08:00
|
|
|
*dev_extent_len = btrfs_dev_extent_length(leaf, extent);
|
|
|
|
|
2008-04-26 04:53:30 +08:00
|
|
|
ret = btrfs_del_item(trans, root, path);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (ret) {
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_handle_fs_error(fs_info, ret,
|
|
|
|
"Failed to remove dev extent item");
|
btrfs: Fix out-of-space bug
Btrfs will report NO_SPACE when we create and remove files for several times,
and we can't write to filesystem until mount it again.
Steps to reproduce:
1: Create a single-dev btrfs fs with default option
2: Write a file into it to take up most fs space
3: Delete above file
4: Wait about 100s to let chunk removed
5: goto 2
Script is like following:
#!/bin/bash
# Recommend 1.2G space, too large disk will make test slow
DEV="/dev/sda16"
MNT="/mnt/tmp"
dev_size="$(lsblk -bn -o SIZE "$DEV")" || exit 2
file_size_m=$((dev_size * 75 / 100 / 1024 / 1024))
echo "Loop write ${file_size_m}M file on $((dev_size / 1024 / 1024))M dev"
for ((i = 0; i < 10; i++)); do umount "$MNT" 2>/dev/null; done
echo "mkfs $DEV"
mkfs.btrfs -f "$DEV" >/dev/null || exit 2
echo "mount $DEV $MNT"
mount "$DEV" "$MNT" || exit 2
for ((loop_i = 0; loop_i < 20; loop_i++)); do
echo
echo "loop $loop_i"
echo "dd file..."
cmd=(dd if=/dev/zero of="$MNT"/file0 bs=1M count="$file_size_m")
"${cmd[@]}" 2>/dev/null || {
# NO_SPACE error triggered
echo "dd failed: ${cmd[*]}"
exit 1
}
echo "rm file..."
rm -f "$MNT"/file0 || exit 2
for ((i = 0; i < 10; i++)); do
df "$MNT" | tail -1
sleep 10
done
done
Reason:
It is triggered by commit: 47ab2a6c689913db23ccae38349714edf8365e0a
which is used to remove empty block groups automatically, but the
reason is not in that patch. Code before works well because btrfs
don't need to create and delete chunks so many times with high
complexity.
Above bug is caused by many reason, any of them can trigger it.
Reason1:
When we remove some continuous chunks but leave other chunks after,
these disk space should be used by chunk-recreating, but in current
code, only first create will successed.
Fixed by Forrest Liu <forrestl@synology.com> in:
Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole
Reason2:
contains_pending_extent() return wrong value in calculation.
Fixed by Forrest Liu <forrestl@synology.com> in:
Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole
Reason3:
btrfs_check_data_free_space() try to commit transaction and retry
allocating chunk when the first allocating failed, but space_info->full
is set in first allocating, and prevent second allocating in retry.
Fixed in this patch by clear space_info->full in commit transaction.
Tested for severial times by above script.
Changelog v3->v4:
use light weight int instead of atomic_t to record have_remove_bgs in
transaction, suggested by:
Josef Bacik <jbacik@fb.com>
Changelog v2->v3:
v2 fixed the bug by adding more commit-transaction, but we
only need to reclaim space when we are really have no space for
new chunk, noticed by:
Filipe David Manana <fdmanana@gmail.com>
Actually, our code already have this type of commit-and-retry,
we only need to make it working with removed-bgs.
v3 fixed the bug with above way.
Changelog v1->v2:
v1 will introduce a new bug when delete and create chunk in same disk
space in same transaction, noticed by:
Filipe David Manana <fdmanana@gmail.com>
V2 fix this bug by commit transaction after remove block grops.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Suggested-by: Filipe David Manana <fdmanana@gmail.com>
Suggested-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-12 14:18:17 +08:00
|
|
|
} else {
|
2015-09-24 22:46:10 +08:00
|
|
|
set_bit(BTRFS_TRANS_HAVE_FREE_BGS, &trans->transaction->flags);
|
2012-03-12 23:03:00 +08:00
|
|
|
}
|
2011-05-19 15:03:42 +08:00
|
|
|
out:
|
2008-04-26 04:53:30 +08:00
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-04-26 04:41:01 +08:00
|
|
|
static int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_device *device,
|
|
|
|
u64 chunk_offset, u64 start, u64 num_bytes)
|
2008-03-25 03:01:56 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct btrfs_path *path;
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *fs_info = device->fs_info;
|
|
|
|
struct btrfs_root *root = fs_info->dev_root;
|
2008-03-25 03:01:56 +08:00
|
|
|
struct btrfs_dev_extent *extent;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_key key;
|
|
|
|
|
2017-12-04 12:54:53 +08:00
|
|
|
WARN_ON(!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state));
|
2017-12-04 12:54:55 +08:00
|
|
|
WARN_ON(test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state));
|
2008-03-25 03:01:56 +08:00
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
key.objectid = device->devid;
|
2008-11-18 10:11:30 +08:00
|
|
|
key.offset = start;
|
2008-03-25 03:01:56 +08:00
|
|
|
key.type = BTRFS_DEV_EXTENT_KEY;
|
|
|
|
ret = btrfs_insert_empty_item(trans, root, path, &key,
|
|
|
|
sizeof(*extent));
|
2011-09-09 08:14:32 +08:00
|
|
|
if (ret)
|
|
|
|
goto out;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
extent = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_dev_extent);
|
2017-08-18 22:58:23 +08:00
|
|
|
btrfs_set_dev_extent_chunk_tree(leaf, extent,
|
|
|
|
BTRFS_CHUNK_TREE_OBJECTID);
|
2017-08-18 22:58:22 +08:00
|
|
|
btrfs_set_dev_extent_chunk_objectid(leaf, extent,
|
|
|
|
BTRFS_FIRST_CHUNK_TREE_OBJECTID);
|
2008-04-16 03:41:47 +08:00
|
|
|
btrfs_set_dev_extent_chunk_offset(leaf, extent, chunk_offset);
|
|
|
|
|
2008-03-25 03:01:56 +08:00
|
|
|
btrfs_set_dev_extent_length(leaf, extent, num_bytes);
|
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
2011-09-09 08:14:32 +08:00
|
|
|
out:
|
2008-03-25 03:01:56 +08:00
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-06-28 01:22:46 +08:00
|
|
|
static u64 find_next_chunk(struct btrfs_fs_info *fs_info)
|
2008-03-25 03:01:56 +08:00
|
|
|
{
|
2013-06-28 01:22:46 +08:00
|
|
|
struct extent_map_tree *em_tree;
|
|
|
|
struct extent_map *em;
|
|
|
|
struct rb_node *n;
|
|
|
|
u64 ret = 0;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2013-06-28 01:22:46 +08:00
|
|
|
em_tree = &fs_info->mapping_tree.map_tree;
|
|
|
|
read_lock(&em_tree->lock);
|
2018-08-23 03:51:52 +08:00
|
|
|
n = rb_last(&em_tree->map.rb_root);
|
2013-06-28 01:22:46 +08:00
|
|
|
if (n) {
|
|
|
|
em = rb_entry(n, struct extent_map, rb_node);
|
|
|
|
ret = em->start + em->len;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
2013-06-28 01:22:46 +08:00
|
|
|
read_unlock(&em_tree->lock);
|
|
|
|
|
2008-03-25 03:01:56 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-08-12 19:33:01 +08:00
|
|
|
static noinline int find_next_devid(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 *devid_ret)
|
2008-03-25 03:01:56 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_key found_key;
|
2008-11-18 10:11:30 +08:00
|
|
|
struct btrfs_path *path;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
|
|
|
key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
|
|
|
|
key.type = BTRFS_DEV_ITEM_KEY;
|
|
|
|
key.offset = (u64)-1;
|
|
|
|
|
2013-08-12 19:33:01 +08:00
|
|
|
ret = btrfs_search_slot(NULL, fs_info->chunk_root, &key, path, 0, 0);
|
2008-03-25 03:01:56 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto error;
|
|
|
|
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(ret == 0); /* Corruption */
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2013-08-12 19:33:01 +08:00
|
|
|
ret = btrfs_previous_item(fs_info->chunk_root, path,
|
|
|
|
BTRFS_DEV_ITEMS_OBJECTID,
|
2008-03-25 03:01:56 +08:00
|
|
|
BTRFS_DEV_ITEM_KEY);
|
|
|
|
if (ret) {
|
2013-08-12 19:33:01 +08:00
|
|
|
*devid_ret = 1;
|
2008-03-25 03:01:56 +08:00
|
|
|
} else {
|
|
|
|
btrfs_item_key_to_cpu(path->nodes[0], &found_key,
|
|
|
|
path->slots[0]);
|
2013-08-12 19:33:01 +08:00
|
|
|
*devid_ret = found_key.offset + 1;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
|
|
|
ret = 0;
|
|
|
|
error:
|
2008-11-18 10:11:30 +08:00
|
|
|
btrfs_free_path(path);
|
2008-03-25 03:01:56 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* the device information is stored in the chunk root
|
|
|
|
* the btrfs_device struct should be fully filled in
|
|
|
|
*/
|
2017-11-06 16:36:15 +08:00
|
|
|
static int btrfs_add_dev_item(struct btrfs_trans_handle *trans,
|
2013-04-26 04:41:01 +08:00
|
|
|
struct btrfs_device *device)
|
2008-03-25 03:01:56 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_dev_item *dev_item;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_key key;
|
|
|
|
unsigned long ptr;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
|
|
|
|
key.type = BTRFS_DEV_ITEM_KEY;
|
2008-11-18 10:11:30 +08:00
|
|
|
key.offset = device->devid;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2018-07-21 00:37:47 +08:00
|
|
|
ret = btrfs_insert_empty_item(trans, trans->fs_info->chunk_root, path,
|
|
|
|
&key, sizeof(*dev_item));
|
2008-03-25 03:01:56 +08:00
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
dev_item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_dev_item);
|
|
|
|
|
|
|
|
btrfs_set_device_id(leaf, dev_item, device->devid);
|
2008-11-18 10:11:30 +08:00
|
|
|
btrfs_set_device_generation(leaf, dev_item, 0);
|
2008-03-25 03:01:56 +08:00
|
|
|
btrfs_set_device_type(leaf, dev_item, device->type);
|
|
|
|
btrfs_set_device_io_align(leaf, dev_item, device->io_align);
|
|
|
|
btrfs_set_device_io_width(leaf, dev_item, device->io_width);
|
|
|
|
btrfs_set_device_sector_size(leaf, dev_item, device->sector_size);
|
2014-09-03 21:35:38 +08:00
|
|
|
btrfs_set_device_total_bytes(leaf, dev_item,
|
|
|
|
btrfs_device_get_disk_total_bytes(device));
|
|
|
|
btrfs_set_device_bytes_used(leaf, dev_item,
|
|
|
|
btrfs_device_get_bytes_used(device));
|
2008-04-16 03:41:47 +08:00
|
|
|
btrfs_set_device_group(leaf, dev_item, 0);
|
|
|
|
btrfs_set_device_seek_speed(leaf, dev_item, 0);
|
|
|
|
btrfs_set_device_bandwidth(leaf, dev_item, 0);
|
2008-12-09 05:40:21 +08:00
|
|
|
btrfs_set_device_start_offset(leaf, dev_item, 0);
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2013-08-20 19:20:11 +08:00
|
|
|
ptr = btrfs_device_uuid(dev_item);
|
2008-04-16 03:41:47 +08:00
|
|
|
write_extent_buffer(leaf, device->uuid, ptr, BTRFS_UUID_SIZE);
|
2013-08-20 19:20:12 +08:00
|
|
|
ptr = btrfs_device_fsid(dev_item);
|
2018-07-21 00:37:47 +08:00
|
|
|
write_extent_buffer(leaf, trans->fs_info->fsid, ptr, BTRFS_FSID_SIZE);
|
2008-03-25 03:01:56 +08:00
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
ret = 0;
|
2008-03-25 03:01:56 +08:00
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
2008-04-26 04:53:30 +08:00
|
|
|
|
2014-04-16 17:02:32 +08:00
|
|
|
/*
|
|
|
|
* Function to update ctime/mtime for a given device path.
|
|
|
|
* Mainly used for ctime/mtime based probe like libblkid.
|
|
|
|
*/
|
2017-02-15 00:55:53 +08:00
|
|
|
static void update_dev_time(const char *path_name)
|
2014-04-16 17:02:32 +08:00
|
|
|
{
|
|
|
|
struct file *filp;
|
|
|
|
|
|
|
|
filp = filp_open(path_name, O_RDWR, 0);
|
2014-12-14 15:59:17 +08:00
|
|
|
if (IS_ERR(filp))
|
2014-04-16 17:02:32 +08:00
|
|
|
return;
|
|
|
|
file_update_time(filp);
|
|
|
|
filp_close(filp, NULL);
|
|
|
|
}
|
|
|
|
|
2016-06-21 22:40:19 +08:00
|
|
|
static int btrfs_rm_dev_item(struct btrfs_fs_info *fs_info,
|
2008-05-07 23:43:44 +08:00
|
|
|
struct btrfs_device *device)
|
|
|
|
{
|
2016-06-21 22:40:19 +08:00
|
|
|
struct btrfs_root *root = fs_info->chunk_root;
|
2008-05-07 23:43:44 +08:00
|
|
|
int ret;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2010-05-16 22:48:46 +08:00
|
|
|
trans = btrfs_start_transaction(root, 0);
|
2011-01-20 14:19:37 +08:00
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return PTR_ERR(trans);
|
|
|
|
}
|
2008-05-07 23:43:44 +08:00
|
|
|
key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
|
|
|
|
key.type = BTRFS_DEV_ITEM_KEY;
|
|
|
|
key.offset = device->devid;
|
|
|
|
|
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
|
2017-10-23 14:58:46 +08:00
|
|
|
if (ret) {
|
|
|
|
if (ret > 0)
|
|
|
|
ret = -ENOENT;
|
|
|
|
btrfs_abort_transaction(trans, ret);
|
|
|
|
btrfs_end_transaction(trans);
|
2008-05-07 23:43:44 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = btrfs_del_item(trans, root, path);
|
2017-10-23 14:58:46 +08:00
|
|
|
if (ret) {
|
|
|
|
btrfs_abort_transaction(trans, ret);
|
|
|
|
btrfs_end_transaction(trans);
|
|
|
|
}
|
|
|
|
|
2008-05-07 23:43:44 +08:00
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
2017-10-23 14:58:46 +08:00
|
|
|
if (!ret)
|
|
|
|
ret = btrfs_commit_transaction(trans);
|
2008-05-07 23:43:44 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-02-15 23:00:26 +08:00
|
|
|
/*
|
|
|
|
* Verify that @num_devices satisfies the RAID profile constraints in the whole
|
|
|
|
* filesystem. It's up to the caller to adjust that number regarding eg. device
|
|
|
|
* replace.
|
|
|
|
*/
|
|
|
|
static int btrfs_check_raid_min_devices(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 num_devices)
|
2008-05-07 23:43:44 +08:00
|
|
|
{
|
|
|
|
u64 all_avail;
|
2013-01-29 18:13:12 +08:00
|
|
|
unsigned seq;
|
2016-02-15 23:28:14 +08:00
|
|
|
int i;
|
2008-05-07 23:43:44 +08:00
|
|
|
|
2013-01-29 18:13:12 +08:00
|
|
|
do {
|
2016-02-13 10:01:34 +08:00
|
|
|
seq = read_seqbegin(&fs_info->profiles_lock);
|
2013-01-29 18:13:12 +08:00
|
|
|
|
2016-02-13 10:01:34 +08:00
|
|
|
all_avail = fs_info->avail_data_alloc_bits |
|
|
|
|
fs_info->avail_system_alloc_bits |
|
|
|
|
fs_info->avail_metadata_alloc_bits;
|
|
|
|
} while (read_seqretry(&fs_info->profiles_lock, seq));
|
2008-05-07 23:43:44 +08:00
|
|
|
|
2016-02-15 23:28:14 +08:00
|
|
|
for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
|
2018-04-25 19:01:43 +08:00
|
|
|
if (!(all_avail & btrfs_raid_array[i].bg_flag))
|
2016-02-15 23:28:14 +08:00
|
|
|
continue;
|
2008-05-07 23:43:44 +08:00
|
|
|
|
2016-02-15 23:28:14 +08:00
|
|
|
if (num_devices < btrfs_raid_array[i].devs_min) {
|
2018-04-25 19:01:44 +08:00
|
|
|
int ret = btrfs_raid_array[i].mindev_error;
|
2016-02-13 10:01:34 +08:00
|
|
|
|
2016-02-15 23:28:14 +08:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
}
|
2013-01-30 07:40:14 +08:00
|
|
|
}
|
|
|
|
|
2016-02-13 10:01:34 +08:00
|
|
|
return 0;
|
2016-02-13 10:01:33 +08:00
|
|
|
}
|
|
|
|
|
2017-08-23 14:46:04 +08:00
|
|
|
static struct btrfs_device * btrfs_find_next_active_device(
|
|
|
|
struct btrfs_fs_devices *fs_devs, struct btrfs_device *device)
|
2008-05-07 23:43:44 +08:00
|
|
|
{
|
2008-11-18 10:11:30 +08:00
|
|
|
struct btrfs_device *next_device;
|
2016-05-03 17:44:43 +08:00
|
|
|
|
|
|
|
list_for_each_entry(next_device, &fs_devs->devices, dev_list) {
|
|
|
|
if (next_device != device &&
|
2017-12-04 12:54:54 +08:00
|
|
|
!test_bit(BTRFS_DEV_STATE_MISSING, &next_device->dev_state)
|
|
|
|
&& next_device->bdev)
|
2016-05-03 17:44:43 +08:00
|
|
|
return next_device;
|
|
|
|
}
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Helper function to check if the given device is part of s_bdev / latest_bdev
|
|
|
|
* and replace it with the provided or the next active device, in the context
|
|
|
|
* where this function called, there should be always be another device (or
|
|
|
|
* this_dev) which is active.
|
|
|
|
*/
|
2018-07-21 00:37:50 +08:00
|
|
|
void btrfs_assign_next_active_device(struct btrfs_device *device,
|
|
|
|
struct btrfs_device *this_dev)
|
2016-05-03 17:44:43 +08:00
|
|
|
{
|
2018-07-21 00:37:50 +08:00
|
|
|
struct btrfs_fs_info *fs_info = device->fs_info;
|
2016-05-03 17:44:43 +08:00
|
|
|
struct btrfs_device *next_device;
|
|
|
|
|
|
|
|
if (this_dev)
|
|
|
|
next_device = this_dev;
|
|
|
|
else
|
|
|
|
next_device = btrfs_find_next_active_device(fs_info->fs_devices,
|
|
|
|
device);
|
|
|
|
ASSERT(next_device);
|
|
|
|
|
|
|
|
if (fs_info->sb->s_bdev &&
|
|
|
|
(fs_info->sb->s_bdev == device->bdev))
|
|
|
|
fs_info->sb->s_bdev = next_device->bdev;
|
|
|
|
|
|
|
|
if (fs_info->fs_devices->latest_bdev == device->bdev)
|
|
|
|
fs_info->fs_devices->latest_bdev = next_device->bdev;
|
|
|
|
}
|
|
|
|
|
2018-08-10 13:53:21 +08:00
|
|
|
/*
|
|
|
|
* Return btrfs_fs_devices::num_devices excluding the device that's being
|
|
|
|
* currently replaced.
|
|
|
|
*/
|
|
|
|
static u64 btrfs_num_devices(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
u64 num_devices = fs_info->fs_devices->num_devices;
|
|
|
|
|
|
|
|
btrfs_dev_replace_read_lock(&fs_info->dev_replace);
|
|
|
|
if (btrfs_dev_replace_is_ongoing(&fs_info->dev_replace)) {
|
|
|
|
ASSERT(num_devices > 1);
|
|
|
|
num_devices--;
|
|
|
|
}
|
|
|
|
btrfs_dev_replace_read_unlock(&fs_info->dev_replace);
|
|
|
|
|
|
|
|
return num_devices;
|
|
|
|
}
|
|
|
|
|
2017-02-15 00:55:53 +08:00
|
|
|
int btrfs_rm_device(struct btrfs_fs_info *fs_info, const char *device_path,
|
|
|
|
u64 devid)
|
2016-02-13 10:01:33 +08:00
|
|
|
{
|
|
|
|
struct btrfs_device *device;
|
2011-04-20 18:09:16 +08:00
|
|
|
struct btrfs_fs_devices *cur_devices;
|
2018-04-12 10:29:30 +08:00
|
|
|
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
|
2008-11-18 10:11:30 +08:00
|
|
|
u64 num_devices;
|
2008-05-07 23:43:44 +08:00
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
mutex_lock(&uuid_mutex);
|
|
|
|
|
2018-08-10 13:53:21 +08:00
|
|
|
num_devices = btrfs_num_devices(fs_info);
|
2012-11-06 20:15:27 +08:00
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
ret = btrfs_check_raid_min_devices(fs_info, num_devices - 1);
|
2016-02-13 10:01:33 +08:00
|
|
|
if (ret)
|
2008-05-07 23:43:44 +08:00
|
|
|
goto out;
|
|
|
|
|
2018-09-03 17:46:14 +08:00
|
|
|
device = btrfs_find_device_by_devspec(fs_info, devid, device_path);
|
|
|
|
|
|
|
|
if (IS_ERR(device)) {
|
|
|
|
if (PTR_ERR(device) == -ENOENT &&
|
|
|
|
strcmp(device_path, "missing") == 0)
|
|
|
|
ret = BTRFS_ERROR_DEV_MISSING_NOT_FOUND;
|
|
|
|
else
|
|
|
|
ret = PTR_ERR(device);
|
2013-01-30 07:40:14 +08:00
|
|
|
goto out;
|
2018-09-03 17:46:14 +08:00
|
|
|
}
|
2008-05-14 01:46:40 +08:00
|
|
|
|
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-04 01:28:12 +08:00
|
|
|
if (btrfs_pinned_by_swapfile(fs_info, device)) {
|
|
|
|
btrfs_warn_in_rcu(fs_info,
|
|
|
|
"cannot remove device %s (devid %llu) due to active swapfile",
|
|
|
|
rcu_str_deref(device->name), device->devid);
|
|
|
|
ret = -ETXTBSY;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2017-12-04 12:54:55 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state)) {
|
2013-05-17 18:52:45 +08:00
|
|
|
ret = BTRFS_ERROR_DEV_TGT_REPLACE;
|
2016-02-13 10:01:36 +08:00
|
|
|
goto out;
|
2012-11-06 01:29:28 +08:00
|
|
|
}
|
|
|
|
|
2017-12-04 12:54:52 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
|
|
|
|
fs_info->fs_devices->rw_devices == 1) {
|
2013-05-17 18:52:45 +08:00
|
|
|
ret = BTRFS_ERROR_DEV_ONLY_WRITABLE;
|
2016-02-13 10:01:36 +08:00
|
|
|
goto out;
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
|
2017-12-04 12:54:52 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
2008-11-18 10:11:30 +08:00
|
|
|
list_del_init(&device->dev_alloc_list);
|
2014-09-03 21:35:47 +08:00
|
|
|
device->fs_devices->rw_devices--;
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
2008-05-14 01:46:40 +08:00
|
|
|
}
|
2008-05-07 23:43:44 +08:00
|
|
|
|
2013-03-05 06:37:06 +08:00
|
|
|
mutex_unlock(&uuid_mutex);
|
2008-05-07 23:43:44 +08:00
|
|
|
ret = btrfs_shrink_device(device, 0);
|
2013-03-05 06:37:06 +08:00
|
|
|
mutex_lock(&uuid_mutex);
|
2008-05-07 23:43:44 +08:00
|
|
|
if (ret)
|
2011-02-16 02:14:25 +08:00
|
|
|
goto error_undo;
|
2008-05-07 23:43:44 +08:00
|
|
|
|
2012-11-06 01:29:28 +08:00
|
|
|
/*
|
|
|
|
* TODO: the superblock still includes this device in its num_devices
|
|
|
|
* counter although write_all_supers() is not locked out. This
|
|
|
|
* could give a filesystem state which requires a degraded mount.
|
|
|
|
*/
|
2016-06-23 06:54:23 +08:00
|
|
|
ret = btrfs_rm_dev_item(fs_info, device);
|
2008-05-07 23:43:44 +08:00
|
|
|
if (ret)
|
2011-02-16 02:14:25 +08:00
|
|
|
goto error_undo;
|
2008-05-07 23:43:44 +08:00
|
|
|
|
2017-12-04 12:54:53 +08:00
|
|
|
clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_scrub_cancel_dev(fs_info, device);
|
2009-06-11 03:17:02 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* the device list mutex makes sure that we don't change
|
|
|
|
* the device list while someone else is writing out all
|
Btrfs: fix race between removing a dev and writing sbs
This change fixes an issue when removing a device and writing
all super blocks run simultaneously. Here's the steps necessary
for the issue to happen:
1) disk-io.c:write_all_supers() gets a number of N devices from the
super_copy, so it will not panic if it fails to write super blocks
for N - 1 devices;
2) Then it tries to acquire the device_list_mutex, but blocks because
volumes.c:btrfs_rm_device() got it first;
3) btrfs_rm_device() removes the device from the list, then unlocks the
mutex and after the unlock it updates the number of devices in
super_copy to N - 1.
4) write_all_supers() finally acquires the mutex, iterates over all the
devices in the list and gets N - 1 errors, that is, it failed to write
super blocks to all the devices;
5) Because write_all_supers() thinks there are a total of N devices, it
considers N - 1 errors to be ok, and therefore won't panic.
So this change just makes sure that write_all_supers() reads the number
of devices from super_copy after it acquires the device_list_mutex.
Conversely, it changes btrfs_rm_device() to update the number of devices
in super_copy before it releases the device list mutex.
The code path to add a new device (volumes.c:btrfs_init_new_device),
already has the right behaviour: it updates the number of devices in
super_copy while holding the device_list_mutex.
The only code path that doesn't lock the device list mutex
before updating the number of devices in the super copy is
disk-io.c:next_root_backup(), called by open_ctree() during
mount time where concurrency issues can't happen.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 22:41:36 +08:00
|
|
|
* the device supers. Whoever is writing all supers, should
|
|
|
|
* lock the device list mutex before getting the number of
|
|
|
|
* devices in the super block (super_copy). Conversely,
|
|
|
|
* whoever updates the number of devices in the super block
|
|
|
|
* (super_copy) should hold the device list mutex.
|
2009-06-11 03:17:02 +08:00
|
|
|
*/
|
2011-04-20 18:09:16 +08:00
|
|
|
|
2018-04-12 10:29:31 +08:00
|
|
|
/*
|
|
|
|
* In normal cases the cur_devices == fs_devices. But in case
|
|
|
|
* of deleting a seed device, the cur_devices should point to
|
|
|
|
* its own fs_devices listed under the fs_devices->seed.
|
|
|
|
*/
|
2011-04-20 18:09:16 +08:00
|
|
|
cur_devices = device->fs_devices;
|
2018-04-12 10:29:30 +08:00
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
2011-04-20 18:09:16 +08:00
|
|
|
list_del_rcu(&device->dev_list);
|
2009-06-11 03:17:02 +08:00
|
|
|
|
2018-04-12 10:29:31 +08:00
|
|
|
cur_devices->num_devices--;
|
|
|
|
cur_devices->total_devices--;
|
2018-07-03 17:07:23 +08:00
|
|
|
/* Update total_devices of the parent fs_devices if it's seed */
|
|
|
|
if (cur_devices != fs_devices)
|
|
|
|
fs_devices->total_devices--;
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2017-12-04 12:54:54 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state))
|
2018-04-12 10:29:31 +08:00
|
|
|
cur_devices->missing_devices--;
|
2010-12-14 03:56:23 +08:00
|
|
|
|
2018-07-21 00:37:50 +08:00
|
|
|
btrfs_assign_next_active_device(device, NULL);
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2014-07-08 01:34:49 +08:00
|
|
|
if (device->bdev) {
|
2018-04-12 10:29:31 +08:00
|
|
|
cur_devices->open_devices--;
|
2014-07-08 01:34:49 +08:00
|
|
|
/* remove sysfs entry */
|
2018-04-12 10:29:30 +08:00
|
|
|
btrfs_sysfs_rm_device_link(fs_devices, device);
|
2014-07-08 01:34:49 +08:00
|
|
|
}
|
2014-06-03 11:36:00 +08:00
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
num_devices = btrfs_super_num_devices(fs_info->super_copy) - 1;
|
|
|
|
btrfs_set_super_num_devices(fs_info->super_copy, num_devices);
|
2018-04-12 10:29:30 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2016-09-20 20:50:21 +08:00
|
|
|
/*
|
|
|
|
* at this point, the device is zero sized and detached from
|
|
|
|
* the devices list. All that's left is to zero out the old
|
|
|
|
* supers and free the device.
|
|
|
|
*/
|
2017-12-04 12:54:52 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state))
|
2016-09-20 20:50:21 +08:00
|
|
|
btrfs_scratch_superblocks(device->bdev, device->name->str);
|
|
|
|
|
|
|
|
btrfs_close_bdev(device);
|
2017-06-06 23:08:23 +08:00
|
|
|
call_rcu(&device->rcu, free_device_rcu);
|
2016-09-20 20:50:21 +08:00
|
|
|
|
2011-04-20 18:09:16 +08:00
|
|
|
if (cur_devices->open_devices == 0) {
|
2008-12-12 23:03:26 +08:00
|
|
|
while (fs_devices) {
|
2014-05-23 04:43:43 +08:00
|
|
|
if (fs_devices->seed == cur_devices) {
|
|
|
|
fs_devices->seed = cur_devices->seed;
|
2008-12-12 23:03:26 +08:00
|
|
|
break;
|
2014-05-23 04:43:43 +08:00
|
|
|
}
|
2008-12-12 23:03:26 +08:00
|
|
|
fs_devices = fs_devices->seed;
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
2011-04-20 18:09:16 +08:00
|
|
|
cur_devices->seed = NULL;
|
2018-04-12 10:29:27 +08:00
|
|
|
close_fs_devices(cur_devices);
|
2011-04-20 18:09:16 +08:00
|
|
|
free_fs_devices(cur_devices);
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
|
2008-05-07 23:43:44 +08:00
|
|
|
out:
|
|
|
|
mutex_unlock(&uuid_mutex);
|
|
|
|
return ret;
|
2016-02-13 10:01:36 +08:00
|
|
|
|
2011-02-16 02:14:25 +08:00
|
|
|
error_undo:
|
2017-12-04 12:54:52 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
2011-02-16 02:14:25 +08:00
|
|
|
list_add(&device->dev_alloc_list,
|
2018-04-12 10:29:30 +08:00
|
|
|
&fs_devices->alloc_list);
|
2014-09-03 21:35:47 +08:00
|
|
|
device->fs_devices->rw_devices++;
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
2011-02-16 02:14:25 +08:00
|
|
|
}
|
2016-02-13 10:01:36 +08:00
|
|
|
goto out;
|
2008-05-07 23:43:44 +08:00
|
|
|
}
|
|
|
|
|
2018-07-21 00:37:48 +08:00
|
|
|
void btrfs_rm_dev_replace_remove_srcdev(struct btrfs_device *srcdev)
|
2012-11-06 00:33:06 +08:00
|
|
|
{
|
2014-08-13 14:24:19 +08:00
|
|
|
struct btrfs_fs_devices *fs_devices;
|
|
|
|
|
2018-07-21 00:37:48 +08:00
|
|
|
lockdep_assert_held(&srcdev->fs_info->fs_devices->device_list_mutex);
|
2013-10-03 01:41:01 +08:00
|
|
|
|
2014-08-20 10:56:56 +08:00
|
|
|
/*
|
|
|
|
* in case of fs with no seed, srcdev->fs_devices will point
|
|
|
|
* to fs_devices of fs_info. However when the dev being replaced is
|
|
|
|
* a seed dev it will point to the seed's local fs_devices. In short
|
|
|
|
* srcdev will have its correct fs_devices in both the cases.
|
|
|
|
*/
|
|
|
|
fs_devices = srcdev->fs_devices;
|
2014-08-13 14:24:19 +08:00
|
|
|
|
2012-11-06 00:33:06 +08:00
|
|
|
list_del_rcu(&srcdev->dev_list);
|
2017-06-19 20:14:22 +08:00
|
|
|
list_del(&srcdev->dev_alloc_list);
|
2014-08-13 14:24:19 +08:00
|
|
|
fs_devices->num_devices--;
|
2017-12-04 12:54:54 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_MISSING, &srcdev->dev_state))
|
2014-08-13 14:24:19 +08:00
|
|
|
fs_devices->missing_devices--;
|
2012-11-06 00:33:06 +08:00
|
|
|
|
2017-12-04 12:54:52 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &srcdev->dev_state))
|
2014-09-03 21:35:44 +08:00
|
|
|
fs_devices->rw_devices--;
|
2013-10-03 01:41:01 +08:00
|
|
|
|
2014-09-03 21:35:44 +08:00
|
|
|
if (srcdev->bdev)
|
2014-08-13 14:24:19 +08:00
|
|
|
fs_devices->open_devices--;
|
2014-10-30 16:52:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
void btrfs_rm_dev_replace_free_srcdev(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_device *srcdev)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_devices *fs_devices = srcdev->fs_devices;
|
2012-11-06 00:33:06 +08:00
|
|
|
|
2017-12-04 12:54:52 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &srcdev->dev_state)) {
|
2016-04-12 21:36:16 +08:00
|
|
|
/* zero out the old super if it is writable */
|
|
|
|
btrfs_scratch_superblocks(srcdev->bdev, srcdev->name->str);
|
|
|
|
}
|
2016-07-22 06:04:53 +08:00
|
|
|
|
|
|
|
btrfs_close_bdev(srcdev);
|
2017-06-06 23:08:23 +08:00
|
|
|
call_rcu(&srcdev->rcu, free_device_rcu);
|
2014-08-13 14:24:22 +08:00
|
|
|
|
|
|
|
/* if this is no devs we rather delete the fs_devices */
|
|
|
|
if (!fs_devices->num_devices) {
|
|
|
|
struct btrfs_fs_devices *tmp_fs_devices;
|
|
|
|
|
2017-10-17 06:53:50 +08:00
|
|
|
/*
|
|
|
|
* On a mounted FS, num_devices can't be zero unless it's a
|
|
|
|
* seed. In case of a seed device being replaced, the replace
|
|
|
|
* target added to the sprout FS, so there will be no more
|
|
|
|
* device left under the seed FS.
|
|
|
|
*/
|
|
|
|
ASSERT(fs_devices->seeding);
|
|
|
|
|
2014-08-13 14:24:22 +08:00
|
|
|
tmp_fs_devices = fs_info->fs_devices;
|
|
|
|
while (tmp_fs_devices) {
|
|
|
|
if (tmp_fs_devices->seed == fs_devices) {
|
|
|
|
tmp_fs_devices->seed = fs_devices->seed;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
tmp_fs_devices = tmp_fs_devices->seed;
|
|
|
|
}
|
|
|
|
fs_devices->seed = NULL;
|
2018-04-12 10:29:27 +08:00
|
|
|
close_fs_devices(fs_devices);
|
2014-08-13 14:24:23 +08:00
|
|
|
free_fs_devices(fs_devices);
|
2014-08-13 14:24:22 +08:00
|
|
|
}
|
2012-11-06 00:33:06 +08:00
|
|
|
}
|
|
|
|
|
2018-07-21 00:37:51 +08:00
|
|
|
void btrfs_destroy_dev_replace_tgtdev(struct btrfs_device *tgtdev)
|
2012-11-06 00:33:06 +08:00
|
|
|
{
|
2018-07-21 00:37:51 +08:00
|
|
|
struct btrfs_fs_devices *fs_devices = tgtdev->fs_info->fs_devices;
|
2018-04-12 10:29:38 +08:00
|
|
|
|
2012-11-06 00:33:06 +08:00
|
|
|
WARN_ON(!tgtdev);
|
2018-04-12 10:29:38 +08:00
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
2015-03-10 06:38:42 +08:00
|
|
|
|
2018-04-12 10:29:38 +08:00
|
|
|
btrfs_sysfs_rm_device_link(fs_devices, tgtdev);
|
2015-03-10 06:38:42 +08:00
|
|
|
|
btrfs: fix lock dep warning, move scratch dev out of device_list_mutex and uuid_mutex
When the replace target fails, the target device will be taken
out of fs device list, scratch + update_dev_time and freed. However
we could do the scratch + update_dev_time and free part after the
device has been taken out of device list, so that we don't have to
hold the device_list_mutex and uuid_mutex locks.
Reported issue:
[ 5375.718845] ======================================================
[ 5375.718846] [ INFO: possible circular locking dependency detected ]
[ 5375.718849] 4.4.5-scst31x-debug-11+ #40 Not tainted
[ 5375.718849] -------------------------------------------------------
[ 5375.718851] btrfs-health/4662 is trying to acquire lock:
[ 5375.718861] (sb_writers){.+.+.+}, at: [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.718862]
[ 5375.718862] but task is already holding lock:
[ 5375.718907] (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.718907]
[ 5375.718907] which lock already depends on the new lock.
[ 5375.718907]
[ 5375.718908]
[ 5375.718908] the existing dependency chain (in reverse order) is:
[ 5375.718911]
[ 5375.718911] -> #3 (&fs_devs->device_list_mutex){+.+.+.}:
[ 5375.718917] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718921] [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.718940] [<ffffffffa0219bf6>] btrfs_show_devname+0x36/0x210 [btrfs]
[ 5375.718945] [<ffffffff81267079>] show_vfsmnt+0x49/0x150
[ 5375.718948] [<ffffffff81240b07>] m_show+0x17/0x20
[ 5375.718951] [<ffffffff81246868>] seq_read+0x2d8/0x3b0
[ 5375.718955] [<ffffffff8121df28>] __vfs_read+0x28/0xd0
[ 5375.718959] [<ffffffff8121e806>] vfs_read+0x86/0x130
[ 5375.718962] [<ffffffff8121f4c9>] SyS_read+0x49/0xa0
[ 5375.718966] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718968]
[ 5375.718968] -> #2 (namespace_sem){+++++.}:
[ 5375.718971] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718974] [<ffffffff81635199>] down_write+0x49/0x80
[ 5375.718977] [<ffffffff81243593>] lock_mount+0x43/0x1c0
[ 5375.718979] [<ffffffff81243c13>] do_add_mount+0x23/0xd0
[ 5375.718982] [<ffffffff81244afb>] do_mount+0x27b/0xe30
[ 5375.718985] [<ffffffff812459dc>] SyS_mount+0x8c/0xd0
[ 5375.718988] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718991]
[ 5375.718991] -> #1 (&sb->s_type->i_mutex_key#5){+.+.+.}:
[ 5375.718994] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718996] [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.719001] [<ffffffff8122d608>] path_openat+0x468/0x1360
[ 5375.719004] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719007] [<ffffffff8121da7b>] do_sys_open+0x12b/0x210
[ 5375.719010] [<ffffffff8121db7e>] SyS_open+0x1e/0x20
[ 5375.719013] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.719015]
[ 5375.719015] -> #0 (sb_writers){.+.+.+}:
[ 5375.719018] [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719021] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719026] [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719028] [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719031] [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719035] [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719037] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719040] [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719043] [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719073] [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719099] [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719123] [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719150] [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719175] [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719199] [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719222] [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719225] [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719229] [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719230]
[ 5375.719230] other info that might help us debug this:
[ 5375.719230]
[ 5375.719233] Chain exists of:
[ 5375.719233] sb_writers --> namespace_sem --> &fs_devs->device_list_mutex
[ 5375.719233]
[ 5375.719234] Possible unsafe locking scenario:
[ 5375.719234]
[ 5375.719234] CPU0 CPU1
[ 5375.719235] ---- ----
[ 5375.719236] lock(&fs_devs->device_list_mutex);
[ 5375.719238] lock(namespace_sem);
[ 5375.719239] lock(&fs_devs->device_list_mutex);
[ 5375.719241] lock(sb_writers);
[ 5375.719241]
[ 5375.719241] *** DEADLOCK ***
[ 5375.719241]
[ 5375.719243] 4 locks held by btrfs-health/4662:
[ 5375.719266] #0: (&fs_info->health_mutex){+.+.+.}, at: [<ffffffffa0246303>] health_kthread+0x63/0x490 [btrfs]
[ 5375.719293] #1: (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.+.}, at: [<ffffffffa02c6611>] btrfs_dev_replace_finishing+0x41/0x990 [btrfs]
[ 5375.719319] #2: (uuid_mutex){+.+.+.}, at: [<ffffffffa0282620>] btrfs_destroy_dev_replace_tgtdev+0x20/0x150 [btrfs]
[ 5375.719343] #3: (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.719343]
[ 5375.719343] stack backtrace:
[ 5375.719347] CPU: 2 PID: 4662 Comm: btrfs-health Not tainted 4.4.5-scst31x-debug-11+ #40
[ 5375.719348] Hardware name: Supermicro SYS-6018R-WTRT/X10DRW-iT, BIOS 1.0c 01/07/2015
[ 5375.719352] 0000000000000000 ffff880856f73880 ffffffff813529e3 ffffffff826182a0
[ 5375.719354] ffffffff8260c090 ffff880856f738c0 ffffffff810d667c ffff880856f73930
[ 5375.719357] ffff880861f32b40 ffff880861f32b68 0000000000000003 0000000000000004
[ 5375.719357] Call Trace:
[ 5375.719363] [<ffffffff813529e3>] dump_stack+0x85/0xc2
[ 5375.719366] [<ffffffff810d667c>] print_circular_bug+0x1ec/0x260
[ 5375.719369] [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719373] [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719376] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719378] [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719383] [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719385] [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719387] [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719389] [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719393] [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719415] [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719418] [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719420] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719423] [<ffffffff810f615d>] ? rcu_read_lock_sched_held+0x6d/0x80
[ 5375.719426] [<ffffffff81201a9b>] ? kmem_cache_alloc+0x26b/0x5d0
[ 5375.719430] [<ffffffff8122e7d4>] ? getname_kernel+0x34/0x120
[ 5375.719433] [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719436] [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719462] [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719485] [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719506] [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719530] [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719554] [<ffffffffa02c6b23>] ? btrfs_dev_replace_finishing+0x553/0x990 [btrfs]
[ 5375.719576] [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719598] [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719621] [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719641] [<ffffffffa02463d8>] ? health_kthread+0x138/0x490 [btrfs]
[ 5375.719661] [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719663] [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719666] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719669] [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719672] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719697] ------------[ cut here ]------------
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reported-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-18 16:51:23 +08:00
|
|
|
if (tgtdev->bdev)
|
2018-04-12 10:29:38 +08:00
|
|
|
fs_devices->open_devices--;
|
btrfs: fix lock dep warning, move scratch dev out of device_list_mutex and uuid_mutex
When the replace target fails, the target device will be taken
out of fs device list, scratch + update_dev_time and freed. However
we could do the scratch + update_dev_time and free part after the
device has been taken out of device list, so that we don't have to
hold the device_list_mutex and uuid_mutex locks.
Reported issue:
[ 5375.718845] ======================================================
[ 5375.718846] [ INFO: possible circular locking dependency detected ]
[ 5375.718849] 4.4.5-scst31x-debug-11+ #40 Not tainted
[ 5375.718849] -------------------------------------------------------
[ 5375.718851] btrfs-health/4662 is trying to acquire lock:
[ 5375.718861] (sb_writers){.+.+.+}, at: [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.718862]
[ 5375.718862] but task is already holding lock:
[ 5375.718907] (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.718907]
[ 5375.718907] which lock already depends on the new lock.
[ 5375.718907]
[ 5375.718908]
[ 5375.718908] the existing dependency chain (in reverse order) is:
[ 5375.718911]
[ 5375.718911] -> #3 (&fs_devs->device_list_mutex){+.+.+.}:
[ 5375.718917] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718921] [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.718940] [<ffffffffa0219bf6>] btrfs_show_devname+0x36/0x210 [btrfs]
[ 5375.718945] [<ffffffff81267079>] show_vfsmnt+0x49/0x150
[ 5375.718948] [<ffffffff81240b07>] m_show+0x17/0x20
[ 5375.718951] [<ffffffff81246868>] seq_read+0x2d8/0x3b0
[ 5375.718955] [<ffffffff8121df28>] __vfs_read+0x28/0xd0
[ 5375.718959] [<ffffffff8121e806>] vfs_read+0x86/0x130
[ 5375.718962] [<ffffffff8121f4c9>] SyS_read+0x49/0xa0
[ 5375.718966] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718968]
[ 5375.718968] -> #2 (namespace_sem){+++++.}:
[ 5375.718971] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718974] [<ffffffff81635199>] down_write+0x49/0x80
[ 5375.718977] [<ffffffff81243593>] lock_mount+0x43/0x1c0
[ 5375.718979] [<ffffffff81243c13>] do_add_mount+0x23/0xd0
[ 5375.718982] [<ffffffff81244afb>] do_mount+0x27b/0xe30
[ 5375.718985] [<ffffffff812459dc>] SyS_mount+0x8c/0xd0
[ 5375.718988] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718991]
[ 5375.718991] -> #1 (&sb->s_type->i_mutex_key#5){+.+.+.}:
[ 5375.718994] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718996] [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.719001] [<ffffffff8122d608>] path_openat+0x468/0x1360
[ 5375.719004] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719007] [<ffffffff8121da7b>] do_sys_open+0x12b/0x210
[ 5375.719010] [<ffffffff8121db7e>] SyS_open+0x1e/0x20
[ 5375.719013] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.719015]
[ 5375.719015] -> #0 (sb_writers){.+.+.+}:
[ 5375.719018] [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719021] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719026] [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719028] [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719031] [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719035] [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719037] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719040] [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719043] [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719073] [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719099] [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719123] [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719150] [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719175] [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719199] [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719222] [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719225] [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719229] [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719230]
[ 5375.719230] other info that might help us debug this:
[ 5375.719230]
[ 5375.719233] Chain exists of:
[ 5375.719233] sb_writers --> namespace_sem --> &fs_devs->device_list_mutex
[ 5375.719233]
[ 5375.719234] Possible unsafe locking scenario:
[ 5375.719234]
[ 5375.719234] CPU0 CPU1
[ 5375.719235] ---- ----
[ 5375.719236] lock(&fs_devs->device_list_mutex);
[ 5375.719238] lock(namespace_sem);
[ 5375.719239] lock(&fs_devs->device_list_mutex);
[ 5375.719241] lock(sb_writers);
[ 5375.719241]
[ 5375.719241] *** DEADLOCK ***
[ 5375.719241]
[ 5375.719243] 4 locks held by btrfs-health/4662:
[ 5375.719266] #0: (&fs_info->health_mutex){+.+.+.}, at: [<ffffffffa0246303>] health_kthread+0x63/0x490 [btrfs]
[ 5375.719293] #1: (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.+.}, at: [<ffffffffa02c6611>] btrfs_dev_replace_finishing+0x41/0x990 [btrfs]
[ 5375.719319] #2: (uuid_mutex){+.+.+.}, at: [<ffffffffa0282620>] btrfs_destroy_dev_replace_tgtdev+0x20/0x150 [btrfs]
[ 5375.719343] #3: (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.719343]
[ 5375.719343] stack backtrace:
[ 5375.719347] CPU: 2 PID: 4662 Comm: btrfs-health Not tainted 4.4.5-scst31x-debug-11+ #40
[ 5375.719348] Hardware name: Supermicro SYS-6018R-WTRT/X10DRW-iT, BIOS 1.0c 01/07/2015
[ 5375.719352] 0000000000000000 ffff880856f73880 ffffffff813529e3 ffffffff826182a0
[ 5375.719354] ffffffff8260c090 ffff880856f738c0 ffffffff810d667c ffff880856f73930
[ 5375.719357] ffff880861f32b40 ffff880861f32b68 0000000000000003 0000000000000004
[ 5375.719357] Call Trace:
[ 5375.719363] [<ffffffff813529e3>] dump_stack+0x85/0xc2
[ 5375.719366] [<ffffffff810d667c>] print_circular_bug+0x1ec/0x260
[ 5375.719369] [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719373] [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719376] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719378] [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719383] [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719385] [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719387] [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719389] [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719393] [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719415] [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719418] [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719420] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719423] [<ffffffff810f615d>] ? rcu_read_lock_sched_held+0x6d/0x80
[ 5375.719426] [<ffffffff81201a9b>] ? kmem_cache_alloc+0x26b/0x5d0
[ 5375.719430] [<ffffffff8122e7d4>] ? getname_kernel+0x34/0x120
[ 5375.719433] [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719436] [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719462] [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719485] [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719506] [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719530] [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719554] [<ffffffffa02c6b23>] ? btrfs_dev_replace_finishing+0x553/0x990 [btrfs]
[ 5375.719576] [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719598] [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719621] [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719641] [<ffffffffa02463d8>] ? health_kthread+0x138/0x490 [btrfs]
[ 5375.719661] [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719663] [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719666] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719669] [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719672] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719697] ------------[ cut here ]------------
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reported-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-18 16:51:23 +08:00
|
|
|
|
2018-04-12 10:29:38 +08:00
|
|
|
fs_devices->num_devices--;
|
2012-11-06 00:33:06 +08:00
|
|
|
|
2018-07-21 00:37:50 +08:00
|
|
|
btrfs_assign_next_active_device(tgtdev, NULL);
|
2012-11-06 00:33:06 +08:00
|
|
|
|
|
|
|
list_del_rcu(&tgtdev->dev_list);
|
|
|
|
|
2018-04-12 10:29:38 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
btrfs: fix lock dep warning, move scratch dev out of device_list_mutex and uuid_mutex
When the replace target fails, the target device will be taken
out of fs device list, scratch + update_dev_time and freed. However
we could do the scratch + update_dev_time and free part after the
device has been taken out of device list, so that we don't have to
hold the device_list_mutex and uuid_mutex locks.
Reported issue:
[ 5375.718845] ======================================================
[ 5375.718846] [ INFO: possible circular locking dependency detected ]
[ 5375.718849] 4.4.5-scst31x-debug-11+ #40 Not tainted
[ 5375.718849] -------------------------------------------------------
[ 5375.718851] btrfs-health/4662 is trying to acquire lock:
[ 5375.718861] (sb_writers){.+.+.+}, at: [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.718862]
[ 5375.718862] but task is already holding lock:
[ 5375.718907] (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.718907]
[ 5375.718907] which lock already depends on the new lock.
[ 5375.718907]
[ 5375.718908]
[ 5375.718908] the existing dependency chain (in reverse order) is:
[ 5375.718911]
[ 5375.718911] -> #3 (&fs_devs->device_list_mutex){+.+.+.}:
[ 5375.718917] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718921] [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.718940] [<ffffffffa0219bf6>] btrfs_show_devname+0x36/0x210 [btrfs]
[ 5375.718945] [<ffffffff81267079>] show_vfsmnt+0x49/0x150
[ 5375.718948] [<ffffffff81240b07>] m_show+0x17/0x20
[ 5375.718951] [<ffffffff81246868>] seq_read+0x2d8/0x3b0
[ 5375.718955] [<ffffffff8121df28>] __vfs_read+0x28/0xd0
[ 5375.718959] [<ffffffff8121e806>] vfs_read+0x86/0x130
[ 5375.718962] [<ffffffff8121f4c9>] SyS_read+0x49/0xa0
[ 5375.718966] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718968]
[ 5375.718968] -> #2 (namespace_sem){+++++.}:
[ 5375.718971] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718974] [<ffffffff81635199>] down_write+0x49/0x80
[ 5375.718977] [<ffffffff81243593>] lock_mount+0x43/0x1c0
[ 5375.718979] [<ffffffff81243c13>] do_add_mount+0x23/0xd0
[ 5375.718982] [<ffffffff81244afb>] do_mount+0x27b/0xe30
[ 5375.718985] [<ffffffff812459dc>] SyS_mount+0x8c/0xd0
[ 5375.718988] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718991]
[ 5375.718991] -> #1 (&sb->s_type->i_mutex_key#5){+.+.+.}:
[ 5375.718994] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718996] [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.719001] [<ffffffff8122d608>] path_openat+0x468/0x1360
[ 5375.719004] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719007] [<ffffffff8121da7b>] do_sys_open+0x12b/0x210
[ 5375.719010] [<ffffffff8121db7e>] SyS_open+0x1e/0x20
[ 5375.719013] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.719015]
[ 5375.719015] -> #0 (sb_writers){.+.+.+}:
[ 5375.719018] [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719021] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719026] [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719028] [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719031] [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719035] [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719037] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719040] [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719043] [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719073] [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719099] [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719123] [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719150] [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719175] [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719199] [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719222] [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719225] [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719229] [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719230]
[ 5375.719230] other info that might help us debug this:
[ 5375.719230]
[ 5375.719233] Chain exists of:
[ 5375.719233] sb_writers --> namespace_sem --> &fs_devs->device_list_mutex
[ 5375.719233]
[ 5375.719234] Possible unsafe locking scenario:
[ 5375.719234]
[ 5375.719234] CPU0 CPU1
[ 5375.719235] ---- ----
[ 5375.719236] lock(&fs_devs->device_list_mutex);
[ 5375.719238] lock(namespace_sem);
[ 5375.719239] lock(&fs_devs->device_list_mutex);
[ 5375.719241] lock(sb_writers);
[ 5375.719241]
[ 5375.719241] *** DEADLOCK ***
[ 5375.719241]
[ 5375.719243] 4 locks held by btrfs-health/4662:
[ 5375.719266] #0: (&fs_info->health_mutex){+.+.+.}, at: [<ffffffffa0246303>] health_kthread+0x63/0x490 [btrfs]
[ 5375.719293] #1: (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.+.}, at: [<ffffffffa02c6611>] btrfs_dev_replace_finishing+0x41/0x990 [btrfs]
[ 5375.719319] #2: (uuid_mutex){+.+.+.}, at: [<ffffffffa0282620>] btrfs_destroy_dev_replace_tgtdev+0x20/0x150 [btrfs]
[ 5375.719343] #3: (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.719343]
[ 5375.719343] stack backtrace:
[ 5375.719347] CPU: 2 PID: 4662 Comm: btrfs-health Not tainted 4.4.5-scst31x-debug-11+ #40
[ 5375.719348] Hardware name: Supermicro SYS-6018R-WTRT/X10DRW-iT, BIOS 1.0c 01/07/2015
[ 5375.719352] 0000000000000000 ffff880856f73880 ffffffff813529e3 ffffffff826182a0
[ 5375.719354] ffffffff8260c090 ffff880856f738c0 ffffffff810d667c ffff880856f73930
[ 5375.719357] ffff880861f32b40 ffff880861f32b68 0000000000000003 0000000000000004
[ 5375.719357] Call Trace:
[ 5375.719363] [<ffffffff813529e3>] dump_stack+0x85/0xc2
[ 5375.719366] [<ffffffff810d667c>] print_circular_bug+0x1ec/0x260
[ 5375.719369] [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719373] [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719376] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719378] [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719383] [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719385] [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719387] [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719389] [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719393] [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719415] [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719418] [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719420] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719423] [<ffffffff810f615d>] ? rcu_read_lock_sched_held+0x6d/0x80
[ 5375.719426] [<ffffffff81201a9b>] ? kmem_cache_alloc+0x26b/0x5d0
[ 5375.719430] [<ffffffff8122e7d4>] ? getname_kernel+0x34/0x120
[ 5375.719433] [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719436] [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719462] [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719485] [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719506] [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719530] [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719554] [<ffffffffa02c6b23>] ? btrfs_dev_replace_finishing+0x553/0x990 [btrfs]
[ 5375.719576] [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719598] [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719621] [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719641] [<ffffffffa02463d8>] ? health_kthread+0x138/0x490 [btrfs]
[ 5375.719661] [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719663] [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719666] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719669] [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719672] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719697] ------------[ cut here ]------------
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reported-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-18 16:51:23 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The update_dev_time() with in btrfs_scratch_superblocks()
|
|
|
|
* may lead to a call to btrfs_show_devname() which will try
|
|
|
|
* to hold device_list_mutex. And here this device
|
|
|
|
* is already out of device list, so we don't have to hold
|
|
|
|
* the device_list_mutex lock.
|
|
|
|
*/
|
|
|
|
btrfs_scratch_superblocks(tgtdev->bdev, tgtdev->name->str);
|
2016-07-22 06:04:53 +08:00
|
|
|
|
|
|
|
btrfs_close_bdev(tgtdev);
|
2017-06-06 23:08:23 +08:00
|
|
|
call_rcu(&tgtdev->rcu, free_device_rcu);
|
2012-11-06 00:33:06 +08:00
|
|
|
}
|
|
|
|
|
2018-09-03 17:46:12 +08:00
|
|
|
static struct btrfs_device *btrfs_find_device_by_path(
|
|
|
|
struct btrfs_fs_info *fs_info, const char *device_path)
|
2012-11-05 21:42:30 +08:00
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
struct btrfs_super_block *disk_super;
|
|
|
|
u64 devid;
|
|
|
|
u8 *dev_uuid;
|
|
|
|
struct block_device *bdev;
|
|
|
|
struct buffer_head *bh;
|
2018-09-03 17:46:12 +08:00
|
|
|
struct btrfs_device *device;
|
2012-11-05 21:42:30 +08:00
|
|
|
|
|
|
|
ret = btrfs_get_bdev_and_sb(device_path, FMODE_READ,
|
2016-06-23 06:54:23 +08:00
|
|
|
fs_info->bdev_holder, 0, &bdev, &bh);
|
2012-11-05 21:42:30 +08:00
|
|
|
if (ret)
|
2018-09-03 17:46:12 +08:00
|
|
|
return ERR_PTR(ret);
|
2012-11-05 21:42:30 +08:00
|
|
|
disk_super = (struct btrfs_super_block *)bh->b_data;
|
|
|
|
devid = btrfs_stack_device_id(&disk_super->dev_item);
|
|
|
|
dev_uuid = disk_super->dev_item.uuid;
|
2018-09-03 17:46:12 +08:00
|
|
|
device = btrfs_find_device(fs_info, devid, dev_uuid, disk_super->fsid);
|
2012-11-05 21:42:30 +08:00
|
|
|
brelse(bh);
|
2018-09-03 17:46:12 +08:00
|
|
|
if (!device)
|
|
|
|
device = ERR_PTR(-ENOENT);
|
2012-11-05 21:42:30 +08:00
|
|
|
blkdev_put(bdev, FMODE_READ);
|
2018-09-03 17:46:12 +08:00
|
|
|
return device;
|
2012-11-05 21:42:30 +08:00
|
|
|
}
|
|
|
|
|
2018-09-03 17:46:13 +08:00
|
|
|
static struct btrfs_device *btrfs_find_device_missing_or_by_path(
|
|
|
|
struct btrfs_fs_info *fs_info, const char *device_path)
|
2012-11-05 21:42:30 +08:00
|
|
|
{
|
2018-09-03 17:46:13 +08:00
|
|
|
struct btrfs_device *device = NULL;
|
2012-11-05 21:42:30 +08:00
|
|
|
if (strcmp(device_path, "missing") == 0) {
|
|
|
|
struct list_head *devices;
|
|
|
|
struct btrfs_device *tmp;
|
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
devices = &fs_info->fs_devices->devices;
|
2012-11-05 21:42:30 +08:00
|
|
|
list_for_each_entry(tmp, devices, dev_list) {
|
2017-12-04 12:54:53 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
|
|
|
|
&tmp->dev_state) && !tmp->bdev) {
|
2018-09-03 17:46:13 +08:00
|
|
|
device = tmp;
|
2012-11-05 21:42:30 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-09-03 17:46:13 +08:00
|
|
|
if (!device)
|
|
|
|
return ERR_PTR(-ENOENT);
|
2012-11-05 21:42:30 +08:00
|
|
|
} else {
|
2018-09-03 17:46:13 +08:00
|
|
|
device = btrfs_find_device_by_path(fs_info, device_path);
|
2012-11-05 21:42:30 +08:00
|
|
|
}
|
2018-09-03 17:46:12 +08:00
|
|
|
|
2018-09-03 17:46:13 +08:00
|
|
|
return device;
|
2012-11-05 21:42:30 +08:00
|
|
|
}
|
|
|
|
|
2016-02-15 23:39:55 +08:00
|
|
|
/*
|
|
|
|
* Lookup a device given by device id, or the path if the id is 0.
|
|
|
|
*/
|
2018-09-03 17:46:14 +08:00
|
|
|
struct btrfs_device *btrfs_find_device_by_devspec(
|
|
|
|
struct btrfs_fs_info *fs_info, u64 devid, const char *devpath)
|
2016-02-13 10:01:35 +08:00
|
|
|
{
|
2018-09-03 17:46:14 +08:00
|
|
|
struct btrfs_device *device;
|
2016-02-13 10:01:35 +08:00
|
|
|
|
2016-02-15 23:39:55 +08:00
|
|
|
if (devid) {
|
2018-09-03 17:46:14 +08:00
|
|
|
device = btrfs_find_device(fs_info, devid, NULL, NULL);
|
|
|
|
if (!device)
|
|
|
|
return ERR_PTR(-ENOENT);
|
2016-02-13 10:01:35 +08:00
|
|
|
} else {
|
2016-02-15 23:39:55 +08:00
|
|
|
if (!devpath || !devpath[0])
|
2018-09-03 17:46:14 +08:00
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
device = btrfs_find_device_missing_or_by_path(fs_info, devpath);
|
2016-02-13 10:01:35 +08:00
|
|
|
}
|
2018-09-03 17:46:14 +08:00
|
|
|
return device;
|
2016-02-13 10:01:35 +08:00
|
|
|
}
|
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
/*
|
|
|
|
* does all the dirty work required for changing file system's UUID.
|
|
|
|
*/
|
2016-06-23 06:54:24 +08:00
|
|
|
static int btrfs_prepare_sprout(struct btrfs_fs_info *fs_info)
|
2008-11-18 10:11:30 +08:00
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
|
2008-11-18 10:11:30 +08:00
|
|
|
struct btrfs_fs_devices *old_devices;
|
2008-12-12 23:03:26 +08:00
|
|
|
struct btrfs_fs_devices *seed_devices;
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_super_block *disk_super = fs_info->super_copy;
|
2008-11-18 10:11:30 +08:00
|
|
|
struct btrfs_device *device;
|
|
|
|
u64 super_flags;
|
|
|
|
|
2018-03-16 09:21:22 +08:00
|
|
|
lockdep_assert_held(&uuid_mutex);
|
2008-12-12 23:03:26 +08:00
|
|
|
if (!fs_devices->seeding)
|
2008-11-18 10:11:30 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
2017-06-14 08:48:07 +08:00
|
|
|
seed_devices = alloc_fs_devices(NULL);
|
2013-08-12 19:33:03 +08:00
|
|
|
if (IS_ERR(seed_devices))
|
|
|
|
return PTR_ERR(seed_devices);
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2008-12-12 23:03:26 +08:00
|
|
|
old_devices = clone_fs_devices(fs_devices);
|
|
|
|
if (IS_ERR(old_devices)) {
|
|
|
|
kfree(seed_devices);
|
|
|
|
return PTR_ERR(old_devices);
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
2008-12-12 23:03:26 +08:00
|
|
|
|
2018-04-12 10:29:25 +08:00
|
|
|
list_add(&old_devices->fs_list, &fs_uuids);
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2008-12-12 23:03:26 +08:00
|
|
|
memcpy(seed_devices, fs_devices, sizeof(*seed_devices));
|
|
|
|
seed_devices->opened = 1;
|
|
|
|
INIT_LIST_HEAD(&seed_devices->devices);
|
|
|
|
INIT_LIST_HEAD(&seed_devices->alloc_list);
|
2009-06-11 03:17:02 +08:00
|
|
|
mutex_init(&seed_devices->device_list_mutex);
|
2011-04-20 18:07:30 +08:00
|
|
|
|
2018-07-16 22:58:09 +08:00
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
2011-04-20 18:09:16 +08:00
|
|
|
list_splice_init_rcu(&fs_devices->devices, &seed_devices->devices,
|
|
|
|
synchronize_rcu);
|
2014-09-03 21:35:41 +08:00
|
|
|
list_for_each_entry(device, &seed_devices->devices, dev_list)
|
|
|
|
device->fs_devices = seed_devices;
|
2011-04-20 18:07:30 +08:00
|
|
|
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
2008-12-12 23:03:26 +08:00
|
|
|
list_splice_init(&fs_devices->alloc_list, &seed_devices->alloc_list);
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
2008-12-12 23:03:26 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
fs_devices->seeding = 0;
|
|
|
|
fs_devices->num_devices = 0;
|
|
|
|
fs_devices->open_devices = 0;
|
2014-07-03 18:22:12 +08:00
|
|
|
fs_devices->missing_devices = 0;
|
|
|
|
fs_devices->rotating = 0;
|
2008-12-12 23:03:26 +08:00
|
|
|
fs_devices->seed = seed_devices;
|
2008-11-18 10:11:30 +08:00
|
|
|
|
|
|
|
generate_random_uuid(fs_devices->fsid);
|
2016-06-23 06:54:23 +08:00
|
|
|
memcpy(fs_info->fsid, fs_devices->fsid, BTRFS_FSID_SIZE);
|
2008-11-18 10:11:30 +08:00
|
|
|
memcpy(disk_super->fsid, fs_devices->fsid, BTRFS_FSID_SIZE);
|
2018-07-16 22:58:09 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
Btrfs: fix race conditions in BTRFS_IOC_FS_INFO ioctl
The handler for the ioctl BTRFS_IOC_FS_INFO was reading the
number of devices before acquiring the device list mutex.
This could lead to inconsistent results because the update of
the device list and the number of devices counter (amongst other
counters related to the device list) are updated in volumes.c
while holding the device list mutex - except for 2 places, one
was volumes.c:btrfs_prepare_sprout() and the other was
volumes.c:device_list_add().
For example, if we have 2 devices, with IDs 1 and 2 and then add
a new device, with ID 3, and while adding the device is in progress
an BTRFS_IOC_FS_INFO ioctl arrives, it could return a number of
devices of 2 and a max dev id of 3. This would be incorrect.
Also, this ioctl handler was reading the fsid while it can be
updated concurrently. This can happen when while a new device is
being added and the current filesystem is in seeding mode.
Example:
$ mkfs.btrfs -f /dev/sdb1
$ mkfs.btrfs -f /dev/sdb2
$ btrfstune -S 1 /dev/sdb1
$ mount /dev/sdb1 /mnt/test
$ btrfs device add /dev/sdb2 /mnt/test
If during the last step a BTRFS_IOC_FS_INFO ioctl was requested, it
could read an fsid that was never valid (some bits part of the old
fsid and others part of the new fsid). Also, it could read a number
of devices that doesn't match the number of devices in the list and
the max device id, as explained before.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-13 03:56:58 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
super_flags = btrfs_super_flags(disk_super) &
|
|
|
|
~BTRFS_SUPER_FLAG_SEEDING;
|
|
|
|
btrfs_set_super_flags(disk_super, super_flags);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2016-05-20 09:18:45 +08:00
|
|
|
* Store the expected generation for seed devices in device items.
|
2008-11-18 10:11:30 +08:00
|
|
|
*/
|
|
|
|
static int btrfs_finish_sprout(struct btrfs_trans_handle *trans,
|
2016-06-21 22:40:19 +08:00
|
|
|
struct btrfs_fs_info *fs_info)
|
2008-11-18 10:11:30 +08:00
|
|
|
{
|
2016-06-21 22:40:19 +08:00
|
|
|
struct btrfs_root *root = fs_info->chunk_root;
|
2008-11-18 10:11:30 +08:00
|
|
|
struct btrfs_path *path;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_dev_item *dev_item;
|
|
|
|
struct btrfs_device *device;
|
|
|
|
struct btrfs_key key;
|
2017-07-29 17:50:09 +08:00
|
|
|
u8 fs_uuid[BTRFS_FSID_SIZE];
|
2008-11-18 10:11:30 +08:00
|
|
|
u8 dev_uuid[BTRFS_UUID_SIZE];
|
|
|
|
u64 devid;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
|
|
|
|
key.offset = 0;
|
|
|
|
key.type = BTRFS_DEV_ITEM_KEY;
|
|
|
|
|
|
|
|
while (1) {
|
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, 0, 1);
|
|
|
|
if (ret < 0)
|
|
|
|
goto error;
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
next_slot:
|
|
|
|
if (path->slots[0] >= btrfs_header_nritems(leaf)) {
|
|
|
|
ret = btrfs_next_leaf(root, path);
|
|
|
|
if (ret > 0)
|
|
|
|
break;
|
|
|
|
if (ret < 0)
|
|
|
|
goto error;
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2008-11-18 10:11:30 +08:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
|
|
|
|
if (key.objectid != BTRFS_DEV_ITEMS_OBJECTID ||
|
|
|
|
key.type != BTRFS_DEV_ITEM_KEY)
|
|
|
|
break;
|
|
|
|
|
|
|
|
dev_item = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_dev_item);
|
|
|
|
devid = btrfs_device_id(leaf, dev_item);
|
2013-08-20 19:20:11 +08:00
|
|
|
read_extent_buffer(leaf, dev_uuid, btrfs_device_uuid(dev_item),
|
2008-11-18 10:11:30 +08:00
|
|
|
BTRFS_UUID_SIZE);
|
2013-08-20 19:20:12 +08:00
|
|
|
read_extent_buffer(leaf, fs_uuid, btrfs_device_fsid(dev_item),
|
2017-07-29 17:50:09 +08:00
|
|
|
BTRFS_FSID_SIZE);
|
2016-06-23 06:54:23 +08:00
|
|
|
device = btrfs_find_device(fs_info, devid, dev_uuid, fs_uuid);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(!device); /* Logic error */
|
2008-11-18 10:11:30 +08:00
|
|
|
|
|
|
|
if (device->fs_devices->seeding) {
|
|
|
|
btrfs_set_device_generation(leaf, dev_item,
|
|
|
|
device->generation);
|
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
|
|
|
}
|
|
|
|
|
|
|
|
path->slots[0]++;
|
|
|
|
goto next_slot;
|
|
|
|
}
|
|
|
|
ret = 0;
|
|
|
|
error:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2017-02-15 00:55:53 +08:00
|
|
|
int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path)
|
2008-04-29 03:29:42 +08:00
|
|
|
{
|
2016-06-22 08:16:08 +08:00
|
|
|
struct btrfs_root *root = fs_info->dev_root;
|
2011-08-04 22:52:27 +08:00
|
|
|
struct request_queue *q;
|
2008-04-29 03:29:42 +08:00
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
struct btrfs_device *device;
|
|
|
|
struct block_device *bdev;
|
2016-06-23 06:54:23 +08:00
|
|
|
struct super_block *sb = fs_info->sb;
|
2012-06-05 02:03:51 +08:00
|
|
|
struct rcu_string *name;
|
2018-07-03 13:14:50 +08:00
|
|
|
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
|
2018-07-27 08:04:55 +08:00
|
|
|
u64 orig_super_total_bytes;
|
|
|
|
u64 orig_super_num_devices;
|
2008-11-18 10:11:30 +08:00
|
|
|
int seeding_dev = 0;
|
2008-04-29 03:29:42 +08:00
|
|
|
int ret = 0;
|
2017-09-28 14:51:11 +08:00
|
|
|
bool unlocked = false;
|
2008-04-29 03:29:42 +08:00
|
|
|
|
2018-07-03 13:14:50 +08:00
|
|
|
if (sb_rdonly(sb) && !fs_devices->seeding)
|
2012-05-10 18:10:38 +08:00
|
|
|
return -EROFS;
|
2008-04-29 03:29:42 +08:00
|
|
|
|
2011-12-08 09:08:40 +08:00
|
|
|
bdev = blkdev_get_by_path(device_path, FMODE_WRITE | FMODE_EXCL,
|
2016-06-23 06:54:23 +08:00
|
|
|
fs_info->bdev_holder);
|
2010-01-27 10:09:00 +08:00
|
|
|
if (IS_ERR(bdev))
|
|
|
|
return PTR_ERR(bdev);
|
2008-06-26 04:01:30 +08:00
|
|
|
|
2018-07-03 13:14:50 +08:00
|
|
|
if (fs_devices->seeding) {
|
2008-11-18 10:11:30 +08:00
|
|
|
seeding_dev = 1;
|
|
|
|
down_write(&sb->s_umount);
|
|
|
|
mutex_lock(&uuid_mutex);
|
|
|
|
}
|
|
|
|
|
2008-09-29 23:19:10 +08:00
|
|
|
filemap_write_and_wait(bdev->bd_inode->i_mapping);
|
2008-06-26 04:01:30 +08:00
|
|
|
|
2018-07-03 13:14:50 +08:00
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
2018-07-03 13:14:51 +08:00
|
|
|
list_for_each_entry(device, &fs_devices->devices, dev_list) {
|
2008-04-29 03:29:42 +08:00
|
|
|
if (device->bdev == bdev) {
|
|
|
|
ret = -EEXIST;
|
2012-11-14 22:35:30 +08:00
|
|
|
mutex_unlock(
|
2018-07-03 13:14:50 +08:00
|
|
|
&fs_devices->device_list_mutex);
|
2008-11-18 10:11:30 +08:00
|
|
|
goto error;
|
2008-04-29 03:29:42 +08:00
|
|
|
}
|
|
|
|
}
|
2018-07-03 13:14:50 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
2008-04-29 03:29:42 +08:00
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
device = btrfs_alloc_device(fs_info, NULL, NULL);
|
2013-08-23 18:20:17 +08:00
|
|
|
if (IS_ERR(device)) {
|
2008-04-29 03:29:42 +08:00
|
|
|
/* we can safely leave the fs_devices entry around */
|
2013-08-23 18:20:17 +08:00
|
|
|
ret = PTR_ERR(device);
|
2008-11-18 10:11:30 +08:00
|
|
|
goto error;
|
2008-04-29 03:29:42 +08:00
|
|
|
}
|
|
|
|
|
2016-02-11 21:25:38 +08:00
|
|
|
name = rcu_string_strdup(device_path, GFP_KERNEL);
|
2012-06-05 02:03:51 +08:00
|
|
|
if (!name) {
|
2008-11-18 10:11:30 +08:00
|
|
|
ret = -ENOMEM;
|
2017-10-31 02:29:46 +08:00
|
|
|
goto error_free_device;
|
2008-04-29 03:29:42 +08:00
|
|
|
}
|
2012-06-05 02:03:51 +08:00
|
|
|
rcu_assign_pointer(device->name, name);
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2010-05-16 22:48:46 +08:00
|
|
|
trans = btrfs_start_transaction(root, 0);
|
2011-01-20 14:19:37 +08:00
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
2017-10-31 02:29:46 +08:00
|
|
|
goto error_free_device;
|
2011-01-20 14:19:37 +08:00
|
|
|
}
|
|
|
|
|
2011-08-04 22:52:27 +08:00
|
|
|
q = bdev_get_queue(bdev);
|
2017-12-04 12:54:52 +08:00
|
|
|
set_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
|
2008-11-18 10:11:30 +08:00
|
|
|
device->generation = trans->transid;
|
2016-06-23 06:54:23 +08:00
|
|
|
device->io_width = fs_info->sectorsize;
|
|
|
|
device->io_align = fs_info->sectorsize;
|
|
|
|
device->sector_size = fs_info->sectorsize;
|
2017-06-16 19:39:20 +08:00
|
|
|
device->total_bytes = round_down(i_size_read(bdev->bd_inode),
|
|
|
|
fs_info->sectorsize);
|
2009-06-04 21:23:50 +08:00
|
|
|
device->disk_total_bytes = device->total_bytes;
|
2014-09-03 21:35:33 +08:00
|
|
|
device->commit_total_bytes = device->total_bytes;
|
2016-06-23 06:54:56 +08:00
|
|
|
device->fs_info = fs_info;
|
2008-04-29 03:29:42 +08:00
|
|
|
device->bdev = bdev;
|
2017-12-04 12:54:53 +08:00
|
|
|
set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
|
2017-12-04 12:54:55 +08:00
|
|
|
clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state);
|
2011-02-16 02:12:57 +08:00
|
|
|
device->mode = FMODE_EXCL;
|
2013-10-11 21:20:42 +08:00
|
|
|
device->dev_stats_valid = 1;
|
2017-06-16 07:48:05 +08:00
|
|
|
set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
|
2008-04-29 03:29:42 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
if (seeding_dev) {
|
2017-11-28 05:05:09 +08:00
|
|
|
sb->s_flags &= ~SB_RDONLY;
|
2016-06-23 06:54:24 +08:00
|
|
|
ret = btrfs_prepare_sprout(fs_info);
|
2017-09-28 14:51:10 +08:00
|
|
|
if (ret) {
|
|
|
|
btrfs_abort_transaction(trans, ret);
|
|
|
|
goto error_trans;
|
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
2008-04-29 03:29:42 +08:00
|
|
|
|
2018-07-03 13:14:50 +08:00
|
|
|
device->fs_devices = fs_devices;
|
2009-06-11 03:17:02 +08:00
|
|
|
|
2018-07-03 13:14:50 +08:00
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
2018-07-03 13:14:50 +08:00
|
|
|
list_add_rcu(&device->dev_list, &fs_devices->devices);
|
|
|
|
list_add(&device->dev_alloc_list, &fs_devices->alloc_list);
|
|
|
|
fs_devices->num_devices++;
|
|
|
|
fs_devices->open_devices++;
|
|
|
|
fs_devices->rw_devices++;
|
|
|
|
fs_devices->total_devices++;
|
|
|
|
fs_devices->total_rw_bytes += device->total_bytes;
|
2008-09-06 04:43:54 +08:00
|
|
|
|
2017-05-11 14:17:46 +08:00
|
|
|
atomic64_add(device->total_bytes, &fs_info->free_chunk_space);
|
2011-09-27 05:12:22 +08:00
|
|
|
|
2017-04-04 18:40:19 +08:00
|
|
|
if (!blk_queue_nonrot(q))
|
2018-07-03 13:14:50 +08:00
|
|
|
fs_devices->rotating = 1;
|
2009-06-10 21:51:32 +08:00
|
|
|
|
2018-07-27 08:04:55 +08:00
|
|
|
orig_super_total_bytes = btrfs_super_total_bytes(fs_info->super_copy);
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_set_super_total_bytes(fs_info->super_copy,
|
2018-07-27 08:04:55 +08:00
|
|
|
round_down(orig_super_total_bytes + device->total_bytes,
|
|
|
|
fs_info->sectorsize));
|
2008-04-29 03:29:42 +08:00
|
|
|
|
2018-07-27 08:04:55 +08:00
|
|
|
orig_super_num_devices = btrfs_super_num_devices(fs_info->super_copy);
|
|
|
|
btrfs_set_super_num_devices(fs_info->super_copy,
|
|
|
|
orig_super_num_devices + 1);
|
2014-06-03 11:36:01 +08:00
|
|
|
|
|
|
|
/* add sysfs device entry */
|
2018-07-03 13:14:50 +08:00
|
|
|
btrfs_sysfs_add_device_link(fs_devices, device);
|
2014-06-03 11:36:01 +08:00
|
|
|
|
2014-09-03 21:35:41 +08:00
|
|
|
/*
|
|
|
|
* we've got more storage, clear any full flags on the space
|
|
|
|
* infos
|
|
|
|
*/
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_clear_space_info_full(fs_info);
|
2014-09-03 21:35:41 +08:00
|
|
|
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
2018-07-03 13:14:50 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
2008-04-29 03:29:42 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
if (seeding_dev) {
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
2017-02-11 02:49:01 +08:00
|
|
|
ret = init_first_rw_device(trans, fs_info);
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
2012-09-18 21:52:32 +08:00
|
|
|
if (ret) {
|
2016-06-11 06:19:25 +08:00
|
|
|
btrfs_abort_transaction(trans, ret);
|
2017-09-28 14:51:10 +08:00
|
|
|
goto error_sysfs;
|
2012-09-18 21:52:32 +08:00
|
|
|
}
|
2014-09-03 21:35:41 +08:00
|
|
|
}
|
|
|
|
|
2018-07-21 00:37:47 +08:00
|
|
|
ret = btrfs_add_dev_item(trans, device);
|
2014-09-03 21:35:41 +08:00
|
|
|
if (ret) {
|
2016-06-11 06:19:25 +08:00
|
|
|
btrfs_abort_transaction(trans, ret);
|
2017-09-28 14:51:10 +08:00
|
|
|
goto error_sysfs;
|
2014-09-03 21:35:41 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (seeding_dev) {
|
|
|
|
char fsid_buf[BTRFS_UUID_UNPARSED_SIZE];
|
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
ret = btrfs_finish_sprout(trans, fs_info);
|
2012-09-18 21:52:32 +08:00
|
|
|
if (ret) {
|
2016-06-11 06:19:25 +08:00
|
|
|
btrfs_abort_transaction(trans, ret);
|
2017-09-28 14:51:10 +08:00
|
|
|
goto error_sysfs;
|
2012-09-18 21:52:32 +08:00
|
|
|
}
|
2014-06-03 11:36:03 +08:00
|
|
|
|
|
|
|
/* Sprouting would change fsid of the mounted root,
|
|
|
|
* so rename the fsid on the sysfs
|
|
|
|
*/
|
|
|
|
snprintf(fsid_buf, BTRFS_UUID_UNPARSED_SIZE, "%pU",
|
2016-06-23 06:54:23 +08:00
|
|
|
fs_info->fsid);
|
2018-07-03 13:14:50 +08:00
|
|
|
if (kobject_rename(&fs_devices->fsid_kobj, fsid_buf))
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_warn(fs_info,
|
|
|
|
"sysfs: failed to create fsid for sprout");
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
|
2016-09-10 09:39:03 +08:00
|
|
|
ret = btrfs_commit_transaction(trans);
|
2008-06-26 04:01:30 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
if (seeding_dev) {
|
|
|
|
mutex_unlock(&uuid_mutex);
|
|
|
|
up_write(&sb->s_umount);
|
2017-09-28 14:51:11 +08:00
|
|
|
unlocked = true;
|
2008-04-29 03:29:42 +08:00
|
|
|
|
2012-03-12 23:03:00 +08:00
|
|
|
if (ret) /* transaction commit */
|
|
|
|
return ret;
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
ret = btrfs_relocate_sys_chunks(fs_info);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (ret < 0)
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_handle_fs_error(fs_info, ret,
|
2016-09-20 22:05:00 +08:00
|
|
|
"Failed to relocate sys chunks after device initialization. This can be fixed using the \"btrfs balance\" command.");
|
Btrfs: fix deadlock caused by the nested chunk allocation
Steps to reproduce:
# mkfs.btrfs -m raid1 <disk1> <disk2>
# btrfstune -S 1 <disk1>
# mount <disk1> <mnt>
# btrfs device add <disk3> <disk4> <mnt>
# mount -o remount,rw <mnt>
# dd if=/dev/zero of=<mnt>/tmpfile bs=1M count=1
Deadlock happened.
It is because of the nested chunk allocation. When we wrote the data
into the filesystem, we would allocate the data chunk because there was
no data chunk in the filesystem. At the end of the data chunk allocation,
we should insert the metadata of the data chunk into the extent tree, but
there was no raid1 chunk, so we tried to lock the chunk allocation mutex to
allocate the new chunk, but we had held the mutex, the deadlock happened.
By rights, we would allocate the raid1 chunk when we added the second device
because the profile of the seed filesystem is raid1 and we had two devices.
But we didn't do that in fact. It is because the last step of the first device
insertion didn't commit the transaction. So when we added the second device,
we didn't cow the tree, and just inserted the relative metadata into the leaves
which were generated by the first device insertion, and its profile was dup.
So, I fix this problem by commiting the transaction at the end of the first
device insertion.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-16 19:26:46 +08:00
|
|
|
trans = btrfs_attach_transaction(root);
|
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
if (PTR_ERR(trans) == -ENOENT)
|
|
|
|
return 0;
|
2017-09-28 14:51:11 +08:00
|
|
|
ret = PTR_ERR(trans);
|
|
|
|
trans = NULL;
|
|
|
|
goto error_sysfs;
|
Btrfs: fix deadlock caused by the nested chunk allocation
Steps to reproduce:
# mkfs.btrfs -m raid1 <disk1> <disk2>
# btrfstune -S 1 <disk1>
# mount <disk1> <mnt>
# btrfs device add <disk3> <disk4> <mnt>
# mount -o remount,rw <mnt>
# dd if=/dev/zero of=<mnt>/tmpfile bs=1M count=1
Deadlock happened.
It is because of the nested chunk allocation. When we wrote the data
into the filesystem, we would allocate the data chunk because there was
no data chunk in the filesystem. At the end of the data chunk allocation,
we should insert the metadata of the data chunk into the extent tree, but
there was no raid1 chunk, so we tried to lock the chunk allocation mutex to
allocate the new chunk, but we had held the mutex, the deadlock happened.
By rights, we would allocate the raid1 chunk when we added the second device
because the profile of the seed filesystem is raid1 and we had two devices.
But we didn't do that in fact. It is because the last step of the first device
insertion didn't commit the transaction. So when we added the second device,
we didn't cow the tree, and just inserted the relative metadata into the leaves
which were generated by the first device insertion, and its profile was dup.
So, I fix this problem by commiting the transaction at the end of the first
device insertion.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-16 19:26:46 +08:00
|
|
|
}
|
2016-09-10 09:39:03 +08:00
|
|
|
ret = btrfs_commit_transaction(trans);
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
2012-01-17 04:04:47 +08:00
|
|
|
|
2014-04-16 17:02:32 +08:00
|
|
|
/* Update ctime/mtime for libblkid */
|
|
|
|
update_dev_time(device_path);
|
2008-11-18 10:11:30 +08:00
|
|
|
return ret;
|
2012-03-12 23:03:00 +08:00
|
|
|
|
2017-09-28 14:51:10 +08:00
|
|
|
error_sysfs:
|
2018-07-03 13:14:50 +08:00
|
|
|
btrfs_sysfs_rm_device_link(fs_devices, device);
|
2018-07-27 08:04:55 +08:00
|
|
|
mutex_lock(&fs_info->fs_devices->device_list_mutex);
|
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
|
|
|
list_del_rcu(&device->dev_list);
|
|
|
|
list_del(&device->dev_alloc_list);
|
|
|
|
fs_info->fs_devices->num_devices--;
|
|
|
|
fs_info->fs_devices->open_devices--;
|
|
|
|
fs_info->fs_devices->rw_devices--;
|
|
|
|
fs_info->fs_devices->total_devices--;
|
|
|
|
fs_info->fs_devices->total_rw_bytes -= device->total_bytes;
|
|
|
|
atomic64_sub(device->total_bytes, &fs_info->free_chunk_space);
|
|
|
|
btrfs_set_super_total_bytes(fs_info->super_copy,
|
|
|
|
orig_super_total_bytes);
|
|
|
|
btrfs_set_super_num_devices(fs_info->super_copy,
|
|
|
|
orig_super_num_devices);
|
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
|
|
|
mutex_unlock(&fs_info->fs_devices->device_list_mutex);
|
2012-03-12 23:03:00 +08:00
|
|
|
error_trans:
|
2017-09-28 14:51:09 +08:00
|
|
|
if (seeding_dev)
|
2017-11-28 05:05:09 +08:00
|
|
|
sb->s_flags |= SB_RDONLY;
|
2017-09-28 14:51:11 +08:00
|
|
|
if (trans)
|
|
|
|
btrfs_end_transaction(trans);
|
2017-10-31 02:29:46 +08:00
|
|
|
error_free_device:
|
2018-03-20 22:47:33 +08:00
|
|
|
btrfs_free_device(device);
|
2008-11-18 10:11:30 +08:00
|
|
|
error:
|
block: make blkdev_get/put() handle exclusive access
Over time, block layer has accumulated a set of APIs dealing with bdev
open, close, claim and release.
* blkdev_get/put() are the primary open and close functions.
* bd_claim/release() deal with exclusive open.
* open/close_bdev_exclusive() are combination of open and claim and
the other way around, respectively.
* bd_link/unlink_disk_holder() to create and remove holder/slave
symlinks.
* open_by_devnum() wraps bdget() + blkdev_get().
The interface is a bit confusing and the decoupling of open and claim
makes it impossible to properly guarantee exclusive access as
in-kernel open + claim sequence can disturb the existing exclusive
open even before the block layer knows the current open if for another
exclusive access. Reorganize the interface such that,
* blkdev_get() is extended to include exclusive access management.
@holder argument is added and, if is @FMODE_EXCL specified, it will
gain exclusive access atomically w.r.t. other exclusive accesses.
* blkdev_put() is similarly extended. It now takes @mode argument and
if @FMODE_EXCL is set, it releases an exclusive access. Also, when
the last exclusive claim is released, the holder/slave symlinks are
removed automatically.
* bd_claim/release() and close_bdev_exclusive() are no longer
necessary and either made static or removed.
* bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
is no longer necessary and removed.
* open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
and blkdev_get(). It also has an unexpected extra bdev_read_only()
test which probably should be moved into blkdev_get().
* open_by_devnum() is modified to take @holder argument and pass it to
blkdev_get().
Most of bdev open/close operations are unified into blkdev_get/put()
and most exclusive accesses are tested atomically at the open time (as
it should). This cleans up code and removes some, both valid and
invalid, but unnecessary all the same, corner cases.
open_bdev_exclusive() and open_by_devnum() can use further cleanup -
rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
special features. Well, let's leave them for another day.
Most conversions are straight-forward. drbd conversion is a bit more
involved as there was some reordering, but the logic should stay the
same.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: dm-devel@redhat.com
Cc: drbd-dev@lists.linbit.com
Cc: Leo Chen <leochen@broadcom.com>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Joern Engel <joern@logfs.org>
Cc: reiserfs-devel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
2010-11-13 18:55:17 +08:00
|
|
|
blkdev_put(bdev, FMODE_EXCL);
|
2017-09-28 14:51:11 +08:00
|
|
|
if (seeding_dev && !unlocked) {
|
2008-11-18 10:11:30 +08:00
|
|
|
mutex_unlock(&uuid_mutex);
|
|
|
|
up_write(&sb->s_umount);
|
|
|
|
}
|
2012-01-17 04:04:47 +08:00
|
|
|
return ret;
|
2008-04-29 03:29:42 +08:00
|
|
|
}
|
|
|
|
|
2009-01-06 10:25:51 +08:00
|
|
|
static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_device *device)
|
2008-03-25 03:01:56 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct btrfs_path *path;
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_root *root = device->fs_info->chunk_root;
|
2008-03-25 03:01:56 +08:00
|
|
|
struct btrfs_dev_item *dev_item;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_key key;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
|
|
|
|
key.type = BTRFS_DEV_ITEM_KEY;
|
|
|
|
key.offset = device->devid;
|
|
|
|
|
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, 0, 1);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (ret > 0) {
|
|
|
|
ret = -ENOENT;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
dev_item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_dev_item);
|
|
|
|
|
|
|
|
btrfs_set_device_id(leaf, dev_item, device->devid);
|
|
|
|
btrfs_set_device_type(leaf, dev_item, device->type);
|
|
|
|
btrfs_set_device_io_align(leaf, dev_item, device->io_align);
|
|
|
|
btrfs_set_device_io_width(leaf, dev_item, device->io_width);
|
|
|
|
btrfs_set_device_sector_size(leaf, dev_item, device->sector_size);
|
2014-09-03 21:35:38 +08:00
|
|
|
btrfs_set_device_total_bytes(leaf, dev_item,
|
|
|
|
btrfs_device_get_disk_total_bytes(device));
|
|
|
|
btrfs_set_device_bytes_used(leaf, dev_item,
|
|
|
|
btrfs_device_get_bytes_used(device));
|
2008-03-25 03:01:56 +08:00
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
|
|
|
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2014-09-03 21:35:41 +08:00
|
|
|
int btrfs_grow_device(struct btrfs_trans_handle *trans,
|
2008-04-26 04:53:30 +08:00
|
|
|
struct btrfs_device *device, u64 new_size)
|
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *fs_info = device->fs_info;
|
|
|
|
struct btrfs_super_block *super_copy = fs_info->super_copy;
|
2014-09-03 21:35:33 +08:00
|
|
|
struct btrfs_fs_devices *fs_devices;
|
2014-09-03 21:35:41 +08:00
|
|
|
u64 old_total;
|
|
|
|
u64 diff;
|
2008-04-26 04:53:30 +08:00
|
|
|
|
2017-12-04 12:54:52 +08:00
|
|
|
if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state))
|
2008-11-18 10:11:30 +08:00
|
|
|
return -EACCES;
|
2014-09-03 21:35:41 +08:00
|
|
|
|
2017-06-16 19:39:20 +08:00
|
|
|
new_size = round_down(new_size, fs_info->sectorsize);
|
|
|
|
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
2014-09-03 21:35:41 +08:00
|
|
|
old_total = btrfs_super_total_bytes(super_copy);
|
2017-07-21 16:28:24 +08:00
|
|
|
diff = round_down(new_size - device->total_bytes, fs_info->sectorsize);
|
2014-09-03 21:35:41 +08:00
|
|
|
|
2012-11-06 01:29:28 +08:00
|
|
|
if (new_size <= device->total_bytes ||
|
2017-12-04 12:54:55 +08:00
|
|
|
test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state)) {
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
2008-11-18 10:11:30 +08:00
|
|
|
return -EINVAL;
|
2014-09-03 21:35:41 +08:00
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
fs_devices = fs_info->fs_devices;
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2017-06-16 19:39:20 +08:00
|
|
|
btrfs_set_super_total_bytes(super_copy,
|
|
|
|
round_down(old_total + diff, fs_info->sectorsize));
|
2008-11-18 10:11:30 +08:00
|
|
|
device->fs_devices->total_rw_bytes += diff;
|
|
|
|
|
2014-09-03 21:35:38 +08:00
|
|
|
btrfs_device_set_total_bytes(device, new_size);
|
|
|
|
btrfs_device_set_disk_total_bytes(device, new_size);
|
2016-06-23 06:54:56 +08:00
|
|
|
btrfs_clear_space_info_full(device->fs_info);
|
2014-09-03 21:35:33 +08:00
|
|
|
if (list_empty(&device->resized_list))
|
|
|
|
list_add_tail(&device->resized_list,
|
|
|
|
&fs_devices->resized_devices);
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
2009-03-11 00:39:20 +08:00
|
|
|
|
2008-04-26 04:53:30 +08:00
|
|
|
return btrfs_update_device(trans, device);
|
|
|
|
}
|
|
|
|
|
2018-07-21 00:37:52 +08:00
|
|
|
static int btrfs_free_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
|
2008-04-26 04:53:30 +08:00
|
|
|
{
|
2018-07-21 00:37:52 +08:00
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
2016-06-21 22:40:19 +08:00
|
|
|
struct btrfs_root *root = fs_info->chunk_root;
|
2008-04-26 04:53:30 +08:00
|
|
|
int ret;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_key key;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2017-07-27 19:37:29 +08:00
|
|
|
key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
|
2008-04-26 04:53:30 +08:00
|
|
|
key.offset = chunk_offset;
|
|
|
|
key.type = BTRFS_CHUNK_ITEM_KEY;
|
|
|
|
|
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
else if (ret > 0) { /* Logic error or corruption */
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_handle_fs_error(fs_info, -ENOENT,
|
|
|
|
"Failed lookup while freeing chunk.");
|
2012-03-12 23:03:00 +08:00
|
|
|
ret = -ENOENT;
|
|
|
|
goto out;
|
|
|
|
}
|
2008-04-26 04:53:30 +08:00
|
|
|
|
|
|
|
ret = btrfs_del_item(trans, root, path);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (ret < 0)
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_handle_fs_error(fs_info, ret,
|
|
|
|
"Failed to delete chunk item.");
|
2012-03-12 23:03:00 +08:00
|
|
|
out:
|
2008-04-26 04:53:30 +08:00
|
|
|
btrfs_free_path(path);
|
2011-05-19 12:37:44 +08:00
|
|
|
return ret;
|
2008-04-26 04:53:30 +08:00
|
|
|
}
|
|
|
|
|
2017-07-27 19:37:29 +08:00
|
|
|
static int btrfs_del_sys_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
|
2008-04-26 04:53:30 +08:00
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_super_block *super_copy = fs_info->super_copy;
|
2008-04-26 04:53:30 +08:00
|
|
|
struct btrfs_disk_key *disk_key;
|
|
|
|
struct btrfs_chunk *chunk;
|
|
|
|
u8 *ptr;
|
|
|
|
int ret = 0;
|
|
|
|
u32 num_stripes;
|
|
|
|
u32 array_size;
|
|
|
|
u32 len = 0;
|
|
|
|
u32 cur;
|
|
|
|
struct btrfs_key key;
|
|
|
|
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
2008-04-26 04:53:30 +08:00
|
|
|
array_size = btrfs_super_sys_array_size(super_copy);
|
|
|
|
|
|
|
|
ptr = super_copy->sys_chunk_array;
|
|
|
|
cur = 0;
|
|
|
|
|
|
|
|
while (cur < array_size) {
|
|
|
|
disk_key = (struct btrfs_disk_key *)ptr;
|
|
|
|
btrfs_disk_key_to_cpu(&key, disk_key);
|
|
|
|
|
|
|
|
len = sizeof(*disk_key);
|
|
|
|
|
|
|
|
if (key.type == BTRFS_CHUNK_ITEM_KEY) {
|
|
|
|
chunk = (struct btrfs_chunk *)(ptr + len);
|
|
|
|
num_stripes = btrfs_stack_chunk_num_stripes(chunk);
|
|
|
|
len += btrfs_chunk_item_size(num_stripes);
|
|
|
|
} else {
|
|
|
|
ret = -EIO;
|
|
|
|
break;
|
|
|
|
}
|
2017-07-27 19:37:29 +08:00
|
|
|
if (key.objectid == BTRFS_FIRST_CHUNK_TREE_OBJECTID &&
|
2008-04-26 04:53:30 +08:00
|
|
|
key.offset == chunk_offset) {
|
|
|
|
memmove(ptr, ptr + len, array_size - (cur + len));
|
|
|
|
array_size -= len;
|
|
|
|
btrfs_set_super_sys_array_size(super_copy, array_size);
|
|
|
|
} else {
|
|
|
|
ptr += len;
|
|
|
|
cur += len;
|
|
|
|
}
|
|
|
|
}
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
2008-04-26 04:53:30 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-05-17 07:34:31 +08:00
|
|
|
/*
|
|
|
|
* btrfs_get_chunk_map() - Find the mapping containing the given logical extent.
|
|
|
|
* @logical: Logical block offset in bytes.
|
|
|
|
* @length: Length of extent in bytes.
|
|
|
|
*
|
|
|
|
* Return: Chunk mapping or ERR_PTR.
|
|
|
|
*/
|
|
|
|
struct extent_map *btrfs_get_chunk_map(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 logical, u64 length)
|
2017-03-15 04:33:55 +08:00
|
|
|
{
|
|
|
|
struct extent_map_tree *em_tree;
|
|
|
|
struct extent_map *em;
|
|
|
|
|
|
|
|
em_tree = &fs_info->mapping_tree.map_tree;
|
|
|
|
read_lock(&em_tree->lock);
|
|
|
|
em = lookup_extent_mapping(em_tree, logical, length);
|
|
|
|
read_unlock(&em_tree->lock);
|
|
|
|
|
|
|
|
if (!em) {
|
|
|
|
btrfs_crit(fs_info, "unable to find logical %llu length %llu",
|
|
|
|
logical, length);
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (em->start > logical || em->start + em->len < logical) {
|
|
|
|
btrfs_crit(fs_info,
|
|
|
|
"found a bad mapping, wanted %llu-%llu, found %llu-%llu",
|
|
|
|
logical, length, em->start, em->start + em->len);
|
|
|
|
free_extent_map(em);
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* callers are responsible for dropping em's ref. */
|
|
|
|
return em;
|
|
|
|
}
|
|
|
|
|
2018-07-21 00:37:53 +08:00
|
|
|
int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
|
2008-04-26 04:53:30 +08:00
|
|
|
{
|
2018-07-21 00:37:53 +08:00
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
2008-04-26 04:53:30 +08:00
|
|
|
struct extent_map *em;
|
|
|
|
struct map_lookup *map;
|
2014-09-03 21:35:41 +08:00
|
|
|
u64 dev_extent_len = 0;
|
2014-09-18 23:20:02 +08:00
|
|
|
int i, ret = 0;
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
|
2008-04-26 04:53:30 +08:00
|
|
|
|
2018-05-17 07:34:31 +08:00
|
|
|
em = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
|
2017-03-15 04:33:55 +08:00
|
|
|
if (IS_ERR(em)) {
|
2014-09-18 23:20:02 +08:00
|
|
|
/*
|
|
|
|
* This is a logic error, but we don't want to just rely on the
|
2016-03-05 03:23:12 +08:00
|
|
|
* user having built with ASSERT enabled, so if ASSERT doesn't
|
2014-09-18 23:20:02 +08:00
|
|
|
* do anything we still error out.
|
|
|
|
*/
|
|
|
|
ASSERT(0);
|
2017-03-15 04:33:55 +08:00
|
|
|
return PTR_ERR(em);
|
2014-09-18 23:20:02 +08:00
|
|
|
}
|
2015-06-03 22:55:48 +08:00
|
|
|
map = em->map_lookup;
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
2018-06-20 20:49:07 +08:00
|
|
|
check_system_chunk(trans, map->type);
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
2008-04-26 04:53:30 +08:00
|
|
|
|
2016-05-20 11:34:23 +08:00
|
|
|
/*
|
|
|
|
* Take the device list mutex to prevent races with the final phase of
|
|
|
|
* a device replace operation that replaces the device object associated
|
|
|
|
* with map stripes (dev-replace.c:btrfs_dev_replace_finishing()).
|
|
|
|
*/
|
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
2008-04-26 04:53:30 +08:00
|
|
|
for (i = 0; i < map->num_stripes; i++) {
|
2014-09-18 23:20:02 +08:00
|
|
|
struct btrfs_device *device = map->stripes[i].dev;
|
2014-09-03 21:35:41 +08:00
|
|
|
ret = btrfs_free_dev_extent(trans, device,
|
|
|
|
map->stripes[i].physical,
|
|
|
|
&dev_extent_len);
|
2014-09-18 23:20:02 +08:00
|
|
|
if (ret) {
|
2016-05-20 11:34:23 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
2016-06-11 06:19:25 +08:00
|
|
|
btrfs_abort_transaction(trans, ret);
|
2014-09-18 23:20:02 +08:00
|
|
|
goto out;
|
|
|
|
}
|
2008-05-07 23:43:44 +08:00
|
|
|
|
2014-09-03 21:35:41 +08:00
|
|
|
if (device->bytes_used > 0) {
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
2014-09-03 21:35:41 +08:00
|
|
|
btrfs_device_set_bytes_used(device,
|
|
|
|
device->bytes_used - dev_extent_len);
|
2017-05-11 14:17:46 +08:00
|
|
|
atomic64_add(dev_extent_len, &fs_info->free_chunk_space);
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_clear_space_info_full(fs_info);
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
2014-09-03 21:35:41 +08:00
|
|
|
}
|
2008-05-07 23:43:44 +08:00
|
|
|
|
2008-05-14 01:46:40 +08:00
|
|
|
if (map->stripes[i].dev) {
|
|
|
|
ret = btrfs_update_device(trans, map->stripes[i].dev);
|
2014-09-18 23:20:02 +08:00
|
|
|
if (ret) {
|
2016-05-20 11:34:23 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
2016-06-11 06:19:25 +08:00
|
|
|
btrfs_abort_transaction(trans, ret);
|
2014-09-18 23:20:02 +08:00
|
|
|
goto out;
|
|
|
|
}
|
2008-05-14 01:46:40 +08:00
|
|
|
}
|
2008-04-26 04:53:30 +08:00
|
|
|
}
|
2016-05-20 11:34:23 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
|
|
|
|
2018-07-21 00:37:52 +08:00
|
|
|
ret = btrfs_free_chunk(trans, chunk_offset);
|
2014-09-18 23:20:02 +08:00
|
|
|
if (ret) {
|
2016-06-11 06:19:25 +08:00
|
|
|
btrfs_abort_transaction(trans, ret);
|
2014-09-18 23:20:02 +08:00
|
|
|
goto out;
|
|
|
|
}
|
2008-04-26 04:53:30 +08:00
|
|
|
|
2016-06-22 09:16:51 +08:00
|
|
|
trace_btrfs_chunk_free(fs_info, map, chunk_offset, em->len);
|
Btrfs: add initial tracepoint support for btrfs
Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
Here is what I have added:
1) ordere_extent:
btrfs_ordered_extent_add
btrfs_ordered_extent_remove
btrfs_ordered_extent_start
btrfs_ordered_extent_put
These provide critical information to understand how ordered_extents are
updated.
2) extent_map:
btrfs_get_extent
extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.
3) writepage:
__extent_writepage
btrfs_writepage_end_io_hook
Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.
4) inode:
btrfs_inode_new
btrfs_inode_request
btrfs_inode_evict
These can show where and when a inode is created, when a inode is evicted.
5) sync:
btrfs_sync_file
btrfs_sync_fs
These show sync arguments.
6) transaction:
btrfs_transaction_commit
In transaction based filesystem, it will be useful to know the generation and
who does commit.
7) back reference and cow:
btrfs_delayed_tree_ref
btrfs_delayed_data_ref
btrfs_delayed_ref_head
btrfs_cow_block
Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.
8) chunk:
btrfs_chunk_alloc
btrfs_chunk_free
Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.
9) reserved_extent:
btrfs_reserved_extent_alloc
btrfs_reserved_extent_free
These can show how btrfs uses its space.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-24 19:18:59 +08:00
|
|
|
|
2008-04-26 04:53:30 +08:00
|
|
|
if (map->type & BTRFS_BLOCK_GROUP_SYSTEM) {
|
2017-07-27 19:37:29 +08:00
|
|
|
ret = btrfs_del_sys_chunk(fs_info, chunk_offset);
|
2014-09-18 23:20:02 +08:00
|
|
|
if (ret) {
|
2016-06-11 06:19:25 +08:00
|
|
|
btrfs_abort_transaction(trans, ret);
|
2014-09-18 23:20:02 +08:00
|
|
|
goto out;
|
|
|
|
}
|
2008-04-26 04:53:30 +08:00
|
|
|
}
|
|
|
|
|
2018-06-20 20:48:56 +08:00
|
|
|
ret = btrfs_remove_block_group(trans, chunk_offset, em);
|
2014-09-18 23:20:02 +08:00
|
|
|
if (ret) {
|
2016-06-11 06:19:25 +08:00
|
|
|
btrfs_abort_transaction(trans, ret);
|
2014-09-18 23:20:02 +08:00
|
|
|
goto out;
|
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2014-09-18 23:20:02 +08:00
|
|
|
out:
|
2008-11-18 10:11:30 +08:00
|
|
|
/* once for us */
|
|
|
|
free_extent_map(em);
|
2014-09-18 23:20:02 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2016-06-21 22:40:19 +08:00
|
|
|
static int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
|
2014-09-18 23:20:02 +08:00
|
|
|
{
|
2016-06-21 22:40:19 +08:00
|
|
|
struct btrfs_root *root = fs_info->chunk_root;
|
2016-10-11 04:43:31 +08:00
|
|
|
struct btrfs_trans_handle *trans;
|
2014-09-18 23:20:02 +08:00
|
|
|
int ret;
|
2008-11-18 10:11:30 +08:00
|
|
|
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
/*
|
|
|
|
* Prevent races with automatic removal of unused block groups.
|
|
|
|
* After we relocate and before we remove the chunk with offset
|
|
|
|
* chunk_offset, automatic removal of the block group can kick in,
|
|
|
|
* resulting in a failure when calling btrfs_remove_chunk() below.
|
|
|
|
*
|
|
|
|
* Make sure to acquire this mutex before doing a tree search (dev
|
|
|
|
* or chunk trees) to find chunks. Otherwise the cleaner kthread might
|
|
|
|
* call btrfs_remove_chunk() (through btrfs_delete_unused_bgs()) after
|
|
|
|
* we release the path used to search the chunk/dev tree and before
|
|
|
|
* the current task acquires this mutex and calls us.
|
|
|
|
*/
|
2018-03-16 09:21:22 +08:00
|
|
|
lockdep_assert_held(&fs_info->delete_unused_bgs_mutex);
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
ret = btrfs_can_relocate(fs_info, chunk_offset);
|
2014-09-18 23:20:02 +08:00
|
|
|
if (ret)
|
|
|
|
return -ENOSPC;
|
|
|
|
|
|
|
|
/* step one, relocate all the extents inside this chunk */
|
2016-06-23 06:54:24 +08:00
|
|
|
btrfs_scrub_pause(fs_info);
|
2016-06-23 06:54:23 +08:00
|
|
|
ret = btrfs_relocate_block_group(fs_info, chunk_offset);
|
2016-06-23 06:54:24 +08:00
|
|
|
btrfs_scrub_continue(fs_info);
|
2014-09-18 23:20:02 +08:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2018-03-21 03:25:26 +08:00
|
|
|
/*
|
|
|
|
* We add the kobjects here (and after forcing data chunk creation)
|
|
|
|
* since relocation is the only place we'll create chunks of a new
|
|
|
|
* type at runtime. The only place where we'll remove the last
|
|
|
|
* chunk of a type is the call immediately below this one. Even
|
|
|
|
* so, we're protected against races with the cleaner thread since
|
|
|
|
* we're covered by the delete_unused_bgs_mutex.
|
|
|
|
*/
|
|
|
|
btrfs_add_raid_kobjects(fs_info);
|
|
|
|
|
2016-10-11 04:43:31 +08:00
|
|
|
trans = btrfs_start_trans_remove_block_group(root->fs_info,
|
|
|
|
chunk_offset);
|
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
|
|
|
btrfs_handle_fs_error(root->fs_info, ret, NULL);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2014-09-18 23:20:02 +08:00
|
|
|
/*
|
2016-10-11 04:43:31 +08:00
|
|
|
* step two, delete the device extents and the
|
|
|
|
* chunk tree entries
|
2014-09-18 23:20:02 +08:00
|
|
|
*/
|
2018-07-21 00:37:53 +08:00
|
|
|
ret = btrfs_remove_chunk(trans, chunk_offset);
|
2016-09-10 09:39:03 +08:00
|
|
|
btrfs_end_transaction(trans);
|
2016-10-11 04:43:31 +08:00
|
|
|
return ret;
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
static int btrfs_relocate_sys_chunks(struct btrfs_fs_info *fs_info)
|
2008-11-18 10:11:30 +08:00
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_root *chunk_root = fs_info->chunk_root;
|
2008-11-18 10:11:30 +08:00
|
|
|
struct btrfs_path *path;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_chunk *chunk;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_key found_key;
|
|
|
|
u64 chunk_type;
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
bool retried = false;
|
|
|
|
int failed = 0;
|
2008-11-18 10:11:30 +08:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
again:
|
2008-11-18 10:11:30 +08:00
|
|
|
key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
|
|
|
|
key.offset = (u64)-1;
|
|
|
|
key.type = BTRFS_CHUNK_ITEM_KEY;
|
|
|
|
|
|
|
|
while (1) {
|
2016-06-23 06:54:23 +08:00
|
|
|
mutex_lock(&fs_info->delete_unused_bgs_mutex);
|
2008-11-18 10:11:30 +08:00
|
|
|
ret = btrfs_search_slot(NULL, chunk_root, &key, path, 0, 0);
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
if (ret < 0) {
|
2016-06-23 06:54:23 +08:00
|
|
|
mutex_unlock(&fs_info->delete_unused_bgs_mutex);
|
2008-11-18 10:11:30 +08:00
|
|
|
goto error;
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
}
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(ret == 0); /* Corruption */
|
2008-11-18 10:11:30 +08:00
|
|
|
|
|
|
|
ret = btrfs_previous_item(chunk_root, path, key.objectid,
|
|
|
|
key.type);
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
if (ret)
|
2016-06-23 06:54:23 +08:00
|
|
|
mutex_unlock(&fs_info->delete_unused_bgs_mutex);
|
2008-11-18 10:11:30 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto error;
|
|
|
|
if (ret > 0)
|
|
|
|
break;
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
leaf = path->nodes[0];
|
|
|
|
btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
chunk = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_chunk);
|
|
|
|
chunk_type = btrfs_chunk_type(leaf, chunk);
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2008-04-26 04:53:30 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
if (chunk_type & BTRFS_BLOCK_GROUP_SYSTEM) {
|
2016-06-23 06:54:23 +08:00
|
|
|
ret = btrfs_relocate_chunk(fs_info, found_key.offset);
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
if (ret == -ENOSPC)
|
|
|
|
failed++;
|
2014-07-09 06:21:41 +08:00
|
|
|
else
|
|
|
|
BUG_ON(ret);
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
2016-06-23 06:54:23 +08:00
|
|
|
mutex_unlock(&fs_info->delete_unused_bgs_mutex);
|
2008-04-26 04:53:30 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
if (found_key.offset == 0)
|
|
|
|
break;
|
|
|
|
key.offset = found_key.offset - 1;
|
|
|
|
}
|
|
|
|
ret = 0;
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
if (failed && !retried) {
|
|
|
|
failed = 0;
|
|
|
|
retried = true;
|
|
|
|
goto again;
|
2013-10-31 13:00:08 +08:00
|
|
|
} else if (WARN_ON(failed && retried)) {
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
ret = -ENOSPC;
|
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
error:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
2008-04-26 04:53:30 +08:00
|
|
|
}
|
|
|
|
|
2017-11-16 07:28:11 +08:00
|
|
|
/*
|
|
|
|
* return 1 : allocate a data chunk successfully,
|
|
|
|
* return <0: errors during allocating a data chunk,
|
|
|
|
* return 0 : no need to allocate a data chunk.
|
|
|
|
*/
|
|
|
|
static int btrfs_may_alloc_data_chunk(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 chunk_offset)
|
|
|
|
{
|
|
|
|
struct btrfs_block_group_cache *cache;
|
|
|
|
u64 bytes_used;
|
|
|
|
u64 chunk_type;
|
|
|
|
|
|
|
|
cache = btrfs_lookup_block_group(fs_info, chunk_offset);
|
|
|
|
ASSERT(cache);
|
|
|
|
chunk_type = cache->flags;
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
|
|
|
|
if (chunk_type & BTRFS_BLOCK_GROUP_DATA) {
|
|
|
|
spin_lock(&fs_info->data_sinfo->lock);
|
|
|
|
bytes_used = fs_info->data_sinfo->bytes_used;
|
|
|
|
spin_unlock(&fs_info->data_sinfo->lock);
|
|
|
|
|
|
|
|
if (!bytes_used) {
|
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
trans = btrfs_join_transaction(fs_info->tree_root);
|
|
|
|
if (IS_ERR(trans))
|
|
|
|
return PTR_ERR(trans);
|
|
|
|
|
2018-06-20 20:49:15 +08:00
|
|
|
ret = btrfs_force_chunk_alloc(trans,
|
2017-11-16 07:28:11 +08:00
|
|
|
BTRFS_BLOCK_GROUP_DATA);
|
|
|
|
btrfs_end_transaction(trans);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
2018-03-21 03:25:26 +08:00
|
|
|
btrfs_add_raid_kobjects(fs_info);
|
|
|
|
|
2017-11-16 07:28:11 +08:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2016-06-22 09:16:51 +08:00
|
|
|
static int insert_balance_item(struct btrfs_fs_info *fs_info,
|
2012-01-17 04:04:48 +08:00
|
|
|
struct btrfs_balance_control *bctl)
|
|
|
|
{
|
2016-06-22 09:16:51 +08:00
|
|
|
struct btrfs_root *root = fs_info->tree_root;
|
2012-01-17 04:04:48 +08:00
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
struct btrfs_balance_item *item;
|
|
|
|
struct btrfs_disk_balance_args disk_bargs;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_key key;
|
|
|
|
int ret, err;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
trans = btrfs_start_transaction(root, 0);
|
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return PTR_ERR(trans);
|
|
|
|
}
|
|
|
|
|
|
|
|
key.objectid = BTRFS_BALANCE_OBJECTID;
|
2016-01-26 00:51:31 +08:00
|
|
|
key.type = BTRFS_TEMPORARY_ITEM_KEY;
|
2012-01-17 04:04:48 +08:00
|
|
|
key.offset = 0;
|
|
|
|
|
|
|
|
ret = btrfs_insert_empty_item(trans, root, path, &key,
|
|
|
|
sizeof(*item));
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_balance_item);
|
|
|
|
|
2016-11-09 01:09:03 +08:00
|
|
|
memzero_extent_buffer(leaf, (unsigned long)item, sizeof(*item));
|
2012-01-17 04:04:48 +08:00
|
|
|
|
|
|
|
btrfs_cpu_balance_args_to_disk(&disk_bargs, &bctl->data);
|
|
|
|
btrfs_set_balance_data(leaf, item, &disk_bargs);
|
|
|
|
btrfs_cpu_balance_args_to_disk(&disk_bargs, &bctl->meta);
|
|
|
|
btrfs_set_balance_meta(leaf, item, &disk_bargs);
|
|
|
|
btrfs_cpu_balance_args_to_disk(&disk_bargs, &bctl->sys);
|
|
|
|
btrfs_set_balance_sys(leaf, item, &disk_bargs);
|
|
|
|
|
|
|
|
btrfs_set_balance_flags(leaf, item, bctl->flags);
|
|
|
|
|
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
2016-09-10 09:39:03 +08:00
|
|
|
err = btrfs_commit_transaction(trans);
|
2012-01-17 04:04:48 +08:00
|
|
|
if (err && !ret)
|
|
|
|
ret = err;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-06-22 09:16:51 +08:00
|
|
|
static int del_balance_item(struct btrfs_fs_info *fs_info)
|
2012-01-17 04:04:48 +08:00
|
|
|
{
|
2016-06-22 09:16:51 +08:00
|
|
|
struct btrfs_root *root = fs_info->tree_root;
|
2012-01-17 04:04:48 +08:00
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_key key;
|
|
|
|
int ret, err;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
trans = btrfs_start_transaction(root, 0);
|
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return PTR_ERR(trans);
|
|
|
|
}
|
|
|
|
|
|
|
|
key.objectid = BTRFS_BALANCE_OBJECTID;
|
2016-01-26 00:51:31 +08:00
|
|
|
key.type = BTRFS_TEMPORARY_ITEM_KEY;
|
2012-01-17 04:04:48 +08:00
|
|
|
key.offset = 0;
|
|
|
|
|
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret > 0) {
|
|
|
|
ret = -ENOENT;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = btrfs_del_item(trans, root, path);
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
2016-09-10 09:39:03 +08:00
|
|
|
err = btrfs_commit_transaction(trans);
|
2012-01-17 04:04:48 +08:00
|
|
|
if (err && !ret)
|
|
|
|
ret = err;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:48 +08:00
|
|
|
/*
|
|
|
|
* This is a heuristic used to reduce the number of chunks balanced on
|
|
|
|
* resume after balance was interrupted.
|
|
|
|
*/
|
|
|
|
static void update_balance_args(struct btrfs_balance_control *bctl)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Turn on soft mode for chunk types that were being converted.
|
|
|
|
*/
|
|
|
|
if (bctl->data.flags & BTRFS_BALANCE_ARGS_CONVERT)
|
|
|
|
bctl->data.flags |= BTRFS_BALANCE_ARGS_SOFT;
|
|
|
|
if (bctl->sys.flags & BTRFS_BALANCE_ARGS_CONVERT)
|
|
|
|
bctl->sys.flags |= BTRFS_BALANCE_ARGS_SOFT;
|
|
|
|
if (bctl->meta.flags & BTRFS_BALANCE_ARGS_CONVERT)
|
|
|
|
bctl->meta.flags |= BTRFS_BALANCE_ARGS_SOFT;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Turn on usage filter if is not already used. The idea is
|
|
|
|
* that chunks that we have already balanced should be
|
|
|
|
* reasonably full. Don't do it for chunks that are being
|
|
|
|
* converted - that will keep us from relocating unconverted
|
|
|
|
* (albeit full) chunks.
|
|
|
|
*/
|
|
|
|
if (!(bctl->data.flags & BTRFS_BALANCE_ARGS_USAGE) &&
|
2015-10-21 00:22:13 +08:00
|
|
|
!(bctl->data.flags & BTRFS_BALANCE_ARGS_USAGE_RANGE) &&
|
2012-01-17 04:04:48 +08:00
|
|
|
!(bctl->data.flags & BTRFS_BALANCE_ARGS_CONVERT)) {
|
|
|
|
bctl->data.flags |= BTRFS_BALANCE_ARGS_USAGE;
|
|
|
|
bctl->data.usage = 90;
|
|
|
|
}
|
|
|
|
if (!(bctl->sys.flags & BTRFS_BALANCE_ARGS_USAGE) &&
|
2015-10-21 00:22:13 +08:00
|
|
|
!(bctl->sys.flags & BTRFS_BALANCE_ARGS_USAGE_RANGE) &&
|
2012-01-17 04:04:48 +08:00
|
|
|
!(bctl->sys.flags & BTRFS_BALANCE_ARGS_CONVERT)) {
|
|
|
|
bctl->sys.flags |= BTRFS_BALANCE_ARGS_USAGE;
|
|
|
|
bctl->sys.usage = 90;
|
|
|
|
}
|
|
|
|
if (!(bctl->meta.flags & BTRFS_BALANCE_ARGS_USAGE) &&
|
2015-10-21 00:22:13 +08:00
|
|
|
!(bctl->meta.flags & BTRFS_BALANCE_ARGS_USAGE_RANGE) &&
|
2012-01-17 04:04:48 +08:00
|
|
|
!(bctl->meta.flags & BTRFS_BALANCE_ARGS_CONVERT)) {
|
|
|
|
bctl->meta.flags |= BTRFS_BALANCE_ARGS_USAGE;
|
|
|
|
bctl->meta.usage = 90;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-03-21 03:23:09 +08:00
|
|
|
/*
|
|
|
|
* Clear the balance status in fs_info and delete the balance item from disk.
|
|
|
|
*/
|
|
|
|
static void reset_balance_state(struct btrfs_fs_info *fs_info)
|
2012-01-17 04:04:47 +08:00
|
|
|
{
|
|
|
|
struct btrfs_balance_control *bctl = fs_info->balance_ctl;
|
2018-03-21 03:23:09 +08:00
|
|
|
int ret;
|
2012-01-17 04:04:47 +08:00
|
|
|
|
|
|
|
BUG_ON(!fs_info->balance_ctl);
|
|
|
|
|
|
|
|
spin_lock(&fs_info->balance_lock);
|
|
|
|
fs_info->balance_ctl = NULL;
|
|
|
|
spin_unlock(&fs_info->balance_lock);
|
|
|
|
|
|
|
|
kfree(bctl);
|
2018-03-21 03:23:09 +08:00
|
|
|
ret = del_balance_item(fs_info);
|
|
|
|
if (ret)
|
|
|
|
btrfs_handle_fs_error(fs_info, ret, NULL);
|
2012-01-17 04:04:47 +08:00
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:47 +08:00
|
|
|
/*
|
|
|
|
* Balance filters. Return 1 if chunk should be filtered out
|
|
|
|
* (should not be balanced).
|
|
|
|
*/
|
2012-03-27 22:09:16 +08:00
|
|
|
static int chunk_profiles_filter(u64 chunk_type,
|
2012-01-17 04:04:47 +08:00
|
|
|
struct btrfs_balance_args *bargs)
|
|
|
|
{
|
2012-03-27 22:09:16 +08:00
|
|
|
chunk_type = chunk_to_extended(chunk_type) &
|
|
|
|
BTRFS_EXTENDED_PROFILE_MASK;
|
2012-01-17 04:04:47 +08:00
|
|
|
|
2012-03-27 22:09:16 +08:00
|
|
|
if (bargs->profiles & chunk_type)
|
2012-01-17 04:04:47 +08:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2015-11-17 19:29:32 +08:00
|
|
|
static int chunk_usage_range_filter(struct btrfs_fs_info *fs_info, u64 chunk_offset,
|
2012-01-17 04:04:47 +08:00
|
|
|
struct btrfs_balance_args *bargs)
|
2015-10-21 00:22:13 +08:00
|
|
|
{
|
|
|
|
struct btrfs_block_group_cache *cache;
|
|
|
|
u64 chunk_used;
|
|
|
|
u64 user_thresh_min;
|
|
|
|
u64 user_thresh_max;
|
|
|
|
int ret = 1;
|
|
|
|
|
|
|
|
cache = btrfs_lookup_block_group(fs_info, chunk_offset);
|
|
|
|
chunk_used = btrfs_block_group_used(&cache->item);
|
|
|
|
|
|
|
|
if (bargs->usage_min == 0)
|
|
|
|
user_thresh_min = 0;
|
|
|
|
else
|
|
|
|
user_thresh_min = div_factor_fine(cache->key.offset,
|
|
|
|
bargs->usage_min);
|
|
|
|
|
|
|
|
if (bargs->usage_max == 0)
|
|
|
|
user_thresh_max = 1;
|
|
|
|
else if (bargs->usage_max > 100)
|
|
|
|
user_thresh_max = cache->key.offset;
|
|
|
|
else
|
|
|
|
user_thresh_max = div_factor_fine(cache->key.offset,
|
|
|
|
bargs->usage_max);
|
|
|
|
|
|
|
|
if (user_thresh_min <= chunk_used && chunk_used < user_thresh_max)
|
|
|
|
ret = 0;
|
|
|
|
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2015-11-17 19:29:32 +08:00
|
|
|
static int chunk_usage_filter(struct btrfs_fs_info *fs_info,
|
2015-10-21 00:22:13 +08:00
|
|
|
u64 chunk_offset, struct btrfs_balance_args *bargs)
|
2012-01-17 04:04:47 +08:00
|
|
|
{
|
|
|
|
struct btrfs_block_group_cache *cache;
|
|
|
|
u64 chunk_used, user_thresh;
|
|
|
|
int ret = 1;
|
|
|
|
|
|
|
|
cache = btrfs_lookup_block_group(fs_info, chunk_offset);
|
|
|
|
chunk_used = btrfs_block_group_used(&cache->item);
|
|
|
|
|
2015-10-21 00:22:13 +08:00
|
|
|
if (bargs->usage_min == 0)
|
2013-02-13 00:28:59 +08:00
|
|
|
user_thresh = 1;
|
2013-01-21 21:15:56 +08:00
|
|
|
else if (bargs->usage > 100)
|
|
|
|
user_thresh = cache->key.offset;
|
|
|
|
else
|
|
|
|
user_thresh = div_factor_fine(cache->key.offset,
|
|
|
|
bargs->usage);
|
|
|
|
|
2012-01-17 04:04:47 +08:00
|
|
|
if (chunk_used < user_thresh)
|
|
|
|
ret = 0;
|
|
|
|
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:47 +08:00
|
|
|
static int chunk_devid_filter(struct extent_buffer *leaf,
|
|
|
|
struct btrfs_chunk *chunk,
|
|
|
|
struct btrfs_balance_args *bargs)
|
|
|
|
{
|
|
|
|
struct btrfs_stripe *stripe;
|
|
|
|
int num_stripes = btrfs_chunk_num_stripes(leaf, chunk);
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < num_stripes; i++) {
|
|
|
|
stripe = btrfs_stripe_nr(chunk, i);
|
|
|
|
if (btrfs_stripe_devid(leaf, stripe) == bargs->devid)
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:48 +08:00
|
|
|
/* [pstart, pend) */
|
|
|
|
static int chunk_drange_filter(struct extent_buffer *leaf,
|
|
|
|
struct btrfs_chunk *chunk,
|
|
|
|
struct btrfs_balance_args *bargs)
|
|
|
|
{
|
|
|
|
struct btrfs_stripe *stripe;
|
|
|
|
int num_stripes = btrfs_chunk_num_stripes(leaf, chunk);
|
|
|
|
u64 stripe_offset;
|
|
|
|
u64 stripe_length;
|
|
|
|
int factor;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (!(bargs->flags & BTRFS_BALANCE_ARGS_DEVID))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (btrfs_chunk_type(leaf, chunk) & (BTRFS_BLOCK_GROUP_DUP |
|
2013-01-30 07:40:14 +08:00
|
|
|
BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_RAID10)) {
|
|
|
|
factor = num_stripes / 2;
|
|
|
|
} else if (btrfs_chunk_type(leaf, chunk) & BTRFS_BLOCK_GROUP_RAID5) {
|
|
|
|
factor = num_stripes - 1;
|
|
|
|
} else if (btrfs_chunk_type(leaf, chunk) & BTRFS_BLOCK_GROUP_RAID6) {
|
|
|
|
factor = num_stripes - 2;
|
|
|
|
} else {
|
|
|
|
factor = num_stripes;
|
|
|
|
}
|
2012-01-17 04:04:48 +08:00
|
|
|
|
|
|
|
for (i = 0; i < num_stripes; i++) {
|
|
|
|
stripe = btrfs_stripe_nr(chunk, i);
|
|
|
|
if (btrfs_stripe_devid(leaf, stripe) != bargs->devid)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
stripe_offset = btrfs_stripe_offset(leaf, stripe);
|
|
|
|
stripe_length = btrfs_chunk_length(leaf, chunk);
|
2015-01-17 00:26:13 +08:00
|
|
|
stripe_length = div_u64(stripe_length, factor);
|
2012-01-17 04:04:48 +08:00
|
|
|
|
|
|
|
if (stripe_offset < bargs->pend &&
|
|
|
|
stripe_offset + stripe_length > bargs->pstart)
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:48 +08:00
|
|
|
/* [vstart, vend) */
|
|
|
|
static int chunk_vrange_filter(struct extent_buffer *leaf,
|
|
|
|
struct btrfs_chunk *chunk,
|
|
|
|
u64 chunk_offset,
|
|
|
|
struct btrfs_balance_args *bargs)
|
|
|
|
{
|
|
|
|
if (chunk_offset < bargs->vend &&
|
|
|
|
chunk_offset + btrfs_chunk_length(leaf, chunk) > bargs->vstart)
|
|
|
|
/* at least part of the chunk is inside this vrange */
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2015-09-29 06:32:41 +08:00
|
|
|
static int chunk_stripes_range_filter(struct extent_buffer *leaf,
|
|
|
|
struct btrfs_chunk *chunk,
|
|
|
|
struct btrfs_balance_args *bargs)
|
|
|
|
{
|
|
|
|
int num_stripes = btrfs_chunk_num_stripes(leaf, chunk);
|
|
|
|
|
|
|
|
if (bargs->stripes_min <= num_stripes
|
|
|
|
&& num_stripes <= bargs->stripes_max)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2012-03-27 22:09:16 +08:00
|
|
|
static int chunk_soft_convert_filter(u64 chunk_type,
|
2012-01-17 04:04:48 +08:00
|
|
|
struct btrfs_balance_args *bargs)
|
|
|
|
{
|
|
|
|
if (!(bargs->flags & BTRFS_BALANCE_ARGS_CONVERT))
|
|
|
|
return 0;
|
|
|
|
|
2012-03-27 22:09:16 +08:00
|
|
|
chunk_type = chunk_to_extended(chunk_type) &
|
|
|
|
BTRFS_EXTENDED_PROFILE_MASK;
|
2012-01-17 04:04:48 +08:00
|
|
|
|
2012-03-27 22:09:16 +08:00
|
|
|
if (bargs->target == chunk_type)
|
2012-01-17 04:04:48 +08:00
|
|
|
return 1;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
static int should_balance_chunk(struct btrfs_fs_info *fs_info,
|
2012-01-17 04:04:47 +08:00
|
|
|
struct extent_buffer *leaf,
|
|
|
|
struct btrfs_chunk *chunk, u64 chunk_offset)
|
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_balance_control *bctl = fs_info->balance_ctl;
|
2012-01-17 04:04:47 +08:00
|
|
|
struct btrfs_balance_args *bargs = NULL;
|
|
|
|
u64 chunk_type = btrfs_chunk_type(leaf, chunk);
|
|
|
|
|
|
|
|
/* type filter */
|
|
|
|
if (!((chunk_type & BTRFS_BLOCK_GROUP_TYPE_MASK) &
|
|
|
|
(bctl->flags & BTRFS_BALANCE_TYPE_MASK))) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (chunk_type & BTRFS_BLOCK_GROUP_DATA)
|
|
|
|
bargs = &bctl->data;
|
|
|
|
else if (chunk_type & BTRFS_BLOCK_GROUP_SYSTEM)
|
|
|
|
bargs = &bctl->sys;
|
|
|
|
else if (chunk_type & BTRFS_BLOCK_GROUP_METADATA)
|
|
|
|
bargs = &bctl->meta;
|
|
|
|
|
2012-01-17 04:04:47 +08:00
|
|
|
/* profiles filter */
|
|
|
|
if ((bargs->flags & BTRFS_BALANCE_ARGS_PROFILES) &&
|
|
|
|
chunk_profiles_filter(chunk_type, bargs)) {
|
|
|
|
return 0;
|
2012-01-17 04:04:47 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* usage filter */
|
|
|
|
if ((bargs->flags & BTRFS_BALANCE_ARGS_USAGE) &&
|
2016-06-23 06:54:23 +08:00
|
|
|
chunk_usage_filter(fs_info, chunk_offset, bargs)) {
|
2012-01-17 04:04:47 +08:00
|
|
|
return 0;
|
2015-10-21 00:22:13 +08:00
|
|
|
} else if ((bargs->flags & BTRFS_BALANCE_ARGS_USAGE_RANGE) &&
|
2016-06-23 06:54:23 +08:00
|
|
|
chunk_usage_range_filter(fs_info, chunk_offset, bargs)) {
|
2015-10-21 00:22:13 +08:00
|
|
|
return 0;
|
2012-01-17 04:04:47 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* devid filter */
|
|
|
|
if ((bargs->flags & BTRFS_BALANCE_ARGS_DEVID) &&
|
|
|
|
chunk_devid_filter(leaf, chunk, bargs)) {
|
|
|
|
return 0;
|
2012-01-17 04:04:48 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* drange filter, makes sense only with devid filter */
|
|
|
|
if ((bargs->flags & BTRFS_BALANCE_ARGS_DRANGE) &&
|
2017-07-19 15:48:42 +08:00
|
|
|
chunk_drange_filter(leaf, chunk, bargs)) {
|
2012-01-17 04:04:48 +08:00
|
|
|
return 0;
|
2012-01-17 04:04:48 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* vrange filter */
|
|
|
|
if ((bargs->flags & BTRFS_BALANCE_ARGS_VRANGE) &&
|
|
|
|
chunk_vrange_filter(leaf, chunk, chunk_offset, bargs)) {
|
|
|
|
return 0;
|
2012-01-17 04:04:47 +08:00
|
|
|
}
|
|
|
|
|
2015-09-29 06:32:41 +08:00
|
|
|
/* stripes filter */
|
|
|
|
if ((bargs->flags & BTRFS_BALANCE_ARGS_STRIPES_RANGE) &&
|
|
|
|
chunk_stripes_range_filter(leaf, chunk, bargs)) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:48 +08:00
|
|
|
/* soft profile changing mode */
|
|
|
|
if ((bargs->flags & BTRFS_BALANCE_ARGS_SOFT) &&
|
|
|
|
chunk_soft_convert_filter(chunk_type, bargs)) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-05-07 23:37:51 +08:00
|
|
|
/*
|
|
|
|
* limited by count, must be the last filter
|
|
|
|
*/
|
|
|
|
if ((bargs->flags & BTRFS_BALANCE_ARGS_LIMIT)) {
|
|
|
|
if (bargs->limit == 0)
|
|
|
|
return 0;
|
|
|
|
else
|
|
|
|
bargs->limit--;
|
2015-10-10 23:16:50 +08:00
|
|
|
} else if ((bargs->flags & BTRFS_BALANCE_ARGS_LIMIT_RANGE)) {
|
|
|
|
/*
|
|
|
|
* Same logic as the 'limit' filter; the minimum cannot be
|
2016-05-20 09:18:45 +08:00
|
|
|
* determined here because we do not have the global information
|
2015-10-10 23:16:50 +08:00
|
|
|
* about the count of all chunks that satisfy the filters.
|
|
|
|
*/
|
|
|
|
if (bargs->limit_max == 0)
|
|
|
|
return 0;
|
|
|
|
else
|
|
|
|
bargs->limit_max--;
|
2014-05-07 23:37:51 +08:00
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:47 +08:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:47 +08:00
|
|
|
static int __btrfs_balance(struct btrfs_fs_info *fs_info)
|
2008-04-29 03:29:52 +08:00
|
|
|
{
|
2012-01-17 04:04:49 +08:00
|
|
|
struct btrfs_balance_control *bctl = fs_info->balance_ctl;
|
2012-01-17 04:04:47 +08:00
|
|
|
struct btrfs_root *chunk_root = fs_info->chunk_root;
|
|
|
|
struct btrfs_root *dev_root = fs_info->dev_root;
|
|
|
|
struct list_head *devices;
|
2008-04-29 03:29:52 +08:00
|
|
|
struct btrfs_device *device;
|
|
|
|
u64 old_size;
|
|
|
|
u64 size_to_free;
|
2015-10-10 23:16:50 +08:00
|
|
|
u64 chunk_type;
|
2012-01-17 04:04:47 +08:00
|
|
|
struct btrfs_chunk *chunk;
|
2016-07-13 02:24:21 +08:00
|
|
|
struct btrfs_path *path = NULL;
|
2008-04-29 03:29:52 +08:00
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_key found_key;
|
2012-01-17 04:04:47 +08:00
|
|
|
struct btrfs_trans_handle *trans;
|
2012-01-17 04:04:47 +08:00
|
|
|
struct extent_buffer *leaf;
|
|
|
|
int slot;
|
2012-01-17 04:04:47 +08:00
|
|
|
int ret;
|
|
|
|
int enospc_errors = 0;
|
2012-01-17 04:04:49 +08:00
|
|
|
bool counting = true;
|
2015-10-10 23:16:50 +08:00
|
|
|
/* The single value limit and min/max limits use the same bytes in the */
|
2014-05-07 23:37:51 +08:00
|
|
|
u64 limit_data = bctl->data.limit;
|
|
|
|
u64 limit_meta = bctl->meta.limit;
|
|
|
|
u64 limit_sys = bctl->sys.limit;
|
2015-10-10 23:16:50 +08:00
|
|
|
u32 count_data = 0;
|
|
|
|
u32 count_meta = 0;
|
|
|
|
u32 count_sys = 0;
|
btrfs: Fix lost-data-profile caused by balance bg
Reproduce:
(In integration-4.3 branch)
TEST_DEV=(/dev/vdg /dev/vdh)
TEST_DIR=/mnt/tmp
umount "$TEST_DEV" >/dev/null
mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
btrfs balance start -dusage=0 $TEST_DIR
btrfs filesystem usage $TEST_DIR
dd if=/dev/zero of="$TEST_DIR"/file count=100
btrfs filesystem usage $TEST_DIR
Result:
We can see "no data chunk" in first "btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 1.07GiB
And "data chunks changed from raid1 to single" in second
"btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Data,single: Size:256.00MiB, Used:0.00B
/dev/vdh 256.00MiB
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 841.92MiB
Reason:
btrfs balance delete last data chunk in case of no data in
the filesystem, then we can see "no data chunk" by "fi usage"
command.
And when we do write operation to fs, the only available data
profile is 0x0, result is all new chunks are allocated single type.
Fix:
Allocate a data chunk explicitly to ensure we don't lose the
raid profile for data.
Test:
Test by above script, and confirmed the logic by debug output.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-09 11:51:32 +08:00
|
|
|
int chunk_reserved = 0;
|
2008-04-29 03:29:52 +08:00
|
|
|
|
|
|
|
/* step one make some room on all the devices */
|
2012-01-17 04:04:47 +08:00
|
|
|
devices = &fs_info->fs_devices->devices;
|
2009-01-21 23:59:08 +08:00
|
|
|
list_for_each_entry(device, devices, dev_list) {
|
2014-09-03 21:35:38 +08:00
|
|
|
old_size = btrfs_device_get_total_bytes(device);
|
2008-04-29 03:29:52 +08:00
|
|
|
size_to_free = div_factor(old_size, 1);
|
2015-12-15 00:42:10 +08:00
|
|
|
size_to_free = min_t(u64, size_to_free, SZ_1M);
|
2017-12-04 12:54:52 +08:00
|
|
|
if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) ||
|
2014-09-03 21:35:38 +08:00
|
|
|
btrfs_device_get_total_bytes(device) -
|
|
|
|
btrfs_device_get_bytes_used(device) > size_to_free ||
|
2017-12-04 12:54:55 +08:00
|
|
|
test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
|
2008-04-29 03:29:52 +08:00
|
|
|
continue;
|
|
|
|
|
|
|
|
ret = btrfs_shrink_device(device, old_size - size_to_free);
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
if (ret == -ENOSPC)
|
|
|
|
break;
|
2016-07-13 02:24:21 +08:00
|
|
|
if (ret) {
|
|
|
|
/* btrfs_shrink_device never returns ret > 0 */
|
|
|
|
WARN_ON(ret > 0);
|
|
|
|
goto error;
|
|
|
|
}
|
2008-04-29 03:29:52 +08:00
|
|
|
|
2010-05-16 22:48:46 +08:00
|
|
|
trans = btrfs_start_transaction(dev_root, 0);
|
2016-07-13 02:24:21 +08:00
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
|
|
|
btrfs_info_in_rcu(fs_info,
|
|
|
|
"resize: unable to start transaction after shrinking device %s (error %d), old size %llu, new size %llu",
|
|
|
|
rcu_str_deref(device->name), ret,
|
|
|
|
old_size, old_size - size_to_free);
|
|
|
|
goto error;
|
|
|
|
}
|
2008-04-29 03:29:52 +08:00
|
|
|
|
|
|
|
ret = btrfs_grow_device(trans, device, old_size);
|
2016-07-13 02:24:21 +08:00
|
|
|
if (ret) {
|
2016-09-10 09:39:03 +08:00
|
|
|
btrfs_end_transaction(trans);
|
2016-07-13 02:24:21 +08:00
|
|
|
/* btrfs_grow_device never returns ret > 0 */
|
|
|
|
WARN_ON(ret > 0);
|
|
|
|
btrfs_info_in_rcu(fs_info,
|
|
|
|
"resize: unable to grow device after shrinking device %s (error %d), old size %llu, new size %llu",
|
|
|
|
rcu_str_deref(device->name), ret,
|
|
|
|
old_size, old_size - size_to_free);
|
|
|
|
goto error;
|
|
|
|
}
|
2008-04-29 03:29:52 +08:00
|
|
|
|
2016-09-10 09:39:03 +08:00
|
|
|
btrfs_end_transaction(trans);
|
2008-04-29 03:29:52 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* step two, relocate all the chunks */
|
|
|
|
path = btrfs_alloc_path();
|
2011-07-13 02:10:23 +08:00
|
|
|
if (!path) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto error;
|
|
|
|
}
|
2012-01-17 04:04:49 +08:00
|
|
|
|
|
|
|
/* zero out stat counters */
|
|
|
|
spin_lock(&fs_info->balance_lock);
|
|
|
|
memset(&bctl->stat, 0, sizeof(bctl->stat));
|
|
|
|
spin_unlock(&fs_info->balance_lock);
|
|
|
|
again:
|
2014-05-07 23:37:51 +08:00
|
|
|
if (!counting) {
|
2015-10-10 23:16:50 +08:00
|
|
|
/*
|
|
|
|
* The single value limit and min/max limits use the same bytes
|
|
|
|
* in the
|
|
|
|
*/
|
2014-05-07 23:37:51 +08:00
|
|
|
bctl->data.limit = limit_data;
|
|
|
|
bctl->meta.limit = limit_meta;
|
|
|
|
bctl->sys.limit = limit_sys;
|
|
|
|
}
|
2008-04-29 03:29:52 +08:00
|
|
|
key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
|
|
|
|
key.offset = (u64)-1;
|
|
|
|
key.type = BTRFS_CHUNK_ITEM_KEY;
|
|
|
|
|
2009-01-06 10:25:51 +08:00
|
|
|
while (1) {
|
2012-01-17 04:04:49 +08:00
|
|
|
if ((!counting && atomic_read(&fs_info->balance_pause_req)) ||
|
2012-01-17 04:04:49 +08:00
|
|
|
atomic_read(&fs_info->balance_cancel_req)) {
|
2012-01-17 04:04:49 +08:00
|
|
|
ret = -ECANCELED;
|
|
|
|
goto error;
|
|
|
|
}
|
|
|
|
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
mutex_lock(&fs_info->delete_unused_bgs_mutex);
|
2008-04-29 03:29:52 +08:00
|
|
|
ret = btrfs_search_slot(NULL, chunk_root, &key, path, 0, 0);
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
if (ret < 0) {
|
|
|
|
mutex_unlock(&fs_info->delete_unused_bgs_mutex);
|
2008-04-29 03:29:52 +08:00
|
|
|
goto error;
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
}
|
2008-04-29 03:29:52 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* this shouldn't happen, it means the last relocate
|
|
|
|
* failed
|
|
|
|
*/
|
|
|
|
if (ret == 0)
|
2012-01-17 04:04:47 +08:00
|
|
|
BUG(); /* FIXME break ? */
|
2008-04-29 03:29:52 +08:00
|
|
|
|
|
|
|
ret = btrfs_previous_item(chunk_root, path, 0,
|
|
|
|
BTRFS_CHUNK_ITEM_KEY);
|
2012-01-17 04:04:47 +08:00
|
|
|
if (ret) {
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
mutex_unlock(&fs_info->delete_unused_bgs_mutex);
|
2012-01-17 04:04:47 +08:00
|
|
|
ret = 0;
|
2008-04-29 03:29:52 +08:00
|
|
|
break;
|
2012-01-17 04:04:47 +08:00
|
|
|
}
|
2008-07-09 02:19:17 +08:00
|
|
|
|
2012-01-17 04:04:47 +08:00
|
|
|
leaf = path->nodes[0];
|
|
|
|
slot = path->slots[0];
|
|
|
|
btrfs_item_key_to_cpu(leaf, &found_key, slot);
|
2008-07-09 02:19:17 +08:00
|
|
|
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
if (found_key.objectid != key.objectid) {
|
|
|
|
mutex_unlock(&fs_info->delete_unused_bgs_mutex);
|
2008-04-29 03:29:52 +08:00
|
|
|
break;
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
}
|
2008-07-09 02:19:17 +08:00
|
|
|
|
2012-01-17 04:04:47 +08:00
|
|
|
chunk = btrfs_item_ptr(leaf, slot, struct btrfs_chunk);
|
2015-10-10 23:16:50 +08:00
|
|
|
chunk_type = btrfs_chunk_type(leaf, chunk);
|
2012-01-17 04:04:47 +08:00
|
|
|
|
2012-01-17 04:04:49 +08:00
|
|
|
if (!counting) {
|
|
|
|
spin_lock(&fs_info->balance_lock);
|
|
|
|
bctl->stat.considered++;
|
|
|
|
spin_unlock(&fs_info->balance_lock);
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
ret = should_balance_chunk(fs_info, leaf, chunk,
|
2012-01-17 04:04:47 +08:00
|
|
|
found_key.offset);
|
btrfs: Fix lost-data-profile caused by balance bg
Reproduce:
(In integration-4.3 branch)
TEST_DEV=(/dev/vdg /dev/vdh)
TEST_DIR=/mnt/tmp
umount "$TEST_DEV" >/dev/null
mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
btrfs balance start -dusage=0 $TEST_DIR
btrfs filesystem usage $TEST_DIR
dd if=/dev/zero of="$TEST_DIR"/file count=100
btrfs filesystem usage $TEST_DIR
Result:
We can see "no data chunk" in first "btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 1.07GiB
And "data chunks changed from raid1 to single" in second
"btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Data,single: Size:256.00MiB, Used:0.00B
/dev/vdh 256.00MiB
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 841.92MiB
Reason:
btrfs balance delete last data chunk in case of no data in
the filesystem, then we can see "no data chunk" by "fi usage"
command.
And when we do write operation to fs, the only available data
profile is 0x0, result is all new chunks are allocated single type.
Fix:
Allocate a data chunk explicitly to ensure we don't lose the
raid profile for data.
Test:
Test by above script, and confirmed the logic by debug output.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-09 11:51:32 +08:00
|
|
|
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
if (!ret) {
|
|
|
|
mutex_unlock(&fs_info->delete_unused_bgs_mutex);
|
2012-01-17 04:04:47 +08:00
|
|
|
goto loop;
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
}
|
2012-01-17 04:04:47 +08:00
|
|
|
|
2012-01-17 04:04:49 +08:00
|
|
|
if (counting) {
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
mutex_unlock(&fs_info->delete_unused_bgs_mutex);
|
2012-01-17 04:04:49 +08:00
|
|
|
spin_lock(&fs_info->balance_lock);
|
|
|
|
bctl->stat.expected++;
|
|
|
|
spin_unlock(&fs_info->balance_lock);
|
2015-10-10 23:16:50 +08:00
|
|
|
|
|
|
|
if (chunk_type & BTRFS_BLOCK_GROUP_DATA)
|
|
|
|
count_data++;
|
|
|
|
else if (chunk_type & BTRFS_BLOCK_GROUP_SYSTEM)
|
|
|
|
count_sys++;
|
|
|
|
else if (chunk_type & BTRFS_BLOCK_GROUP_METADATA)
|
|
|
|
count_meta++;
|
|
|
|
|
|
|
|
goto loop;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Apply limit_min filter, no need to check if the LIMITS
|
|
|
|
* filter is used, limit_min is 0 by default
|
|
|
|
*/
|
|
|
|
if (((chunk_type & BTRFS_BLOCK_GROUP_DATA) &&
|
|
|
|
count_data < bctl->data.limit_min)
|
|
|
|
|| ((chunk_type & BTRFS_BLOCK_GROUP_METADATA) &&
|
|
|
|
count_meta < bctl->meta.limit_min)
|
|
|
|
|| ((chunk_type & BTRFS_BLOCK_GROUP_SYSTEM) &&
|
|
|
|
count_sys < bctl->sys.limit_min)) {
|
|
|
|
mutex_unlock(&fs_info->delete_unused_bgs_mutex);
|
2012-01-17 04:04:49 +08:00
|
|
|
goto loop;
|
|
|
|
}
|
|
|
|
|
2017-11-16 07:28:11 +08:00
|
|
|
if (!chunk_reserved) {
|
|
|
|
/*
|
|
|
|
* We may be relocating the only data chunk we have,
|
|
|
|
* which could potentially end up with losing data's
|
|
|
|
* raid profile, so lets allocate an empty one in
|
|
|
|
* advance.
|
|
|
|
*/
|
|
|
|
ret = btrfs_may_alloc_data_chunk(fs_info,
|
|
|
|
found_key.offset);
|
btrfs: Fix lost-data-profile caused by balance bg
Reproduce:
(In integration-4.3 branch)
TEST_DEV=(/dev/vdg /dev/vdh)
TEST_DIR=/mnt/tmp
umount "$TEST_DEV" >/dev/null
mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
btrfs balance start -dusage=0 $TEST_DIR
btrfs filesystem usage $TEST_DIR
dd if=/dev/zero of="$TEST_DIR"/file count=100
btrfs filesystem usage $TEST_DIR
Result:
We can see "no data chunk" in first "btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 1.07GiB
And "data chunks changed from raid1 to single" in second
"btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Data,single: Size:256.00MiB, Used:0.00B
/dev/vdh 256.00MiB
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 841.92MiB
Reason:
btrfs balance delete last data chunk in case of no data in
the filesystem, then we can see "no data chunk" by "fi usage"
command.
And when we do write operation to fs, the only available data
profile is 0x0, result is all new chunks are allocated single type.
Fix:
Allocate a data chunk explicitly to ensure we don't lose the
raid profile for data.
Test:
Test by above script, and confirmed the logic by debug output.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-09 11:51:32 +08:00
|
|
|
if (ret < 0) {
|
|
|
|
mutex_unlock(&fs_info->delete_unused_bgs_mutex);
|
|
|
|
goto error;
|
2017-11-16 07:28:11 +08:00
|
|
|
} else if (ret == 1) {
|
|
|
|
chunk_reserved = 1;
|
btrfs: Fix lost-data-profile caused by balance bg
Reproduce:
(In integration-4.3 branch)
TEST_DEV=(/dev/vdg /dev/vdh)
TEST_DIR=/mnt/tmp
umount "$TEST_DEV" >/dev/null
mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
btrfs balance start -dusage=0 $TEST_DIR
btrfs filesystem usage $TEST_DIR
dd if=/dev/zero of="$TEST_DIR"/file count=100
btrfs filesystem usage $TEST_DIR
Result:
We can see "no data chunk" in first "btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 1.07GiB
And "data chunks changed from raid1 to single" in second
"btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Data,single: Size:256.00MiB, Used:0.00B
/dev/vdh 256.00MiB
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 841.92MiB
Reason:
btrfs balance delete last data chunk in case of no data in
the filesystem, then we can see "no data chunk" by "fi usage"
command.
And when we do write operation to fs, the only available data
profile is 0x0, result is all new chunks are allocated single type.
Fix:
Allocate a data chunk explicitly to ensure we don't lose the
raid profile for data.
Test:
Test by above script, and confirmed the logic by debug output.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-09 11:51:32 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-06-21 22:40:19 +08:00
|
|
|
ret = btrfs_relocate_chunk(fs_info, found_key.offset);
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
mutex_unlock(&fs_info->delete_unused_bgs_mutex);
|
2012-01-17 04:04:49 +08:00
|
|
|
if (ret == -ENOSPC) {
|
2012-01-17 04:04:47 +08:00
|
|
|
enospc_errors++;
|
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-04 01:28:12 +08:00
|
|
|
} else if (ret == -ETXTBSY) {
|
|
|
|
btrfs_info(fs_info,
|
|
|
|
"skipping relocation of block group %llu due to active swapfile",
|
|
|
|
found_key.offset);
|
|
|
|
ret = 0;
|
|
|
|
} else if (ret) {
|
|
|
|
goto error;
|
2012-01-17 04:04:49 +08:00
|
|
|
} else {
|
|
|
|
spin_lock(&fs_info->balance_lock);
|
|
|
|
bctl->stat.completed++;
|
|
|
|
spin_unlock(&fs_info->balance_lock);
|
|
|
|
}
|
2012-01-17 04:04:47 +08:00
|
|
|
loop:
|
2013-08-27 18:50:44 +08:00
|
|
|
if (found_key.offset == 0)
|
|
|
|
break;
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
key.offset = found_key.offset - 1;
|
2008-04-29 03:29:52 +08:00
|
|
|
}
|
2012-01-17 04:04:47 +08:00
|
|
|
|
2012-01-17 04:04:49 +08:00
|
|
|
if (counting) {
|
|
|
|
btrfs_release_path(path);
|
|
|
|
counting = false;
|
|
|
|
goto again;
|
|
|
|
}
|
2008-04-29 03:29:52 +08:00
|
|
|
error:
|
|
|
|
btrfs_free_path(path);
|
2012-01-17 04:04:47 +08:00
|
|
|
if (enospc_errors) {
|
2013-12-21 00:37:06 +08:00
|
|
|
btrfs_info(fs_info, "%d enospc errors during balance",
|
2016-09-20 22:05:00 +08:00
|
|
|
enospc_errors);
|
2012-01-17 04:04:47 +08:00
|
|
|
if (!ret)
|
|
|
|
ret = -ENOSPC;
|
|
|
|
}
|
|
|
|
|
2008-04-29 03:29:52 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-03-27 22:09:17 +08:00
|
|
|
/**
|
|
|
|
* alloc_profile_is_valid - see if a given profile is valid and reduced
|
|
|
|
* @flags: profile to validate
|
|
|
|
* @extended: if true @flags is treated as an extended profile
|
|
|
|
*/
|
|
|
|
static int alloc_profile_is_valid(u64 flags, int extended)
|
|
|
|
{
|
|
|
|
u64 mask = (extended ? BTRFS_EXTENDED_PROFILE_MASK :
|
|
|
|
BTRFS_BLOCK_GROUP_PROFILE_MASK);
|
|
|
|
|
|
|
|
flags &= ~BTRFS_BLOCK_GROUP_TYPE_MASK;
|
|
|
|
|
|
|
|
/* 1) check that all other bits are zeroed */
|
|
|
|
if (flags & ~mask)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
/* 2) see if profile is reduced */
|
|
|
|
if (flags == 0)
|
|
|
|
return !extended; /* "0" is valid for usual profiles */
|
|
|
|
|
|
|
|
/* true if exactly one bit set */
|
2018-09-21 20:26:34 +08:00
|
|
|
return is_power_of_2(flags);
|
2012-03-27 22:09:17 +08:00
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:49 +08:00
|
|
|
static inline int balance_need_close(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
2012-01-17 04:04:49 +08:00
|
|
|
/* cancel requested || normal exit path */
|
|
|
|
return atomic_read(&fs_info->balance_cancel_req) ||
|
|
|
|
(atomic_read(&fs_info->balance_pause_req) == 0 &&
|
|
|
|
atomic_read(&fs_info->balance_cancel_req) == 0);
|
2012-01-17 04:04:49 +08:00
|
|
|
}
|
|
|
|
|
2015-09-23 04:02:25 +08:00
|
|
|
/* Non-zero return value signifies invalidity */
|
|
|
|
static inline int validate_convert_profile(struct btrfs_balance_args *bctl_arg,
|
|
|
|
u64 allowed)
|
|
|
|
{
|
|
|
|
return ((bctl_arg->flags & BTRFS_BALANCE_ARGS_CONVERT) &&
|
|
|
|
(!alloc_profile_is_valid(bctl_arg->target, 1) ||
|
|
|
|
(bctl_arg->target & ~allowed)));
|
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:47 +08:00
|
|
|
/*
|
2018-03-21 07:20:05 +08:00
|
|
|
* Should be called with balance mutexe held
|
2012-01-17 04:04:47 +08:00
|
|
|
*/
|
2018-05-07 23:44:03 +08:00
|
|
|
int btrfs_balance(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_balance_control *bctl,
|
2012-01-17 04:04:47 +08:00
|
|
|
struct btrfs_ioctl_balance_args *bargs)
|
|
|
|
{
|
2017-03-08 06:34:44 +08:00
|
|
|
u64 meta_target, data_target;
|
2012-01-17 04:04:47 +08:00
|
|
|
u64 allowed;
|
2012-03-27 22:09:17 +08:00
|
|
|
int mixed = 0;
|
2012-01-17 04:04:47 +08:00
|
|
|
int ret;
|
2012-11-06 20:15:27 +08:00
|
|
|
u64 num_devices;
|
2013-01-29 18:13:12 +08:00
|
|
|
unsigned seq;
|
2012-01-17 04:04:47 +08:00
|
|
|
|
2012-01-17 04:04:49 +08:00
|
|
|
if (btrfs_fs_closing(fs_info) ||
|
2012-01-17 04:04:49 +08:00
|
|
|
atomic_read(&fs_info->balance_pause_req) ||
|
|
|
|
atomic_read(&fs_info->balance_cancel_req)) {
|
2012-01-17 04:04:47 +08:00
|
|
|
ret = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2012-03-27 22:09:17 +08:00
|
|
|
allowed = btrfs_super_incompat_flags(fs_info->super_copy);
|
|
|
|
if (allowed & BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS)
|
|
|
|
mixed = 1;
|
|
|
|
|
2012-01-17 04:04:47 +08:00
|
|
|
/*
|
|
|
|
* In case of mixed groups both data and meta should be picked,
|
|
|
|
* and identical options should be given for both of them.
|
|
|
|
*/
|
2012-03-27 22:09:17 +08:00
|
|
|
allowed = BTRFS_BALANCE_DATA | BTRFS_BALANCE_METADATA;
|
|
|
|
if (mixed && (bctl->flags & allowed)) {
|
2012-01-17 04:04:47 +08:00
|
|
|
if (!(bctl->flags & BTRFS_BALANCE_DATA) ||
|
|
|
|
!(bctl->flags & BTRFS_BALANCE_METADATA) ||
|
|
|
|
memcmp(&bctl->data, &bctl->meta, sizeof(bctl->data))) {
|
2016-09-20 22:05:00 +08:00
|
|
|
btrfs_err(fs_info,
|
2018-05-16 10:51:26 +08:00
|
|
|
"balance: mixed groups data and metadata options must be the same");
|
2012-01-17 04:04:47 +08:00
|
|
|
ret = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-08-10 13:53:21 +08:00
|
|
|
num_devices = btrfs_num_devices(fs_info);
|
|
|
|
|
2016-03-24 02:22:59 +08:00
|
|
|
allowed = BTRFS_AVAIL_ALLOC_BIT_SINGLE | BTRFS_BLOCK_GROUP_DUP;
|
|
|
|
if (num_devices > 1)
|
2012-01-17 04:04:48 +08:00
|
|
|
allowed |= (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID1);
|
2013-05-11 19:13:03 +08:00
|
|
|
if (num_devices > 2)
|
|
|
|
allowed |= BTRFS_BLOCK_GROUP_RAID5;
|
|
|
|
if (num_devices > 3)
|
|
|
|
allowed |= (BTRFS_BLOCK_GROUP_RAID10 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID6);
|
2015-09-23 04:02:25 +08:00
|
|
|
if (validate_convert_profile(&bctl->data, allowed)) {
|
2018-05-16 10:51:26 +08:00
|
|
|
int index = btrfs_bg_flags_to_raid_index(bctl->data.target);
|
|
|
|
|
2016-09-20 22:05:00 +08:00
|
|
|
btrfs_err(fs_info,
|
2018-05-16 10:51:26 +08:00
|
|
|
"balance: invalid convert data profile %s",
|
|
|
|
get_raid_name(index));
|
2012-01-17 04:04:48 +08:00
|
|
|
ret = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
2015-09-23 04:02:25 +08:00
|
|
|
if (validate_convert_profile(&bctl->meta, allowed)) {
|
2018-05-16 10:51:26 +08:00
|
|
|
int index = btrfs_bg_flags_to_raid_index(bctl->meta.target);
|
|
|
|
|
2013-12-21 00:37:06 +08:00
|
|
|
btrfs_err(fs_info,
|
2018-05-16 10:51:26 +08:00
|
|
|
"balance: invalid convert metadata profile %s",
|
|
|
|
get_raid_name(index));
|
2012-01-17 04:04:48 +08:00
|
|
|
ret = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
2015-09-23 04:02:25 +08:00
|
|
|
if (validate_convert_profile(&bctl->sys, allowed)) {
|
2018-05-16 10:51:26 +08:00
|
|
|
int index = btrfs_bg_flags_to_raid_index(bctl->sys.target);
|
|
|
|
|
2013-12-21 00:37:06 +08:00
|
|
|
btrfs_err(fs_info,
|
2018-05-16 10:51:26 +08:00
|
|
|
"balance: invalid convert system profile %s",
|
|
|
|
get_raid_name(index));
|
2012-01-17 04:04:48 +08:00
|
|
|
ret = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* allow to reduce meta or sys integrity only if force set */
|
|
|
|
allowed = BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1 |
|
2013-01-30 07:40:14 +08:00
|
|
|
BTRFS_BLOCK_GROUP_RAID10 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID5 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID6;
|
2013-01-29 18:13:12 +08:00
|
|
|
do {
|
|
|
|
seq = read_seqbegin(&fs_info->profiles_lock);
|
|
|
|
|
|
|
|
if (((bctl->sys.flags & BTRFS_BALANCE_ARGS_CONVERT) &&
|
|
|
|
(fs_info->avail_system_alloc_bits & allowed) &&
|
|
|
|
!(bctl->sys.target & allowed)) ||
|
|
|
|
((bctl->meta.flags & BTRFS_BALANCE_ARGS_CONVERT) &&
|
|
|
|
(fs_info->avail_metadata_alloc_bits & allowed) &&
|
|
|
|
!(bctl->meta.target & allowed))) {
|
|
|
|
if (bctl->flags & BTRFS_BALANCE_FORCE) {
|
2016-09-20 22:05:00 +08:00
|
|
|
btrfs_info(fs_info,
|
2018-05-16 10:51:26 +08:00
|
|
|
"balance: force reducing metadata integrity");
|
2013-01-29 18:13:12 +08:00
|
|
|
} else {
|
2016-09-20 22:05:00 +08:00
|
|
|
btrfs_err(fs_info,
|
2018-05-16 10:51:26 +08:00
|
|
|
"balance: reduces metadata integrity, use --force if you want this");
|
2013-01-29 18:13:12 +08:00
|
|
|
ret = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
2012-01-17 04:04:48 +08:00
|
|
|
}
|
2013-01-29 18:13:12 +08:00
|
|
|
} while (read_seqretry(&fs_info->profiles_lock, seq));
|
2012-01-17 04:04:48 +08:00
|
|
|
|
2017-03-08 06:34:44 +08:00
|
|
|
/* if we're not converting, the target field is uninitialized */
|
|
|
|
meta_target = (bctl->meta.flags & BTRFS_BALANCE_ARGS_CONVERT) ?
|
|
|
|
bctl->meta.target : fs_info->avail_metadata_alloc_bits;
|
|
|
|
data_target = (bctl->data.flags & BTRFS_BALANCE_ARGS_CONVERT) ?
|
|
|
|
bctl->data.target : fs_info->avail_data_alloc_bits;
|
|
|
|
if (btrfs_get_num_tolerated_disk_barrier_failures(meta_target) <
|
|
|
|
btrfs_get_num_tolerated_disk_barrier_failures(data_target)) {
|
2018-05-16 10:51:26 +08:00
|
|
|
int meta_index = btrfs_bg_flags_to_raid_index(meta_target);
|
|
|
|
int data_index = btrfs_bg_flags_to_raid_index(data_target);
|
|
|
|
|
2016-01-06 16:46:12 +08:00
|
|
|
btrfs_warn(fs_info,
|
2018-05-16 10:51:26 +08:00
|
|
|
"balance: metadata profile %s has lower redundancy than data profile %s",
|
|
|
|
get_raid_name(meta_index), get_raid_name(data_index));
|
2016-01-06 16:46:12 +08:00
|
|
|
}
|
|
|
|
|
2016-06-22 09:16:51 +08:00
|
|
|
ret = insert_balance_item(fs_info, bctl);
|
2012-01-17 04:04:48 +08:00
|
|
|
if (ret && ret != -EEXIST)
|
2012-01-17 04:04:48 +08:00
|
|
|
goto out;
|
|
|
|
|
2012-01-17 04:04:48 +08:00
|
|
|
if (!(bctl->flags & BTRFS_BALANCE_RESUME)) {
|
|
|
|
BUG_ON(ret == -EEXIST);
|
2018-03-21 09:41:30 +08:00
|
|
|
BUG_ON(fs_info->balance_ctl);
|
|
|
|
spin_lock(&fs_info->balance_lock);
|
|
|
|
fs_info->balance_ctl = bctl;
|
|
|
|
spin_unlock(&fs_info->balance_lock);
|
2012-01-17 04:04:48 +08:00
|
|
|
} else {
|
|
|
|
BUG_ON(ret != -EEXIST);
|
|
|
|
spin_lock(&fs_info->balance_lock);
|
|
|
|
update_balance_args(bctl);
|
|
|
|
spin_unlock(&fs_info->balance_lock);
|
|
|
|
}
|
2012-01-17 04:04:47 +08:00
|
|
|
|
2018-03-21 08:31:04 +08:00
|
|
|
ASSERT(!test_bit(BTRFS_FS_BALANCE_RUNNING, &fs_info->flags));
|
|
|
|
set_bit(BTRFS_FS_BALANCE_RUNNING, &fs_info->flags);
|
2012-01-17 04:04:47 +08:00
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
|
|
|
|
|
|
|
ret = __btrfs_balance(fs_info);
|
|
|
|
|
|
|
|
mutex_lock(&fs_info->balance_mutex);
|
2018-03-21 08:31:04 +08:00
|
|
|
clear_bit(BTRFS_FS_BALANCE_RUNNING, &fs_info->flags);
|
2012-01-17 04:04:47 +08:00
|
|
|
|
|
|
|
if (bargs) {
|
|
|
|
memset(bargs, 0, sizeof(*bargs));
|
2018-03-21 09:05:27 +08:00
|
|
|
btrfs_update_ioctl_balance_args(fs_info, bargs);
|
2012-01-17 04:04:47 +08:00
|
|
|
}
|
|
|
|
|
2013-03-06 16:57:55 +08:00
|
|
|
if ((ret && ret != -ECANCELED && ret != -ENOSPC) ||
|
|
|
|
balance_need_close(fs_info)) {
|
2018-03-21 03:23:09 +08:00
|
|
|
reset_balance_state(fs_info);
|
2018-03-21 00:28:05 +08:00
|
|
|
clear_bit(BTRFS_FS_EXCL_OP, &fs_info->flags);
|
2013-03-06 16:57:55 +08:00
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:49 +08:00
|
|
|
wake_up(&fs_info->balance_wait_q);
|
2012-01-17 04:04:47 +08:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
out:
|
2012-01-17 04:04:48 +08:00
|
|
|
if (bctl->flags & BTRFS_BALANCE_RESUME)
|
2018-03-21 03:23:09 +08:00
|
|
|
reset_balance_state(fs_info);
|
2018-03-21 00:28:05 +08:00
|
|
|
else
|
2012-01-17 04:04:48 +08:00
|
|
|
kfree(bctl);
|
2018-03-21 00:28:05 +08:00
|
|
|
clear_bit(BTRFS_FS_EXCL_OP, &fs_info->flags);
|
|
|
|
|
2012-01-17 04:04:48 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int balance_kthread(void *data)
|
|
|
|
{
|
2012-06-23 02:24:13 +08:00
|
|
|
struct btrfs_fs_info *fs_info = data;
|
2012-01-17 04:04:48 +08:00
|
|
|
int ret = 0;
|
2012-01-17 04:04:48 +08:00
|
|
|
|
|
|
|
mutex_lock(&fs_info->balance_mutex);
|
2012-06-23 02:24:13 +08:00
|
|
|
if (fs_info->balance_ctl) {
|
2018-05-16 10:51:26 +08:00
|
|
|
btrfs_info(fs_info, "balance: resuming");
|
2018-05-07 23:44:03 +08:00
|
|
|
ret = btrfs_balance(fs_info, fs_info->balance_ctl, NULL);
|
2012-01-17 04:04:48 +08:00
|
|
|
}
|
2012-01-17 04:04:48 +08:00
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
2012-06-23 02:24:13 +08:00
|
|
|
|
2012-01-17 04:04:48 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-06-23 02:24:13 +08:00
|
|
|
int btrfs_resume_balance_async(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
struct task_struct *tsk;
|
|
|
|
|
2018-03-21 09:29:13 +08:00
|
|
|
mutex_lock(&fs_info->balance_mutex);
|
2012-06-23 02:24:13 +08:00
|
|
|
if (!fs_info->balance_ctl) {
|
2018-03-21 09:29:13 +08:00
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
2012-06-23 02:24:13 +08:00
|
|
|
return 0;
|
|
|
|
}
|
2018-03-21 09:29:13 +08:00
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
2012-06-23 02:24:13 +08:00
|
|
|
|
2016-06-10 09:38:35 +08:00
|
|
|
if (btrfs_test_opt(fs_info, SKIP_BALANCE)) {
|
2018-05-16 10:51:26 +08:00
|
|
|
btrfs_info(fs_info, "balance: resume skipped");
|
2012-06-23 02:24:13 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-05-17 15:16:51 +08:00
|
|
|
/*
|
|
|
|
* A ro->rw remount sequence should continue with the paused balance
|
|
|
|
* regardless of who pauses it, system or the user as of now, so set
|
|
|
|
* the resume flag.
|
|
|
|
*/
|
|
|
|
spin_lock(&fs_info->balance_lock);
|
|
|
|
fs_info->balance_ctl->flags |= BTRFS_BALANCE_RESUME;
|
|
|
|
spin_unlock(&fs_info->balance_lock);
|
|
|
|
|
2012-06-23 02:24:13 +08:00
|
|
|
tsk = kthread_run(balance_kthread, fs_info, "btrfs-balance");
|
2013-07-15 19:22:18 +08:00
|
|
|
return PTR_ERR_OR_ZERO(tsk);
|
2012-06-23 02:24:13 +08:00
|
|
|
}
|
|
|
|
|
2012-06-23 02:24:12 +08:00
|
|
|
int btrfs_recover_balance(struct btrfs_fs_info *fs_info)
|
2012-01-17 04:04:48 +08:00
|
|
|
{
|
|
|
|
struct btrfs_balance_control *bctl;
|
|
|
|
struct btrfs_balance_item *item;
|
|
|
|
struct btrfs_disk_balance_args disk_bargs;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_key key;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
key.objectid = BTRFS_BALANCE_OBJECTID;
|
2016-01-26 00:51:31 +08:00
|
|
|
key.type = BTRFS_TEMPORARY_ITEM_KEY;
|
2012-01-17 04:04:48 +08:00
|
|
|
key.offset = 0;
|
|
|
|
|
2012-06-23 02:24:12 +08:00
|
|
|
ret = btrfs_search_slot(NULL, fs_info->tree_root, &key, path, 0, 0);
|
2012-01-17 04:04:48 +08:00
|
|
|
if (ret < 0)
|
2012-06-23 02:24:12 +08:00
|
|
|
goto out;
|
2012-01-17 04:04:48 +08:00
|
|
|
if (ret > 0) { /* ret = -ENOENT; */
|
|
|
|
ret = 0;
|
2012-06-23 02:24:12 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
bctl = kzalloc(sizeof(*bctl), GFP_NOFS);
|
|
|
|
if (!bctl) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
2012-01-17 04:04:48 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_balance_item);
|
|
|
|
|
2012-06-23 02:24:12 +08:00
|
|
|
bctl->flags = btrfs_balance_flags(leaf, item);
|
|
|
|
bctl->flags |= BTRFS_BALANCE_RESUME;
|
2012-01-17 04:04:48 +08:00
|
|
|
|
|
|
|
btrfs_balance_data(leaf, item, &disk_bargs);
|
|
|
|
btrfs_disk_balance_args_to_cpu(&bctl->data, &disk_bargs);
|
|
|
|
btrfs_balance_meta(leaf, item, &disk_bargs);
|
|
|
|
btrfs_disk_balance_args_to_cpu(&bctl->meta, &disk_bargs);
|
|
|
|
btrfs_balance_sys(leaf, item, &disk_bargs);
|
|
|
|
btrfs_disk_balance_args_to_cpu(&bctl->sys, &disk_bargs);
|
|
|
|
|
2018-03-21 03:07:58 +08:00
|
|
|
/*
|
|
|
|
* This should never happen, as the paused balance state is recovered
|
|
|
|
* during mount without any chance of other exclusive ops to collide.
|
|
|
|
*
|
|
|
|
* This gives the exclusive op status to balance and keeps in paused
|
|
|
|
* state until user intervention (cancel or umount). If the ownership
|
|
|
|
* cannot be assigned, show a message but do not fail. The balance
|
|
|
|
* is in a paused state and must have fs_info::balance_ctl properly
|
|
|
|
* set up.
|
|
|
|
*/
|
|
|
|
if (test_and_set_bit(BTRFS_FS_EXCL_OP, &fs_info->flags))
|
|
|
|
btrfs_warn(fs_info,
|
2018-05-16 10:51:26 +08:00
|
|
|
"balance: cannot set exclusive op status, resume manually");
|
2013-01-20 21:57:57 +08:00
|
|
|
|
2012-06-23 02:24:12 +08:00
|
|
|
mutex_lock(&fs_info->balance_mutex);
|
2018-03-21 09:41:30 +08:00
|
|
|
BUG_ON(fs_info->balance_ctl);
|
|
|
|
spin_lock(&fs_info->balance_lock);
|
|
|
|
fs_info->balance_ctl = bctl;
|
|
|
|
spin_unlock(&fs_info->balance_lock);
|
2012-06-23 02:24:12 +08:00
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
2012-01-17 04:04:48 +08:00
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
2008-04-29 03:29:52 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:49 +08:00
|
|
|
int btrfs_pause_balance(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
mutex_lock(&fs_info->balance_mutex);
|
|
|
|
if (!fs_info->balance_ctl) {
|
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
|
|
|
return -ENOTCONN;
|
|
|
|
}
|
|
|
|
|
2018-03-21 08:31:04 +08:00
|
|
|
if (test_bit(BTRFS_FS_BALANCE_RUNNING, &fs_info->flags)) {
|
2012-01-17 04:04:49 +08:00
|
|
|
atomic_inc(&fs_info->balance_pause_req);
|
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
|
|
|
|
|
|
|
wait_event(fs_info->balance_wait_q,
|
2018-03-21 08:31:04 +08:00
|
|
|
!test_bit(BTRFS_FS_BALANCE_RUNNING, &fs_info->flags));
|
2012-01-17 04:04:49 +08:00
|
|
|
|
|
|
|
mutex_lock(&fs_info->balance_mutex);
|
|
|
|
/* we are good with balance_ctl ripped off from under us */
|
2018-03-21 08:31:04 +08:00
|
|
|
BUG_ON(test_bit(BTRFS_FS_BALANCE_RUNNING, &fs_info->flags));
|
2012-01-17 04:04:49 +08:00
|
|
|
atomic_dec(&fs_info->balance_pause_req);
|
|
|
|
} else {
|
|
|
|
ret = -ENOTCONN;
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:49 +08:00
|
|
|
int btrfs_cancel_balance(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
mutex_lock(&fs_info->balance_mutex);
|
|
|
|
if (!fs_info->balance_ctl) {
|
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
|
|
|
return -ENOTCONN;
|
|
|
|
}
|
|
|
|
|
2018-03-21 08:45:32 +08:00
|
|
|
/*
|
|
|
|
* A paused balance with the item stored on disk can be resumed at
|
|
|
|
* mount time if the mount is read-write. Otherwise it's still paused
|
|
|
|
* and we must not allow cancelling as it deletes the item.
|
|
|
|
*/
|
|
|
|
if (sb_rdonly(fs_info->sb)) {
|
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
|
|
|
return -EROFS;
|
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:49 +08:00
|
|
|
atomic_inc(&fs_info->balance_cancel_req);
|
|
|
|
/*
|
|
|
|
* if we are running just wait and return, balance item is
|
|
|
|
* deleted in btrfs_balance in this case
|
|
|
|
*/
|
2018-03-21 08:31:04 +08:00
|
|
|
if (test_bit(BTRFS_FS_BALANCE_RUNNING, &fs_info->flags)) {
|
2012-01-17 04:04:49 +08:00
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
|
|
|
wait_event(fs_info->balance_wait_q,
|
2018-03-21 08:31:04 +08:00
|
|
|
!test_bit(BTRFS_FS_BALANCE_RUNNING, &fs_info->flags));
|
2012-01-17 04:04:49 +08:00
|
|
|
mutex_lock(&fs_info->balance_mutex);
|
|
|
|
} else {
|
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
2018-03-21 07:20:05 +08:00
|
|
|
/*
|
|
|
|
* Lock released to allow other waiters to continue, we'll
|
|
|
|
* reexamine the status again.
|
|
|
|
*/
|
2012-01-17 04:04:49 +08:00
|
|
|
mutex_lock(&fs_info->balance_mutex);
|
|
|
|
|
2018-03-21 00:28:05 +08:00
|
|
|
if (fs_info->balance_ctl) {
|
2018-03-21 03:23:09 +08:00
|
|
|
reset_balance_state(fs_info);
|
2018-03-21 00:28:05 +08:00
|
|
|
clear_bit(BTRFS_FS_EXCL_OP, &fs_info->flags);
|
2018-05-16 10:51:26 +08:00
|
|
|
btrfs_info(fs_info, "balance: canceled");
|
2018-03-21 00:28:05 +08:00
|
|
|
}
|
2012-01-17 04:04:49 +08:00
|
|
|
}
|
|
|
|
|
2018-03-21 08:31:04 +08:00
|
|
|
BUG_ON(fs_info->balance_ctl ||
|
|
|
|
test_bit(BTRFS_FS_BALANCE_RUNNING, &fs_info->flags));
|
2012-01-17 04:04:49 +08:00
|
|
|
atomic_dec(&fs_info->balance_cancel_req);
|
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-08-15 23:11:21 +08:00
|
|
|
static int btrfs_uuid_scan_kthread(void *data)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = data;
|
|
|
|
struct btrfs_root *root = fs_info->tree_root;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_path *path = NULL;
|
|
|
|
int ret = 0;
|
|
|
|
struct extent_buffer *eb;
|
|
|
|
int slot;
|
|
|
|
struct btrfs_root_item root_item;
|
|
|
|
u32 item_size;
|
2013-08-28 17:28:34 +08:00
|
|
|
struct btrfs_trans_handle *trans = NULL;
|
2013-08-15 23:11:21 +08:00
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
key.objectid = 0;
|
|
|
|
key.type = BTRFS_ROOT_ITEM_KEY;
|
|
|
|
key.offset = 0;
|
|
|
|
|
|
|
|
while (1) {
|
2018-03-07 17:29:18 +08:00
|
|
|
ret = btrfs_search_forward(root, &key, path,
|
|
|
|
BTRFS_OLDEST_GENERATION);
|
2013-08-15 23:11:21 +08:00
|
|
|
if (ret) {
|
|
|
|
if (ret > 0)
|
|
|
|
ret = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (key.type != BTRFS_ROOT_ITEM_KEY ||
|
|
|
|
(key.objectid < BTRFS_FIRST_FREE_OBJECTID &&
|
|
|
|
key.objectid != BTRFS_FS_TREE_OBJECTID) ||
|
|
|
|
key.objectid > BTRFS_LAST_FREE_OBJECTID)
|
|
|
|
goto skip;
|
|
|
|
|
|
|
|
eb = path->nodes[0];
|
|
|
|
slot = path->slots[0];
|
|
|
|
item_size = btrfs_item_size_nr(eb, slot);
|
|
|
|
if (item_size < sizeof(root_item))
|
|
|
|
goto skip;
|
|
|
|
|
|
|
|
read_extent_buffer(eb, &root_item,
|
|
|
|
btrfs_item_ptr_offset(eb, slot),
|
|
|
|
(int)sizeof(root_item));
|
|
|
|
if (btrfs_root_refs(&root_item) == 0)
|
|
|
|
goto skip;
|
2013-08-28 17:28:34 +08:00
|
|
|
|
|
|
|
if (!btrfs_is_empty_uuid(root_item.uuid) ||
|
|
|
|
!btrfs_is_empty_uuid(root_item.received_uuid)) {
|
|
|
|
if (trans)
|
|
|
|
goto update_tree;
|
|
|
|
|
|
|
|
btrfs_release_path(path);
|
2013-08-15 23:11:21 +08:00
|
|
|
/*
|
|
|
|
* 1 - subvol uuid item
|
|
|
|
* 1 - received_subvol uuid item
|
|
|
|
*/
|
|
|
|
trans = btrfs_start_transaction(fs_info->uuid_root, 2);
|
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
|
|
|
break;
|
|
|
|
}
|
2013-08-28 17:28:34 +08:00
|
|
|
continue;
|
|
|
|
} else {
|
|
|
|
goto skip;
|
|
|
|
}
|
|
|
|
update_tree:
|
|
|
|
if (!btrfs_is_empty_uuid(root_item.uuid)) {
|
2018-05-29 15:01:53 +08:00
|
|
|
ret = btrfs_uuid_tree_add(trans, root_item.uuid,
|
2013-08-15 23:11:21 +08:00
|
|
|
BTRFS_UUID_KEY_SUBVOL,
|
|
|
|
key.objectid);
|
|
|
|
if (ret < 0) {
|
2013-12-21 00:37:06 +08:00
|
|
|
btrfs_warn(fs_info, "uuid_tree_add failed %d",
|
2013-08-15 23:11:21 +08:00
|
|
|
ret);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!btrfs_is_empty_uuid(root_item.received_uuid)) {
|
2018-05-29 15:01:53 +08:00
|
|
|
ret = btrfs_uuid_tree_add(trans,
|
2013-08-15 23:11:21 +08:00
|
|
|
root_item.received_uuid,
|
|
|
|
BTRFS_UUID_KEY_RECEIVED_SUBVOL,
|
|
|
|
key.objectid);
|
|
|
|
if (ret < 0) {
|
2013-12-21 00:37:06 +08:00
|
|
|
btrfs_warn(fs_info, "uuid_tree_add failed %d",
|
2013-08-15 23:11:21 +08:00
|
|
|
ret);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2013-08-28 17:28:34 +08:00
|
|
|
skip:
|
2013-08-15 23:11:21 +08:00
|
|
|
if (trans) {
|
2016-09-10 09:39:03 +08:00
|
|
|
ret = btrfs_end_transaction(trans);
|
2013-08-28 17:28:34 +08:00
|
|
|
trans = NULL;
|
2013-08-15 23:11:21 +08:00
|
|
|
if (ret)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
btrfs_release_path(path);
|
|
|
|
if (key.offset < (u64)-1) {
|
|
|
|
key.offset++;
|
|
|
|
} else if (key.type < BTRFS_ROOT_ITEM_KEY) {
|
|
|
|
key.offset = 0;
|
|
|
|
key.type = BTRFS_ROOT_ITEM_KEY;
|
|
|
|
} else if (key.objectid < (u64)-1) {
|
|
|
|
key.offset = 0;
|
|
|
|
key.type = BTRFS_ROOT_ITEM_KEY;
|
|
|
|
key.objectid++;
|
|
|
|
} else {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
cond_resched();
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
2013-08-28 17:28:34 +08:00
|
|
|
if (trans && !IS_ERR(trans))
|
2016-09-10 09:39:03 +08:00
|
|
|
btrfs_end_transaction(trans);
|
2013-08-15 23:11:21 +08:00
|
|
|
if (ret)
|
2013-12-21 00:37:06 +08:00
|
|
|
btrfs_warn(fs_info, "btrfs_uuid_scan_kthread failed %d", ret);
|
2013-08-15 23:11:23 +08:00
|
|
|
else
|
2016-09-03 03:40:02 +08:00
|
|
|
set_bit(BTRFS_FS_UPDATE_UUID_TREE_GEN, &fs_info->flags);
|
2013-08-15 23:11:21 +08:00
|
|
|
up(&fs_info->uuid_tree_rescan_sem);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-08-15 23:11:23 +08:00
|
|
|
/*
|
|
|
|
* Callback for btrfs_uuid_tree_iterate().
|
|
|
|
* returns:
|
|
|
|
* 0 check succeeded, the entry is not outdated.
|
2016-03-05 03:23:12 +08:00
|
|
|
* < 0 if an error occurred.
|
2013-08-15 23:11:23 +08:00
|
|
|
* > 0 if the check failed, which means the caller shall remove the entry.
|
|
|
|
*/
|
|
|
|
static int btrfs_check_uuid_tree_entry(struct btrfs_fs_info *fs_info,
|
|
|
|
u8 *uuid, u8 type, u64 subid)
|
|
|
|
{
|
|
|
|
struct btrfs_key key;
|
|
|
|
int ret = 0;
|
|
|
|
struct btrfs_root *subvol_root;
|
|
|
|
|
|
|
|
if (type != BTRFS_UUID_KEY_SUBVOL &&
|
|
|
|
type != BTRFS_UUID_KEY_RECEIVED_SUBVOL)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
key.objectid = subid;
|
|
|
|
key.type = BTRFS_ROOT_ITEM_KEY;
|
|
|
|
key.offset = (u64)-1;
|
|
|
|
subvol_root = btrfs_read_fs_root_no_name(fs_info, &key);
|
|
|
|
if (IS_ERR(subvol_root)) {
|
|
|
|
ret = PTR_ERR(subvol_root);
|
|
|
|
if (ret == -ENOENT)
|
|
|
|
ret = 1;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
switch (type) {
|
|
|
|
case BTRFS_UUID_KEY_SUBVOL:
|
|
|
|
if (memcmp(uuid, subvol_root->root_item.uuid, BTRFS_UUID_SIZE))
|
|
|
|
ret = 1;
|
|
|
|
break;
|
|
|
|
case BTRFS_UUID_KEY_RECEIVED_SUBVOL:
|
|
|
|
if (memcmp(uuid, subvol_root->root_item.received_uuid,
|
|
|
|
BTRFS_UUID_SIZE))
|
|
|
|
ret = 1;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int btrfs_uuid_rescan_kthread(void *data)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = (struct btrfs_fs_info *)data;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* 1st step is to iterate through the existing UUID tree and
|
|
|
|
* to delete all entries that contain outdated data.
|
|
|
|
* 2nd step is to add all missing entries to the UUID tree.
|
|
|
|
*/
|
|
|
|
ret = btrfs_uuid_tree_iterate(fs_info, btrfs_check_uuid_tree_entry);
|
|
|
|
if (ret < 0) {
|
2013-12-21 00:37:06 +08:00
|
|
|
btrfs_warn(fs_info, "iterating uuid_tree failed %d", ret);
|
2013-08-15 23:11:23 +08:00
|
|
|
up(&fs_info->uuid_tree_rescan_sem);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
return btrfs_uuid_scan_kthread(data);
|
|
|
|
}
|
|
|
|
|
2013-08-15 23:11:19 +08:00
|
|
|
int btrfs_create_uuid_tree(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
struct btrfs_root *tree_root = fs_info->tree_root;
|
|
|
|
struct btrfs_root *uuid_root;
|
2013-08-15 23:11:21 +08:00
|
|
|
struct task_struct *task;
|
|
|
|
int ret;
|
2013-08-15 23:11:19 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* 1 - root node
|
|
|
|
* 1 - root item
|
|
|
|
*/
|
|
|
|
trans = btrfs_start_transaction(tree_root, 2);
|
|
|
|
if (IS_ERR(trans))
|
|
|
|
return PTR_ERR(trans);
|
|
|
|
|
|
|
|
uuid_root = btrfs_create_tree(trans, fs_info,
|
|
|
|
BTRFS_UUID_TREE_OBJECTID);
|
|
|
|
if (IS_ERR(uuid_root)) {
|
2015-04-25 01:12:01 +08:00
|
|
|
ret = PTR_ERR(uuid_root);
|
2016-06-11 06:19:25 +08:00
|
|
|
btrfs_abort_transaction(trans, ret);
|
2016-09-10 09:39:03 +08:00
|
|
|
btrfs_end_transaction(trans);
|
2015-04-25 01:12:01 +08:00
|
|
|
return ret;
|
2013-08-15 23:11:19 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
fs_info->uuid_root = uuid_root;
|
|
|
|
|
2016-09-10 09:39:03 +08:00
|
|
|
ret = btrfs_commit_transaction(trans);
|
2013-08-15 23:11:21 +08:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
down(&fs_info->uuid_tree_rescan_sem);
|
|
|
|
task = kthread_run(btrfs_uuid_scan_kthread, fs_info, "btrfs-uuid");
|
|
|
|
if (IS_ERR(task)) {
|
2013-08-15 23:11:23 +08:00
|
|
|
/* fs_info->update_uuid_tree_gen remains 0 in all error case */
|
2013-12-21 00:37:06 +08:00
|
|
|
btrfs_warn(fs_info, "failed to start uuid_scan task");
|
2013-08-15 23:11:21 +08:00
|
|
|
up(&fs_info->uuid_tree_rescan_sem);
|
|
|
|
return PTR_ERR(task);
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
2013-08-15 23:11:19 +08:00
|
|
|
}
|
2013-08-15 23:11:21 +08:00
|
|
|
|
2013-08-15 23:11:23 +08:00
|
|
|
int btrfs_check_uuid_tree(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
struct task_struct *task;
|
|
|
|
|
|
|
|
down(&fs_info->uuid_tree_rescan_sem);
|
|
|
|
task = kthread_run(btrfs_uuid_rescan_kthread, fs_info, "btrfs-uuid");
|
|
|
|
if (IS_ERR(task)) {
|
|
|
|
/* fs_info->update_uuid_tree_gen remains 0 in all error case */
|
2013-12-21 00:37:06 +08:00
|
|
|
btrfs_warn(fs_info, "failed to start uuid_rescan task");
|
2013-08-15 23:11:23 +08:00
|
|
|
up(&fs_info->uuid_tree_rescan_sem);
|
|
|
|
return PTR_ERR(task);
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-04-26 04:53:30 +08:00
|
|
|
/*
|
|
|
|
* shrinking a device means finding all of the device extents past
|
|
|
|
* the new size, and then following the back refs to the chunks.
|
|
|
|
* The chunk relocation code actually frees the device extent
|
|
|
|
*/
|
|
|
|
int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
|
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *fs_info = device->fs_info;
|
|
|
|
struct btrfs_root *root = fs_info->dev_root;
|
2008-04-26 04:53:30 +08:00
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
struct btrfs_dev_extent *dev_extent = NULL;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
u64 length;
|
|
|
|
u64 chunk_offset;
|
|
|
|
int ret;
|
|
|
|
int slot;
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
int failed = 0;
|
|
|
|
bool retried = false;
|
2015-06-02 21:43:21 +08:00
|
|
|
bool checked_pending_chunks = false;
|
2008-04-26 04:53:30 +08:00
|
|
|
struct extent_buffer *l;
|
|
|
|
struct btrfs_key key;
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_super_block *super_copy = fs_info->super_copy;
|
2008-04-26 04:53:30 +08:00
|
|
|
u64 old_total = btrfs_super_total_bytes(super_copy);
|
2014-09-03 21:35:38 +08:00
|
|
|
u64 old_size = btrfs_device_get_total_bytes(device);
|
2017-06-16 19:39:20 +08:00
|
|
|
u64 diff;
|
|
|
|
|
|
|
|
new_size = round_down(new_size, fs_info->sectorsize);
|
2017-07-21 16:28:24 +08:00
|
|
|
diff = round_down(old_size - new_size, fs_info->sectorsize);
|
2008-04-26 04:53:30 +08:00
|
|
|
|
2017-12-04 12:54:55 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
|
2012-11-06 01:29:28 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
2008-04-26 04:53:30 +08:00
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2018-04-27 16:22:07 +08:00
|
|
|
path->reada = READA_BACK;
|
2008-04-26 04:53:30 +08:00
|
|
|
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
2008-07-09 02:19:17 +08:00
|
|
|
|
2014-09-03 21:35:38 +08:00
|
|
|
btrfs_device_set_total_bytes(device, new_size);
|
2017-12-04 12:54:52 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
|
2008-11-18 10:11:30 +08:00
|
|
|
device->fs_devices->total_rw_bytes -= diff;
|
2017-05-11 14:17:46 +08:00
|
|
|
atomic64_sub(diff, &fs_info->free_chunk_space);
|
2011-09-27 05:12:22 +08:00
|
|
|
}
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
2008-04-26 04:53:30 +08:00
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
again:
|
2008-04-26 04:53:30 +08:00
|
|
|
key.objectid = device->devid;
|
|
|
|
key.offset = (u64)-1;
|
|
|
|
key.type = BTRFS_DEV_EXTENT_KEY;
|
|
|
|
|
2012-03-27 22:09:18 +08:00
|
|
|
do {
|
2016-06-23 06:54:23 +08:00
|
|
|
mutex_lock(&fs_info->delete_unused_bgs_mutex);
|
2008-04-26 04:53:30 +08:00
|
|
|
ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
if (ret < 0) {
|
2016-06-23 06:54:23 +08:00
|
|
|
mutex_unlock(&fs_info->delete_unused_bgs_mutex);
|
2008-04-26 04:53:30 +08:00
|
|
|
goto done;
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
}
|
2008-04-26 04:53:30 +08:00
|
|
|
|
|
|
|
ret = btrfs_previous_item(root, path, 0, key.type);
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
if (ret)
|
2016-06-23 06:54:23 +08:00
|
|
|
mutex_unlock(&fs_info->delete_unused_bgs_mutex);
|
2008-04-26 04:53:30 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto done;
|
|
|
|
if (ret) {
|
|
|
|
ret = 0;
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2009-07-22 21:59:00 +08:00
|
|
|
break;
|
2008-04-26 04:53:30 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
l = path->nodes[0];
|
|
|
|
slot = path->slots[0];
|
|
|
|
btrfs_item_key_to_cpu(l, &key, path->slots[0]);
|
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
if (key.objectid != device->devid) {
|
2016-06-23 06:54:23 +08:00
|
|
|
mutex_unlock(&fs_info->delete_unused_bgs_mutex);
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2009-07-22 21:59:00 +08:00
|
|
|
break;
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
}
|
2008-04-26 04:53:30 +08:00
|
|
|
|
|
|
|
dev_extent = btrfs_item_ptr(l, slot, struct btrfs_dev_extent);
|
|
|
|
length = btrfs_dev_extent_length(l, dev_extent);
|
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
if (key.offset + length <= new_size) {
|
2016-06-23 06:54:23 +08:00
|
|
|
mutex_unlock(&fs_info->delete_unused_bgs_mutex);
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2009-04-27 19:29:03 +08:00
|
|
|
break;
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
}
|
2008-04-26 04:53:30 +08:00
|
|
|
|
|
|
|
chunk_offset = btrfs_dev_extent_chunk_offset(l, dev_extent);
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2008-04-26 04:53:30 +08:00
|
|
|
|
2017-11-16 07:28:11 +08:00
|
|
|
/*
|
|
|
|
* We may be relocating the only data chunk we have,
|
|
|
|
* which could potentially end up with losing data's
|
|
|
|
* raid profile, so lets allocate an empty one in
|
|
|
|
* advance.
|
|
|
|
*/
|
|
|
|
ret = btrfs_may_alloc_data_chunk(fs_info, chunk_offset);
|
|
|
|
if (ret < 0) {
|
|
|
|
mutex_unlock(&fs_info->delete_unused_bgs_mutex);
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
ret = btrfs_relocate_chunk(fs_info, chunk_offset);
|
|
|
|
mutex_unlock(&fs_info->delete_unused_bgs_mutex);
|
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-04 01:28:12 +08:00
|
|
|
if (ret == -ENOSPC) {
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
failed++;
|
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-04 01:28:12 +08:00
|
|
|
} else if (ret) {
|
|
|
|
if (ret == -ETXTBSY) {
|
|
|
|
btrfs_warn(fs_info,
|
|
|
|
"could not shrink block group %llu due to active swapfile",
|
|
|
|
chunk_offset);
|
|
|
|
}
|
|
|
|
goto done;
|
|
|
|
}
|
2012-03-27 22:09:18 +08:00
|
|
|
} while (key.offset-- > 0);
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
|
|
|
|
if (failed && !retried) {
|
|
|
|
failed = 0;
|
|
|
|
retried = true;
|
|
|
|
goto again;
|
|
|
|
} else if (failed && retried) {
|
|
|
|
ret = -ENOSPC;
|
|
|
|
goto done;
|
2008-04-26 04:53:30 +08:00
|
|
|
}
|
|
|
|
|
2009-04-27 19:29:03 +08:00
|
|
|
/* Shrinking succeeded, else we would be at "done". */
|
2010-05-16 22:48:46 +08:00
|
|
|
trans = btrfs_start_transaction(root, 0);
|
2011-01-20 14:19:37 +08:00
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
2015-06-02 21:43:21 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We checked in the above loop all device extents that were already in
|
|
|
|
* the device tree. However before we have updated the device's
|
|
|
|
* total_bytes to the new size, we might have had chunk allocations that
|
|
|
|
* have not complete yet (new block groups attached to transaction
|
|
|
|
* handles), and therefore their device extents were not yet in the
|
|
|
|
* device tree and we missed them in the loop above. So if we have any
|
|
|
|
* pending chunk using a device extent that overlaps the device range
|
|
|
|
* that we can not use anymore, commit the current transaction and
|
|
|
|
* repeat the search on the device tree - this way we guarantee we will
|
|
|
|
* not have chunks using device extents that end beyond 'new_size'.
|
|
|
|
*/
|
|
|
|
if (!checked_pending_chunks) {
|
|
|
|
u64 start = new_size;
|
|
|
|
u64 len = old_size - new_size;
|
|
|
|
|
2015-06-15 21:41:17 +08:00
|
|
|
if (contains_pending_extent(trans->transaction, device,
|
|
|
|
&start, len)) {
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
2015-06-02 21:43:21 +08:00
|
|
|
checked_pending_chunks = true;
|
|
|
|
failed = 0;
|
|
|
|
retried = false;
|
2016-09-10 09:39:03 +08:00
|
|
|
ret = btrfs_commit_transaction(trans);
|
2015-06-02 21:43:21 +08:00
|
|
|
if (ret)
|
|
|
|
goto done;
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-09-03 21:35:38 +08:00
|
|
|
btrfs_device_set_disk_total_bytes(device, new_size);
|
2014-09-03 21:35:33 +08:00
|
|
|
if (list_empty(&device->resized_list))
|
|
|
|
list_add_tail(&device->resized_list,
|
2016-06-23 06:54:23 +08:00
|
|
|
&fs_info->fs_devices->resized_devices);
|
2009-04-27 19:29:03 +08:00
|
|
|
|
|
|
|
WARN_ON(diff > old_total);
|
2017-06-16 19:39:20 +08:00
|
|
|
btrfs_set_super_total_bytes(super_copy,
|
|
|
|
round_down(old_total - diff, fs_info->sectorsize));
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
2014-09-03 21:35:41 +08:00
|
|
|
|
|
|
|
/* Now btrfs_update_device() will change the on-disk size. */
|
|
|
|
ret = btrfs_update_device(trans, device);
|
2018-08-06 18:12:37 +08:00
|
|
|
if (ret < 0) {
|
|
|
|
btrfs_abort_transaction(trans, ret);
|
|
|
|
btrfs_end_transaction(trans);
|
|
|
|
} else {
|
|
|
|
ret = btrfs_commit_transaction(trans);
|
|
|
|
}
|
2008-04-26 04:53:30 +08:00
|
|
|
done:
|
|
|
|
btrfs_free_path(path);
|
2015-06-02 21:43:21 +08:00
|
|
|
if (ret) {
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
2015-06-02 21:43:21 +08:00
|
|
|
btrfs_device_set_total_bytes(device, old_size);
|
2017-12-04 12:54:52 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state))
|
2015-06-02 21:43:21 +08:00
|
|
|
device->fs_devices->total_rw_bytes += diff;
|
2017-05-11 14:17:46 +08:00
|
|
|
atomic64_add(diff, &fs_info->free_chunk_space);
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
2015-06-02 21:43:21 +08:00
|
|
|
}
|
2008-04-26 04:53:30 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
|
2008-03-25 03:01:56 +08:00
|
|
|
struct btrfs_key *key,
|
|
|
|
struct btrfs_chunk *chunk, int item_size)
|
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_super_block *super_copy = fs_info->super_copy;
|
2008-03-25 03:01:56 +08:00
|
|
|
struct btrfs_disk_key disk_key;
|
|
|
|
u32 array_size;
|
|
|
|
u8 *ptr;
|
|
|
|
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
2008-03-25 03:01:56 +08:00
|
|
|
array_size = btrfs_super_sys_array_size(super_copy);
|
2014-04-21 20:13:11 +08:00
|
|
|
if (array_size + item_size + sizeof(disk_key)
|
2014-09-03 21:35:39 +08:00
|
|
|
> BTRFS_SYSTEM_CHUNK_ARRAY_SIZE) {
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
2008-03-25 03:01:56 +08:00
|
|
|
return -EFBIG;
|
2014-09-03 21:35:39 +08:00
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
|
|
|
|
ptr = super_copy->sys_chunk_array + array_size;
|
|
|
|
btrfs_cpu_key_to_disk(&disk_key, key);
|
|
|
|
memcpy(ptr, &disk_key, sizeof(disk_key));
|
|
|
|
ptr += sizeof(disk_key);
|
|
|
|
memcpy(ptr, chunk, item_size);
|
|
|
|
item_size += sizeof(disk_key);
|
|
|
|
btrfs_set_super_sys_array_size(super_copy, array_size + item_size);
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
2014-09-03 21:35:39 +08:00
|
|
|
|
2008-03-25 03:01:56 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
/*
|
|
|
|
* sort the devices in descending order by max_avail, total_avail
|
|
|
|
*/
|
|
|
|
static int btrfs_cmp_device_info(const void *a, const void *b)
|
2008-04-18 22:29:51 +08:00
|
|
|
{
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
const struct btrfs_device_info *di_a = a;
|
|
|
|
const struct btrfs_device_info *di_b = b;
|
2008-04-18 22:29:51 +08:00
|
|
|
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
if (di_a->max_avail > di_b->max_avail)
|
2011-01-05 18:07:28 +08:00
|
|
|
return -1;
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
if (di_a->max_avail < di_b->max_avail)
|
2011-01-05 18:07:28 +08:00
|
|
|
return 1;
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
if (di_a->total_avail > di_b->total_avail)
|
|
|
|
return -1;
|
|
|
|
if (di_a->total_avail < di_b->total_avail)
|
|
|
|
return 1;
|
|
|
|
return 0;
|
2011-01-05 18:07:28 +08:00
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2013-01-30 07:40:14 +08:00
|
|
|
static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
|
|
|
|
{
|
2015-01-20 15:11:44 +08:00
|
|
|
if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))
|
2013-01-30 07:40:14 +08:00
|
|
|
return;
|
|
|
|
|
2013-04-11 18:30:16 +08:00
|
|
|
btrfs_set_fs_incompat(info, RAID56);
|
2013-01-30 07:40:14 +08:00
|
|
|
}
|
|
|
|
|
2018-01-30 18:20:46 +08:00
|
|
|
#define BTRFS_MAX_DEVS(info) ((BTRFS_MAX_ITEM_SIZE(info) \
|
2014-04-21 20:13:12 +08:00
|
|
|
- sizeof(struct btrfs_chunk)) \
|
|
|
|
/ sizeof(struct btrfs_stripe) + 1)
|
|
|
|
|
|
|
|
#define BTRFS_MAX_DEVS_SYS_CHUNK ((BTRFS_SYSTEM_CHUNK_ARRAY_SIZE \
|
|
|
|
- 2 * sizeof(struct btrfs_disk_key) \
|
|
|
|
- 2 * sizeof(struct btrfs_chunk)) \
|
|
|
|
/ sizeof(struct btrfs_stripe) + 1)
|
|
|
|
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
|
2017-02-11 02:46:27 +08:00
|
|
|
u64 start, u64 type)
|
2011-01-05 18:07:28 +08:00
|
|
|
{
|
2016-06-23 06:54:24 +08:00
|
|
|
struct btrfs_fs_info *info = trans->fs_info;
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
struct btrfs_fs_devices *fs_devices = info->fs_devices;
|
2017-06-27 15:02:24 +08:00
|
|
|
struct btrfs_device *device;
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
struct map_lookup *map = NULL;
|
|
|
|
struct extent_map_tree *em_tree;
|
|
|
|
struct extent_map *em;
|
|
|
|
struct btrfs_device_info *devices_info = NULL;
|
|
|
|
u64 total_avail;
|
|
|
|
int num_stripes; /* total number of stripes to allocate */
|
2013-01-30 07:40:14 +08:00
|
|
|
int data_stripes; /* number of stripes that count for
|
|
|
|
block group size */
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
int sub_stripes; /* sub_stripes info for map */
|
|
|
|
int dev_stripes; /* stripes per dev */
|
|
|
|
int devs_max; /* max devs to use */
|
|
|
|
int devs_min; /* min devs needed */
|
|
|
|
int devs_increment; /* ndevs has to be a multiple of this */
|
|
|
|
int ncopies; /* how many copies to data has */
|
|
|
|
int ret;
|
|
|
|
u64 max_stripe_size;
|
|
|
|
u64 max_chunk_size;
|
|
|
|
u64 stripe_size;
|
2018-10-05 05:24:39 +08:00
|
|
|
u64 chunk_size;
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
int ndevs;
|
|
|
|
int i;
|
|
|
|
int j;
|
2012-11-21 22:18:10 +08:00
|
|
|
int index;
|
2008-03-26 04:50:33 +08:00
|
|
|
|
2012-03-27 22:09:17 +08:00
|
|
|
BUG_ON(!alloc_profile_is_valid(type, 0));
|
2008-04-18 22:29:51 +08:00
|
|
|
|
2018-01-22 13:50:54 +08:00
|
|
|
if (list_empty(&fs_devices->alloc_list)) {
|
|
|
|
if (btrfs_test_opt(info, ENOSPC_DEBUG))
|
|
|
|
btrfs_debug(info, "%s: no writable device", __func__);
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
return -ENOSPC;
|
2018-01-22 13:50:54 +08:00
|
|
|
}
|
2011-01-05 18:07:28 +08:00
|
|
|
|
2018-01-30 18:20:45 +08:00
|
|
|
index = btrfs_bg_flags_to_raid_index(type);
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
|
2012-11-21 22:18:10 +08:00
|
|
|
sub_stripes = btrfs_raid_array[index].sub_stripes;
|
|
|
|
dev_stripes = btrfs_raid_array[index].dev_stripes;
|
|
|
|
devs_max = btrfs_raid_array[index].devs_max;
|
|
|
|
devs_min = btrfs_raid_array[index].devs_min;
|
|
|
|
devs_increment = btrfs_raid_array[index].devs_increment;
|
|
|
|
ncopies = btrfs_raid_array[index].ncopies;
|
2011-01-05 18:07:28 +08:00
|
|
|
|
2008-04-18 22:29:51 +08:00
|
|
|
if (type & BTRFS_BLOCK_GROUP_DATA) {
|
2015-12-15 00:42:10 +08:00
|
|
|
max_stripe_size = SZ_1G;
|
2018-07-03 17:10:05 +08:00
|
|
|
max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE;
|
2014-04-21 20:13:12 +08:00
|
|
|
if (!devs_max)
|
2018-01-30 18:20:46 +08:00
|
|
|
devs_max = BTRFS_MAX_DEVS(info);
|
2008-04-18 22:29:51 +08:00
|
|
|
} else if (type & BTRFS_BLOCK_GROUP_METADATA) {
|
2012-01-07 04:47:38 +08:00
|
|
|
/* for larger filesystems, use larger metadata chunks */
|
2015-12-15 00:42:10 +08:00
|
|
|
if (fs_devices->total_rw_bytes > 50ULL * SZ_1G)
|
|
|
|
max_stripe_size = SZ_1G;
|
2012-01-07 04:47:38 +08:00
|
|
|
else
|
2015-12-15 00:42:10 +08:00
|
|
|
max_stripe_size = SZ_256M;
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
max_chunk_size = max_stripe_size;
|
2014-04-21 20:13:12 +08:00
|
|
|
if (!devs_max)
|
2018-01-30 18:20:46 +08:00
|
|
|
devs_max = BTRFS_MAX_DEVS(info);
|
2008-04-18 23:55:51 +08:00
|
|
|
} else if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
|
2015-12-15 00:42:10 +08:00
|
|
|
max_stripe_size = SZ_32M;
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
max_chunk_size = 2 * max_stripe_size;
|
2014-04-21 20:13:12 +08:00
|
|
|
if (!devs_max)
|
|
|
|
devs_max = BTRFS_MAX_DEVS_SYS_CHUNK;
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
} else {
|
2014-05-15 22:48:20 +08:00
|
|
|
btrfs_err(info, "invalid chunk type 0x%llx requested",
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
type);
|
|
|
|
BUG_ON(1);
|
2008-04-18 22:29:51 +08:00
|
|
|
}
|
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
/* we don't want a chunk larger than 10% of writeable space */
|
|
|
|
max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
|
|
|
|
max_chunk_size);
|
2008-04-18 22:29:51 +08:00
|
|
|
|
2015-02-21 01:00:26 +08:00
|
|
|
devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
GFP_NOFS);
|
|
|
|
if (!devices_info)
|
|
|
|
return -ENOMEM;
|
2010-03-18 04:45:56 +08:00
|
|
|
|
2010-04-06 21:37:47 +08:00
|
|
|
/*
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
* in the first pass through the devices list, we gather information
|
|
|
|
* about the available holes on each device.
|
2010-04-06 21:37:47 +08:00
|
|
|
*/
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
ndevs = 0;
|
2017-06-27 15:02:24 +08:00
|
|
|
list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
u64 max_avail;
|
|
|
|
u64 dev_offset;
|
2011-01-05 18:07:28 +08:00
|
|
|
|
2017-12-04 12:54:52 +08:00
|
|
|
if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
|
2012-11-03 18:58:34 +08:00
|
|
|
WARN(1, KERN_ERR
|
2013-12-21 00:37:06 +08:00
|
|
|
"BTRFS: read-only device in alloc_list\n");
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
continue;
|
|
|
|
}
|
2011-01-05 18:07:28 +08:00
|
|
|
|
2017-12-04 12:54:53 +08:00
|
|
|
if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
|
|
|
|
&device->dev_state) ||
|
2017-12-04 12:54:55 +08:00
|
|
|
test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
continue;
|
2011-01-05 18:07:28 +08:00
|
|
|
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
if (device->total_bytes > device->bytes_used)
|
|
|
|
total_avail = device->total_bytes - device->bytes_used;
|
|
|
|
else
|
|
|
|
total_avail = 0;
|
Btrfs: fix a bug of balance on full multi-disk partitions
When balancing, we'll first try to shrink devices for some space,
but if it is working on a full multi-disk partition with raid protection,
we may encounter a bug, that is, while shrinking, total_bytes may be less
than bytes_used, and btrfs may allocate a dev extent that accesses out of
device's bounds.
Then we will not be able to write or read the data which stores at the end
of the device, and get the followings:
device fsid 0939f071-7ea3-46c8-95df-f176d773bfb6 devid 1 transid 10 /dev/sdb5
Btrfs detected SSD devices, enabling SSD mode
btrfs: relocating block group 476315648 flags 9
btrfs: found 4 extents
attempt to access beyond end of device
sdb5: rw=145, want=546176, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546304, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546432, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546560, limit=546147
attempt to access beyond end of device
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-02 10:39:03 +08:00
|
|
|
|
|
|
|
/* If there is no space on this device, skip it. */
|
|
|
|
if (total_avail == 0)
|
|
|
|
continue;
|
2011-01-05 18:07:28 +08:00
|
|
|
|
2013-06-28 01:22:46 +08:00
|
|
|
ret = find_free_dev_extent(trans, device,
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
max_stripe_size * dev_stripes,
|
|
|
|
&dev_offset, &max_avail);
|
|
|
|
if (ret && ret != -ENOSPC)
|
|
|
|
goto error;
|
2011-01-05 18:07:28 +08:00
|
|
|
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
if (ret == 0)
|
|
|
|
max_avail = max_stripe_size * dev_stripes;
|
2011-01-05 18:07:28 +08:00
|
|
|
|
2018-01-22 13:50:54 +08:00
|
|
|
if (max_avail < BTRFS_STRIPE_LEN * dev_stripes) {
|
|
|
|
if (btrfs_test_opt(info, ENOSPC_DEBUG))
|
|
|
|
btrfs_debug(info,
|
|
|
|
"%s: devid %llu has no free space, have=%llu want=%u",
|
|
|
|
__func__, device->devid, max_avail,
|
|
|
|
BTRFS_STRIPE_LEN * dev_stripes);
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
continue;
|
2018-01-22 13:50:54 +08:00
|
|
|
}
|
2011-01-05 18:07:28 +08:00
|
|
|
|
2013-01-31 08:55:01 +08:00
|
|
|
if (ndevs == fs_devices->rw_devices) {
|
|
|
|
WARN(1, "%s: found more than %llu devices\n",
|
|
|
|
__func__, fs_devices->rw_devices);
|
|
|
|
break;
|
|
|
|
}
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
devices_info[ndevs].dev_offset = dev_offset;
|
|
|
|
devices_info[ndevs].max_avail = max_avail;
|
|
|
|
devices_info[ndevs].total_avail = total_avail;
|
|
|
|
devices_info[ndevs].dev = device;
|
|
|
|
++ndevs;
|
|
|
|
}
|
2011-01-05 18:07:28 +08:00
|
|
|
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
/*
|
|
|
|
* now sort the devices by hole size / available space
|
|
|
|
*/
|
|
|
|
sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
|
|
|
|
btrfs_cmp_device_info, NULL);
|
2011-01-05 18:07:28 +08:00
|
|
|
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
/* round down to number of usable stripes */
|
2017-06-27 15:02:25 +08:00
|
|
|
ndevs = round_down(ndevs, devs_increment);
|
2011-01-05 18:07:28 +08:00
|
|
|
|
2018-01-31 13:56:15 +08:00
|
|
|
if (ndevs < devs_min) {
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
ret = -ENOSPC;
|
2018-01-22 13:50:54 +08:00
|
|
|
if (btrfs_test_opt(info, ENOSPC_DEBUG)) {
|
|
|
|
btrfs_debug(info,
|
|
|
|
"%s: not enough devices with free space: have=%d minimum required=%d",
|
2018-01-31 13:56:15 +08:00
|
|
|
__func__, ndevs, devs_min);
|
2018-01-22 13:50:54 +08:00
|
|
|
}
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
goto error;
|
2011-01-05 18:07:28 +08:00
|
|
|
}
|
2010-04-06 21:37:47 +08:00
|
|
|
|
2017-06-27 15:02:26 +08:00
|
|
|
ndevs = min(ndevs, devs_max);
|
|
|
|
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
/*
|
btrfs: alloc_chunk: fix DUP stripe size handling
In case of using DUP, we search for enough unallocated disk space on a
device to hold two stripes.
The devices_info[ndevs-1].max_avail that holds the amount of unallocated
space found is directly assigned to stripe_size, while it's actually
twice the stripe size.
Later on in the code, an unconditional division of stripe_size by
dev_stripes corrects the value, but in the meantime there's a check to
see if the stripe_size does not exceed max_chunk_size. Since during this
check stripe_size is twice the amount as intended, the check will reduce
the stripe_size to max_chunk_size if the actual correct to be used
stripe_size is more than half the amount of max_chunk_size.
The unconditional division later tries to correct stripe_size, but will
actually make sure we can't allocate more than half the max_chunk_size.
Fix this by moving the division by dev_stripes before the max chunk size
check, so it always contains the right value, instead of putting a duct
tape division in further on to get it fixed again.
Since in all other cases than DUP, dev_stripes is 1, this change only
affects DUP.
Other attempts in the past were made to fix this:
* 37db63a400 "Btrfs: fix max chunk size check in chunk allocator" tried
to fix the same problem, but still resulted in part of the code acting
on a wrongly doubled stripe_size value.
* 86db25785a "Btrfs: fix max chunk size on raid5/6" unintentionally
broke this fix again.
The real problem was already introduced with the rest of the code in
73c5de0051.
The user visible result however will be that the max chunk size for DUP
will suddenly double, while it's actually acting according to the limits
in the code again like it was 5 years ago.
Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
Link: https://www.spinics.net/lists/linux-btrfs/msg69752.html
Fixes: 73c5de0051 ("btrfs: quasi-round-robin for chunk allocation")
Fixes: 86db25785a ("Btrfs: fix max chunk size on raid5/6")
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-02-06 00:45:11 +08:00
|
|
|
* The primary goal is to maximize the number of stripes, so use as
|
|
|
|
* many devices as possible, even if the stripes are not maximum sized.
|
|
|
|
*
|
|
|
|
* The DUP profile stores more than one stripe per device, the
|
|
|
|
* max_avail is the total size so we have to adjust.
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
*/
|
btrfs: alloc_chunk: fix DUP stripe size handling
In case of using DUP, we search for enough unallocated disk space on a
device to hold two stripes.
The devices_info[ndevs-1].max_avail that holds the amount of unallocated
space found is directly assigned to stripe_size, while it's actually
twice the stripe size.
Later on in the code, an unconditional division of stripe_size by
dev_stripes corrects the value, but in the meantime there's a check to
see if the stripe_size does not exceed max_chunk_size. Since during this
check stripe_size is twice the amount as intended, the check will reduce
the stripe_size to max_chunk_size if the actual correct to be used
stripe_size is more than half the amount of max_chunk_size.
The unconditional division later tries to correct stripe_size, but will
actually make sure we can't allocate more than half the max_chunk_size.
Fix this by moving the division by dev_stripes before the max chunk size
check, so it always contains the right value, instead of putting a duct
tape division in further on to get it fixed again.
Since in all other cases than DUP, dev_stripes is 1, this change only
affects DUP.
Other attempts in the past were made to fix this:
* 37db63a400 "Btrfs: fix max chunk size check in chunk allocator" tried
to fix the same problem, but still resulted in part of the code acting
on a wrongly doubled stripe_size value.
* 86db25785a "Btrfs: fix max chunk size on raid5/6" unintentionally
broke this fix again.
The real problem was already introduced with the rest of the code in
73c5de0051.
The user visible result however will be that the max chunk size for DUP
will suddenly double, while it's actually acting according to the limits
in the code again like it was 5 years ago.
Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
Link: https://www.spinics.net/lists/linux-btrfs/msg69752.html
Fixes: 73c5de0051 ("btrfs: quasi-round-robin for chunk allocation")
Fixes: 86db25785a ("Btrfs: fix max chunk size on raid5/6")
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-02-06 00:45:11 +08:00
|
|
|
stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes);
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
num_stripes = ndevs * dev_stripes;
|
2011-01-05 18:07:28 +08:00
|
|
|
|
2013-01-30 07:40:14 +08:00
|
|
|
/*
|
|
|
|
* this will have to be fixed for RAID1 and RAID10 over
|
|
|
|
* more drives
|
|
|
|
*/
|
|
|
|
data_stripes = num_stripes / ncopies;
|
|
|
|
|
2017-07-14 14:55:41 +08:00
|
|
|
if (type & BTRFS_BLOCK_GROUP_RAID5)
|
2013-01-30 07:40:14 +08:00
|
|
|
data_stripes = num_stripes - 1;
|
2017-07-14 14:55:41 +08:00
|
|
|
|
|
|
|
if (type & BTRFS_BLOCK_GROUP_RAID6)
|
2013-01-30 07:40:14 +08:00
|
|
|
data_stripes = num_stripes - 2;
|
2013-02-21 05:23:40 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Use the number of data stripes to figure out how big this chunk
|
|
|
|
* is really going to be in terms of logical address space,
|
btrfs: alloc_chunk: fix more DUP stripe size handling
Commit 92e222df7b "btrfs: alloc_chunk: fix DUP stripe size handling"
fixed calculating the stripe_size for a new DUP chunk.
However, the same calculation reappears a bit later, and that one was
not changed yet. The resulting bug that is exposed is that the newly
allocated device extents ('stripes') can have a few MiB overlap with the
next thing stored after them, which is another device extent or the end
of the disk.
The scenario in which this can happen is:
* The block device for the filesystem is less than 10GiB in size.
* The amount of contiguous free unallocated disk space chosen to use for
chunk allocation is 20% of the total device size, or a few MiB more or
less.
An example:
- The filesystem device is 7880MiB (max_chunk_size gets set to 788MiB)
- There's 1578MiB unallocated raw disk space left in one contiguous
piece.
In this case stripe_size is first calculated as 789MiB, (half of
1578MiB).
Since 789MiB (stripe_size * data_stripes) > 788MiB (max_chunk_size), we
enter the if block. Now stripe_size value is immediately overwritten
while calculating an adjusted value based on max_chunk_size, which ends
up as 788MiB.
Next, the value is rounded up to a 16MiB boundary, 800MiB, which is
actually more than the value we had before. However, the last comparison
fails to detect this, because it's comparing the value with the total
amount of free space, which is about twice the size of stripe_size.
In the example above, this means that the resulting raw disk space being
allocated is 1600MiB, while only a gap of 1578MiB has been found. The
second device extent object for this DUP chunk will overlap for 22MiB
with whatever comes next.
The underlying problem here is that the stripe_size is reused all the
time for different things. So, when entering the code in the if block,
stripe_size is immediately overwritten with something else. If later we
decide we want to have the previous value back, then the logic to
compute it was copy pasted in again.
With this change, the value in stripe_size is not unnecessarily
destroyed, so the duplicated calculation is not needed any more.
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-05 05:24:40 +08:00
|
|
|
* and compare that answer with the max chunk size. If it's higher,
|
|
|
|
* we try to reduce stripe_size.
|
2013-02-21 05:23:40 +08:00
|
|
|
*/
|
|
|
|
if (stripe_size * data_stripes > max_chunk_size) {
|
2018-01-31 14:16:34 +08:00
|
|
|
/*
|
btrfs: alloc_chunk: fix more DUP stripe size handling
Commit 92e222df7b "btrfs: alloc_chunk: fix DUP stripe size handling"
fixed calculating the stripe_size for a new DUP chunk.
However, the same calculation reappears a bit later, and that one was
not changed yet. The resulting bug that is exposed is that the newly
allocated device extents ('stripes') can have a few MiB overlap with the
next thing stored after them, which is another device extent or the end
of the disk.
The scenario in which this can happen is:
* The block device for the filesystem is less than 10GiB in size.
* The amount of contiguous free unallocated disk space chosen to use for
chunk allocation is 20% of the total device size, or a few MiB more or
less.
An example:
- The filesystem device is 7880MiB (max_chunk_size gets set to 788MiB)
- There's 1578MiB unallocated raw disk space left in one contiguous
piece.
In this case stripe_size is first calculated as 789MiB, (half of
1578MiB).
Since 789MiB (stripe_size * data_stripes) > 788MiB (max_chunk_size), we
enter the if block. Now stripe_size value is immediately overwritten
while calculating an adjusted value based on max_chunk_size, which ends
up as 788MiB.
Next, the value is rounded up to a 16MiB boundary, 800MiB, which is
actually more than the value we had before. However, the last comparison
fails to detect this, because it's comparing the value with the total
amount of free space, which is about twice the size of stripe_size.
In the example above, this means that the resulting raw disk space being
allocated is 1600MiB, while only a gap of 1578MiB has been found. The
second device extent object for this DUP chunk will overlap for 22MiB
with whatever comes next.
The underlying problem here is that the stripe_size is reused all the
time for different things. So, when entering the code in the if block,
stripe_size is immediately overwritten with something else. If later we
decide we want to have the previous value back, then the logic to
compute it was copy pasted in again.
With this change, the value in stripe_size is not unnecessarily
destroyed, so the duplicated calculation is not needed any more.
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-05 05:24:40 +08:00
|
|
|
* Reduce stripe_size, round it up to a 16MB boundary again and
|
|
|
|
* then use it, unless it ends up being even bigger than the
|
|
|
|
* previous value we had already.
|
2013-02-21 05:23:40 +08:00
|
|
|
*/
|
btrfs: alloc_chunk: fix more DUP stripe size handling
Commit 92e222df7b "btrfs: alloc_chunk: fix DUP stripe size handling"
fixed calculating the stripe_size for a new DUP chunk.
However, the same calculation reappears a bit later, and that one was
not changed yet. The resulting bug that is exposed is that the newly
allocated device extents ('stripes') can have a few MiB overlap with the
next thing stored after them, which is another device extent or the end
of the disk.
The scenario in which this can happen is:
* The block device for the filesystem is less than 10GiB in size.
* The amount of contiguous free unallocated disk space chosen to use for
chunk allocation is 20% of the total device size, or a few MiB more or
less.
An example:
- The filesystem device is 7880MiB (max_chunk_size gets set to 788MiB)
- There's 1578MiB unallocated raw disk space left in one contiguous
piece.
In this case stripe_size is first calculated as 789MiB, (half of
1578MiB).
Since 789MiB (stripe_size * data_stripes) > 788MiB (max_chunk_size), we
enter the if block. Now stripe_size value is immediately overwritten
while calculating an adjusted value based on max_chunk_size, which ends
up as 788MiB.
Next, the value is rounded up to a 16MiB boundary, 800MiB, which is
actually more than the value we had before. However, the last comparison
fails to detect this, because it's comparing the value with the total
amount of free space, which is about twice the size of stripe_size.
In the example above, this means that the resulting raw disk space being
allocated is 1600MiB, while only a gap of 1578MiB has been found. The
second device extent object for this DUP chunk will overlap for 22MiB
with whatever comes next.
The underlying problem here is that the stripe_size is reused all the
time for different things. So, when entering the code in the if block,
stripe_size is immediately overwritten with something else. If later we
decide we want to have the previous value back, then the logic to
compute it was copy pasted in again.
With this change, the value in stripe_size is not unnecessarily
destroyed, so the duplicated calculation is not needed any more.
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-05 05:24:40 +08:00
|
|
|
stripe_size = min(round_up(div_u64(max_chunk_size,
|
|
|
|
data_stripes), SZ_16M),
|
2018-01-31 14:16:34 +08:00
|
|
|
stripe_size);
|
2013-02-21 05:23:40 +08:00
|
|
|
}
|
|
|
|
|
2012-04-13 22:05:08 +08:00
|
|
|
/* align to BTRFS_STRIPE_LEN */
|
2017-07-14 14:55:41 +08:00
|
|
|
stripe_size = round_down(stripe_size, BTRFS_STRIPE_LEN);
|
2011-01-05 18:07:28 +08:00
|
|
|
|
|
|
|
map = kmalloc(map_lookup_size(num_stripes), GFP_NOFS);
|
|
|
|
if (!map) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto error;
|
|
|
|
}
|
|
|
|
map->num_stripes = num_stripes;
|
2008-04-18 22:29:51 +08:00
|
|
|
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
for (i = 0; i < ndevs; ++i) {
|
|
|
|
for (j = 0; j < dev_stripes; ++j) {
|
|
|
|
int s = i * dev_stripes + j;
|
|
|
|
map->stripes[s].dev = devices_info[i].dev;
|
|
|
|
map->stripes[s].physical = devices_info[i].dev_offset +
|
|
|
|
j * stripe_size;
|
2008-03-25 03:01:59 +08:00
|
|
|
}
|
|
|
|
}
|
2017-07-14 14:55:41 +08:00
|
|
|
map->stripe_len = BTRFS_STRIPE_LEN;
|
|
|
|
map->io_align = BTRFS_STRIPE_LEN;
|
|
|
|
map->io_width = BTRFS_STRIPE_LEN;
|
2008-11-18 10:11:30 +08:00
|
|
|
map->type = type;
|
|
|
|
map->sub_stripes = sub_stripes;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2018-10-05 05:24:39 +08:00
|
|
|
chunk_size = stripe_size * data_stripes;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2018-10-05 05:24:39 +08:00
|
|
|
trace_btrfs_chunk_alloc(info, map, start, chunk_size);
|
Btrfs: add initial tracepoint support for btrfs
Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
Here is what I have added:
1) ordere_extent:
btrfs_ordered_extent_add
btrfs_ordered_extent_remove
btrfs_ordered_extent_start
btrfs_ordered_extent_put
These provide critical information to understand how ordered_extents are
updated.
2) extent_map:
btrfs_get_extent
extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.
3) writepage:
__extent_writepage
btrfs_writepage_end_io_hook
Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.
4) inode:
btrfs_inode_new
btrfs_inode_request
btrfs_inode_evict
These can show where and when a inode is created, when a inode is evicted.
5) sync:
btrfs_sync_file
btrfs_sync_fs
These show sync arguments.
6) transaction:
btrfs_transaction_commit
In transaction based filesystem, it will be useful to know the generation and
who does commit.
7) back reference and cow:
btrfs_delayed_tree_ref
btrfs_delayed_data_ref
btrfs_delayed_ref_head
btrfs_cow_block
Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.
8) chunk:
btrfs_chunk_alloc
btrfs_chunk_free
Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.
9) reserved_extent:
btrfs_reserved_extent_alloc
btrfs_reserved_extent_free
These can show how btrfs uses its space.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-24 19:18:59 +08:00
|
|
|
|
2011-04-21 06:48:27 +08:00
|
|
|
em = alloc_extent_map();
|
2008-11-18 10:11:30 +08:00
|
|
|
if (!em) {
|
2014-06-19 10:42:52 +08:00
|
|
|
kfree(map);
|
2011-01-05 18:07:28 +08:00
|
|
|
ret = -ENOMEM;
|
|
|
|
goto error;
|
2008-03-26 04:50:33 +08:00
|
|
|
}
|
2014-06-19 10:42:52 +08:00
|
|
|
set_bit(EXTENT_FLAG_FS_MAPPING, &em->flags);
|
2015-06-03 22:55:48 +08:00
|
|
|
em->map_lookup = map;
|
2008-11-18 10:11:30 +08:00
|
|
|
em->start = start;
|
2018-10-05 05:24:39 +08:00
|
|
|
em->len = chunk_size;
|
2008-11-18 10:11:30 +08:00
|
|
|
em->block_start = 0;
|
|
|
|
em->block_len = em->len;
|
2013-06-28 01:22:46 +08:00
|
|
|
em->orig_block_len = stripe_size;
|
2008-03-26 04:50:33 +08:00
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
em_tree = &info->mapping_tree.map_tree;
|
2009-09-03 04:24:52 +08:00
|
|
|
write_lock(&em_tree->lock);
|
2013-04-06 04:51:15 +08:00
|
|
|
ret = add_extent_mapping(em_tree, em, 0);
|
2013-01-31 23:23:04 +08:00
|
|
|
if (ret) {
|
2017-08-21 17:43:49 +08:00
|
|
|
write_unlock(&em_tree->lock);
|
2013-01-31 23:23:04 +08:00
|
|
|
free_extent_map(em);
|
2011-09-09 08:29:00 +08:00
|
|
|
goto error;
|
2013-01-31 23:23:04 +08:00
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2017-08-21 17:43:49 +08:00
|
|
|
list_add_tail(&em->list, &trans->transaction->pending_chunks);
|
|
|
|
refcount_inc(&em->refs);
|
|
|
|
write_unlock(&em_tree->lock);
|
|
|
|
|
2018-10-05 05:24:39 +08:00
|
|
|
ret = btrfs_make_block_group(trans, 0, type, start, chunk_size);
|
2013-06-28 01:22:46 +08:00
|
|
|
if (ret)
|
|
|
|
goto error_del_extent;
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2018-10-05 05:24:38 +08:00
|
|
|
for (i = 0; i < map->num_stripes; i++)
|
|
|
|
btrfs_device_set_bytes_used(map->stripes[i].dev,
|
|
|
|
map->stripes[i].dev->bytes_used + stripe_size);
|
2014-09-03 21:35:36 +08:00
|
|
|
|
2017-05-11 14:17:46 +08:00
|
|
|
atomic64_sub(stripe_size * map->num_stripes, &info->free_chunk_space);
|
2014-09-03 21:35:37 +08:00
|
|
|
|
2013-01-31 23:23:04 +08:00
|
|
|
free_extent_map(em);
|
2016-06-23 06:54:23 +08:00
|
|
|
check_raid56_incompat_flag(info, type);
|
2013-01-30 07:40:14 +08:00
|
|
|
|
2011-01-05 18:07:28 +08:00
|
|
|
kfree(devices_info);
|
2008-11-18 10:11:30 +08:00
|
|
|
return 0;
|
2011-01-05 18:07:28 +08:00
|
|
|
|
2013-06-28 01:22:46 +08:00
|
|
|
error_del_extent:
|
2013-01-31 23:23:04 +08:00
|
|
|
write_lock(&em_tree->lock);
|
|
|
|
remove_extent_mapping(em_tree, em);
|
|
|
|
write_unlock(&em_tree->lock);
|
|
|
|
|
|
|
|
/* One for our allocation */
|
|
|
|
free_extent_map(em);
|
|
|
|
/* One for the tree reference */
|
|
|
|
free_extent_map(em);
|
2014-12-03 02:07:30 +08:00
|
|
|
/* One for the pending_chunks list reference */
|
|
|
|
free_extent_map(em);
|
2011-01-05 18:07:28 +08:00
|
|
|
error:
|
|
|
|
kfree(devices_info);
|
|
|
|
return ret;
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
|
2013-06-28 01:22:46 +08:00
|
|
|
int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans,
|
2018-07-21 00:37:53 +08:00
|
|
|
u64 chunk_offset, u64 chunk_size)
|
2008-11-18 10:11:30 +08:00
|
|
|
{
|
2018-07-21 00:37:53 +08:00
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
2016-06-22 09:16:51 +08:00
|
|
|
struct btrfs_root *extent_root = fs_info->extent_root;
|
|
|
|
struct btrfs_root *chunk_root = fs_info->chunk_root;
|
2008-11-18 10:11:30 +08:00
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_device *device;
|
|
|
|
struct btrfs_chunk *chunk;
|
|
|
|
struct btrfs_stripe *stripe;
|
2013-06-28 01:22:46 +08:00
|
|
|
struct extent_map *em;
|
|
|
|
struct map_lookup *map;
|
|
|
|
size_t item_size;
|
|
|
|
u64 dev_offset;
|
|
|
|
u64 stripe_size;
|
|
|
|
int i = 0;
|
2015-12-24 05:30:51 +08:00
|
|
|
int ret = 0;
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2018-05-17 07:34:31 +08:00
|
|
|
em = btrfs_get_chunk_map(fs_info, chunk_offset, chunk_size);
|
2017-03-15 04:33:55 +08:00
|
|
|
if (IS_ERR(em))
|
|
|
|
return PTR_ERR(em);
|
2013-06-28 01:22:46 +08:00
|
|
|
|
2015-06-03 22:55:48 +08:00
|
|
|
map = em->map_lookup;
|
2013-06-28 01:22:46 +08:00
|
|
|
item_size = btrfs_chunk_item_size(map->num_stripes);
|
|
|
|
stripe_size = em->orig_block_len;
|
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
chunk = kzalloc(item_size, GFP_NOFS);
|
2013-06-28 01:22:46 +08:00
|
|
|
if (!chunk) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
Btrfs: fix race when finishing dev replace leading to transaction abort
During the final phase of a device replace operation, I ran into a
transaction abort that resulted in the following trace:
[23919.655368] WARNING: CPU: 10 PID: 30175 at fs/btrfs/extent-tree.c:9843 btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]()
[23919.664742] BTRFS: Transaction aborted (error -2)
[23919.665749] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 parport psmouse acpi_cpufreq processor i2c_core evdev microcode pcspkr button serio_raw ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic ata_piix virtio_pci floppy virtio_ring libata e1000 virtio scsi_mod [last unloaded: btrfs]
[23919.679442] CPU: 10 PID: 30175 Comm: fsstress Not tainted 4.3.0-rc5-btrfs-next-17+ #1
[23919.682392] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[23919.689151] 0000000000000000 ffff8804020cbb50 ffffffff812566f4 ffff8804020cbb98
[23919.692604] ffff8804020cbb88 ffffffff8104d0a6 ffffffffa03eea69 ffff88041b678a48
[23919.694230] ffff88042ac38000 ffff88041b678930 00000000fffffffe ffff8804020cbbf0
[23919.696716] Call Trace:
[23919.698669] [<ffffffff812566f4>] dump_stack+0x4e/0x79
[23919.700597] [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
[23919.701958] [<ffffffffa03eea69>] ? btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
[23919.703612] [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
[23919.705047] [<ffffffffa03eea69>] btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
[23919.706967] [<ffffffffa0402097>] __btrfs_end_transaction+0x84/0x2dd [btrfs]
[23919.708611] [<ffffffffa0402300>] btrfs_end_transaction+0x10/0x12 [btrfs]
[23919.710099] [<ffffffffa03ef0b8>] btrfs_alloc_data_chunk_ondemand+0x121/0x28b [btrfs]
[23919.711970] [<ffffffffa0413025>] btrfs_fallocate+0x7d3/0xc6d [btrfs]
[23919.713602] [<ffffffff8108b78f>] ? lock_acquire+0x10d/0x194
[23919.714756] [<ffffffff81086dbc>] ? percpu_down_read+0x51/0x78
[23919.716155] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
[23919.718918] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
[23919.724170] [<ffffffff8116b579>] vfs_fallocate+0x170/0x1ff
[23919.725482] [<ffffffff8117c1d7>] ioctl_preallocate+0x89/0x9b
[23919.726790] [<ffffffff8117c5ef>] do_vfs_ioctl+0x406/0x4e6
[23919.728428] [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
[23919.729642] [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
[23919.730782] [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
[23919.731847] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[23919.733330] ---[ end trace 166ef301a335832a ]---
This is due to a race between device replace and chunk allocation, which
the following diagram illustrates:
CPU 1 CPU 2
btrfs_dev_replace_finishing()
at this point
dev_replace->tgtdev->devid ==
BTRFS_DEV_REPLACE_DEVID (0ULL)
...
btrfs_start_transaction()
btrfs_commit_transaction()
btrfs_fallocate()
btrfs_alloc_data_chunk_ondemand()
btrfs_join_transaction()
--> starts a new transaction
do_chunk_alloc()
lock fs_info->chunk_mutex
btrfs_alloc_chunk()
--> creates extent map for
the new chunk with
em->bdev->map->stripes[i]->dev->devid
== X (X > 0)
--> extent map is added to
fs_info->mapping_tree
--> initial phase of bg A
allocation completes
unlock fs_info->chunk_mutex
lock fs_info->chunk_mutex
btrfs_dev_replace_update_device_in_mapping_tree()
--> iterates fs_info->mapping_tree and
replaces the device in every extent
map's map->stripes[] with
dev_replace->tgtdev, which still has
an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)
btrfs_end_transaction()
btrfs_create_pending_block_groups()
--> starts final phase of
bg A creation (update device,
extent, and chunk trees, etc)
btrfs_finish_chunk_alloc()
btrfs_update_device()
--> attempts to update a device
item with ID == 0ULL
(BTRFS_DEV_REPLACE_DEVID)
which is the current ID of
bg A's
em->bdev->map->stripes[i]->dev->devid
--> doesn't find such item
returns -ENOENT
--> the device id should have been X
and not 0ULL
got -ENOENT from
btrfs_finish_chunk_alloc()
and aborts current transaction
finishes setting up the target device,
namely it sets tgtdev->devid to the value
of srcdev->devid, which is X (and X > 0)
frees the srcdev
unlock fs_info->chunk_mutex
So fix this by taking the device list mutex when processing the chunk's
extent map stripes to update the device items. This avoids getting the
wrong device id and use-after-free problems if the task finishing a
chunk allocation grabs the replaced device, which is freed while the
dev replace task is holding the device list mutex.
This happened while running fstest btrfs/071.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-11-20 18:42:47 +08:00
|
|
|
/*
|
|
|
|
* Take the device list mutex to prevent races with the final phase of
|
|
|
|
* a device replace operation that replaces the device object associated
|
|
|
|
* with the map's stripes, because the device object's id can change
|
|
|
|
* at any time during that final phase of the device replace operation
|
|
|
|
* (dev-replace.c:btrfs_dev_replace_finishing()).
|
|
|
|
*/
|
2016-06-23 06:54:23 +08:00
|
|
|
mutex_lock(&fs_info->fs_devices->device_list_mutex);
|
2013-06-28 01:22:46 +08:00
|
|
|
for (i = 0; i < map->num_stripes; i++) {
|
|
|
|
device = map->stripes[i].dev;
|
|
|
|
dev_offset = map->stripes[i].physical;
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2008-03-25 03:01:56 +08:00
|
|
|
ret = btrfs_update_device(trans, device);
|
2011-09-09 08:40:01 +08:00
|
|
|
if (ret)
|
Btrfs: fix race when finishing dev replace leading to transaction abort
During the final phase of a device replace operation, I ran into a
transaction abort that resulted in the following trace:
[23919.655368] WARNING: CPU: 10 PID: 30175 at fs/btrfs/extent-tree.c:9843 btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]()
[23919.664742] BTRFS: Transaction aborted (error -2)
[23919.665749] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 parport psmouse acpi_cpufreq processor i2c_core evdev microcode pcspkr button serio_raw ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic ata_piix virtio_pci floppy virtio_ring libata e1000 virtio scsi_mod [last unloaded: btrfs]
[23919.679442] CPU: 10 PID: 30175 Comm: fsstress Not tainted 4.3.0-rc5-btrfs-next-17+ #1
[23919.682392] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[23919.689151] 0000000000000000 ffff8804020cbb50 ffffffff812566f4 ffff8804020cbb98
[23919.692604] ffff8804020cbb88 ffffffff8104d0a6 ffffffffa03eea69 ffff88041b678a48
[23919.694230] ffff88042ac38000 ffff88041b678930 00000000fffffffe ffff8804020cbbf0
[23919.696716] Call Trace:
[23919.698669] [<ffffffff812566f4>] dump_stack+0x4e/0x79
[23919.700597] [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
[23919.701958] [<ffffffffa03eea69>] ? btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
[23919.703612] [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
[23919.705047] [<ffffffffa03eea69>] btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
[23919.706967] [<ffffffffa0402097>] __btrfs_end_transaction+0x84/0x2dd [btrfs]
[23919.708611] [<ffffffffa0402300>] btrfs_end_transaction+0x10/0x12 [btrfs]
[23919.710099] [<ffffffffa03ef0b8>] btrfs_alloc_data_chunk_ondemand+0x121/0x28b [btrfs]
[23919.711970] [<ffffffffa0413025>] btrfs_fallocate+0x7d3/0xc6d [btrfs]
[23919.713602] [<ffffffff8108b78f>] ? lock_acquire+0x10d/0x194
[23919.714756] [<ffffffff81086dbc>] ? percpu_down_read+0x51/0x78
[23919.716155] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
[23919.718918] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
[23919.724170] [<ffffffff8116b579>] vfs_fallocate+0x170/0x1ff
[23919.725482] [<ffffffff8117c1d7>] ioctl_preallocate+0x89/0x9b
[23919.726790] [<ffffffff8117c5ef>] do_vfs_ioctl+0x406/0x4e6
[23919.728428] [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
[23919.729642] [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
[23919.730782] [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
[23919.731847] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[23919.733330] ---[ end trace 166ef301a335832a ]---
This is due to a race between device replace and chunk allocation, which
the following diagram illustrates:
CPU 1 CPU 2
btrfs_dev_replace_finishing()
at this point
dev_replace->tgtdev->devid ==
BTRFS_DEV_REPLACE_DEVID (0ULL)
...
btrfs_start_transaction()
btrfs_commit_transaction()
btrfs_fallocate()
btrfs_alloc_data_chunk_ondemand()
btrfs_join_transaction()
--> starts a new transaction
do_chunk_alloc()
lock fs_info->chunk_mutex
btrfs_alloc_chunk()
--> creates extent map for
the new chunk with
em->bdev->map->stripes[i]->dev->devid
== X (X > 0)
--> extent map is added to
fs_info->mapping_tree
--> initial phase of bg A
allocation completes
unlock fs_info->chunk_mutex
lock fs_info->chunk_mutex
btrfs_dev_replace_update_device_in_mapping_tree()
--> iterates fs_info->mapping_tree and
replaces the device in every extent
map's map->stripes[] with
dev_replace->tgtdev, which still has
an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)
btrfs_end_transaction()
btrfs_create_pending_block_groups()
--> starts final phase of
bg A creation (update device,
extent, and chunk trees, etc)
btrfs_finish_chunk_alloc()
btrfs_update_device()
--> attempts to update a device
item with ID == 0ULL
(BTRFS_DEV_REPLACE_DEVID)
which is the current ID of
bg A's
em->bdev->map->stripes[i]->dev->devid
--> doesn't find such item
returns -ENOENT
--> the device id should have been X
and not 0ULL
got -ENOENT from
btrfs_finish_chunk_alloc()
and aborts current transaction
finishes setting up the target device,
namely it sets tgtdev->devid to the value
of srcdev->devid, which is X (and X > 0)
frees the srcdev
unlock fs_info->chunk_mutex
So fix this by taking the device list mutex when processing the chunk's
extent map stripes to update the device items. This avoids getting the
wrong device id and use-after-free problems if the task finishing a
chunk allocation grabs the replaced device, which is freed while the
dev replace task is holding the device list mutex.
This happened while running fstest btrfs/071.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-11-20 18:42:47 +08:00
|
|
|
break;
|
2017-08-18 22:58:23 +08:00
|
|
|
ret = btrfs_alloc_dev_extent(trans, device, chunk_offset,
|
|
|
|
dev_offset, stripe_size);
|
2013-06-28 01:22:46 +08:00
|
|
|
if (ret)
|
Btrfs: fix race when finishing dev replace leading to transaction abort
During the final phase of a device replace operation, I ran into a
transaction abort that resulted in the following trace:
[23919.655368] WARNING: CPU: 10 PID: 30175 at fs/btrfs/extent-tree.c:9843 btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]()
[23919.664742] BTRFS: Transaction aborted (error -2)
[23919.665749] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 parport psmouse acpi_cpufreq processor i2c_core evdev microcode pcspkr button serio_raw ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic ata_piix virtio_pci floppy virtio_ring libata e1000 virtio scsi_mod [last unloaded: btrfs]
[23919.679442] CPU: 10 PID: 30175 Comm: fsstress Not tainted 4.3.0-rc5-btrfs-next-17+ #1
[23919.682392] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[23919.689151] 0000000000000000 ffff8804020cbb50 ffffffff812566f4 ffff8804020cbb98
[23919.692604] ffff8804020cbb88 ffffffff8104d0a6 ffffffffa03eea69 ffff88041b678a48
[23919.694230] ffff88042ac38000 ffff88041b678930 00000000fffffffe ffff8804020cbbf0
[23919.696716] Call Trace:
[23919.698669] [<ffffffff812566f4>] dump_stack+0x4e/0x79
[23919.700597] [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
[23919.701958] [<ffffffffa03eea69>] ? btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
[23919.703612] [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
[23919.705047] [<ffffffffa03eea69>] btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
[23919.706967] [<ffffffffa0402097>] __btrfs_end_transaction+0x84/0x2dd [btrfs]
[23919.708611] [<ffffffffa0402300>] btrfs_end_transaction+0x10/0x12 [btrfs]
[23919.710099] [<ffffffffa03ef0b8>] btrfs_alloc_data_chunk_ondemand+0x121/0x28b [btrfs]
[23919.711970] [<ffffffffa0413025>] btrfs_fallocate+0x7d3/0xc6d [btrfs]
[23919.713602] [<ffffffff8108b78f>] ? lock_acquire+0x10d/0x194
[23919.714756] [<ffffffff81086dbc>] ? percpu_down_read+0x51/0x78
[23919.716155] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
[23919.718918] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
[23919.724170] [<ffffffff8116b579>] vfs_fallocate+0x170/0x1ff
[23919.725482] [<ffffffff8117c1d7>] ioctl_preallocate+0x89/0x9b
[23919.726790] [<ffffffff8117c5ef>] do_vfs_ioctl+0x406/0x4e6
[23919.728428] [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
[23919.729642] [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
[23919.730782] [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
[23919.731847] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[23919.733330] ---[ end trace 166ef301a335832a ]---
This is due to a race between device replace and chunk allocation, which
the following diagram illustrates:
CPU 1 CPU 2
btrfs_dev_replace_finishing()
at this point
dev_replace->tgtdev->devid ==
BTRFS_DEV_REPLACE_DEVID (0ULL)
...
btrfs_start_transaction()
btrfs_commit_transaction()
btrfs_fallocate()
btrfs_alloc_data_chunk_ondemand()
btrfs_join_transaction()
--> starts a new transaction
do_chunk_alloc()
lock fs_info->chunk_mutex
btrfs_alloc_chunk()
--> creates extent map for
the new chunk with
em->bdev->map->stripes[i]->dev->devid
== X (X > 0)
--> extent map is added to
fs_info->mapping_tree
--> initial phase of bg A
allocation completes
unlock fs_info->chunk_mutex
lock fs_info->chunk_mutex
btrfs_dev_replace_update_device_in_mapping_tree()
--> iterates fs_info->mapping_tree and
replaces the device in every extent
map's map->stripes[] with
dev_replace->tgtdev, which still has
an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)
btrfs_end_transaction()
btrfs_create_pending_block_groups()
--> starts final phase of
bg A creation (update device,
extent, and chunk trees, etc)
btrfs_finish_chunk_alloc()
btrfs_update_device()
--> attempts to update a device
item with ID == 0ULL
(BTRFS_DEV_REPLACE_DEVID)
which is the current ID of
bg A's
em->bdev->map->stripes[i]->dev->devid
--> doesn't find such item
returns -ENOENT
--> the device id should have been X
and not 0ULL
got -ENOENT from
btrfs_finish_chunk_alloc()
and aborts current transaction
finishes setting up the target device,
namely it sets tgtdev->devid to the value
of srcdev->devid, which is X (and X > 0)
frees the srcdev
unlock fs_info->chunk_mutex
So fix this by taking the device list mutex when processing the chunk's
extent map stripes to update the device items. This avoids getting the
wrong device id and use-after-free problems if the task finishing a
chunk allocation grabs the replaced device, which is freed while the
dev replace task is holding the device list mutex.
This happened while running fstest btrfs/071.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-11-20 18:42:47 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (ret) {
|
2016-06-23 06:54:23 +08:00
|
|
|
mutex_unlock(&fs_info->fs_devices->device_list_mutex);
|
Btrfs: fix race when finishing dev replace leading to transaction abort
During the final phase of a device replace operation, I ran into a
transaction abort that resulted in the following trace:
[23919.655368] WARNING: CPU: 10 PID: 30175 at fs/btrfs/extent-tree.c:9843 btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]()
[23919.664742] BTRFS: Transaction aborted (error -2)
[23919.665749] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 parport psmouse acpi_cpufreq processor i2c_core evdev microcode pcspkr button serio_raw ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic ata_piix virtio_pci floppy virtio_ring libata e1000 virtio scsi_mod [last unloaded: btrfs]
[23919.679442] CPU: 10 PID: 30175 Comm: fsstress Not tainted 4.3.0-rc5-btrfs-next-17+ #1
[23919.682392] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[23919.689151] 0000000000000000 ffff8804020cbb50 ffffffff812566f4 ffff8804020cbb98
[23919.692604] ffff8804020cbb88 ffffffff8104d0a6 ffffffffa03eea69 ffff88041b678a48
[23919.694230] ffff88042ac38000 ffff88041b678930 00000000fffffffe ffff8804020cbbf0
[23919.696716] Call Trace:
[23919.698669] [<ffffffff812566f4>] dump_stack+0x4e/0x79
[23919.700597] [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
[23919.701958] [<ffffffffa03eea69>] ? btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
[23919.703612] [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
[23919.705047] [<ffffffffa03eea69>] btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
[23919.706967] [<ffffffffa0402097>] __btrfs_end_transaction+0x84/0x2dd [btrfs]
[23919.708611] [<ffffffffa0402300>] btrfs_end_transaction+0x10/0x12 [btrfs]
[23919.710099] [<ffffffffa03ef0b8>] btrfs_alloc_data_chunk_ondemand+0x121/0x28b [btrfs]
[23919.711970] [<ffffffffa0413025>] btrfs_fallocate+0x7d3/0xc6d [btrfs]
[23919.713602] [<ffffffff8108b78f>] ? lock_acquire+0x10d/0x194
[23919.714756] [<ffffffff81086dbc>] ? percpu_down_read+0x51/0x78
[23919.716155] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
[23919.718918] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
[23919.724170] [<ffffffff8116b579>] vfs_fallocate+0x170/0x1ff
[23919.725482] [<ffffffff8117c1d7>] ioctl_preallocate+0x89/0x9b
[23919.726790] [<ffffffff8117c5ef>] do_vfs_ioctl+0x406/0x4e6
[23919.728428] [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
[23919.729642] [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
[23919.730782] [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
[23919.731847] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[23919.733330] ---[ end trace 166ef301a335832a ]---
This is due to a race between device replace and chunk allocation, which
the following diagram illustrates:
CPU 1 CPU 2
btrfs_dev_replace_finishing()
at this point
dev_replace->tgtdev->devid ==
BTRFS_DEV_REPLACE_DEVID (0ULL)
...
btrfs_start_transaction()
btrfs_commit_transaction()
btrfs_fallocate()
btrfs_alloc_data_chunk_ondemand()
btrfs_join_transaction()
--> starts a new transaction
do_chunk_alloc()
lock fs_info->chunk_mutex
btrfs_alloc_chunk()
--> creates extent map for
the new chunk with
em->bdev->map->stripes[i]->dev->devid
== X (X > 0)
--> extent map is added to
fs_info->mapping_tree
--> initial phase of bg A
allocation completes
unlock fs_info->chunk_mutex
lock fs_info->chunk_mutex
btrfs_dev_replace_update_device_in_mapping_tree()
--> iterates fs_info->mapping_tree and
replaces the device in every extent
map's map->stripes[] with
dev_replace->tgtdev, which still has
an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)
btrfs_end_transaction()
btrfs_create_pending_block_groups()
--> starts final phase of
bg A creation (update device,
extent, and chunk trees, etc)
btrfs_finish_chunk_alloc()
btrfs_update_device()
--> attempts to update a device
item with ID == 0ULL
(BTRFS_DEV_REPLACE_DEVID)
which is the current ID of
bg A's
em->bdev->map->stripes[i]->dev->devid
--> doesn't find such item
returns -ENOENT
--> the device id should have been X
and not 0ULL
got -ENOENT from
btrfs_finish_chunk_alloc()
and aborts current transaction
finishes setting up the target device,
namely it sets tgtdev->devid to the value
of srcdev->devid, which is X (and X > 0)
frees the srcdev
unlock fs_info->chunk_mutex
So fix this by taking the device list mutex when processing the chunk's
extent map stripes to update the device items. This avoids getting the
wrong device id and use-after-free problems if the task finishing a
chunk allocation grabs the replaced device, which is freed while the
dev replace task is holding the device list mutex.
This happened while running fstest btrfs/071.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-11-20 18:42:47 +08:00
|
|
|
goto out;
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
stripe = &chunk->stripe;
|
2013-06-28 01:22:46 +08:00
|
|
|
for (i = 0; i < map->num_stripes; i++) {
|
|
|
|
device = map->stripes[i].dev;
|
|
|
|
dev_offset = map->stripes[i].physical;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2008-04-16 03:41:47 +08:00
|
|
|
btrfs_set_stack_stripe_devid(stripe, device->devid);
|
|
|
|
btrfs_set_stack_stripe_offset(stripe, dev_offset);
|
|
|
|
memcpy(stripe->dev_uuid, device->uuid, BTRFS_UUID_SIZE);
|
2008-11-18 10:11:30 +08:00
|
|
|
stripe++;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
2016-06-23 06:54:23 +08:00
|
|
|
mutex_unlock(&fs_info->fs_devices->device_list_mutex);
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
btrfs_set_stack_chunk_length(chunk, chunk_size);
|
2008-03-25 03:01:56 +08:00
|
|
|
btrfs_set_stack_chunk_owner(chunk, extent_root->root_key.objectid);
|
2008-11-18 10:11:30 +08:00
|
|
|
btrfs_set_stack_chunk_stripe_len(chunk, map->stripe_len);
|
|
|
|
btrfs_set_stack_chunk_type(chunk, map->type);
|
|
|
|
btrfs_set_stack_chunk_num_stripes(chunk, map->num_stripes);
|
|
|
|
btrfs_set_stack_chunk_io_align(chunk, map->stripe_len);
|
|
|
|
btrfs_set_stack_chunk_io_width(chunk, map->stripe_len);
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_set_stack_chunk_sector_size(chunk, fs_info->sectorsize);
|
2008-11-18 10:11:30 +08:00
|
|
|
btrfs_set_stack_chunk_sub_stripes(chunk, map->sub_stripes);
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
|
|
|
|
key.type = BTRFS_CHUNK_ITEM_KEY;
|
|
|
|
key.offset = chunk_offset;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
ret = btrfs_insert_item(trans, chunk_root, &key, chunk, item_size);
|
2011-08-11 03:32:10 +08:00
|
|
|
if (ret == 0 && map->type & BTRFS_BLOCK_GROUP_SYSTEM) {
|
|
|
|
/*
|
|
|
|
* TODO: Cleanup of inserted chunk root in case of
|
|
|
|
* failure.
|
|
|
|
*/
|
2016-06-23 06:54:24 +08:00
|
|
|
ret = btrfs_add_system_chunk(fs_info, &key, chunk, item_size);
|
2008-04-26 04:53:30 +08:00
|
|
|
}
|
Btrfs: add initial tracepoint support for btrfs
Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
Here is what I have added:
1) ordere_extent:
btrfs_ordered_extent_add
btrfs_ordered_extent_remove
btrfs_ordered_extent_start
btrfs_ordered_extent_put
These provide critical information to understand how ordered_extents are
updated.
2) extent_map:
btrfs_get_extent
extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.
3) writepage:
__extent_writepage
btrfs_writepage_end_io_hook
Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.
4) inode:
btrfs_inode_new
btrfs_inode_request
btrfs_inode_evict
These can show where and when a inode is created, when a inode is evicted.
5) sync:
btrfs_sync_file
btrfs_sync_fs
These show sync arguments.
6) transaction:
btrfs_transaction_commit
In transaction based filesystem, it will be useful to know the generation and
who does commit.
7) back reference and cow:
btrfs_delayed_tree_ref
btrfs_delayed_data_ref
btrfs_delayed_ref_head
btrfs_cow_block
Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.
8) chunk:
btrfs_chunk_alloc
btrfs_chunk_free
Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.
9) reserved_extent:
btrfs_reserved_extent_alloc
btrfs_reserved_extent_free
These can show how btrfs uses its space.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-24 19:18:59 +08:00
|
|
|
|
2013-06-28 01:22:46 +08:00
|
|
|
out:
|
2008-03-25 03:01:56 +08:00
|
|
|
kfree(chunk);
|
2013-06-28 01:22:46 +08:00
|
|
|
free_extent_map(em);
|
2011-08-11 03:32:10 +08:00
|
|
|
return ret;
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
/*
|
|
|
|
* Chunk allocation falls into two parts. The first part does works
|
|
|
|
* that make the new allocated chunk useable, but not do any operation
|
|
|
|
* that modifies the chunk tree. The second part does the works that
|
|
|
|
* require modifying the chunk tree. This division is important for the
|
|
|
|
* bootstrap process of adding storage to a seed btrfs.
|
|
|
|
*/
|
2018-06-20 20:49:06 +08:00
|
|
|
int btrfs_alloc_chunk(struct btrfs_trans_handle *trans, u64 type)
|
2008-11-18 10:11:30 +08:00
|
|
|
{
|
|
|
|
u64 chunk_offset;
|
|
|
|
|
2018-06-20 20:49:06 +08:00
|
|
|
lockdep_assert_held(&trans->fs_info->chunk_mutex);
|
|
|
|
chunk_offset = find_next_chunk(trans->fs_info);
|
2017-02-11 02:46:27 +08:00
|
|
|
return __btrfs_alloc_chunk(trans, chunk_offset, type);
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
|
2009-01-06 10:25:51 +08:00
|
|
|
static noinline int init_first_rw_device(struct btrfs_trans_handle *trans,
|
2017-02-11 02:49:01 +08:00
|
|
|
struct btrfs_fs_info *fs_info)
|
2008-11-18 10:11:30 +08:00
|
|
|
{
|
|
|
|
u64 chunk_offset;
|
|
|
|
u64 sys_chunk_offset;
|
|
|
|
u64 alloc_profile;
|
|
|
|
int ret;
|
|
|
|
|
2013-06-28 01:22:46 +08:00
|
|
|
chunk_offset = find_next_chunk(fs_info);
|
2017-05-17 23:38:35 +08:00
|
|
|
alloc_profile = btrfs_metadata_alloc_profile(fs_info);
|
2017-02-11 02:46:27 +08:00
|
|
|
ret = __btrfs_alloc_chunk(trans, chunk_offset, alloc_profile);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
sys_chunk_offset = find_next_chunk(fs_info);
|
2017-05-17 23:38:35 +08:00
|
|
|
alloc_profile = btrfs_system_alloc_profile(fs_info);
|
2017-02-11 02:46:27 +08:00
|
|
|
ret = __btrfs_alloc_chunk(trans, sys_chunk_offset, alloc_profile);
|
2012-03-12 23:03:00 +08:00
|
|
|
return ret;
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
|
2014-07-03 18:22:13 +08:00
|
|
|
static inline int btrfs_chunk_max_errors(struct map_lookup *map)
|
|
|
|
{
|
|
|
|
int max_errors;
|
|
|
|
|
|
|
|
if (map->type & (BTRFS_BLOCK_GROUP_RAID1 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID10 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID5 |
|
|
|
|
BTRFS_BLOCK_GROUP_DUP)) {
|
|
|
|
max_errors = 1;
|
|
|
|
} else if (map->type & BTRFS_BLOCK_GROUP_RAID6) {
|
|
|
|
max_errors = 2;
|
|
|
|
} else {
|
|
|
|
max_errors = 0;
|
2012-09-18 21:52:32 +08:00
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2014-07-03 18:22:13 +08:00
|
|
|
return max_errors;
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
int btrfs_chunk_readonly(struct btrfs_fs_info *fs_info, u64 chunk_offset)
|
2008-11-18 10:11:30 +08:00
|
|
|
{
|
|
|
|
struct extent_map *em;
|
|
|
|
struct map_lookup *map;
|
|
|
|
int readonly = 0;
|
2014-07-03 18:22:13 +08:00
|
|
|
int miss_ndevs = 0;
|
2008-11-18 10:11:30 +08:00
|
|
|
int i;
|
|
|
|
|
2018-05-17 07:34:31 +08:00
|
|
|
em = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
|
2017-03-15 04:33:55 +08:00
|
|
|
if (IS_ERR(em))
|
2008-11-18 10:11:30 +08:00
|
|
|
return 1;
|
|
|
|
|
2015-06-03 22:55:48 +08:00
|
|
|
map = em->map_lookup;
|
2008-11-18 10:11:30 +08:00
|
|
|
for (i = 0; i < map->num_stripes; i++) {
|
2017-12-04 12:54:54 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_MISSING,
|
|
|
|
&map->stripes[i].dev->dev_state)) {
|
2014-07-03 18:22:13 +08:00
|
|
|
miss_ndevs++;
|
|
|
|
continue;
|
|
|
|
}
|
2017-12-04 12:54:52 +08:00
|
|
|
if (!test_bit(BTRFS_DEV_STATE_WRITEABLE,
|
|
|
|
&map->stripes[i].dev->dev_state)) {
|
2008-11-18 10:11:30 +08:00
|
|
|
readonly = 1;
|
2014-07-03 18:22:13 +08:00
|
|
|
goto end;
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
}
|
2014-07-03 18:22:13 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the number of missing devices is larger than max errors,
|
|
|
|
* we can not write the data into that chunk successfully, so
|
|
|
|
* set it readonly.
|
|
|
|
*/
|
|
|
|
if (miss_ndevs > btrfs_chunk_max_errors(map))
|
|
|
|
readonly = 1;
|
|
|
|
end:
|
2008-03-25 03:01:56 +08:00
|
|
|
free_extent_map(em);
|
2008-11-18 10:11:30 +08:00
|
|
|
return readonly;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
void btrfs_mapping_init(struct btrfs_mapping_tree *tree)
|
|
|
|
{
|
2011-04-21 06:34:43 +08:00
|
|
|
extent_map_tree_init(&tree->map_tree);
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
void btrfs_mapping_tree_free(struct btrfs_mapping_tree *tree)
|
|
|
|
{
|
|
|
|
struct extent_map *em;
|
|
|
|
|
2009-01-06 10:25:51 +08:00
|
|
|
while (1) {
|
2009-09-03 04:24:52 +08:00
|
|
|
write_lock(&tree->map_tree.lock);
|
2008-03-25 03:01:56 +08:00
|
|
|
em = lookup_extent_mapping(&tree->map_tree, 0, (u64)-1);
|
|
|
|
if (em)
|
|
|
|
remove_extent_mapping(&tree->map_tree, em);
|
2009-09-03 04:24:52 +08:00
|
|
|
write_unlock(&tree->map_tree.lock);
|
2008-03-25 03:01:56 +08:00
|
|
|
if (!em)
|
|
|
|
break;
|
|
|
|
/* once for us */
|
|
|
|
free_extent_map(em);
|
|
|
|
/* once for the tree */
|
|
|
|
free_extent_map(em);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-11-05 21:59:07 +08:00
|
|
|
int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
|
2008-04-10 04:28:12 +08:00
|
|
|
{
|
|
|
|
struct extent_map *em;
|
|
|
|
struct map_lookup *map;
|
|
|
|
int ret;
|
|
|
|
|
2018-05-17 07:34:31 +08:00
|
|
|
em = btrfs_get_chunk_map(fs_info, logical, len);
|
2017-03-15 04:33:55 +08:00
|
|
|
if (IS_ERR(em))
|
|
|
|
/*
|
|
|
|
* We could return errors for these cases, but that could get
|
|
|
|
* ugly and we'd probably do the same thing which is just not do
|
|
|
|
* anything else and exit, so return 1 so the callers don't try
|
|
|
|
* to use other copies.
|
|
|
|
*/
|
2013-04-23 22:53:18 +08:00
|
|
|
return 1;
|
|
|
|
|
2015-06-03 22:55:48 +08:00
|
|
|
map = em->map_lookup;
|
2008-04-10 04:28:12 +08:00
|
|
|
if (map->type & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1))
|
|
|
|
ret = map->num_stripes;
|
2008-04-16 22:49:51 +08:00
|
|
|
else if (map->type & BTRFS_BLOCK_GROUP_RAID10)
|
|
|
|
ret = map->sub_stripes;
|
2013-01-30 07:40:14 +08:00
|
|
|
else if (map->type & BTRFS_BLOCK_GROUP_RAID5)
|
|
|
|
ret = 2;
|
|
|
|
else if (map->type & BTRFS_BLOCK_GROUP_RAID6)
|
Btrfs: make raid6 rebuild retry more
There is a scenario that can end up with rebuild process failing to
return good content, i.e.
suppose that all disks can be read without problems and if the content
that was read out doesn't match its checksum, currently for raid6
btrfs at most retries twice,
- the 1st retry is to rebuild with all other stripes, it'll eventually
be a raid5 xor rebuild,
- if the 1st fails, the 2nd retry will deliberately fail parity p so
that it will do raid6 style rebuild,
however, the chances are that another non-parity stripe content also
has something corrupted, so that the above retries are not able to
return correct content, and users will think of this as data loss.
More seriouly, if the loss happens on some important internal btree
roots, it could refuse to mount.
This extends btrfs to do more retries and each retry fails only one
stripe. Since raid6 can tolerate 2 disk failures, if there is one
more failure besides the failure on which we're recovering, this can
always work.
The worst case is to retry as many times as the number of raid6 disks,
but given the fact that such a scenario is really rare in practice,
it's still acceptable.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-03 04:36:41 +08:00
|
|
|
/*
|
|
|
|
* There could be two corrupted data stripes, we need
|
|
|
|
* to loop retry in order to rebuild the correct data.
|
2018-06-20 20:48:55 +08:00
|
|
|
*
|
Btrfs: make raid6 rebuild retry more
There is a scenario that can end up with rebuild process failing to
return good content, i.e.
suppose that all disks can be read without problems and if the content
that was read out doesn't match its checksum, currently for raid6
btrfs at most retries twice,
- the 1st retry is to rebuild with all other stripes, it'll eventually
be a raid5 xor rebuild,
- if the 1st fails, the 2nd retry will deliberately fail parity p so
that it will do raid6 style rebuild,
however, the chances are that another non-parity stripe content also
has something corrupted, so that the above retries are not able to
return correct content, and users will think of this as data loss.
More seriouly, if the loss happens on some important internal btree
roots, it could refuse to mount.
This extends btrfs to do more retries and each retry fails only one
stripe. Since raid6 can tolerate 2 disk failures, if there is one
more failure besides the failure on which we're recovering, this can
always work.
The worst case is to retry as many times as the number of raid6 disks,
but given the fact that such a scenario is really rare in practice,
it's still acceptable.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-03 04:36:41 +08:00
|
|
|
* Fail a stripe at a time on every retry except the
|
|
|
|
* stripe under reconstruction.
|
|
|
|
*/
|
|
|
|
ret = map->num_stripes;
|
2008-04-10 04:28:12 +08:00
|
|
|
else
|
|
|
|
ret = 1;
|
|
|
|
free_extent_map(em);
|
2012-11-06 22:06:47 +08:00
|
|
|
|
2018-03-24 09:11:38 +08:00
|
|
|
btrfs_dev_replace_read_lock(&fs_info->dev_replace);
|
2017-03-15 04:33:59 +08:00
|
|
|
if (btrfs_dev_replace_is_ongoing(&fs_info->dev_replace) &&
|
|
|
|
fs_info->dev_replace.tgtdev)
|
2012-11-06 22:06:47 +08:00
|
|
|
ret++;
|
2018-03-24 09:11:38 +08:00
|
|
|
btrfs_dev_replace_read_unlock(&fs_info->dev_replace);
|
2012-11-06 22:06:47 +08:00
|
|
|
|
2008-04-10 04:28:12 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
unsigned long btrfs_full_stripe_len(struct btrfs_fs_info *fs_info,
|
2013-01-30 07:40:14 +08:00
|
|
|
u64 logical)
|
|
|
|
{
|
|
|
|
struct extent_map *em;
|
|
|
|
struct map_lookup *map;
|
2016-06-23 06:54:23 +08:00
|
|
|
unsigned long len = fs_info->sectorsize;
|
2013-01-30 07:40:14 +08:00
|
|
|
|
2018-05-17 07:34:31 +08:00
|
|
|
em = btrfs_get_chunk_map(fs_info, logical, len);
|
2013-01-30 07:40:14 +08:00
|
|
|
|
2017-07-11 21:55:51 +08:00
|
|
|
if (!WARN_ON(IS_ERR(em))) {
|
|
|
|
map = em->map_lookup;
|
|
|
|
if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)
|
|
|
|
len = map->stripe_len * nr_data_stripes(map);
|
|
|
|
free_extent_map(em);
|
|
|
|
}
|
2013-01-30 07:40:14 +08:00
|
|
|
return len;
|
|
|
|
}
|
|
|
|
|
2017-07-19 15:48:42 +08:00
|
|
|
int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
|
2013-01-30 07:40:14 +08:00
|
|
|
{
|
|
|
|
struct extent_map *em;
|
|
|
|
struct map_lookup *map;
|
|
|
|
int ret = 0;
|
|
|
|
|
2018-05-17 07:34:31 +08:00
|
|
|
em = btrfs_get_chunk_map(fs_info, logical, len);
|
2013-01-30 07:40:14 +08:00
|
|
|
|
2017-07-11 21:55:51 +08:00
|
|
|
if(!WARN_ON(IS_ERR(em))) {
|
|
|
|
map = em->map_lookup;
|
|
|
|
if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)
|
|
|
|
ret = 1;
|
|
|
|
free_extent_map(em);
|
|
|
|
}
|
2013-01-30 07:40:14 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-11-06 21:52:18 +08:00
|
|
|
static int find_live_mirror(struct btrfs_fs_info *fs_info,
|
2018-03-14 16:29:12 +08:00
|
|
|
struct map_lookup *map, int first,
|
2018-03-14 16:29:13 +08:00
|
|
|
int dev_replace_is_ongoing)
|
2008-05-14 01:46:40 +08:00
|
|
|
{
|
|
|
|
int i;
|
2018-03-14 16:29:12 +08:00
|
|
|
int num_stripes;
|
2018-03-14 16:29:13 +08:00
|
|
|
int preferred_mirror;
|
2012-11-06 21:52:18 +08:00
|
|
|
int tolerance;
|
|
|
|
struct btrfs_device *srcdev;
|
|
|
|
|
2018-03-14 16:29:12 +08:00
|
|
|
ASSERT((map->type &
|
|
|
|
(BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_RAID10)));
|
|
|
|
|
|
|
|
if (map->type & BTRFS_BLOCK_GROUP_RAID10)
|
|
|
|
num_stripes = map->sub_stripes;
|
|
|
|
else
|
|
|
|
num_stripes = map->num_stripes;
|
|
|
|
|
2018-03-14 16:29:13 +08:00
|
|
|
preferred_mirror = first + current->pid % num_stripes;
|
|
|
|
|
2012-11-06 21:52:18 +08:00
|
|
|
if (dev_replace_is_ongoing &&
|
|
|
|
fs_info->dev_replace.cont_reading_from_srcdev_mode ==
|
|
|
|
BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_AVOID)
|
|
|
|
srcdev = fs_info->dev_replace.srcdev;
|
|
|
|
else
|
|
|
|
srcdev = NULL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* try to avoid the drive that is the source drive for a
|
|
|
|
* dev-replace procedure, only choose it if no other non-missing
|
|
|
|
* mirror is available
|
|
|
|
*/
|
|
|
|
for (tolerance = 0; tolerance < 2; tolerance++) {
|
2018-03-14 16:29:13 +08:00
|
|
|
if (map->stripes[preferred_mirror].dev->bdev &&
|
|
|
|
(tolerance || map->stripes[preferred_mirror].dev != srcdev))
|
|
|
|
return preferred_mirror;
|
2018-03-14 16:29:12 +08:00
|
|
|
for (i = first; i < first + num_stripes; i++) {
|
2012-11-06 21:52:18 +08:00
|
|
|
if (map->stripes[i].dev->bdev &&
|
|
|
|
(tolerance || map->stripes[i].dev != srcdev))
|
|
|
|
return i;
|
|
|
|
}
|
2008-05-14 01:46:40 +08:00
|
|
|
}
|
2012-11-06 21:52:18 +08:00
|
|
|
|
2008-05-14 01:46:40 +08:00
|
|
|
/* we couldn't find one that doesn't fail. Just return something
|
|
|
|
* and the io error handling code will clean up eventually
|
|
|
|
*/
|
2018-03-14 16:29:13 +08:00
|
|
|
return preferred_mirror;
|
2008-05-14 01:46:40 +08:00
|
|
|
}
|
|
|
|
|
2013-01-30 07:40:14 +08:00
|
|
|
static inline int parity_smaller(u64 a, u64 b)
|
|
|
|
{
|
|
|
|
return a > b;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Bubble-sort the stripe set to put the parity/syndrome stripes last */
|
2015-01-20 15:11:33 +08:00
|
|
|
static void sort_parity_stripes(struct btrfs_bio *bbio, int num_stripes)
|
2013-01-30 07:40:14 +08:00
|
|
|
{
|
|
|
|
struct btrfs_bio_stripe s;
|
|
|
|
int i;
|
|
|
|
u64 l;
|
|
|
|
int again = 1;
|
|
|
|
|
|
|
|
while (again) {
|
|
|
|
again = 0;
|
2015-01-20 15:11:32 +08:00
|
|
|
for (i = 0; i < num_stripes - 1; i++) {
|
2015-01-20 15:11:33 +08:00
|
|
|
if (parity_smaller(bbio->raid_map[i],
|
|
|
|
bbio->raid_map[i+1])) {
|
2013-01-30 07:40:14 +08:00
|
|
|
s = bbio->stripes[i];
|
2015-01-20 15:11:33 +08:00
|
|
|
l = bbio->raid_map[i];
|
2013-01-30 07:40:14 +08:00
|
|
|
bbio->stripes[i] = bbio->stripes[i+1];
|
2015-01-20 15:11:33 +08:00
|
|
|
bbio->raid_map[i] = bbio->raid_map[i+1];
|
2013-01-30 07:40:14 +08:00
|
|
|
bbio->stripes[i+1] = s;
|
2015-01-20 15:11:33 +08:00
|
|
|
bbio->raid_map[i+1] = l;
|
2014-11-14 16:06:25 +08:00
|
|
|
|
2013-01-30 07:40:14 +08:00
|
|
|
again = 1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-01-20 15:11:34 +08:00
|
|
|
static struct btrfs_bio *alloc_btrfs_bio(int total_stripes, int real_stripes)
|
|
|
|
{
|
|
|
|
struct btrfs_bio *bbio = kzalloc(
|
2015-02-20 09:51:39 +08:00
|
|
|
/* the size of the btrfs_bio */
|
2015-01-20 15:11:34 +08:00
|
|
|
sizeof(struct btrfs_bio) +
|
2015-02-20 09:51:39 +08:00
|
|
|
/* plus the variable array for the stripes */
|
2015-01-20 15:11:34 +08:00
|
|
|
sizeof(struct btrfs_bio_stripe) * (total_stripes) +
|
2015-02-20 09:51:39 +08:00
|
|
|
/* plus the variable array for the tgt dev */
|
2015-01-20 15:11:34 +08:00
|
|
|
sizeof(int) * (real_stripes) +
|
2015-02-20 09:51:39 +08:00
|
|
|
/*
|
|
|
|
* plus the raid_map, which includes both the tgt dev
|
|
|
|
* and the stripes
|
|
|
|
*/
|
|
|
|
sizeof(u64) * (total_stripes),
|
2015-08-19 20:17:41 +08:00
|
|
|
GFP_NOFS|__GFP_NOFAIL);
|
2015-01-20 15:11:34 +08:00
|
|
|
|
|
|
|
atomic_set(&bbio->error, 0);
|
2017-03-03 16:55:10 +08:00
|
|
|
refcount_set(&bbio->refs, 1);
|
2015-01-20 15:11:34 +08:00
|
|
|
|
|
|
|
return bbio;
|
|
|
|
}
|
|
|
|
|
|
|
|
void btrfs_get_bbio(struct btrfs_bio *bbio)
|
|
|
|
{
|
2017-03-03 16:55:10 +08:00
|
|
|
WARN_ON(!refcount_read(&bbio->refs));
|
|
|
|
refcount_inc(&bbio->refs);
|
2015-01-20 15:11:34 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
void btrfs_put_bbio(struct btrfs_bio *bbio)
|
|
|
|
{
|
|
|
|
if (!bbio)
|
|
|
|
return;
|
2017-03-03 16:55:10 +08:00
|
|
|
if (refcount_dec_and_test(&bbio->refs))
|
2015-01-20 15:11:34 +08:00
|
|
|
kfree(bbio);
|
|
|
|
}
|
|
|
|
|
2017-03-15 04:33:56 +08:00
|
|
|
/* can REQ_OP_DISCARD be sent with other REQ like REQ_OP_WRITE? */
|
|
|
|
/*
|
|
|
|
* Please note that, discard won't be sent to target device of device
|
|
|
|
* replace.
|
|
|
|
*/
|
|
|
|
static int __btrfs_map_block_for_discard(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 logical, u64 length,
|
|
|
|
struct btrfs_bio **bbio_ret)
|
|
|
|
{
|
|
|
|
struct extent_map *em;
|
|
|
|
struct map_lookup *map;
|
|
|
|
struct btrfs_bio *bbio;
|
|
|
|
u64 offset;
|
|
|
|
u64 stripe_nr;
|
|
|
|
u64 stripe_nr_end;
|
|
|
|
u64 stripe_end_offset;
|
|
|
|
u64 stripe_cnt;
|
|
|
|
u64 stripe_len;
|
|
|
|
u64 stripe_offset;
|
|
|
|
u64 num_stripes;
|
|
|
|
u32 stripe_index;
|
|
|
|
u32 factor = 0;
|
|
|
|
u32 sub_stripes = 0;
|
|
|
|
u64 stripes_per_dev = 0;
|
|
|
|
u32 remaining_stripes = 0;
|
|
|
|
u32 last_stripe = 0;
|
|
|
|
int ret = 0;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
/* discard always return a bbio */
|
|
|
|
ASSERT(bbio_ret);
|
|
|
|
|
2018-05-17 07:34:31 +08:00
|
|
|
em = btrfs_get_chunk_map(fs_info, logical, length);
|
2017-03-15 04:33:56 +08:00
|
|
|
if (IS_ERR(em))
|
|
|
|
return PTR_ERR(em);
|
|
|
|
|
|
|
|
map = em->map_lookup;
|
|
|
|
/* we don't discard raid56 yet */
|
|
|
|
if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
|
|
|
|
ret = -EOPNOTSUPP;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
offset = logical - em->start;
|
|
|
|
length = min_t(u64, em->len - offset, length);
|
|
|
|
|
|
|
|
stripe_len = map->stripe_len;
|
|
|
|
/*
|
|
|
|
* stripe_nr counts the total number of stripes we have to stride
|
|
|
|
* to get to this block
|
|
|
|
*/
|
|
|
|
stripe_nr = div64_u64(offset, stripe_len);
|
|
|
|
|
|
|
|
/* stripe_offset is the offset of this block in its stripe */
|
|
|
|
stripe_offset = offset - stripe_nr * stripe_len;
|
|
|
|
|
|
|
|
stripe_nr_end = round_up(offset + length, map->stripe_len);
|
2017-04-04 04:45:24 +08:00
|
|
|
stripe_nr_end = div64_u64(stripe_nr_end, map->stripe_len);
|
2017-03-15 04:33:56 +08:00
|
|
|
stripe_cnt = stripe_nr_end - stripe_nr;
|
|
|
|
stripe_end_offset = stripe_nr_end * map->stripe_len -
|
|
|
|
(offset + length);
|
|
|
|
/*
|
|
|
|
* after this, stripe_nr is the number of stripes on this
|
|
|
|
* device we have to walk to find the data, and stripe_index is
|
|
|
|
* the number of our device in the stripe array
|
|
|
|
*/
|
|
|
|
num_stripes = 1;
|
|
|
|
stripe_index = 0;
|
|
|
|
if (map->type & (BTRFS_BLOCK_GROUP_RAID0 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID10)) {
|
|
|
|
if (map->type & BTRFS_BLOCK_GROUP_RAID0)
|
|
|
|
sub_stripes = 1;
|
|
|
|
else
|
|
|
|
sub_stripes = map->sub_stripes;
|
|
|
|
|
|
|
|
factor = map->num_stripes / sub_stripes;
|
|
|
|
num_stripes = min_t(u64, map->num_stripes,
|
|
|
|
sub_stripes * stripe_cnt);
|
|
|
|
stripe_nr = div_u64_rem(stripe_nr, factor, &stripe_index);
|
|
|
|
stripe_index *= sub_stripes;
|
|
|
|
stripes_per_dev = div_u64_rem(stripe_cnt, factor,
|
|
|
|
&remaining_stripes);
|
|
|
|
div_u64_rem(stripe_nr_end - 1, factor, &last_stripe);
|
|
|
|
last_stripe *= sub_stripes;
|
|
|
|
} else if (map->type & (BTRFS_BLOCK_GROUP_RAID1 |
|
|
|
|
BTRFS_BLOCK_GROUP_DUP)) {
|
|
|
|
num_stripes = map->num_stripes;
|
|
|
|
} else {
|
|
|
|
stripe_nr = div_u64_rem(stripe_nr, map->num_stripes,
|
|
|
|
&stripe_index);
|
|
|
|
}
|
|
|
|
|
|
|
|
bbio = alloc_btrfs_bio(num_stripes, 0);
|
|
|
|
if (!bbio) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; i < num_stripes; i++) {
|
|
|
|
bbio->stripes[i].physical =
|
|
|
|
map->stripes[stripe_index].physical +
|
|
|
|
stripe_offset + stripe_nr * map->stripe_len;
|
|
|
|
bbio->stripes[i].dev = map->stripes[stripe_index].dev;
|
|
|
|
|
|
|
|
if (map->type & (BTRFS_BLOCK_GROUP_RAID0 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID10)) {
|
|
|
|
bbio->stripes[i].length = stripes_per_dev *
|
|
|
|
map->stripe_len;
|
|
|
|
|
|
|
|
if (i / sub_stripes < remaining_stripes)
|
|
|
|
bbio->stripes[i].length +=
|
|
|
|
map->stripe_len;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Special for the first stripe and
|
|
|
|
* the last stripe:
|
|
|
|
*
|
|
|
|
* |-------|...|-------|
|
|
|
|
* |----------|
|
|
|
|
* off end_off
|
|
|
|
*/
|
|
|
|
if (i < sub_stripes)
|
|
|
|
bbio->stripes[i].length -=
|
|
|
|
stripe_offset;
|
|
|
|
|
|
|
|
if (stripe_index >= last_stripe &&
|
|
|
|
stripe_index <= (last_stripe +
|
|
|
|
sub_stripes - 1))
|
|
|
|
bbio->stripes[i].length -=
|
|
|
|
stripe_end_offset;
|
|
|
|
|
|
|
|
if (i == sub_stripes - 1)
|
|
|
|
stripe_offset = 0;
|
|
|
|
} else {
|
|
|
|
bbio->stripes[i].length = length;
|
|
|
|
}
|
|
|
|
|
|
|
|
stripe_index++;
|
|
|
|
if (stripe_index == map->num_stripes) {
|
|
|
|
stripe_index = 0;
|
|
|
|
stripe_nr++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
*bbio_ret = bbio;
|
|
|
|
bbio->map_type = map->type;
|
|
|
|
bbio->num_stripes = num_stripes;
|
|
|
|
out:
|
|
|
|
free_extent_map(em);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2017-03-15 04:33:57 +08:00
|
|
|
/*
|
|
|
|
* In dev-replace case, for repair case (that's the only case where the mirror
|
|
|
|
* is selected explicitly when calling btrfs_map_block), blocks left of the
|
|
|
|
* left cursor can also be read from the target drive.
|
|
|
|
*
|
|
|
|
* For REQ_GET_READ_MIRRORS, the target drive is added as the last one to the
|
|
|
|
* array of stripes.
|
|
|
|
* For READ, it also needs to be supported using the same mirror number.
|
|
|
|
*
|
|
|
|
* If the requested block is not left of the left cursor, EIO is returned. This
|
|
|
|
* can happen because btrfs_num_copies() returns one more in the dev-replace
|
|
|
|
* case.
|
|
|
|
*/
|
|
|
|
static int get_extra_mirror_from_replace(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 logical, u64 length,
|
|
|
|
u64 srcdev_devid, int *mirror_num,
|
|
|
|
u64 *physical)
|
|
|
|
{
|
|
|
|
struct btrfs_bio *bbio = NULL;
|
|
|
|
int num_stripes;
|
|
|
|
int index_srcdev = 0;
|
|
|
|
int found = 0;
|
|
|
|
u64 physical_of_found = 0;
|
|
|
|
int i;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
ret = __btrfs_map_block(fs_info, BTRFS_MAP_GET_READ_MIRRORS,
|
|
|
|
logical, &length, &bbio, 0, 0);
|
|
|
|
if (ret) {
|
|
|
|
ASSERT(bbio == NULL);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
num_stripes = bbio->num_stripes;
|
|
|
|
if (*mirror_num > num_stripes) {
|
|
|
|
/*
|
|
|
|
* BTRFS_MAP_GET_READ_MIRRORS does not contain this mirror,
|
|
|
|
* that means that the requested area is not left of the left
|
|
|
|
* cursor
|
|
|
|
*/
|
|
|
|
btrfs_put_bbio(bbio);
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* process the rest of the function using the mirror_num of the source
|
|
|
|
* drive. Therefore look it up first. At the end, patch the device
|
|
|
|
* pointer to the one of the target drive.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < num_stripes; i++) {
|
|
|
|
if (bbio->stripes[i].dev->devid != srcdev_devid)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* In case of DUP, in order to keep it simple, only add the
|
|
|
|
* mirror with the lowest physical address
|
|
|
|
*/
|
|
|
|
if (found &&
|
|
|
|
physical_of_found <= bbio->stripes[i].physical)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
index_srcdev = i;
|
|
|
|
found = 1;
|
|
|
|
physical_of_found = bbio->stripes[i].physical;
|
|
|
|
}
|
|
|
|
|
|
|
|
btrfs_put_bbio(bbio);
|
|
|
|
|
|
|
|
ASSERT(found);
|
|
|
|
if (!found)
|
|
|
|
return -EIO;
|
|
|
|
|
|
|
|
*mirror_num = index_srcdev + 1;
|
|
|
|
*physical = physical_of_found;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2017-03-15 04:33:58 +08:00
|
|
|
static void handle_ops_on_dev_replace(enum btrfs_map_op op,
|
|
|
|
struct btrfs_bio **bbio_ret,
|
|
|
|
struct btrfs_dev_replace *dev_replace,
|
|
|
|
int *num_stripes_ret, int *max_errors_ret)
|
|
|
|
{
|
|
|
|
struct btrfs_bio *bbio = *bbio_ret;
|
|
|
|
u64 srcdev_devid = dev_replace->srcdev->devid;
|
|
|
|
int tgtdev_indexes = 0;
|
|
|
|
int num_stripes = *num_stripes_ret;
|
|
|
|
int max_errors = *max_errors_ret;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (op == BTRFS_MAP_WRITE) {
|
|
|
|
int index_where_to_add;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* duplicate the write operations while the dev replace
|
|
|
|
* procedure is running. Since the copying of the old disk to
|
|
|
|
* the new disk takes place at run time while the filesystem is
|
|
|
|
* mounted writable, the regular write operations to the old
|
|
|
|
* disk have to be duplicated to go to the new disk as well.
|
|
|
|
*
|
|
|
|
* Note that device->missing is handled by the caller, and that
|
|
|
|
* the write to the old disk is already set up in the stripes
|
|
|
|
* array.
|
|
|
|
*/
|
|
|
|
index_where_to_add = num_stripes;
|
|
|
|
for (i = 0; i < num_stripes; i++) {
|
|
|
|
if (bbio->stripes[i].dev->devid == srcdev_devid) {
|
|
|
|
/* write to new disk, too */
|
|
|
|
struct btrfs_bio_stripe *new =
|
|
|
|
bbio->stripes + index_where_to_add;
|
|
|
|
struct btrfs_bio_stripe *old =
|
|
|
|
bbio->stripes + i;
|
|
|
|
|
|
|
|
new->physical = old->physical;
|
|
|
|
new->length = old->length;
|
|
|
|
new->dev = dev_replace->tgtdev;
|
|
|
|
bbio->tgtdev_map[i] = index_where_to_add;
|
|
|
|
index_where_to_add++;
|
|
|
|
max_errors++;
|
|
|
|
tgtdev_indexes++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
num_stripes = index_where_to_add;
|
|
|
|
} else if (op == BTRFS_MAP_GET_READ_MIRRORS) {
|
|
|
|
int index_srcdev = 0;
|
|
|
|
int found = 0;
|
|
|
|
u64 physical_of_found = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* During the dev-replace procedure, the target drive can also
|
|
|
|
* be used to read data in case it is needed to repair a corrupt
|
|
|
|
* block elsewhere. This is possible if the requested area is
|
|
|
|
* left of the left cursor. In this area, the target drive is a
|
|
|
|
* full copy of the source drive.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < num_stripes; i++) {
|
|
|
|
if (bbio->stripes[i].dev->devid == srcdev_devid) {
|
|
|
|
/*
|
|
|
|
* In case of DUP, in order to keep it simple,
|
|
|
|
* only add the mirror with the lowest physical
|
|
|
|
* address
|
|
|
|
*/
|
|
|
|
if (found &&
|
|
|
|
physical_of_found <=
|
|
|
|
bbio->stripes[i].physical)
|
|
|
|
continue;
|
|
|
|
index_srcdev = i;
|
|
|
|
found = 1;
|
|
|
|
physical_of_found = bbio->stripes[i].physical;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (found) {
|
|
|
|
struct btrfs_bio_stripe *tgtdev_stripe =
|
|
|
|
bbio->stripes + num_stripes;
|
|
|
|
|
|
|
|
tgtdev_stripe->physical = physical_of_found;
|
|
|
|
tgtdev_stripe->length =
|
|
|
|
bbio->stripes[index_srcdev].length;
|
|
|
|
tgtdev_stripe->dev = dev_replace->tgtdev;
|
|
|
|
bbio->tgtdev_map[index_srcdev] = num_stripes;
|
|
|
|
|
|
|
|
tgtdev_indexes++;
|
|
|
|
num_stripes++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
*num_stripes_ret = num_stripes;
|
|
|
|
*max_errors_ret = max_errors;
|
|
|
|
bbio->num_tgtdevs = tgtdev_indexes;
|
|
|
|
*bbio_ret = bbio;
|
|
|
|
}
|
|
|
|
|
2017-03-15 04:34:00 +08:00
|
|
|
static bool need_full_stripe(enum btrfs_map_op op)
|
|
|
|
{
|
|
|
|
return (op == BTRFS_MAP_WRITE || op == BTRFS_MAP_GET_READ_MIRRORS);
|
|
|
|
}
|
|
|
|
|
2016-10-27 15:27:36 +08:00
|
|
|
static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
|
|
|
|
enum btrfs_map_op op,
|
2008-04-21 22:03:05 +08:00
|
|
|
u64 logical, u64 *length,
|
2011-08-04 23:15:33 +08:00
|
|
|
struct btrfs_bio **bbio_ret,
|
2015-01-20 15:11:33 +08:00
|
|
|
int mirror_num, int need_raid_map)
|
2008-03-25 03:01:56 +08:00
|
|
|
{
|
|
|
|
struct extent_map *em;
|
|
|
|
struct map_lookup *map;
|
|
|
|
u64 offset;
|
2008-03-26 04:50:33 +08:00
|
|
|
u64 stripe_offset;
|
|
|
|
u64 stripe_nr;
|
2013-01-30 07:40:14 +08:00
|
|
|
u64 stripe_len;
|
2015-02-21 01:42:11 +08:00
|
|
|
u32 stripe_index;
|
2008-04-10 04:28:12 +08:00
|
|
|
int i;
|
2011-12-01 12:55:47 +08:00
|
|
|
int ret = 0;
|
2008-04-21 22:03:05 +08:00
|
|
|
int num_stripes;
|
2008-04-29 21:38:00 +08:00
|
|
|
int max_errors = 0;
|
2014-11-14 16:06:25 +08:00
|
|
|
int tgtdev_indexes = 0;
|
2011-08-04 23:15:33 +08:00
|
|
|
struct btrfs_bio *bbio = NULL;
|
2012-11-06 21:43:46 +08:00
|
|
|
struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
|
|
|
|
int dev_replace_is_ongoing = 0;
|
|
|
|
int num_alloc_stripes;
|
2012-11-06 22:06:47 +08:00
|
|
|
int patch_the_first_stripe_for_dev_replace = 0;
|
|
|
|
u64 physical_to_patch_in_first_stripe = 0;
|
2013-01-30 07:40:14 +08:00
|
|
|
u64 raid56_full_stripe_start = (u64)-1;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2017-03-15 04:33:56 +08:00
|
|
|
if (op == BTRFS_MAP_DISCARD)
|
|
|
|
return __btrfs_map_block_for_discard(fs_info, logical,
|
|
|
|
*length, bbio_ret);
|
|
|
|
|
2018-05-17 07:34:31 +08:00
|
|
|
em = btrfs_get_chunk_map(fs_info, logical, *length);
|
2017-03-15 04:33:55 +08:00
|
|
|
if (IS_ERR(em))
|
|
|
|
return PTR_ERR(em);
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2015-06-03 22:55:48 +08:00
|
|
|
map = em->map_lookup;
|
2008-03-25 03:01:56 +08:00
|
|
|
offset = logical - em->start;
|
2008-03-26 04:50:33 +08:00
|
|
|
|
2013-01-30 07:40:14 +08:00
|
|
|
stripe_len = map->stripe_len;
|
2008-03-26 04:50:33 +08:00
|
|
|
stripe_nr = offset;
|
|
|
|
/*
|
|
|
|
* stripe_nr counts the total number of stripes we have to stride
|
|
|
|
* to get to this block
|
|
|
|
*/
|
2015-02-21 01:43:47 +08:00
|
|
|
stripe_nr = div64_u64(stripe_nr, stripe_len);
|
2008-03-26 04:50:33 +08:00
|
|
|
|
2013-01-30 07:40:14 +08:00
|
|
|
stripe_offset = stripe_nr * stripe_len;
|
2016-04-13 00:54:40 +08:00
|
|
|
if (offset < stripe_offset) {
|
2016-09-20 22:05:00 +08:00
|
|
|
btrfs_crit(fs_info,
|
|
|
|
"stripe math has gone wrong, stripe_offset=%llu, offset=%llu, start=%llu, logical=%llu, stripe_len=%llu",
|
2016-04-13 00:54:40 +08:00
|
|
|
stripe_offset, offset, em->start, logical,
|
|
|
|
stripe_len);
|
|
|
|
free_extent_map(em);
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
2008-03-26 04:50:33 +08:00
|
|
|
|
|
|
|
/* stripe_offset is the offset of this block in its stripe*/
|
|
|
|
stripe_offset = offset - stripe_offset;
|
|
|
|
|
2013-01-30 07:40:14 +08:00
|
|
|
/* if we're here for raid56, we need to know the stripe aligned start */
|
2015-01-20 15:11:44 +08:00
|
|
|
if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
|
2013-01-30 07:40:14 +08:00
|
|
|
unsigned long full_stripe_len = stripe_len * nr_data_stripes(map);
|
|
|
|
raid56_full_stripe_start = offset;
|
|
|
|
|
|
|
|
/* allow a write of a full stripe, but make sure we don't
|
|
|
|
* allow straddling of stripes
|
|
|
|
*/
|
2015-02-21 01:43:47 +08:00
|
|
|
raid56_full_stripe_start = div64_u64(raid56_full_stripe_start,
|
|
|
|
full_stripe_len);
|
2013-01-30 07:40:14 +08:00
|
|
|
raid56_full_stripe_start *= full_stripe_len;
|
|
|
|
}
|
|
|
|
|
2017-03-15 04:33:56 +08:00
|
|
|
if (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
|
2013-01-30 07:40:14 +08:00
|
|
|
u64 max_len;
|
|
|
|
/* For writes to RAID[56], allow a full stripeset across all disks.
|
|
|
|
For other RAID types and for RAID[56] reads, just allow a single
|
|
|
|
stripe (on a single disk). */
|
2015-01-20 15:11:44 +08:00
|
|
|
if ((map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) &&
|
2016-10-27 15:27:36 +08:00
|
|
|
(op == BTRFS_MAP_WRITE)) {
|
2013-01-30 07:40:14 +08:00
|
|
|
max_len = stripe_len * nr_data_stripes(map) -
|
|
|
|
(offset - raid56_full_stripe_start);
|
|
|
|
} else {
|
|
|
|
/* we limit the length of each bio to what fits in a stripe */
|
|
|
|
max_len = stripe_len - stripe_offset;
|
|
|
|
}
|
|
|
|
*length = min_t(u64, em->len - offset, max_len);
|
2008-04-10 04:28:12 +08:00
|
|
|
} else {
|
|
|
|
*length = em->len - offset;
|
|
|
|
}
|
2008-04-21 22:03:05 +08:00
|
|
|
|
2013-01-30 07:40:14 +08:00
|
|
|
/* This is for when we're called from btrfs_merge_bio_hook() and all
|
|
|
|
it cares about is the length */
|
2011-08-04 23:15:33 +08:00
|
|
|
if (!bbio_ret)
|
2008-04-10 04:28:12 +08:00
|
|
|
goto out;
|
|
|
|
|
2018-03-24 09:11:38 +08:00
|
|
|
btrfs_dev_replace_read_lock(dev_replace);
|
2012-11-06 21:43:46 +08:00
|
|
|
dev_replace_is_ongoing = btrfs_dev_replace_is_ongoing(dev_replace);
|
|
|
|
if (!dev_replace_is_ongoing)
|
2018-03-24 09:11:38 +08:00
|
|
|
btrfs_dev_replace_read_unlock(dev_replace);
|
Btrfs: fix lockdep deadlock warning due to dev_replace
Xfstests btrfs/011 complains about a deadlock warning,
[ 1226.649039] =========================================================
[ 1226.649039] [ INFO: possible irq lock inversion dependency detected ]
[ 1226.649039] 4.1.0+ #270 Not tainted
[ 1226.649039] ---------------------------------------------------------
[ 1226.652955] kswapd0/46 just changed the state of lock:
[ 1226.652955] (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1226.652955] (&fs_info->dev_replace.lock){+.+.+.}
and interrupts could create inverse lock ordering between them.
[ 1226.652955]
other info that might help us debug this:
[ 1226.652955] Chain exists of:
&delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock
[ 1226.652955] Possible interrupt unsafe locking scenario:
[ 1226.652955] CPU0 CPU1
[ 1226.652955] ---- ----
[ 1226.652955] lock(&fs_info->dev_replace.lock);
[ 1226.652955] local_irq_disable();
[ 1226.652955] lock(&delayed_node->mutex);
[ 1226.652955] lock(&found->groups_sem);
[ 1226.652955] <Interrupt>
[ 1226.652955] lock(&delayed_node->mutex);
[ 1226.652955]
*** DEADLOCK ***
Commit 084b6e7c7607 ("btrfs: Fix a lockdep warning when running xfstest.") tried
to fix a similar one that has the exactly same warning, but with that, we still
run to this.
The above lock chain comes from
btrfs_commit_transaction
->btrfs_run_delayed_items
...
->__btrfs_update_delayed_inode
...
->__btrfs_cow_block
...
->find_free_extent
->cache_block_group
->load_free_space_cache
->btrfs_readpages
->submit_one_bio
...
->__btrfs_map_block
->btrfs_dev_replace_lock
However, with high memory pressure, tasks which hold dev_replace.lock can
be interrupted by kswapd and then kswapd is intended to release memory occupied
by superblock, inodes and dentries, where we may call evict_inode, and it comes
to
[ 1226.652955] [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30
[ 1226.652955] [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700
delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads
to a ABBA deadlock.
To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but
things are simpler here since we only needs read's spinlock to blocking lock.
With this, btrfs/011 no more produces warnings in dmesg.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-07-17 16:49:19 +08:00
|
|
|
else
|
|
|
|
btrfs_dev_replace_set_lock_blocking(dev_replace);
|
2012-11-06 21:43:46 +08:00
|
|
|
|
2012-11-06 22:06:47 +08:00
|
|
|
if (dev_replace_is_ongoing && mirror_num == map->num_stripes + 1 &&
|
2017-03-15 04:34:00 +08:00
|
|
|
!need_full_stripe(op) && dev_replace->tgtdev != NULL) {
|
2017-03-15 04:33:57 +08:00
|
|
|
ret = get_extra_mirror_from_replace(fs_info, logical, *length,
|
|
|
|
dev_replace->srcdev->devid,
|
|
|
|
&mirror_num,
|
|
|
|
&physical_to_patch_in_first_stripe);
|
|
|
|
if (ret)
|
2012-11-06 22:06:47 +08:00
|
|
|
goto out;
|
2017-03-15 04:33:57 +08:00
|
|
|
else
|
|
|
|
patch_the_first_stripe_for_dev_replace = 1;
|
2012-11-06 22:06:47 +08:00
|
|
|
} else if (mirror_num > map->num_stripes) {
|
|
|
|
mirror_num = 0;
|
|
|
|
}
|
|
|
|
|
2008-04-21 22:03:05 +08:00
|
|
|
num_stripes = 1;
|
2008-04-10 04:28:12 +08:00
|
|
|
stripe_index = 0;
|
2011-03-24 18:24:26 +08:00
|
|
|
if (map->type & BTRFS_BLOCK_GROUP_RAID0) {
|
2015-02-21 01:43:47 +08:00
|
|
|
stripe_nr = div_u64_rem(stripe_nr, map->num_stripes,
|
|
|
|
&stripe_index);
|
2017-10-12 16:43:00 +08:00
|
|
|
if (!need_full_stripe(op))
|
2014-09-12 18:44:02 +08:00
|
|
|
mirror_num = 1;
|
2011-03-24 18:24:26 +08:00
|
|
|
} else if (map->type & BTRFS_BLOCK_GROUP_RAID1) {
|
2017-10-12 16:43:00 +08:00
|
|
|
if (need_full_stripe(op))
|
2008-04-21 22:03:05 +08:00
|
|
|
num_stripes = map->num_stripes;
|
2008-04-30 02:12:09 +08:00
|
|
|
else if (mirror_num)
|
2008-04-10 04:28:12 +08:00
|
|
|
stripe_index = mirror_num - 1;
|
2008-05-14 01:46:40 +08:00
|
|
|
else {
|
2012-11-06 21:52:18 +08:00
|
|
|
stripe_index = find_live_mirror(fs_info, map, 0,
|
|
|
|
dev_replace_is_ongoing);
|
2011-08-04 23:15:33 +08:00
|
|
|
mirror_num = stripe_index + 1;
|
2008-05-14 01:46:40 +08:00
|
|
|
}
|
2008-04-30 02:12:09 +08:00
|
|
|
|
2008-04-04 04:29:03 +08:00
|
|
|
} else if (map->type & BTRFS_BLOCK_GROUP_DUP) {
|
2017-10-12 16:43:00 +08:00
|
|
|
if (need_full_stripe(op)) {
|
2008-04-21 22:03:05 +08:00
|
|
|
num_stripes = map->num_stripes;
|
2011-08-04 23:15:33 +08:00
|
|
|
} else if (mirror_num) {
|
2008-04-10 04:28:12 +08:00
|
|
|
stripe_index = mirror_num - 1;
|
2011-08-04 23:15:33 +08:00
|
|
|
} else {
|
|
|
|
mirror_num = 1;
|
|
|
|
}
|
2008-04-30 02:12:09 +08:00
|
|
|
|
2008-04-16 22:49:51 +08:00
|
|
|
} else if (map->type & BTRFS_BLOCK_GROUP_RAID10) {
|
2015-02-21 01:42:11 +08:00
|
|
|
u32 factor = map->num_stripes / map->sub_stripes;
|
2008-04-16 22:49:51 +08:00
|
|
|
|
2015-02-21 01:43:47 +08:00
|
|
|
stripe_nr = div_u64_rem(stripe_nr, factor, &stripe_index);
|
2008-04-16 22:49:51 +08:00
|
|
|
stripe_index *= map->sub_stripes;
|
|
|
|
|
2017-10-12 16:43:00 +08:00
|
|
|
if (need_full_stripe(op))
|
2008-04-21 22:03:05 +08:00
|
|
|
num_stripes = map->sub_stripes;
|
2008-04-16 22:49:51 +08:00
|
|
|
else if (mirror_num)
|
|
|
|
stripe_index += mirror_num - 1;
|
2008-05-14 01:46:40 +08:00
|
|
|
else {
|
2012-04-28 00:41:45 +08:00
|
|
|
int old_stripe_index = stripe_index;
|
2012-11-06 21:52:18 +08:00
|
|
|
stripe_index = find_live_mirror(fs_info, map,
|
|
|
|
stripe_index,
|
|
|
|
dev_replace_is_ongoing);
|
2012-04-28 00:41:45 +08:00
|
|
|
mirror_num = stripe_index - old_stripe_index + 1;
|
2008-05-14 01:46:40 +08:00
|
|
|
}
|
2013-01-30 07:40:14 +08:00
|
|
|
|
2015-01-20 15:11:44 +08:00
|
|
|
} else if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
|
2017-10-12 16:43:00 +08:00
|
|
|
if (need_raid_map && (need_full_stripe(op) || mirror_num > 1)) {
|
2013-01-30 07:40:14 +08:00
|
|
|
/* push stripe_nr back to the start of the full stripe */
|
2017-04-04 04:45:24 +08:00
|
|
|
stripe_nr = div64_u64(raid56_full_stripe_start,
|
2015-01-17 00:26:13 +08:00
|
|
|
stripe_len * nr_data_stripes(map));
|
2013-01-30 07:40:14 +08:00
|
|
|
|
|
|
|
/* RAID[56] write or recovery. Return all stripes */
|
|
|
|
num_stripes = map->num_stripes;
|
|
|
|
max_errors = nr_parity_stripes(map);
|
|
|
|
|
|
|
|
*length = map->stripe_len;
|
|
|
|
stripe_index = 0;
|
|
|
|
stripe_offset = 0;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Mirror #0 or #1 means the original data block.
|
|
|
|
* Mirror #2 is RAID5 parity block.
|
|
|
|
* Mirror #3 is RAID6 Q block.
|
|
|
|
*/
|
2015-02-21 01:43:47 +08:00
|
|
|
stripe_nr = div_u64_rem(stripe_nr,
|
|
|
|
nr_data_stripes(map), &stripe_index);
|
2013-01-30 07:40:14 +08:00
|
|
|
if (mirror_num > 1)
|
|
|
|
stripe_index = nr_data_stripes(map) +
|
|
|
|
mirror_num - 2;
|
|
|
|
|
|
|
|
/* We distribute the parity blocks across stripes */
|
2015-02-21 01:43:47 +08:00
|
|
|
div_u64_rem(stripe_nr + stripe_index, map->num_stripes,
|
|
|
|
&stripe_index);
|
2017-10-12 16:43:00 +08:00
|
|
|
if (!need_full_stripe(op) && mirror_num <= 1)
|
2014-09-12 18:44:02 +08:00
|
|
|
mirror_num = 1;
|
2013-01-30 07:40:14 +08:00
|
|
|
}
|
2008-04-04 04:29:03 +08:00
|
|
|
} else {
|
|
|
|
/*
|
2015-02-21 01:43:47 +08:00
|
|
|
* after this, stripe_nr is the number of stripes on this
|
|
|
|
* device we have to walk to find the data, and stripe_index is
|
|
|
|
* the number of our device in the stripe array
|
2008-04-04 04:29:03 +08:00
|
|
|
*/
|
2015-02-21 01:43:47 +08:00
|
|
|
stripe_nr = div_u64_rem(stripe_nr, map->num_stripes,
|
|
|
|
&stripe_index);
|
2011-08-04 23:15:33 +08:00
|
|
|
mirror_num = stripe_index + 1;
|
2008-04-04 04:29:03 +08:00
|
|
|
}
|
2016-04-13 00:54:40 +08:00
|
|
|
if (stripe_index >= map->num_stripes) {
|
2016-09-20 22:05:00 +08:00
|
|
|
btrfs_crit(fs_info,
|
|
|
|
"stripe index math went horribly wrong, got stripe_index=%u, num_stripes=%u",
|
2016-04-13 00:54:40 +08:00
|
|
|
stripe_index, map->num_stripes);
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
2008-04-10 04:28:12 +08:00
|
|
|
|
2012-11-06 21:43:46 +08:00
|
|
|
num_alloc_stripes = num_stripes;
|
2017-03-15 04:33:59 +08:00
|
|
|
if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL) {
|
2017-03-15 04:33:56 +08:00
|
|
|
if (op == BTRFS_MAP_WRITE)
|
2012-11-06 22:06:47 +08:00
|
|
|
num_alloc_stripes <<= 1;
|
2016-10-27 15:27:36 +08:00
|
|
|
if (op == BTRFS_MAP_GET_READ_MIRRORS)
|
2012-11-06 22:06:47 +08:00
|
|
|
num_alloc_stripes++;
|
2014-11-14 16:06:25 +08:00
|
|
|
tgtdev_indexes = num_stripes;
|
2012-11-06 22:06:47 +08:00
|
|
|
}
|
2014-11-14 16:06:25 +08:00
|
|
|
|
2015-01-20 15:11:34 +08:00
|
|
|
bbio = alloc_btrfs_bio(num_alloc_stripes, tgtdev_indexes);
|
2011-12-01 12:55:47 +08:00
|
|
|
if (!bbio) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
2017-03-15 04:33:59 +08:00
|
|
|
if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL)
|
2014-11-14 16:06:25 +08:00
|
|
|
bbio->tgtdev_map = (int *)(bbio->stripes + num_alloc_stripes);
|
2011-12-01 12:55:47 +08:00
|
|
|
|
2015-01-20 15:11:33 +08:00
|
|
|
/* build raid_map */
|
2017-03-15 04:34:00 +08:00
|
|
|
if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK && need_raid_map &&
|
|
|
|
(need_full_stripe(op) || mirror_num > 1)) {
|
2015-01-20 15:11:33 +08:00
|
|
|
u64 tmp;
|
2015-02-21 01:42:11 +08:00
|
|
|
unsigned rot;
|
2015-01-20 15:11:33 +08:00
|
|
|
|
|
|
|
bbio->raid_map = (u64 *)((void *)bbio->stripes +
|
|
|
|
sizeof(struct btrfs_bio_stripe) *
|
|
|
|
num_alloc_stripes +
|
|
|
|
sizeof(int) * tgtdev_indexes);
|
|
|
|
|
|
|
|
/* Work out the disk rotation on this stripe-set */
|
2015-02-21 01:43:47 +08:00
|
|
|
div_u64_rem(stripe_nr, num_stripes, &rot);
|
2015-01-20 15:11:33 +08:00
|
|
|
|
|
|
|
/* Fill in the logical address of each stripe */
|
|
|
|
tmp = stripe_nr * nr_data_stripes(map);
|
|
|
|
for (i = 0; i < nr_data_stripes(map); i++)
|
|
|
|
bbio->raid_map[(i+rot) % num_stripes] =
|
|
|
|
em->start + (tmp + i) * map->stripe_len;
|
|
|
|
|
|
|
|
bbio->raid_map[(i+rot) % map->num_stripes] = RAID5_P_STRIPE;
|
|
|
|
if (map->type & BTRFS_BLOCK_GROUP_RAID6)
|
|
|
|
bbio->raid_map[(i+rot+1) % num_stripes] =
|
|
|
|
RAID6_Q_STRIPE;
|
|
|
|
}
|
|
|
|
|
2012-04-13 04:03:56 +08:00
|
|
|
|
2017-03-15 04:33:56 +08:00
|
|
|
for (i = 0; i < num_stripes; i++) {
|
|
|
|
bbio->stripes[i].physical =
|
|
|
|
map->stripes[stripe_index].physical +
|
|
|
|
stripe_offset +
|
|
|
|
stripe_nr * map->stripe_len;
|
|
|
|
bbio->stripes[i].dev =
|
|
|
|
map->stripes[stripe_index].dev;
|
|
|
|
stripe_index++;
|
2008-03-26 04:50:33 +08:00
|
|
|
}
|
2011-12-01 12:55:47 +08:00
|
|
|
|
2017-03-15 04:34:00 +08:00
|
|
|
if (need_full_stripe(op))
|
2014-07-03 18:22:13 +08:00
|
|
|
max_errors = btrfs_chunk_max_errors(map);
|
2011-12-01 12:55:47 +08:00
|
|
|
|
2015-01-20 15:11:33 +08:00
|
|
|
if (bbio->raid_map)
|
|
|
|
sort_parity_stripes(bbio, num_stripes);
|
2015-01-20 15:11:32 +08:00
|
|
|
|
2017-03-15 04:33:58 +08:00
|
|
|
if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL &&
|
2017-03-15 04:34:00 +08:00
|
|
|
need_full_stripe(op)) {
|
2017-03-15 04:33:58 +08:00
|
|
|
handle_ops_on_dev_replace(op, &bbio, dev_replace, &num_stripes,
|
|
|
|
&max_errors);
|
2012-11-06 21:43:46 +08:00
|
|
|
}
|
|
|
|
|
2011-12-01 12:55:47 +08:00
|
|
|
*bbio_ret = bbio;
|
2015-01-20 15:11:43 +08:00
|
|
|
bbio->map_type = map->type;
|
2011-12-01 12:55:47 +08:00
|
|
|
bbio->num_stripes = num_stripes;
|
|
|
|
bbio->max_errors = max_errors;
|
|
|
|
bbio->mirror_num = mirror_num;
|
2012-11-06 22:06:47 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* this is the case that REQ_READ && dev_replace_is_ongoing &&
|
|
|
|
* mirror_num == num_stripes + 1 && dev_replace target drive is
|
|
|
|
* available as a mirror
|
|
|
|
*/
|
|
|
|
if (patch_the_first_stripe_for_dev_replace && num_stripes > 0) {
|
|
|
|
WARN_ON(num_stripes > 1);
|
|
|
|
bbio->stripes[0].dev = dev_replace->tgtdev;
|
|
|
|
bbio->stripes[0].physical = physical_to_patch_in_first_stripe;
|
|
|
|
bbio->mirror_num = map->num_stripes + 1;
|
|
|
|
}
|
2008-04-10 04:28:12 +08:00
|
|
|
out:
|
Btrfs: fix lockdep deadlock warning due to dev_replace
Xfstests btrfs/011 complains about a deadlock warning,
[ 1226.649039] =========================================================
[ 1226.649039] [ INFO: possible irq lock inversion dependency detected ]
[ 1226.649039] 4.1.0+ #270 Not tainted
[ 1226.649039] ---------------------------------------------------------
[ 1226.652955] kswapd0/46 just changed the state of lock:
[ 1226.652955] (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1226.652955] (&fs_info->dev_replace.lock){+.+.+.}
and interrupts could create inverse lock ordering between them.
[ 1226.652955]
other info that might help us debug this:
[ 1226.652955] Chain exists of:
&delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock
[ 1226.652955] Possible interrupt unsafe locking scenario:
[ 1226.652955] CPU0 CPU1
[ 1226.652955] ---- ----
[ 1226.652955] lock(&fs_info->dev_replace.lock);
[ 1226.652955] local_irq_disable();
[ 1226.652955] lock(&delayed_node->mutex);
[ 1226.652955] lock(&found->groups_sem);
[ 1226.652955] <Interrupt>
[ 1226.652955] lock(&delayed_node->mutex);
[ 1226.652955]
*** DEADLOCK ***
Commit 084b6e7c7607 ("btrfs: Fix a lockdep warning when running xfstest.") tried
to fix a similar one that has the exactly same warning, but with that, we still
run to this.
The above lock chain comes from
btrfs_commit_transaction
->btrfs_run_delayed_items
...
->__btrfs_update_delayed_inode
...
->__btrfs_cow_block
...
->find_free_extent
->cache_block_group
->load_free_space_cache
->btrfs_readpages
->submit_one_bio
...
->__btrfs_map_block
->btrfs_dev_replace_lock
However, with high memory pressure, tasks which hold dev_replace.lock can
be interrupted by kswapd and then kswapd is intended to release memory occupied
by superblock, inodes and dentries, where we may call evict_inode, and it comes
to
[ 1226.652955] [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30
[ 1226.652955] [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700
delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads
to a ABBA deadlock.
To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but
things are simpler here since we only needs read's spinlock to blocking lock.
With this, btrfs/011 no more produces warnings in dmesg.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-07-17 16:49:19 +08:00
|
|
|
if (dev_replace_is_ongoing) {
|
2018-08-24 23:33:58 +08:00
|
|
|
ASSERT(atomic_read(&dev_replace->blocking_readers) > 0);
|
|
|
|
btrfs_dev_replace_read_lock(dev_replace);
|
|
|
|
/* Barrier implied by atomic_dec_and_test */
|
|
|
|
if (atomic_dec_and_test(&dev_replace->blocking_readers))
|
|
|
|
cond_wake_up_nomb(&dev_replace->read_lock_wq);
|
2018-03-24 09:11:38 +08:00
|
|
|
btrfs_dev_replace_read_unlock(dev_replace);
|
Btrfs: fix lockdep deadlock warning due to dev_replace
Xfstests btrfs/011 complains about a deadlock warning,
[ 1226.649039] =========================================================
[ 1226.649039] [ INFO: possible irq lock inversion dependency detected ]
[ 1226.649039] 4.1.0+ #270 Not tainted
[ 1226.649039] ---------------------------------------------------------
[ 1226.652955] kswapd0/46 just changed the state of lock:
[ 1226.652955] (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1226.652955] (&fs_info->dev_replace.lock){+.+.+.}
and interrupts could create inverse lock ordering between them.
[ 1226.652955]
other info that might help us debug this:
[ 1226.652955] Chain exists of:
&delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock
[ 1226.652955] Possible interrupt unsafe locking scenario:
[ 1226.652955] CPU0 CPU1
[ 1226.652955] ---- ----
[ 1226.652955] lock(&fs_info->dev_replace.lock);
[ 1226.652955] local_irq_disable();
[ 1226.652955] lock(&delayed_node->mutex);
[ 1226.652955] lock(&found->groups_sem);
[ 1226.652955] <Interrupt>
[ 1226.652955] lock(&delayed_node->mutex);
[ 1226.652955]
*** DEADLOCK ***
Commit 084b6e7c7607 ("btrfs: Fix a lockdep warning when running xfstest.") tried
to fix a similar one that has the exactly same warning, but with that, we still
run to this.
The above lock chain comes from
btrfs_commit_transaction
->btrfs_run_delayed_items
...
->__btrfs_update_delayed_inode
...
->__btrfs_cow_block
...
->find_free_extent
->cache_block_group
->load_free_space_cache
->btrfs_readpages
->submit_one_bio
...
->__btrfs_map_block
->btrfs_dev_replace_lock
However, with high memory pressure, tasks which hold dev_replace.lock can
be interrupted by kswapd and then kswapd is intended to release memory occupied
by superblock, inodes and dentries, where we may call evict_inode, and it comes
to
[ 1226.652955] [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30
[ 1226.652955] [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700
delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads
to a ABBA deadlock.
To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but
things are simpler here since we only needs read's spinlock to blocking lock.
With this, btrfs/011 no more produces warnings in dmesg.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-07-17 16:49:19 +08:00
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
free_extent_map(em);
|
2011-12-01 12:55:47 +08:00
|
|
|
return ret;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
|
|
|
|
2016-10-27 15:27:36 +08:00
|
|
|
int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
|
2008-04-21 22:03:05 +08:00
|
|
|
u64 logical, u64 *length,
|
2011-08-04 23:15:33 +08:00
|
|
|
struct btrfs_bio **bbio_ret, int mirror_num)
|
2008-04-21 22:03:05 +08:00
|
|
|
{
|
2016-06-06 03:31:53 +08:00
|
|
|
return __btrfs_map_block(fs_info, op, logical, length, bbio_ret,
|
2015-01-20 15:11:33 +08:00
|
|
|
mirror_num, 0);
|
2008-04-21 22:03:05 +08:00
|
|
|
}
|
|
|
|
|
2014-10-23 14:42:50 +08:00
|
|
|
/* For Scrub/replace */
|
2016-10-27 15:27:36 +08:00
|
|
|
int btrfs_map_sblock(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
|
2014-10-23 14:42:50 +08:00
|
|
|
u64 logical, u64 *length,
|
2017-03-28 20:45:22 +08:00
|
|
|
struct btrfs_bio **bbio_ret)
|
2014-10-23 14:42:50 +08:00
|
|
|
{
|
2017-03-28 20:45:22 +08:00
|
|
|
return __btrfs_map_block(fs_info, op, logical, length, bbio_ret, 0, 1);
|
2014-10-23 14:42:50 +08:00
|
|
|
}
|
|
|
|
|
2018-05-04 15:53:05 +08:00
|
|
|
int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
|
|
|
|
u64 physical, u64 **logical, int *naddrs, int *stripe_len)
|
2008-12-09 05:46:26 +08:00
|
|
|
{
|
|
|
|
struct extent_map *em;
|
|
|
|
struct map_lookup *map;
|
|
|
|
u64 *buf;
|
|
|
|
u64 bytenr;
|
|
|
|
u64 length;
|
|
|
|
u64 stripe_nr;
|
2013-01-30 07:40:14 +08:00
|
|
|
u64 rmap_len;
|
2008-12-09 05:46:26 +08:00
|
|
|
int i, j, nr = 0;
|
|
|
|
|
2018-05-17 07:34:31 +08:00
|
|
|
em = btrfs_get_chunk_map(fs_info, chunk_start, 1);
|
2017-03-15 04:33:55 +08:00
|
|
|
if (IS_ERR(em))
|
2013-03-20 00:13:25 +08:00
|
|
|
return -EIO;
|
|
|
|
|
2015-06-03 22:55:48 +08:00
|
|
|
map = em->map_lookup;
|
2008-12-09 05:46:26 +08:00
|
|
|
length = em->len;
|
2013-01-30 07:40:14 +08:00
|
|
|
rmap_len = map->stripe_len;
|
|
|
|
|
2008-12-09 05:46:26 +08:00
|
|
|
if (map->type & BTRFS_BLOCK_GROUP_RAID10)
|
2015-01-17 00:26:13 +08:00
|
|
|
length = div_u64(length, map->num_stripes / map->sub_stripes);
|
2008-12-09 05:46:26 +08:00
|
|
|
else if (map->type & BTRFS_BLOCK_GROUP_RAID0)
|
2015-01-17 00:26:13 +08:00
|
|
|
length = div_u64(length, map->num_stripes);
|
2015-01-20 15:11:44 +08:00
|
|
|
else if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
|
2015-01-17 00:26:13 +08:00
|
|
|
length = div_u64(length, nr_data_stripes(map));
|
2013-01-30 07:40:14 +08:00
|
|
|
rmap_len = map->stripe_len * nr_data_stripes(map);
|
|
|
|
}
|
2008-12-09 05:46:26 +08:00
|
|
|
|
2015-02-21 01:00:26 +08:00
|
|
|
buf = kcalloc(map->num_stripes, sizeof(u64), GFP_NOFS);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(!buf); /* -ENOMEM */
|
2008-12-09 05:46:26 +08:00
|
|
|
|
|
|
|
for (i = 0; i < map->num_stripes; i++) {
|
|
|
|
if (map->stripes[i].physical > physical ||
|
|
|
|
map->stripes[i].physical + length <= physical)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
stripe_nr = physical - map->stripes[i].physical;
|
2017-04-04 04:45:24 +08:00
|
|
|
stripe_nr = div64_u64(stripe_nr, map->stripe_len);
|
2008-12-09 05:46:26 +08:00
|
|
|
|
|
|
|
if (map->type & BTRFS_BLOCK_GROUP_RAID10) {
|
|
|
|
stripe_nr = stripe_nr * map->num_stripes + i;
|
2015-01-17 00:26:13 +08:00
|
|
|
stripe_nr = div_u64(stripe_nr, map->sub_stripes);
|
2008-12-09 05:46:26 +08:00
|
|
|
} else if (map->type & BTRFS_BLOCK_GROUP_RAID0) {
|
|
|
|
stripe_nr = stripe_nr * map->num_stripes + i;
|
2013-01-30 07:40:14 +08:00
|
|
|
} /* else if RAID[56], multiply by nr_data_stripes().
|
|
|
|
* Alternatively, just use rmap_len below instead of
|
|
|
|
* map->stripe_len */
|
|
|
|
|
|
|
|
bytenr = chunk_start + stripe_nr * rmap_len;
|
2008-12-09 05:43:10 +08:00
|
|
|
WARN_ON(nr >= map->num_stripes);
|
2008-12-09 05:46:26 +08:00
|
|
|
for (j = 0; j < nr; j++) {
|
|
|
|
if (buf[j] == bytenr)
|
|
|
|
break;
|
|
|
|
}
|
2008-12-09 05:43:10 +08:00
|
|
|
if (j == nr) {
|
|
|
|
WARN_ON(nr >= map->num_stripes);
|
2008-12-09 05:46:26 +08:00
|
|
|
buf[nr++] = bytenr;
|
2008-12-09 05:43:10 +08:00
|
|
|
}
|
2008-12-09 05:46:26 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
*logical = buf;
|
|
|
|
*naddrs = nr;
|
2013-01-30 07:40:14 +08:00
|
|
|
*stripe_len = rmap_len;
|
2008-12-09 05:46:26 +08:00
|
|
|
|
|
|
|
free_extent_map(em);
|
|
|
|
return 0;
|
2008-04-21 22:03:05 +08:00
|
|
|
}
|
|
|
|
|
2015-07-20 21:29:37 +08:00
|
|
|
static inline void btrfs_end_bbio(struct btrfs_bio *bbio, struct bio *bio)
|
2014-06-19 10:42:55 +08:00
|
|
|
{
|
2015-05-22 21:14:03 +08:00
|
|
|
bio->bi_private = bbio->private;
|
|
|
|
bio->bi_end_io = bbio->end_io;
|
2015-07-20 21:29:37 +08:00
|
|
|
bio_endio(bio);
|
2015-05-22 21:14:03 +08:00
|
|
|
|
2015-01-20 15:11:34 +08:00
|
|
|
btrfs_put_bbio(bbio);
|
2014-06-19 10:42:55 +08:00
|
|
|
}
|
|
|
|
|
2015-07-20 21:29:37 +08:00
|
|
|
static void btrfs_end_bio(struct bio *bio)
|
2008-04-04 04:29:03 +08:00
|
|
|
{
|
2013-05-18 06:30:14 +08:00
|
|
|
struct btrfs_bio *bbio = bio->bi_private;
|
2008-08-05 22:13:57 +08:00
|
|
|
int is_orig_bio = 0;
|
2008-04-04 04:29:03 +08:00
|
|
|
|
2017-06-03 15:38:06 +08:00
|
|
|
if (bio->bi_status) {
|
2011-08-04 23:15:33 +08:00
|
|
|
atomic_inc(&bbio->error);
|
2017-06-03 15:38:06 +08:00
|
|
|
if (bio->bi_status == BLK_STS_IOERR ||
|
|
|
|
bio->bi_status == BLK_STS_TARGET) {
|
2012-05-25 22:06:08 +08:00
|
|
|
unsigned int stripe_index =
|
2013-05-18 06:30:14 +08:00
|
|
|
btrfs_io_bio(bio)->stripe_index;
|
2015-06-08 20:05:50 +08:00
|
|
|
struct btrfs_device *dev;
|
2012-05-25 22:06:08 +08:00
|
|
|
|
|
|
|
BUG_ON(stripe_index >= bbio->num_stripes);
|
|
|
|
dev = bbio->stripes[stripe_index].dev;
|
2012-06-14 22:42:31 +08:00
|
|
|
if (dev->bdev) {
|
2016-06-06 03:31:52 +08:00
|
|
|
if (bio_op(bio) == REQ_OP_WRITE)
|
2017-10-21 01:45:33 +08:00
|
|
|
btrfs_dev_stat_inc_and_print(dev,
|
2012-06-14 22:42:31 +08:00
|
|
|
BTRFS_DEV_STAT_WRITE_ERRS);
|
|
|
|
else
|
2017-10-21 01:45:33 +08:00
|
|
|
btrfs_dev_stat_inc_and_print(dev,
|
2012-06-14 22:42:31 +08:00
|
|
|
BTRFS_DEV_STAT_READ_ERRS);
|
2016-11-01 21:40:10 +08:00
|
|
|
if (bio->bi_opf & REQ_PREFLUSH)
|
2017-10-21 01:45:33 +08:00
|
|
|
btrfs_dev_stat_inc_and_print(dev,
|
2012-06-14 22:42:31 +08:00
|
|
|
BTRFS_DEV_STAT_FLUSH_ERRS);
|
|
|
|
}
|
2012-05-25 22:06:08 +08:00
|
|
|
}
|
|
|
|
}
|
2008-04-04 04:29:03 +08:00
|
|
|
|
2011-08-04 23:15:33 +08:00
|
|
|
if (bio == bbio->orig_bio)
|
2008-08-05 22:13:57 +08:00
|
|
|
is_orig_bio = 1;
|
|
|
|
|
Btrfs: fix use-after-free in the finishing procedure of the device replace
During device replace test, we hit a null pointer deference (It was very easy
to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
scsi driver). There were two bugs that caused this problem:
- We might allocate new chunks on the replaced device after we updated
the mapping tree. And we forgot to replace the source device in those
mapping of the new chunks.
- We might get the mapping information which including the source device
before the mapping information update. And then submit the bio which was
based on that mapping information after we freed the source device.
For the first bug, we can fix it by doing mapping tree update and source
device remove in the same context of the chunk mutex. The chunk mutex is
used to protect the allocable device list, the above method can avoid
the new chunk allocation, and after we remove the source device, all
the new chunks will be allocated on the new device. So it can fix
the first bug.
For the second bug, we need make sure all flighting bios are finished and
no new bios are produced during we are removing the source device. To fix
this problem, we introduced a global @bio_counter, we not only inc/dec
@bio_counter outsize of map_blocks, but also inc it before submitting bio
and dec @bio_counter when ending bios.
Since Raid56 is a little different and device replace dosen't support raid56
yet, it is not addressed in the patch and I add comments to make sure we will
fix it in the future.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-01-30 16:46:55 +08:00
|
|
|
btrfs_bio_counter_dec(bbio->fs_info);
|
|
|
|
|
2011-08-04 23:15:33 +08:00
|
|
|
if (atomic_dec_and_test(&bbio->stripes_pending)) {
|
2008-08-05 22:13:57 +08:00
|
|
|
if (!is_orig_bio) {
|
|
|
|
bio_put(bio);
|
2011-08-04 23:15:33 +08:00
|
|
|
bio = bbio->orig_bio;
|
2008-08-05 22:13:57 +08:00
|
|
|
}
|
2014-01-09 05:19:52 +08:00
|
|
|
|
2013-05-18 06:30:14 +08:00
|
|
|
btrfs_io_bio(bio)->mirror_num = bbio->mirror_num;
|
2008-04-29 21:38:00 +08:00
|
|
|
/* only send an error to the higher layers if it is
|
2013-01-30 07:40:14 +08:00
|
|
|
* beyond the tolerance of the btrfs bio
|
2008-04-29 21:38:00 +08:00
|
|
|
*/
|
2011-08-04 23:15:33 +08:00
|
|
|
if (atomic_read(&bbio->error) > bbio->max_errors) {
|
2017-06-03 15:38:06 +08:00
|
|
|
bio->bi_status = BLK_STS_IOERR;
|
2011-12-10 00:07:37 +08:00
|
|
|
} else {
|
2008-05-13 01:39:03 +08:00
|
|
|
/*
|
|
|
|
* this bio is actually up to date, we didn't
|
|
|
|
* go over the max number of errors
|
|
|
|
*/
|
2017-10-14 08:35:56 +08:00
|
|
|
bio->bi_status = BLK_STS_OK;
|
2008-05-13 01:39:03 +08:00
|
|
|
}
|
2014-06-19 10:42:54 +08:00
|
|
|
|
2015-07-20 21:29:37 +08:00
|
|
|
btrfs_end_bbio(bbio, bio);
|
2008-08-05 22:13:57 +08:00
|
|
|
} else if (!is_orig_bio) {
|
2008-04-04 04:29:03 +08:00
|
|
|
bio_put(bio);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-06-12 04:50:36 +08:00
|
|
|
/*
|
|
|
|
* see run_scheduled_bios for a description of why bios are collected for
|
|
|
|
* async submit.
|
|
|
|
*
|
|
|
|
* This will add one bio to the pending list for a device and make sure
|
|
|
|
* the work struct is scheduled.
|
|
|
|
*/
|
2016-06-23 06:54:24 +08:00
|
|
|
static noinline void btrfs_schedule_bio(struct btrfs_device *device,
|
2016-06-06 03:31:41 +08:00
|
|
|
struct bio *bio)
|
2008-06-12 04:50:36 +08:00
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *fs_info = device->fs_info;
|
2008-06-12 04:50:36 +08:00
|
|
|
int should_queue = 1;
|
2009-04-21 03:50:09 +08:00
|
|
|
struct btrfs_pending_bios *pending_bios;
|
2008-06-12 04:50:36 +08:00
|
|
|
|
|
|
|
/* don't bother with additional async steps for reads, right now */
|
2016-06-06 03:31:52 +08:00
|
|
|
if (bio_op(bio) == REQ_OP_READ) {
|
2016-06-06 03:31:41 +08:00
|
|
|
btrfsic_submit_bio(bio);
|
2012-03-01 21:56:26 +08:00
|
|
|
return;
|
2008-06-12 04:50:36 +08:00
|
|
|
}
|
|
|
|
|
2008-08-01 04:29:02 +08:00
|
|
|
WARN_ON(bio->bi_next);
|
2008-06-12 04:50:36 +08:00
|
|
|
bio->bi_next = NULL;
|
|
|
|
|
|
|
|
spin_lock(&device->io_lock);
|
2016-11-01 21:40:06 +08:00
|
|
|
if (op_is_sync(bio->bi_opf))
|
2009-04-21 03:50:09 +08:00
|
|
|
pending_bios = &device->pending_sync_bios;
|
|
|
|
else
|
|
|
|
pending_bios = &device->pending_bios;
|
2008-06-12 04:50:36 +08:00
|
|
|
|
2009-04-21 03:50:09 +08:00
|
|
|
if (pending_bios->tail)
|
|
|
|
pending_bios->tail->bi_next = bio;
|
2008-06-12 04:50:36 +08:00
|
|
|
|
2009-04-21 03:50:09 +08:00
|
|
|
pending_bios->tail = bio;
|
|
|
|
if (!pending_bios->head)
|
|
|
|
pending_bios->head = bio;
|
2008-06-12 04:50:36 +08:00
|
|
|
if (device->running_pending)
|
|
|
|
should_queue = 0;
|
|
|
|
|
|
|
|
spin_unlock(&device->io_lock);
|
|
|
|
|
|
|
|
if (should_queue)
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_queue_work(fs_info->submit_workers, &device->work);
|
2008-06-12 04:50:36 +08:00
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
static void submit_stripe_bio(struct btrfs_bio *bbio, struct bio *bio,
|
|
|
|
u64 physical, int dev_nr, int async)
|
2012-10-20 04:50:56 +08:00
|
|
|
{
|
|
|
|
struct btrfs_device *dev = bbio->stripes[dev_nr].dev;
|
2016-06-23 06:54:24 +08:00
|
|
|
struct btrfs_fs_info *fs_info = bbio->fs_info;
|
2012-10-20 04:50:56 +08:00
|
|
|
|
|
|
|
bio->bi_private = bbio;
|
2013-05-18 06:30:14 +08:00
|
|
|
btrfs_io_bio(bio)->stripe_index = dev_nr;
|
2012-10-20 04:50:56 +08:00
|
|
|
bio->bi_end_io = btrfs_end_bio;
|
2013-10-12 06:44:27 +08:00
|
|
|
bio->bi_iter.bi_sector = physical >> 9;
|
2018-08-02 15:19:07 +08:00
|
|
|
btrfs_debug_in_rcu(fs_info,
|
|
|
|
"btrfs_map_bio: rw %d 0x%x, sector=%llu, dev=%lu (%s id %llu), size=%u",
|
|
|
|
bio_op(bio), bio->bi_opf, (u64)bio->bi_iter.bi_sector,
|
|
|
|
(u_long)dev->bdev->bd_dev, rcu_str_deref(dev->name), dev->devid,
|
|
|
|
bio->bi_iter.bi_size);
|
2017-08-24 01:10:32 +08:00
|
|
|
bio_set_dev(bio, dev->bdev);
|
Btrfs: fix use-after-free in the finishing procedure of the device replace
During device replace test, we hit a null pointer deference (It was very easy
to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
scsi driver). There were two bugs that caused this problem:
- We might allocate new chunks on the replaced device after we updated
the mapping tree. And we forgot to replace the source device in those
mapping of the new chunks.
- We might get the mapping information which including the source device
before the mapping information update. And then submit the bio which was
based on that mapping information after we freed the source device.
For the first bug, we can fix it by doing mapping tree update and source
device remove in the same context of the chunk mutex. The chunk mutex is
used to protect the allocable device list, the above method can avoid
the new chunk allocation, and after we remove the source device, all
the new chunks will be allocated on the new device. So it can fix
the first bug.
For the second bug, we need make sure all flighting bios are finished and
no new bios are produced during we are removing the source device. To fix
this problem, we introduced a global @bio_counter, we not only inc/dec
@bio_counter outsize of map_blocks, but also inc it before submitting bio
and dec @bio_counter when ending bios.
Since Raid56 is a little different and device replace dosen't support raid56
yet, it is not addressed in the patch and I add comments to make sure we will
fix it in the future.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-01-30 16:46:55 +08:00
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
btrfs_bio_counter_inc_noblocked(fs_info);
|
Btrfs: fix use-after-free in the finishing procedure of the device replace
During device replace test, we hit a null pointer deference (It was very easy
to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
scsi driver). There were two bugs that caused this problem:
- We might allocate new chunks on the replaced device after we updated
the mapping tree. And we forgot to replace the source device in those
mapping of the new chunks.
- We might get the mapping information which including the source device
before the mapping information update. And then submit the bio which was
based on that mapping information after we freed the source device.
For the first bug, we can fix it by doing mapping tree update and source
device remove in the same context of the chunk mutex. The chunk mutex is
used to protect the allocable device list, the above method can avoid
the new chunk allocation, and after we remove the source device, all
the new chunks will be allocated on the new device. So it can fix
the first bug.
For the second bug, we need make sure all flighting bios are finished and
no new bios are produced during we are removing the source device. To fix
this problem, we introduced a global @bio_counter, we not only inc/dec
@bio_counter outsize of map_blocks, but also inc it before submitting bio
and dec @bio_counter when ending bios.
Since Raid56 is a little different and device replace dosen't support raid56
yet, it is not addressed in the patch and I add comments to make sure we will
fix it in the future.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-01-30 16:46:55 +08:00
|
|
|
|
2012-10-20 04:50:56 +08:00
|
|
|
if (async)
|
2016-06-23 06:54:24 +08:00
|
|
|
btrfs_schedule_bio(dev, bio);
|
2012-10-20 04:50:56 +08:00
|
|
|
else
|
2016-06-06 03:31:41 +08:00
|
|
|
btrfsic_submit_bio(bio);
|
2012-10-20 04:50:56 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
|
|
|
|
{
|
|
|
|
atomic_inc(&bbio->error);
|
|
|
|
if (atomic_dec_and_test(&bbio->stripes_pending)) {
|
2016-05-20 09:18:45 +08:00
|
|
|
/* Should be the original bio. */
|
2014-06-19 10:42:55 +08:00
|
|
|
WARN_ON(bio != bbio->orig_bio);
|
|
|
|
|
2013-05-18 06:30:14 +08:00
|
|
|
btrfs_io_bio(bio)->mirror_num = bbio->mirror_num;
|
2013-10-12 06:44:27 +08:00
|
|
|
bio->bi_iter.bi_sector = logical >> 9;
|
2017-10-14 08:34:02 +08:00
|
|
|
if (atomic_read(&bbio->error) > bbio->max_errors)
|
|
|
|
bio->bi_status = BLK_STS_IOERR;
|
|
|
|
else
|
|
|
|
bio->bi_status = BLK_STS_OK;
|
2015-07-20 21:29:37 +08:00
|
|
|
btrfs_end_bbio(bbio, bio);
|
2012-10-20 04:50:56 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-08-23 14:45:59 +08:00
|
|
|
blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
|
|
|
|
int mirror_num, int async_submit)
|
2008-03-25 03:01:56 +08:00
|
|
|
{
|
|
|
|
struct btrfs_device *dev;
|
2008-04-04 04:29:03 +08:00
|
|
|
struct bio *first_bio = bio;
|
2013-10-12 06:44:27 +08:00
|
|
|
u64 logical = (u64)bio->bi_iter.bi_sector << 9;
|
2008-03-25 03:01:56 +08:00
|
|
|
u64 length = 0;
|
|
|
|
u64 map_length;
|
|
|
|
int ret;
|
2015-02-12 15:42:16 +08:00
|
|
|
int dev_nr;
|
|
|
|
int total_devs;
|
2011-08-04 23:15:33 +08:00
|
|
|
struct btrfs_bio *bbio = NULL;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2013-10-12 06:44:27 +08:00
|
|
|
length = bio->bi_iter.bi_size;
|
2008-03-25 03:01:56 +08:00
|
|
|
map_length = length;
|
2008-04-10 04:28:12 +08:00
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_bio_counter_inc_blocked(fs_info);
|
2017-09-20 07:50:09 +08:00
|
|
|
ret = __btrfs_map_block(fs_info, btrfs_op(bio), logical,
|
2016-06-06 03:31:52 +08:00
|
|
|
&map_length, &bbio, mirror_num, 1);
|
Btrfs: fix use-after-free in the finishing procedure of the device replace
During device replace test, we hit a null pointer deference (It was very easy
to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
scsi driver). There were two bugs that caused this problem:
- We might allocate new chunks on the replaced device after we updated
the mapping tree. And we forgot to replace the source device in those
mapping of the new chunks.
- We might get the mapping information which including the source device
before the mapping information update. And then submit the bio which was
based on that mapping information after we freed the source device.
For the first bug, we can fix it by doing mapping tree update and source
device remove in the same context of the chunk mutex. The chunk mutex is
used to protect the allocable device list, the above method can avoid
the new chunk allocation, and after we remove the source device, all
the new chunks will be allocated on the new device. So it can fix
the first bug.
For the second bug, we need make sure all flighting bios are finished and
no new bios are produced during we are removing the source device. To fix
this problem, we introduced a global @bio_counter, we not only inc/dec
@bio_counter outsize of map_blocks, but also inc it before submitting bio
and dec @bio_counter when ending bios.
Since Raid56 is a little different and device replace dosen't support raid56
yet, it is not addressed in the patch and I add comments to make sure we will
fix it in the future.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-01-30 16:46:55 +08:00
|
|
|
if (ret) {
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_bio_counter_dec(fs_info);
|
2017-08-23 14:45:59 +08:00
|
|
|
return errno_to_blk_status(ret);
|
Btrfs: fix use-after-free in the finishing procedure of the device replace
During device replace test, we hit a null pointer deference (It was very easy
to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
scsi driver). There were two bugs that caused this problem:
- We might allocate new chunks on the replaced device after we updated
the mapping tree. And we forgot to replace the source device in those
mapping of the new chunks.
- We might get the mapping information which including the source device
before the mapping information update. And then submit the bio which was
based on that mapping information after we freed the source device.
For the first bug, we can fix it by doing mapping tree update and source
device remove in the same context of the chunk mutex. The chunk mutex is
used to protect the allocable device list, the above method can avoid
the new chunk allocation, and after we remove the source device, all
the new chunks will be allocated on the new device. So it can fix
the first bug.
For the second bug, we need make sure all flighting bios are finished and
no new bios are produced during we are removing the source device. To fix
this problem, we introduced a global @bio_counter, we not only inc/dec
@bio_counter outsize of map_blocks, but also inc it before submitting bio
and dec @bio_counter when ending bios.
Since Raid56 is a little different and device replace dosen't support raid56
yet, it is not addressed in the patch and I add comments to make sure we will
fix it in the future.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-01-30 16:46:55 +08:00
|
|
|
}
|
2008-04-10 04:28:12 +08:00
|
|
|
|
2011-08-04 23:15:33 +08:00
|
|
|
total_devs = bbio->num_stripes;
|
2013-01-30 07:40:14 +08:00
|
|
|
bbio->orig_bio = first_bio;
|
|
|
|
bbio->private = first_bio->bi_private;
|
|
|
|
bbio->end_io = first_bio->bi_end_io;
|
2016-06-23 06:54:23 +08:00
|
|
|
bbio->fs_info = fs_info;
|
2013-01-30 07:40:14 +08:00
|
|
|
atomic_set(&bbio->stripes_pending, bbio->num_stripes);
|
|
|
|
|
2015-12-15 18:18:09 +08:00
|
|
|
if ((bbio->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK) &&
|
2016-06-06 03:31:52 +08:00
|
|
|
((bio_op(bio) == REQ_OP_WRITE) || (mirror_num > 1))) {
|
2013-01-30 07:40:14 +08:00
|
|
|
/* In this case, map_length has been set to the length of
|
|
|
|
a single stripe; not the whole write */
|
2016-06-06 03:31:52 +08:00
|
|
|
if (bio_op(bio) == REQ_OP_WRITE) {
|
2016-06-23 06:54:24 +08:00
|
|
|
ret = raid56_parity_write(fs_info, bio, bbio,
|
|
|
|
map_length);
|
2013-01-30 07:40:14 +08:00
|
|
|
} else {
|
2016-06-23 06:54:24 +08:00
|
|
|
ret = raid56_parity_recover(fs_info, bio, bbio,
|
|
|
|
map_length, mirror_num, 1);
|
2013-01-30 07:40:14 +08:00
|
|
|
}
|
2014-11-25 16:39:28 +08:00
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_bio_counter_dec(fs_info);
|
2017-08-23 14:45:59 +08:00
|
|
|
return errno_to_blk_status(ret);
|
2013-01-30 07:40:14 +08:00
|
|
|
}
|
|
|
|
|
2008-04-10 04:28:12 +08:00
|
|
|
if (map_length < length) {
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_crit(fs_info,
|
2016-09-20 22:05:00 +08:00
|
|
|
"mapping failed logical %llu bio len %llu len %llu",
|
|
|
|
logical, length, map_length);
|
2008-04-10 04:28:12 +08:00
|
|
|
BUG();
|
|
|
|
}
|
2011-08-04 23:15:33 +08:00
|
|
|
|
2015-02-12 15:42:16 +08:00
|
|
|
for (dev_nr = 0; dev_nr < total_devs; dev_nr++) {
|
2012-10-20 04:50:56 +08:00
|
|
|
dev = bbio->stripes[dev_nr].dev;
|
2018-11-08 22:16:38 +08:00
|
|
|
if (!dev || !dev->bdev || test_bit(BTRFS_DEV_STATE_MISSING,
|
|
|
|
&dev->dev_state) ||
|
2017-12-04 12:54:52 +08:00
|
|
|
(bio_op(first_bio) == REQ_OP_WRITE &&
|
|
|
|
!test_bit(BTRFS_DEV_STATE_WRITEABLE, &dev->dev_state))) {
|
2012-10-20 04:50:56 +08:00
|
|
|
bbio_error(bbio, first_bio, logical);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2017-06-02 23:38:30 +08:00
|
|
|
if (dev_nr < total_devs - 1)
|
2017-06-02 23:48:13 +08:00
|
|
|
bio = btrfs_bio_clone(first_bio);
|
2017-06-02 23:38:30 +08:00
|
|
|
else
|
2011-08-04 23:15:33 +08:00
|
|
|
bio = first_bio;
|
2012-10-20 04:50:56 +08:00
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
submit_stripe_bio(bbio, bio, bbio->stripes[dev_nr].physical,
|
|
|
|
dev_nr, async_submit);
|
2008-04-04 04:29:03 +08:00
|
|
|
}
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_bio_counter_dec(fs_info);
|
2017-08-23 14:45:59 +08:00
|
|
|
return BLK_STS_OK;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
|
|
|
|
2012-11-06 00:03:39 +08:00
|
|
|
struct btrfs_device *btrfs_find_device(struct btrfs_fs_info *fs_info, u64 devid,
|
2008-11-18 10:11:30 +08:00
|
|
|
u8 *uuid, u8 *fsid)
|
2008-03-25 03:01:56 +08:00
|
|
|
{
|
2008-11-18 10:11:30 +08:00
|
|
|
struct btrfs_device *device;
|
|
|
|
struct btrfs_fs_devices *cur_devices;
|
|
|
|
|
2012-11-06 00:03:39 +08:00
|
|
|
cur_devices = fs_info->fs_devices;
|
2008-11-18 10:11:30 +08:00
|
|
|
while (cur_devices) {
|
|
|
|
if (!fsid ||
|
2017-07-29 17:50:09 +08:00
|
|
|
!memcmp(cur_devices->fsid, fsid, BTRFS_FSID_SIZE)) {
|
2017-06-16 01:51:51 +08:00
|
|
|
device = find_device(cur_devices, devid, uuid);
|
2008-11-18 10:11:30 +08:00
|
|
|
if (device)
|
|
|
|
return device;
|
|
|
|
}
|
|
|
|
cur_devices = cur_devices->seed;
|
|
|
|
}
|
|
|
|
return NULL;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
static struct btrfs_device *add_missing_dev(struct btrfs_fs_devices *fs_devices,
|
2008-05-14 01:46:40 +08:00
|
|
|
u64 devid, u8 *dev_uuid)
|
|
|
|
{
|
|
|
|
struct btrfs_device *device;
|
|
|
|
|
2013-08-23 18:20:17 +08:00
|
|
|
device = btrfs_alloc_device(NULL, &devid, dev_uuid);
|
|
|
|
if (IS_ERR(device))
|
2017-10-11 12:46:18 +08:00
|
|
|
return device;
|
2013-08-23 18:20:17 +08:00
|
|
|
|
|
|
|
list_add(&device->dev_list, &fs_devices->devices);
|
2008-12-12 23:03:26 +08:00
|
|
|
device->fs_devices = fs_devices;
|
2008-05-14 01:46:40 +08:00
|
|
|
fs_devices->num_devices++;
|
2013-08-23 18:20:17 +08:00
|
|
|
|
2017-12-04 12:54:54 +08:00
|
|
|
set_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state);
|
2010-12-14 03:56:23 +08:00
|
|
|
fs_devices->missing_devices++;
|
2013-08-23 18:20:17 +08:00
|
|
|
|
2008-05-14 01:46:40 +08:00
|
|
|
return device;
|
|
|
|
}
|
|
|
|
|
2013-08-23 18:20:17 +08:00
|
|
|
/**
|
|
|
|
* btrfs_alloc_device - allocate struct btrfs_device
|
|
|
|
* @fs_info: used only for generating a new devid, can be NULL if
|
|
|
|
* devid is provided (i.e. @devid != NULL).
|
|
|
|
* @devid: a pointer to devid for this device. If NULL a new devid
|
|
|
|
* is generated.
|
|
|
|
* @uuid: a pointer to UUID for this device. If NULL a new UUID
|
|
|
|
* is generated.
|
|
|
|
*
|
|
|
|
* Return: a pointer to a new &struct btrfs_device on success; ERR_PTR()
|
2017-10-31 01:10:25 +08:00
|
|
|
* on error. Returned struct is not linked onto any lists and must be
|
2018-03-20 22:47:33 +08:00
|
|
|
* destroyed with btrfs_free_device.
|
2013-08-23 18:20:17 +08:00
|
|
|
*/
|
|
|
|
struct btrfs_device *btrfs_alloc_device(struct btrfs_fs_info *fs_info,
|
|
|
|
const u64 *devid,
|
|
|
|
const u8 *uuid)
|
|
|
|
{
|
|
|
|
struct btrfs_device *dev;
|
|
|
|
u64 tmp;
|
|
|
|
|
2013-10-31 13:00:08 +08:00
|
|
|
if (WARN_ON(!devid && !fs_info))
|
2013-08-23 18:20:17 +08:00
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
|
|
|
|
dev = __alloc_device();
|
|
|
|
if (IS_ERR(dev))
|
|
|
|
return dev;
|
|
|
|
|
|
|
|
if (devid)
|
|
|
|
tmp = *devid;
|
|
|
|
else {
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = find_next_devid(fs_info, &tmp);
|
|
|
|
if (ret) {
|
2018-03-20 22:47:33 +08:00
|
|
|
btrfs_free_device(dev);
|
2013-08-23 18:20:17 +08:00
|
|
|
return ERR_PTR(ret);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
dev->devid = tmp;
|
|
|
|
|
|
|
|
if (uuid)
|
|
|
|
memcpy(dev->uuid, uuid, BTRFS_UUID_SIZE);
|
|
|
|
else
|
|
|
|
generate_random_uuid(dev->uuid);
|
|
|
|
|
Btrfs: fix task hang under heavy compressed write
This has been reported and discussed for a long time, and this hang occurs in
both 3.15 and 3.16.
Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
Btrfs has a kind of work queued as an ordered way, which means that its
ordered_func() must be processed in the way of FIFO, so it usually looks like --
normal_work_helper(arg)
work = container_of(arg, struct btrfs_work, normal_work);
work->func() <---- (we name it work X)
for ordered_work in wq->ordered_list
ordered_work->ordered_func()
ordered_work->ordered_free()
The hang is a rare case, first when we find free space, we get an uncached block
group, then we go to read its free space cache inode for free space information,
so it will
file a readahead request
btrfs_readpages()
for page that is not in page cache
__do_readpage()
submit_extent_page()
btrfs_submit_bio_hook()
btrfs_bio_wq_end_io()
submit_bio()
end_workqueue_bio() <--(ret by the 1st endio)
queue a work(named work Y) for the 2nd
also the real endio()
So the hang occurs when work Y's work_struct and work X's work_struct happens
to share the same address.
A bit more explanation,
A,B,C -- struct btrfs_work
arg -- struct work_struct
kthread:
worker_thread()
pick up a work_struct from @worklist
process_one_work(arg)
worker->current_work = arg; <-- arg is A->normal_work
worker->current_func(arg)
normal_work_helper(arg)
A = container_of(arg, struct btrfs_work, normal_work);
A->func()
A->ordered_func()
A->ordered_free() <-- A gets freed
B->ordered_func()
submit_compressed_extents()
find_free_extent()
load_free_space_inode()
... <-- (the above readhead stack)
end_workqueue_bio()
btrfs_queue_work(work C)
B->ordered_free()
As if work A has a high priority in wq->ordered_list and there are more ordered
works queued after it, such as B->ordered_func(), its memory could have been
freed before normal_work_helper() returns, which means that kernel workqueue
code worker_thread() still has worker->current_work pointer to be work
A->normal_work's, ie. arg's address.
Meanwhile, work C is allocated after work A is freed, work C->normal_work
and work A->normal_work are likely to share the same address(I confirmed this
with ftrace output, so I'm not just guessing, it's rare though).
When another kthread picks up work C->normal_work to process, and finds our
kthread is processing it(see find_worker_executing_work()), it'll think
work C as a collision and skip then, which ends up nobody processing work C.
So the situation is that our kthread is waiting forever on work C.
Besides, there're other cases that can lead to deadlock, but the real problem
is that all btrfs workqueue shares one work->func, -- normal_work_helper,
so this makes each workqueue to have its own helper function, but only a
wraper pf normal_work_helper.
With this patch, I no long hit the above hang.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 23:36:53 +08:00
|
|
|
btrfs_init_work(&dev->work, btrfs_submit_helper,
|
|
|
|
pending_bios_fn, NULL, NULL);
|
2013-08-23 18:20:17 +08:00
|
|
|
|
|
|
|
return dev;
|
|
|
|
}
|
|
|
|
|
2016-06-04 03:05:15 +08:00
|
|
|
/* Return -EIO if any error, otherwise return 0. */
|
2016-06-23 06:54:24 +08:00
|
|
|
static int btrfs_check_chunk_valid(struct btrfs_fs_info *fs_info,
|
2016-06-04 03:05:15 +08:00
|
|
|
struct extent_buffer *leaf,
|
|
|
|
struct btrfs_chunk *chunk, u64 logical)
|
2008-03-25 03:01:56 +08:00
|
|
|
{
|
|
|
|
u64 length;
|
2015-12-15 09:14:37 +08:00
|
|
|
u64 stripe_len;
|
2016-06-04 03:05:15 +08:00
|
|
|
u16 num_stripes;
|
|
|
|
u16 sub_stripes;
|
|
|
|
u64 type;
|
2018-07-04 18:16:39 +08:00
|
|
|
u64 features;
|
|
|
|
bool mixed = false;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2008-04-16 03:41:47 +08:00
|
|
|
length = btrfs_chunk_length(leaf, chunk);
|
2015-12-15 09:14:37 +08:00
|
|
|
stripe_len = btrfs_chunk_stripe_len(leaf, chunk);
|
|
|
|
num_stripes = btrfs_chunk_num_stripes(leaf, chunk);
|
2016-06-04 03:05:15 +08:00
|
|
|
sub_stripes = btrfs_chunk_sub_stripes(leaf, chunk);
|
|
|
|
type = btrfs_chunk_type(leaf, chunk);
|
|
|
|
|
2015-12-15 09:14:37 +08:00
|
|
|
if (!num_stripes) {
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_err(fs_info, "invalid chunk num_stripes: %u",
|
2015-12-15 09:14:37 +08:00
|
|
|
num_stripes);
|
|
|
|
return -EIO;
|
|
|
|
}
|
2016-06-23 06:54:23 +08:00
|
|
|
if (!IS_ALIGNED(logical, fs_info->sectorsize)) {
|
|
|
|
btrfs_err(fs_info, "invalid chunk logical %llu", logical);
|
2015-12-15 09:14:37 +08:00
|
|
|
return -EIO;
|
|
|
|
}
|
2016-06-23 06:54:23 +08:00
|
|
|
if (btrfs_chunk_sector_size(leaf, chunk) != fs_info->sectorsize) {
|
|
|
|
btrfs_err(fs_info, "invalid chunk sectorsize %u",
|
2016-06-04 03:05:15 +08:00
|
|
|
btrfs_chunk_sector_size(leaf, chunk));
|
|
|
|
return -EIO;
|
|
|
|
}
|
2016-06-23 06:54:23 +08:00
|
|
|
if (!length || !IS_ALIGNED(length, fs_info->sectorsize)) {
|
|
|
|
btrfs_err(fs_info, "invalid chunk length %llu", length);
|
2015-12-15 09:14:37 +08:00
|
|
|
return -EIO;
|
|
|
|
}
|
2016-04-27 08:53:31 +08:00
|
|
|
if (!is_power_of_2(stripe_len) || stripe_len != BTRFS_STRIPE_LEN) {
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_err(fs_info, "invalid chunk stripe length: %llu",
|
2015-12-15 09:14:37 +08:00
|
|
|
stripe_len);
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
if (~(BTRFS_BLOCK_GROUP_TYPE_MASK | BTRFS_BLOCK_GROUP_PROFILE_MASK) &
|
2016-06-04 03:05:15 +08:00
|
|
|
type) {
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_err(fs_info, "unrecognized chunk type: %llu",
|
2015-12-15 09:14:37 +08:00
|
|
|
~(BTRFS_BLOCK_GROUP_TYPE_MASK |
|
|
|
|
BTRFS_BLOCK_GROUP_PROFILE_MASK) &
|
|
|
|
btrfs_chunk_type(leaf, chunk));
|
|
|
|
return -EIO;
|
|
|
|
}
|
2018-07-04 18:16:39 +08:00
|
|
|
|
|
|
|
if ((type & BTRFS_BLOCK_GROUP_TYPE_MASK) == 0) {
|
|
|
|
btrfs_err(fs_info, "missing chunk type flag: 0x%llx", type);
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
|
|
|
|
if ((type & BTRFS_BLOCK_GROUP_SYSTEM) &&
|
|
|
|
(type & (BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA))) {
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"system chunk with data or metadata type: 0x%llx", type);
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
|
|
|
|
features = btrfs_super_incompat_flags(fs_info->super_copy);
|
|
|
|
if (features & BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS)
|
|
|
|
mixed = true;
|
|
|
|
|
|
|
|
if (!mixed) {
|
|
|
|
if ((type & BTRFS_BLOCK_GROUP_METADATA) &&
|
|
|
|
(type & BTRFS_BLOCK_GROUP_DATA)) {
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"mixed chunk type in non-mixed mode: 0x%llx", type);
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-06-04 03:05:15 +08:00
|
|
|
if ((type & BTRFS_BLOCK_GROUP_RAID10 && sub_stripes != 2) ||
|
|
|
|
(type & BTRFS_BLOCK_GROUP_RAID1 && num_stripes < 1) ||
|
|
|
|
(type & BTRFS_BLOCK_GROUP_RAID5 && num_stripes < 2) ||
|
|
|
|
(type & BTRFS_BLOCK_GROUP_RAID6 && num_stripes < 3) ||
|
|
|
|
(type & BTRFS_BLOCK_GROUP_DUP && num_stripes > 2) ||
|
|
|
|
((type & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 0 &&
|
|
|
|
num_stripes != 1)) {
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_err(fs_info,
|
2016-06-04 03:05:15 +08:00
|
|
|
"invalid num_stripes:sub_stripes %u:%u for profile %llu",
|
|
|
|
num_stripes, sub_stripes,
|
|
|
|
type & BTRFS_BLOCK_GROUP_PROFILE_MASK);
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-10-09 11:07:45 +08:00
|
|
|
static void btrfs_report_missing_device(struct btrfs_fs_info *fs_info,
|
2017-10-09 11:07:46 +08:00
|
|
|
u64 devid, u8 *uuid, bool error)
|
2017-10-09 11:07:45 +08:00
|
|
|
{
|
2017-10-09 11:07:46 +08:00
|
|
|
if (error)
|
|
|
|
btrfs_err_rl(fs_info, "devid %llu uuid %pU is missing",
|
|
|
|
devid, uuid);
|
|
|
|
else
|
|
|
|
btrfs_warn_rl(fs_info, "devid %llu uuid %pU is missing",
|
|
|
|
devid, uuid);
|
2017-10-09 11:07:45 +08:00
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key,
|
2016-06-04 03:05:15 +08:00
|
|
|
struct extent_buffer *leaf,
|
|
|
|
struct btrfs_chunk *chunk)
|
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_mapping_tree *map_tree = &fs_info->mapping_tree;
|
2016-06-04 03:05:15 +08:00
|
|
|
struct map_lookup *map;
|
|
|
|
struct extent_map *em;
|
|
|
|
u64 logical;
|
|
|
|
u64 length;
|
|
|
|
u64 devid;
|
|
|
|
u8 uuid[BTRFS_UUID_SIZE];
|
|
|
|
int num_stripes;
|
|
|
|
int ret;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
logical = key->offset;
|
|
|
|
length = btrfs_chunk_length(leaf, chunk);
|
|
|
|
num_stripes = btrfs_chunk_num_stripes(leaf, chunk);
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
ret = btrfs_check_chunk_valid(fs_info, leaf, chunk, logical);
|
2016-06-04 03:05:15 +08:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
2008-05-07 23:43:44 +08:00
|
|
|
|
2009-09-03 04:24:52 +08:00
|
|
|
read_lock(&map_tree->map_tree.lock);
|
2008-03-25 03:01:56 +08:00
|
|
|
em = lookup_extent_mapping(&map_tree->map_tree, logical, 1);
|
2009-09-03 04:24:52 +08:00
|
|
|
read_unlock(&map_tree->map_tree.lock);
|
2008-03-25 03:01:56 +08:00
|
|
|
|
|
|
|
/* already mapped? */
|
|
|
|
if (em && em->start <= logical && em->start + em->len > logical) {
|
|
|
|
free_extent_map(em);
|
|
|
|
return 0;
|
|
|
|
} else if (em) {
|
|
|
|
free_extent_map(em);
|
|
|
|
}
|
|
|
|
|
2011-04-21 06:48:27 +08:00
|
|
|
em = alloc_extent_map();
|
2008-03-25 03:01:56 +08:00
|
|
|
if (!em)
|
|
|
|
return -ENOMEM;
|
2008-03-26 04:50:33 +08:00
|
|
|
map = kmalloc(map_lookup_size(num_stripes), GFP_NOFS);
|
2008-03-25 03:01:56 +08:00
|
|
|
if (!map) {
|
|
|
|
free_extent_map(em);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
2014-06-19 10:42:52 +08:00
|
|
|
set_bit(EXTENT_FLAG_FS_MAPPING, &em->flags);
|
2015-06-03 22:55:48 +08:00
|
|
|
em->map_lookup = map;
|
2008-03-25 03:01:56 +08:00
|
|
|
em->start = logical;
|
|
|
|
em->len = length;
|
2012-10-12 04:54:30 +08:00
|
|
|
em->orig_start = 0;
|
2008-03-25 03:01:56 +08:00
|
|
|
em->block_start = 0;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-30 02:49:59 +08:00
|
|
|
em->block_len = em->len;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2008-03-26 04:50:33 +08:00
|
|
|
map->num_stripes = num_stripes;
|
|
|
|
map->io_width = btrfs_chunk_io_width(leaf, chunk);
|
|
|
|
map->io_align = btrfs_chunk_io_align(leaf, chunk);
|
|
|
|
map->stripe_len = btrfs_chunk_stripe_len(leaf, chunk);
|
|
|
|
map->type = btrfs_chunk_type(leaf, chunk);
|
2008-04-16 22:49:51 +08:00
|
|
|
map->sub_stripes = btrfs_chunk_sub_stripes(leaf, chunk);
|
2018-08-01 10:37:19 +08:00
|
|
|
map->verified_stripes = 0;
|
2008-03-26 04:50:33 +08:00
|
|
|
for (i = 0; i < num_stripes; i++) {
|
|
|
|
map->stripes[i].physical =
|
|
|
|
btrfs_stripe_offset_nr(leaf, chunk, i);
|
|
|
|
devid = btrfs_stripe_devid_nr(leaf, chunk, i);
|
2008-04-18 22:29:38 +08:00
|
|
|
read_extent_buffer(leaf, uuid, (unsigned long)
|
|
|
|
btrfs_stripe_dev_uuid_nr(chunk, i),
|
|
|
|
BTRFS_UUID_SIZE);
|
2016-06-23 06:54:23 +08:00
|
|
|
map->stripes[i].dev = btrfs_find_device(fs_info, devid,
|
2012-11-06 00:03:39 +08:00
|
|
|
uuid, NULL);
|
2016-06-10 09:38:35 +08:00
|
|
|
if (!map->stripes[i].dev &&
|
2016-06-23 06:54:23 +08:00
|
|
|
!btrfs_test_opt(fs_info, DEGRADED)) {
|
2008-03-26 04:50:33 +08:00
|
|
|
free_extent_map(em);
|
2017-10-09 11:07:46 +08:00
|
|
|
btrfs_report_missing_device(fs_info, devid, uuid, true);
|
2017-10-09 11:07:44 +08:00
|
|
|
return -ENOENT;
|
2008-03-26 04:50:33 +08:00
|
|
|
}
|
2008-05-14 01:46:40 +08:00
|
|
|
if (!map->stripes[i].dev) {
|
|
|
|
map->stripes[i].dev =
|
2016-06-23 06:54:24 +08:00
|
|
|
add_missing_dev(fs_info->fs_devices, devid,
|
|
|
|
uuid);
|
2017-10-11 12:46:18 +08:00
|
|
|
if (IS_ERR(map->stripes[i].dev)) {
|
2008-05-14 01:46:40 +08:00
|
|
|
free_extent_map(em);
|
2017-10-11 12:46:18 +08:00
|
|
|
btrfs_err(fs_info,
|
|
|
|
"failed to init missing dev %llu: %ld",
|
|
|
|
devid, PTR_ERR(map->stripes[i].dev));
|
|
|
|
return PTR_ERR(map->stripes[i].dev);
|
2008-05-14 01:46:40 +08:00
|
|
|
}
|
2017-10-09 11:07:46 +08:00
|
|
|
btrfs_report_missing_device(fs_info, devid, uuid, false);
|
2008-05-14 01:46:40 +08:00
|
|
|
}
|
2017-12-04 12:54:53 +08:00
|
|
|
set_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
|
|
|
|
&(map->stripes[i].dev->dev_state));
|
|
|
|
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
|
|
|
|
2009-09-03 04:24:52 +08:00
|
|
|
write_lock(&map_tree->map_tree.lock);
|
2013-04-06 04:51:15 +08:00
|
|
|
ret = add_extent_mapping(&map_tree->map_tree, em, 0);
|
2009-09-03 04:24:52 +08:00
|
|
|
write_unlock(&map_tree->map_tree.lock);
|
2018-08-01 10:37:20 +08:00
|
|
|
if (ret < 0) {
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"failed to add chunk map, start=%llu len=%llu: %d",
|
|
|
|
em->start, em->len, ret);
|
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
free_extent_map(em);
|
|
|
|
|
2018-08-01 10:37:20 +08:00
|
|
|
return ret;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
|
|
|
|
2012-03-01 21:56:26 +08:00
|
|
|
static void fill_device_from_item(struct extent_buffer *leaf,
|
2008-03-25 03:01:56 +08:00
|
|
|
struct btrfs_dev_item *dev_item,
|
|
|
|
struct btrfs_device *device)
|
|
|
|
{
|
|
|
|
unsigned long ptr;
|
|
|
|
|
|
|
|
device->devid = btrfs_device_id(leaf, dev_item);
|
2009-04-27 19:29:03 +08:00
|
|
|
device->disk_total_bytes = btrfs_device_total_bytes(leaf, dev_item);
|
|
|
|
device->total_bytes = device->disk_total_bytes;
|
2014-09-03 21:35:33 +08:00
|
|
|
device->commit_total_bytes = device->disk_total_bytes;
|
2008-03-25 03:01:56 +08:00
|
|
|
device->bytes_used = btrfs_device_bytes_used(leaf, dev_item);
|
2014-09-03 21:35:34 +08:00
|
|
|
device->commit_bytes_used = device->bytes_used;
|
2008-03-25 03:01:56 +08:00
|
|
|
device->type = btrfs_device_type(leaf, dev_item);
|
|
|
|
device->io_align = btrfs_device_io_align(leaf, dev_item);
|
|
|
|
device->io_width = btrfs_device_io_width(leaf, dev_item);
|
|
|
|
device->sector_size = btrfs_device_sector_size(leaf, dev_item);
|
2012-11-06 20:15:27 +08:00
|
|
|
WARN_ON(device->devid == BTRFS_DEV_REPLACE_DEVID);
|
2017-12-04 12:54:55 +08:00
|
|
|
clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state);
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2013-08-20 19:20:11 +08:00
|
|
|
ptr = btrfs_device_uuid(dev_item);
|
2008-04-16 03:41:47 +08:00
|
|
|
read_extent_buffer(leaf, device->uuid, ptr, BTRFS_UUID_SIZE);
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
static struct btrfs_fs_devices *open_seed_devices(struct btrfs_fs_info *fs_info,
|
2014-09-03 21:35:46 +08:00
|
|
|
u8 *fsid)
|
2008-11-18 10:11:30 +08:00
|
|
|
{
|
|
|
|
struct btrfs_fs_devices *fs_devices;
|
|
|
|
int ret;
|
|
|
|
|
2018-03-16 09:21:22 +08:00
|
|
|
lockdep_assert_held(&uuid_mutex);
|
2017-06-14 08:48:07 +08:00
|
|
|
ASSERT(fsid);
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
fs_devices = fs_info->fs_devices->seed;
|
2008-11-18 10:11:30 +08:00
|
|
|
while (fs_devices) {
|
2017-07-29 17:50:09 +08:00
|
|
|
if (!memcmp(fs_devices->fsid, fsid, BTRFS_FSID_SIZE))
|
2014-09-03 21:35:46 +08:00
|
|
|
return fs_devices;
|
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
fs_devices = fs_devices->seed;
|
|
|
|
}
|
|
|
|
|
|
|
|
fs_devices = find_fsid(fsid);
|
|
|
|
if (!fs_devices) {
|
2016-06-23 06:54:23 +08:00
|
|
|
if (!btrfs_test_opt(fs_info, DEGRADED))
|
2014-09-03 21:35:46 +08:00
|
|
|
return ERR_PTR(-ENOENT);
|
|
|
|
|
|
|
|
fs_devices = alloc_fs_devices(fsid);
|
|
|
|
if (IS_ERR(fs_devices))
|
|
|
|
return fs_devices;
|
|
|
|
|
|
|
|
fs_devices->seeding = 1;
|
|
|
|
fs_devices->opened = 1;
|
|
|
|
return fs_devices;
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
2008-12-12 23:03:26 +08:00
|
|
|
|
|
|
|
fs_devices = clone_fs_devices(fs_devices);
|
2014-09-03 21:35:46 +08:00
|
|
|
if (IS_ERR(fs_devices))
|
|
|
|
return fs_devices;
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2018-04-12 10:29:28 +08:00
|
|
|
ret = open_fs_devices(fs_devices, FMODE_READ, fs_info->bdev_holder);
|
2012-04-14 17:24:33 +08:00
|
|
|
if (ret) {
|
|
|
|
free_fs_devices(fs_devices);
|
2014-09-03 21:35:46 +08:00
|
|
|
fs_devices = ERR_PTR(ret);
|
2008-11-18 10:11:30 +08:00
|
|
|
goto out;
|
2012-04-14 17:24:33 +08:00
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
|
|
|
|
if (!fs_devices->seeding) {
|
2018-04-12 10:29:27 +08:00
|
|
|
close_fs_devices(fs_devices);
|
2008-12-12 23:03:26 +08:00
|
|
|
free_fs_devices(fs_devices);
|
2014-09-03 21:35:46 +08:00
|
|
|
fs_devices = ERR_PTR(-EINVAL);
|
2008-11-18 10:11:30 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
fs_devices->seed = fs_info->fs_devices->seed;
|
|
|
|
fs_info->fs_devices->seed = fs_devices;
|
2008-11-18 10:11:30 +08:00
|
|
|
out:
|
2014-09-03 21:35:46 +08:00
|
|
|
return fs_devices;
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
static int read_one_dev(struct btrfs_fs_info *fs_info,
|
2008-03-25 03:01:56 +08:00
|
|
|
struct extent_buffer *leaf,
|
|
|
|
struct btrfs_dev_item *dev_item)
|
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
|
2008-03-25 03:01:56 +08:00
|
|
|
struct btrfs_device *device;
|
|
|
|
u64 devid;
|
|
|
|
int ret;
|
2017-07-29 17:50:09 +08:00
|
|
|
u8 fs_uuid[BTRFS_FSID_SIZE];
|
2008-04-18 22:29:38 +08:00
|
|
|
u8 dev_uuid[BTRFS_UUID_SIZE];
|
|
|
|
|
2008-03-25 03:01:56 +08:00
|
|
|
devid = btrfs_device_id(leaf, dev_item);
|
2013-08-20 19:20:11 +08:00
|
|
|
read_extent_buffer(leaf, dev_uuid, btrfs_device_uuid(dev_item),
|
2008-04-18 22:29:38 +08:00
|
|
|
BTRFS_UUID_SIZE);
|
2013-08-20 19:20:12 +08:00
|
|
|
read_extent_buffer(leaf, fs_uuid, btrfs_device_fsid(dev_item),
|
2017-07-29 17:50:09 +08:00
|
|
|
BTRFS_FSID_SIZE);
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2017-07-29 17:50:09 +08:00
|
|
|
if (memcmp(fs_uuid, fs_info->fsid, BTRFS_FSID_SIZE)) {
|
2016-06-23 06:54:24 +08:00
|
|
|
fs_devices = open_seed_devices(fs_info, fs_uuid);
|
2014-09-03 21:35:46 +08:00
|
|
|
if (IS_ERR(fs_devices))
|
|
|
|
return PTR_ERR(fs_devices);
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
device = btrfs_find_device(fs_info, devid, dev_uuid, fs_uuid);
|
2014-09-03 21:35:46 +08:00
|
|
|
if (!device) {
|
2017-03-09 09:34:42 +08:00
|
|
|
if (!btrfs_test_opt(fs_info, DEGRADED)) {
|
2017-10-09 11:07:46 +08:00
|
|
|
btrfs_report_missing_device(fs_info, devid,
|
|
|
|
dev_uuid, true);
|
2017-10-09 11:07:44 +08:00
|
|
|
return -ENOENT;
|
2017-03-09 09:34:42 +08:00
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
device = add_missing_dev(fs_devices, devid, dev_uuid);
|
2017-10-11 12:46:18 +08:00
|
|
|
if (IS_ERR(device)) {
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"failed to add missing dev %llu: %ld",
|
|
|
|
devid, PTR_ERR(device));
|
|
|
|
return PTR_ERR(device);
|
|
|
|
}
|
2017-10-09 11:07:46 +08:00
|
|
|
btrfs_report_missing_device(fs_info, devid, dev_uuid, false);
|
2014-09-03 21:35:46 +08:00
|
|
|
} else {
|
2017-03-09 09:34:42 +08:00
|
|
|
if (!device->bdev) {
|
2017-10-09 11:07:46 +08:00
|
|
|
if (!btrfs_test_opt(fs_info, DEGRADED)) {
|
|
|
|
btrfs_report_missing_device(fs_info,
|
|
|
|
devid, dev_uuid, true);
|
2017-10-09 11:07:44 +08:00
|
|
|
return -ENOENT;
|
2017-10-09 11:07:46 +08:00
|
|
|
}
|
|
|
|
btrfs_report_missing_device(fs_info, devid,
|
|
|
|
dev_uuid, false);
|
2017-03-09 09:34:42 +08:00
|
|
|
}
|
2014-09-03 21:35:46 +08:00
|
|
|
|
2017-12-04 12:54:54 +08:00
|
|
|
if (!device->bdev &&
|
|
|
|
!test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state)) {
|
2010-12-14 03:56:23 +08:00
|
|
|
/*
|
|
|
|
* this happens when a device that was properly setup
|
|
|
|
* in the device info lists suddenly goes bad.
|
|
|
|
* device->bdev is NULL, and so we have to set
|
|
|
|
* device->missing to one here
|
|
|
|
*/
|
2014-09-03 21:35:46 +08:00
|
|
|
device->fs_devices->missing_devices++;
|
2017-12-04 12:54:54 +08:00
|
|
|
set_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state);
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
2014-09-03 21:35:46 +08:00
|
|
|
|
|
|
|
/* Move the device to its own fs_devices */
|
|
|
|
if (device->fs_devices != fs_devices) {
|
2017-12-04 12:54:54 +08:00
|
|
|
ASSERT(test_bit(BTRFS_DEV_STATE_MISSING,
|
|
|
|
&device->dev_state));
|
2014-09-03 21:35:46 +08:00
|
|
|
|
|
|
|
list_move(&device->dev_list, &fs_devices->devices);
|
|
|
|
device->fs_devices->num_devices--;
|
|
|
|
fs_devices->num_devices++;
|
|
|
|
|
|
|
|
device->fs_devices->missing_devices--;
|
|
|
|
fs_devices->missing_devices++;
|
|
|
|
|
|
|
|
device->fs_devices = fs_devices;
|
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
if (device->fs_devices != fs_info->fs_devices) {
|
2017-12-04 12:54:52 +08:00
|
|
|
BUG_ON(test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state));
|
2008-11-18 10:11:30 +08:00
|
|
|
if (device->generation !=
|
|
|
|
btrfs_device_generation(leaf, dev_item))
|
|
|
|
return -EINVAL;
|
2008-03-25 03:01:59 +08:00
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
|
|
|
|
fill_device_from_item(leaf, dev_item, device);
|
2017-12-04 12:54:53 +08:00
|
|
|
set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
|
2017-12-04 12:54:52 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
|
2017-12-04 12:54:55 +08:00
|
|
|
!test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state)) {
|
2008-11-18 10:11:30 +08:00
|
|
|
device->fs_devices->total_rw_bytes += device->total_bytes;
|
2017-05-11 14:17:46 +08:00
|
|
|
atomic64_add(device->total_bytes - device->bytes_used,
|
|
|
|
&fs_info->free_chunk_space);
|
2011-09-27 05:12:22 +08:00
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
ret = 0;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-06-22 09:16:51 +08:00
|
|
|
int btrfs_read_sys_array(struct btrfs_fs_info *fs_info)
|
2008-03-25 03:01:56 +08:00
|
|
|
{
|
2016-06-22 09:16:51 +08:00
|
|
|
struct btrfs_root *root = fs_info->tree_root;
|
2016-09-20 22:05:02 +08:00
|
|
|
struct btrfs_super_block *super_copy = fs_info->super_copy;
|
2008-05-07 23:43:44 +08:00
|
|
|
struct extent_buffer *sb;
|
2008-03-25 03:01:56 +08:00
|
|
|
struct btrfs_disk_key *disk_key;
|
|
|
|
struct btrfs_chunk *chunk;
|
2014-11-01 02:02:42 +08:00
|
|
|
u8 *array_ptr;
|
|
|
|
unsigned long sb_array_offset;
|
2008-04-25 21:04:37 +08:00
|
|
|
int ret = 0;
|
2008-03-25 03:01:56 +08:00
|
|
|
u32 num_stripes;
|
|
|
|
u32 array_size;
|
|
|
|
u32 len = 0;
|
2014-11-01 02:02:42 +08:00
|
|
|
u32 cur_offset;
|
2016-06-04 03:05:15 +08:00
|
|
|
u64 type;
|
2008-04-25 21:04:37 +08:00
|
|
|
struct btrfs_key key;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
ASSERT(BTRFS_SUPER_INFO_SIZE <= fs_info->nodesize);
|
2014-06-15 08:39:54 +08:00
|
|
|
/*
|
|
|
|
* This will create extent buffer of nodesize, superblock size is
|
|
|
|
* fixed to BTRFS_SUPER_INFO_SIZE. If nodesize > sb size, this will
|
|
|
|
* overallocate but we can keep it as-is, only the first page is used.
|
|
|
|
*/
|
2016-06-23 06:54:24 +08:00
|
|
|
sb = btrfs_find_create_tree_block(fs_info, BTRFS_SUPER_INFO_OFFSET);
|
2016-06-07 03:01:23 +08:00
|
|
|
if (IS_ERR(sb))
|
|
|
|
return PTR_ERR(sb);
|
2015-12-03 20:06:46 +08:00
|
|
|
set_extent_buffer_uptodate(sb);
|
2011-07-27 04:11:19 +08:00
|
|
|
btrfs_set_buffer_lockdep_class(root->root_key.objectid, sb, 0);
|
2011-10-08 00:06:13 +08:00
|
|
|
/*
|
2016-05-20 09:18:45 +08:00
|
|
|
* The sb extent buffer is artificial and just used to read the system array.
|
2015-12-03 20:06:46 +08:00
|
|
|
* set_extent_buffer_uptodate() call does not properly mark all it's
|
2011-10-08 00:06:13 +08:00
|
|
|
* pages up-to-date when the page is larger: extent does not cover the
|
|
|
|
* whole page and consequently check_page_uptodate does not find all
|
|
|
|
* the page's extents up-to-date (the hole beyond sb),
|
|
|
|
* write_extent_buffer then triggers a WARN_ON.
|
|
|
|
*
|
|
|
|
* Regular short extents go through mark_extent_buffer_dirty/writeback cycle,
|
|
|
|
* but sb spans only this function. Add an explicit SetPageUptodate call
|
|
|
|
* to silence the warning eg. on PowerPC 64.
|
|
|
|
*/
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
if (PAGE_SIZE > BTRFS_SUPER_INFO_SIZE)
|
2010-08-07 01:21:20 +08:00
|
|
|
SetPageUptodate(sb->pages[0]);
|
2009-02-13 03:09:45 +08:00
|
|
|
|
2008-05-07 23:43:44 +08:00
|
|
|
write_extent_buffer(sb, super_copy, 0, BTRFS_SUPER_INFO_SIZE);
|
2008-03-25 03:01:56 +08:00
|
|
|
array_size = btrfs_super_sys_array_size(super_copy);
|
|
|
|
|
2014-11-01 02:02:42 +08:00
|
|
|
array_ptr = super_copy->sys_chunk_array;
|
|
|
|
sb_array_offset = offsetof(struct btrfs_super_block, sys_chunk_array);
|
|
|
|
cur_offset = 0;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2014-11-01 02:02:42 +08:00
|
|
|
while (cur_offset < array_size) {
|
|
|
|
disk_key = (struct btrfs_disk_key *)array_ptr;
|
2014-11-05 22:24:51 +08:00
|
|
|
len = sizeof(*disk_key);
|
|
|
|
if (cur_offset + len > array_size)
|
|
|
|
goto out_short_read;
|
|
|
|
|
2008-03-25 03:01:56 +08:00
|
|
|
btrfs_disk_key_to_cpu(&key, disk_key);
|
|
|
|
|
2014-11-01 02:02:42 +08:00
|
|
|
array_ptr += len;
|
|
|
|
sb_array_offset += len;
|
|
|
|
cur_offset += len;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2008-03-25 03:02:07 +08:00
|
|
|
if (key.type == BTRFS_CHUNK_ITEM_KEY) {
|
2014-11-01 02:02:42 +08:00
|
|
|
chunk = (struct btrfs_chunk *)sb_array_offset;
|
2014-11-05 22:24:51 +08:00
|
|
|
/*
|
|
|
|
* At least one btrfs_chunk with one stripe must be
|
|
|
|
* present, exact stripe count check comes afterwards
|
|
|
|
*/
|
|
|
|
len = btrfs_chunk_item_size(1);
|
|
|
|
if (cur_offset + len > array_size)
|
|
|
|
goto out_short_read;
|
|
|
|
|
|
|
|
num_stripes = btrfs_chunk_num_stripes(sb, chunk);
|
2015-12-01 00:27:06 +08:00
|
|
|
if (!num_stripes) {
|
2016-09-20 22:05:02 +08:00
|
|
|
btrfs_err(fs_info,
|
|
|
|
"invalid number of stripes %u in sys_array at offset %u",
|
2015-12-01 00:27:06 +08:00
|
|
|
num_stripes, cur_offset);
|
|
|
|
ret = -EIO;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2016-06-04 03:05:15 +08:00
|
|
|
type = btrfs_chunk_type(sb, chunk);
|
|
|
|
if ((type & BTRFS_BLOCK_GROUP_SYSTEM) == 0) {
|
2016-09-20 22:05:02 +08:00
|
|
|
btrfs_err(fs_info,
|
2016-06-04 03:05:15 +08:00
|
|
|
"invalid chunk type %llu in sys_array at offset %u",
|
|
|
|
type, cur_offset);
|
|
|
|
ret = -EIO;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2014-11-05 22:24:51 +08:00
|
|
|
len = btrfs_chunk_item_size(num_stripes);
|
|
|
|
if (cur_offset + len > array_size)
|
|
|
|
goto out_short_read;
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
ret = read_one_chunk(fs_info, &key, sb, chunk);
|
2008-04-25 21:04:37 +08:00
|
|
|
if (ret)
|
|
|
|
break;
|
2008-03-25 03:01:56 +08:00
|
|
|
} else {
|
2016-09-20 22:05:02 +08:00
|
|
|
btrfs_err(fs_info,
|
|
|
|
"unexpected item type %u in sys_array at offset %u",
|
|
|
|
(u32)key.type, cur_offset);
|
2008-04-25 21:04:37 +08:00
|
|
|
ret = -EIO;
|
|
|
|
break;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
2014-11-01 02:02:42 +08:00
|
|
|
array_ptr += len;
|
|
|
|
sb_array_offset += len;
|
|
|
|
cur_offset += len;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
2016-06-04 08:41:42 +08:00
|
|
|
clear_extent_buffer_uptodate(sb);
|
2016-05-14 08:06:59 +08:00
|
|
|
free_extent_buffer_stale(sb);
|
2008-04-25 21:04:37 +08:00
|
|
|
return ret;
|
2014-11-05 22:24:51 +08:00
|
|
|
|
|
|
|
out_short_read:
|
2016-09-20 22:05:02 +08:00
|
|
|
btrfs_err(fs_info, "sys_array too short to read %u bytes at offset %u",
|
2014-11-05 22:24:51 +08:00
|
|
|
len, cur_offset);
|
2016-06-04 08:41:42 +08:00
|
|
|
clear_extent_buffer_uptodate(sb);
|
2016-05-14 08:06:59 +08:00
|
|
|
free_extent_buffer_stale(sb);
|
2014-11-05 22:24:51 +08:00
|
|
|
return -EIO;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
|
|
|
|
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
|
|
|
/*
|
|
|
|
* Check if all chunks in the fs are OK for read-write degraded mount
|
|
|
|
*
|
2017-12-18 17:08:59 +08:00
|
|
|
* If the @failing_dev is specified, it's accounted as missing.
|
|
|
|
*
|
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
|
|
|
* Return true if all chunks meet the minimal RW mount requirements.
|
|
|
|
* Return false if any chunk doesn't meet the minimal RW mount requirements.
|
|
|
|
*/
|
2017-12-18 17:08:59 +08:00
|
|
|
bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_device *failing_dev)
|
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
|
|
|
{
|
|
|
|
struct btrfs_mapping_tree *map_tree = &fs_info->mapping_tree;
|
|
|
|
struct extent_map *em;
|
|
|
|
u64 next_start = 0;
|
|
|
|
bool ret = true;
|
|
|
|
|
|
|
|
read_lock(&map_tree->map_tree.lock);
|
|
|
|
em = lookup_extent_mapping(&map_tree->map_tree, 0, (u64)-1);
|
|
|
|
read_unlock(&map_tree->map_tree.lock);
|
|
|
|
/* No chunk at all? Return false anyway */
|
|
|
|
if (!em) {
|
|
|
|
ret = false;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
while (em) {
|
|
|
|
struct map_lookup *map;
|
|
|
|
int missing = 0;
|
|
|
|
int max_tolerated;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
map = em->map_lookup;
|
|
|
|
max_tolerated =
|
|
|
|
btrfs_get_num_tolerated_disk_barrier_failures(
|
|
|
|
map->type);
|
|
|
|
for (i = 0; i < map->num_stripes; i++) {
|
|
|
|
struct btrfs_device *dev = map->stripes[i].dev;
|
|
|
|
|
2017-12-04 12:54:54 +08:00
|
|
|
if (!dev || !dev->bdev ||
|
|
|
|
test_bit(BTRFS_DEV_STATE_MISSING, &dev->dev_state) ||
|
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
|
|
|
dev->last_flush_error)
|
|
|
|
missing++;
|
2017-12-18 17:08:59 +08:00
|
|
|
else if (failing_dev && failing_dev == dev)
|
|
|
|
missing++;
|
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
|
|
|
}
|
|
|
|
if (missing > max_tolerated) {
|
2017-12-18 17:08:59 +08:00
|
|
|
if (!failing_dev)
|
|
|
|
btrfs_warn(fs_info,
|
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
|
|
|
"chunk %llu missing %d devices, max tolerance is %d for writeable mount",
|
|
|
|
em->start, missing, max_tolerated);
|
|
|
|
free_extent_map(em);
|
|
|
|
ret = false;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
next_start = extent_map_end(em);
|
|
|
|
free_extent_map(em);
|
|
|
|
|
|
|
|
read_lock(&map_tree->map_tree.lock);
|
|
|
|
em = lookup_extent_mapping(&map_tree->map_tree, next_start,
|
|
|
|
(u64)(-1) - next_start);
|
|
|
|
read_unlock(&map_tree->map_tree.lock);
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-06-21 22:40:19 +08:00
|
|
|
int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info)
|
2008-03-25 03:01:56 +08:00
|
|
|
{
|
2016-06-21 22:40:19 +08:00
|
|
|
struct btrfs_root *root = fs_info->chunk_root;
|
2008-03-25 03:01:56 +08:00
|
|
|
struct btrfs_path *path;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_key found_key;
|
|
|
|
int ret;
|
|
|
|
int slot;
|
2016-06-04 03:05:14 +08:00
|
|
|
u64 total_dev = 0;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2018-04-12 10:29:32 +08:00
|
|
|
/*
|
|
|
|
* uuid_mutex is needed only if we are mounting a sprout FS
|
|
|
|
* otherwise we don't need it.
|
|
|
|
*/
|
2011-12-07 11:38:24 +08:00
|
|
|
mutex_lock(&uuid_mutex);
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
2011-12-07 11:38:24 +08:00
|
|
|
|
2013-07-30 19:03:04 +08:00
|
|
|
/*
|
|
|
|
* Read all device items, and then all the chunk items. All
|
|
|
|
* device items are found before any chunk item (their object id
|
|
|
|
* is smaller than the lowest possible object id for a chunk
|
|
|
|
* item - BTRFS_FIRST_CHUNK_TREE_OBJECTID).
|
2008-03-25 03:01:56 +08:00
|
|
|
*/
|
|
|
|
key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
|
|
|
|
key.offset = 0;
|
|
|
|
key.type = 0;
|
|
|
|
ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
|
2010-03-25 20:34:49 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto error;
|
2009-01-06 10:25:51 +08:00
|
|
|
while (1) {
|
2008-03-25 03:01:56 +08:00
|
|
|
leaf = path->nodes[0];
|
|
|
|
slot = path->slots[0];
|
|
|
|
if (slot >= btrfs_header_nritems(leaf)) {
|
|
|
|
ret = btrfs_next_leaf(root, path);
|
|
|
|
if (ret == 0)
|
|
|
|
continue;
|
|
|
|
if (ret < 0)
|
|
|
|
goto error;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
btrfs_item_key_to_cpu(leaf, &found_key, slot);
|
2013-07-30 19:03:04 +08:00
|
|
|
if (found_key.type == BTRFS_DEV_ITEM_KEY) {
|
|
|
|
struct btrfs_dev_item *dev_item;
|
|
|
|
dev_item = btrfs_item_ptr(leaf, slot,
|
2008-03-25 03:01:56 +08:00
|
|
|
struct btrfs_dev_item);
|
2016-06-23 06:54:24 +08:00
|
|
|
ret = read_one_dev(fs_info, leaf, dev_item);
|
2013-07-30 19:03:04 +08:00
|
|
|
if (ret)
|
|
|
|
goto error;
|
2016-06-04 03:05:14 +08:00
|
|
|
total_dev++;
|
2008-03-25 03:01:56 +08:00
|
|
|
} else if (found_key.type == BTRFS_CHUNK_ITEM_KEY) {
|
|
|
|
struct btrfs_chunk *chunk;
|
|
|
|
chunk = btrfs_item_ptr(leaf, slot, struct btrfs_chunk);
|
2016-06-23 06:54:24 +08:00
|
|
|
ret = read_one_chunk(fs_info, &found_key, leaf, chunk);
|
2008-11-18 10:11:30 +08:00
|
|
|
if (ret)
|
|
|
|
goto error;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
|
|
|
path->slots[0]++;
|
|
|
|
}
|
2016-06-04 03:05:14 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* After loading chunk tree, we've got all device information,
|
|
|
|
* do another round of validation checks.
|
|
|
|
*/
|
2016-06-23 06:54:23 +08:00
|
|
|
if (total_dev != fs_info->fs_devices->total_devices) {
|
|
|
|
btrfs_err(fs_info,
|
2016-06-04 03:05:14 +08:00
|
|
|
"super_num_devices %llu mismatch with num_devices %llu found here",
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_super_num_devices(fs_info->super_copy),
|
2016-06-04 03:05:14 +08:00
|
|
|
total_dev);
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto error;
|
|
|
|
}
|
2016-06-23 06:54:23 +08:00
|
|
|
if (btrfs_super_total_bytes(fs_info->super_copy) <
|
|
|
|
fs_info->fs_devices->total_rw_bytes) {
|
|
|
|
btrfs_err(fs_info,
|
2016-06-04 03:05:14 +08:00
|
|
|
"super_total_bytes %llu mismatch with fs_devices total_rw_bytes %llu",
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_super_total_bytes(fs_info->super_copy),
|
|
|
|
fs_info->fs_devices->total_rw_bytes);
|
2016-06-04 03:05:14 +08:00
|
|
|
ret = -EINVAL;
|
|
|
|
goto error;
|
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
ret = 0;
|
|
|
|
error:
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
2011-12-07 11:38:24 +08:00
|
|
|
mutex_unlock(&uuid_mutex);
|
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
btrfs_free_path(path);
|
2008-03-25 03:01:56 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2012-05-25 22:06:08 +08:00
|
|
|
|
2013-05-15 15:48:19 +08:00
|
|
|
void btrfs_init_devices_late(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
|
|
|
|
struct btrfs_device *device;
|
|
|
|
|
2014-05-11 23:14:59 +08:00
|
|
|
while (fs_devices) {
|
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
|
|
|
list_for_each_entry(device, &fs_devices->devices, dev_list)
|
2016-06-23 06:54:56 +08:00
|
|
|
device->fs_info = fs_info;
|
2014-05-11 23:14:59 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
|
|
|
|
|
|
|
fs_devices = fs_devices->seed;
|
|
|
|
}
|
2013-05-15 15:48:19 +08:00
|
|
|
}
|
|
|
|
|
2012-05-25 22:06:10 +08:00
|
|
|
static void __btrfs_reset_dev_stats(struct btrfs_device *dev)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < BTRFS_DEV_STAT_VALUES_MAX; i++)
|
|
|
|
btrfs_dev_stat_reset(dev, i);
|
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_init_dev_stats(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_key found_key;
|
|
|
|
struct btrfs_root *dev_root = fs_info->dev_root;
|
|
|
|
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
|
|
|
|
struct extent_buffer *eb;
|
|
|
|
int slot;
|
|
|
|
int ret = 0;
|
|
|
|
struct btrfs_device *device;
|
|
|
|
struct btrfs_path *path = NULL;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
|
|
|
list_for_each_entry(device, &fs_devices->devices, dev_list) {
|
|
|
|
int item_size;
|
|
|
|
struct btrfs_dev_stats_item *ptr;
|
|
|
|
|
2016-01-26 00:51:31 +08:00
|
|
|
key.objectid = BTRFS_DEV_STATS_OBJECTID;
|
|
|
|
key.type = BTRFS_PERSISTENT_ITEM_KEY;
|
2012-05-25 22:06:10 +08:00
|
|
|
key.offset = device->devid;
|
|
|
|
ret = btrfs_search_slot(NULL, dev_root, &key, path, 0, 0);
|
|
|
|
if (ret) {
|
|
|
|
__btrfs_reset_dev_stats(device);
|
|
|
|
device->dev_stats_valid = 1;
|
|
|
|
btrfs_release_path(path);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
slot = path->slots[0];
|
|
|
|
eb = path->nodes[0];
|
|
|
|
btrfs_item_key_to_cpu(eb, &found_key, slot);
|
|
|
|
item_size = btrfs_item_size_nr(eb, slot);
|
|
|
|
|
|
|
|
ptr = btrfs_item_ptr(eb, slot,
|
|
|
|
struct btrfs_dev_stats_item);
|
|
|
|
|
|
|
|
for (i = 0; i < BTRFS_DEV_STAT_VALUES_MAX; i++) {
|
|
|
|
if (item_size >= (1 + i) * sizeof(__le64))
|
|
|
|
btrfs_dev_stat_set(device, i,
|
|
|
|
btrfs_dev_stats_value(eb, ptr, i));
|
|
|
|
else
|
|
|
|
btrfs_dev_stat_reset(device, i);
|
|
|
|
}
|
|
|
|
|
|
|
|
device->dev_stats_valid = 1;
|
|
|
|
btrfs_dev_stat_print_on_load(device);
|
|
|
|
btrfs_release_path(path);
|
|
|
|
}
|
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
|
|
|
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret < 0 ? ret : 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int update_dev_stat_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_device *device)
|
|
|
|
{
|
2018-07-21 00:37:49 +08:00
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
2016-06-22 09:16:51 +08:00
|
|
|
struct btrfs_root *dev_root = fs_info->dev_root;
|
2012-05-25 22:06:10 +08:00
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct extent_buffer *eb;
|
|
|
|
struct btrfs_dev_stats_item *ptr;
|
|
|
|
int ret;
|
|
|
|
int i;
|
|
|
|
|
2016-01-26 00:51:31 +08:00
|
|
|
key.objectid = BTRFS_DEV_STATS_OBJECTID;
|
|
|
|
key.type = BTRFS_PERSISTENT_ITEM_KEY;
|
2012-05-25 22:06:10 +08:00
|
|
|
key.offset = device->devid;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
2017-02-15 16:35:01 +08:00
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
2012-05-25 22:06:10 +08:00
|
|
|
ret = btrfs_search_slot(trans, dev_root, &key, path, -1, 1);
|
|
|
|
if (ret < 0) {
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_warn_in_rcu(fs_info,
|
2015-10-08 15:01:03 +08:00
|
|
|
"error %d while searching for dev_stats item for device %s",
|
2012-06-05 02:03:51 +08:00
|
|
|
ret, rcu_str_deref(device->name));
|
2012-05-25 22:06:10 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (ret == 0 &&
|
|
|
|
btrfs_item_size_nr(path->nodes[0], path->slots[0]) < sizeof(*ptr)) {
|
|
|
|
/* need to delete old one and insert a new one */
|
|
|
|
ret = btrfs_del_item(trans, dev_root, path);
|
|
|
|
if (ret != 0) {
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_warn_in_rcu(fs_info,
|
2015-10-08 15:01:03 +08:00
|
|
|
"delete too small dev_stats item for device %s failed %d",
|
2012-06-05 02:03:51 +08:00
|
|
|
rcu_str_deref(device->name), ret);
|
2012-05-25 22:06:10 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
ret = 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (ret == 1) {
|
|
|
|
/* need to insert a new item */
|
|
|
|
btrfs_release_path(path);
|
|
|
|
ret = btrfs_insert_empty_item(trans, dev_root, path,
|
|
|
|
&key, sizeof(*ptr));
|
|
|
|
if (ret < 0) {
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_warn_in_rcu(fs_info,
|
2015-10-08 15:01:03 +08:00
|
|
|
"insert dev_stats item for device %s failed %d",
|
|
|
|
rcu_str_deref(device->name), ret);
|
2012-05-25 22:06:10 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
eb = path->nodes[0];
|
|
|
|
ptr = btrfs_item_ptr(eb, path->slots[0], struct btrfs_dev_stats_item);
|
|
|
|
for (i = 0; i < BTRFS_DEV_STAT_VALUES_MAX; i++)
|
|
|
|
btrfs_set_dev_stats_value(eb, ptr, i,
|
|
|
|
btrfs_dev_stat_read(device, i));
|
|
|
|
btrfs_mark_buffer_dirty(eb);
|
|
|
|
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* called from commit_transaction. Writes all changed device stats to disk.
|
|
|
|
*/
|
|
|
|
int btrfs_run_dev_stats(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
|
|
|
|
struct btrfs_device *device;
|
2014-07-24 11:37:11 +08:00
|
|
|
int stats_cnt;
|
2012-05-25 22:06:10 +08:00
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
|
|
|
list_for_each_entry(device, &fs_devices->devices, dev_list) {
|
2017-10-24 18:47:37 +08:00
|
|
|
stats_cnt = atomic_read(&device->dev_stats_ccnt);
|
|
|
|
if (!device->dev_stats_valid || stats_cnt == 0)
|
2012-05-25 22:06:10 +08:00
|
|
|
continue;
|
|
|
|
|
2017-10-24 18:47:37 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* There is a LOAD-LOAD control dependency between the value of
|
|
|
|
* dev_stats_ccnt and updating the on-disk values which requires
|
|
|
|
* reading the in-memory counters. Such control dependencies
|
|
|
|
* require explicit read memory barriers.
|
|
|
|
*
|
|
|
|
* This memory barriers pairs with smp_mb__before_atomic in
|
|
|
|
* btrfs_dev_stat_inc/btrfs_dev_stat_set and with the full
|
|
|
|
* barrier implied by atomic_xchg in
|
|
|
|
* btrfs_dev_stats_read_and_reset
|
|
|
|
*/
|
|
|
|
smp_rmb();
|
|
|
|
|
2018-07-21 00:37:49 +08:00
|
|
|
ret = update_dev_stat_item(trans, device);
|
2012-05-25 22:06:10 +08:00
|
|
|
if (!ret)
|
2014-07-24 11:37:11 +08:00
|
|
|
atomic_sub(stats_cnt, &device->dev_stats_ccnt);
|
2012-05-25 22:06:10 +08:00
|
|
|
}
|
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-05-25 22:06:08 +08:00
|
|
|
void btrfs_dev_stat_inc_and_print(struct btrfs_device *dev, int index)
|
|
|
|
{
|
|
|
|
btrfs_dev_stat_inc(dev, index);
|
|
|
|
btrfs_dev_stat_print_on_error(dev);
|
|
|
|
}
|
|
|
|
|
2013-04-26 04:41:01 +08:00
|
|
|
static void btrfs_dev_stat_print_on_error(struct btrfs_device *dev)
|
2012-05-25 22:06:08 +08:00
|
|
|
{
|
2012-05-25 22:06:10 +08:00
|
|
|
if (!dev->dev_stats_valid)
|
|
|
|
return;
|
2016-06-23 06:54:56 +08:00
|
|
|
btrfs_err_rl_in_rcu(dev->fs_info,
|
2015-10-08 16:43:10 +08:00
|
|
|
"bdev %s errs: wr %u, rd %u, flush %u, corrupt %u, gen %u",
|
2012-06-05 02:03:51 +08:00
|
|
|
rcu_str_deref(dev->name),
|
2012-05-25 22:06:08 +08:00
|
|
|
btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_WRITE_ERRS),
|
|
|
|
btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_READ_ERRS),
|
|
|
|
btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_FLUSH_ERRS),
|
2013-12-21 00:37:06 +08:00
|
|
|
btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_CORRUPTION_ERRS),
|
|
|
|
btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_GENERATION_ERRS));
|
2012-05-25 22:06:08 +08:00
|
|
|
}
|
2012-05-25 22:06:09 +08:00
|
|
|
|
2012-05-25 22:06:10 +08:00
|
|
|
static void btrfs_dev_stat_print_on_load(struct btrfs_device *dev)
|
|
|
|
{
|
2012-07-17 23:02:11 +08:00
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < BTRFS_DEV_STAT_VALUES_MAX; i++)
|
|
|
|
if (btrfs_dev_stat_read(dev, i) != 0)
|
|
|
|
break;
|
|
|
|
if (i == BTRFS_DEV_STAT_VALUES_MAX)
|
|
|
|
return; /* all values == 0, suppress message */
|
|
|
|
|
2016-06-23 06:54:56 +08:00
|
|
|
btrfs_info_in_rcu(dev->fs_info,
|
2015-10-08 15:01:03 +08:00
|
|
|
"bdev %s errs: wr %u, rd %u, flush %u, corrupt %u, gen %u",
|
2012-06-05 02:03:51 +08:00
|
|
|
rcu_str_deref(dev->name),
|
2012-05-25 22:06:10 +08:00
|
|
|
btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_WRITE_ERRS),
|
|
|
|
btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_READ_ERRS),
|
|
|
|
btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_FLUSH_ERRS),
|
|
|
|
btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_CORRUPTION_ERRS),
|
|
|
|
btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_GENERATION_ERRS));
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
int btrfs_get_dev_stats(struct btrfs_fs_info *fs_info,
|
2012-06-22 20:30:39 +08:00
|
|
|
struct btrfs_ioctl_get_dev_stats *stats)
|
2012-05-25 22:06:09 +08:00
|
|
|
{
|
|
|
|
struct btrfs_device *dev;
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
|
2012-05-25 22:06:09 +08:00
|
|
|
int i;
|
|
|
|
|
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
2016-06-23 06:54:23 +08:00
|
|
|
dev = btrfs_find_device(fs_info, stats->devid, NULL, NULL);
|
2012-05-25 22:06:09 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
|
|
|
|
|
|
|
if (!dev) {
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_warn(fs_info, "get dev_stats failed, device not found");
|
2012-05-25 22:06:09 +08:00
|
|
|
return -ENODEV;
|
2012-05-25 22:06:10 +08:00
|
|
|
} else if (!dev->dev_stats_valid) {
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_warn(fs_info, "get dev_stats failed, not yet valid");
|
2012-05-25 22:06:10 +08:00
|
|
|
return -ENODEV;
|
2012-06-22 20:30:39 +08:00
|
|
|
} else if (stats->flags & BTRFS_DEV_STATS_RESET) {
|
2012-05-25 22:06:09 +08:00
|
|
|
for (i = 0; i < BTRFS_DEV_STAT_VALUES_MAX; i++) {
|
|
|
|
if (stats->nr_items > i)
|
|
|
|
stats->values[i] =
|
|
|
|
btrfs_dev_stat_read_and_reset(dev, i);
|
|
|
|
else
|
|
|
|
btrfs_dev_stat_reset(dev, i);
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
for (i = 0; i < BTRFS_DEV_STAT_VALUES_MAX; i++)
|
|
|
|
if (stats->nr_items > i)
|
|
|
|
stats->values[i] = btrfs_dev_stat_read(dev, i);
|
|
|
|
}
|
|
|
|
if (stats->nr_items > BTRFS_DEV_STAT_VALUES_MAX)
|
|
|
|
stats->nr_items = BTRFS_DEV_STAT_VALUES_MAX;
|
|
|
|
return 0;
|
|
|
|
}
|
2012-11-05 22:50:14 +08:00
|
|
|
|
2017-02-15 00:55:53 +08:00
|
|
|
void btrfs_scratch_superblocks(struct block_device *bdev, const char *device_path)
|
2012-11-05 22:50:14 +08:00
|
|
|
{
|
|
|
|
struct buffer_head *bh;
|
|
|
|
struct btrfs_super_block *disk_super;
|
2015-08-14 18:32:59 +08:00
|
|
|
int copy_num;
|
2012-11-05 22:50:14 +08:00
|
|
|
|
2015-08-14 18:32:59 +08:00
|
|
|
if (!bdev)
|
|
|
|
return;
|
2012-11-05 22:50:14 +08:00
|
|
|
|
2015-08-14 18:32:59 +08:00
|
|
|
for (copy_num = 0; copy_num < BTRFS_SUPER_MIRROR_MAX;
|
|
|
|
copy_num++) {
|
2012-11-05 22:50:14 +08:00
|
|
|
|
2015-08-14 18:32:59 +08:00
|
|
|
if (btrfs_read_dev_one_super(bdev, copy_num, &bh))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
disk_super = (struct btrfs_super_block *)bh->b_data;
|
|
|
|
|
|
|
|
memset(&disk_super->magic, 0, sizeof(disk_super->magic));
|
|
|
|
set_buffer_dirty(bh);
|
|
|
|
sync_dirty_buffer(bh);
|
|
|
|
brelse(bh);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Notify udev that device has changed */
|
|
|
|
btrfs_kobject_uevent(bdev, KOBJ_CHANGE);
|
|
|
|
|
|
|
|
/* Update ctime/mtime for device path for libblkid */
|
|
|
|
update_dev_time(device_path);
|
2012-11-05 22:50:14 +08:00
|
|
|
}
|
2014-09-03 21:35:33 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Update the size of all devices, which is used for writing out the
|
|
|
|
* super blocks.
|
|
|
|
*/
|
|
|
|
void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
|
|
|
|
struct btrfs_device *curr, *next;
|
|
|
|
|
|
|
|
if (list_empty(&fs_devices->resized_devices))
|
|
|
|
return;
|
|
|
|
|
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
2014-09-03 21:35:33 +08:00
|
|
|
list_for_each_entry_safe(curr, next, &fs_devices->resized_devices,
|
|
|
|
resized_list) {
|
|
|
|
list_del_init(&curr->resized_list);
|
|
|
|
curr->commit_total_bytes = curr->disk_total_bytes;
|
|
|
|
}
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
2014-09-03 21:35:33 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
|
|
|
}
|
2014-09-03 21:35:34 +08:00
|
|
|
|
|
|
|
/* Must be invoked during the transaction commit */
|
2018-02-07 23:55:49 +08:00
|
|
|
void btrfs_update_commit_device_bytes_used(struct btrfs_transaction *trans)
|
2014-09-03 21:35:34 +08:00
|
|
|
{
|
2018-02-07 23:55:49 +08:00
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
2014-09-03 21:35:34 +08:00
|
|
|
struct extent_map *em;
|
|
|
|
struct map_lookup *map;
|
|
|
|
struct btrfs_device *dev;
|
|
|
|
int i;
|
|
|
|
|
2018-02-07 23:55:49 +08:00
|
|
|
if (list_empty(&trans->pending_chunks))
|
2014-09-03 21:35:34 +08:00
|
|
|
return;
|
|
|
|
|
|
|
|
/* In order to kick the device replace finish process */
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
2018-02-07 23:55:49 +08:00
|
|
|
list_for_each_entry(em, &trans->pending_chunks, list) {
|
2015-06-03 22:55:48 +08:00
|
|
|
map = em->map_lookup;
|
2014-09-03 21:35:34 +08:00
|
|
|
|
|
|
|
for (i = 0; i < map->num_stripes; i++) {
|
|
|
|
dev = map->stripes[i].dev;
|
|
|
|
dev->commit_bytes_used = dev->bytes_used;
|
|
|
|
}
|
|
|
|
}
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
2014-09-03 21:35:34 +08:00
|
|
|
}
|
2015-03-10 06:38:31 +08:00
|
|
|
|
|
|
|
void btrfs_set_fs_info_ptr(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
|
|
|
|
while (fs_devices) {
|
|
|
|
fs_devices->fs_info = fs_info;
|
|
|
|
fs_devices = fs_devices->seed;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
|
|
|
|
while (fs_devices) {
|
|
|
|
fs_devices->fs_info = NULL;
|
|
|
|
fs_devices = fs_devices->seed;
|
|
|
|
}
|
|
|
|
}
|
2018-07-14 02:46:30 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Multiplicity factor for simple profiles: DUP, RAID1-like and RAID10.
|
|
|
|
*/
|
|
|
|
int btrfs_bg_type_to_factor(u64 flags)
|
|
|
|
{
|
|
|
|
if (flags & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID10))
|
|
|
|
return 2;
|
|
|
|
return 1;
|
|
|
|
}
|
2018-08-01 10:37:19 +08:00
|
|
|
|
|
|
|
|
|
|
|
static u64 calc_stripe_length(u64 type, u64 chunk_len, int num_stripes)
|
|
|
|
{
|
|
|
|
int index = btrfs_bg_flags_to_raid_index(type);
|
|
|
|
int ncopies = btrfs_raid_array[index].ncopies;
|
|
|
|
int data_stripes;
|
|
|
|
|
|
|
|
switch (type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
|
|
|
|
case BTRFS_BLOCK_GROUP_RAID5:
|
|
|
|
data_stripes = num_stripes - 1;
|
|
|
|
break;
|
|
|
|
case BTRFS_BLOCK_GROUP_RAID6:
|
|
|
|
data_stripes = num_stripes - 2;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
data_stripes = num_stripes / ncopies;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
return div_u64(chunk_len, data_stripes);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int verify_one_dev_extent(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 chunk_offset, u64 devid,
|
|
|
|
u64 physical_offset, u64 physical_len)
|
|
|
|
{
|
|
|
|
struct extent_map_tree *em_tree = &fs_info->mapping_tree.map_tree;
|
|
|
|
struct extent_map *em;
|
|
|
|
struct map_lookup *map;
|
2018-10-05 17:45:55 +08:00
|
|
|
struct btrfs_device *dev;
|
2018-08-01 10:37:19 +08:00
|
|
|
u64 stripe_len;
|
|
|
|
bool found = false;
|
|
|
|
int ret = 0;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
read_lock(&em_tree->lock);
|
|
|
|
em = lookup_extent_mapping(em_tree, chunk_offset, 1);
|
|
|
|
read_unlock(&em_tree->lock);
|
|
|
|
|
|
|
|
if (!em) {
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"dev extent physical offset %llu on devid %llu doesn't have corresponding chunk",
|
|
|
|
physical_offset, devid);
|
|
|
|
ret = -EUCLEAN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
map = em->map_lookup;
|
|
|
|
stripe_len = calc_stripe_length(map->type, em->len, map->num_stripes);
|
|
|
|
if (physical_len != stripe_len) {
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"dev extent physical offset %llu on devid %llu length doesn't match chunk %llu, have %llu expect %llu",
|
|
|
|
physical_offset, devid, em->start, physical_len,
|
|
|
|
stripe_len);
|
|
|
|
ret = -EUCLEAN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; i < map->num_stripes; i++) {
|
|
|
|
if (map->stripes[i].dev->devid == devid &&
|
|
|
|
map->stripes[i].physical == physical_offset) {
|
|
|
|
found = true;
|
|
|
|
if (map->verified_stripes >= map->num_stripes) {
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"too many dev extents for chunk %llu found",
|
|
|
|
em->start);
|
|
|
|
ret = -EUCLEAN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
map->verified_stripes++;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!found) {
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"dev extent physical offset %llu devid %llu has no corresponding chunk",
|
|
|
|
physical_offset, devid);
|
|
|
|
ret = -EUCLEAN;
|
|
|
|
}
|
2018-10-05 17:45:55 +08:00
|
|
|
|
|
|
|
/* Make sure no dev extent is beyond device bondary */
|
|
|
|
dev = btrfs_find_device(fs_info, devid, NULL, NULL);
|
|
|
|
if (!dev) {
|
|
|
|
btrfs_err(fs_info, "failed to find devid %llu", devid);
|
|
|
|
ret = -EUCLEAN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
if (physical_offset + physical_len > dev->disk_total_bytes) {
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"dev extent devid %llu physical offset %llu len %llu is beyond device boundary %llu",
|
|
|
|
devid, physical_offset, physical_len,
|
|
|
|
dev->disk_total_bytes);
|
|
|
|
ret = -EUCLEAN;
|
|
|
|
goto out;
|
|
|
|
}
|
2018-08-01 10:37:19 +08:00
|
|
|
out:
|
|
|
|
free_extent_map(em);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int verify_chunk_dev_extent_mapping(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
struct extent_map_tree *em_tree = &fs_info->mapping_tree.map_tree;
|
|
|
|
struct extent_map *em;
|
|
|
|
struct rb_node *node;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
read_lock(&em_tree->lock);
|
2018-08-23 03:51:52 +08:00
|
|
|
for (node = rb_first_cached(&em_tree->map); node; node = rb_next(node)) {
|
2018-08-01 10:37:19 +08:00
|
|
|
em = rb_entry(node, struct extent_map, rb_node);
|
|
|
|
if (em->map_lookup->num_stripes !=
|
|
|
|
em->map_lookup->verified_stripes) {
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"chunk %llu has missing dev extent, have %d expect %d",
|
|
|
|
em->start, em->map_lookup->verified_stripes,
|
|
|
|
em->map_lookup->num_stripes);
|
|
|
|
ret = -EUCLEAN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
read_unlock(&em_tree->lock);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Ensure that all dev extents are mapped to correct chunk, otherwise
|
|
|
|
* later chunk allocation/free would cause unexpected behavior.
|
|
|
|
*
|
|
|
|
* NOTE: This will iterate through the whole device tree, which should be of
|
|
|
|
* the same size level as the chunk tree. This slightly increases mount time.
|
|
|
|
*/
|
|
|
|
int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_root *root = fs_info->dev_root;
|
|
|
|
struct btrfs_key key;
|
btrfs: volumes: Make sure there is no overlap of dev extents at mount time
Enhance btrfs_verify_dev_extents() to remember previous checked dev
extents, so it can verify no dev extents can overlap.
Analysis from Hans:
"Imagine allocating a DATA|DUP chunk.
In the chunk allocator, we first set...
max_stripe_size = SZ_1G;
max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE
... which is 10GiB.
Then...
/* we don't want a chunk larger than 10% of writeable space */
max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
max_chunk_size);
Imagine we only have one 7880MiB block device in this filesystem. Now
max_chunk_size is down to 788MiB.
The next step in the code is to search for max_stripe_size * dev_stripes
amount of free space on the device, which is in our example 1GiB * 2 =
2GiB. Imagine the device has exactly 1578MiB free in one contiguous
piece. This amount of bytes will be put in devices_info[ndevs - 1].max_avail
Next we recalculate the stripe_size (which is actually the device extent
length), based on the actual maximum amount of available raw disk space:
stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes);
stripe_size is now 789MiB
Next we do...
data_stripes = num_stripes / ncopies
...where data_stripes ends up as 1, because num_stripes is 2 (the amount
of device extents we're going to have), and DUP has ncopies 2.
Next there's a check...
if (stripe_size * data_stripes > max_chunk_size)
...which matches because 789MiB * 1 > 788MiB.
We go into the if code, and next is...
stripe_size = div_u64(max_chunk_size, data_stripes);
...which resets stripe_size to max_chunk_size: 788MiB
Next is a fun one...
/* bump the answer up to a 16MB boundary */
stripe_size = round_up(stripe_size, SZ_16M);
...which changes stripe_size from 788MiB to 800MiB.
We're not done changing stripe_size yet...
/* But don't go higher than the limits we found while searching
* for free extents
*/
stripe_size = min(devices_info[ndevs - 1].max_avail,
stripe_size);
This is bad. max_avail is twice the stripe_size (we need to fit 2 device
extents on the same device for DUP).
The result here is that 800MiB < 1578MiB, so it's unchanged. However,
the resulting DUP chunk will need 1600MiB disk space, which isn't there,
and the second dev_extent might extend into the next thing (next
dev_extent? end of device?) for 22MiB.
The last shown line of code relies on a situation where there's twice
the value of stripe_size present as value for the variable stripe_size
when it's DUP. This was actually the case before commit 92e222df7b
"btrfs: alloc_chunk: fix DUP stripe size handling", from which I quote:
"[...] in the meantime there's a check to see if the stripe_size does
not exceed max_chunk_size. Since during this check stripe_size is twice
the amount as intended, the check will reduce the stripe_size to
max_chunk_size if the actual correct to be used stripe_size is more than
half the amount of max_chunk_size."
In the previous version of the code, the 16MiB alignment (why is this
done, by the way?) would result in a 50% chance that it would actually
do an 8MiB alignment for the individual dev_extents, since it was
operating on double the size. Does this matter?
Does it matter that stripe_size can be set to anything which is not
16MiB aligned because of the amount of remaining available disk space
which is just taken?
What is the main purpose of this round_up?
The most straightforward thing to do seems something like...
stripe_size = min(
div_u64(devices_info[ndevs - 1].max_avail, dev_stripes),
stripe_size
)
..just putting half of the max_avail into stripe_size."
Link: https://lore.kernel.org/linux-btrfs/b3461a38-e5f8-f41d-c67c-2efac8129054@mendix.com/
Reported-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ add analysis from report ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-05 17:45:54 +08:00
|
|
|
u64 prev_devid = 0;
|
|
|
|
u64 prev_dev_ext_end = 0;
|
2018-08-01 10:37:19 +08:00
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
key.objectid = 1;
|
|
|
|
key.type = BTRFS_DEV_EXTENT_KEY;
|
|
|
|
key.offset = 0;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
path->reada = READA_FORWARD;
|
|
|
|
ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
|
|
|
|
ret = btrfs_next_item(root, path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
/* No dev extents at all? Not good */
|
|
|
|
if (ret > 0) {
|
|
|
|
ret = -EUCLEAN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
while (1) {
|
|
|
|
struct extent_buffer *leaf = path->nodes[0];
|
|
|
|
struct btrfs_dev_extent *dext;
|
|
|
|
int slot = path->slots[0];
|
|
|
|
u64 chunk_offset;
|
|
|
|
u64 physical_offset;
|
|
|
|
u64 physical_len;
|
|
|
|
u64 devid;
|
|
|
|
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, slot);
|
|
|
|
if (key.type != BTRFS_DEV_EXTENT_KEY)
|
|
|
|
break;
|
|
|
|
devid = key.objectid;
|
|
|
|
physical_offset = key.offset;
|
|
|
|
|
|
|
|
dext = btrfs_item_ptr(leaf, slot, struct btrfs_dev_extent);
|
|
|
|
chunk_offset = btrfs_dev_extent_chunk_offset(leaf, dext);
|
|
|
|
physical_len = btrfs_dev_extent_length(leaf, dext);
|
|
|
|
|
btrfs: volumes: Make sure there is no overlap of dev extents at mount time
Enhance btrfs_verify_dev_extents() to remember previous checked dev
extents, so it can verify no dev extents can overlap.
Analysis from Hans:
"Imagine allocating a DATA|DUP chunk.
In the chunk allocator, we first set...
max_stripe_size = SZ_1G;
max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE
... which is 10GiB.
Then...
/* we don't want a chunk larger than 10% of writeable space */
max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
max_chunk_size);
Imagine we only have one 7880MiB block device in this filesystem. Now
max_chunk_size is down to 788MiB.
The next step in the code is to search for max_stripe_size * dev_stripes
amount of free space on the device, which is in our example 1GiB * 2 =
2GiB. Imagine the device has exactly 1578MiB free in one contiguous
piece. This amount of bytes will be put in devices_info[ndevs - 1].max_avail
Next we recalculate the stripe_size (which is actually the device extent
length), based on the actual maximum amount of available raw disk space:
stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes);
stripe_size is now 789MiB
Next we do...
data_stripes = num_stripes / ncopies
...where data_stripes ends up as 1, because num_stripes is 2 (the amount
of device extents we're going to have), and DUP has ncopies 2.
Next there's a check...
if (stripe_size * data_stripes > max_chunk_size)
...which matches because 789MiB * 1 > 788MiB.
We go into the if code, and next is...
stripe_size = div_u64(max_chunk_size, data_stripes);
...which resets stripe_size to max_chunk_size: 788MiB
Next is a fun one...
/* bump the answer up to a 16MB boundary */
stripe_size = round_up(stripe_size, SZ_16M);
...which changes stripe_size from 788MiB to 800MiB.
We're not done changing stripe_size yet...
/* But don't go higher than the limits we found while searching
* for free extents
*/
stripe_size = min(devices_info[ndevs - 1].max_avail,
stripe_size);
This is bad. max_avail is twice the stripe_size (we need to fit 2 device
extents on the same device for DUP).
The result here is that 800MiB < 1578MiB, so it's unchanged. However,
the resulting DUP chunk will need 1600MiB disk space, which isn't there,
and the second dev_extent might extend into the next thing (next
dev_extent? end of device?) for 22MiB.
The last shown line of code relies on a situation where there's twice
the value of stripe_size present as value for the variable stripe_size
when it's DUP. This was actually the case before commit 92e222df7b
"btrfs: alloc_chunk: fix DUP stripe size handling", from which I quote:
"[...] in the meantime there's a check to see if the stripe_size does
not exceed max_chunk_size. Since during this check stripe_size is twice
the amount as intended, the check will reduce the stripe_size to
max_chunk_size if the actual correct to be used stripe_size is more than
half the amount of max_chunk_size."
In the previous version of the code, the 16MiB alignment (why is this
done, by the way?) would result in a 50% chance that it would actually
do an 8MiB alignment for the individual dev_extents, since it was
operating on double the size. Does this matter?
Does it matter that stripe_size can be set to anything which is not
16MiB aligned because of the amount of remaining available disk space
which is just taken?
What is the main purpose of this round_up?
The most straightforward thing to do seems something like...
stripe_size = min(
div_u64(devices_info[ndevs - 1].max_avail, dev_stripes),
stripe_size
)
..just putting half of the max_avail into stripe_size."
Link: https://lore.kernel.org/linux-btrfs/b3461a38-e5f8-f41d-c67c-2efac8129054@mendix.com/
Reported-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ add analysis from report ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-05 17:45:54 +08:00
|
|
|
/* Check if this dev extent overlaps with the previous one */
|
|
|
|
if (devid == prev_devid && physical_offset < prev_dev_ext_end) {
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"dev extent devid %llu physical offset %llu overlap with previous dev extent end %llu",
|
|
|
|
devid, physical_offset, prev_dev_ext_end);
|
|
|
|
ret = -EUCLEAN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2018-08-01 10:37:19 +08:00
|
|
|
ret = verify_one_dev_extent(fs_info, chunk_offset, devid,
|
|
|
|
physical_offset, physical_len);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
btrfs: volumes: Make sure there is no overlap of dev extents at mount time
Enhance btrfs_verify_dev_extents() to remember previous checked dev
extents, so it can verify no dev extents can overlap.
Analysis from Hans:
"Imagine allocating a DATA|DUP chunk.
In the chunk allocator, we first set...
max_stripe_size = SZ_1G;
max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE
... which is 10GiB.
Then...
/* we don't want a chunk larger than 10% of writeable space */
max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
max_chunk_size);
Imagine we only have one 7880MiB block device in this filesystem. Now
max_chunk_size is down to 788MiB.
The next step in the code is to search for max_stripe_size * dev_stripes
amount of free space on the device, which is in our example 1GiB * 2 =
2GiB. Imagine the device has exactly 1578MiB free in one contiguous
piece. This amount of bytes will be put in devices_info[ndevs - 1].max_avail
Next we recalculate the stripe_size (which is actually the device extent
length), based on the actual maximum amount of available raw disk space:
stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes);
stripe_size is now 789MiB
Next we do...
data_stripes = num_stripes / ncopies
...where data_stripes ends up as 1, because num_stripes is 2 (the amount
of device extents we're going to have), and DUP has ncopies 2.
Next there's a check...
if (stripe_size * data_stripes > max_chunk_size)
...which matches because 789MiB * 1 > 788MiB.
We go into the if code, and next is...
stripe_size = div_u64(max_chunk_size, data_stripes);
...which resets stripe_size to max_chunk_size: 788MiB
Next is a fun one...
/* bump the answer up to a 16MB boundary */
stripe_size = round_up(stripe_size, SZ_16M);
...which changes stripe_size from 788MiB to 800MiB.
We're not done changing stripe_size yet...
/* But don't go higher than the limits we found while searching
* for free extents
*/
stripe_size = min(devices_info[ndevs - 1].max_avail,
stripe_size);
This is bad. max_avail is twice the stripe_size (we need to fit 2 device
extents on the same device for DUP).
The result here is that 800MiB < 1578MiB, so it's unchanged. However,
the resulting DUP chunk will need 1600MiB disk space, which isn't there,
and the second dev_extent might extend into the next thing (next
dev_extent? end of device?) for 22MiB.
The last shown line of code relies on a situation where there's twice
the value of stripe_size present as value for the variable stripe_size
when it's DUP. This was actually the case before commit 92e222df7b
"btrfs: alloc_chunk: fix DUP stripe size handling", from which I quote:
"[...] in the meantime there's a check to see if the stripe_size does
not exceed max_chunk_size. Since during this check stripe_size is twice
the amount as intended, the check will reduce the stripe_size to
max_chunk_size if the actual correct to be used stripe_size is more than
half the amount of max_chunk_size."
In the previous version of the code, the 16MiB alignment (why is this
done, by the way?) would result in a 50% chance that it would actually
do an 8MiB alignment for the individual dev_extents, since it was
operating on double the size. Does this matter?
Does it matter that stripe_size can be set to anything which is not
16MiB aligned because of the amount of remaining available disk space
which is just taken?
What is the main purpose of this round_up?
The most straightforward thing to do seems something like...
stripe_size = min(
div_u64(devices_info[ndevs - 1].max_avail, dev_stripes),
stripe_size
)
..just putting half of the max_avail into stripe_size."
Link: https://lore.kernel.org/linux-btrfs/b3461a38-e5f8-f41d-c67c-2efac8129054@mendix.com/
Reported-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ add analysis from report ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-05 17:45:54 +08:00
|
|
|
prev_devid = devid;
|
|
|
|
prev_dev_ext_end = physical_offset + physical_len;
|
|
|
|
|
2018-08-01 10:37:19 +08:00
|
|
|
ret = btrfs_next_item(root, path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret > 0) {
|
|
|
|
ret = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Ensure all chunks have corresponding dev extents */
|
|
|
|
ret = verify_chunk_dev_extent_mapping(fs_info);
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-04 01:28:12 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Check whether the given block group or device is pinned by any inode being
|
|
|
|
* used as a swapfile.
|
|
|
|
*/
|
|
|
|
bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr)
|
|
|
|
{
|
|
|
|
struct btrfs_swapfile_pin *sp;
|
|
|
|
struct rb_node *node;
|
|
|
|
|
|
|
|
spin_lock(&fs_info->swapfile_pins_lock);
|
|
|
|
node = fs_info->swapfile_pins.rb_node;
|
|
|
|
while (node) {
|
|
|
|
sp = rb_entry(node, struct btrfs_swapfile_pin, node);
|
|
|
|
if (ptr < sp->ptr)
|
|
|
|
node = node->rb_left;
|
|
|
|
else if (ptr > sp->ptr)
|
|
|
|
node = node->rb_right;
|
|
|
|
else
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
spin_unlock(&fs_info->swapfile_pins_lock);
|
|
|
|
return node != NULL;
|
|
|
|
}
|