2018-04-04 01:23:33 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2008-03-25 03:01:56 +08:00
|
|
|
/*
|
|
|
|
* Copyright (C) 2007 Oracle. All rights reserved.
|
|
|
|
*/
|
2018-04-04 01:23:33 +08:00
|
|
|
|
2008-03-25 03:01:56 +08:00
|
|
|
#include <linux/sched.h>
|
2020-08-31 22:52:42 +08:00
|
|
|
#include <linux/sched/mm.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 16:04:11 +08:00
|
|
|
#include <linux/slab.h>
|
2012-05-25 22:06:08 +08:00
|
|
|
#include <linux/ratelimit.h>
|
2012-01-17 04:04:48 +08:00
|
|
|
#include <linux/kthread.h>
|
2013-08-15 23:11:21 +08:00
|
|
|
#include <linux/semaphore.h>
|
2016-05-21 08:01:00 +08:00
|
|
|
#include <linux/uuid.h>
|
2018-01-23 06:49:36 +08:00
|
|
|
#include <linux/list_sort.h>
|
2021-10-15 01:11:01 +08:00
|
|
|
#include <linux/namei.h>
|
2019-08-22 00:54:28 +08:00
|
|
|
#include "misc.h"
|
2008-03-25 03:01:56 +08:00
|
|
|
#include "ctree.h"
|
|
|
|
#include "disk-io.h"
|
|
|
|
#include "transaction.h"
|
|
|
|
#include "volumes.h"
|
2013-01-30 07:40:14 +08:00
|
|
|
#include "raid56.h"
|
2012-06-05 02:03:51 +08:00
|
|
|
#include "rcu-string.h"
|
2012-11-06 20:15:27 +08:00
|
|
|
#include "dev-replace.h"
|
2014-06-03 11:36:00 +08:00
|
|
|
#include "sysfs.h"
|
2019-03-20 13:16:42 +08:00
|
|
|
#include "tree-checker.h"
|
2019-06-19 04:09:16 +08:00
|
|
|
#include "space-info.h"
|
2019-06-21 03:37:44 +08:00
|
|
|
#include "block-group.h"
|
2019-12-14 08:22:14 +08:00
|
|
|
#include "discard.h"
|
2020-11-10 19:26:07 +08:00
|
|
|
#include "zoned.h"
|
2022-10-19 22:50:47 +08:00
|
|
|
#include "fs.h"
|
2022-10-19 22:51:00 +08:00
|
|
|
#include "accessors.h"
|
2022-10-27 03:08:28 +08:00
|
|
|
#include "uuid-tree.h"
|
2022-10-27 03:08:29 +08:00
|
|
|
#include "ioctl.h"
|
2022-10-27 03:08:34 +08:00
|
|
|
#include "relocation.h"
|
2022-10-27 03:08:35 +08:00
|
|
|
#include "scrub.h"
|
2022-10-27 03:08:40 +08:00
|
|
|
#include "super.h"
|
2023-09-15 00:07:00 +08:00
|
|
|
#include "raid-stripe-tree.h"
|
2008-03-25 03:01:56 +08:00
|
|
|
|
btrfs: don't check stripe length if the profile is not stripe based
[BUG]
When debugging calc_bio_boundaries(), I found that even for RAID1
metadata, we're following stripe length to calculate stripe boundary.
# mkfs.btrfs -m raid1 -d raid1 /dev/test/scratch[12]
# mount /dev/test/scratch /mnt/btrfs
# xfs_io -f -c "pwrite 0 64K" /mnt/btrfs/file
# umount
Above very basic operations will make calc_bio_boundaries() to report
the following result:
submit_extent_page: r/i=1/1 file_offset=22036480 len_to_stripe_boundary=49152
submit_extent_page: r/i=1/1 file_offset=30474240 len_to_stripe_boundary=65536
...
submit_extent_page: r/i=1/1 file_offset=30523392 len_to_stripe_boundary=16384
submit_extent_page: r/i=1/1 file_offset=30457856 len_to_stripe_boundary=16384
submit_extent_page: r/i=5/257 file_offset=0 len_to_stripe_boundary=65536
submit_extent_page: r/i=5/257 file_offset=65536 len_to_stripe_boundary=65536
submit_extent_page: r/i=1/1 file_offset=30490624 len_to_stripe_boundary=49152
submit_extent_page: r/i=1/1 file_offset=30507008 len_to_stripe_boundary=32768
Where "r/i" is the rootid and inode, 1/1 means they metadata.
The remaining names match the member used in kernel.
Even all data/metadata are using RAID1, we're still following stripe
length.
[CAUSE]
This behavior is caused by a wrong condition in btrfs_get_io_geometry():
if (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
/* Fill using stripe_len */
len = min_t(u64, em->len - offset, max_len);
} else {
len = em->len - offset;
}
This means, only for SINGLE we will not follow stripe_len.
However for profiles like RAID1*, DUP, they don't need to bother
stripe_len.
This can lead to unnecessary bio split for RAID1*/DUP profiles, and can
even be a blockage for future zoned RAID support.
[FIX]
Introduce one single-use macro, BTRFS_BLOCK_GROUP_STRIPE_MASK, and
change the condition to only calculate the length using stripe length
for stripe based profiles.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-19 14:19:33 +08:00
|
|
|
#define BTRFS_BLOCK_GROUP_STRIPE_MASK (BTRFS_BLOCK_GROUP_RAID0 | \
|
|
|
|
BTRFS_BLOCK_GROUP_RAID10 | \
|
|
|
|
BTRFS_BLOCK_GROUP_RAID56_MASK)
|
|
|
|
|
2023-12-13 22:42:57 +08:00
|
|
|
struct btrfs_io_geometry {
|
|
|
|
u32 stripe_index;
|
|
|
|
u32 stripe_nr;
|
|
|
|
int mirror_num;
|
|
|
|
int num_stripes;
|
|
|
|
u64 stripe_offset;
|
|
|
|
u64 raid56_full_stripe_start;
|
|
|
|
int max_errors;
|
|
|
|
enum btrfs_map_op op;
|
|
|
|
};
|
|
|
|
|
2015-09-15 21:08:06 +08:00
|
|
|
const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
|
|
|
|
[BTRFS_RAID_RAID10] = {
|
|
|
|
.sub_stripes = 2,
|
|
|
|
.dev_stripes = 1,
|
|
|
|
.devs_max = 0, /* 0 == as many as possible */
|
btrfs: allow degenerate raid0/raid10
The data on raid0 and raid10 are supposed to be spread over multiple
devices, so the minimum constraints are set to 2 and 4 respectively.
This is an artificial limit and there's some interest to remove it.
Change this to allow raid0 on one device and raid10 on two devices. This
works as expected eg. when converting or removing devices.
The only difference is when raid0 on two devices gets one device
removed. Unpatched would silently create a single profile, while newly
it would be raid0.
The motivation is to allow to preserve the profile type as long as it
possible for some intermediate state (device removal, conversion), or
when there are disks of different size, with raid0 the otherwise
unusable space of the last device will be used too. Similarly for
raid10, though the two largest devices would need to be the same.
Unpatched kernel will mount and use the degenerate profiles just fine
but won't allow any operation that would not satisfy the stricter device
number constraints, eg. not allowing to go from 3 to 2 devices for
raid10 or various profile conversions.
Example output:
# btrfs fi us -T .
Overall:
Device size: 10.00GiB
Device allocated: 1.01GiB
Device unallocated: 8.99GiB
Device missing: 0.00B
Used: 200.61MiB
Free (estimated): 9.79GiB (min: 9.79GiB)
Free (statfs, df): 9.79GiB
Data ratio: 1.00
Metadata ratio: 1.00
Global reserve: 3.25MiB (used: 0.00B)
Multiple profiles: no
Data Metadata System
Id Path RAID0 single single Unallocated
-- ---------- --------- --------- -------- -----------
1 /dev/sda10 1.00GiB 8.00MiB 1.00MiB 8.99GiB
-- ---------- --------- --------- -------- -----------
Total 1.00GiB 8.00MiB 1.00MiB 8.99GiB
Used 200.25MiB 352.00KiB 16.00KiB
# btrfs dev us .
/dev/sda10, ID: 1
Device size: 10.00GiB
Device slack: 0.00B
Data,RAID0/1: 1.00GiB
Metadata,single: 8.00MiB
System,single: 1.00MiB
Unallocated: 8.99GiB
Note "Data,RAID0/1", with btrfs-progs 5.13+ the number of devices per
profile is printed.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-23 02:54:37 +08:00
|
|
|
.devs_min = 2,
|
2015-09-15 21:08:07 +08:00
|
|
|
.tolerated_failures = 1,
|
2015-09-15 21:08:06 +08:00
|
|
|
.devs_increment = 2,
|
|
|
|
.ncopies = 2,
|
2018-10-05 05:24:42 +08:00
|
|
|
.nparity = 0,
|
2018-04-25 19:01:42 +08:00
|
|
|
.raid_name = "raid10",
|
2018-04-25 19:01:43 +08:00
|
|
|
.bg_flag = BTRFS_BLOCK_GROUP_RAID10,
|
2018-04-25 19:01:44 +08:00
|
|
|
.mindev_error = BTRFS_ERROR_DEV_RAID10_MIN_NOT_MET,
|
2015-09-15 21:08:06 +08:00
|
|
|
},
|
|
|
|
[BTRFS_RAID_RAID1] = {
|
|
|
|
.sub_stripes = 1,
|
|
|
|
.dev_stripes = 1,
|
|
|
|
.devs_max = 2,
|
|
|
|
.devs_min = 2,
|
2015-09-15 21:08:07 +08:00
|
|
|
.tolerated_failures = 1,
|
2015-09-15 21:08:06 +08:00
|
|
|
.devs_increment = 2,
|
|
|
|
.ncopies = 2,
|
2018-10-05 05:24:42 +08:00
|
|
|
.nparity = 0,
|
2018-04-25 19:01:42 +08:00
|
|
|
.raid_name = "raid1",
|
2018-04-25 19:01:43 +08:00
|
|
|
.bg_flag = BTRFS_BLOCK_GROUP_RAID1,
|
2018-04-25 19:01:44 +08:00
|
|
|
.mindev_error = BTRFS_ERROR_DEV_RAID1_MIN_NOT_MET,
|
2015-09-15 21:08:06 +08:00
|
|
|
},
|
2018-03-03 05:56:53 +08:00
|
|
|
[BTRFS_RAID_RAID1C3] = {
|
|
|
|
.sub_stripes = 1,
|
|
|
|
.dev_stripes = 1,
|
2019-11-27 23:10:54 +08:00
|
|
|
.devs_max = 3,
|
2018-03-03 05:56:53 +08:00
|
|
|
.devs_min = 3,
|
|
|
|
.tolerated_failures = 2,
|
|
|
|
.devs_increment = 3,
|
|
|
|
.ncopies = 3,
|
2019-11-25 22:34:48 +08:00
|
|
|
.nparity = 0,
|
2018-03-03 05:56:53 +08:00
|
|
|
.raid_name = "raid1c3",
|
|
|
|
.bg_flag = BTRFS_BLOCK_GROUP_RAID1C3,
|
|
|
|
.mindev_error = BTRFS_ERROR_DEV_RAID1C3_MIN_NOT_MET,
|
|
|
|
},
|
2018-03-03 05:56:53 +08:00
|
|
|
[BTRFS_RAID_RAID1C4] = {
|
|
|
|
.sub_stripes = 1,
|
|
|
|
.dev_stripes = 1,
|
2019-11-27 23:10:54 +08:00
|
|
|
.devs_max = 4,
|
2018-03-03 05:56:53 +08:00
|
|
|
.devs_min = 4,
|
|
|
|
.tolerated_failures = 3,
|
|
|
|
.devs_increment = 4,
|
|
|
|
.ncopies = 4,
|
2019-11-25 22:34:48 +08:00
|
|
|
.nparity = 0,
|
2018-03-03 05:56:53 +08:00
|
|
|
.raid_name = "raid1c4",
|
|
|
|
.bg_flag = BTRFS_BLOCK_GROUP_RAID1C4,
|
|
|
|
.mindev_error = BTRFS_ERROR_DEV_RAID1C4_MIN_NOT_MET,
|
|
|
|
},
|
2015-09-15 21:08:06 +08:00
|
|
|
[BTRFS_RAID_DUP] = {
|
|
|
|
.sub_stripes = 1,
|
|
|
|
.dev_stripes = 2,
|
|
|
|
.devs_max = 1,
|
|
|
|
.devs_min = 1,
|
2015-09-15 21:08:07 +08:00
|
|
|
.tolerated_failures = 0,
|
2015-09-15 21:08:06 +08:00
|
|
|
.devs_increment = 1,
|
|
|
|
.ncopies = 2,
|
2018-10-05 05:24:42 +08:00
|
|
|
.nparity = 0,
|
2018-04-25 19:01:42 +08:00
|
|
|
.raid_name = "dup",
|
2018-04-25 19:01:43 +08:00
|
|
|
.bg_flag = BTRFS_BLOCK_GROUP_DUP,
|
2018-04-25 19:01:44 +08:00
|
|
|
.mindev_error = 0,
|
2015-09-15 21:08:06 +08:00
|
|
|
},
|
|
|
|
[BTRFS_RAID_RAID0] = {
|
|
|
|
.sub_stripes = 1,
|
|
|
|
.dev_stripes = 1,
|
|
|
|
.devs_max = 0,
|
btrfs: allow degenerate raid0/raid10
The data on raid0 and raid10 are supposed to be spread over multiple
devices, so the minimum constraints are set to 2 and 4 respectively.
This is an artificial limit and there's some interest to remove it.
Change this to allow raid0 on one device and raid10 on two devices. This
works as expected eg. when converting or removing devices.
The only difference is when raid0 on two devices gets one device
removed. Unpatched would silently create a single profile, while newly
it would be raid0.
The motivation is to allow to preserve the profile type as long as it
possible for some intermediate state (device removal, conversion), or
when there are disks of different size, with raid0 the otherwise
unusable space of the last device will be used too. Similarly for
raid10, though the two largest devices would need to be the same.
Unpatched kernel will mount and use the degenerate profiles just fine
but won't allow any operation that would not satisfy the stricter device
number constraints, eg. not allowing to go from 3 to 2 devices for
raid10 or various profile conversions.
Example output:
# btrfs fi us -T .
Overall:
Device size: 10.00GiB
Device allocated: 1.01GiB
Device unallocated: 8.99GiB
Device missing: 0.00B
Used: 200.61MiB
Free (estimated): 9.79GiB (min: 9.79GiB)
Free (statfs, df): 9.79GiB
Data ratio: 1.00
Metadata ratio: 1.00
Global reserve: 3.25MiB (used: 0.00B)
Multiple profiles: no
Data Metadata System
Id Path RAID0 single single Unallocated
-- ---------- --------- --------- -------- -----------
1 /dev/sda10 1.00GiB 8.00MiB 1.00MiB 8.99GiB
-- ---------- --------- --------- -------- -----------
Total 1.00GiB 8.00MiB 1.00MiB 8.99GiB
Used 200.25MiB 352.00KiB 16.00KiB
# btrfs dev us .
/dev/sda10, ID: 1
Device size: 10.00GiB
Device slack: 0.00B
Data,RAID0/1: 1.00GiB
Metadata,single: 8.00MiB
System,single: 1.00MiB
Unallocated: 8.99GiB
Note "Data,RAID0/1", with btrfs-progs 5.13+ the number of devices per
profile is printed.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-23 02:54:37 +08:00
|
|
|
.devs_min = 1,
|
2015-09-15 21:08:07 +08:00
|
|
|
.tolerated_failures = 0,
|
2015-09-15 21:08:06 +08:00
|
|
|
.devs_increment = 1,
|
|
|
|
.ncopies = 1,
|
2018-10-05 05:24:42 +08:00
|
|
|
.nparity = 0,
|
2018-04-25 19:01:42 +08:00
|
|
|
.raid_name = "raid0",
|
2018-04-25 19:01:43 +08:00
|
|
|
.bg_flag = BTRFS_BLOCK_GROUP_RAID0,
|
2018-04-25 19:01:44 +08:00
|
|
|
.mindev_error = 0,
|
2015-09-15 21:08:06 +08:00
|
|
|
},
|
|
|
|
[BTRFS_RAID_SINGLE] = {
|
|
|
|
.sub_stripes = 1,
|
|
|
|
.dev_stripes = 1,
|
|
|
|
.devs_max = 1,
|
|
|
|
.devs_min = 1,
|
2015-09-15 21:08:07 +08:00
|
|
|
.tolerated_failures = 0,
|
2015-09-15 21:08:06 +08:00
|
|
|
.devs_increment = 1,
|
|
|
|
.ncopies = 1,
|
2018-10-05 05:24:42 +08:00
|
|
|
.nparity = 0,
|
2018-04-25 19:01:42 +08:00
|
|
|
.raid_name = "single",
|
2018-04-25 19:01:43 +08:00
|
|
|
.bg_flag = 0,
|
2018-04-25 19:01:44 +08:00
|
|
|
.mindev_error = 0,
|
2015-09-15 21:08:06 +08:00
|
|
|
},
|
|
|
|
[BTRFS_RAID_RAID5] = {
|
|
|
|
.sub_stripes = 1,
|
|
|
|
.dev_stripes = 1,
|
|
|
|
.devs_max = 0,
|
|
|
|
.devs_min = 2,
|
2015-09-15 21:08:07 +08:00
|
|
|
.tolerated_failures = 1,
|
2015-09-15 21:08:06 +08:00
|
|
|
.devs_increment = 1,
|
2018-10-05 05:24:41 +08:00
|
|
|
.ncopies = 1,
|
2018-10-05 05:24:42 +08:00
|
|
|
.nparity = 1,
|
2018-04-25 19:01:42 +08:00
|
|
|
.raid_name = "raid5",
|
2018-04-25 19:01:43 +08:00
|
|
|
.bg_flag = BTRFS_BLOCK_GROUP_RAID5,
|
2018-04-25 19:01:44 +08:00
|
|
|
.mindev_error = BTRFS_ERROR_DEV_RAID5_MIN_NOT_MET,
|
2015-09-15 21:08:06 +08:00
|
|
|
},
|
|
|
|
[BTRFS_RAID_RAID6] = {
|
|
|
|
.sub_stripes = 1,
|
|
|
|
.dev_stripes = 1,
|
|
|
|
.devs_max = 0,
|
|
|
|
.devs_min = 3,
|
2015-09-15 21:08:07 +08:00
|
|
|
.tolerated_failures = 2,
|
2015-09-15 21:08:06 +08:00
|
|
|
.devs_increment = 1,
|
2018-10-05 05:24:41 +08:00
|
|
|
.ncopies = 1,
|
2018-10-05 05:24:42 +08:00
|
|
|
.nparity = 2,
|
2018-04-25 19:01:42 +08:00
|
|
|
.raid_name = "raid6",
|
2018-04-25 19:01:43 +08:00
|
|
|
.bg_flag = BTRFS_BLOCK_GROUP_RAID6,
|
2018-04-25 19:01:44 +08:00
|
|
|
.mindev_error = BTRFS_ERROR_DEV_RAID6_MIN_NOT_MET,
|
2015-09-15 21:08:06 +08:00
|
|
|
},
|
|
|
|
};
|
|
|
|
|
2021-07-26 20:15:19 +08:00
|
|
|
/*
|
|
|
|
* Convert block group flags (BTRFS_BLOCK_GROUP_*) to btrfs_raid_types, which
|
|
|
|
* can be used as index to access btrfs_raid_array[].
|
|
|
|
*/
|
|
|
|
enum btrfs_raid_types __attribute_const__ btrfs_bg_flags_to_raid_index(u64 flags)
|
|
|
|
{
|
2022-04-20 16:08:28 +08:00
|
|
|
const u64 profile = (flags & BTRFS_BLOCK_GROUP_PROFILE_MASK);
|
2021-07-26 20:15:19 +08:00
|
|
|
|
2022-04-20 16:08:28 +08:00
|
|
|
if (!profile)
|
|
|
|
return BTRFS_RAID_SINGLE;
|
|
|
|
|
|
|
|
return BTRFS_BG_FLAG_TO_INDEX(profile);
|
2021-07-26 20:15:19 +08:00
|
|
|
}
|
|
|
|
|
2019-05-17 17:43:41 +08:00
|
|
|
const char *btrfs_bg_type_to_raid_name(u64 flags)
|
2018-04-25 19:01:42 +08:00
|
|
|
{
|
2019-05-17 17:43:41 +08:00
|
|
|
const int index = btrfs_bg_flags_to_raid_index(flags);
|
|
|
|
|
|
|
|
if (index >= BTRFS_NR_RAID_TYPES)
|
2018-04-25 19:01:42 +08:00
|
|
|
return NULL;
|
|
|
|
|
2019-05-17 17:43:41 +08:00
|
|
|
return btrfs_raid_array[index].raid_name;
|
2018-04-25 19:01:42 +08:00
|
|
|
}
|
|
|
|
|
2022-05-13 16:34:30 +08:00
|
|
|
int btrfs_nr_parity_stripes(u64 type)
|
|
|
|
{
|
|
|
|
enum btrfs_raid_types index = btrfs_bg_flags_to_raid_index(type);
|
|
|
|
|
|
|
|
return btrfs_raid_array[index].nparity;
|
|
|
|
}
|
|
|
|
|
2018-11-20 16:12:55 +08:00
|
|
|
/*
|
|
|
|
* Fill @buf with textual description of @bg_flags, no more than @size_buf
|
|
|
|
* bytes including terminating null byte.
|
|
|
|
*/
|
|
|
|
void btrfs_describe_block_groups(u64 bg_flags, char *buf, u32 size_buf)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
int ret;
|
|
|
|
char *bp = buf;
|
|
|
|
u64 flags = bg_flags;
|
|
|
|
u32 size_bp = size_buf;
|
|
|
|
|
|
|
|
if (!flags) {
|
|
|
|
strcpy(bp, "NONE");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
#define DESCRIBE_FLAG(flag, desc) \
|
|
|
|
do { \
|
|
|
|
if (flags & (flag)) { \
|
|
|
|
ret = snprintf(bp, size_bp, "%s|", (desc)); \
|
|
|
|
if (ret < 0 || ret >= size_bp) \
|
|
|
|
goto out_overflow; \
|
|
|
|
size_bp -= ret; \
|
|
|
|
bp += ret; \
|
|
|
|
flags &= ~(flag); \
|
|
|
|
} \
|
|
|
|
} while (0)
|
|
|
|
|
|
|
|
DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_DATA, "data");
|
|
|
|
DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_SYSTEM, "system");
|
|
|
|
DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_METADATA, "metadata");
|
|
|
|
|
|
|
|
DESCRIBE_FLAG(BTRFS_AVAIL_ALLOC_BIT_SINGLE, "single");
|
|
|
|
for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
|
|
|
|
DESCRIBE_FLAG(btrfs_raid_array[i].bg_flag,
|
|
|
|
btrfs_raid_array[i].raid_name);
|
|
|
|
#undef DESCRIBE_FLAG
|
|
|
|
|
|
|
|
if (flags) {
|
|
|
|
ret = snprintf(bp, size_bp, "0x%llx|", flags);
|
|
|
|
size_bp -= ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (size_bp < size_buf)
|
|
|
|
buf[size_buf - size_bp - 1] = '\0'; /* remove last | */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The text is trimmed, it's up to the caller to provide sufficiently
|
|
|
|
* large buffer
|
|
|
|
*/
|
|
|
|
out_overflow:;
|
|
|
|
}
|
|
|
|
|
2019-03-20 23:29:13 +08:00
|
|
|
static int init_first_rw_device(struct btrfs_trans_handle *trans);
|
2016-06-23 06:54:24 +08:00
|
|
|
static int btrfs_relocate_sys_chunks(struct btrfs_fs_info *fs_info);
|
2012-05-25 22:06:10 +08:00
|
|
|
static void btrfs_dev_stat_print_on_load(struct btrfs_device *device);
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2017-06-17 04:30:00 +08:00
|
|
|
/*
|
|
|
|
* Device locking
|
|
|
|
* ==============
|
|
|
|
*
|
|
|
|
* There are several mutexes that protect manipulation of devices and low-level
|
|
|
|
* structures like chunks but not block groups, extents or files
|
|
|
|
*
|
|
|
|
* uuid_mutex (global lock)
|
|
|
|
* ------------------------
|
|
|
|
* protects the fs_uuids list that tracks all per-fs fs_devices, resulting from
|
|
|
|
* the SCAN_DEV ioctl registration or from mount either implicitly (the first
|
|
|
|
* device) or requested by the device= mount option
|
|
|
|
*
|
|
|
|
* the mutex can be very coarse and can cover long-running operations
|
|
|
|
*
|
|
|
|
* protects: updates to fs_devices counters like missing devices, rw devices,
|
2018-11-28 19:05:13 +08:00
|
|
|
* seeding, structure cloning, opening/closing devices at mount/umount time
|
2017-06-17 04:30:00 +08:00
|
|
|
*
|
|
|
|
* global::fs_devs - add, remove, updates to the global list
|
|
|
|
*
|
btrfs: open device without device_list_mutex
There's long existed a lockdep splat because we open our bdev's under
the ->device_list_mutex at mount time, which acquires the bd_mutex.
Usually this goes unnoticed, but if you do loopback devices at all
suddenly the bd_mutex comes with a whole host of other dependencies,
which results in the splat when you mount a btrfs file system.
======================================================
WARNING: possible circular locking dependency detected
5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted
------------------------------------------------------
systemd-journal/509 is trying to acquire lock:
ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs]
but task is already holding lock:
ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sb_pagefaults){.+.+}-{0:0}:
__sb_start_write+0x13e/0x220
btrfs_page_mkwrite+0x59/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
-> #5 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x60/0x80
_copy_from_user+0x20/0xb0
get_sg_io_hdr+0x9a/0xb0
scsi_cmd_ioctl+0x1ea/0x2f0
cdrom_ioctl+0x3c/0x12b4
sr_block_ioctl+0xa4/0xd0
block_ioctl+0x3f/0x50
ksys_ioctl+0x82/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&cd->lock){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
sr_block_open+0xa2/0x180
__blkdev_get+0xdd/0x550
blkdev_get+0x38/0x150
do_dentry_open+0x16b/0x3e0
path_openat+0x3c9/0xa00
do_filp_open+0x75/0x100
do_sys_openat2+0x8a/0x140
__x64_sys_openat+0x46/0x70
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
__blkdev_get+0x6a/0x550
blkdev_get+0x85/0x150
blkdev_get_by_path+0x2c/0x70
btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
open_fs_devices+0x88/0x240 [btrfs]
btrfs_open_devices+0x92/0xa0 [btrfs]
btrfs_mount_root+0x250/0x490 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
vfs_kern_mount.part.0+0x71/0xb0
btrfs_mount+0x119/0x380 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
do_mount+0x8c6/0xca0
__x64_sys_mount+0x8e/0xd0
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_run_dev_stats+0x36/0x420 [btrfs]
commit_cowonly_roots+0x91/0x2d0 [btrfs]
btrfs_commit_transaction+0x4e6/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_commit_transaction+0x48e/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
__mutex_lock+0x7b/0x820
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
other info that might help us debug this:
Chain exists of:
&fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sb_pagefaults);
lock(&mm->mmap_lock#2);
lock(sb_pagefaults);
lock(&fs_info->reloc_mutex);
*** DEADLOCK ***
3 locks held by systemd-journal/509:
#0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0
#1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
#2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs]
stack backtrace:
CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
dump_stack+0x92/0xc8
check_noncircular+0x134/0x150
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
__mutex_lock+0x7b/0x820
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? kvm_sched_clock_read+0x14/0x30
? sched_clock+0x5/0x10
? sched_clock_cpu+0xc/0xb0
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
? sched_clock+0x5/0x10
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
? asm_exc_page_fault+0x8/0x30
asm_exc_page_fault+0x1e/0x30
RIP: 0033:0x7fa3972fdbfe
Code: Bad RIP value.
Fix this by not holding the ->device_list_mutex at this point. The
device_list_mutex exists to protect us from modifying the device list
while the file system is running.
However it can also be modified by doing a scan on a device. But this
action is specifically protected by the uuid_mutex, which we are holding
here. We cannot race with opening at this point because we have the
->s_mount lock held during the mount. Not having the
->device_list_mutex here is perfectly safe as we're not going to change
the devices at this point.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add some comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-18 03:12:27 +08:00
|
|
|
* does not protect: manipulation of the fs_devices::devices list in general
|
|
|
|
* but in mount context it could be used to exclude list modifications by eg.
|
|
|
|
* scan ioctl
|
2017-06-17 04:30:00 +08:00
|
|
|
*
|
|
|
|
* btrfs_device::name - renames (write side), read is RCU
|
|
|
|
*
|
|
|
|
* fs_devices::device_list_mutex (per-fs, with RCU)
|
|
|
|
* ------------------------------------------------
|
|
|
|
* protects updates to fs_devices::devices, ie. adding and deleting
|
|
|
|
*
|
|
|
|
* simple list traversal with read-only actions can be done with RCU protection
|
|
|
|
*
|
|
|
|
* may be used to exclude some operations from running concurrently without any
|
|
|
|
* modifications to the list (see write_all_supers)
|
|
|
|
*
|
btrfs: open device without device_list_mutex
There's long existed a lockdep splat because we open our bdev's under
the ->device_list_mutex at mount time, which acquires the bd_mutex.
Usually this goes unnoticed, but if you do loopback devices at all
suddenly the bd_mutex comes with a whole host of other dependencies,
which results in the splat when you mount a btrfs file system.
======================================================
WARNING: possible circular locking dependency detected
5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted
------------------------------------------------------
systemd-journal/509 is trying to acquire lock:
ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs]
but task is already holding lock:
ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sb_pagefaults){.+.+}-{0:0}:
__sb_start_write+0x13e/0x220
btrfs_page_mkwrite+0x59/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
-> #5 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x60/0x80
_copy_from_user+0x20/0xb0
get_sg_io_hdr+0x9a/0xb0
scsi_cmd_ioctl+0x1ea/0x2f0
cdrom_ioctl+0x3c/0x12b4
sr_block_ioctl+0xa4/0xd0
block_ioctl+0x3f/0x50
ksys_ioctl+0x82/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&cd->lock){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
sr_block_open+0xa2/0x180
__blkdev_get+0xdd/0x550
blkdev_get+0x38/0x150
do_dentry_open+0x16b/0x3e0
path_openat+0x3c9/0xa00
do_filp_open+0x75/0x100
do_sys_openat2+0x8a/0x140
__x64_sys_openat+0x46/0x70
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
__blkdev_get+0x6a/0x550
blkdev_get+0x85/0x150
blkdev_get_by_path+0x2c/0x70
btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
open_fs_devices+0x88/0x240 [btrfs]
btrfs_open_devices+0x92/0xa0 [btrfs]
btrfs_mount_root+0x250/0x490 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
vfs_kern_mount.part.0+0x71/0xb0
btrfs_mount+0x119/0x380 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
do_mount+0x8c6/0xca0
__x64_sys_mount+0x8e/0xd0
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_run_dev_stats+0x36/0x420 [btrfs]
commit_cowonly_roots+0x91/0x2d0 [btrfs]
btrfs_commit_transaction+0x4e6/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_commit_transaction+0x48e/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
__mutex_lock+0x7b/0x820
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
other info that might help us debug this:
Chain exists of:
&fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sb_pagefaults);
lock(&mm->mmap_lock#2);
lock(sb_pagefaults);
lock(&fs_info->reloc_mutex);
*** DEADLOCK ***
3 locks held by systemd-journal/509:
#0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0
#1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
#2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs]
stack backtrace:
CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
dump_stack+0x92/0xc8
check_noncircular+0x134/0x150
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
__mutex_lock+0x7b/0x820
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? kvm_sched_clock_read+0x14/0x30
? sched_clock+0x5/0x10
? sched_clock_cpu+0xc/0xb0
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
? sched_clock+0x5/0x10
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
? asm_exc_page_fault+0x8/0x30
asm_exc_page_fault+0x1e/0x30
RIP: 0033:0x7fa3972fdbfe
Code: Bad RIP value.
Fix this by not holding the ->device_list_mutex at this point. The
device_list_mutex exists to protect us from modifying the device list
while the file system is running.
However it can also be modified by doing a scan on a device. But this
action is specifically protected by the uuid_mutex, which we are holding
here. We cannot race with opening at this point because we have the
->s_mount lock held during the mount. Not having the
->device_list_mutex here is perfectly safe as we're not going to change
the devices at this point.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add some comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-18 03:12:27 +08:00
|
|
|
* Is not required at mount and close times, because our device list is
|
|
|
|
* protected by the uuid_mutex at that point.
|
|
|
|
*
|
2017-06-17 04:30:00 +08:00
|
|
|
* balance_mutex
|
|
|
|
* -------------
|
|
|
|
* protects balance structures (status, state) and context accessed from
|
|
|
|
* several places (internally, ioctl)
|
|
|
|
*
|
|
|
|
* chunk_mutex
|
|
|
|
* -----------
|
|
|
|
* protects chunks, adding or removing during allocation, trim or when a new
|
2019-05-09 23:11:11 +08:00
|
|
|
* device is added/removed. Additionally it also protects post_commit_list of
|
|
|
|
* individual devices, since they can be added to the transaction's
|
|
|
|
* post_commit_list only with chunk_mutex held.
|
2017-06-17 04:30:00 +08:00
|
|
|
*
|
|
|
|
* cleaner_mutex
|
|
|
|
* -------------
|
|
|
|
* a big lock that is held by the cleaner thread and prevents running subvolume
|
|
|
|
* cleaning together with relocation or delayed iputs
|
|
|
|
*
|
|
|
|
*
|
|
|
|
* Lock nesting
|
|
|
|
* ============
|
|
|
|
*
|
|
|
|
* uuid_mutex
|
2020-05-14 01:42:45 +08:00
|
|
|
* device_list_mutex
|
|
|
|
* chunk_mutex
|
|
|
|
* balance_mutex
|
2018-04-18 14:59:25 +08:00
|
|
|
*
|
|
|
|
*
|
2020-08-25 23:02:32 +08:00
|
|
|
* Exclusive operations
|
|
|
|
* ====================
|
2018-04-18 14:59:25 +08:00
|
|
|
*
|
|
|
|
* Maintains the exclusivity of the following operations that apply to the
|
|
|
|
* whole filesystem and cannot run in parallel.
|
|
|
|
*
|
|
|
|
* - Balance (*)
|
|
|
|
* - Device add
|
|
|
|
* - Device remove
|
|
|
|
* - Device replace (*)
|
|
|
|
* - Resize
|
|
|
|
*
|
|
|
|
* The device operations (as above) can be in one of the following states:
|
|
|
|
*
|
|
|
|
* - Running state
|
|
|
|
* - Paused state
|
|
|
|
* - Completed state
|
|
|
|
*
|
|
|
|
* Only device operations marked with (*) can go into the Paused state for the
|
|
|
|
* following reasons:
|
|
|
|
*
|
|
|
|
* - ioctl (only Balance can be Paused through ioctl)
|
|
|
|
* - filesystem remounted as read-only
|
|
|
|
* - filesystem unmounted and mounted as read-only
|
|
|
|
* - system power-cycle and filesystem mounted as read-only
|
|
|
|
* - filesystem or device errors leading to forced read-only
|
|
|
|
*
|
2020-08-25 23:02:32 +08:00
|
|
|
* The status of exclusive operation is set and cleared atomically.
|
|
|
|
* During the course of Paused state, fs_info::exclusive_operation remains set.
|
2018-04-18 14:59:25 +08:00
|
|
|
* A device operation in Paused or Running state can be canceled or resumed
|
|
|
|
* either by ioctl (Balance only) or when remounted as read-write.
|
2020-08-25 23:02:32 +08:00
|
|
|
* The exclusive status is cleared when the device operation is canceled or
|
2018-04-18 14:59:25 +08:00
|
|
|
* completed.
|
2017-06-17 04:30:00 +08:00
|
|
|
*/
|
|
|
|
|
2014-09-03 21:35:43 +08:00
|
|
|
DEFINE_MUTEX(uuid_mutex);
|
2008-03-25 03:02:07 +08:00
|
|
|
static LIST_HEAD(fs_uuids);
|
2019-10-02 01:57:37 +08:00
|
|
|
struct list_head * __attribute_const__ btrfs_get_fs_uuids(void)
|
2015-03-10 06:38:30 +08:00
|
|
|
{
|
|
|
|
return &fs_uuids;
|
|
|
|
}
|
2008-03-25 03:02:07 +08:00
|
|
|
|
2017-06-14 08:48:07 +08:00
|
|
|
/*
|
2023-08-23 22:52:13 +08:00
|
|
|
* Allocate new btrfs_fs_devices structure identified by a fsid.
|
|
|
|
*
|
|
|
|
* @fsid: if not NULL, copy the UUID to fs_devices::fsid and to
|
|
|
|
* fs_devices::metadata_fsid
|
2017-06-14 08:48:07 +08:00
|
|
|
*
|
|
|
|
* Return a pointer to a new struct btrfs_fs_devices on success, or ERR_PTR().
|
|
|
|
* The returned struct is not linked onto any lists and can be destroyed with
|
|
|
|
* kfree() right away.
|
|
|
|
*/
|
2023-08-23 22:52:13 +08:00
|
|
|
static struct btrfs_fs_devices *alloc_fs_devices(const u8 *fsid)
|
2013-08-12 19:33:03 +08:00
|
|
|
{
|
|
|
|
struct btrfs_fs_devices *fs_devs;
|
|
|
|
|
2016-02-11 21:25:38 +08:00
|
|
|
fs_devs = kzalloc(sizeof(*fs_devs), GFP_KERNEL);
|
2013-08-12 19:33:03 +08:00
|
|
|
if (!fs_devs)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
|
|
|
|
mutex_init(&fs_devs->device_list_mutex);
|
|
|
|
|
|
|
|
INIT_LIST_HEAD(&fs_devs->devices);
|
|
|
|
INIT_LIST_HEAD(&fs_devs->alloc_list);
|
2018-04-12 10:29:25 +08:00
|
|
|
INIT_LIST_HEAD(&fs_devs->fs_list);
|
2020-07-16 15:25:33 +08:00
|
|
|
INIT_LIST_HEAD(&fs_devs->seed_list);
|
2013-08-12 19:33:03 +08:00
|
|
|
|
2023-05-24 20:02:36 +08:00
|
|
|
if (fsid) {
|
|
|
|
memcpy(fs_devs->fsid, fsid, BTRFS_FSID_SIZE);
|
2023-08-23 22:52:13 +08:00
|
|
|
memcpy(fs_devs->metadata_uuid, fsid, BTRFS_FSID_SIZE);
|
2023-05-24 20:02:36 +08:00
|
|
|
}
|
2018-10-30 22:43:23 +08:00
|
|
|
|
2013-08-12 19:33:03 +08:00
|
|
|
return fs_devs;
|
|
|
|
}
|
|
|
|
|
2023-04-27 01:13:01 +08:00
|
|
|
static void btrfs_free_device(struct btrfs_device *device)
|
2017-10-31 01:10:25 +08:00
|
|
|
{
|
2019-03-25 20:31:22 +08:00
|
|
|
WARN_ON(!list_empty(&device->post_commit_list));
|
2017-10-31 01:10:25 +08:00
|
|
|
rcu_string_free(device->name);
|
2023-04-27 01:13:00 +08:00
|
|
|
extent_io_tree_release(&device->alloc_state);
|
2020-11-10 19:26:07 +08:00
|
|
|
btrfs_destroy_dev_zone_info(device);
|
2017-10-31 01:10:25 +08:00
|
|
|
kfree(device);
|
|
|
|
}
|
|
|
|
|
2008-12-12 23:03:26 +08:00
|
|
|
static void free_fs_devices(struct btrfs_fs_devices *fs_devices)
|
|
|
|
{
|
|
|
|
struct btrfs_device *device;
|
2023-01-20 21:47:16 +08:00
|
|
|
|
2008-12-12 23:03:26 +08:00
|
|
|
WARN_ON(fs_devices->opened);
|
|
|
|
while (!list_empty(&fs_devices->devices)) {
|
|
|
|
device = list_entry(fs_devices->devices.next,
|
|
|
|
struct btrfs_device, dev_list);
|
|
|
|
list_del(&device->dev_list);
|
2018-03-20 22:47:33 +08:00
|
|
|
btrfs_free_device(device);
|
2008-12-12 23:03:26 +08:00
|
|
|
}
|
|
|
|
kfree(fs_devices);
|
|
|
|
}
|
|
|
|
|
2018-02-20 00:24:15 +08:00
|
|
|
void __exit btrfs_cleanup_fs_uuids(void)
|
2008-03-25 03:02:07 +08:00
|
|
|
{
|
|
|
|
struct btrfs_fs_devices *fs_devices;
|
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
while (!list_empty(&fs_uuids)) {
|
|
|
|
fs_devices = list_entry(fs_uuids.next,
|
2018-04-12 10:29:25 +08:00
|
|
|
struct btrfs_fs_devices, fs_list);
|
|
|
|
list_del(&fs_devices->fs_list);
|
2008-12-12 23:03:26 +08:00
|
|
|
free_fs_devices(fs_devices);
|
2008-03-25 03:02:07 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-05-24 20:02:40 +08:00
|
|
|
static bool match_fsid_fs_devices(const struct btrfs_fs_devices *fs_devices,
|
|
|
|
const u8 *fsid, const u8 *metadata_fsid)
|
|
|
|
{
|
|
|
|
if (memcmp(fsid, fs_devices->fsid, BTRFS_FSID_SIZE) != 0)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if (!metadata_fsid)
|
|
|
|
return true;
|
|
|
|
|
|
|
|
if (memcmp(metadata_fsid, fs_devices->metadata_uuid, BTRFS_FSID_SIZE) != 0)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2018-10-30 22:43:23 +08:00
|
|
|
static noinline struct btrfs_fs_devices *find_fsid(
|
|
|
|
const u8 *fsid, const u8 *metadata_fsid)
|
2008-03-25 03:02:07 +08:00
|
|
|
{
|
|
|
|
struct btrfs_fs_devices *fs_devices;
|
|
|
|
|
2018-10-30 22:43:23 +08:00
|
|
|
ASSERT(fsid);
|
|
|
|
|
2018-10-30 22:43:27 +08:00
|
|
|
/* Handle non-split brain cases */
|
2018-04-12 10:29:25 +08:00
|
|
|
list_for_each_entry(fs_devices, &fs_uuids, fs_list) {
|
2023-05-24 20:02:40 +08:00
|
|
|
if (match_fsid_fs_devices(fs_devices, fsid, metadata_fsid))
|
|
|
|
return fs_devices;
|
2008-03-25 03:02:07 +08:00
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2012-11-12 21:03:45 +08:00
|
|
|
static int
|
2023-06-08 19:02:55 +08:00
|
|
|
btrfs_get_bdev_and_sb(const char *device_path, blk_mode_t flags, void *holder,
|
2024-01-23 21:26:36 +08:00
|
|
|
int flush, struct file **bdev_file,
|
2020-02-13 23:24:32 +08:00
|
|
|
struct btrfs_super_block **disk_super)
|
2012-11-12 21:03:45 +08:00
|
|
|
{
|
2023-09-27 17:34:26 +08:00
|
|
|
struct block_device *bdev;
|
2012-11-12 21:03:45 +08:00
|
|
|
int ret;
|
|
|
|
|
2024-01-23 21:26:36 +08:00
|
|
|
*bdev_file = bdev_file_open_by_path(device_path, flags, holder, NULL);
|
2012-11-12 21:03:45 +08:00
|
|
|
|
2024-01-23 21:26:36 +08:00
|
|
|
if (IS_ERR(*bdev_file)) {
|
|
|
|
ret = PTR_ERR(*bdev_file);
|
2024-07-18 00:58:54 +08:00
|
|
|
btrfs_err(NULL, "failed to open device for path %s with flags 0x%x: %d",
|
|
|
|
device_path, flags, ret);
|
2012-11-12 21:03:45 +08:00
|
|
|
goto error;
|
|
|
|
}
|
2024-01-23 21:26:36 +08:00
|
|
|
bdev = file_bdev(*bdev_file);
|
2012-11-12 21:03:45 +08:00
|
|
|
|
|
|
|
if (flush)
|
2023-09-27 17:34:26 +08:00
|
|
|
sync_blockdev(bdev);
|
2024-04-18 12:21:25 +08:00
|
|
|
if (holder) {
|
2024-04-18 12:34:31 +08:00
|
|
|
ret = set_blocksize(*bdev_file, BTRFS_BDEV_BLOCKSIZE);
|
2024-04-18 12:21:25 +08:00
|
|
|
if (ret) {
|
|
|
|
fput(*bdev_file);
|
|
|
|
goto error;
|
|
|
|
}
|
2012-11-12 21:03:45 +08:00
|
|
|
}
|
2023-09-27 17:34:26 +08:00
|
|
|
invalidate_bdev(bdev);
|
|
|
|
*disk_super = btrfs_read_dev_super(bdev);
|
2020-02-13 23:24:32 +08:00
|
|
|
if (IS_ERR(*disk_super)) {
|
|
|
|
ret = PTR_ERR(*disk_super);
|
2024-01-23 21:26:36 +08:00
|
|
|
fput(*bdev_file);
|
2012-11-12 21:03:45 +08:00
|
|
|
goto error;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
error:
|
2024-04-18 12:21:25 +08:00
|
|
|
*disk_super = NULL;
|
2024-01-23 21:26:36 +08:00
|
|
|
*bdev_file = NULL;
|
2012-11-12 21:03:45 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2022-10-27 20:21:42 +08:00
|
|
|
/*
|
|
|
|
* Search and remove all stale devices (which are not mounted). When both
|
|
|
|
* inputs are NULL, it will search and release all stale devices.
|
2022-01-12 13:06:00 +08:00
|
|
|
*
|
2022-10-27 20:21:42 +08:00
|
|
|
* @devt: Optional. When provided will it release all unmounted devices
|
|
|
|
* matching this devt only.
|
2022-01-12 13:06:00 +08:00
|
|
|
* @skip_device: Optional. Will skip this device when searching for the stale
|
2022-10-27 20:21:42 +08:00
|
|
|
* devices.
|
2022-01-12 13:06:00 +08:00
|
|
|
*
|
|
|
|
* Return: 0 for success or if @devt is 0.
|
|
|
|
* -EBUSY if @devt is a mounted device.
|
|
|
|
* -ENOENT if @devt does not match any device in the list.
|
2018-01-18 22:00:37 +08:00
|
|
|
*/
|
2022-01-12 13:06:00 +08:00
|
|
|
static int btrfs_free_stale_devices(dev_t devt, struct btrfs_device *skip_device)
|
2015-06-17 21:10:48 +08:00
|
|
|
{
|
2018-05-29 15:33:08 +08:00
|
|
|
struct btrfs_fs_devices *fs_devices, *tmp_fs_devices;
|
|
|
|
struct btrfs_device *device, *tmp_device;
|
2023-09-09 00:31:55 +08:00
|
|
|
int ret;
|
|
|
|
bool freed = false;
|
2019-01-04 13:31:53 +08:00
|
|
|
|
2021-08-31 09:21:28 +08:00
|
|
|
lockdep_assert_held(&uuid_mutex);
|
|
|
|
|
2023-09-09 00:31:55 +08:00
|
|
|
/* Return good status if there is no instance of devt. */
|
|
|
|
ret = 0;
|
2018-05-29 15:33:08 +08:00
|
|
|
list_for_each_entry_safe(fs_devices, tmp_fs_devices, &fs_uuids, fs_list) {
|
2015-06-17 21:10:48 +08:00
|
|
|
|
2019-01-04 13:31:53 +08:00
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
2018-05-29 15:33:08 +08:00
|
|
|
list_for_each_entry_safe(device, tmp_device,
|
|
|
|
&fs_devices->devices, dev_list) {
|
|
|
|
if (skip_device && skip_device == device)
|
2018-01-18 22:00:37 +08:00
|
|
|
continue;
|
2022-01-12 13:06:02 +08:00
|
|
|
if (devt && devt != device->devt)
|
2018-01-18 22:00:34 +08:00
|
|
|
continue;
|
2019-01-04 13:31:53 +08:00
|
|
|
if (fs_devices->opened) {
|
2023-09-09 00:31:55 +08:00
|
|
|
if (devt)
|
2019-01-04 13:31:53 +08:00
|
|
|
ret = -EBUSY;
|
|
|
|
break;
|
|
|
|
}
|
2015-06-17 21:10:48 +08:00
|
|
|
|
|
|
|
/* delete the stale device */
|
2018-05-29 17:23:20 +08:00
|
|
|
fs_devices->num_devices--;
|
|
|
|
list_del(&device->dev_list);
|
|
|
|
btrfs_free_device(device);
|
|
|
|
|
2023-09-09 00:31:55 +08:00
|
|
|
freed = true;
|
2018-05-29 17:23:20 +08:00
|
|
|
}
|
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
2019-01-04 13:31:53 +08:00
|
|
|
|
2018-05-29 17:23:20 +08:00
|
|
|
if (fs_devices->num_devices == 0) {
|
|
|
|
btrfs_sysfs_remove_fsid(fs_devices);
|
|
|
|
list_del(&fs_devices->fs_list);
|
|
|
|
free_fs_devices(fs_devices);
|
2015-06-17 21:10:48 +08:00
|
|
|
}
|
|
|
|
}
|
2019-01-04 13:31:53 +08:00
|
|
|
|
2023-09-09 00:31:55 +08:00
|
|
|
/* If there is at least one freed device return 0. */
|
|
|
|
if (freed)
|
|
|
|
return 0;
|
|
|
|
|
2019-01-04 13:31:53 +08:00
|
|
|
return ret;
|
2015-06-17 21:10:48 +08:00
|
|
|
}
|
|
|
|
|
2023-09-28 09:09:46 +08:00
|
|
|
static struct btrfs_fs_devices *find_fsid_by_device(
|
2023-09-28 09:09:47 +08:00
|
|
|
struct btrfs_super_block *disk_super,
|
|
|
|
dev_t devt, bool *same_fsid_diff_dev)
|
2023-09-28 09:09:46 +08:00
|
|
|
{
|
|
|
|
struct btrfs_fs_devices *fsid_fs_devices;
|
2023-09-28 09:09:47 +08:00
|
|
|
struct btrfs_fs_devices *devt_fs_devices;
|
2023-09-28 09:09:46 +08:00
|
|
|
const bool has_metadata_uuid = (btrfs_super_incompat_flags(disk_super) &
|
|
|
|
BTRFS_FEATURE_INCOMPAT_METADATA_UUID);
|
2023-09-28 09:09:47 +08:00
|
|
|
bool found_by_devt = false;
|
2023-09-28 09:09:46 +08:00
|
|
|
|
|
|
|
/* Find the fs_device by the usual method, if found use it. */
|
|
|
|
fsid_fs_devices = find_fsid(disk_super->fsid,
|
|
|
|
has_metadata_uuid ? disk_super->metadata_uuid : NULL);
|
|
|
|
|
2023-09-28 09:09:47 +08:00
|
|
|
/* The temp_fsid feature is supported only with single device filesystem. */
|
|
|
|
if (btrfs_super_num_devices(disk_super) != 1)
|
|
|
|
return fsid_fs_devices;
|
|
|
|
|
2023-10-04 23:00:25 +08:00
|
|
|
/*
|
|
|
|
* A seed device is an integral component of the sprout device, which
|
|
|
|
* functions as a multi-device filesystem. So, temp-fsid feature is
|
|
|
|
* not supported.
|
|
|
|
*/
|
|
|
|
if (btrfs_super_flags(disk_super) & BTRFS_SUPER_FLAG_SEEDING)
|
|
|
|
return fsid_fs_devices;
|
|
|
|
|
2023-09-28 09:09:47 +08:00
|
|
|
/* Try to find a fs_devices by matching devt. */
|
|
|
|
list_for_each_entry(devt_fs_devices, &fs_uuids, fs_list) {
|
|
|
|
struct btrfs_device *device;
|
|
|
|
|
|
|
|
list_for_each_entry(device, &devt_fs_devices->devices, dev_list) {
|
|
|
|
if (device->devt == devt) {
|
|
|
|
found_by_devt = true;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (found_by_devt)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (found_by_devt) {
|
|
|
|
/* Existing device. */
|
|
|
|
if (fsid_fs_devices == NULL) {
|
|
|
|
if (devt_fs_devices->opened == 0) {
|
|
|
|
/* Stale device. */
|
|
|
|
return NULL;
|
|
|
|
} else {
|
|
|
|
/* temp_fsid is mounting a subvol. */
|
|
|
|
return devt_fs_devices;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
/* Regular or temp_fsid device mounting a subvol. */
|
|
|
|
return devt_fs_devices;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
/* New device. */
|
|
|
|
if (fsid_fs_devices == NULL) {
|
|
|
|
return NULL;
|
|
|
|
} else {
|
|
|
|
/* sb::fsid is already used create a new temp_fsid. */
|
|
|
|
*same_fsid_diff_dev = true;
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Not reached. */
|
2023-09-28 09:09:46 +08:00
|
|
|
}
|
|
|
|
|
btrfs: open device without device_list_mutex
There's long existed a lockdep splat because we open our bdev's under
the ->device_list_mutex at mount time, which acquires the bd_mutex.
Usually this goes unnoticed, but if you do loopback devices at all
suddenly the bd_mutex comes with a whole host of other dependencies,
which results in the splat when you mount a btrfs file system.
======================================================
WARNING: possible circular locking dependency detected
5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted
------------------------------------------------------
systemd-journal/509 is trying to acquire lock:
ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs]
but task is already holding lock:
ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sb_pagefaults){.+.+}-{0:0}:
__sb_start_write+0x13e/0x220
btrfs_page_mkwrite+0x59/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
-> #5 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x60/0x80
_copy_from_user+0x20/0xb0
get_sg_io_hdr+0x9a/0xb0
scsi_cmd_ioctl+0x1ea/0x2f0
cdrom_ioctl+0x3c/0x12b4
sr_block_ioctl+0xa4/0xd0
block_ioctl+0x3f/0x50
ksys_ioctl+0x82/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&cd->lock){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
sr_block_open+0xa2/0x180
__blkdev_get+0xdd/0x550
blkdev_get+0x38/0x150
do_dentry_open+0x16b/0x3e0
path_openat+0x3c9/0xa00
do_filp_open+0x75/0x100
do_sys_openat2+0x8a/0x140
__x64_sys_openat+0x46/0x70
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
__blkdev_get+0x6a/0x550
blkdev_get+0x85/0x150
blkdev_get_by_path+0x2c/0x70
btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
open_fs_devices+0x88/0x240 [btrfs]
btrfs_open_devices+0x92/0xa0 [btrfs]
btrfs_mount_root+0x250/0x490 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
vfs_kern_mount.part.0+0x71/0xb0
btrfs_mount+0x119/0x380 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
do_mount+0x8c6/0xca0
__x64_sys_mount+0x8e/0xd0
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_run_dev_stats+0x36/0x420 [btrfs]
commit_cowonly_roots+0x91/0x2d0 [btrfs]
btrfs_commit_transaction+0x4e6/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_commit_transaction+0x48e/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
__mutex_lock+0x7b/0x820
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
other info that might help us debug this:
Chain exists of:
&fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sb_pagefaults);
lock(&mm->mmap_lock#2);
lock(sb_pagefaults);
lock(&fs_info->reloc_mutex);
*** DEADLOCK ***
3 locks held by systemd-journal/509:
#0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0
#1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
#2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs]
stack backtrace:
CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
dump_stack+0x92/0xc8
check_noncircular+0x134/0x150
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
__mutex_lock+0x7b/0x820
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? kvm_sched_clock_read+0x14/0x30
? sched_clock+0x5/0x10
? sched_clock_cpu+0xc/0xb0
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
? sched_clock+0x5/0x10
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
? asm_exc_page_fault+0x8/0x30
asm_exc_page_fault+0x1e/0x30
RIP: 0033:0x7fa3972fdbfe
Code: Bad RIP value.
Fix this by not holding the ->device_list_mutex at this point. The
device_list_mutex exists to protect us from modifying the device list
while the file system is running.
However it can also be modified by doing a scan on a device. But this
action is specifically protected by the uuid_mutex, which we are holding
here. We cannot race with opening at this point because we have the
->s_mount lock held during the mount. Not having the
->device_list_mutex here is perfectly safe as we're not going to change
the devices at this point.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add some comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-18 03:12:27 +08:00
|
|
|
/*
|
|
|
|
* This is only used on mount, and we are protected from competing things
|
|
|
|
* messing with our fs_devices by the uuid_mutex, thus we do not need the
|
|
|
|
* fs_devices->device_list_mutex here.
|
|
|
|
*/
|
2017-11-09 23:45:24 +08:00
|
|
|
static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
|
2023-06-08 19:02:55 +08:00
|
|
|
struct btrfs_device *device, blk_mode_t flags,
|
2017-11-09 23:45:24 +08:00
|
|
|
void *holder)
|
|
|
|
{
|
2024-01-23 21:26:36 +08:00
|
|
|
struct file *bdev_file;
|
2017-11-09 23:45:24 +08:00
|
|
|
struct btrfs_super_block *disk_super;
|
|
|
|
u64 devid;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (device->bdev)
|
|
|
|
return -EINVAL;
|
|
|
|
if (!device->name)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
ret = btrfs_get_bdev_and_sb(device->name->str, flags, holder, 1,
|
2024-01-23 21:26:36 +08:00
|
|
|
&bdev_file, &disk_super);
|
2017-11-09 23:45:24 +08:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
devid = btrfs_stack_device_id(&disk_super->dev_item);
|
|
|
|
if (devid != device->devid)
|
2020-02-13 23:24:32 +08:00
|
|
|
goto error_free_page;
|
2017-11-09 23:45:24 +08:00
|
|
|
|
|
|
|
if (memcmp(device->uuid, disk_super->dev_item.uuid, BTRFS_UUID_SIZE))
|
2020-02-13 23:24:32 +08:00
|
|
|
goto error_free_page;
|
2017-11-09 23:45:24 +08:00
|
|
|
|
|
|
|
device->generation = btrfs_super_generation(disk_super);
|
|
|
|
|
|
|
|
if (btrfs_super_flags(disk_super) & BTRFS_SUPER_FLAG_SEEDING) {
|
2018-10-30 22:43:23 +08:00
|
|
|
if (btrfs_super_incompat_flags(disk_super) &
|
|
|
|
BTRFS_FEATURE_INCOMPAT_METADATA_UUID) {
|
|
|
|
pr_err(
|
|
|
|
"BTRFS: Invalid seeding and uuid-changed device detected\n");
|
2020-02-13 23:24:32 +08:00
|
|
|
goto error_free_page;
|
2018-10-30 22:43:23 +08:00
|
|
|
}
|
|
|
|
|
2017-12-04 12:54:52 +08:00
|
|
|
clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
|
2019-11-13 18:27:27 +08:00
|
|
|
fs_devices->seeding = true;
|
2017-11-09 23:45:24 +08:00
|
|
|
} else {
|
2024-01-23 21:26:36 +08:00
|
|
|
if (bdev_read_only(file_bdev(bdev_file)))
|
2017-12-04 12:54:52 +08:00
|
|
|
clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
|
|
|
|
else
|
|
|
|
set_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
|
2017-11-09 23:45:24 +08:00
|
|
|
}
|
|
|
|
|
2024-01-23 21:26:36 +08:00
|
|
|
if (!bdev_nonrot(file_bdev(bdev_file)))
|
2019-11-13 18:27:28 +08:00
|
|
|
fs_devices->rotating = true;
|
2017-11-09 23:45:24 +08:00
|
|
|
|
2024-01-23 21:26:36 +08:00
|
|
|
if (bdev_max_discard_sectors(file_bdev(bdev_file)))
|
2022-07-27 02:54:10 +08:00
|
|
|
fs_devices->discardable = true;
|
|
|
|
|
2024-01-23 21:26:36 +08:00
|
|
|
device->bdev_file = bdev_file;
|
|
|
|
device->bdev = file_bdev(bdev_file);
|
2017-12-04 12:54:53 +08:00
|
|
|
clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
|
2017-11-09 23:45:24 +08:00
|
|
|
|
2024-03-01 08:42:13 +08:00
|
|
|
if (device->devt != device->bdev->bd_dev) {
|
|
|
|
btrfs_warn(NULL,
|
|
|
|
"device %s maj:min changed from %d:%d to %d:%d",
|
|
|
|
device->name->str, MAJOR(device->devt),
|
|
|
|
MINOR(device->devt), MAJOR(device->bdev->bd_dev),
|
|
|
|
MINOR(device->bdev->bd_dev));
|
|
|
|
|
|
|
|
device->devt = device->bdev->bd_dev;
|
|
|
|
}
|
|
|
|
|
2017-11-09 23:45:24 +08:00
|
|
|
fs_devices->open_devices++;
|
2017-12-04 12:54:52 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
|
|
|
|
device->devid != BTRFS_DEV_REPLACE_DEVID) {
|
2017-11-09 23:45:24 +08:00
|
|
|
fs_devices->rw_devices++;
|
2018-01-23 06:49:37 +08:00
|
|
|
list_add_tail(&device->dev_alloc_list, &fs_devices->alloc_list);
|
2017-11-09 23:45:24 +08:00
|
|
|
}
|
2020-02-13 23:24:32 +08:00
|
|
|
btrfs_release_disk_super(disk_super);
|
2017-11-09 23:45:24 +08:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
2020-02-13 23:24:32 +08:00
|
|
|
error_free_page:
|
|
|
|
btrfs_release_disk_super(disk_super);
|
2024-01-23 21:26:36 +08:00
|
|
|
fput(bdev_file);
|
2017-11-09 23:45:24 +08:00
|
|
|
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2024-05-31 01:14:12 +08:00
|
|
|
const u8 *btrfs_sb_fsid_ptr(const struct btrfs_super_block *sb)
|
2023-07-31 19:16:32 +08:00
|
|
|
{
|
|
|
|
bool has_metadata_uuid = (btrfs_super_incompat_flags(sb) &
|
|
|
|
BTRFS_FEATURE_INCOMPAT_METADATA_UUID);
|
|
|
|
|
|
|
|
return has_metadata_uuid ? sb->metadata_uuid : sb->fsid;
|
|
|
|
}
|
|
|
|
|
2014-03-27 01:26:36 +08:00
|
|
|
/*
|
|
|
|
* Add new device to list of registered devices
|
|
|
|
*
|
|
|
|
* Returns:
|
2018-01-18 22:02:35 +08:00
|
|
|
* device pointer which was just added or updated when successful
|
|
|
|
* error pointer when failed
|
2014-03-27 01:26:36 +08:00
|
|
|
*/
|
2018-01-18 22:02:35 +08:00
|
|
|
static noinline struct btrfs_device *device_list_add(const char *path,
|
2018-05-29 12:28:37 +08:00
|
|
|
struct btrfs_super_block *disk_super,
|
|
|
|
bool *new_device_added)
|
2008-03-25 03:02:07 +08:00
|
|
|
{
|
|
|
|
struct btrfs_device *device;
|
2018-10-30 22:43:27 +08:00
|
|
|
struct btrfs_fs_devices *fs_devices = NULL;
|
2012-06-05 02:03:51 +08:00
|
|
|
struct rcu_string *name;
|
2008-03-25 03:02:07 +08:00
|
|
|
u64 found_transid = btrfs_super_generation(disk_super);
|
2018-01-18 22:02:36 +08:00
|
|
|
u64 devid = btrfs_stack_device_id(&disk_super->dev_item);
|
2022-01-12 13:06:01 +08:00
|
|
|
dev_t path_devt;
|
|
|
|
int error;
|
2023-09-28 09:09:47 +08:00
|
|
|
bool same_fsid_diff_dev = false;
|
2018-10-30 22:43:23 +08:00
|
|
|
bool has_metadata_uuid = (btrfs_super_incompat_flags(disk_super) &
|
|
|
|
BTRFS_FEATURE_INCOMPAT_METADATA_UUID);
|
|
|
|
|
2023-09-21 05:51:14 +08:00
|
|
|
if (btrfs_super_flags(disk_super) & BTRFS_SUPER_FLAG_CHANGING_FSID_V2) {
|
btrfs: reject devices with CHANGING_FSID_V2
The BTRFS_SUPER_FLAG_CHANGING_FSID_V2 flag indicates a transient state
where the device in the userspace btrfstune -m|-M operation failed to
complete changing the fsid.
This flag makes the kernel to automatically determine the other
partner devices to which a given device can be associated, based on the
fsid, metadata_uuid and generation values.
btrfstune -m|M feature is especially useful in virtual cloud setups, where
compute instances (disk images) are quickly copied, fsid changed, and
launched. Given numerous disk images with the same metadata_uuid but
different fsid, there's no clear way a device can be correctly assembled
with the proper partners when the CHANGING_FSID_V2 flag is set. So, the
disk could be assembled incorrectly, as in the example below:
Before this patch:
Consider the following two filesystems:
/dev/loop[2-3] are raw copies of /dev/loop[0-1] and the btrsftune -m
operation fails.
In this scenario, as the /dev/loop0's fsid change is interrupted, and the
CHANGING_FSID_V2 flag is set as shown below.
$ p="device|devid|^metadata_uuid|^fsid|^incom|^generation|^flags"
$ btrfs inspect dump-super /dev/loop0 | egrep '$p'
superblock: bytenr=65536, device=/dev/loop0
flags 0x1000000001
fsid 7d4b4b93-2b27-4432-b4e4-4be1fbccbd45
metadata_uuid bb040a9f-233a-4de2-ad84-49aa5a28059b
generation 9
num_devices 2
incompat_flags 0x741
dev_item.devid 1
$ btrfs inspect dump-super /dev/loop1 | egrep '$p'
superblock: bytenr=65536, device=/dev/loop1
flags 0x1
fsid 11d2af4d-1b71-45a9-83f6-f2100766939d
metadata_uuid bb040a9f-233a-4de2-ad84-49aa5a28059b
generation 10
num_devices 2
incompat_flags 0x741
dev_item.devid 2
$ btrfs inspect dump-super /dev/loop2 | egrep '$p'
superblock: bytenr=65536, device=/dev/loop2
flags 0x1
fsid 7d4b4b93-2b27-4432-b4e4-4be1fbccbd45
metadata_uuid bb040a9f-233a-4de2-ad84-49aa5a28059b
generation 8
num_devices 2
incompat_flags 0x741
dev_item.devid 1
$ btrfs inspect dump-super /dev/loop3 | egrep '$p'
superblock: bytenr=65536, device=/dev/loop3
flags 0x1
fsid 7d4b4b93-2b27-4432-b4e4-4be1fbccbd45
metadata_uuid bb040a9f-233a-4de2-ad84-49aa5a28059b
generation 8
num_devices 2
incompat_flags 0x741
dev_item.devid 2
It is normal that some devices aren't instantly discovered during
system boot or iSCSI discovery. The controlled scan below demonstrates
this.
$ btrfs device scan --forget
$ btrfs device scan /dev/loop0
Scanning for btrfs filesystems on '/dev/loop0'
$ mount /dev/loop3 /btrfs
$ btrfs filesystem show -m
Label: none uuid: 7d4b4b93-2b27-4432-b4e4-4be1fbccbd45
Total devices 2 FS bytes used 144.00KiB
devid 1 size 300.00MiB used 48.00MiB path /dev/loop0
devid 2 size 300.00MiB used 40.00MiB path /dev/loop3
/dev/loop0 and /dev/loop3 are incorrectly partnered.
This kernel patch removes functions and code connected to the
CHANGING_FSID_V2 flag.
With this patch, now devices with the CHANGING_FSID_V2 flag are rejected.
And its partner will fail to mount with the extra -o degraded option.
The check is removed from open_ctree(), devices are rejected during
scanning which in turn fails the mount.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-21 05:51:13 +08:00
|
|
|
btrfs_err(NULL,
|
|
|
|
"device %s has incomplete metadata_uuid change, please use btrfstune to complete",
|
|
|
|
path);
|
|
|
|
return ERR_PTR(-EAGAIN);
|
|
|
|
}
|
2018-10-30 22:43:23 +08:00
|
|
|
|
2022-01-12 13:06:01 +08:00
|
|
|
error = lookup_bdev(path, &path_devt);
|
2022-12-12 10:19:37 +08:00
|
|
|
if (error) {
|
|
|
|
btrfs_err(NULL, "failed to lookup block device for path %s: %d",
|
|
|
|
path, error);
|
2022-01-12 13:06:01 +08:00
|
|
|
return ERR_PTR(error);
|
2022-12-12 10:19:37 +08:00
|
|
|
}
|
2022-01-12 13:06:01 +08:00
|
|
|
|
2023-09-28 09:09:47 +08:00
|
|
|
fs_devices = find_fsid_by_device(disk_super, path_devt, &same_fsid_diff_dev);
|
2008-03-25 03:02:07 +08:00
|
|
|
|
|
|
|
if (!fs_devices) {
|
2023-08-23 22:52:13 +08:00
|
|
|
fs_devices = alloc_fs_devices(disk_super->fsid);
|
2023-10-30 19:54:23 +08:00
|
|
|
if (IS_ERR(fs_devices))
|
|
|
|
return ERR_CAST(fs_devices);
|
|
|
|
|
2023-08-23 22:52:13 +08:00
|
|
|
if (has_metadata_uuid)
|
|
|
|
memcpy(fs_devices->metadata_uuid,
|
|
|
|
disk_super->metadata_uuid, BTRFS_FSID_SIZE);
|
|
|
|
|
2023-09-28 09:09:47 +08:00
|
|
|
if (same_fsid_diff_dev) {
|
|
|
|
generate_random_uuid(fs_devices->fsid);
|
|
|
|
fs_devices->temp_fsid = true;
|
2024-02-25 14:58:31 +08:00
|
|
|
pr_info("BTRFS: device %s (%d:%d) using temp-fsid %pU\n",
|
|
|
|
path, MAJOR(path_devt), MINOR(path_devt),
|
|
|
|
fs_devices->fsid);
|
2023-09-28 09:09:47 +08:00
|
|
|
}
|
2019-01-27 12:58:00 +08:00
|
|
|
|
2018-05-29 14:10:20 +08:00
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
2018-04-12 10:29:25 +08:00
|
|
|
list_add(&fs_devices->fs_list, &fs_uuids);
|
2013-08-12 19:33:03 +08:00
|
|
|
|
2008-03-25 03:02:07 +08:00
|
|
|
device = NULL;
|
|
|
|
} else {
|
2021-10-06 04:12:42 +08:00
|
|
|
struct btrfs_dev_lookup_args args = {
|
|
|
|
.devid = devid,
|
|
|
|
.uuid = disk_super->dev_item.uuid,
|
|
|
|
};
|
|
|
|
|
2018-05-29 14:10:20 +08:00
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
2021-10-06 04:12:42 +08:00
|
|
|
device = btrfs_find_device(fs_devices, &args);
|
2018-10-30 22:43:27 +08:00
|
|
|
|
2023-09-21 05:51:14 +08:00
|
|
|
if (found_transid > fs_devices->latest_generation) {
|
2018-10-30 22:43:27 +08:00
|
|
|
memcpy(fs_devices->fsid, disk_super->fsid,
|
|
|
|
BTRFS_FSID_SIZE);
|
2023-07-31 19:16:33 +08:00
|
|
|
memcpy(fs_devices->metadata_uuid,
|
|
|
|
btrfs_sb_fsid_ptr(disk_super), BTRFS_FSID_SIZE);
|
2018-10-30 22:43:27 +08:00
|
|
|
}
|
2008-03-25 03:02:07 +08:00
|
|
|
}
|
2014-07-24 11:37:15 +08:00
|
|
|
|
2008-03-25 03:02:07 +08:00
|
|
|
if (!device) {
|
2022-11-07 23:07:17 +08:00
|
|
|
unsigned int nofs_flag;
|
|
|
|
|
2018-05-29 14:10:20 +08:00
|
|
|
if (fs_devices->opened) {
|
2022-12-12 10:19:37 +08:00
|
|
|
btrfs_err(NULL,
|
2024-02-25 14:58:31 +08:00
|
|
|
"device %s (%d:%d) belongs to fsid %pU, and the fs is already mounted, scanned by %s (%d)",
|
|
|
|
path, MAJOR(path_devt), MINOR(path_devt),
|
|
|
|
fs_devices->fsid, current->comm,
|
2023-07-27 21:53:03 +08:00
|
|
|
task_pid_nr(current));
|
2018-05-29 14:10:20 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
2018-01-18 22:02:35 +08:00
|
|
|
return ERR_PTR(-EBUSY);
|
2018-05-29 14:10:20 +08:00
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2022-11-07 23:07:17 +08:00
|
|
|
nofs_flag = memalloc_nofs_save();
|
2013-08-23 18:20:17 +08:00
|
|
|
device = btrfs_alloc_device(NULL, &devid,
|
2022-11-07 23:07:17 +08:00
|
|
|
disk_super->dev_item.uuid, path);
|
|
|
|
memalloc_nofs_restore(nofs_flag);
|
2013-08-23 18:20:17 +08:00
|
|
|
if (IS_ERR(device)) {
|
2018-05-29 14:10:20 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
2008-03-25 03:02:07 +08:00
|
|
|
/* we can safely leave the fs_devices entry around */
|
2018-01-18 22:02:35 +08:00
|
|
|
return device;
|
2008-03-25 03:02:07 +08:00
|
|
|
}
|
2012-06-05 02:03:51 +08:00
|
|
|
|
2022-01-12 13:06:01 +08:00
|
|
|
device->devt = path_devt;
|
2011-05-23 20:30:00 +08:00
|
|
|
|
2011-04-20 18:09:16 +08:00
|
|
|
list_add_rcu(&device->dev_list, &fs_devices->devices);
|
Btrfs: fix race conditions in BTRFS_IOC_FS_INFO ioctl
The handler for the ioctl BTRFS_IOC_FS_INFO was reading the
number of devices before acquiring the device list mutex.
This could lead to inconsistent results because the update of
the device list and the number of devices counter (amongst other
counters related to the device list) are updated in volumes.c
while holding the device list mutex - except for 2 places, one
was volumes.c:btrfs_prepare_sprout() and the other was
volumes.c:device_list_add().
For example, if we have 2 devices, with IDs 1 and 2 and then add
a new device, with ID 3, and while adding the device is in progress
an BTRFS_IOC_FS_INFO ioctl arrives, it could return a number of
devices of 2 and a max dev id of 3. This would be incorrect.
Also, this ioctl handler was reading the fsid while it can be
updated concurrently. This can happen when while a new device is
being added and the current filesystem is in seeding mode.
Example:
$ mkfs.btrfs -f /dev/sdb1
$ mkfs.btrfs -f /dev/sdb2
$ btrfstune -S 1 /dev/sdb1
$ mount /dev/sdb1 /mnt/test
$ btrfs device add /dev/sdb2 /mnt/test
If during the last step a BTRFS_IOC_FS_INFO ioctl was requested, it
could read an fsid that was never valid (some bits part of the old
fsid and others part of the new fsid). Also, it could read a number
of devices that doesn't match the number of devices in the list and
the max device id, as explained before.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-13 03:56:58 +08:00
|
|
|
fs_devices->num_devices++;
|
2009-06-11 03:17:02 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
device->fs_devices = fs_devices;
|
2018-05-29 12:28:37 +08:00
|
|
|
*new_device_added = true;
|
2018-01-18 22:02:33 +08:00
|
|
|
|
|
|
|
if (disk_super->label[0])
|
2019-10-02 18:30:48 +08:00
|
|
|
pr_info(
|
2024-02-25 14:58:31 +08:00
|
|
|
"BTRFS: device label %s devid %llu transid %llu %s (%d:%d) scanned by %s (%d)\n",
|
2019-10-02 18:30:48 +08:00
|
|
|
disk_super->label, devid, found_transid, path,
|
2024-02-25 14:58:31 +08:00
|
|
|
MAJOR(path_devt), MINOR(path_devt),
|
2019-10-02 18:30:48 +08:00
|
|
|
current->comm, task_pid_nr(current));
|
2018-01-18 22:02:33 +08:00
|
|
|
else
|
2019-10-02 18:30:48 +08:00
|
|
|
pr_info(
|
2024-02-25 14:58:31 +08:00
|
|
|
"BTRFS: device fsid %pU devid %llu transid %llu %s (%d:%d) scanned by %s (%d)\n",
|
2019-10-02 18:30:48 +08:00
|
|
|
disk_super->fsid, devid, found_transid, path,
|
2024-02-25 14:58:31 +08:00
|
|
|
MAJOR(path_devt), MINOR(path_devt),
|
2019-10-02 18:30:48 +08:00
|
|
|
current->comm, task_pid_nr(current));
|
2018-01-18 22:02:33 +08:00
|
|
|
|
2012-06-05 02:03:51 +08:00
|
|
|
} else if (!device->name || strcmp(device->name->str, path)) {
|
2014-07-03 18:22:05 +08:00
|
|
|
/*
|
|
|
|
* When FS is already mounted.
|
|
|
|
* 1. If you are here and if the device->name is NULL that
|
|
|
|
* means this device was missing at time of FS mount.
|
|
|
|
* 2. If you are here and if the device->name is different
|
|
|
|
* from 'path' that means either
|
|
|
|
* a. The same device disappeared and reappeared with
|
|
|
|
* different name. or
|
|
|
|
* b. The missing-disk-which-was-replaced, has
|
|
|
|
* reappeared now.
|
|
|
|
*
|
|
|
|
* We must allow 1 and 2a above. But 2b would be a spurious
|
|
|
|
* and unintentional.
|
|
|
|
*
|
|
|
|
* Further in case of 1 and 2a above, the disk at 'path'
|
|
|
|
* would have missed some transaction when it was away and
|
|
|
|
* in case of 2a the stale bdev has to be updated as well.
|
|
|
|
* 2b must not be allowed at all time.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
2014-09-18 22:49:05 +08:00
|
|
|
* For now, we do allow update to btrfs_fs_device through the
|
|
|
|
* btrfs dev scan cli after FS has been mounted. We're still
|
|
|
|
* tracking a problem where systems fail mount by subvolume id
|
|
|
|
* when we reject replacement on a mounted FS.
|
2014-07-03 18:22:05 +08:00
|
|
|
*/
|
2014-09-18 22:49:05 +08:00
|
|
|
if (!fs_devices->opened && found_transid < device->generation) {
|
2014-07-03 18:22:06 +08:00
|
|
|
/*
|
|
|
|
* That is if the FS is _not_ mounted and if you
|
|
|
|
* are here, that means there is more than one
|
|
|
|
* disk with same uuid and devid.We keep the one
|
|
|
|
* with larger generation number or the last-in if
|
|
|
|
* generation are equal.
|
|
|
|
*/
|
2018-05-29 14:10:20 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
2022-12-12 10:19:37 +08:00
|
|
|
btrfs_err(NULL,
|
|
|
|
"device %s already registered with a higher generation, found %llu expect %llu",
|
|
|
|
path, found_transid, device->generation);
|
2018-01-18 22:02:35 +08:00
|
|
|
return ERR_PTR(-EEXIST);
|
2014-07-03 18:22:06 +08:00
|
|
|
}
|
2014-07-03 18:22:05 +08:00
|
|
|
|
2018-10-15 10:45:17 +08:00
|
|
|
/*
|
|
|
|
* We are going to replace the device path for a given devid,
|
|
|
|
* make sure it's the same device if the device is mounted
|
2022-03-03 22:40:27 +08:00
|
|
|
*
|
|
|
|
* NOTE: the device->fs_info may not be reliable here so pass
|
|
|
|
* in a NULL to message helpers instead. This avoids a possible
|
|
|
|
* use-after-free when the fs_info and fs_info->sb are already
|
|
|
|
* torn down.
|
2018-10-15 10:45:17 +08:00
|
|
|
*/
|
|
|
|
if (device->bdev) {
|
2022-01-12 13:06:01 +08:00
|
|
|
if (device->devt != path_devt) {
|
2018-10-15 10:45:17 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
2020-11-18 17:03:26 +08:00
|
|
|
btrfs_warn_in_rcu(NULL,
|
2020-09-03 21:30:12 +08:00
|
|
|
"duplicate device %s devid %llu generation %llu scanned by %s (%d)",
|
|
|
|
path, devid, found_transid,
|
|
|
|
current->comm,
|
|
|
|
task_pid_nr(current));
|
2018-10-15 10:45:17 +08:00
|
|
|
return ERR_PTR(-EEXIST);
|
|
|
|
}
|
2022-03-03 22:40:27 +08:00
|
|
|
btrfs_info_in_rcu(NULL,
|
2020-09-03 21:30:12 +08:00
|
|
|
"devid %llu device path %s changed to %s scanned by %s (%d)",
|
2022-11-13 09:32:07 +08:00
|
|
|
devid, btrfs_dev_name(device),
|
2020-09-03 21:30:12 +08:00
|
|
|
path, current->comm,
|
|
|
|
task_pid_nr(current));
|
2018-10-15 10:45:17 +08:00
|
|
|
}
|
|
|
|
|
2012-06-05 02:03:51 +08:00
|
|
|
name = rcu_string_strdup(path, GFP_NOFS);
|
2018-05-29 14:10:20 +08:00
|
|
|
if (!name) {
|
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
2018-01-18 22:02:35 +08:00
|
|
|
return ERR_PTR(-ENOMEM);
|
2018-05-29 14:10:20 +08:00
|
|
|
}
|
2012-06-05 02:03:51 +08:00
|
|
|
rcu_string_free(device->name);
|
|
|
|
rcu_assign_pointer(device->name, name);
|
2017-12-04 12:54:54 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state)) {
|
2010-12-14 03:56:23 +08:00
|
|
|
fs_devices->missing_devices--;
|
2017-12-04 12:54:54 +08:00
|
|
|
clear_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state);
|
2010-12-14 03:56:23 +08:00
|
|
|
}
|
2022-01-12 13:06:01 +08:00
|
|
|
device->devt = path_devt;
|
2008-03-25 03:02:07 +08:00
|
|
|
}
|
|
|
|
|
2014-07-03 18:22:06 +08:00
|
|
|
/*
|
|
|
|
* Unmount does not free the btrfs_device struct but would zero
|
|
|
|
* generation along with most of the other members. So just update
|
|
|
|
* it back. We need it to pick the disk with largest generation
|
|
|
|
* (as above).
|
|
|
|
*/
|
2018-10-30 22:43:26 +08:00
|
|
|
if (!fs_devices->opened) {
|
2014-07-03 18:22:06 +08:00
|
|
|
device->generation = found_transid;
|
2018-10-30 22:43:26 +08:00
|
|
|
fs_devices->latest_generation = max_t(u64, found_transid,
|
|
|
|
fs_devices->latest_generation);
|
|
|
|
}
|
2014-07-03 18:22:06 +08:00
|
|
|
|
2018-01-18 22:02:34 +08:00
|
|
|
fs_devices->total_devices = btrfs_super_num_devices(disk_super);
|
|
|
|
|
2018-05-29 14:10:20 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
2018-01-18 22:02:35 +08:00
|
|
|
return device;
|
2008-03-25 03:02:07 +08:00
|
|
|
}
|
|
|
|
|
2008-12-12 23:03:26 +08:00
|
|
|
static struct btrfs_fs_devices *clone_fs_devices(struct btrfs_fs_devices *orig)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_devices *fs_devices;
|
|
|
|
struct btrfs_device *device;
|
|
|
|
struct btrfs_device *orig_dev;
|
2019-08-27 15:40:45 +08:00
|
|
|
int ret = 0;
|
2008-12-12 23:03:26 +08:00
|
|
|
|
2021-08-31 09:21:28 +08:00
|
|
|
lockdep_assert_held(&uuid_mutex);
|
|
|
|
|
2023-08-23 22:52:13 +08:00
|
|
|
fs_devices = alloc_fs_devices(orig->fsid);
|
2013-08-12 19:33:03 +08:00
|
|
|
if (IS_ERR(fs_devices))
|
|
|
|
return fs_devices;
|
2008-12-12 23:03:26 +08:00
|
|
|
|
2012-06-22 04:03:58 +08:00
|
|
|
fs_devices->total_devices = orig->total_devices;
|
2008-12-12 23:03:26 +08:00
|
|
|
|
|
|
|
list_for_each_entry(orig_dev, &orig->devices, dev_list) {
|
2022-11-07 23:07:17 +08:00
|
|
|
const char *dev_path = NULL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This is ok to do without RCU read locked because we hold the
|
|
|
|
* uuid mutex so nothing we touch in here is going to disappear.
|
|
|
|
*/
|
|
|
|
if (orig_dev->name)
|
|
|
|
dev_path = orig_dev->name->str;
|
2012-06-05 02:03:51 +08:00
|
|
|
|
2013-08-23 18:20:17 +08:00
|
|
|
device = btrfs_alloc_device(NULL, &orig_dev->devid,
|
2022-11-07 23:07:17 +08:00
|
|
|
orig_dev->uuid, dev_path);
|
2019-08-27 15:40:45 +08:00
|
|
|
if (IS_ERR(device)) {
|
|
|
|
ret = PTR_ERR(device);
|
2008-12-12 23:03:26 +08:00
|
|
|
goto error;
|
2019-08-27 15:40:45 +08:00
|
|
|
}
|
2008-12-12 23:03:26 +08:00
|
|
|
|
2022-11-04 22:12:33 +08:00
|
|
|
if (orig_dev->zone_info) {
|
|
|
|
struct btrfs_zoned_device_info *zone_info;
|
|
|
|
|
|
|
|
zone_info = btrfs_clone_dev_zone_info(orig_dev);
|
|
|
|
if (!zone_info) {
|
|
|
|
btrfs_free_device(device);
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto error;
|
|
|
|
}
|
|
|
|
device->zone_info = zone_info;
|
|
|
|
}
|
|
|
|
|
2008-12-12 23:03:26 +08:00
|
|
|
list_add(&device->dev_list, &fs_devices->devices);
|
|
|
|
device->fs_devices = fs_devices;
|
|
|
|
fs_devices->num_devices++;
|
|
|
|
}
|
|
|
|
return fs_devices;
|
|
|
|
error:
|
|
|
|
free_fs_devices(fs_devices);
|
2019-08-27 15:40:45 +08:00
|
|
|
return ERR_PTR(ret);
|
2008-12-12 23:03:26 +08:00
|
|
|
}
|
|
|
|
|
2020-07-16 15:17:04 +08:00
|
|
|
static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
|
2020-11-06 16:06:33 +08:00
|
|
|
struct btrfs_device **latest_dev)
|
2008-05-14 01:46:40 +08:00
|
|
|
{
|
2009-01-21 23:59:08 +08:00
|
|
|
struct btrfs_device *device, *next;
|
2012-02-21 09:53:43 +08:00
|
|
|
|
2011-04-20 18:08:47 +08:00
|
|
|
/* This is the initialized path, it is safe to release the devices. */
|
2009-01-21 23:59:08 +08:00
|
|
|
list_for_each_entry_safe(device, next, &fs_devices->devices, dev_list) {
|
2020-07-16 15:17:04 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state)) {
|
2017-12-04 12:54:55 +08:00
|
|
|
if (!test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
|
2020-07-16 15:17:04 +08:00
|
|
|
&device->dev_state) &&
|
2020-05-05 02:58:25 +08:00
|
|
|
!test_bit(BTRFS_DEV_STATE_MISSING,
|
|
|
|
&device->dev_state) &&
|
2020-07-16 15:17:04 +08:00
|
|
|
(!*latest_dev ||
|
|
|
|
device->generation > (*latest_dev)->generation)) {
|
|
|
|
*latest_dev = device;
|
2012-02-21 09:53:43 +08:00
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
continue;
|
2012-02-21 09:53:43 +08:00
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2020-10-30 06:53:56 +08:00
|
|
|
/*
|
|
|
|
* We have already validated the presence of BTRFS_DEV_REPLACE_DEVID,
|
|
|
|
* in btrfs_init_dev_replace() so just continue.
|
|
|
|
*/
|
|
|
|
if (device->devid == BTRFS_DEV_REPLACE_DEVID)
|
|
|
|
continue;
|
|
|
|
|
2024-01-23 21:26:36 +08:00
|
|
|
if (device->bdev_file) {
|
|
|
|
fput(device->bdev_file);
|
2008-11-18 10:11:30 +08:00
|
|
|
device->bdev = NULL;
|
2024-01-23 21:26:36 +08:00
|
|
|
device->bdev_file = NULL;
|
2008-11-18 10:11:30 +08:00
|
|
|
fs_devices->open_devices--;
|
|
|
|
}
|
2017-12-04 12:54:52 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
|
2008-11-18 10:11:30 +08:00
|
|
|
list_del_init(&device->dev_alloc_list);
|
2017-12-04 12:54:52 +08:00
|
|
|
clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
|
btrfs: fix rw device counting in __btrfs_free_extra_devids
When removing a writeable device in __btrfs_free_extra_devids, the rw
device count should be decremented.
This error was caught by Syzbot which reported a warning in
close_fs_devices:
WARNING: CPU: 1 PID: 9355 at fs/btrfs/volumes.c:1168 close_fs_devices+0x763/0x880 fs/btrfs/volumes.c:1168
Modules linked in:
CPU: 0 PID: 9355 Comm: syz-executor552 Not tainted 5.13.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:close_fs_devices+0x763/0x880 fs/btrfs/volumes.c:1168
RSP: 0018:ffffc9000333f2f0 EFLAGS: 00010293
RAX: ffffffff8365f5c3 RBX: 0000000000000001 RCX: ffff888029afd4c0
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
RBP: ffff88802846f508 R08: ffffffff8365f525 R09: ffffed100337d128
R10: ffffed100337d128 R11: 0000000000000000 R12: dffffc0000000000
R13: ffff888019be8868 R14: 1ffff1100337d10d R15: 1ffff1100337d10a
FS: 00007f6f53828700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000047c410 CR3: 00000000302a6000 CR4: 00000000001506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_close_devices+0xc9/0x450 fs/btrfs/volumes.c:1180
open_ctree+0x8e1/0x3968 fs/btrfs/disk-io.c:3693
btrfs_fill_super fs/btrfs/super.c:1382 [inline]
btrfs_mount_root+0xac5/0xc60 fs/btrfs/super.c:1749
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x86/0x270 fs/super.c:1498
fc_mount fs/namespace.c:993 [inline]
vfs_kern_mount+0xc9/0x160 fs/namespace.c:1023
btrfs_mount+0x3d3/0xb50 fs/btrfs/super.c:1809
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x86/0x270 fs/super.c:1498
do_new_mount fs/namespace.c:2905 [inline]
path_mount+0x196f/0x2be0 fs/namespace.c:3235
do_mount fs/namespace.c:3248 [inline]
__do_sys_mount fs/namespace.c:3456 [inline]
__se_sys_mount+0x2f9/0x3b0 fs/namespace.c:3433
do_syscall_64+0x3f/0xb0 arch/x86/entry/common.c:47
entry_SYSCALL_64_after_hwframe+0x44/0xae
Because fs_devices->rw_devices was not 0 after
closing all devices. Here is the call trace that was observed:
btrfs_mount_root():
btrfs_scan_one_device():
device_list_add(); <---------------- device added
btrfs_open_devices():
open_fs_devices():
btrfs_open_one_device(); <-------- writable device opened,
rw device count ++
btrfs_fill_super():
open_ctree():
btrfs_free_extra_devids():
__btrfs_free_extra_devids(); <--- writable device removed,
rw device count not decremented
fail_tree_roots:
btrfs_close_devices():
close_fs_devices(); <------- rw device count off by 1
As a note, prior to commit cf89af146b7e ("btrfs: dev-replace: fail
mount if we don't have replace item with target device"), rw_devices
was decremented on removing a writable device in
__btrfs_free_extra_devids only if the BTRFS_DEV_STATE_REPLACE_TGT bit
was not set for the device. However, this check does not need to be
reinstated as it is now redundant and incorrect.
In __btrfs_free_extra_devids, we skip removing the device if it is the
target for replacement. This is done by checking whether device->devid
== BTRFS_DEV_REPLACE_DEVID. Since BTRFS_DEV_STATE_REPLACE_TGT is set
only on the device with devid BTRFS_DEV_REPLACE_DEVID, no devices
should have the BTRFS_DEV_STATE_REPLACE_TGT bit set after the check,
and so it's redundant to test for that bit.
Additionally, following commit 82372bc816d7 ("Btrfs: make
the logic of source device removing more clear"), rw_devices is
incremented whenever a writeable device is added to the alloc
list (including the target device in btrfs_dev_replace_finishing), so
all removals of writable devices from the alloc list should also be
accompanied by a decrement to rw_devices.
Reported-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
Fixes: cf89af146b7e ("btrfs: dev-replace: fail mount if we don't have replace item with target device")
CC: stable@vger.kernel.org # 5.10+
Tested-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-27 15:13:03 +08:00
|
|
|
fs_devices->rw_devices--;
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
2008-12-12 23:03:26 +08:00
|
|
|
list_del_init(&device->dev_list);
|
|
|
|
fs_devices->num_devices--;
|
2018-03-20 22:47:33 +08:00
|
|
|
btrfs_free_device(device);
|
2008-05-14 01:46:40 +08:00
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2020-07-16 15:17:04 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* After we have read the system tree and know devids belonging to this
|
|
|
|
* filesystem, remove the device which does not belong there.
|
|
|
|
*/
|
2020-11-06 16:06:33 +08:00
|
|
|
void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices)
|
2020-07-16 15:17:04 +08:00
|
|
|
{
|
|
|
|
struct btrfs_device *latest_dev = NULL;
|
2020-07-16 15:25:33 +08:00
|
|
|
struct btrfs_fs_devices *seed_dev;
|
2020-07-16 15:17:04 +08:00
|
|
|
|
|
|
|
mutex_lock(&uuid_mutex);
|
2020-11-06 16:06:33 +08:00
|
|
|
__btrfs_free_extra_devids(fs_devices, &latest_dev);
|
2020-07-16 15:25:33 +08:00
|
|
|
|
|
|
|
list_for_each_entry(seed_dev, &fs_devices->seed_list, seed_list)
|
2020-11-06 16:06:33 +08:00
|
|
|
__btrfs_free_extra_devids(seed_dev, &latest_dev);
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2021-08-24 13:05:19 +08:00
|
|
|
fs_devices->latest_dev = latest_dev;
|
2012-02-21 09:53:43 +08:00
|
|
|
|
2008-05-14 01:46:40 +08:00
|
|
|
mutex_unlock(&uuid_mutex);
|
|
|
|
}
|
2008-05-14 04:03:06 +08:00
|
|
|
|
2016-07-22 06:04:53 +08:00
|
|
|
static void btrfs_close_bdev(struct btrfs_device *device)
|
|
|
|
{
|
2017-06-19 22:55:35 +08:00
|
|
|
if (!device->bdev)
|
|
|
|
return;
|
|
|
|
|
2017-12-04 12:54:52 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
|
2016-07-22 06:04:53 +08:00
|
|
|
sync_blockdev(device->bdev);
|
|
|
|
invalidate_bdev(device->bdev);
|
|
|
|
}
|
|
|
|
|
2024-01-23 21:26:36 +08:00
|
|
|
fput(device->bdev_file);
|
2016-07-22 06:04:53 +08:00
|
|
|
}
|
|
|
|
|
2018-06-29 13:26:05 +08:00
|
|
|
static void btrfs_close_one_device(struct btrfs_device *device)
|
2016-06-14 18:55:25 +08:00
|
|
|
{
|
|
|
|
struct btrfs_fs_devices *fs_devices = device->fs_devices;
|
|
|
|
|
2017-12-04 12:54:52 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
|
2016-06-14 18:55:25 +08:00
|
|
|
device->devid != BTRFS_DEV_REPLACE_DEVID) {
|
|
|
|
list_del_init(&device->dev_alloc_list);
|
|
|
|
fs_devices->rw_devices--;
|
|
|
|
}
|
|
|
|
|
btrfs: reset replace target device to allocation state on close
This crash was observed with a failed assertion on device close:
BTRFS: Transaction aborted (error -28)
WARNING: CPU: 1 PID: 3902 at fs/btrfs/extent-tree.c:2150 btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs]
Modules linked in: btrfs blake2b_generic libcrc32c crc32c_intel xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress raid6_pq loop
CPU: 1 PID: 3902 Comm: kworker/u8:4 Not tainted 5.14.0-rc5-default+ #1532
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
RIP: 0010:btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs]
RSP: 0018:ffffb7a5452d7d80 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffabee13c4 RDI: 00000000ffffffff
RBP: ffff97834176a378 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: ffff97835195d388
R13: 0000000005b08000 R14: ffff978385484000 R15: 000000000000016c
FS: 0000000000000000(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000056190d003fe8 CR3: 000000002a81e005 CR4: 0000000000170ea0
Call Trace:
flush_space+0x197/0x2f0 [btrfs]
btrfs_async_reclaim_metadata_space+0x139/0x300 [btrfs]
process_one_work+0x262/0x5e0
worker_thread+0x4c/0x320
? process_one_work+0x5e0/0x5e0
kthread+0x144/0x170
? set_kthread_struct+0x40/0x40
ret_from_fork+0x1f/0x30
irq event stamp: 19334989
hardirqs last enabled at (19334997): [<ffffffffab0e0c87>] console_unlock+0x2b7/0x400
hardirqs last disabled at (19335006): [<ffffffffab0e0d0d>] console_unlock+0x33d/0x400
softirqs last enabled at (19334900): [<ffffffffaba0030d>] __do_softirq+0x30d/0x574
softirqs last disabled at (19334893): [<ffffffffab0721ec>] irq_exit_rcu+0x12c/0x140
---[ end trace 45939e308e0dd3c7 ]---
BTRFS: error (device vdd) in btrfs_run_delayed_refs:2150: errno=-28 No space left
BTRFS info (device vdd): forced readonly
BTRFS warning (device vdd): failed setting block group ro: -30
BTRFS info (device vdd): suspending dev_replace for unmount
assertion failed: !test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state), in fs/btrfs/volumes.c:1150
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:3431!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 3982 Comm: umount Tainted: G W 5.14.0-rc5-default+ #1532
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
RSP: 0018:ffffb7a5454c7db8 EFLAGS: 00010246
RAX: 0000000000000068 RBX: ffff978364b91c00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffabee13c4 RDI: 00000000ffffffff
RBP: ffff9783523a4c00 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: ffff9783523a4d18
R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000003
FS: 00007f61c8f42800(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000056190cffa810 CR3: 0000000030b96002 CR4: 0000000000170ea0
Call Trace:
btrfs_close_one_device.cold+0x11/0x55 [btrfs]
close_fs_devices+0x44/0xb0 [btrfs]
btrfs_close_devices+0x48/0x160 [btrfs]
generic_shutdown_super+0x69/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x2c/0xa0
cleanup_mnt+0x144/0x1b0
task_work_run+0x59/0xa0
exit_to_user_mode_loop+0xe7/0xf0
exit_to_user_mode_prepare+0xaf/0xf0
syscall_exit_to_user_mode+0x19/0x50
do_syscall_64+0x4a/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
This happens when close_ctree is called while a dev_replace hasn't
completed. In close_ctree, we suspend the dev_replace, but keep the
replace target around so that we can resume the dev_replace procedure
when we mount the root again. This is the call trace:
close_ctree():
btrfs_dev_replace_suspend_for_unmount();
btrfs_close_devices():
btrfs_close_fs_devices():
btrfs_close_one_device():
ASSERT(!test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
&device->dev_state));
However, since the replace target sticks around, there is a device
with BTRFS_DEV_STATE_REPLACE_TGT set on close, and we fail the
assertion in btrfs_close_one_device.
To fix this, if we come across the replace target device when
closing, we should properly reset it back to allocation state. This
fix also ensures that if a non-target device has a corrupted state and
has the BTRFS_DEV_STATE_REPLACE_TGT bit set, the assertion will still
catch the error.
Reported-by: David Sterba <dsterba@suse.com>
Fixes: b2a616676839 ("btrfs: fix rw device counting in __btrfs_free_extra_devids")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-21 01:50:40 +08:00
|
|
|
if (device->devid == BTRFS_DEV_REPLACE_DEVID)
|
|
|
|
clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state);
|
|
|
|
|
btrfs: clear MISSING device status bit in btrfs_close_one_device
Reported bug: https://github.com/kdave/btrfs-progs/issues/389
There's a problem with scrub reporting aborted status but returning
error code 0, on a filesystem with missing and readded device.
Roughly these steps:
- mkfs -d raid1 dev1 dev2
- fill with data
- unmount
- make dev1 disappear
- mount -o degraded
- copy more data
- make dev1 appear again
Running scrub afterwards reports that the command was aborted, but the
system log message says the exit code was 0.
It seems that the cause of the error is decrementing
fs_devices->missing_devices but not clearing device->dev_state. Every
time we umount filesystem, it would call close_ctree, And it would
eventually involve btrfs_close_one_device to close the device, but it
only decrements fs_devices->missing_devices but does not clear the
device BTRFS_DEV_STATE_MISSING bit. Worse, this bug will cause Integer
Overflow, because every time umount, fs_devices->missing_devices will
decrease. If fs_devices->missing_devices value hit 0, it would overflow.
With added debugging:
loop1: detected capacity change from 0 to 20971520
BTRFS: device fsid 56ad51f1-5523-463b-8547-c19486c51ebb devid 1 transid 21 /dev/loop1 scanned by systemd-udevd (2311)
loop2: detected capacity change from 0 to 20971520
BTRFS: device fsid 56ad51f1-5523-463b-8547-c19486c51ebb devid 2 transid 17 /dev/loop2 scanned by systemd-udevd (2313)
BTRFS info (device loop1): flagging fs with big metadata feature
BTRFS info (device loop1): allowing degraded mounts
BTRFS info (device loop1): using free space tree
BTRFS info (device loop1): has skinny extents
BTRFS info (device loop1): before clear_missing.00000000f706684d /dev/loop1 0
BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing
BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1
BTRFS info (device loop1): flagging fs with big metadata feature
BTRFS info (device loop1): allowing degraded mounts
BTRFS info (device loop1): using free space tree
BTRFS info (device loop1): has skinny extents
BTRFS info (device loop1): before clear_missing.00000000f706684d /dev/loop1 0
BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing
BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 0
BTRFS info (device loop1): flagging fs with big metadata feature
BTRFS info (device loop1): allowing degraded mounts
BTRFS info (device loop1): using free space tree
BTRFS info (device loop1): has skinny extents
BTRFS info (device loop1): before clear_missing.00000000f706684d /dev/loop1 18446744073709551615
BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing
BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 18446744073709551615
If fs_devices->missing_devices is 0, next time it would be 18446744073709551615
After apply this patch, the fs_devices->missing_devices seems to be
right:
$ truncate -s 10g test1
$ truncate -s 10g test2
$ losetup /dev/loop1 test1
$ losetup /dev/loop2 test2
$ mkfs.btrfs -draid1 -mraid1 /dev/loop1 /dev/loop2 -f
$ losetup -d /dev/loop2
$ mount -o degraded /dev/loop1 /mnt/1
$ umount /mnt/1
$ mount -o degraded /dev/loop1 /mnt/1
$ umount /mnt/1
$ mount -o degraded /dev/loop1 /mnt/1
$ umount /mnt/1
$ dmesg
loop1: detected capacity change from 0 to 20971520
loop2: detected capacity change from 0 to 20971520
BTRFS: device fsid 15aa1203-98d3-4a66-bcae-ca82f629c2cd devid 1 transid 5 /dev/loop1 scanned by mkfs.btrfs (1863)
BTRFS: device fsid 15aa1203-98d3-4a66-bcae-ca82f629c2cd devid 2 transid 5 /dev/loop2 scanned by mkfs.btrfs (1863)
BTRFS info (device loop1): flagging fs with big metadata feature
BTRFS info (device loop1): allowing degraded mounts
BTRFS info (device loop1): disk space caching is enabled
BTRFS info (device loop1): has skinny extents
BTRFS info (device loop1): before clear_missing.00000000975bd577 /dev/loop1 0
BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing
BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1
BTRFS info (device loop1): checking UUID tree
BTRFS info (device loop1): flagging fs with big metadata feature
BTRFS info (device loop1): allowing degraded mounts
BTRFS info (device loop1): disk space caching is enabled
BTRFS info (device loop1): has skinny extents
BTRFS info (device loop1): before clear_missing.00000000975bd577 /dev/loop1 0
BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing
BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1
BTRFS info (device loop1): flagging fs with big metadata feature
BTRFS info (device loop1): allowing degraded mounts
BTRFS info (device loop1): disk space caching is enabled
BTRFS info (device loop1): has skinny extents
BTRFS info (device loop1): before clear_missing.00000000975bd577 /dev/loop1 0
BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing
BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Li Zhang <zhanglikernel@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-05 01:15:33 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state)) {
|
|
|
|
clear_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state);
|
2016-06-14 18:55:25 +08:00
|
|
|
fs_devices->missing_devices--;
|
btrfs: clear MISSING device status bit in btrfs_close_one_device
Reported bug: https://github.com/kdave/btrfs-progs/issues/389
There's a problem with scrub reporting aborted status but returning
error code 0, on a filesystem with missing and readded device.
Roughly these steps:
- mkfs -d raid1 dev1 dev2
- fill with data
- unmount
- make dev1 disappear
- mount -o degraded
- copy more data
- make dev1 appear again
Running scrub afterwards reports that the command was aborted, but the
system log message says the exit code was 0.
It seems that the cause of the error is decrementing
fs_devices->missing_devices but not clearing device->dev_state. Every
time we umount filesystem, it would call close_ctree, And it would
eventually involve btrfs_close_one_device to close the device, but it
only decrements fs_devices->missing_devices but does not clear the
device BTRFS_DEV_STATE_MISSING bit. Worse, this bug will cause Integer
Overflow, because every time umount, fs_devices->missing_devices will
decrease. If fs_devices->missing_devices value hit 0, it would overflow.
With added debugging:
loop1: detected capacity change from 0 to 20971520
BTRFS: device fsid 56ad51f1-5523-463b-8547-c19486c51ebb devid 1 transid 21 /dev/loop1 scanned by systemd-udevd (2311)
loop2: detected capacity change from 0 to 20971520
BTRFS: device fsid 56ad51f1-5523-463b-8547-c19486c51ebb devid 2 transid 17 /dev/loop2 scanned by systemd-udevd (2313)
BTRFS info (device loop1): flagging fs with big metadata feature
BTRFS info (device loop1): allowing degraded mounts
BTRFS info (device loop1): using free space tree
BTRFS info (device loop1): has skinny extents
BTRFS info (device loop1): before clear_missing.00000000f706684d /dev/loop1 0
BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing
BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1
BTRFS info (device loop1): flagging fs with big metadata feature
BTRFS info (device loop1): allowing degraded mounts
BTRFS info (device loop1): using free space tree
BTRFS info (device loop1): has skinny extents
BTRFS info (device loop1): before clear_missing.00000000f706684d /dev/loop1 0
BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing
BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 0
BTRFS info (device loop1): flagging fs with big metadata feature
BTRFS info (device loop1): allowing degraded mounts
BTRFS info (device loop1): using free space tree
BTRFS info (device loop1): has skinny extents
BTRFS info (device loop1): before clear_missing.00000000f706684d /dev/loop1 18446744073709551615
BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing
BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 18446744073709551615
If fs_devices->missing_devices is 0, next time it would be 18446744073709551615
After apply this patch, the fs_devices->missing_devices seems to be
right:
$ truncate -s 10g test1
$ truncate -s 10g test2
$ losetup /dev/loop1 test1
$ losetup /dev/loop2 test2
$ mkfs.btrfs -draid1 -mraid1 /dev/loop1 /dev/loop2 -f
$ losetup -d /dev/loop2
$ mount -o degraded /dev/loop1 /mnt/1
$ umount /mnt/1
$ mount -o degraded /dev/loop1 /mnt/1
$ umount /mnt/1
$ mount -o degraded /dev/loop1 /mnt/1
$ umount /mnt/1
$ dmesg
loop1: detected capacity change from 0 to 20971520
loop2: detected capacity change from 0 to 20971520
BTRFS: device fsid 15aa1203-98d3-4a66-bcae-ca82f629c2cd devid 1 transid 5 /dev/loop1 scanned by mkfs.btrfs (1863)
BTRFS: device fsid 15aa1203-98d3-4a66-bcae-ca82f629c2cd devid 2 transid 5 /dev/loop2 scanned by mkfs.btrfs (1863)
BTRFS info (device loop1): flagging fs with big metadata feature
BTRFS info (device loop1): allowing degraded mounts
BTRFS info (device loop1): disk space caching is enabled
BTRFS info (device loop1): has skinny extents
BTRFS info (device loop1): before clear_missing.00000000975bd577 /dev/loop1 0
BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing
BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1
BTRFS info (device loop1): checking UUID tree
BTRFS info (device loop1): flagging fs with big metadata feature
BTRFS info (device loop1): allowing degraded mounts
BTRFS info (device loop1): disk space caching is enabled
BTRFS info (device loop1): has skinny extents
BTRFS info (device loop1): before clear_missing.00000000975bd577 /dev/loop1 0
BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing
BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1
BTRFS info (device loop1): flagging fs with big metadata feature
BTRFS info (device loop1): allowing degraded mounts
BTRFS info (device loop1): disk space caching is enabled
BTRFS info (device loop1): has skinny extents
BTRFS info (device loop1): before clear_missing.00000000975bd577 /dev/loop1 0
BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing
BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Li Zhang <zhanglikernel@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-05 01:15:33 +08:00
|
|
|
}
|
2016-06-14 18:55:25 +08:00
|
|
|
|
2018-06-29 13:26:05 +08:00
|
|
|
btrfs_close_bdev(device);
|
2019-12-04 21:36:39 +08:00
|
|
|
if (device->bdev) {
|
2019-11-26 16:40:05 +08:00
|
|
|
fs_devices->open_devices--;
|
2019-12-04 21:36:39 +08:00
|
|
|
device->bdev = NULL;
|
2024-10-21 22:02:15 +08:00
|
|
|
device->bdev_file = NULL;
|
2016-06-14 18:55:25 +08:00
|
|
|
}
|
2019-12-04 21:36:39 +08:00
|
|
|
clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
|
2020-11-10 19:26:07 +08:00
|
|
|
btrfs_destroy_dev_zone_info(device);
|
2016-06-14 18:55:25 +08:00
|
|
|
|
2019-12-04 21:36:39 +08:00
|
|
|
device->fs_info = NULL;
|
|
|
|
atomic_set(&device->dev_stats_ccnt, 0);
|
|
|
|
extent_io_tree_release(&device->alloc_state);
|
2018-06-29 13:26:05 +08:00
|
|
|
|
btrfs: fix mount failure due to past and transient device flush error
When we get an error flushing one device, during a super block commit, we
record the error in the device structure, in the field 'last_flush_error'.
This is used to later check if we should error out the super block commit,
depending on whether the number of flush errors is greater than or equals
to the maximum tolerated device failures for a raid profile.
However if we get a transient device flush error, unmount the filesystem
and later try to mount it, we can fail the mount because we treat that
past error as critical and consider the device is missing. Even if it's
very likely that the error will happen again, as it's probably due to a
hardware related problem, there may be cases where the error might not
happen again. One example is during testing, and a test case like the
new generic/648 from fstests always triggers this. The test cases
generic/019 and generic/475 also trigger this scenario, but very
sporadically.
When this happens we get an error like this:
$ mount /dev/sdc /mnt
mount: /mnt wrong fs type, bad option, bad superblock on /dev/sdc, missing codepage or helper program, or other error.
$ dmesg
(...)
[12918.886926] BTRFS warning (device sdc): chunk 13631488 missing 1 devices, max tolerance is 0 for writable mount
[12918.888293] BTRFS warning (device sdc): writable mount is not allowed due to too many missing devices
[12918.890853] BTRFS error (device sdc): open_ctree failed
The failure happens because when btrfs_check_rw_degradable() is called at
mount time, or at remount from RO to RW time, is sees a non zero value in
a device's ->last_flush_error attribute, and therefore considers that the
device is 'missing'.
Fix this by setting a device's ->last_flush_error to zero when we close a
device, making sure the error is not seen on the next mount attempt. We
only need to track flush errors during the current mount, so that we never
commit a super block if such errors happened.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-09 02:05:44 +08:00
|
|
|
/*
|
|
|
|
* Reset the flush error record. We might have a transient flush error
|
|
|
|
* in this mount, and if so we aborted the current transaction and set
|
|
|
|
* the fs to an error state, guaranteeing no super blocks can be further
|
|
|
|
* committed. However that error might be transient and if we unmount the
|
|
|
|
* filesystem and mount it again, we should allow the mount to succeed
|
|
|
|
* (btrfs_check_rw_degradable() should not fail) - if after mounting the
|
|
|
|
* filesystem again we still get flush errors, then we will again abort
|
|
|
|
* any transaction and set the error state, guaranteeing no commits of
|
|
|
|
* unsafe super blocks.
|
|
|
|
*/
|
|
|
|
device->last_flush_error = 0;
|
|
|
|
|
2019-12-04 21:36:39 +08:00
|
|
|
/* Verify the device is back in a pristine state */
|
2023-04-04 22:55:11 +08:00
|
|
|
WARN_ON(test_bit(BTRFS_DEV_STATE_FLUSH_SENT, &device->dev_state));
|
|
|
|
WARN_ON(test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state));
|
|
|
|
WARN_ON(!list_empty(&device->dev_alloc_list));
|
|
|
|
WARN_ON(!list_empty(&device->post_commit_list));
|
2016-06-14 18:55:25 +08:00
|
|
|
}
|
|
|
|
|
2020-07-15 18:48:48 +08:00
|
|
|
static void close_fs_devices(struct btrfs_fs_devices *fs_devices)
|
2008-03-25 03:02:07 +08:00
|
|
|
{
|
2015-05-13 07:31:37 +08:00
|
|
|
struct btrfs_device *device, *tmp;
|
2008-12-12 23:03:26 +08:00
|
|
|
|
2020-08-10 23:42:28 +08:00
|
|
|
lockdep_assert_held(&uuid_mutex);
|
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
if (--fs_devices->opened > 0)
|
2020-07-15 18:48:48 +08:00
|
|
|
return;
|
2008-03-25 03:02:07 +08:00
|
|
|
|
2020-08-10 23:42:28 +08:00
|
|
|
list_for_each_entry_safe(device, tmp, &fs_devices->devices, dev_list)
|
2018-06-29 13:26:05 +08:00
|
|
|
btrfs_close_one_device(device);
|
2011-04-20 18:07:30 +08:00
|
|
|
|
2008-12-12 23:03:26 +08:00
|
|
|
WARN_ON(fs_devices->open_devices);
|
|
|
|
WARN_ON(fs_devices->rw_devices);
|
2008-11-18 10:11:30 +08:00
|
|
|
fs_devices->opened = 0;
|
2019-11-13 18:27:27 +08:00
|
|
|
fs_devices->seeding = false;
|
2020-07-15 18:48:49 +08:00
|
|
|
fs_devices->fs_info = NULL;
|
2008-03-25 03:02:07 +08:00
|
|
|
}
|
|
|
|
|
2020-07-15 18:48:48 +08:00
|
|
|
void btrfs_close_devices(struct btrfs_fs_devices *fs_devices)
|
2008-11-18 10:11:30 +08:00
|
|
|
{
|
2020-07-16 15:25:33 +08:00
|
|
|
LIST_HEAD(list);
|
|
|
|
struct btrfs_fs_devices *tmp;
|
2008-11-18 10:11:30 +08:00
|
|
|
|
|
|
|
mutex_lock(&uuid_mutex);
|
2020-07-15 18:48:48 +08:00
|
|
|
close_fs_devices(fs_devices);
|
2023-01-20 21:47:16 +08:00
|
|
|
if (!fs_devices->opened) {
|
2020-07-16 15:25:33 +08:00
|
|
|
list_splice_init(&fs_devices->seed_list, &list);
|
2008-12-12 23:03:26 +08:00
|
|
|
|
2023-01-20 21:47:16 +08:00
|
|
|
/*
|
|
|
|
* If the struct btrfs_fs_devices is not assembled with any
|
|
|
|
* other device, it can be re-initialized during the next mount
|
|
|
|
* without the needing device-scan step. Therefore, it can be
|
|
|
|
* fully freed.
|
|
|
|
*/
|
|
|
|
if (fs_devices->num_devices == 1) {
|
|
|
|
list_del(&fs_devices->fs_list);
|
|
|
|
free_fs_devices(fs_devices);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2020-07-16 15:25:33 +08:00
|
|
|
list_for_each_entry_safe(fs_devices, tmp, &list, seed_list) {
|
2018-04-12 10:29:27 +08:00
|
|
|
close_fs_devices(fs_devices);
|
2020-07-16 15:25:33 +08:00
|
|
|
list_del(&fs_devices->seed_list);
|
2008-12-12 23:03:26 +08:00
|
|
|
free_fs_devices(fs_devices);
|
|
|
|
}
|
2020-08-10 23:42:28 +08:00
|
|
|
mutex_unlock(&uuid_mutex);
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
|
2018-04-12 10:29:28 +08:00
|
|
|
static int open_fs_devices(struct btrfs_fs_devices *fs_devices,
|
2023-06-08 19:02:55 +08:00
|
|
|
blk_mode_t flags, void *holder)
|
2008-03-25 03:02:07 +08:00
|
|
|
{
|
|
|
|
struct btrfs_device *device;
|
2014-07-24 11:37:15 +08:00
|
|
|
struct btrfs_device *latest_dev = NULL;
|
2020-09-30 21:09:52 +08:00
|
|
|
struct btrfs_device *tmp_device;
|
2024-03-19 10:58:18 +08:00
|
|
|
int ret = 0;
|
2008-03-25 03:02:07 +08:00
|
|
|
|
2020-09-30 21:09:52 +08:00
|
|
|
list_for_each_entry_safe(device, tmp_device, &fs_devices->devices,
|
|
|
|
dev_list) {
|
2024-03-19 10:58:18 +08:00
|
|
|
int ret2;
|
2008-05-14 04:03:06 +08:00
|
|
|
|
2024-03-19 10:58:18 +08:00
|
|
|
ret2 = btrfs_open_one_device(fs_devices, device, flags, holder);
|
|
|
|
if (ret2 == 0 &&
|
2020-09-30 21:09:52 +08:00
|
|
|
(!latest_dev || device->generation > latest_dev->generation)) {
|
2017-11-09 23:45:25 +08:00
|
|
|
latest_dev = device;
|
2024-03-19 10:58:18 +08:00
|
|
|
} else if (ret2 == -ENODATA) {
|
2020-09-30 21:09:52 +08:00
|
|
|
fs_devices->num_devices--;
|
|
|
|
list_del(&device->dev_list);
|
|
|
|
btrfs_free_device(device);
|
|
|
|
}
|
2024-03-19 10:58:18 +08:00
|
|
|
if (ret == 0 && ret2 != 0)
|
|
|
|
ret = ret2;
|
2008-03-25 03:02:07 +08:00
|
|
|
}
|
2024-03-19 10:58:18 +08:00
|
|
|
|
|
|
|
if (fs_devices->open_devices == 0) {
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2020-04-28 23:22:25 +08:00
|
|
|
return -EINVAL;
|
2024-03-19 10:58:18 +08:00
|
|
|
}
|
2020-04-28 23:22:25 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
fs_devices->opened = 1;
|
2021-08-24 13:05:19 +08:00
|
|
|
fs_devices->latest_dev = latest_dev;
|
2008-11-18 10:11:30 +08:00
|
|
|
fs_devices->total_rw_bytes = 0;
|
2020-02-25 11:56:08 +08:00
|
|
|
fs_devices->chunk_alloc_policy = BTRFS_CHUNK_ALLOC_REGULAR;
|
2020-10-28 21:14:46 +08:00
|
|
|
fs_devices->read_policy = BTRFS_READ_POLICY_PID;
|
2020-04-28 23:22:25 +08:00
|
|
|
|
|
|
|
return 0;
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
|
2021-04-09 02:28:34 +08:00
|
|
|
static int devid_cmp(void *priv, const struct list_head *a,
|
|
|
|
const struct list_head *b)
|
2018-01-23 06:49:36 +08:00
|
|
|
{
|
2021-07-26 20:15:26 +08:00
|
|
|
const struct btrfs_device *dev1, *dev2;
|
2018-01-23 06:49:36 +08:00
|
|
|
|
|
|
|
dev1 = list_entry(a, struct btrfs_device, dev_list);
|
|
|
|
dev2 = list_entry(b, struct btrfs_device, dev_list);
|
|
|
|
|
|
|
|
if (dev1->devid < dev2->devid)
|
|
|
|
return -1;
|
|
|
|
else if (dev1->devid > dev2->devid)
|
|
|
|
return 1;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
|
2023-06-08 19:02:55 +08:00
|
|
|
blk_mode_t flags, void *holder)
|
2008-11-18 10:11:30 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
2018-06-19 23:09:47 +08:00
|
|
|
lockdep_assert_held(&uuid_mutex);
|
btrfs: open device without device_list_mutex
There's long existed a lockdep splat because we open our bdev's under
the ->device_list_mutex at mount time, which acquires the bd_mutex.
Usually this goes unnoticed, but if you do loopback devices at all
suddenly the bd_mutex comes with a whole host of other dependencies,
which results in the splat when you mount a btrfs file system.
======================================================
WARNING: possible circular locking dependency detected
5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted
------------------------------------------------------
systemd-journal/509 is trying to acquire lock:
ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs]
but task is already holding lock:
ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sb_pagefaults){.+.+}-{0:0}:
__sb_start_write+0x13e/0x220
btrfs_page_mkwrite+0x59/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
-> #5 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x60/0x80
_copy_from_user+0x20/0xb0
get_sg_io_hdr+0x9a/0xb0
scsi_cmd_ioctl+0x1ea/0x2f0
cdrom_ioctl+0x3c/0x12b4
sr_block_ioctl+0xa4/0xd0
block_ioctl+0x3f/0x50
ksys_ioctl+0x82/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&cd->lock){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
sr_block_open+0xa2/0x180
__blkdev_get+0xdd/0x550
blkdev_get+0x38/0x150
do_dentry_open+0x16b/0x3e0
path_openat+0x3c9/0xa00
do_filp_open+0x75/0x100
do_sys_openat2+0x8a/0x140
__x64_sys_openat+0x46/0x70
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
__blkdev_get+0x6a/0x550
blkdev_get+0x85/0x150
blkdev_get_by_path+0x2c/0x70
btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
open_fs_devices+0x88/0x240 [btrfs]
btrfs_open_devices+0x92/0xa0 [btrfs]
btrfs_mount_root+0x250/0x490 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
vfs_kern_mount.part.0+0x71/0xb0
btrfs_mount+0x119/0x380 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
do_mount+0x8c6/0xca0
__x64_sys_mount+0x8e/0xd0
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_run_dev_stats+0x36/0x420 [btrfs]
commit_cowonly_roots+0x91/0x2d0 [btrfs]
btrfs_commit_transaction+0x4e6/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_commit_transaction+0x48e/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
__mutex_lock+0x7b/0x820
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
other info that might help us debug this:
Chain exists of:
&fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sb_pagefaults);
lock(&mm->mmap_lock#2);
lock(sb_pagefaults);
lock(&fs_info->reloc_mutex);
*** DEADLOCK ***
3 locks held by systemd-journal/509:
#0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0
#1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
#2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs]
stack backtrace:
CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
dump_stack+0x92/0xc8
check_noncircular+0x134/0x150
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
__mutex_lock+0x7b/0x820
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? kvm_sched_clock_read+0x14/0x30
? sched_clock+0x5/0x10
? sched_clock_cpu+0xc/0xb0
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
? sched_clock+0x5/0x10
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
? asm_exc_page_fault+0x8/0x30
asm_exc_page_fault+0x1e/0x30
RIP: 0033:0x7fa3972fdbfe
Code: Bad RIP value.
Fix this by not holding the ->device_list_mutex at this point. The
device_list_mutex exists to protect us from modifying the device list
while the file system is running.
However it can also be modified by doing a scan on a device. But this
action is specifically protected by the uuid_mutex, which we are holding
here. We cannot race with opening at this point because we have the
->s_mount lock held during the mount. Not having the
->device_list_mutex here is perfectly safe as we're not going to change
the devices at this point.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add some comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-18 03:12:27 +08:00
|
|
|
/*
|
|
|
|
* The device_list_mutex cannot be taken here in case opening the
|
2021-05-25 14:12:56 +08:00
|
|
|
* underlying device takes further locks like open_mutex.
|
btrfs: open device without device_list_mutex
There's long existed a lockdep splat because we open our bdev's under
the ->device_list_mutex at mount time, which acquires the bd_mutex.
Usually this goes unnoticed, but if you do loopback devices at all
suddenly the bd_mutex comes with a whole host of other dependencies,
which results in the splat when you mount a btrfs file system.
======================================================
WARNING: possible circular locking dependency detected
5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted
------------------------------------------------------
systemd-journal/509 is trying to acquire lock:
ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs]
but task is already holding lock:
ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sb_pagefaults){.+.+}-{0:0}:
__sb_start_write+0x13e/0x220
btrfs_page_mkwrite+0x59/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
-> #5 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x60/0x80
_copy_from_user+0x20/0xb0
get_sg_io_hdr+0x9a/0xb0
scsi_cmd_ioctl+0x1ea/0x2f0
cdrom_ioctl+0x3c/0x12b4
sr_block_ioctl+0xa4/0xd0
block_ioctl+0x3f/0x50
ksys_ioctl+0x82/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&cd->lock){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
sr_block_open+0xa2/0x180
__blkdev_get+0xdd/0x550
blkdev_get+0x38/0x150
do_dentry_open+0x16b/0x3e0
path_openat+0x3c9/0xa00
do_filp_open+0x75/0x100
do_sys_openat2+0x8a/0x140
__x64_sys_openat+0x46/0x70
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
__blkdev_get+0x6a/0x550
blkdev_get+0x85/0x150
blkdev_get_by_path+0x2c/0x70
btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
open_fs_devices+0x88/0x240 [btrfs]
btrfs_open_devices+0x92/0xa0 [btrfs]
btrfs_mount_root+0x250/0x490 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
vfs_kern_mount.part.0+0x71/0xb0
btrfs_mount+0x119/0x380 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
do_mount+0x8c6/0xca0
__x64_sys_mount+0x8e/0xd0
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_run_dev_stats+0x36/0x420 [btrfs]
commit_cowonly_roots+0x91/0x2d0 [btrfs]
btrfs_commit_transaction+0x4e6/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_commit_transaction+0x48e/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
__mutex_lock+0x7b/0x820
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
other info that might help us debug this:
Chain exists of:
&fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sb_pagefaults);
lock(&mm->mmap_lock#2);
lock(sb_pagefaults);
lock(&fs_info->reloc_mutex);
*** DEADLOCK ***
3 locks held by systemd-journal/509:
#0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0
#1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
#2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs]
stack backtrace:
CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
dump_stack+0x92/0xc8
check_noncircular+0x134/0x150
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
__mutex_lock+0x7b/0x820
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? kvm_sched_clock_read+0x14/0x30
? sched_clock+0x5/0x10
? sched_clock_cpu+0xc/0xb0
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
? sched_clock+0x5/0x10
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
? asm_exc_page_fault+0x8/0x30
asm_exc_page_fault+0x1e/0x30
RIP: 0033:0x7fa3972fdbfe
Code: Bad RIP value.
Fix this by not holding the ->device_list_mutex at this point. The
device_list_mutex exists to protect us from modifying the device list
while the file system is running.
However it can also be modified by doing a scan on a device. But this
action is specifically protected by the uuid_mutex, which we are holding
here. We cannot race with opening at this point because we have the
->s_mount lock held during the mount. Not having the
->device_list_mutex here is perfectly safe as we're not going to change
the devices at this point.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add some comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-18 03:12:27 +08:00
|
|
|
*
|
|
|
|
* We also don't need the lock here as this is called during mount and
|
|
|
|
* exclusion is provided by uuid_mutex
|
|
|
|
*/
|
2018-06-19 23:09:47 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
if (fs_devices->opened) {
|
2008-12-12 23:03:26 +08:00
|
|
|
fs_devices->opened++;
|
|
|
|
ret = 0;
|
2008-11-18 10:11:30 +08:00
|
|
|
} else {
|
2018-01-23 06:49:36 +08:00
|
|
|
list_sort(NULL, &fs_devices->devices, devid_cmp);
|
2018-04-12 10:29:28 +08:00
|
|
|
ret = open_fs_devices(fs_devices, flags, holder);
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
2018-04-12 10:29:34 +08:00
|
|
|
|
2008-03-25 03:02:07 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2020-02-13 23:24:32 +08:00
|
|
|
void btrfs_release_disk_super(struct btrfs_super_block *super)
|
2016-02-13 10:01:29 +08:00
|
|
|
{
|
2020-02-13 23:24:32 +08:00
|
|
|
struct page *page = virt_to_page(super);
|
|
|
|
|
2016-02-13 10:01:29 +08:00
|
|
|
put_page(page);
|
|
|
|
}
|
|
|
|
|
2020-04-15 20:53:46 +08:00
|
|
|
static struct btrfs_super_block *btrfs_read_disk_super(struct block_device *bdev,
|
btrfs: implement log-structured superblock for ZONED mode
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones. However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today. So, the number of superblock and
copies is limited to be two. Second downside is that we cannot support
devices which have no conventional zones at all.
To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.
We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.
The following zones are reserved as the circular buffer on ZONED btrfs.
- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
next to it
If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-10 19:26:14 +08:00
|
|
|
u64 bytenr, u64 bytenr_orig)
|
2016-02-13 10:01:29 +08:00
|
|
|
{
|
2020-04-15 20:53:46 +08:00
|
|
|
struct btrfs_super_block *disk_super;
|
|
|
|
struct page *page;
|
2016-02-13 10:01:29 +08:00
|
|
|
void *p;
|
|
|
|
pgoff_t index;
|
|
|
|
|
|
|
|
/* make sure our super fits in the device */
|
2021-10-18 18:11:12 +08:00
|
|
|
if (bytenr + PAGE_SIZE >= bdev_nr_bytes(bdev))
|
2020-04-15 20:53:46 +08:00
|
|
|
return ERR_PTR(-EINVAL);
|
2016-02-13 10:01:29 +08:00
|
|
|
|
|
|
|
/* make sure our super fits in the page */
|
2020-04-15 20:53:46 +08:00
|
|
|
if (sizeof(*disk_super) > PAGE_SIZE)
|
|
|
|
return ERR_PTR(-EINVAL);
|
2016-02-13 10:01:29 +08:00
|
|
|
|
|
|
|
/* make sure our super doesn't straddle pages on disk */
|
|
|
|
index = bytenr >> PAGE_SHIFT;
|
2020-04-15 20:53:46 +08:00
|
|
|
if ((bytenr + sizeof(*disk_super) - 1) >> PAGE_SHIFT != index)
|
|
|
|
return ERR_PTR(-EINVAL);
|
2016-02-13 10:01:29 +08:00
|
|
|
|
|
|
|
/* pull in the page with our super */
|
2024-04-11 22:53:37 +08:00
|
|
|
page = read_cache_page_gfp(bdev->bd_mapping, index, GFP_KERNEL);
|
2016-02-13 10:01:29 +08:00
|
|
|
|
2020-04-15 20:53:46 +08:00
|
|
|
if (IS_ERR(page))
|
|
|
|
return ERR_CAST(page);
|
2016-02-13 10:01:29 +08:00
|
|
|
|
2020-04-15 20:53:46 +08:00
|
|
|
p = page_address(page);
|
2016-02-13 10:01:29 +08:00
|
|
|
|
|
|
|
/* align our pointer to the offset of the super block */
|
2020-04-15 20:53:46 +08:00
|
|
|
disk_super = p + offset_in_page(bytenr);
|
2016-02-13 10:01:29 +08:00
|
|
|
|
btrfs: implement log-structured superblock for ZONED mode
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones. However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today. So, the number of superblock and
copies is limited to be two. Second downside is that we cannot support
devices which have no conventional zones at all.
To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.
We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.
The following zones are reserved as the circular buffer on ZONED btrfs.
- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
next to it
If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-10 19:26:14 +08:00
|
|
|
if (btrfs_super_bytenr(disk_super) != bytenr_orig ||
|
2020-04-15 20:53:46 +08:00
|
|
|
btrfs_super_magic(disk_super) != BTRFS_MAGIC) {
|
2020-02-13 23:24:32 +08:00
|
|
|
btrfs_release_disk_super(p);
|
2020-04-15 20:53:46 +08:00
|
|
|
return ERR_PTR(-EINVAL);
|
2016-02-13 10:01:29 +08:00
|
|
|
}
|
|
|
|
|
2020-04-15 20:53:46 +08:00
|
|
|
if (disk_super->label[0] && disk_super->label[BTRFS_LABEL_SIZE - 1])
|
|
|
|
disk_super->label[BTRFS_LABEL_SIZE - 1] = 0;
|
2016-02-13 10:01:29 +08:00
|
|
|
|
2020-04-15 20:53:46 +08:00
|
|
|
return disk_super;
|
2016-02-13 10:01:29 +08:00
|
|
|
}
|
|
|
|
|
2022-01-12 13:06:00 +08:00
|
|
|
int btrfs_forget_devices(dev_t devt)
|
2019-01-04 13:31:54 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
mutex_lock(&uuid_mutex);
|
2022-01-12 13:06:00 +08:00
|
|
|
ret = btrfs_free_stale_devices(devt, NULL);
|
2019-01-04 13:31:54 +08:00
|
|
|
mutex_unlock(&uuid_mutex);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
btrfs: do not skip re-registration for the mounted device
There are reports that since version 6.7 update-grub fails to find the
device of the root on systems without initrd and on a single device.
This looks like the device name changed in the output of
/proc/self/mountinfo:
6.5-rc5 working
18 1 0:16 / / rw,noatime - btrfs /dev/sda8 ...
6.7 not working:
17 1 0:15 / / rw,noatime - btrfs /dev/root ...
and "update-grub" shows this error:
/usr/sbin/grub-probe: error: cannot find a device for / (is /dev mounted?)
This looks like it's related to the device name, but grub-probe
recognizes the "/dev/root" path and tries to find the underlying device.
However there's a special case for some filesystems, for btrfs in
particular.
The generic root device detection heuristic is not done and it all
relies on reading the device infos by a btrfs specific ioctl. This ioctl
returns the device name as it was saved at the time of device scan (in
this case it's /dev/root).
The change in 6.7 for temp_fsid to allow several single device
filesystem to exist with the same fsid (and transparently generate a new
UUID at mount time) was to skip caching/registering such devices.
This also skipped mounted device. One step of scanning is to check if
the device name hasn't changed, and if yes then update the cached value.
This broke the grub-probe as it always read the device /dev/root and
couldn't find it in the system. A temporary workaround is to create a
symlink but this does not survive reboot.
The right fix is to allow updating the device path of a mounted
filesystem even if this is a single device one.
In the fix, check if the device's major:minor number matches with the
cached device. If they do, then we can allow the scan to happen so that
device_list_add() can take care of updating the device path. The file
descriptor remains unchanged.
This does not affect the temp_fsid feature, the UUID of the mounted
filesystem remains the same and the matching is based on device major:minor
which is unique per mounted filesystem.
This covers the path when the device (that exists for all mounted
devices) name changes, updating /dev/root to /dev/sdx. Any other single
device with filesystem and is not mounted is still skipped.
Note that if a system is booted and initial mount is done on the
/dev/root device, this will be the cached name of the device. Only after
the command "btrfs device scan" it will change as it triggers the
rename.
The fix was verified by users whose systems were affected.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=218353
Link: https://lore.kernel.org/lkml/CAKLYgeJ1tUuqLcsquwuFqjDXPSJpEiokrWK2gisPKDZLs8Y2TQ@mail.gmail.com/
Fixes: bc27d6f0aa0e ("btrfs: scan but don't register device on single device filesystem")
CC: stable@vger.kernel.org # 6.7+
Tested-by: Alex Romosan <aromosan@gmail.com>
Tested-by: CHECK_1234543212345@protonmail.com
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-13 09:13:56 +08:00
|
|
|
static bool btrfs_skip_registration(struct btrfs_super_block *disk_super,
|
|
|
|
const char *path, dev_t devt,
|
|
|
|
bool mount_arg_dev)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_devices *fs_devices;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Do not skip device registration for mounted devices with matching
|
|
|
|
* maj:min but different paths. Booting without initrd relies on
|
|
|
|
* /dev/root initially, later replaced with the actual root device.
|
|
|
|
* A successful scan ensures grub2-probe selects the correct device.
|
|
|
|
*/
|
|
|
|
list_for_each_entry(fs_devices, &fs_uuids, fs_list) {
|
|
|
|
struct btrfs_device *device;
|
|
|
|
|
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
|
|
|
|
|
|
|
if (!fs_devices->opened) {
|
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
list_for_each_entry(device, &fs_devices->devices, dev_list) {
|
|
|
|
if (device->bdev && (device->bdev->bd_dev == devt) &&
|
|
|
|
strcmp(device->name->str, path) != 0) {
|
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
|
|
|
|
|
|
|
/* Do not skip registration. */
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!mount_arg_dev && btrfs_super_num_devices(disk_super) == 1 &&
|
|
|
|
!(btrfs_super_flags(disk_super) & BTRFS_SUPER_FLAG_SEEDING))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2013-02-16 02:31:02 +08:00
|
|
|
/*
|
|
|
|
* Look for a btrfs signature on a device. This may be called out of the mount path
|
|
|
|
* and we are not allowed to call set_blocksize during the scan. The superblock
|
2023-09-09 00:31:55 +08:00
|
|
|
* is read via pagecache.
|
|
|
|
*
|
|
|
|
* With @mount_arg_dev it's a scan during mount time that will always register
|
|
|
|
* the device or return an error. Multi-device and seeding devices are registered
|
|
|
|
* in both cases.
|
2013-02-16 02:31:02 +08:00
|
|
|
*/
|
2023-09-09 00:31:55 +08:00
|
|
|
struct btrfs_device *btrfs_scan_one_device(const char *path, blk_mode_t flags,
|
|
|
|
bool mount_arg_dev)
|
2008-03-25 03:02:07 +08:00
|
|
|
{
|
|
|
|
struct btrfs_super_block *disk_super;
|
2018-05-29 12:28:37 +08:00
|
|
|
bool new_device_added = false;
|
2018-07-12 14:23:16 +08:00
|
|
|
struct btrfs_device *device = NULL;
|
2024-01-23 21:26:36 +08:00
|
|
|
struct file *bdev_file;
|
2024-05-16 11:10:23 +08:00
|
|
|
u64 bytenr;
|
btrfs: do not skip re-registration for the mounted device
There are reports that since version 6.7 update-grub fails to find the
device of the root on systems without initrd and on a single device.
This looks like the device name changed in the output of
/proc/self/mountinfo:
6.5-rc5 working
18 1 0:16 / / rw,noatime - btrfs /dev/sda8 ...
6.7 not working:
17 1 0:15 / / rw,noatime - btrfs /dev/root ...
and "update-grub" shows this error:
/usr/sbin/grub-probe: error: cannot find a device for / (is /dev mounted?)
This looks like it's related to the device name, but grub-probe
recognizes the "/dev/root" path and tries to find the underlying device.
However there's a special case for some filesystems, for btrfs in
particular.
The generic root device detection heuristic is not done and it all
relies on reading the device infos by a btrfs specific ioctl. This ioctl
returns the device name as it was saved at the time of device scan (in
this case it's /dev/root).
The change in 6.7 for temp_fsid to allow several single device
filesystem to exist with the same fsid (and transparently generate a new
UUID at mount time) was to skip caching/registering such devices.
This also skipped mounted device. One step of scanning is to check if
the device name hasn't changed, and if yes then update the cached value.
This broke the grub-probe as it always read the device /dev/root and
couldn't find it in the system. A temporary workaround is to create a
symlink but this does not survive reboot.
The right fix is to allow updating the device path of a mounted
filesystem even if this is a single device one.
In the fix, check if the device's major:minor number matches with the
cached device. If they do, then we can allow the scan to happen so that
device_list_add() can take care of updating the device path. The file
descriptor remains unchanged.
This does not affect the temp_fsid feature, the UUID of the mounted
filesystem remains the same and the matching is based on device major:minor
which is unique per mounted filesystem.
This covers the path when the device (that exists for all mounted
devices) name changes, updating /dev/root to /dev/sdx. Any other single
device with filesystem and is not mounted is still skipped.
Note that if a system is booted and initial mount is done on the
/dev/root device, this will be the cached name of the device. Only after
the command "btrfs device scan" it will change as it triggers the
rename.
The fix was verified by users whose systems were affected.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=218353
Link: https://lore.kernel.org/lkml/CAKLYgeJ1tUuqLcsquwuFqjDXPSJpEiokrWK2gisPKDZLs8Y2TQ@mail.gmail.com/
Fixes: bc27d6f0aa0e ("btrfs: scan but don't register device on single device filesystem")
CC: stable@vger.kernel.org # 6.7+
Tested-by: Alex Romosan <aromosan@gmail.com>
Tested-by: CHECK_1234543212345@protonmail.com
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-13 09:13:56 +08:00
|
|
|
dev_t devt;
|
btrfs: implement log-structured superblock for ZONED mode
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones. However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today. So, the number of superblock and
copies is limited to be two. Second downside is that we cannot support
devices which have no conventional zones at all.
To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.
We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.
The following zones are reserved as the circular buffer on ZONED btrfs.
- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
next to it
If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-10 19:26:14 +08:00
|
|
|
int ret;
|
2008-03-25 03:02:07 +08:00
|
|
|
|
2018-06-19 22:37:36 +08:00
|
|
|
lockdep_assert_held(&uuid_mutex);
|
|
|
|
|
btrfs: scan device in non-exclusive mode
This fixes mkfs/mount/check failures due to race with systemd-udevd
scan.
During the device scan initiated by systemd-udevd, other user space
EXCL operations such as mkfs, mount, or check may get blocked and result
in a "Device or resource busy" error. This is because the device
scan process opens the device with the EXCL flag in the kernel.
Two reports were received:
- btrfs/179 test case, where the fsck command failed with the -EBUSY
error
- LTP pwritev03 test case, where mkfs.vfs failed with
the -EBUSY error, when mkfs.vfs tried to overwrite old btrfs filesystem
on the device.
In both cases, fsck and mkfs (respectively) were racing with a
systemd-udevd device scan, and systemd-udevd won, resulting in the
-EBUSY error for fsck and mkfs.
Reproducing the problem has been difficult because there is a very
small window during which these userspace threads can race to
acquire the exclusive device open. Even on the system where the problem
was observed, the problem occurrences were anywhere between 10 to 400
iterations and chances of reproducing decreases with debug printk()s.
However, an exclusive device open is unnecessary for the scan process,
as there are no write operations on the device during scan. Furthermore,
during the mount process, the superblock is re-read in the below
function call chain:
btrfs_mount_root
btrfs_open_devices
open_fs_devices
btrfs_open_one_device
btrfs_get_bdev_and_sb
So, to fix this issue, removes the FMODE_EXCL flag from the scan
operation, and add a comment.
The case where mkfs may still write to the device and a scan is running,
the btrfs signature is not written at that time so scan will not
recognize such device.
Reported-by: Sherry Yang <sherry.yang@oracle.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/oe-lkp/202303170839.fdf23068-oliver.sang@intel.com
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-23 15:56:48 +08:00
|
|
|
/*
|
2023-06-08 19:02:42 +08:00
|
|
|
* Avoid an exclusive open here, as the systemd-udev may initiate the
|
|
|
|
* device scan which may race with the user's mount or mkfs command,
|
|
|
|
* resulting in failure.
|
|
|
|
* Since the device scan is solely for reading purposes, there is no
|
|
|
|
* need for an exclusive open. Additionally, the devices are read again
|
btrfs: scan device in non-exclusive mode
This fixes mkfs/mount/check failures due to race with systemd-udevd
scan.
During the device scan initiated by systemd-udevd, other user space
EXCL operations such as mkfs, mount, or check may get blocked and result
in a "Device or resource busy" error. This is because the device
scan process opens the device with the EXCL flag in the kernel.
Two reports were received:
- btrfs/179 test case, where the fsck command failed with the -EBUSY
error
- LTP pwritev03 test case, where mkfs.vfs failed with
the -EBUSY error, when mkfs.vfs tried to overwrite old btrfs filesystem
on the device.
In both cases, fsck and mkfs (respectively) were racing with a
systemd-udevd device scan, and systemd-udevd won, resulting in the
-EBUSY error for fsck and mkfs.
Reproducing the problem has been difficult because there is a very
small window during which these userspace threads can race to
acquire the exclusive device open. Even on the system where the problem
was observed, the problem occurrences were anywhere between 10 to 400
iterations and chances of reproducing decreases with debug printk()s.
However, an exclusive device open is unnecessary for the scan process,
as there are no write operations on the device during scan. Furthermore,
during the mount process, the superblock is re-read in the below
function call chain:
btrfs_mount_root
btrfs_open_devices
open_fs_devices
btrfs_open_one_device
btrfs_get_bdev_and_sb
So, to fix this issue, removes the FMODE_EXCL flag from the scan
operation, and add a comment.
The case where mkfs may still write to the device and a scan is running,
the btrfs signature is not written at that time so scan will not
recognize such device.
Reported-by: Sherry Yang <sherry.yang@oracle.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/oe-lkp/202303170839.fdf23068-oliver.sang@intel.com
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-23 15:56:48 +08:00
|
|
|
* during the mount process. It is ok to get some inconsistent
|
|
|
|
* values temporarily, as the device paths of the fsid are the only
|
|
|
|
* required information for assembling the volume.
|
|
|
|
*/
|
2024-01-23 21:26:36 +08:00
|
|
|
bdev_file = bdev_file_open_by_path(path, flags, NULL, NULL);
|
|
|
|
if (IS_ERR(bdev_file))
|
|
|
|
return ERR_CAST(bdev_file);
|
2013-02-16 02:31:02 +08:00
|
|
|
|
2024-05-16 11:10:23 +08:00
|
|
|
/*
|
|
|
|
* We would like to check all the super blocks, but doing so would
|
|
|
|
* allow a mount to succeed after a mkfs from a different filesystem.
|
|
|
|
* Currently, recovery from a bad primary btrfs superblock is done
|
|
|
|
* using the userspace command 'btrfs check --super'.
|
|
|
|
*/
|
2024-01-23 21:26:36 +08:00
|
|
|
ret = btrfs_sb_log_location_bdev(file_bdev(bdev_file), 0, READ, &bytenr);
|
2021-12-15 18:38:43 +08:00
|
|
|
if (ret) {
|
|
|
|
device = ERR_PTR(ret);
|
|
|
|
goto error_bdev_put;
|
|
|
|
}
|
btrfs: implement log-structured superblock for ZONED mode
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones. However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today. So, the number of superblock and
copies is limited to be two. Second downside is that we cannot support
devices which have no conventional zones at all.
To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.
We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.
The following zones are reserved as the circular buffer on ZONED btrfs.
- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
next to it
If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-10 19:26:14 +08:00
|
|
|
|
2024-01-23 21:26:36 +08:00
|
|
|
disk_super = btrfs_read_disk_super(file_bdev(bdev_file), bytenr,
|
2024-05-16 11:10:23 +08:00
|
|
|
btrfs_sb_offset(0));
|
2020-04-15 20:53:46 +08:00
|
|
|
if (IS_ERR(disk_super)) {
|
|
|
|
device = ERR_CAST(disk_super);
|
2013-02-16 02:31:02 +08:00
|
|
|
goto error_bdev_put;
|
2017-12-15 15:40:16 +08:00
|
|
|
}
|
2013-02-16 02:31:02 +08:00
|
|
|
|
btrfs: do not skip re-registration for the mounted device
There are reports that since version 6.7 update-grub fails to find the
device of the root on systems without initrd and on a single device.
This looks like the device name changed in the output of
/proc/self/mountinfo:
6.5-rc5 working
18 1 0:16 / / rw,noatime - btrfs /dev/sda8 ...
6.7 not working:
17 1 0:15 / / rw,noatime - btrfs /dev/root ...
and "update-grub" shows this error:
/usr/sbin/grub-probe: error: cannot find a device for / (is /dev mounted?)
This looks like it's related to the device name, but grub-probe
recognizes the "/dev/root" path and tries to find the underlying device.
However there's a special case for some filesystems, for btrfs in
particular.
The generic root device detection heuristic is not done and it all
relies on reading the device infos by a btrfs specific ioctl. This ioctl
returns the device name as it was saved at the time of device scan (in
this case it's /dev/root).
The change in 6.7 for temp_fsid to allow several single device
filesystem to exist with the same fsid (and transparently generate a new
UUID at mount time) was to skip caching/registering such devices.
This also skipped mounted device. One step of scanning is to check if
the device name hasn't changed, and if yes then update the cached value.
This broke the grub-probe as it always read the device /dev/root and
couldn't find it in the system. A temporary workaround is to create a
symlink but this does not survive reboot.
The right fix is to allow updating the device path of a mounted
filesystem even if this is a single device one.
In the fix, check if the device's major:minor number matches with the
cached device. If they do, then we can allow the scan to happen so that
device_list_add() can take care of updating the device path. The file
descriptor remains unchanged.
This does not affect the temp_fsid feature, the UUID of the mounted
filesystem remains the same and the matching is based on device major:minor
which is unique per mounted filesystem.
This covers the path when the device (that exists for all mounted
devices) name changes, updating /dev/root to /dev/sdx. Any other single
device with filesystem and is not mounted is still skipped.
Note that if a system is booted and initial mount is done on the
/dev/root device, this will be the cached name of the device. Only after
the command "btrfs device scan" it will change as it triggers the
rename.
The fix was verified by users whose systems were affected.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=218353
Link: https://lore.kernel.org/lkml/CAKLYgeJ1tUuqLcsquwuFqjDXPSJpEiokrWK2gisPKDZLs8Y2TQ@mail.gmail.com/
Fixes: bc27d6f0aa0e ("btrfs: scan but don't register device on single device filesystem")
CC: stable@vger.kernel.org # 6.7+
Tested-by: Alex Romosan <aromosan@gmail.com>
Tested-by: CHECK_1234543212345@protonmail.com
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-13 09:13:56 +08:00
|
|
|
devt = file_bdev(bdev_file)->bd_dev;
|
|
|
|
if (btrfs_skip_registration(disk_super, path, devt, mount_arg_dev)) {
|
|
|
|
pr_debug("BTRFS: skip registering single non-seed device %s (%d:%d)\n",
|
|
|
|
path, MAJOR(devt), MINOR(devt));
|
2023-09-09 00:31:55 +08:00
|
|
|
|
btrfs: do not skip re-registration for the mounted device
There are reports that since version 6.7 update-grub fails to find the
device of the root on systems without initrd and on a single device.
This looks like the device name changed in the output of
/proc/self/mountinfo:
6.5-rc5 working
18 1 0:16 / / rw,noatime - btrfs /dev/sda8 ...
6.7 not working:
17 1 0:15 / / rw,noatime - btrfs /dev/root ...
and "update-grub" shows this error:
/usr/sbin/grub-probe: error: cannot find a device for / (is /dev mounted?)
This looks like it's related to the device name, but grub-probe
recognizes the "/dev/root" path and tries to find the underlying device.
However there's a special case for some filesystems, for btrfs in
particular.
The generic root device detection heuristic is not done and it all
relies on reading the device infos by a btrfs specific ioctl. This ioctl
returns the device name as it was saved at the time of device scan (in
this case it's /dev/root).
The change in 6.7 for temp_fsid to allow several single device
filesystem to exist with the same fsid (and transparently generate a new
UUID at mount time) was to skip caching/registering such devices.
This also skipped mounted device. One step of scanning is to check if
the device name hasn't changed, and if yes then update the cached value.
This broke the grub-probe as it always read the device /dev/root and
couldn't find it in the system. A temporary workaround is to create a
symlink but this does not survive reboot.
The right fix is to allow updating the device path of a mounted
filesystem even if this is a single device one.
In the fix, check if the device's major:minor number matches with the
cached device. If they do, then we can allow the scan to happen so that
device_list_add() can take care of updating the device path. The file
descriptor remains unchanged.
This does not affect the temp_fsid feature, the UUID of the mounted
filesystem remains the same and the matching is based on device major:minor
which is unique per mounted filesystem.
This covers the path when the device (that exists for all mounted
devices) name changes, updating /dev/root to /dev/sdx. Any other single
device with filesystem and is not mounted is still skipped.
Note that if a system is booted and initial mount is done on the
/dev/root device, this will be the cached name of the device. Only after
the command "btrfs device scan" it will change as it triggers the
rename.
The fix was verified by users whose systems were affected.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=218353
Link: https://lore.kernel.org/lkml/CAKLYgeJ1tUuqLcsquwuFqjDXPSJpEiokrWK2gisPKDZLs8Y2TQ@mail.gmail.com/
Fixes: bc27d6f0aa0e ("btrfs: scan but don't register device on single device filesystem")
CC: stable@vger.kernel.org # 6.7+
Tested-by: Alex Romosan <aromosan@gmail.com>
Tested-by: CHECK_1234543212345@protonmail.com
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-13 09:13:56 +08:00
|
|
|
btrfs_free_stale_devices(devt, NULL);
|
2023-09-09 00:31:55 +08:00
|
|
|
|
|
|
|
device = NULL;
|
|
|
|
goto free_disk_super;
|
|
|
|
}
|
|
|
|
|
2018-05-29 12:28:37 +08:00
|
|
|
device = device_list_add(path, disk_super, &new_device_added);
|
2022-01-12 13:06:01 +08:00
|
|
|
if (!IS_ERR(device) && new_device_added)
|
|
|
|
btrfs_free_stale_devices(device->devt, device);
|
2013-02-16 02:31:02 +08:00
|
|
|
|
2023-09-09 00:31:55 +08:00
|
|
|
free_disk_super:
|
2020-02-13 23:24:32 +08:00
|
|
|
btrfs_release_disk_super(disk_super);
|
2013-02-16 02:31:02 +08:00
|
|
|
|
|
|
|
error_bdev_put:
|
2024-01-23 21:26:36 +08:00
|
|
|
fput(bdev_file);
|
2018-04-12 10:29:24 +08:00
|
|
|
|
2018-07-12 14:23:16 +08:00
|
|
|
return device;
|
2008-03-25 03:02:07 +08:00
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2019-03-27 20:24:12 +08:00
|
|
|
/*
|
|
|
|
* Try to find a chunk that intersects [start, start + len] range and when one
|
|
|
|
* such is found, record the end of it in *start
|
|
|
|
*/
|
|
|
|
static bool contains_pending_extent(struct btrfs_device *device, u64 *start,
|
|
|
|
u64 len)
|
2013-06-28 01:22:46 +08:00
|
|
|
{
|
2019-03-27 20:24:12 +08:00
|
|
|
u64 physical_start, physical_end;
|
2013-06-28 01:22:46 +08:00
|
|
|
|
2019-03-27 20:24:12 +08:00
|
|
|
lockdep_assert_held(&device->fs_info->chunk_mutex);
|
2013-06-28 01:22:46 +08:00
|
|
|
|
2023-06-30 23:03:49 +08:00
|
|
|
if (find_first_extent_bit(&device->alloc_state, *start,
|
|
|
|
&physical_start, &physical_end,
|
|
|
|
CHUNK_ALLOCATED, NULL)) {
|
2015-05-14 17:46:03 +08:00
|
|
|
|
2019-03-27 20:24:12 +08:00
|
|
|
if (in_range(physical_start, *start, len) ||
|
|
|
|
in_range(*start, physical_start,
|
2024-02-29 18:37:04 +08:00
|
|
|
physical_end + 1 - physical_start)) {
|
2019-03-27 20:24:12 +08:00
|
|
|
*start = physical_end + 1;
|
|
|
|
return true;
|
2013-06-28 01:22:46 +08:00
|
|
|
}
|
|
|
|
}
|
2019-03-27 20:24:12 +08:00
|
|
|
return false;
|
2013-06-28 01:22:46 +08:00
|
|
|
}
|
|
|
|
|
2023-07-26 23:57:09 +08:00
|
|
|
static u64 dev_extent_search_start(struct btrfs_device *device)
|
2020-02-25 11:56:09 +08:00
|
|
|
{
|
|
|
|
switch (device->fs_devices->chunk_alloc_policy) {
|
|
|
|
case BTRFS_CHUNK_ALLOC_REGULAR:
|
2023-07-26 23:57:09 +08:00
|
|
|
return BTRFS_DEVICE_RANGE_RESERVED;
|
2021-02-04 18:21:48 +08:00
|
|
|
case BTRFS_CHUNK_ALLOC_ZONED:
|
|
|
|
/*
|
|
|
|
* We don't care about the starting region like regular
|
|
|
|
* allocator, because we anyway use/reserve the first two zones
|
|
|
|
* for superblock logging.
|
|
|
|
*/
|
2023-07-26 23:57:09 +08:00
|
|
|
return 0;
|
2020-02-25 11:56:09 +08:00
|
|
|
default:
|
|
|
|
BUG();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-02-04 18:21:48 +08:00
|
|
|
static bool dev_extent_hole_check_zoned(struct btrfs_device *device,
|
|
|
|
u64 *hole_start, u64 *hole_size,
|
|
|
|
u64 num_bytes)
|
|
|
|
{
|
|
|
|
u64 zone_size = device->zone_info->zone_size;
|
|
|
|
u64 pos;
|
|
|
|
int ret;
|
|
|
|
bool changed = false;
|
|
|
|
|
|
|
|
ASSERT(IS_ALIGNED(*hole_start, zone_size));
|
|
|
|
|
|
|
|
while (*hole_size > 0) {
|
|
|
|
pos = btrfs_find_allocatable_zones(device, *hole_start,
|
|
|
|
*hole_start + *hole_size,
|
|
|
|
num_bytes);
|
|
|
|
if (pos != *hole_start) {
|
|
|
|
*hole_size = *hole_start + *hole_size - pos;
|
|
|
|
*hole_start = pos;
|
|
|
|
changed = true;
|
|
|
|
if (*hole_size < num_bytes)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = btrfs_ensure_empty_zones(device, pos, num_bytes);
|
|
|
|
|
|
|
|
/* Range is ensured to be empty */
|
|
|
|
if (!ret)
|
|
|
|
return changed;
|
|
|
|
|
|
|
|
/* Given hole range was invalid (outside of device) */
|
|
|
|
if (ret == -ERANGE) {
|
|
|
|
*hole_start += *hole_size;
|
2021-05-10 21:39:38 +08:00
|
|
|
*hole_size = 0;
|
2021-03-03 17:45:28 +08:00
|
|
|
return true;
|
2021-02-04 18:21:48 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
*hole_start += zone_size;
|
|
|
|
*hole_size -= zone_size;
|
|
|
|
changed = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
return changed;
|
|
|
|
}
|
|
|
|
|
2022-10-27 20:21:42 +08:00
|
|
|
/*
|
|
|
|
* Check if specified hole is suitable for allocation.
|
|
|
|
*
|
2020-02-25 11:56:09 +08:00
|
|
|
* @device: the device which we have the hole
|
|
|
|
* @hole_start: starting position of the hole
|
|
|
|
* @hole_size: the size of the hole
|
|
|
|
* @num_bytes: the size of the free space that we need
|
|
|
|
*
|
2021-02-04 18:21:48 +08:00
|
|
|
* This function may modify @hole_start and @hole_size to reflect the suitable
|
2020-02-25 11:56:09 +08:00
|
|
|
* position for allocation. Returns 1 if hole position is updated, 0 otherwise.
|
|
|
|
*/
|
|
|
|
static bool dev_extent_hole_check(struct btrfs_device *device, u64 *hole_start,
|
|
|
|
u64 *hole_size, u64 num_bytes)
|
|
|
|
{
|
|
|
|
bool changed = false;
|
|
|
|
u64 hole_end = *hole_start + *hole_size;
|
|
|
|
|
2021-02-04 18:21:48 +08:00
|
|
|
for (;;) {
|
|
|
|
/*
|
|
|
|
* Check before we set max_hole_start, otherwise we could end up
|
|
|
|
* sending back this offset anyway.
|
|
|
|
*/
|
|
|
|
if (contains_pending_extent(device, hole_start, *hole_size)) {
|
|
|
|
if (hole_end >= *hole_start)
|
|
|
|
*hole_size = hole_end - *hole_start;
|
|
|
|
else
|
|
|
|
*hole_size = 0;
|
|
|
|
changed = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
switch (device->fs_devices->chunk_alloc_policy) {
|
|
|
|
case BTRFS_CHUNK_ALLOC_REGULAR:
|
|
|
|
/* No extra check */
|
|
|
|
break;
|
|
|
|
case BTRFS_CHUNK_ALLOC_ZONED:
|
|
|
|
if (dev_extent_hole_check_zoned(device, hole_start,
|
|
|
|
hole_size, num_bytes)) {
|
|
|
|
changed = true;
|
|
|
|
/*
|
|
|
|
* The changed hole can contain pending extent.
|
|
|
|
* Loop again to check that.
|
|
|
|
*/
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
BUG();
|
|
|
|
}
|
2020-02-25 11:56:09 +08:00
|
|
|
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return changed;
|
|
|
|
}
|
2013-06-28 01:22:46 +08:00
|
|
|
|
2008-03-25 03:01:56 +08:00
|
|
|
/*
|
2022-10-27 20:21:42 +08:00
|
|
|
* Find free space in the specified device.
|
|
|
|
*
|
2015-06-15 21:41:17 +08:00
|
|
|
* @device: the device which we search the free space in
|
|
|
|
* @num_bytes: the size of the free space that we need
|
|
|
|
* @search_start: the position from which to begin the search
|
|
|
|
* @start: store the start of the free space.
|
|
|
|
* @len: the size of the free space. that we find, or the size
|
|
|
|
* of the max free space if we don't find suitable free space
|
2011-01-05 18:07:26 +08:00
|
|
|
*
|
2022-10-27 20:21:42 +08:00
|
|
|
* This does a pretty simple search, the expectation is that it is called very
|
|
|
|
* infrequently and that a given device has a small number of extents.
|
2011-01-05 18:07:26 +08:00
|
|
|
*
|
|
|
|
* @start is used to store the start of the free space if we find. But if we
|
|
|
|
* don't find suitable free space, it will be used to store the start position
|
|
|
|
* of the max free space.
|
|
|
|
*
|
|
|
|
* @len is used to store the size of the free space that we find.
|
|
|
|
* But if we don't find suitable free space, it is used to store the size of
|
|
|
|
* the max free space.
|
2019-07-19 14:51:42 +08:00
|
|
|
*
|
|
|
|
* NOTE: This function will search *commit* root of device tree, and does extra
|
|
|
|
* check to ensure dev extents are not double allocated.
|
|
|
|
* This makes the function safe to allocate dev extents but may not report
|
|
|
|
* correct usable device space, as device extent freed in current transaction
|
2021-05-21 23:42:23 +08:00
|
|
|
* is not reported as available.
|
2008-03-25 03:01:56 +08:00
|
|
|
*/
|
2023-07-26 23:57:09 +08:00
|
|
|
static int find_free_dev_extent(struct btrfs_device *device, u64 num_bytes,
|
|
|
|
u64 *start, u64 *len)
|
2008-03-25 03:01:56 +08:00
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *fs_info = device->fs_info;
|
|
|
|
struct btrfs_root *root = fs_info->dev_root;
|
2008-03-25 03:01:56 +08:00
|
|
|
struct btrfs_key key;
|
2011-01-05 18:07:26 +08:00
|
|
|
struct btrfs_dev_extent *dev_extent;
|
2008-11-18 10:11:30 +08:00
|
|
|
struct btrfs_path *path;
|
2023-07-26 23:57:09 +08:00
|
|
|
u64 search_start;
|
2011-01-05 18:07:26 +08:00
|
|
|
u64 hole_size;
|
|
|
|
u64 max_hole_start;
|
2023-09-06 00:15:23 +08:00
|
|
|
u64 max_hole_size = 0;
|
2011-01-05 18:07:26 +08:00
|
|
|
u64 extent_end;
|
2008-03-25 03:01:56 +08:00
|
|
|
u64 search_end = device->total_bytes;
|
|
|
|
int ret;
|
2011-01-05 18:07:26 +08:00
|
|
|
int slot;
|
2008-03-25 03:01:56 +08:00
|
|
|
struct extent_buffer *l;
|
Btrfs: fix fitrim discarding device area reserved for boot loader's use
As of the 4.3 kernel release, the fitrim ioctl can now discard any region
of a disk that is not allocated to any chunk/block group, including the
first megabyte which is used for our primary superblock and by the boot
loader (grub for example).
Fix this by not allowing to trim/discard any region in the device starting
with an offset not greater than min(alloc_start_mount_option, 1Mb), just
as it was not possible before 4.3.
A reproducer test case for xfstests follows.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
cd /
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
# Write to the [0, 64Kb[ and [68Kb, 1Mb[ ranges of the device. These ranges are
# reserved for a boot loader to use (GRUB for example) and btrfs should never
# use them - neither for allocating metadata/data nor should trim/discard them.
# The range [64Kb, 68Kb[ is used for the primary superblock of the filesystem.
$XFS_IO_PROG -c "pwrite -S 0xfd 0 64K" $SCRATCH_DEV | _filter_xfs_io
$XFS_IO_PROG -c "pwrite -S 0xfd 68K 956K" $SCRATCH_DEV | _filter_xfs_io
# Now mount the filesystem and perform a fitrim against it.
_scratch_mount
_require_batched_discard $SCRATCH_MNT
$FSTRIM_PROG $SCRATCH_MNT
# Now unmount the filesystem and verify the content of the ranges was not
# modified (no trim/discard happened on them).
_scratch_unmount
echo "Content of the ranges [0, 64Kb] and [68Kb, 1Mb[ after fitrim:"
od -t x1 -N $((64 * 1024)) $SCRATCH_DEV
od -t x1 -j $((68 * 1024)) -N $((956 * 1024)) $SCRATCH_DEV
status=0
exit
Reported-by: Vincent Petry <PVince81@yahoo.fr>
Reported-by: Andrei Borzenkov <arvidjaar@gmail.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109341
Fixes: 499f377f49f0 (btrfs: iterate over unused chunk space in FITRIM)
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-01-07 06:42:35 +08:00
|
|
|
|
2023-07-26 23:57:09 +08:00
|
|
|
search_start = dev_extent_search_start(device);
|
2023-09-06 00:15:23 +08:00
|
|
|
max_hole_start = search_start;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2021-02-04 18:21:48 +08:00
|
|
|
WARN_ON(device->zone_info &&
|
|
|
|
!IS_ALIGNED(num_bytes, device->zone_info->zone_size));
|
|
|
|
|
2013-06-28 01:22:46 +08:00
|
|
|
path = btrfs_alloc_path();
|
2023-09-06 00:15:23 +08:00
|
|
|
if (!path) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
2015-02-16 18:52:17 +08:00
|
|
|
again:
|
2017-12-04 12:54:55 +08:00
|
|
|
if (search_start >= search_end ||
|
|
|
|
test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state)) {
|
2011-01-05 18:07:26 +08:00
|
|
|
ret = -ENOSPC;
|
2013-06-28 01:22:46 +08:00
|
|
|
goto out;
|
2011-01-05 18:07:26 +08:00
|
|
|
}
|
|
|
|
|
2015-11-27 23:31:35 +08:00
|
|
|
path->reada = READA_FORWARD;
|
2013-06-28 01:22:46 +08:00
|
|
|
path->search_commit_root = 1;
|
|
|
|
path->skip_locking = 1;
|
2011-01-05 18:07:26 +08:00
|
|
|
|
2008-03-25 03:01:56 +08:00
|
|
|
key.objectid = device->devid;
|
|
|
|
key.offset = search_start;
|
|
|
|
key.type = BTRFS_DEV_EXTENT_KEY;
|
2011-01-05 18:07:26 +08:00
|
|
|
|
2021-07-29 16:22:16 +08:00
|
|
|
ret = btrfs_search_backwards(root, &key, path);
|
2008-03-25 03:01:56 +08:00
|
|
|
if (ret < 0)
|
2011-01-05 18:07:26 +08:00
|
|
|
goto out;
|
|
|
|
|
2023-01-19 05:35:13 +08:00
|
|
|
while (search_start < search_end) {
|
2008-03-25 03:01:56 +08:00
|
|
|
l = path->nodes[0];
|
|
|
|
slot = path->slots[0];
|
|
|
|
if (slot >= btrfs_header_nritems(l)) {
|
|
|
|
ret = btrfs_next_leaf(root, path);
|
|
|
|
if (ret == 0)
|
|
|
|
continue;
|
|
|
|
if (ret < 0)
|
2011-01-05 18:07:26 +08:00
|
|
|
goto out;
|
|
|
|
|
|
|
|
break;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
|
|
|
btrfs_item_key_to_cpu(l, &key, slot);
|
|
|
|
|
|
|
|
if (key.objectid < device->devid)
|
|
|
|
goto next;
|
|
|
|
|
|
|
|
if (key.objectid > device->devid)
|
2011-01-05 18:07:26 +08:00
|
|
|
break;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2014-06-05 00:41:45 +08:00
|
|
|
if (key.type != BTRFS_DEV_EXTENT_KEY)
|
2011-01-05 18:07:26 +08:00
|
|
|
goto next;
|
2009-07-25 04:41:41 +08:00
|
|
|
|
2023-01-19 05:35:13 +08:00
|
|
|
if (key.offset > search_end)
|
|
|
|
break;
|
|
|
|
|
2011-01-05 18:07:26 +08:00
|
|
|
if (key.offset > search_start) {
|
|
|
|
hole_size = key.offset - search_start;
|
2020-02-25 11:56:09 +08:00
|
|
|
dev_extent_hole_check(device, &search_start, &hole_size,
|
|
|
|
num_bytes);
|
2013-06-28 01:22:46 +08:00
|
|
|
|
2011-01-05 18:07:26 +08:00
|
|
|
if (hole_size > max_hole_size) {
|
|
|
|
max_hole_start = search_start;
|
|
|
|
max_hole_size = hole_size;
|
|
|
|
}
|
2009-07-25 04:41:41 +08:00
|
|
|
|
2011-01-05 18:07:26 +08:00
|
|
|
/*
|
|
|
|
* If this free space is greater than which we need,
|
|
|
|
* it must be the max free space that we have found
|
|
|
|
* until now, so max_hole_start must point to the start
|
|
|
|
* of this free space and the length of this free space
|
|
|
|
* is stored in max_hole_size. Thus, we return
|
|
|
|
* max_hole_start and max_hole_size and go back to the
|
|
|
|
* caller.
|
|
|
|
*/
|
|
|
|
if (hole_size >= num_bytes) {
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
dev_extent = btrfs_item_ptr(l, slot, struct btrfs_dev_extent);
|
2011-01-05 18:07:26 +08:00
|
|
|
extent_end = key.offset + btrfs_dev_extent_length(l,
|
|
|
|
dev_extent);
|
|
|
|
if (extent_end > search_start)
|
|
|
|
search_start = extent_end;
|
2008-03-25 03:01:56 +08:00
|
|
|
next:
|
|
|
|
path->slots[0]++;
|
|
|
|
cond_resched();
|
|
|
|
}
|
|
|
|
|
Btrfs: fix a bug of balance on full multi-disk partitions
When balancing, we'll first try to shrink devices for some space,
but if it is working on a full multi-disk partition with raid protection,
we may encounter a bug, that is, while shrinking, total_bytes may be less
than bytes_used, and btrfs may allocate a dev extent that accesses out of
device's bounds.
Then we will not be able to write or read the data which stores at the end
of the device, and get the followings:
device fsid 0939f071-7ea3-46c8-95df-f176d773bfb6 devid 1 transid 10 /dev/sdb5
Btrfs detected SSD devices, enabling SSD mode
btrfs: relocating block group 476315648 flags 9
btrfs: found 4 extents
attempt to access beyond end of device
sdb5: rw=145, want=546176, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546304, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546432, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546560, limit=546147
attempt to access beyond end of device
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-02 10:39:03 +08:00
|
|
|
/*
|
|
|
|
* At this point, search_start should be the end of
|
|
|
|
* allocated dev extents, and when shrinking the device,
|
|
|
|
* search_end may be smaller than search_start.
|
|
|
|
*/
|
2015-02-16 18:52:17 +08:00
|
|
|
if (search_end > search_start) {
|
Btrfs: fix a bug of balance on full multi-disk partitions
When balancing, we'll first try to shrink devices for some space,
but if it is working on a full multi-disk partition with raid protection,
we may encounter a bug, that is, while shrinking, total_bytes may be less
than bytes_used, and btrfs may allocate a dev extent that accesses out of
device's bounds.
Then we will not be able to write or read the data which stores at the end
of the device, and get the followings:
device fsid 0939f071-7ea3-46c8-95df-f176d773bfb6 devid 1 transid 10 /dev/sdb5
Btrfs detected SSD devices, enabling SSD mode
btrfs: relocating block group 476315648 flags 9
btrfs: found 4 extents
attempt to access beyond end of device
sdb5: rw=145, want=546176, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546304, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546432, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546560, limit=546147
attempt to access beyond end of device
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-02 10:39:03 +08:00
|
|
|
hole_size = search_end - search_start;
|
2020-02-25 11:56:09 +08:00
|
|
|
if (dev_extent_hole_check(device, &search_start, &hole_size,
|
|
|
|
num_bytes)) {
|
2015-02-16 18:52:17 +08:00
|
|
|
btrfs_release_path(path);
|
|
|
|
goto again;
|
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2015-02-16 18:52:17 +08:00
|
|
|
if (hole_size > max_hole_size) {
|
|
|
|
max_hole_start = search_start;
|
|
|
|
max_hole_size = hole_size;
|
|
|
|
}
|
2013-06-28 01:22:46 +08:00
|
|
|
}
|
|
|
|
|
2011-01-05 18:07:26 +08:00
|
|
|
/* See above. */
|
2015-02-16 18:52:17 +08:00
|
|
|
if (max_hole_size < num_bytes)
|
2011-01-05 18:07:26 +08:00
|
|
|
ret = -ENOSPC;
|
|
|
|
else
|
|
|
|
ret = 0;
|
|
|
|
|
2023-01-19 05:35:13 +08:00
|
|
|
ASSERT(max_hole_start + max_hole_size <= search_end);
|
2011-01-05 18:07:26 +08:00
|
|
|
out:
|
2008-11-18 10:11:30 +08:00
|
|
|
btrfs_free_path(path);
|
2011-01-05 18:07:26 +08:00
|
|
|
*start = max_hole_start;
|
2011-01-05 18:07:28 +08:00
|
|
|
if (len)
|
2011-01-05 18:07:26 +08:00
|
|
|
*len = max_hole_size;
|
2008-03-25 03:01:56 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2008-12-02 22:54:17 +08:00
|
|
|
static int btrfs_free_dev_extent(struct btrfs_trans_handle *trans,
|
2008-04-26 04:53:30 +08:00
|
|
|
struct btrfs_device *device,
|
2014-09-03 21:35:41 +08:00
|
|
|
u64 start, u64 *dev_extent_len)
|
2008-04-26 04:53:30 +08:00
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *fs_info = device->fs_info;
|
|
|
|
struct btrfs_root *root = fs_info->dev_root;
|
2008-04-26 04:53:30 +08:00
|
|
|
int ret;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_key key;
|
2008-05-07 23:43:44 +08:00
|
|
|
struct btrfs_key found_key;
|
|
|
|
struct extent_buffer *leaf = NULL;
|
|
|
|
struct btrfs_dev_extent *extent = NULL;
|
2008-04-26 04:53:30 +08:00
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
key.objectid = device->devid;
|
|
|
|
key.offset = start;
|
|
|
|
key.type = BTRFS_DEV_EXTENT_KEY;
|
2011-11-11 09:45:04 +08:00
|
|
|
again:
|
2008-04-26 04:53:30 +08:00
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
|
2008-05-07 23:43:44 +08:00
|
|
|
if (ret > 0) {
|
|
|
|
ret = btrfs_previous_item(root, path, key.objectid,
|
|
|
|
BTRFS_DEV_EXTENT_KEY);
|
2011-05-19 15:03:42 +08:00
|
|
|
if (ret)
|
|
|
|
goto out;
|
2008-05-07 23:43:44 +08:00
|
|
|
leaf = path->nodes[0];
|
|
|
|
btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
|
|
|
|
extent = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_dev_extent);
|
|
|
|
BUG_ON(found_key.offset > start || found_key.offset +
|
|
|
|
btrfs_dev_extent_length(leaf, extent) < start);
|
2011-11-11 09:45:04 +08:00
|
|
|
key = found_key;
|
|
|
|
btrfs_release_path(path);
|
|
|
|
goto again;
|
2008-05-07 23:43:44 +08:00
|
|
|
} else if (ret == 0) {
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
extent = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_dev_extent);
|
2012-03-12 23:03:00 +08:00
|
|
|
} else {
|
|
|
|
goto out;
|
2008-05-07 23:43:44 +08:00
|
|
|
}
|
2008-04-26 04:53:30 +08:00
|
|
|
|
2014-09-03 21:35:41 +08:00
|
|
|
*dev_extent_len = btrfs_dev_extent_length(leaf, extent);
|
|
|
|
|
2008-04-26 04:53:30 +08:00
|
|
|
ret = btrfs_del_item(trans, root, path);
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
if (ret == 0)
|
2015-09-24 22:46:10 +08:00
|
|
|
set_bit(BTRFS_TRANS_HAVE_FREE_BGS, &trans->transaction->flags);
|
2011-05-19 15:03:42 +08:00
|
|
|
out:
|
2008-04-26 04:53:30 +08:00
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-06-28 01:22:46 +08:00
|
|
|
static u64 find_next_chunk(struct btrfs_fs_info *fs_info)
|
2008-03-25 03:01:56 +08:00
|
|
|
{
|
2013-06-28 01:22:46 +08:00
|
|
|
struct rb_node *n;
|
|
|
|
u64 ret = 0;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
read_lock(&fs_info->mapping_tree_lock);
|
|
|
|
n = rb_last(&fs_info->mapping_tree.rb_root);
|
2013-06-28 01:22:46 +08:00
|
|
|
if (n) {
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
struct btrfs_chunk_map *map;
|
|
|
|
|
|
|
|
map = rb_entry(n, struct btrfs_chunk_map, rb_node);
|
|
|
|
ret = map->start + map->chunk_len;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
read_unlock(&fs_info->mapping_tree_lock);
|
2013-06-28 01:22:46 +08:00
|
|
|
|
2008-03-25 03:01:56 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-08-12 19:33:01 +08:00
|
|
|
static noinline int find_next_devid(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 *devid_ret)
|
2008-03-25 03:01:56 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_key found_key;
|
2008-11-18 10:11:30 +08:00
|
|
|
struct btrfs_path *path;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
|
|
|
key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
|
|
|
|
key.type = BTRFS_DEV_ITEM_KEY;
|
|
|
|
key.offset = (u64)-1;
|
|
|
|
|
2013-08-12 19:33:01 +08:00
|
|
|
ret = btrfs_search_slot(NULL, fs_info->chunk_root, &key, path, 0, 0);
|
2008-03-25 03:01:56 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto error;
|
|
|
|
|
2019-08-27 15:40:44 +08:00
|
|
|
if (ret == 0) {
|
|
|
|
/* Corruption */
|
|
|
|
btrfs_err(fs_info, "corrupted chunk tree devid -1 matched");
|
|
|
|
ret = -EUCLEAN;
|
|
|
|
goto error;
|
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2013-08-12 19:33:01 +08:00
|
|
|
ret = btrfs_previous_item(fs_info->chunk_root, path,
|
|
|
|
BTRFS_DEV_ITEMS_OBJECTID,
|
2008-03-25 03:01:56 +08:00
|
|
|
BTRFS_DEV_ITEM_KEY);
|
|
|
|
if (ret) {
|
2013-08-12 19:33:01 +08:00
|
|
|
*devid_ret = 1;
|
2008-03-25 03:01:56 +08:00
|
|
|
} else {
|
|
|
|
btrfs_item_key_to_cpu(path->nodes[0], &found_key,
|
|
|
|
path->slots[0]);
|
2013-08-12 19:33:01 +08:00
|
|
|
*devid_ret = found_key.offset + 1;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
|
|
|
ret = 0;
|
|
|
|
error:
|
2008-11-18 10:11:30 +08:00
|
|
|
btrfs_free_path(path);
|
2008-03-25 03:01:56 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* the device information is stored in the chunk root
|
|
|
|
* the btrfs_device struct should be fully filled in
|
|
|
|
*/
|
2017-11-06 16:36:15 +08:00
|
|
|
static int btrfs_add_dev_item(struct btrfs_trans_handle *trans,
|
2013-04-26 04:41:01 +08:00
|
|
|
struct btrfs_device *device)
|
2008-03-25 03:01:56 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_dev_item *dev_item;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_key key;
|
|
|
|
unsigned long ptr;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
|
|
|
|
key.type = BTRFS_DEV_ITEM_KEY;
|
2008-11-18 10:11:30 +08:00
|
|
|
key.offset = device->devid;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
btrfs: fix deadlock between chunk allocation and chunk btree modifications
When a task is doing some modification to the chunk btree and it is not in
the context of a chunk allocation or a chunk removal, it can deadlock with
another task that is currently allocating a new data or metadata chunk.
These contexts are the following:
* When relocating a system chunk, when we need to COW the extent buffers
that belong to the chunk btree;
* When adding a new device (ioctl), where we need to add a new device item
to the chunk btree;
* When removing a device (ioctl), where we need to remove a device item
from the chunk btree;
* When resizing a device (ioctl), where we need to update a device item in
the chunk btree and may need to relocate a system chunk that lies beyond
the new device size when shrinking a device.
The problem happens due to a sequence of steps like the following:
1) Task A starts a data or metadata chunk allocation and it locks the
chunk mutex;
2) Task B is relocating a system chunk, and when it needs to COW an extent
buffer of the chunk btree, it has locked both that extent buffer as
well as its parent extent buffer;
3) Since there is not enough available system space, either because none
of the existing system block groups have enough free space or because
the only one with enough free space is in RO mode due to the relocation,
task B triggers a new system chunk allocation. It blocks when trying to
acquire the chunk mutex, currently held by task A;
4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
the new chunk item into the chunk btree and update the existing device
items there. But in order to do that, it has to lock the extent buffer
that task B locked at step 2, or its parent extent buffer, but task B
is waiting on the chunk mutex, which is currently locked by task A,
therefore resulting in a deadlock.
One example report when the deadlock happens with system chunk relocation:
INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u9:5 state:D stack:25936 pid: 546 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
__down_read_common kernel/locking/rwsem.c:1214 [inline]
__down_read kernel/locking/rwsem.c:1223 [inline]
down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
__btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
worker_thread+0x90/0xed0 kernel/workqueue.c:2444
kthread+0x3e5/0x4d0 kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
INFO: task syz-executor:9107 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor state:D stack:23200 pid: 9107 ppid: 7792 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
__mutex_lock_common kernel/locking/mutex.c:669 [inline]
__mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
__btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
__btrfs_balance fs/btrfs/volumes.c:3911 [inline]
btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
So fix this by making sure that whenever we try to modify the chunk btree
and we are neither in a chunk allocation context nor in a chunk remove
context, we reserve system space before modifying the chunk btree.
Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
CC: stable@vger.kernel.org # 5.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-13 17:12:49 +08:00
|
|
|
btrfs_reserve_chunk_metadata(trans, true);
|
2018-07-21 00:37:47 +08:00
|
|
|
ret = btrfs_insert_empty_item(trans, trans->fs_info->chunk_root, path,
|
|
|
|
&key, sizeof(*dev_item));
|
btrfs: fix deadlock between chunk allocation and chunk btree modifications
When a task is doing some modification to the chunk btree and it is not in
the context of a chunk allocation or a chunk removal, it can deadlock with
another task that is currently allocating a new data or metadata chunk.
These contexts are the following:
* When relocating a system chunk, when we need to COW the extent buffers
that belong to the chunk btree;
* When adding a new device (ioctl), where we need to add a new device item
to the chunk btree;
* When removing a device (ioctl), where we need to remove a device item
from the chunk btree;
* When resizing a device (ioctl), where we need to update a device item in
the chunk btree and may need to relocate a system chunk that lies beyond
the new device size when shrinking a device.
The problem happens due to a sequence of steps like the following:
1) Task A starts a data or metadata chunk allocation and it locks the
chunk mutex;
2) Task B is relocating a system chunk, and when it needs to COW an extent
buffer of the chunk btree, it has locked both that extent buffer as
well as its parent extent buffer;
3) Since there is not enough available system space, either because none
of the existing system block groups have enough free space or because
the only one with enough free space is in RO mode due to the relocation,
task B triggers a new system chunk allocation. It blocks when trying to
acquire the chunk mutex, currently held by task A;
4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
the new chunk item into the chunk btree and update the existing device
items there. But in order to do that, it has to lock the extent buffer
that task B locked at step 2, or its parent extent buffer, but task B
is waiting on the chunk mutex, which is currently locked by task A,
therefore resulting in a deadlock.
One example report when the deadlock happens with system chunk relocation:
INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u9:5 state:D stack:25936 pid: 546 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
__down_read_common kernel/locking/rwsem.c:1214 [inline]
__down_read kernel/locking/rwsem.c:1223 [inline]
down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
__btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
worker_thread+0x90/0xed0 kernel/workqueue.c:2444
kthread+0x3e5/0x4d0 kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
INFO: task syz-executor:9107 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor state:D stack:23200 pid: 9107 ppid: 7792 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
__mutex_lock_common kernel/locking/mutex.c:669 [inline]
__mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
__btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
__btrfs_balance fs/btrfs/volumes.c:3911 [inline]
btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
So fix this by making sure that whenever we try to modify the chunk btree
and we are neither in a chunk allocation context nor in a chunk remove
context, we reserve system space before modifying the chunk btree.
Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
CC: stable@vger.kernel.org # 5.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-13 17:12:49 +08:00
|
|
|
btrfs_trans_release_chunk_metadata(trans);
|
2008-03-25 03:01:56 +08:00
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
dev_item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_dev_item);
|
|
|
|
|
|
|
|
btrfs_set_device_id(leaf, dev_item, device->devid);
|
2008-11-18 10:11:30 +08:00
|
|
|
btrfs_set_device_generation(leaf, dev_item, 0);
|
2008-03-25 03:01:56 +08:00
|
|
|
btrfs_set_device_type(leaf, dev_item, device->type);
|
|
|
|
btrfs_set_device_io_align(leaf, dev_item, device->io_align);
|
|
|
|
btrfs_set_device_io_width(leaf, dev_item, device->io_width);
|
|
|
|
btrfs_set_device_sector_size(leaf, dev_item, device->sector_size);
|
2014-09-03 21:35:38 +08:00
|
|
|
btrfs_set_device_total_bytes(leaf, dev_item,
|
|
|
|
btrfs_device_get_disk_total_bytes(device));
|
|
|
|
btrfs_set_device_bytes_used(leaf, dev_item,
|
|
|
|
btrfs_device_get_bytes_used(device));
|
2008-04-16 03:41:47 +08:00
|
|
|
btrfs_set_device_group(leaf, dev_item, 0);
|
|
|
|
btrfs_set_device_seek_speed(leaf, dev_item, 0);
|
|
|
|
btrfs_set_device_bandwidth(leaf, dev_item, 0);
|
2008-12-09 05:40:21 +08:00
|
|
|
btrfs_set_device_start_offset(leaf, dev_item, 0);
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2013-08-20 19:20:11 +08:00
|
|
|
ptr = btrfs_device_uuid(dev_item);
|
2008-04-16 03:41:47 +08:00
|
|
|
write_extent_buffer(leaf, device->uuid, ptr, BTRFS_UUID_SIZE);
|
2013-08-20 19:20:12 +08:00
|
|
|
ptr = btrfs_device_fsid(dev_item);
|
2018-10-30 22:43:24 +08:00
|
|
|
write_extent_buffer(leaf, trans->fs_info->fs_devices->metadata_uuid,
|
|
|
|
ptr, BTRFS_FSID_SIZE);
|
2023-09-12 20:04:29 +08:00
|
|
|
btrfs_mark_buffer_dirty(trans, leaf);
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
ret = 0;
|
2008-03-25 03:01:56 +08:00
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
2008-04-26 04:53:30 +08:00
|
|
|
|
2014-04-16 17:02:32 +08:00
|
|
|
/*
|
|
|
|
* Function to update ctime/mtime for a given device path.
|
|
|
|
* Mainly used for ctime/mtime based probe like libblkid.
|
2021-10-15 01:11:01 +08:00
|
|
|
*
|
|
|
|
* We don't care about errors here, this is just to be kind to userspace.
|
2014-04-16 17:02:32 +08:00
|
|
|
*/
|
2021-10-15 01:11:01 +08:00
|
|
|
static void update_dev_time(const char *device_path)
|
2014-04-16 17:02:32 +08:00
|
|
|
{
|
2021-10-15 01:11:01 +08:00
|
|
|
struct path path;
|
|
|
|
int ret;
|
2014-04-16 17:02:32 +08:00
|
|
|
|
2021-10-15 01:11:01 +08:00
|
|
|
ret = kern_path(device_path, LOOKUP_FOLLOW, &path);
|
|
|
|
if (ret)
|
2014-04-16 17:02:32 +08:00
|
|
|
return;
|
2021-07-28 05:01:16 +08:00
|
|
|
|
2023-08-08 03:38:39 +08:00
|
|
|
inode_update_time(d_inode(path.dentry), S_MTIME | S_CTIME | S_VERSION);
|
2021-10-15 01:11:01 +08:00
|
|
|
path_put(&path);
|
2014-04-16 17:02:32 +08:00
|
|
|
}
|
|
|
|
|
2022-03-08 13:36:38 +08:00
|
|
|
static int btrfs_rm_dev_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_device *device)
|
2008-05-07 23:43:44 +08:00
|
|
|
{
|
2019-03-20 23:31:53 +08:00
|
|
|
struct btrfs_root *root = device->fs_info->chunk_root;
|
2008-05-07 23:43:44 +08:00
|
|
|
int ret;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_key key;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
|
|
|
|
key.type = BTRFS_DEV_ITEM_KEY;
|
|
|
|
key.offset = device->devid;
|
|
|
|
|
btrfs: fix deadlock between chunk allocation and chunk btree modifications
When a task is doing some modification to the chunk btree and it is not in
the context of a chunk allocation or a chunk removal, it can deadlock with
another task that is currently allocating a new data or metadata chunk.
These contexts are the following:
* When relocating a system chunk, when we need to COW the extent buffers
that belong to the chunk btree;
* When adding a new device (ioctl), where we need to add a new device item
to the chunk btree;
* When removing a device (ioctl), where we need to remove a device item
from the chunk btree;
* When resizing a device (ioctl), where we need to update a device item in
the chunk btree and may need to relocate a system chunk that lies beyond
the new device size when shrinking a device.
The problem happens due to a sequence of steps like the following:
1) Task A starts a data or metadata chunk allocation and it locks the
chunk mutex;
2) Task B is relocating a system chunk, and when it needs to COW an extent
buffer of the chunk btree, it has locked both that extent buffer as
well as its parent extent buffer;
3) Since there is not enough available system space, either because none
of the existing system block groups have enough free space or because
the only one with enough free space is in RO mode due to the relocation,
task B triggers a new system chunk allocation. It blocks when trying to
acquire the chunk mutex, currently held by task A;
4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
the new chunk item into the chunk btree and update the existing device
items there. But in order to do that, it has to lock the extent buffer
that task B locked at step 2, or its parent extent buffer, but task B
is waiting on the chunk mutex, which is currently locked by task A,
therefore resulting in a deadlock.
One example report when the deadlock happens with system chunk relocation:
INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u9:5 state:D stack:25936 pid: 546 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
__down_read_common kernel/locking/rwsem.c:1214 [inline]
__down_read kernel/locking/rwsem.c:1223 [inline]
down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
__btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
worker_thread+0x90/0xed0 kernel/workqueue.c:2444
kthread+0x3e5/0x4d0 kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
INFO: task syz-executor:9107 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor state:D stack:23200 pid: 9107 ppid: 7792 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
__mutex_lock_common kernel/locking/mutex.c:669 [inline]
__mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
__btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
__btrfs_balance fs/btrfs/volumes.c:3911 [inline]
btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
So fix this by making sure that whenever we try to modify the chunk btree
and we are neither in a chunk allocation context nor in a chunk remove
context, we reserve system space before modifying the chunk btree.
Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
CC: stable@vger.kernel.org # 5.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-13 17:12:49 +08:00
|
|
|
btrfs_reserve_chunk_metadata(trans, false);
|
2008-05-07 23:43:44 +08:00
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
|
btrfs: fix deadlock between chunk allocation and chunk btree modifications
When a task is doing some modification to the chunk btree and it is not in
the context of a chunk allocation or a chunk removal, it can deadlock with
another task that is currently allocating a new data or metadata chunk.
These contexts are the following:
* When relocating a system chunk, when we need to COW the extent buffers
that belong to the chunk btree;
* When adding a new device (ioctl), where we need to add a new device item
to the chunk btree;
* When removing a device (ioctl), where we need to remove a device item
from the chunk btree;
* When resizing a device (ioctl), where we need to update a device item in
the chunk btree and may need to relocate a system chunk that lies beyond
the new device size when shrinking a device.
The problem happens due to a sequence of steps like the following:
1) Task A starts a data or metadata chunk allocation and it locks the
chunk mutex;
2) Task B is relocating a system chunk, and when it needs to COW an extent
buffer of the chunk btree, it has locked both that extent buffer as
well as its parent extent buffer;
3) Since there is not enough available system space, either because none
of the existing system block groups have enough free space or because
the only one with enough free space is in RO mode due to the relocation,
task B triggers a new system chunk allocation. It blocks when trying to
acquire the chunk mutex, currently held by task A;
4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
the new chunk item into the chunk btree and update the existing device
items there. But in order to do that, it has to lock the extent buffer
that task B locked at step 2, or its parent extent buffer, but task B
is waiting on the chunk mutex, which is currently locked by task A,
therefore resulting in a deadlock.
One example report when the deadlock happens with system chunk relocation:
INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u9:5 state:D stack:25936 pid: 546 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
__down_read_common kernel/locking/rwsem.c:1214 [inline]
__down_read kernel/locking/rwsem.c:1223 [inline]
down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
__btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
worker_thread+0x90/0xed0 kernel/workqueue.c:2444
kthread+0x3e5/0x4d0 kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
INFO: task syz-executor:9107 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor state:D stack:23200 pid: 9107 ppid: 7792 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
__mutex_lock_common kernel/locking/mutex.c:669 [inline]
__mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
__btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
__btrfs_balance fs/btrfs/volumes.c:3911 [inline]
btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
So fix this by making sure that whenever we try to modify the chunk btree
and we are neither in a chunk allocation context nor in a chunk remove
context, we reserve system space before modifying the chunk btree.
Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
CC: stable@vger.kernel.org # 5.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-13 17:12:49 +08:00
|
|
|
btrfs_trans_release_chunk_metadata(trans);
|
2017-10-23 14:58:46 +08:00
|
|
|
if (ret) {
|
|
|
|
if (ret > 0)
|
|
|
|
ret = -ENOENT;
|
2008-05-07 23:43:44 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = btrfs_del_item(trans, root, path);
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-02-15 23:00:26 +08:00
|
|
|
/*
|
|
|
|
* Verify that @num_devices satisfies the RAID profile constraints in the whole
|
|
|
|
* filesystem. It's up to the caller to adjust that number regarding eg. device
|
|
|
|
* replace.
|
|
|
|
*/
|
|
|
|
static int btrfs_check_raid_min_devices(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 num_devices)
|
2008-05-07 23:43:44 +08:00
|
|
|
{
|
|
|
|
u64 all_avail;
|
2013-01-29 18:13:12 +08:00
|
|
|
unsigned seq;
|
2016-02-15 23:28:14 +08:00
|
|
|
int i;
|
2008-05-07 23:43:44 +08:00
|
|
|
|
2013-01-29 18:13:12 +08:00
|
|
|
do {
|
2016-02-13 10:01:34 +08:00
|
|
|
seq = read_seqbegin(&fs_info->profiles_lock);
|
2013-01-29 18:13:12 +08:00
|
|
|
|
2016-02-13 10:01:34 +08:00
|
|
|
all_avail = fs_info->avail_data_alloc_bits |
|
|
|
|
fs_info->avail_system_alloc_bits |
|
|
|
|
fs_info->avail_metadata_alloc_bits;
|
|
|
|
} while (read_seqretry(&fs_info->profiles_lock, seq));
|
2008-05-07 23:43:44 +08:00
|
|
|
|
2016-02-15 23:28:14 +08:00
|
|
|
for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
|
2018-04-25 19:01:43 +08:00
|
|
|
if (!(all_avail & btrfs_raid_array[i].bg_flag))
|
2016-02-15 23:28:14 +08:00
|
|
|
continue;
|
2008-05-07 23:43:44 +08:00
|
|
|
|
2021-07-28 07:03:05 +08:00
|
|
|
if (num_devices < btrfs_raid_array[i].devs_min)
|
|
|
|
return btrfs_raid_array[i].mindev_error;
|
2013-01-30 07:40:14 +08:00
|
|
|
}
|
|
|
|
|
2016-02-13 10:01:34 +08:00
|
|
|
return 0;
|
2016-02-13 10:01:33 +08:00
|
|
|
}
|
|
|
|
|
2017-08-23 14:46:04 +08:00
|
|
|
static struct btrfs_device * btrfs_find_next_active_device(
|
|
|
|
struct btrfs_fs_devices *fs_devs, struct btrfs_device *device)
|
2008-05-07 23:43:44 +08:00
|
|
|
{
|
2008-11-18 10:11:30 +08:00
|
|
|
struct btrfs_device *next_device;
|
2016-05-03 17:44:43 +08:00
|
|
|
|
|
|
|
list_for_each_entry(next_device, &fs_devs->devices, dev_list) {
|
|
|
|
if (next_device != device &&
|
2017-12-04 12:54:54 +08:00
|
|
|
!test_bit(BTRFS_DEV_STATE_MISSING, &next_device->dev_state)
|
|
|
|
&& next_device->bdev)
|
2016-05-03 17:44:43 +08:00
|
|
|
return next_device;
|
|
|
|
}
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2021-08-24 13:05:19 +08:00
|
|
|
* Helper function to check if the given device is part of s_bdev / latest_dev
|
2016-05-03 17:44:43 +08:00
|
|
|
* and replace it with the provided or the next active device, in the context
|
|
|
|
* where this function called, there should be always be another device (or
|
|
|
|
* this_dev) which is active.
|
|
|
|
*/
|
2019-10-02 01:57:35 +08:00
|
|
|
void __cold btrfs_assign_next_active_device(struct btrfs_device *device,
|
2020-09-05 01:34:34 +08:00
|
|
|
struct btrfs_device *next_device)
|
2016-05-03 17:44:43 +08:00
|
|
|
{
|
2018-07-21 00:37:50 +08:00
|
|
|
struct btrfs_fs_info *fs_info = device->fs_info;
|
2016-05-03 17:44:43 +08:00
|
|
|
|
2020-09-05 01:34:34 +08:00
|
|
|
if (!next_device)
|
2016-05-03 17:44:43 +08:00
|
|
|
next_device = btrfs_find_next_active_device(fs_info->fs_devices,
|
2020-09-05 01:34:34 +08:00
|
|
|
device);
|
2016-05-03 17:44:43 +08:00
|
|
|
ASSERT(next_device);
|
|
|
|
|
|
|
|
if (fs_info->sb->s_bdev &&
|
|
|
|
(fs_info->sb->s_bdev == device->bdev))
|
|
|
|
fs_info->sb->s_bdev = next_device->bdev;
|
|
|
|
|
2021-08-24 13:05:19 +08:00
|
|
|
if (fs_info->fs_devices->latest_dev->bdev == device->bdev)
|
|
|
|
fs_info->fs_devices->latest_dev = next_device;
|
2016-05-03 17:44:43 +08:00
|
|
|
}
|
|
|
|
|
2018-08-10 13:53:21 +08:00
|
|
|
/*
|
|
|
|
* Return btrfs_fs_devices::num_devices excluding the device that's being
|
|
|
|
* currently replaced.
|
|
|
|
*/
|
|
|
|
static u64 btrfs_num_devices(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
u64 num_devices = fs_info->fs_devices->num_devices;
|
|
|
|
|
2018-09-07 22:11:23 +08:00
|
|
|
down_read(&fs_info->dev_replace.rwsem);
|
2018-08-10 13:53:21 +08:00
|
|
|
if (btrfs_dev_replace_is_ongoing(&fs_info->dev_replace)) {
|
|
|
|
ASSERT(num_devices > 1);
|
|
|
|
num_devices--;
|
|
|
|
}
|
2018-09-07 22:11:23 +08:00
|
|
|
up_read(&fs_info->dev_replace.rwsem);
|
2018-08-10 13:53:21 +08:00
|
|
|
|
|
|
|
return num_devices;
|
|
|
|
}
|
|
|
|
|
2022-11-22 01:47:48 +08:00
|
|
|
static void btrfs_scratch_superblock(struct btrfs_fs_info *fs_info,
|
|
|
|
struct block_device *bdev, int copy_num)
|
|
|
|
{
|
|
|
|
struct btrfs_super_block *disk_super;
|
2022-11-22 01:47:49 +08:00
|
|
|
const size_t len = sizeof(disk_super->magic);
|
|
|
|
const u64 bytenr = btrfs_sb_offset(copy_num);
|
2022-11-22 01:47:48 +08:00
|
|
|
int ret;
|
|
|
|
|
2022-11-22 01:47:49 +08:00
|
|
|
disk_super = btrfs_read_disk_super(bdev, bytenr, bytenr);
|
2022-11-22 01:47:48 +08:00
|
|
|
if (IS_ERR(disk_super))
|
|
|
|
return;
|
|
|
|
|
2022-11-22 01:47:49 +08:00
|
|
|
memset(&disk_super->magic, 0, len);
|
|
|
|
folio_mark_dirty(virt_to_folio(disk_super));
|
|
|
|
btrfs_release_disk_super(disk_super);
|
|
|
|
|
|
|
|
ret = sync_blockdev_range(bdev, bytenr, bytenr + len - 1);
|
2022-11-22 01:47:48 +08:00
|
|
|
if (ret)
|
|
|
|
btrfs_warn(fs_info, "error clearing superblock number %d (%d)",
|
|
|
|
copy_num, ret);
|
|
|
|
}
|
|
|
|
|
2024-02-22 16:51:33 +08:00
|
|
|
void btrfs_scratch_superblocks(struct btrfs_fs_info *fs_info, struct btrfs_device *device)
|
2020-02-13 23:24:31 +08:00
|
|
|
{
|
|
|
|
int copy_num;
|
2024-02-22 16:51:33 +08:00
|
|
|
struct block_device *bdev = device->bdev;
|
2020-02-13 23:24:31 +08:00
|
|
|
|
|
|
|
if (!bdev)
|
|
|
|
return;
|
|
|
|
|
|
|
|
for (copy_num = 0; copy_num < BTRFS_SUPER_MIRROR_MAX; copy_num++) {
|
2022-11-22 01:47:48 +08:00
|
|
|
if (bdev_is_zoned(bdev))
|
btrfs: implement log-structured superblock for ZONED mode
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones. However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today. So, the number of superblock and
copies is limited to be two. Second downside is that we cannot support
devices which have no conventional zones at all.
To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.
We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.
The following zones are reserved as the circular buffer on ZONED btrfs.
- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
next to it
If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-10 19:26:14 +08:00
|
|
|
btrfs_reset_sb_log_zones(bdev, copy_num);
|
2022-11-22 01:47:48 +08:00
|
|
|
else
|
|
|
|
btrfs_scratch_superblock(fs_info, bdev, copy_num);
|
2020-02-13 23:24:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Notify udev that device has changed */
|
|
|
|
btrfs_kobject_uevent(bdev, KOBJ_CHANGE);
|
|
|
|
|
|
|
|
/* Update ctime/mtime for device path for libblkid */
|
2024-02-22 16:51:33 +08:00
|
|
|
update_dev_time(device->name->str);
|
2020-02-13 23:24:31 +08:00
|
|
|
}
|
|
|
|
|
btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls
For device removal and replace we call btrfs_find_device_by_devspec,
which if we give it a device path and nothing else will call
btrfs_get_dev_args_from_path, which opens the block device and reads the
super block and then looks up our device based on that.
However at this point we're holding the sb write "lock", so reading the
block device pulls in the dependency of ->open_mutex, which produces the
following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #405 Not tainted
------------------------------------------------------
losetup/11576 is trying to acquire lock:
ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
but task is already holding lock:
ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x25/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
blkdev_get_by_dev.part.0+0x56/0x3c0
blkdev_get_by_path+0x98/0xa0
btrfs_get_bdev_and_sb+0x1b/0xb0
btrfs_find_device_by_devspec+0x12b/0x1c0
btrfs_rm_device+0x127/0x610
btrfs_ioctl+0x2a31/0x2e70
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#12){.+.+}-{0:0}:
lo_write_bvec+0xc2/0x240 [loop]
loop_process_work+0x238/0xd00 [loop]
process_one_work+0x26b/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x245/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
flush_workqueue+0x91/0x5e0
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
1 lock held by losetup/11576:
#0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
stack backtrace:
CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0xcf/0xf0
? stack_trace_save+0x3b/0x50
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
? flush_workqueue+0x67/0x5e0
? lockdep_init_map_type+0x47/0x220
flush_workqueue+0x91/0x5e0
? flush_workqueue+0x67/0x5e0
? verify_cpu+0xf0/0x100
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
? blkdev_ioctl+0x8d/0x2a0
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f31b02404cb
Instead what we want to do is populate our device lookup args before we
grab any locks, and then pass these args into btrfs_rm_device(). From
there we can find the device and do the appropriate removal.
Suggested-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-06 04:12:44 +08:00
|
|
|
int btrfs_rm_device(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_dev_lookup_args *args,
|
2024-01-23 21:26:36 +08:00
|
|
|
struct file **bdev_file)
|
2016-02-13 10:01:33 +08:00
|
|
|
{
|
2022-03-08 13:36:38 +08:00
|
|
|
struct btrfs_trans_handle *trans;
|
2016-02-13 10:01:33 +08:00
|
|
|
struct btrfs_device *device;
|
2011-04-20 18:09:16 +08:00
|
|
|
struct btrfs_fs_devices *cur_devices;
|
2018-04-12 10:29:30 +08:00
|
|
|
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
|
2008-11-18 10:11:30 +08:00
|
|
|
u64 num_devices;
|
2008-05-07 23:43:44 +08:00
|
|
|
int ret = 0;
|
|
|
|
|
2021-12-16 04:40:00 +08:00
|
|
|
if (btrfs_fs_incompat(fs_info, EXTENT_TREE_V2)) {
|
|
|
|
btrfs_err(fs_info, "device remove not supported on extent tree v2 yet");
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
btrfs: do not take the uuid_mutex in btrfs_rm_device
We got the following lockdep splat while running fstests (specifically
btrfs/003 and btrfs/020 in a row) with the new rc. This was uncovered
by 87579e9b7d8d ("loop: use worker per cgroup instead of kworker") which
converted loop to using workqueues, which comes with lockdep
annotations that don't exist with kworkers. The lockdep splat is as
follows:
WARNING: possible circular locking dependency detected
5.14.0-rc2-custom+ #34 Not tainted
------------------------------------------------------
losetup/156417 is trying to acquire lock:
ffff9c7645b02d38 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x84/0x600
but task is already holding lock:
ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #5 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0xba/0x7c0
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x28/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x163/0x3a0
path_openat+0x74d/0xa40
do_filp_open+0x9c/0x140
do_sys_openat2+0xb1/0x170
__x64_sys_openat+0x54/0x90
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #4 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0xba/0x7c0
blkdev_get_by_dev.part.0+0xd1/0x3c0
blkdev_get_by_path+0xc0/0xd0
btrfs_scan_one_device+0x52/0x1f0 [btrfs]
btrfs_control_ioctl+0xac/0x170 [btrfs]
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (uuid_mutex){+.+.}-{3:3}:
__mutex_lock+0xba/0x7c0
btrfs_rm_device+0x48/0x6a0 [btrfs]
btrfs_ioctl+0x2d1c/0x3110 [btrfs]
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#11){.+.+}-{0:0}:
lo_write_bvec+0x112/0x290 [loop]
loop_process_work+0x25f/0xcb0 [loop]
process_one_work+0x28f/0x5d0
worker_thread+0x55/0x3c0
kthread+0x140/0x170
ret_from_fork+0x22/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x266/0x5d0
worker_thread+0x55/0x3c0
kthread+0x140/0x170
ret_from_fork+0x22/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x1130/0x1dc0
lock_acquire+0xf5/0x320
flush_workqueue+0xae/0x600
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x650 [loop]
lo_ioctl+0x29d/0x780 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
1 lock held by losetup/156417:
#0: ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop]
stack backtrace:
CPU: 8 PID: 156417 Comm: losetup Not tainted 5.14.0-rc2-custom+ #34
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0x10a/0x120
__lock_acquire+0x1130/0x1dc0
lock_acquire+0xf5/0x320
? flush_workqueue+0x84/0x600
flush_workqueue+0xae/0x600
? flush_workqueue+0x84/0x600
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x650 [loop]
lo_ioctl+0x29d/0x780 [loop]
? __lock_acquire+0x3a0/0x1dc0
? update_dl_rq_load_avg+0x152/0x360
? lock_is_held_type+0xa5/0x120
? find_held_lock.constprop.0+0x2b/0x80
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f645884de6b
Usually the uuid_mutex exists to protect the fs_devices that map
together all of the devices that match a specific uuid. In rm_device
we're messing with the uuid of a device, so it makes sense to protect
that here.
However in doing that it pulls in a whole host of lockdep dependencies,
as we call mnt_may_write() on the sb before we grab the uuid_mutex, thus
we end up with the dependency chain under the uuid_mutex being added
under the normal sb write dependency chain, which causes problems with
loop devices.
We don't need the uuid mutex here however. If we call
btrfs_scan_one_device() before we scratch the super block we will find
the fs_devices and not find the device itself and return EBUSY because
the fs_devices is open. If we call it after the scratch happens it will
not appear to be a valid btrfs file system.
We do not need to worry about other fs_devices modifying operations here
because we're protected by the exclusive operations locking.
So drop the uuid_mutex here in order to fix the lockdep splat.
A more detailed explanation from the discussion:
We are worried about rm and scan racing with each other, before this
change we'll zero the device out under the UUID mutex so when scan does
run it'll make sure that it can go through the whole device scan thing
without rm messing with us.
We aren't worried if the scratch happens first, because the result is we
don't think this is a btrfs device and we bail out.
The only case we are concerned with is we scratch _after_ scan is able
to read the superblock and gets a seemingly valid super block, so lets
consider this case.
Scan will call device_list_add() with the device we're removing. We'll
call find_fsid_with_metadata_uuid() and get our fs_devices for this
UUID. At this point we lock the fs_devices->device_list_mutex. This is
what protects us in this case, but we have two cases here.
1. We aren't to the device removal part of the RM. We found our device,
and device name matches our path, we go down and we set total_devices
to our super number of devices, which doesn't affect anything because
we haven't done the remove yet.
2. We are past the device removal part, which is protected by the
device_list_mutex. Scan doesn't find the device, it goes down and
does the
if (fs_devices->opened)
return -EBUSY;
check and we bail out.
Nothing about this situation is ideal, but the lockdep splat is real,
and the fix is safe, tho admittedly a bit scary looking.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copy more from the discussion ]
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-28 05:01:14 +08:00
|
|
|
/*
|
|
|
|
* The device list in fs_devices is accessed without locks (neither
|
|
|
|
* uuid_mutex nor device_list_mutex) as it won't change on a mounted
|
|
|
|
* filesystem and another device rm cannot run.
|
|
|
|
*/
|
2018-08-10 13:53:21 +08:00
|
|
|
num_devices = btrfs_num_devices(fs_info);
|
2012-11-06 20:15:27 +08:00
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
ret = btrfs_check_raid_min_devices(fs_info, num_devices - 1);
|
2016-02-13 10:01:33 +08:00
|
|
|
if (ret)
|
2022-03-08 13:36:38 +08:00
|
|
|
return ret;
|
2008-05-07 23:43:44 +08:00
|
|
|
|
btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls
For device removal and replace we call btrfs_find_device_by_devspec,
which if we give it a device path and nothing else will call
btrfs_get_dev_args_from_path, which opens the block device and reads the
super block and then looks up our device based on that.
However at this point we're holding the sb write "lock", so reading the
block device pulls in the dependency of ->open_mutex, which produces the
following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #405 Not tainted
------------------------------------------------------
losetup/11576 is trying to acquire lock:
ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
but task is already holding lock:
ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x25/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
blkdev_get_by_dev.part.0+0x56/0x3c0
blkdev_get_by_path+0x98/0xa0
btrfs_get_bdev_and_sb+0x1b/0xb0
btrfs_find_device_by_devspec+0x12b/0x1c0
btrfs_rm_device+0x127/0x610
btrfs_ioctl+0x2a31/0x2e70
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#12){.+.+}-{0:0}:
lo_write_bvec+0xc2/0x240 [loop]
loop_process_work+0x238/0xd00 [loop]
process_one_work+0x26b/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x245/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
flush_workqueue+0x91/0x5e0
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
1 lock held by losetup/11576:
#0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
stack backtrace:
CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0xcf/0xf0
? stack_trace_save+0x3b/0x50
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
? flush_workqueue+0x67/0x5e0
? lockdep_init_map_type+0x47/0x220
flush_workqueue+0x91/0x5e0
? flush_workqueue+0x67/0x5e0
? verify_cpu+0xf0/0x100
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
? blkdev_ioctl+0x8d/0x2a0
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f31b02404cb
Instead what we want to do is populate our device lookup args before we
grab any locks, and then pass these args into btrfs_rm_device(). From
there we can find the device and do the appropriate removal.
Suggested-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-06 04:12:44 +08:00
|
|
|
device = btrfs_find_device(fs_info->fs_devices, args);
|
|
|
|
if (!device) {
|
|
|
|
if (args->missing)
|
2018-09-03 17:46:14 +08:00
|
|
|
ret = BTRFS_ERROR_DEV_MISSING_NOT_FOUND;
|
|
|
|
else
|
btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls
For device removal and replace we call btrfs_find_device_by_devspec,
which if we give it a device path and nothing else will call
btrfs_get_dev_args_from_path, which opens the block device and reads the
super block and then looks up our device based on that.
However at this point we're holding the sb write "lock", so reading the
block device pulls in the dependency of ->open_mutex, which produces the
following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #405 Not tainted
------------------------------------------------------
losetup/11576 is trying to acquire lock:
ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
but task is already holding lock:
ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x25/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
blkdev_get_by_dev.part.0+0x56/0x3c0
blkdev_get_by_path+0x98/0xa0
btrfs_get_bdev_and_sb+0x1b/0xb0
btrfs_find_device_by_devspec+0x12b/0x1c0
btrfs_rm_device+0x127/0x610
btrfs_ioctl+0x2a31/0x2e70
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#12){.+.+}-{0:0}:
lo_write_bvec+0xc2/0x240 [loop]
loop_process_work+0x238/0xd00 [loop]
process_one_work+0x26b/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x245/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
flush_workqueue+0x91/0x5e0
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
1 lock held by losetup/11576:
#0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
stack backtrace:
CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0xcf/0xf0
? stack_trace_save+0x3b/0x50
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
? flush_workqueue+0x67/0x5e0
? lockdep_init_map_type+0x47/0x220
flush_workqueue+0x91/0x5e0
? flush_workqueue+0x67/0x5e0
? verify_cpu+0xf0/0x100
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
? blkdev_ioctl+0x8d/0x2a0
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f31b02404cb
Instead what we want to do is populate our device lookup args before we
grab any locks, and then pass these args into btrfs_rm_device(). From
there we can find the device and do the appropriate removal.
Suggested-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-06 04:12:44 +08:00
|
|
|
ret = -ENOENT;
|
2022-03-08 13:36:38 +08:00
|
|
|
return ret;
|
2018-09-03 17:46:14 +08:00
|
|
|
}
|
2008-05-14 01:46:40 +08:00
|
|
|
|
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-04 01:28:12 +08:00
|
|
|
if (btrfs_pinned_by_swapfile(fs_info, device)) {
|
|
|
|
btrfs_warn_in_rcu(fs_info,
|
|
|
|
"cannot remove device %s (devid %llu) due to active swapfile",
|
2022-11-13 09:32:07 +08:00
|
|
|
btrfs_dev_name(device), device->devid);
|
2022-03-08 13:36:38 +08:00
|
|
|
return -ETXTBSY;
|
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-04 01:28:12 +08:00
|
|
|
}
|
|
|
|
|
2022-03-08 13:36:38 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
|
|
|
|
return BTRFS_ERROR_DEV_TGT_REPLACE;
|
2012-11-06 01:29:28 +08:00
|
|
|
|
2017-12-04 12:54:52 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
|
2022-03-08 13:36:38 +08:00
|
|
|
fs_info->fs_devices->rw_devices == 1)
|
|
|
|
return BTRFS_ERROR_DEV_ONLY_WRITABLE;
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2017-12-04 12:54:52 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
2008-11-18 10:11:30 +08:00
|
|
|
list_del_init(&device->dev_alloc_list);
|
2014-09-03 21:35:47 +08:00
|
|
|
device->fs_devices->rw_devices--;
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
2008-05-14 01:46:40 +08:00
|
|
|
}
|
2008-05-07 23:43:44 +08:00
|
|
|
|
|
|
|
ret = btrfs_shrink_device(device, 0);
|
|
|
|
if (ret)
|
2011-02-16 02:14:25 +08:00
|
|
|
goto error_undo;
|
2008-05-07 23:43:44 +08:00
|
|
|
|
2022-03-08 13:36:38 +08:00
|
|
|
trans = btrfs_start_transaction(fs_info->chunk_root, 0);
|
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
2011-02-16 02:14:25 +08:00
|
|
|
goto error_undo;
|
2022-03-08 13:36:38 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
ret = btrfs_rm_dev_item(trans, device);
|
|
|
|
if (ret) {
|
|
|
|
/* Any error in dev item removal is critical */
|
|
|
|
btrfs_crit(fs_info,
|
|
|
|
"failed to remove device item for devid %llu: %d",
|
|
|
|
device->devid, ret);
|
|
|
|
btrfs_abort_transaction(trans, ret);
|
|
|
|
btrfs_end_transaction(trans);
|
|
|
|
return ret;
|
|
|
|
}
|
2008-05-07 23:43:44 +08:00
|
|
|
|
2017-12-04 12:54:53 +08:00
|
|
|
clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
|
2019-03-20 23:32:55 +08:00
|
|
|
btrfs_scrub_cancel_dev(device);
|
2009-06-11 03:17:02 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* the device list mutex makes sure that we don't change
|
|
|
|
* the device list while someone else is writing out all
|
Btrfs: fix race between removing a dev and writing sbs
This change fixes an issue when removing a device and writing
all super blocks run simultaneously. Here's the steps necessary
for the issue to happen:
1) disk-io.c:write_all_supers() gets a number of N devices from the
super_copy, so it will not panic if it fails to write super blocks
for N - 1 devices;
2) Then it tries to acquire the device_list_mutex, but blocks because
volumes.c:btrfs_rm_device() got it first;
3) btrfs_rm_device() removes the device from the list, then unlocks the
mutex and after the unlock it updates the number of devices in
super_copy to N - 1.
4) write_all_supers() finally acquires the mutex, iterates over all the
devices in the list and gets N - 1 errors, that is, it failed to write
super blocks to all the devices;
5) Because write_all_supers() thinks there are a total of N devices, it
considers N - 1 errors to be ok, and therefore won't panic.
So this change just makes sure that write_all_supers() reads the number
of devices from super_copy after it acquires the device_list_mutex.
Conversely, it changes btrfs_rm_device() to update the number of devices
in super_copy before it releases the device list mutex.
The code path to add a new device (volumes.c:btrfs_init_new_device),
already has the right behaviour: it updates the number of devices in
super_copy while holding the device_list_mutex.
The only code path that doesn't lock the device list mutex
before updating the number of devices in the super copy is
disk-io.c:next_root_backup(), called by open_ctree() during
mount time where concurrency issues can't happen.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 22:41:36 +08:00
|
|
|
* the device supers. Whoever is writing all supers, should
|
|
|
|
* lock the device list mutex before getting the number of
|
|
|
|
* devices in the super block (super_copy). Conversely,
|
|
|
|
* whoever updates the number of devices in the super block
|
|
|
|
* (super_copy) should hold the device list mutex.
|
2009-06-11 03:17:02 +08:00
|
|
|
*/
|
2011-04-20 18:09:16 +08:00
|
|
|
|
2018-04-12 10:29:31 +08:00
|
|
|
/*
|
|
|
|
* In normal cases the cur_devices == fs_devices. But in case
|
|
|
|
* of deleting a seed device, the cur_devices should point to
|
2021-08-18 12:15:48 +08:00
|
|
|
* its own fs_devices listed under the fs_devices->seed_list.
|
2018-04-12 10:29:31 +08:00
|
|
|
*/
|
2011-04-20 18:09:16 +08:00
|
|
|
cur_devices = device->fs_devices;
|
2018-04-12 10:29:30 +08:00
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
2011-04-20 18:09:16 +08:00
|
|
|
list_del_rcu(&device->dev_list);
|
2009-06-11 03:17:02 +08:00
|
|
|
|
2018-04-12 10:29:31 +08:00
|
|
|
cur_devices->num_devices--;
|
|
|
|
cur_devices->total_devices--;
|
2018-07-03 17:07:23 +08:00
|
|
|
/* Update total_devices of the parent fs_devices if it's seed */
|
|
|
|
if (cur_devices != fs_devices)
|
|
|
|
fs_devices->total_devices--;
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2017-12-04 12:54:54 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state))
|
2018-04-12 10:29:31 +08:00
|
|
|
cur_devices->missing_devices--;
|
2010-12-14 03:56:23 +08:00
|
|
|
|
2018-07-21 00:37:50 +08:00
|
|
|
btrfs_assign_next_active_device(device, NULL);
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2024-01-23 21:26:36 +08:00
|
|
|
if (device->bdev_file) {
|
2018-04-12 10:29:31 +08:00
|
|
|
cur_devices->open_devices--;
|
2014-07-08 01:34:49 +08:00
|
|
|
/* remove sysfs entry */
|
2020-09-05 01:34:27 +08:00
|
|
|
btrfs_sysfs_remove_device(device);
|
2014-07-08 01:34:49 +08:00
|
|
|
}
|
2014-06-03 11:36:00 +08:00
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
num_devices = btrfs_super_num_devices(fs_info->super_copy) - 1;
|
|
|
|
btrfs_set_super_num_devices(fs_info->super_copy, num_devices);
|
2018-04-12 10:29:30 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2016-09-20 20:50:21 +08:00
|
|
|
/*
|
2021-07-28 05:01:17 +08:00
|
|
|
* At this point, the device is zero sized and detached from the
|
|
|
|
* devices list. All that's left is to zero out the old supers and
|
|
|
|
* free the device.
|
|
|
|
*
|
|
|
|
* We cannot call btrfs_close_bdev() here because we're holding the sb
|
2024-01-23 21:26:36 +08:00
|
|
|
* write lock, and fput() on the block device will pull in the
|
|
|
|
* ->open_mutex on the block device and it's dependencies. Instead
|
|
|
|
* just flush the device and let the caller do the final bdev_release.
|
2016-09-20 20:50:21 +08:00
|
|
|
*/
|
2021-07-28 05:01:17 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
|
2024-02-22 16:51:33 +08:00
|
|
|
btrfs_scratch_superblocks(fs_info, device);
|
2021-07-28 05:01:17 +08:00
|
|
|
if (device->bdev) {
|
|
|
|
sync_blockdev(device->bdev);
|
|
|
|
invalidate_bdev(device->bdev);
|
|
|
|
}
|
|
|
|
}
|
2016-09-20 20:50:21 +08:00
|
|
|
|
2024-01-23 21:26:36 +08:00
|
|
|
*bdev_file = device->bdev_file;
|
2019-03-27 20:24:11 +08:00
|
|
|
synchronize_rcu();
|
|
|
|
btrfs_free_device(device);
|
2016-09-20 20:50:21 +08:00
|
|
|
|
2021-10-06 04:12:41 +08:00
|
|
|
/*
|
|
|
|
* This can happen if cur_devices is the private seed devices list. We
|
|
|
|
* cannot call close_fs_devices() here because it expects the uuid_mutex
|
|
|
|
* to be held, but in fact we don't need that for the private
|
|
|
|
* seed_devices, we can simply decrement cur_devices->opened and then
|
|
|
|
* remove it from our list and free the fs_devices.
|
|
|
|
*/
|
2021-10-06 04:12:39 +08:00
|
|
|
if (cur_devices->num_devices == 0) {
|
2020-07-16 15:25:33 +08:00
|
|
|
list_del_init(&cur_devices->seed_list);
|
2021-10-06 04:12:41 +08:00
|
|
|
ASSERT(cur_devices->opened == 1);
|
|
|
|
cur_devices->opened--;
|
2011-04-20 18:09:16 +08:00
|
|
|
free_fs_devices(cur_devices);
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
|
2022-03-08 13:36:38 +08:00
|
|
|
ret = btrfs_commit_transaction(trans);
|
|
|
|
|
2008-05-07 23:43:44 +08:00
|
|
|
return ret;
|
2016-02-13 10:01:36 +08:00
|
|
|
|
2011-02-16 02:14:25 +08:00
|
|
|
error_undo:
|
2017-12-04 12:54:52 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
2011-02-16 02:14:25 +08:00
|
|
|
list_add(&device->dev_alloc_list,
|
2018-04-12 10:29:30 +08:00
|
|
|
&fs_devices->alloc_list);
|
2014-09-03 21:35:47 +08:00
|
|
|
device->fs_devices->rw_devices++;
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
2011-02-16 02:14:25 +08:00
|
|
|
}
|
2022-03-08 13:36:38 +08:00
|
|
|
return ret;
|
2008-05-07 23:43:44 +08:00
|
|
|
}
|
|
|
|
|
2018-07-21 00:37:48 +08:00
|
|
|
void btrfs_rm_dev_replace_remove_srcdev(struct btrfs_device *srcdev)
|
2012-11-06 00:33:06 +08:00
|
|
|
{
|
2014-08-13 14:24:19 +08:00
|
|
|
struct btrfs_fs_devices *fs_devices;
|
|
|
|
|
2018-07-21 00:37:48 +08:00
|
|
|
lockdep_assert_held(&srcdev->fs_info->fs_devices->device_list_mutex);
|
2013-10-03 01:41:01 +08:00
|
|
|
|
2014-08-20 10:56:56 +08:00
|
|
|
/*
|
|
|
|
* in case of fs with no seed, srcdev->fs_devices will point
|
|
|
|
* to fs_devices of fs_info. However when the dev being replaced is
|
|
|
|
* a seed dev it will point to the seed's local fs_devices. In short
|
|
|
|
* srcdev will have its correct fs_devices in both the cases.
|
|
|
|
*/
|
|
|
|
fs_devices = srcdev->fs_devices;
|
2014-08-13 14:24:19 +08:00
|
|
|
|
2012-11-06 00:33:06 +08:00
|
|
|
list_del_rcu(&srcdev->dev_list);
|
2017-06-19 20:14:22 +08:00
|
|
|
list_del(&srcdev->dev_alloc_list);
|
2014-08-13 14:24:19 +08:00
|
|
|
fs_devices->num_devices--;
|
2017-12-04 12:54:54 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_MISSING, &srcdev->dev_state))
|
2014-08-13 14:24:19 +08:00
|
|
|
fs_devices->missing_devices--;
|
2012-11-06 00:33:06 +08:00
|
|
|
|
2017-12-04 12:54:52 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &srcdev->dev_state))
|
2014-09-03 21:35:44 +08:00
|
|
|
fs_devices->rw_devices--;
|
2013-10-03 01:41:01 +08:00
|
|
|
|
2014-09-03 21:35:44 +08:00
|
|
|
if (srcdev->bdev)
|
2014-08-13 14:24:19 +08:00
|
|
|
fs_devices->open_devices--;
|
2014-10-30 16:52:31 +08:00
|
|
|
}
|
|
|
|
|
2019-03-20 23:34:54 +08:00
|
|
|
void btrfs_rm_dev_replace_free_srcdev(struct btrfs_device *srcdev)
|
2014-10-30 16:52:31 +08:00
|
|
|
{
|
|
|
|
struct btrfs_fs_devices *fs_devices = srcdev->fs_devices;
|
2012-11-06 00:33:06 +08:00
|
|
|
|
btrfs: move btrfs_rm_dev_replace_free_srcdev outside of all locks
When closing and freeing the source device we could end up doing our
final blkdev_put() on the bdev, which will grab the bd_mutex. As such
we want to be holding as few locks as possible, so move this call
outside of the dev_replace->lock_finishing_cancel_unmount lock. Since
we're modifying the fs_devices we need to make sure we're holding the
uuid_mutex here, so take that as well.
There's a report from syzbot probably hitting one of the cases where
the bd_mutex and device_list_mutex are taken in the wrong order, however
it's not with device replace, like this patch fixes. As there's no
reproducer available so far, we can't verify the fix.
https://lore.kernel.org/lkml/000000000000fc04d105afcf86d7@google.com/
dashboard link: https://syzkaller.appspot.com/bug?extid=84a0634dc5d21d488419
WARNING: possible circular locking dependency detected
5.9.0-rc5-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor.0/6878 is trying to acquire lock:
ffff88804c17d780 (&bdev->bd_mutex){+.+.}-{3:3}, at: blkdev_put+0x30/0x520 fs/block_dev.c:1804
but task is already holding lock:
ffff8880908cfce0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: close_fs_devices.part.0+0x2e/0x800 fs/btrfs/volumes.c:1159
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
btrfs_finish_chunk_alloc+0x281/0xf90 fs/btrfs/volumes.c:5255
btrfs_create_pending_block_groups+0x2f3/0x700 fs/btrfs/block-group.c:2109
__btrfs_end_transaction+0xf5/0x690 fs/btrfs/transaction.c:916
find_free_extent_update_loop fs/btrfs/extent-tree.c:3807 [inline]
find_free_extent+0x23b7/0x2e60 fs/btrfs/extent-tree.c:4127
btrfs_reserve_extent+0x166/0x460 fs/btrfs/extent-tree.c:4206
cow_file_range+0x3de/0x9b0 fs/btrfs/inode.c:1063
btrfs_run_delalloc_range+0x2cf/0x1410 fs/btrfs/inode.c:1838
writepage_delalloc+0x150/0x460 fs/btrfs/extent_io.c:3439
__extent_writepage+0x441/0xd00 fs/btrfs/extent_io.c:3653
extent_write_cache_pages.constprop.0+0x69d/0x1040 fs/btrfs/extent_io.c:4249
extent_writepages+0xcd/0x2b0 fs/btrfs/extent_io.c:4370
do_writepages+0xec/0x290 mm/page-writeback.c:2352
__writeback_single_inode+0x125/0x1400 fs/fs-writeback.c:1461
writeback_sb_inodes+0x53d/0xf40 fs/fs-writeback.c:1721
wb_writeback+0x2ad/0xd40 fs/fs-writeback.c:1894
wb_do_writeback fs/fs-writeback.c:2039 [inline]
wb_workfn+0x2dc/0x13e0 fs/fs-writeback.c:2080
process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
kthread+0x3b5/0x4a0 kernel/kthread.c:292
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
-> #3 (sb_internal#2){.+.+}-{0:0}:
percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
__sb_start_write+0x234/0x470 fs/super.c:1672
sb_start_intwrite include/linux/fs.h:1690 [inline]
start_transaction+0xbe7/0x1170 fs/btrfs/transaction.c:624
find_free_extent_update_loop fs/btrfs/extent-tree.c:3789 [inline]
find_free_extent+0x25e1/0x2e60 fs/btrfs/extent-tree.c:4127
btrfs_reserve_extent+0x166/0x460 fs/btrfs/extent-tree.c:4206
cow_file_range+0x3de/0x9b0 fs/btrfs/inode.c:1063
btrfs_run_delalloc_range+0x2cf/0x1410 fs/btrfs/inode.c:1838
writepage_delalloc+0x150/0x460 fs/btrfs/extent_io.c:3439
__extent_writepage+0x441/0xd00 fs/btrfs/extent_io.c:3653
extent_write_cache_pages.constprop.0+0x69d/0x1040 fs/btrfs/extent_io.c:4249
extent_writepages+0xcd/0x2b0 fs/btrfs/extent_io.c:4370
do_writepages+0xec/0x290 mm/page-writeback.c:2352
__writeback_single_inode+0x125/0x1400 fs/fs-writeback.c:1461
writeback_sb_inodes+0x53d/0xf40 fs/fs-writeback.c:1721
wb_writeback+0x2ad/0xd40 fs/fs-writeback.c:1894
wb_do_writeback fs/fs-writeback.c:2039 [inline]
wb_workfn+0x2dc/0x13e0 fs/fs-writeback.c:2080
process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
kthread+0x3b5/0x4a0 kernel/kthread.c:292
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
-> #2 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}:
__flush_work+0x60e/0xac0 kernel/workqueue.c:3041
wb_shutdown+0x180/0x220 mm/backing-dev.c:355
bdi_unregister+0x174/0x590 mm/backing-dev.c:872
del_gendisk+0x820/0xa10 block/genhd.c:933
loop_remove drivers/block/loop.c:2192 [inline]
loop_control_ioctl drivers/block/loop.c:2291 [inline]
loop_control_ioctl+0x3b1/0x480 drivers/block/loop.c:2257
vfs_ioctl fs/ioctl.c:48 [inline]
__do_sys_ioctl fs/ioctl.c:753 [inline]
__se_sys_ioctl fs/ioctl.c:739 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (loop_ctl_mutex){+.+.}-{3:3}:
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
lo_open+0x19/0xd0 drivers/block/loop.c:1893
__blkdev_get+0x759/0x1aa0 fs/block_dev.c:1507
blkdev_get fs/block_dev.c:1639 [inline]
blkdev_open+0x227/0x300 fs/block_dev.c:1753
do_dentry_open+0x4b9/0x11b0 fs/open.c:817
do_open fs/namei.c:3251 [inline]
path_openat+0x1b9a/0x2730 fs/namei.c:3368
do_filp_open+0x17e/0x3c0 fs/namei.c:3395
do_sys_openat2+0x16d/0x420 fs/open.c:1168
do_sys_open fs/open.c:1184 [inline]
__do_sys_open fs/open.c:1192 [inline]
__se_sys_open fs/open.c:1188 [inline]
__x64_sys_open+0x119/0x1c0 fs/open.c:1188
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&bdev->bd_mutex){+.+.}-{3:3}:
check_prev_add kernel/locking/lockdep.c:2496 [inline]
check_prevs_add kernel/locking/lockdep.c:2601 [inline]
validate_chain kernel/locking/lockdep.c:3218 [inline]
__lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4426
lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
blkdev_put+0x30/0x520 fs/block_dev.c:1804
btrfs_close_bdev fs/btrfs/volumes.c:1117 [inline]
btrfs_close_bdev fs/btrfs/volumes.c:1107 [inline]
btrfs_close_one_device fs/btrfs/volumes.c:1133 [inline]
close_fs_devices.part.0+0x1a4/0x800 fs/btrfs/volumes.c:1161
close_fs_devices fs/btrfs/volumes.c:1193 [inline]
btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
close_ctree+0x688/0x6cb fs/btrfs/disk-io.c:4149
generic_shutdown_super+0x144/0x370 fs/super.c:464
kill_anon_super+0x36/0x60 fs/super.c:1108
btrfs_kill_super+0x38/0x50 fs/btrfs/super.c:2265
deactivate_locked_super+0x94/0x160 fs/super.c:335
deactivate_super+0xad/0xd0 fs/super.c:366
cleanup_mnt+0x3a3/0x530 fs/namespace.c:1118
task_work_run+0xdd/0x190 kernel/task_work.c:141
tracehook_notify_resume include/linux/tracehook.h:188 [inline]
exit_to_user_mode_loop kernel/entry/common.c:163 [inline]
exit_to_user_mode_prepare+0x1e1/0x200 kernel/entry/common.c:190
syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:265
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
&bdev->bd_mutex --> sb_internal#2 --> &fs_devs->device_list_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_devs->device_list_mutex);
lock(sb_internal#2);
lock(&fs_devs->device_list_mutex);
lock(&bdev->bd_mutex);
*** DEADLOCK ***
3 locks held by syz-executor.0/6878:
#0: ffff88809070c0e0 (&type->s_umount_key#70){++++}-{3:3}, at: deactivate_super+0xa5/0xd0 fs/super.c:365
#1: ffffffff8a5b37a8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_close_devices+0x23/0x1f0 fs/btrfs/volumes.c:1178
#2: ffff8880908cfce0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: close_fs_devices.part.0+0x2e/0x800 fs/btrfs/volumes.c:1159
stack backtrace:
CPU: 0 PID: 6878 Comm: syz-executor.0 Not tainted 5.9.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x198/0x1fd lib/dump_stack.c:118
check_noncircular+0x324/0x3e0 kernel/locking/lockdep.c:1827
check_prev_add kernel/locking/lockdep.c:2496 [inline]
check_prevs_add kernel/locking/lockdep.c:2601 [inline]
validate_chain kernel/locking/lockdep.c:3218 [inline]
__lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4426
lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
blkdev_put+0x30/0x520 fs/block_dev.c:1804
btrfs_close_bdev fs/btrfs/volumes.c:1117 [inline]
btrfs_close_bdev fs/btrfs/volumes.c:1107 [inline]
btrfs_close_one_device fs/btrfs/volumes.c:1133 [inline]
close_fs_devices.part.0+0x1a4/0x800 fs/btrfs/volumes.c:1161
close_fs_devices fs/btrfs/volumes.c:1193 [inline]
btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
close_ctree+0x688/0x6cb fs/btrfs/disk-io.c:4149
generic_shutdown_super+0x144/0x370 fs/super.c:464
kill_anon_super+0x36/0x60 fs/super.c:1108
btrfs_kill_super+0x38/0x50 fs/btrfs/super.c:2265
deactivate_locked_super+0x94/0x160 fs/super.c:335
deactivate_super+0xad/0xd0 fs/super.c:366
cleanup_mnt+0x3a3/0x530 fs/namespace.c:1118
task_work_run+0xdd/0x190 kernel/task_work.c:141
tracehook_notify_resume include/linux/tracehook.h:188 [inline]
exit_to_user_mode_loop kernel/entry/common.c:163 [inline]
exit_to_user_mode_prepare+0x1e1/0x200 kernel/entry/common.c:190
syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:265
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x460027
RSP: 002b:00007fff59216328 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 0000000000076035 RCX: 0000000000460027
RDX: 0000000000403188 RSI: 0000000000000002 RDI: 00007fff592163d0
RBP: 0000000000000333 R08: 0000000000000000 R09: 000000000000000b
R10: 0000000000000005 R11: 0000000000000246 R12: 00007fff59217460
R13: 0000000002df2a60 R14: 0000000000000000 R15: 00007fff59217460
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add syzbot reference ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-20 23:18:27 +08:00
|
|
|
mutex_lock(&uuid_mutex);
|
|
|
|
|
2016-07-22 06:04:53 +08:00
|
|
|
btrfs_close_bdev(srcdev);
|
2019-03-27 20:24:11 +08:00
|
|
|
synchronize_rcu();
|
|
|
|
btrfs_free_device(srcdev);
|
2014-08-13 14:24:22 +08:00
|
|
|
|
|
|
|
/* if this is no devs we rather delete the fs_devices */
|
|
|
|
if (!fs_devices->num_devices) {
|
2017-10-17 06:53:50 +08:00
|
|
|
/*
|
|
|
|
* On a mounted FS, num_devices can't be zero unless it's a
|
|
|
|
* seed. In case of a seed device being replaced, the replace
|
|
|
|
* target added to the sprout FS, so there will be no more
|
|
|
|
* device left under the seed FS.
|
|
|
|
*/
|
|
|
|
ASSERT(fs_devices->seeding);
|
|
|
|
|
2020-07-16 15:25:33 +08:00
|
|
|
list_del_init(&fs_devices->seed_list);
|
2018-04-12 10:29:27 +08:00
|
|
|
close_fs_devices(fs_devices);
|
2014-08-13 14:24:23 +08:00
|
|
|
free_fs_devices(fs_devices);
|
2014-08-13 14:24:22 +08:00
|
|
|
}
|
btrfs: move btrfs_rm_dev_replace_free_srcdev outside of all locks
When closing and freeing the source device we could end up doing our
final blkdev_put() on the bdev, which will grab the bd_mutex. As such
we want to be holding as few locks as possible, so move this call
outside of the dev_replace->lock_finishing_cancel_unmount lock. Since
we're modifying the fs_devices we need to make sure we're holding the
uuid_mutex here, so take that as well.
There's a report from syzbot probably hitting one of the cases where
the bd_mutex and device_list_mutex are taken in the wrong order, however
it's not with device replace, like this patch fixes. As there's no
reproducer available so far, we can't verify the fix.
https://lore.kernel.org/lkml/000000000000fc04d105afcf86d7@google.com/
dashboard link: https://syzkaller.appspot.com/bug?extid=84a0634dc5d21d488419
WARNING: possible circular locking dependency detected
5.9.0-rc5-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor.0/6878 is trying to acquire lock:
ffff88804c17d780 (&bdev->bd_mutex){+.+.}-{3:3}, at: blkdev_put+0x30/0x520 fs/block_dev.c:1804
but task is already holding lock:
ffff8880908cfce0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: close_fs_devices.part.0+0x2e/0x800 fs/btrfs/volumes.c:1159
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
btrfs_finish_chunk_alloc+0x281/0xf90 fs/btrfs/volumes.c:5255
btrfs_create_pending_block_groups+0x2f3/0x700 fs/btrfs/block-group.c:2109
__btrfs_end_transaction+0xf5/0x690 fs/btrfs/transaction.c:916
find_free_extent_update_loop fs/btrfs/extent-tree.c:3807 [inline]
find_free_extent+0x23b7/0x2e60 fs/btrfs/extent-tree.c:4127
btrfs_reserve_extent+0x166/0x460 fs/btrfs/extent-tree.c:4206
cow_file_range+0x3de/0x9b0 fs/btrfs/inode.c:1063
btrfs_run_delalloc_range+0x2cf/0x1410 fs/btrfs/inode.c:1838
writepage_delalloc+0x150/0x460 fs/btrfs/extent_io.c:3439
__extent_writepage+0x441/0xd00 fs/btrfs/extent_io.c:3653
extent_write_cache_pages.constprop.0+0x69d/0x1040 fs/btrfs/extent_io.c:4249
extent_writepages+0xcd/0x2b0 fs/btrfs/extent_io.c:4370
do_writepages+0xec/0x290 mm/page-writeback.c:2352
__writeback_single_inode+0x125/0x1400 fs/fs-writeback.c:1461
writeback_sb_inodes+0x53d/0xf40 fs/fs-writeback.c:1721
wb_writeback+0x2ad/0xd40 fs/fs-writeback.c:1894
wb_do_writeback fs/fs-writeback.c:2039 [inline]
wb_workfn+0x2dc/0x13e0 fs/fs-writeback.c:2080
process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
kthread+0x3b5/0x4a0 kernel/kthread.c:292
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
-> #3 (sb_internal#2){.+.+}-{0:0}:
percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
__sb_start_write+0x234/0x470 fs/super.c:1672
sb_start_intwrite include/linux/fs.h:1690 [inline]
start_transaction+0xbe7/0x1170 fs/btrfs/transaction.c:624
find_free_extent_update_loop fs/btrfs/extent-tree.c:3789 [inline]
find_free_extent+0x25e1/0x2e60 fs/btrfs/extent-tree.c:4127
btrfs_reserve_extent+0x166/0x460 fs/btrfs/extent-tree.c:4206
cow_file_range+0x3de/0x9b0 fs/btrfs/inode.c:1063
btrfs_run_delalloc_range+0x2cf/0x1410 fs/btrfs/inode.c:1838
writepage_delalloc+0x150/0x460 fs/btrfs/extent_io.c:3439
__extent_writepage+0x441/0xd00 fs/btrfs/extent_io.c:3653
extent_write_cache_pages.constprop.0+0x69d/0x1040 fs/btrfs/extent_io.c:4249
extent_writepages+0xcd/0x2b0 fs/btrfs/extent_io.c:4370
do_writepages+0xec/0x290 mm/page-writeback.c:2352
__writeback_single_inode+0x125/0x1400 fs/fs-writeback.c:1461
writeback_sb_inodes+0x53d/0xf40 fs/fs-writeback.c:1721
wb_writeback+0x2ad/0xd40 fs/fs-writeback.c:1894
wb_do_writeback fs/fs-writeback.c:2039 [inline]
wb_workfn+0x2dc/0x13e0 fs/fs-writeback.c:2080
process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
kthread+0x3b5/0x4a0 kernel/kthread.c:292
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
-> #2 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}:
__flush_work+0x60e/0xac0 kernel/workqueue.c:3041
wb_shutdown+0x180/0x220 mm/backing-dev.c:355
bdi_unregister+0x174/0x590 mm/backing-dev.c:872
del_gendisk+0x820/0xa10 block/genhd.c:933
loop_remove drivers/block/loop.c:2192 [inline]
loop_control_ioctl drivers/block/loop.c:2291 [inline]
loop_control_ioctl+0x3b1/0x480 drivers/block/loop.c:2257
vfs_ioctl fs/ioctl.c:48 [inline]
__do_sys_ioctl fs/ioctl.c:753 [inline]
__se_sys_ioctl fs/ioctl.c:739 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (loop_ctl_mutex){+.+.}-{3:3}:
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
lo_open+0x19/0xd0 drivers/block/loop.c:1893
__blkdev_get+0x759/0x1aa0 fs/block_dev.c:1507
blkdev_get fs/block_dev.c:1639 [inline]
blkdev_open+0x227/0x300 fs/block_dev.c:1753
do_dentry_open+0x4b9/0x11b0 fs/open.c:817
do_open fs/namei.c:3251 [inline]
path_openat+0x1b9a/0x2730 fs/namei.c:3368
do_filp_open+0x17e/0x3c0 fs/namei.c:3395
do_sys_openat2+0x16d/0x420 fs/open.c:1168
do_sys_open fs/open.c:1184 [inline]
__do_sys_open fs/open.c:1192 [inline]
__se_sys_open fs/open.c:1188 [inline]
__x64_sys_open+0x119/0x1c0 fs/open.c:1188
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&bdev->bd_mutex){+.+.}-{3:3}:
check_prev_add kernel/locking/lockdep.c:2496 [inline]
check_prevs_add kernel/locking/lockdep.c:2601 [inline]
validate_chain kernel/locking/lockdep.c:3218 [inline]
__lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4426
lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
blkdev_put+0x30/0x520 fs/block_dev.c:1804
btrfs_close_bdev fs/btrfs/volumes.c:1117 [inline]
btrfs_close_bdev fs/btrfs/volumes.c:1107 [inline]
btrfs_close_one_device fs/btrfs/volumes.c:1133 [inline]
close_fs_devices.part.0+0x1a4/0x800 fs/btrfs/volumes.c:1161
close_fs_devices fs/btrfs/volumes.c:1193 [inline]
btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
close_ctree+0x688/0x6cb fs/btrfs/disk-io.c:4149
generic_shutdown_super+0x144/0x370 fs/super.c:464
kill_anon_super+0x36/0x60 fs/super.c:1108
btrfs_kill_super+0x38/0x50 fs/btrfs/super.c:2265
deactivate_locked_super+0x94/0x160 fs/super.c:335
deactivate_super+0xad/0xd0 fs/super.c:366
cleanup_mnt+0x3a3/0x530 fs/namespace.c:1118
task_work_run+0xdd/0x190 kernel/task_work.c:141
tracehook_notify_resume include/linux/tracehook.h:188 [inline]
exit_to_user_mode_loop kernel/entry/common.c:163 [inline]
exit_to_user_mode_prepare+0x1e1/0x200 kernel/entry/common.c:190
syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:265
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
&bdev->bd_mutex --> sb_internal#2 --> &fs_devs->device_list_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_devs->device_list_mutex);
lock(sb_internal#2);
lock(&fs_devs->device_list_mutex);
lock(&bdev->bd_mutex);
*** DEADLOCK ***
3 locks held by syz-executor.0/6878:
#0: ffff88809070c0e0 (&type->s_umount_key#70){++++}-{3:3}, at: deactivate_super+0xa5/0xd0 fs/super.c:365
#1: ffffffff8a5b37a8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_close_devices+0x23/0x1f0 fs/btrfs/volumes.c:1178
#2: ffff8880908cfce0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: close_fs_devices.part.0+0x2e/0x800 fs/btrfs/volumes.c:1159
stack backtrace:
CPU: 0 PID: 6878 Comm: syz-executor.0 Not tainted 5.9.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x198/0x1fd lib/dump_stack.c:118
check_noncircular+0x324/0x3e0 kernel/locking/lockdep.c:1827
check_prev_add kernel/locking/lockdep.c:2496 [inline]
check_prevs_add kernel/locking/lockdep.c:2601 [inline]
validate_chain kernel/locking/lockdep.c:3218 [inline]
__lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4426
lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
blkdev_put+0x30/0x520 fs/block_dev.c:1804
btrfs_close_bdev fs/btrfs/volumes.c:1117 [inline]
btrfs_close_bdev fs/btrfs/volumes.c:1107 [inline]
btrfs_close_one_device fs/btrfs/volumes.c:1133 [inline]
close_fs_devices.part.0+0x1a4/0x800 fs/btrfs/volumes.c:1161
close_fs_devices fs/btrfs/volumes.c:1193 [inline]
btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
close_ctree+0x688/0x6cb fs/btrfs/disk-io.c:4149
generic_shutdown_super+0x144/0x370 fs/super.c:464
kill_anon_super+0x36/0x60 fs/super.c:1108
btrfs_kill_super+0x38/0x50 fs/btrfs/super.c:2265
deactivate_locked_super+0x94/0x160 fs/super.c:335
deactivate_super+0xad/0xd0 fs/super.c:366
cleanup_mnt+0x3a3/0x530 fs/namespace.c:1118
task_work_run+0xdd/0x190 kernel/task_work.c:141
tracehook_notify_resume include/linux/tracehook.h:188 [inline]
exit_to_user_mode_loop kernel/entry/common.c:163 [inline]
exit_to_user_mode_prepare+0x1e1/0x200 kernel/entry/common.c:190
syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:265
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x460027
RSP: 002b:00007fff59216328 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 0000000000076035 RCX: 0000000000460027
RDX: 0000000000403188 RSI: 0000000000000002 RDI: 00007fff592163d0
RBP: 0000000000000333 R08: 0000000000000000 R09: 000000000000000b
R10: 0000000000000005 R11: 0000000000000246 R12: 00007fff59217460
R13: 0000000002df2a60 R14: 0000000000000000 R15: 00007fff59217460
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add syzbot reference ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-20 23:18:27 +08:00
|
|
|
mutex_unlock(&uuid_mutex);
|
2012-11-06 00:33:06 +08:00
|
|
|
}
|
|
|
|
|
2018-07-21 00:37:51 +08:00
|
|
|
void btrfs_destroy_dev_replace_tgtdev(struct btrfs_device *tgtdev)
|
2012-11-06 00:33:06 +08:00
|
|
|
{
|
2018-07-21 00:37:51 +08:00
|
|
|
struct btrfs_fs_devices *fs_devices = tgtdev->fs_info->fs_devices;
|
2018-04-12 10:29:38 +08:00
|
|
|
|
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
2015-03-10 06:38:42 +08:00
|
|
|
|
2020-09-05 01:34:27 +08:00
|
|
|
btrfs_sysfs_remove_device(tgtdev);
|
2015-03-10 06:38:42 +08:00
|
|
|
|
btrfs: fix lock dep warning, move scratch dev out of device_list_mutex and uuid_mutex
When the replace target fails, the target device will be taken
out of fs device list, scratch + update_dev_time and freed. However
we could do the scratch + update_dev_time and free part after the
device has been taken out of device list, so that we don't have to
hold the device_list_mutex and uuid_mutex locks.
Reported issue:
[ 5375.718845] ======================================================
[ 5375.718846] [ INFO: possible circular locking dependency detected ]
[ 5375.718849] 4.4.5-scst31x-debug-11+ #40 Not tainted
[ 5375.718849] -------------------------------------------------------
[ 5375.718851] btrfs-health/4662 is trying to acquire lock:
[ 5375.718861] (sb_writers){.+.+.+}, at: [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.718862]
[ 5375.718862] but task is already holding lock:
[ 5375.718907] (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.718907]
[ 5375.718907] which lock already depends on the new lock.
[ 5375.718907]
[ 5375.718908]
[ 5375.718908] the existing dependency chain (in reverse order) is:
[ 5375.718911]
[ 5375.718911] -> #3 (&fs_devs->device_list_mutex){+.+.+.}:
[ 5375.718917] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718921] [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.718940] [<ffffffffa0219bf6>] btrfs_show_devname+0x36/0x210 [btrfs]
[ 5375.718945] [<ffffffff81267079>] show_vfsmnt+0x49/0x150
[ 5375.718948] [<ffffffff81240b07>] m_show+0x17/0x20
[ 5375.718951] [<ffffffff81246868>] seq_read+0x2d8/0x3b0
[ 5375.718955] [<ffffffff8121df28>] __vfs_read+0x28/0xd0
[ 5375.718959] [<ffffffff8121e806>] vfs_read+0x86/0x130
[ 5375.718962] [<ffffffff8121f4c9>] SyS_read+0x49/0xa0
[ 5375.718966] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718968]
[ 5375.718968] -> #2 (namespace_sem){+++++.}:
[ 5375.718971] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718974] [<ffffffff81635199>] down_write+0x49/0x80
[ 5375.718977] [<ffffffff81243593>] lock_mount+0x43/0x1c0
[ 5375.718979] [<ffffffff81243c13>] do_add_mount+0x23/0xd0
[ 5375.718982] [<ffffffff81244afb>] do_mount+0x27b/0xe30
[ 5375.718985] [<ffffffff812459dc>] SyS_mount+0x8c/0xd0
[ 5375.718988] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718991]
[ 5375.718991] -> #1 (&sb->s_type->i_mutex_key#5){+.+.+.}:
[ 5375.718994] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718996] [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.719001] [<ffffffff8122d608>] path_openat+0x468/0x1360
[ 5375.719004] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719007] [<ffffffff8121da7b>] do_sys_open+0x12b/0x210
[ 5375.719010] [<ffffffff8121db7e>] SyS_open+0x1e/0x20
[ 5375.719013] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.719015]
[ 5375.719015] -> #0 (sb_writers){.+.+.+}:
[ 5375.719018] [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719021] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719026] [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719028] [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719031] [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719035] [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719037] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719040] [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719043] [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719073] [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719099] [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719123] [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719150] [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719175] [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719199] [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719222] [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719225] [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719229] [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719230]
[ 5375.719230] other info that might help us debug this:
[ 5375.719230]
[ 5375.719233] Chain exists of:
[ 5375.719233] sb_writers --> namespace_sem --> &fs_devs->device_list_mutex
[ 5375.719233]
[ 5375.719234] Possible unsafe locking scenario:
[ 5375.719234]
[ 5375.719234] CPU0 CPU1
[ 5375.719235] ---- ----
[ 5375.719236] lock(&fs_devs->device_list_mutex);
[ 5375.719238] lock(namespace_sem);
[ 5375.719239] lock(&fs_devs->device_list_mutex);
[ 5375.719241] lock(sb_writers);
[ 5375.719241]
[ 5375.719241] *** DEADLOCK ***
[ 5375.719241]
[ 5375.719243] 4 locks held by btrfs-health/4662:
[ 5375.719266] #0: (&fs_info->health_mutex){+.+.+.}, at: [<ffffffffa0246303>] health_kthread+0x63/0x490 [btrfs]
[ 5375.719293] #1: (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.+.}, at: [<ffffffffa02c6611>] btrfs_dev_replace_finishing+0x41/0x990 [btrfs]
[ 5375.719319] #2: (uuid_mutex){+.+.+.}, at: [<ffffffffa0282620>] btrfs_destroy_dev_replace_tgtdev+0x20/0x150 [btrfs]
[ 5375.719343] #3: (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.719343]
[ 5375.719343] stack backtrace:
[ 5375.719347] CPU: 2 PID: 4662 Comm: btrfs-health Not tainted 4.4.5-scst31x-debug-11+ #40
[ 5375.719348] Hardware name: Supermicro SYS-6018R-WTRT/X10DRW-iT, BIOS 1.0c 01/07/2015
[ 5375.719352] 0000000000000000 ffff880856f73880 ffffffff813529e3 ffffffff826182a0
[ 5375.719354] ffffffff8260c090 ffff880856f738c0 ffffffff810d667c ffff880856f73930
[ 5375.719357] ffff880861f32b40 ffff880861f32b68 0000000000000003 0000000000000004
[ 5375.719357] Call Trace:
[ 5375.719363] [<ffffffff813529e3>] dump_stack+0x85/0xc2
[ 5375.719366] [<ffffffff810d667c>] print_circular_bug+0x1ec/0x260
[ 5375.719369] [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719373] [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719376] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719378] [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719383] [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719385] [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719387] [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719389] [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719393] [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719415] [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719418] [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719420] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719423] [<ffffffff810f615d>] ? rcu_read_lock_sched_held+0x6d/0x80
[ 5375.719426] [<ffffffff81201a9b>] ? kmem_cache_alloc+0x26b/0x5d0
[ 5375.719430] [<ffffffff8122e7d4>] ? getname_kernel+0x34/0x120
[ 5375.719433] [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719436] [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719462] [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719485] [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719506] [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719530] [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719554] [<ffffffffa02c6b23>] ? btrfs_dev_replace_finishing+0x553/0x990 [btrfs]
[ 5375.719576] [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719598] [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719621] [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719641] [<ffffffffa02463d8>] ? health_kthread+0x138/0x490 [btrfs]
[ 5375.719661] [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719663] [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719666] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719669] [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719672] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719697] ------------[ cut here ]------------
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reported-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-18 16:51:23 +08:00
|
|
|
if (tgtdev->bdev)
|
2018-04-12 10:29:38 +08:00
|
|
|
fs_devices->open_devices--;
|
btrfs: fix lock dep warning, move scratch dev out of device_list_mutex and uuid_mutex
When the replace target fails, the target device will be taken
out of fs device list, scratch + update_dev_time and freed. However
we could do the scratch + update_dev_time and free part after the
device has been taken out of device list, so that we don't have to
hold the device_list_mutex and uuid_mutex locks.
Reported issue:
[ 5375.718845] ======================================================
[ 5375.718846] [ INFO: possible circular locking dependency detected ]
[ 5375.718849] 4.4.5-scst31x-debug-11+ #40 Not tainted
[ 5375.718849] -------------------------------------------------------
[ 5375.718851] btrfs-health/4662 is trying to acquire lock:
[ 5375.718861] (sb_writers){.+.+.+}, at: [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.718862]
[ 5375.718862] but task is already holding lock:
[ 5375.718907] (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.718907]
[ 5375.718907] which lock already depends on the new lock.
[ 5375.718907]
[ 5375.718908]
[ 5375.718908] the existing dependency chain (in reverse order) is:
[ 5375.718911]
[ 5375.718911] -> #3 (&fs_devs->device_list_mutex){+.+.+.}:
[ 5375.718917] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718921] [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.718940] [<ffffffffa0219bf6>] btrfs_show_devname+0x36/0x210 [btrfs]
[ 5375.718945] [<ffffffff81267079>] show_vfsmnt+0x49/0x150
[ 5375.718948] [<ffffffff81240b07>] m_show+0x17/0x20
[ 5375.718951] [<ffffffff81246868>] seq_read+0x2d8/0x3b0
[ 5375.718955] [<ffffffff8121df28>] __vfs_read+0x28/0xd0
[ 5375.718959] [<ffffffff8121e806>] vfs_read+0x86/0x130
[ 5375.718962] [<ffffffff8121f4c9>] SyS_read+0x49/0xa0
[ 5375.718966] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718968]
[ 5375.718968] -> #2 (namespace_sem){+++++.}:
[ 5375.718971] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718974] [<ffffffff81635199>] down_write+0x49/0x80
[ 5375.718977] [<ffffffff81243593>] lock_mount+0x43/0x1c0
[ 5375.718979] [<ffffffff81243c13>] do_add_mount+0x23/0xd0
[ 5375.718982] [<ffffffff81244afb>] do_mount+0x27b/0xe30
[ 5375.718985] [<ffffffff812459dc>] SyS_mount+0x8c/0xd0
[ 5375.718988] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718991]
[ 5375.718991] -> #1 (&sb->s_type->i_mutex_key#5){+.+.+.}:
[ 5375.718994] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718996] [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.719001] [<ffffffff8122d608>] path_openat+0x468/0x1360
[ 5375.719004] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719007] [<ffffffff8121da7b>] do_sys_open+0x12b/0x210
[ 5375.719010] [<ffffffff8121db7e>] SyS_open+0x1e/0x20
[ 5375.719013] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.719015]
[ 5375.719015] -> #0 (sb_writers){.+.+.+}:
[ 5375.719018] [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719021] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719026] [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719028] [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719031] [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719035] [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719037] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719040] [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719043] [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719073] [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719099] [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719123] [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719150] [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719175] [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719199] [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719222] [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719225] [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719229] [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719230]
[ 5375.719230] other info that might help us debug this:
[ 5375.719230]
[ 5375.719233] Chain exists of:
[ 5375.719233] sb_writers --> namespace_sem --> &fs_devs->device_list_mutex
[ 5375.719233]
[ 5375.719234] Possible unsafe locking scenario:
[ 5375.719234]
[ 5375.719234] CPU0 CPU1
[ 5375.719235] ---- ----
[ 5375.719236] lock(&fs_devs->device_list_mutex);
[ 5375.719238] lock(namespace_sem);
[ 5375.719239] lock(&fs_devs->device_list_mutex);
[ 5375.719241] lock(sb_writers);
[ 5375.719241]
[ 5375.719241] *** DEADLOCK ***
[ 5375.719241]
[ 5375.719243] 4 locks held by btrfs-health/4662:
[ 5375.719266] #0: (&fs_info->health_mutex){+.+.+.}, at: [<ffffffffa0246303>] health_kthread+0x63/0x490 [btrfs]
[ 5375.719293] #1: (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.+.}, at: [<ffffffffa02c6611>] btrfs_dev_replace_finishing+0x41/0x990 [btrfs]
[ 5375.719319] #2: (uuid_mutex){+.+.+.}, at: [<ffffffffa0282620>] btrfs_destroy_dev_replace_tgtdev+0x20/0x150 [btrfs]
[ 5375.719343] #3: (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.719343]
[ 5375.719343] stack backtrace:
[ 5375.719347] CPU: 2 PID: 4662 Comm: btrfs-health Not tainted 4.4.5-scst31x-debug-11+ #40
[ 5375.719348] Hardware name: Supermicro SYS-6018R-WTRT/X10DRW-iT, BIOS 1.0c 01/07/2015
[ 5375.719352] 0000000000000000 ffff880856f73880 ffffffff813529e3 ffffffff826182a0
[ 5375.719354] ffffffff8260c090 ffff880856f738c0 ffffffff810d667c ffff880856f73930
[ 5375.719357] ffff880861f32b40 ffff880861f32b68 0000000000000003 0000000000000004
[ 5375.719357] Call Trace:
[ 5375.719363] [<ffffffff813529e3>] dump_stack+0x85/0xc2
[ 5375.719366] [<ffffffff810d667c>] print_circular_bug+0x1ec/0x260
[ 5375.719369] [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719373] [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719376] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719378] [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719383] [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719385] [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719387] [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719389] [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719393] [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719415] [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719418] [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719420] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719423] [<ffffffff810f615d>] ? rcu_read_lock_sched_held+0x6d/0x80
[ 5375.719426] [<ffffffff81201a9b>] ? kmem_cache_alloc+0x26b/0x5d0
[ 5375.719430] [<ffffffff8122e7d4>] ? getname_kernel+0x34/0x120
[ 5375.719433] [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719436] [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719462] [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719485] [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719506] [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719530] [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719554] [<ffffffffa02c6b23>] ? btrfs_dev_replace_finishing+0x553/0x990 [btrfs]
[ 5375.719576] [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719598] [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719621] [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719641] [<ffffffffa02463d8>] ? health_kthread+0x138/0x490 [btrfs]
[ 5375.719661] [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719663] [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719666] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719669] [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719672] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719697] ------------[ cut here ]------------
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reported-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-18 16:51:23 +08:00
|
|
|
|
2018-04-12 10:29:38 +08:00
|
|
|
fs_devices->num_devices--;
|
2012-11-06 00:33:06 +08:00
|
|
|
|
2018-07-21 00:37:50 +08:00
|
|
|
btrfs_assign_next_active_device(tgtdev, NULL);
|
2012-11-06 00:33:06 +08:00
|
|
|
|
|
|
|
list_del_rcu(&tgtdev->dev_list);
|
|
|
|
|
2018-04-12 10:29:38 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
btrfs: fix lock dep warning, move scratch dev out of device_list_mutex and uuid_mutex
When the replace target fails, the target device will be taken
out of fs device list, scratch + update_dev_time and freed. However
we could do the scratch + update_dev_time and free part after the
device has been taken out of device list, so that we don't have to
hold the device_list_mutex and uuid_mutex locks.
Reported issue:
[ 5375.718845] ======================================================
[ 5375.718846] [ INFO: possible circular locking dependency detected ]
[ 5375.718849] 4.4.5-scst31x-debug-11+ #40 Not tainted
[ 5375.718849] -------------------------------------------------------
[ 5375.718851] btrfs-health/4662 is trying to acquire lock:
[ 5375.718861] (sb_writers){.+.+.+}, at: [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.718862]
[ 5375.718862] but task is already holding lock:
[ 5375.718907] (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.718907]
[ 5375.718907] which lock already depends on the new lock.
[ 5375.718907]
[ 5375.718908]
[ 5375.718908] the existing dependency chain (in reverse order) is:
[ 5375.718911]
[ 5375.718911] -> #3 (&fs_devs->device_list_mutex){+.+.+.}:
[ 5375.718917] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718921] [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.718940] [<ffffffffa0219bf6>] btrfs_show_devname+0x36/0x210 [btrfs]
[ 5375.718945] [<ffffffff81267079>] show_vfsmnt+0x49/0x150
[ 5375.718948] [<ffffffff81240b07>] m_show+0x17/0x20
[ 5375.718951] [<ffffffff81246868>] seq_read+0x2d8/0x3b0
[ 5375.718955] [<ffffffff8121df28>] __vfs_read+0x28/0xd0
[ 5375.718959] [<ffffffff8121e806>] vfs_read+0x86/0x130
[ 5375.718962] [<ffffffff8121f4c9>] SyS_read+0x49/0xa0
[ 5375.718966] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718968]
[ 5375.718968] -> #2 (namespace_sem){+++++.}:
[ 5375.718971] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718974] [<ffffffff81635199>] down_write+0x49/0x80
[ 5375.718977] [<ffffffff81243593>] lock_mount+0x43/0x1c0
[ 5375.718979] [<ffffffff81243c13>] do_add_mount+0x23/0xd0
[ 5375.718982] [<ffffffff81244afb>] do_mount+0x27b/0xe30
[ 5375.718985] [<ffffffff812459dc>] SyS_mount+0x8c/0xd0
[ 5375.718988] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718991]
[ 5375.718991] -> #1 (&sb->s_type->i_mutex_key#5){+.+.+.}:
[ 5375.718994] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718996] [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.719001] [<ffffffff8122d608>] path_openat+0x468/0x1360
[ 5375.719004] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719007] [<ffffffff8121da7b>] do_sys_open+0x12b/0x210
[ 5375.719010] [<ffffffff8121db7e>] SyS_open+0x1e/0x20
[ 5375.719013] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.719015]
[ 5375.719015] -> #0 (sb_writers){.+.+.+}:
[ 5375.719018] [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719021] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719026] [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719028] [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719031] [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719035] [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719037] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719040] [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719043] [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719073] [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719099] [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719123] [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719150] [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719175] [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719199] [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719222] [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719225] [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719229] [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719230]
[ 5375.719230] other info that might help us debug this:
[ 5375.719230]
[ 5375.719233] Chain exists of:
[ 5375.719233] sb_writers --> namespace_sem --> &fs_devs->device_list_mutex
[ 5375.719233]
[ 5375.719234] Possible unsafe locking scenario:
[ 5375.719234]
[ 5375.719234] CPU0 CPU1
[ 5375.719235] ---- ----
[ 5375.719236] lock(&fs_devs->device_list_mutex);
[ 5375.719238] lock(namespace_sem);
[ 5375.719239] lock(&fs_devs->device_list_mutex);
[ 5375.719241] lock(sb_writers);
[ 5375.719241]
[ 5375.719241] *** DEADLOCK ***
[ 5375.719241]
[ 5375.719243] 4 locks held by btrfs-health/4662:
[ 5375.719266] #0: (&fs_info->health_mutex){+.+.+.}, at: [<ffffffffa0246303>] health_kthread+0x63/0x490 [btrfs]
[ 5375.719293] #1: (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.+.}, at: [<ffffffffa02c6611>] btrfs_dev_replace_finishing+0x41/0x990 [btrfs]
[ 5375.719319] #2: (uuid_mutex){+.+.+.}, at: [<ffffffffa0282620>] btrfs_destroy_dev_replace_tgtdev+0x20/0x150 [btrfs]
[ 5375.719343] #3: (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.719343]
[ 5375.719343] stack backtrace:
[ 5375.719347] CPU: 2 PID: 4662 Comm: btrfs-health Not tainted 4.4.5-scst31x-debug-11+ #40
[ 5375.719348] Hardware name: Supermicro SYS-6018R-WTRT/X10DRW-iT, BIOS 1.0c 01/07/2015
[ 5375.719352] 0000000000000000 ffff880856f73880 ffffffff813529e3 ffffffff826182a0
[ 5375.719354] ffffffff8260c090 ffff880856f738c0 ffffffff810d667c ffff880856f73930
[ 5375.719357] ffff880861f32b40 ffff880861f32b68 0000000000000003 0000000000000004
[ 5375.719357] Call Trace:
[ 5375.719363] [<ffffffff813529e3>] dump_stack+0x85/0xc2
[ 5375.719366] [<ffffffff810d667c>] print_circular_bug+0x1ec/0x260
[ 5375.719369] [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719373] [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719376] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719378] [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719383] [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719385] [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719387] [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719389] [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719393] [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719415] [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719418] [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719420] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719423] [<ffffffff810f615d>] ? rcu_read_lock_sched_held+0x6d/0x80
[ 5375.719426] [<ffffffff81201a9b>] ? kmem_cache_alloc+0x26b/0x5d0
[ 5375.719430] [<ffffffff8122e7d4>] ? getname_kernel+0x34/0x120
[ 5375.719433] [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719436] [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719462] [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719485] [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719506] [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719530] [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719554] [<ffffffffa02c6b23>] ? btrfs_dev_replace_finishing+0x553/0x990 [btrfs]
[ 5375.719576] [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719598] [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719621] [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719641] [<ffffffffa02463d8>] ? health_kthread+0x138/0x490 [btrfs]
[ 5375.719661] [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719663] [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719666] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719669] [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719672] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719697] ------------[ cut here ]------------
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reported-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-18 16:51:23 +08:00
|
|
|
|
2024-02-22 16:51:33 +08:00
|
|
|
btrfs_scratch_superblocks(tgtdev->fs_info, tgtdev);
|
2016-07-22 06:04:53 +08:00
|
|
|
|
|
|
|
btrfs_close_bdev(tgtdev);
|
2019-03-27 20:24:11 +08:00
|
|
|
synchronize_rcu();
|
|
|
|
btrfs_free_device(tgtdev);
|
2012-11-06 00:33:06 +08:00
|
|
|
}
|
|
|
|
|
2022-10-27 20:21:42 +08:00
|
|
|
/*
|
|
|
|
* Populate args from device at path.
|
2021-10-06 04:12:43 +08:00
|
|
|
*
|
|
|
|
* @fs_info: the filesystem
|
|
|
|
* @args: the args to populate
|
|
|
|
* @path: the path to the device
|
|
|
|
*
|
|
|
|
* This will read the super block of the device at @path and populate @args with
|
|
|
|
* the devid, fsid, and uuid. This is meant to be used for ioctls that need to
|
|
|
|
* lookup a device to operate on, but need to do it before we take any locks.
|
|
|
|
* This properly handles the special case of "missing" that a user may pass in,
|
|
|
|
* and does some basic sanity checks. The caller must make sure that @path is
|
|
|
|
* properly NUL terminated before calling in, and must call
|
|
|
|
* btrfs_put_dev_args_from_path() in order to free up the temporary fsid and
|
|
|
|
* uuid buffers.
|
|
|
|
*
|
|
|
|
* Return: 0 for success, -errno for failure
|
|
|
|
*/
|
|
|
|
int btrfs_get_dev_args_from_path(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_dev_lookup_args *args,
|
|
|
|
const char *path)
|
2012-11-05 21:42:30 +08:00
|
|
|
{
|
|
|
|
struct btrfs_super_block *disk_super;
|
2024-01-23 21:26:36 +08:00
|
|
|
struct file *bdev_file;
|
2021-10-06 04:12:43 +08:00
|
|
|
int ret;
|
2012-11-05 21:42:30 +08:00
|
|
|
|
2021-10-06 04:12:43 +08:00
|
|
|
if (!path || !path[0])
|
|
|
|
return -EINVAL;
|
|
|
|
if (!strcmp(path, "missing")) {
|
|
|
|
args->missing = true;
|
|
|
|
return 0;
|
|
|
|
}
|
2020-02-13 23:24:32 +08:00
|
|
|
|
2021-10-06 04:12:43 +08:00
|
|
|
args->uuid = kzalloc(BTRFS_UUID_SIZE, GFP_KERNEL);
|
|
|
|
args->fsid = kzalloc(BTRFS_FSID_SIZE, GFP_KERNEL);
|
|
|
|
if (!args->uuid || !args->fsid) {
|
|
|
|
btrfs_put_dev_args_from_path(args);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
2020-02-13 23:24:32 +08:00
|
|
|
|
2023-06-08 19:02:55 +08:00
|
|
|
ret = btrfs_get_bdev_and_sb(path, BLK_OPEN_READ, NULL, 0,
|
2024-01-23 21:26:36 +08:00
|
|
|
&bdev_file, &disk_super);
|
2022-08-15 23:16:06 +08:00
|
|
|
if (ret) {
|
|
|
|
btrfs_put_dev_args_from_path(args);
|
2021-10-06 04:12:43 +08:00
|
|
|
return ret;
|
2022-08-15 23:16:06 +08:00
|
|
|
}
|
|
|
|
|
2021-10-06 04:12:43 +08:00
|
|
|
args->devid = btrfs_stack_device_id(&disk_super->dev_item);
|
|
|
|
memcpy(args->uuid, disk_super->dev_item.uuid, BTRFS_UUID_SIZE);
|
2018-10-30 22:43:23 +08:00
|
|
|
if (btrfs_fs_incompat(fs_info, METADATA_UUID))
|
2021-10-06 04:12:43 +08:00
|
|
|
memcpy(args->fsid, disk_super->metadata_uuid, BTRFS_FSID_SIZE);
|
2018-10-30 22:43:23 +08:00
|
|
|
else
|
2021-10-06 04:12:43 +08:00
|
|
|
memcpy(args->fsid, disk_super->fsid, BTRFS_FSID_SIZE);
|
2020-02-13 23:24:32 +08:00
|
|
|
btrfs_release_disk_super(disk_super);
|
2024-01-23 21:26:36 +08:00
|
|
|
fput(bdev_file);
|
2021-10-06 04:12:43 +08:00
|
|
|
return 0;
|
2012-11-05 21:42:30 +08:00
|
|
|
}
|
|
|
|
|
2016-02-15 23:39:55 +08:00
|
|
|
/*
|
2021-10-06 04:12:43 +08:00
|
|
|
* Only use this jointly with btrfs_get_dev_args_from_path() because we will
|
|
|
|
* allocate our ->uuid and ->fsid pointers, everybody else uses local variables
|
|
|
|
* that don't need to be freed.
|
2016-02-15 23:39:55 +08:00
|
|
|
*/
|
2021-10-06 04:12:43 +08:00
|
|
|
void btrfs_put_dev_args_from_path(struct btrfs_dev_lookup_args *args)
|
|
|
|
{
|
|
|
|
kfree(args->uuid);
|
|
|
|
kfree(args->fsid);
|
|
|
|
args->uuid = NULL;
|
|
|
|
args->fsid = NULL;
|
|
|
|
}
|
|
|
|
|
2018-09-03 17:46:14 +08:00
|
|
|
struct btrfs_device *btrfs_find_device_by_devspec(
|
2019-01-17 23:32:29 +08:00
|
|
|
struct btrfs_fs_info *fs_info, u64 devid,
|
|
|
|
const char *device_path)
|
2016-02-13 10:01:35 +08:00
|
|
|
{
|
2021-10-06 04:12:42 +08:00
|
|
|
BTRFS_DEV_LOOKUP_ARGS(args);
|
2018-09-03 17:46:14 +08:00
|
|
|
struct btrfs_device *device;
|
2021-10-06 04:12:43 +08:00
|
|
|
int ret;
|
2016-02-13 10:01:35 +08:00
|
|
|
|
2016-02-15 23:39:55 +08:00
|
|
|
if (devid) {
|
2021-10-06 04:12:42 +08:00
|
|
|
args.devid = devid;
|
|
|
|
device = btrfs_find_device(fs_info->fs_devices, &args);
|
2018-09-03 17:46:14 +08:00
|
|
|
if (!device)
|
|
|
|
return ERR_PTR(-ENOENT);
|
2019-01-17 23:32:29 +08:00
|
|
|
return device;
|
|
|
|
}
|
|
|
|
|
2021-10-06 04:12:43 +08:00
|
|
|
ret = btrfs_get_dev_args_from_path(fs_info, &args, device_path);
|
|
|
|
if (ret)
|
|
|
|
return ERR_PTR(ret);
|
|
|
|
device = btrfs_find_device(fs_info->fs_devices, &args);
|
|
|
|
btrfs_put_dev_args_from_path(&args);
|
|
|
|
if (!device)
|
2019-01-17 23:32:29 +08:00
|
|
|
return ERR_PTR(-ENOENT);
|
2021-10-06 04:12:43 +08:00
|
|
|
return device;
|
2016-02-13 10:01:35 +08:00
|
|
|
}
|
|
|
|
|
2021-11-09 17:51:58 +08:00
|
|
|
static struct btrfs_fs_devices *btrfs_init_sprout(struct btrfs_fs_info *fs_info)
|
2008-11-18 10:11:30 +08:00
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
|
2008-11-18 10:11:30 +08:00
|
|
|
struct btrfs_fs_devices *old_devices;
|
2008-12-12 23:03:26 +08:00
|
|
|
struct btrfs_fs_devices *seed_devices;
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2018-03-16 09:21:22 +08:00
|
|
|
lockdep_assert_held(&uuid_mutex);
|
2008-12-12 23:03:26 +08:00
|
|
|
if (!fs_devices->seeding)
|
2021-11-09 17:51:58 +08:00
|
|
|
return ERR_PTR(-EINVAL);
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2020-08-12 22:04:36 +08:00
|
|
|
/*
|
|
|
|
* Private copy of the seed devices, anchored at
|
|
|
|
* fs_info->fs_devices->seed_list
|
|
|
|
*/
|
2023-08-23 22:52:13 +08:00
|
|
|
seed_devices = alloc_fs_devices(NULL);
|
2013-08-12 19:33:03 +08:00
|
|
|
if (IS_ERR(seed_devices))
|
2021-11-09 17:51:58 +08:00
|
|
|
return seed_devices;
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2020-08-12 22:04:36 +08:00
|
|
|
/*
|
|
|
|
* It's necessary to retain a copy of the original seed fs_devices in
|
|
|
|
* fs_uuids so that filesystems which have been seeded can successfully
|
|
|
|
* reference the seed device from open_seed_devices. This also supports
|
|
|
|
* multiple fs seed.
|
|
|
|
*/
|
2008-12-12 23:03:26 +08:00
|
|
|
old_devices = clone_fs_devices(fs_devices);
|
|
|
|
if (IS_ERR(old_devices)) {
|
|
|
|
kfree(seed_devices);
|
2021-11-09 17:51:58 +08:00
|
|
|
return old_devices;
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
2008-12-12 23:03:26 +08:00
|
|
|
|
2018-04-12 10:29:25 +08:00
|
|
|
list_add(&old_devices->fs_list, &fs_uuids);
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2008-12-12 23:03:26 +08:00
|
|
|
memcpy(seed_devices, fs_devices, sizeof(*seed_devices));
|
|
|
|
seed_devices->opened = 1;
|
|
|
|
INIT_LIST_HEAD(&seed_devices->devices);
|
|
|
|
INIT_LIST_HEAD(&seed_devices->alloc_list);
|
2009-06-11 03:17:02 +08:00
|
|
|
mutex_init(&seed_devices->device_list_mutex);
|
2011-04-20 18:07:30 +08:00
|
|
|
|
2021-11-09 17:51:58 +08:00
|
|
|
return seed_devices;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Splice seed devices into the sprout fs_devices.
|
|
|
|
* Generate a new fsid for the sprouted read-write filesystem.
|
|
|
|
*/
|
|
|
|
static void btrfs_setup_sprout(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_fs_devices *seed_devices)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
|
|
|
|
struct btrfs_super_block *disk_super = fs_info->super_copy;
|
|
|
|
struct btrfs_device *device;
|
|
|
|
u64 super_flags;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We are updating the fsid, the thread leading to device_list_add()
|
|
|
|
* could race, so uuid_mutex is needed.
|
|
|
|
*/
|
|
|
|
lockdep_assert_held(&uuid_mutex);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The threads listed below may traverse dev_list but can do that without
|
|
|
|
* device_list_mutex:
|
|
|
|
* - All device ops and balance - as we are in btrfs_exclop_start.
|
|
|
|
* - Various dev_list readers - are using RCU.
|
|
|
|
* - btrfs_ioctl_fitrim() - is using RCU.
|
|
|
|
*
|
|
|
|
* For-read threads as below are using device_list_mutex:
|
|
|
|
* - Readonly scrub btrfs_scrub_dev()
|
|
|
|
* - Readonly scrub btrfs_scrub_progress()
|
|
|
|
* - btrfs_get_dev_stats()
|
|
|
|
*/
|
|
|
|
lockdep_assert_held(&fs_devices->device_list_mutex);
|
|
|
|
|
2011-04-20 18:09:16 +08:00
|
|
|
list_splice_init_rcu(&fs_devices->devices, &seed_devices->devices,
|
|
|
|
synchronize_rcu);
|
2014-09-03 21:35:41 +08:00
|
|
|
list_for_each_entry(device, &seed_devices->devices, dev_list)
|
|
|
|
device->fs_devices = seed_devices;
|
2011-04-20 18:07:30 +08:00
|
|
|
|
2019-11-13 18:27:27 +08:00
|
|
|
fs_devices->seeding = false;
|
2008-11-18 10:11:30 +08:00
|
|
|
fs_devices->num_devices = 0;
|
|
|
|
fs_devices->open_devices = 0;
|
2014-07-03 18:22:12 +08:00
|
|
|
fs_devices->missing_devices = 0;
|
2019-11-13 18:27:28 +08:00
|
|
|
fs_devices->rotating = false;
|
2020-07-16 15:25:33 +08:00
|
|
|
list_add(&seed_devices->seed_list, &fs_devices->seed_list);
|
2008-11-18 10:11:30 +08:00
|
|
|
|
|
|
|
generate_random_uuid(fs_devices->fsid);
|
2018-10-30 22:43:23 +08:00
|
|
|
memcpy(fs_devices->metadata_uuid, fs_devices->fsid, BTRFS_FSID_SIZE);
|
2008-11-18 10:11:30 +08:00
|
|
|
memcpy(disk_super->fsid, fs_devices->fsid, BTRFS_FSID_SIZE);
|
Btrfs: fix race conditions in BTRFS_IOC_FS_INFO ioctl
The handler for the ioctl BTRFS_IOC_FS_INFO was reading the
number of devices before acquiring the device list mutex.
This could lead to inconsistent results because the update of
the device list and the number of devices counter (amongst other
counters related to the device list) are updated in volumes.c
while holding the device list mutex - except for 2 places, one
was volumes.c:btrfs_prepare_sprout() and the other was
volumes.c:device_list_add().
For example, if we have 2 devices, with IDs 1 and 2 and then add
a new device, with ID 3, and while adding the device is in progress
an BTRFS_IOC_FS_INFO ioctl arrives, it could return a number of
devices of 2 and a max dev id of 3. This would be incorrect.
Also, this ioctl handler was reading the fsid while it can be
updated concurrently. This can happen when while a new device is
being added and the current filesystem is in seeding mode.
Example:
$ mkfs.btrfs -f /dev/sdb1
$ mkfs.btrfs -f /dev/sdb2
$ btrfstune -S 1 /dev/sdb1
$ mount /dev/sdb1 /mnt/test
$ btrfs device add /dev/sdb2 /mnt/test
If during the last step a BTRFS_IOC_FS_INFO ioctl was requested, it
could read an fsid that was never valid (some bits part of the old
fsid and others part of the new fsid). Also, it could read a number
of devices that doesn't match the number of devices in the list and
the max device id, as explained before.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-13 03:56:58 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
super_flags = btrfs_super_flags(disk_super) &
|
|
|
|
~BTRFS_SUPER_FLAG_SEEDING;
|
|
|
|
btrfs_set_super_flags(disk_super, super_flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2016-05-20 09:18:45 +08:00
|
|
|
* Store the expected generation for seed devices in device items.
|
2008-11-18 10:11:30 +08:00
|
|
|
*/
|
2019-03-20 23:36:39 +08:00
|
|
|
static int btrfs_finish_sprout(struct btrfs_trans_handle *trans)
|
2008-11-18 10:11:30 +08:00
|
|
|
{
|
2021-10-06 04:12:42 +08:00
|
|
|
BTRFS_DEV_LOOKUP_ARGS(args);
|
2019-03-20 23:36:39 +08:00
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
2016-06-21 22:40:19 +08:00
|
|
|
struct btrfs_root *root = fs_info->chunk_root;
|
2008-11-18 10:11:30 +08:00
|
|
|
struct btrfs_path *path;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_dev_item *dev_item;
|
|
|
|
struct btrfs_device *device;
|
|
|
|
struct btrfs_key key;
|
2017-07-29 17:50:09 +08:00
|
|
|
u8 fs_uuid[BTRFS_FSID_SIZE];
|
2008-11-18 10:11:30 +08:00
|
|
|
u8 dev_uuid[BTRFS_UUID_SIZE];
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
|
|
|
|
key.offset = 0;
|
|
|
|
key.type = BTRFS_DEV_ITEM_KEY;
|
|
|
|
|
|
|
|
while (1) {
|
btrfs: fix deadlock between chunk allocation and chunk btree modifications
When a task is doing some modification to the chunk btree and it is not in
the context of a chunk allocation or a chunk removal, it can deadlock with
another task that is currently allocating a new data or metadata chunk.
These contexts are the following:
* When relocating a system chunk, when we need to COW the extent buffers
that belong to the chunk btree;
* When adding a new device (ioctl), where we need to add a new device item
to the chunk btree;
* When removing a device (ioctl), where we need to remove a device item
from the chunk btree;
* When resizing a device (ioctl), where we need to update a device item in
the chunk btree and may need to relocate a system chunk that lies beyond
the new device size when shrinking a device.
The problem happens due to a sequence of steps like the following:
1) Task A starts a data or metadata chunk allocation and it locks the
chunk mutex;
2) Task B is relocating a system chunk, and when it needs to COW an extent
buffer of the chunk btree, it has locked both that extent buffer as
well as its parent extent buffer;
3) Since there is not enough available system space, either because none
of the existing system block groups have enough free space or because
the only one with enough free space is in RO mode due to the relocation,
task B triggers a new system chunk allocation. It blocks when trying to
acquire the chunk mutex, currently held by task A;
4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
the new chunk item into the chunk btree and update the existing device
items there. But in order to do that, it has to lock the extent buffer
that task B locked at step 2, or its parent extent buffer, but task B
is waiting on the chunk mutex, which is currently locked by task A,
therefore resulting in a deadlock.
One example report when the deadlock happens with system chunk relocation:
INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u9:5 state:D stack:25936 pid: 546 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
__down_read_common kernel/locking/rwsem.c:1214 [inline]
__down_read kernel/locking/rwsem.c:1223 [inline]
down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
__btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
worker_thread+0x90/0xed0 kernel/workqueue.c:2444
kthread+0x3e5/0x4d0 kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
INFO: task syz-executor:9107 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor state:D stack:23200 pid: 9107 ppid: 7792 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
__mutex_lock_common kernel/locking/mutex.c:669 [inline]
__mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
__btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
__btrfs_balance fs/btrfs/volumes.c:3911 [inline]
btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
So fix this by making sure that whenever we try to modify the chunk btree
and we are neither in a chunk allocation context nor in a chunk remove
context, we reserve system space before modifying the chunk btree.
Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
CC: stable@vger.kernel.org # 5.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-13 17:12:49 +08:00
|
|
|
btrfs_reserve_chunk_metadata(trans, false);
|
2008-11-18 10:11:30 +08:00
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, 0, 1);
|
btrfs: fix deadlock between chunk allocation and chunk btree modifications
When a task is doing some modification to the chunk btree and it is not in
the context of a chunk allocation or a chunk removal, it can deadlock with
another task that is currently allocating a new data or metadata chunk.
These contexts are the following:
* When relocating a system chunk, when we need to COW the extent buffers
that belong to the chunk btree;
* When adding a new device (ioctl), where we need to add a new device item
to the chunk btree;
* When removing a device (ioctl), where we need to remove a device item
from the chunk btree;
* When resizing a device (ioctl), where we need to update a device item in
the chunk btree and may need to relocate a system chunk that lies beyond
the new device size when shrinking a device.
The problem happens due to a sequence of steps like the following:
1) Task A starts a data or metadata chunk allocation and it locks the
chunk mutex;
2) Task B is relocating a system chunk, and when it needs to COW an extent
buffer of the chunk btree, it has locked both that extent buffer as
well as its parent extent buffer;
3) Since there is not enough available system space, either because none
of the existing system block groups have enough free space or because
the only one with enough free space is in RO mode due to the relocation,
task B triggers a new system chunk allocation. It blocks when trying to
acquire the chunk mutex, currently held by task A;
4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
the new chunk item into the chunk btree and update the existing device
items there. But in order to do that, it has to lock the extent buffer
that task B locked at step 2, or its parent extent buffer, but task B
is waiting on the chunk mutex, which is currently locked by task A,
therefore resulting in a deadlock.
One example report when the deadlock happens with system chunk relocation:
INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u9:5 state:D stack:25936 pid: 546 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
__down_read_common kernel/locking/rwsem.c:1214 [inline]
__down_read kernel/locking/rwsem.c:1223 [inline]
down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
__btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
worker_thread+0x90/0xed0 kernel/workqueue.c:2444
kthread+0x3e5/0x4d0 kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
INFO: task syz-executor:9107 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor state:D stack:23200 pid: 9107 ppid: 7792 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
__mutex_lock_common kernel/locking/mutex.c:669 [inline]
__mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
__btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
__btrfs_balance fs/btrfs/volumes.c:3911 [inline]
btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
So fix this by making sure that whenever we try to modify the chunk btree
and we are neither in a chunk allocation context nor in a chunk remove
context, we reserve system space before modifying the chunk btree.
Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
CC: stable@vger.kernel.org # 5.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-13 17:12:49 +08:00
|
|
|
btrfs_trans_release_chunk_metadata(trans);
|
2008-11-18 10:11:30 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto error;
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
next_slot:
|
|
|
|
if (path->slots[0] >= btrfs_header_nritems(leaf)) {
|
|
|
|
ret = btrfs_next_leaf(root, path);
|
|
|
|
if (ret > 0)
|
|
|
|
break;
|
|
|
|
if (ret < 0)
|
|
|
|
goto error;
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2008-11-18 10:11:30 +08:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
|
|
|
|
if (key.objectid != BTRFS_DEV_ITEMS_OBJECTID ||
|
|
|
|
key.type != BTRFS_DEV_ITEM_KEY)
|
|
|
|
break;
|
|
|
|
|
|
|
|
dev_item = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_dev_item);
|
2021-10-06 04:12:42 +08:00
|
|
|
args.devid = btrfs_device_id(leaf, dev_item);
|
2013-08-20 19:20:11 +08:00
|
|
|
read_extent_buffer(leaf, dev_uuid, btrfs_device_uuid(dev_item),
|
2008-11-18 10:11:30 +08:00
|
|
|
BTRFS_UUID_SIZE);
|
2013-08-20 19:20:12 +08:00
|
|
|
read_extent_buffer(leaf, fs_uuid, btrfs_device_fsid(dev_item),
|
2017-07-29 17:50:09 +08:00
|
|
|
BTRFS_FSID_SIZE);
|
2021-10-06 04:12:42 +08:00
|
|
|
args.uuid = dev_uuid;
|
|
|
|
args.fsid = fs_uuid;
|
|
|
|
device = btrfs_find_device(fs_info->fs_devices, &args);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(!device); /* Logic error */
|
2008-11-18 10:11:30 +08:00
|
|
|
|
|
|
|
if (device->fs_devices->seeding) {
|
|
|
|
btrfs_set_device_generation(leaf, dev_item,
|
|
|
|
device->generation);
|
2023-09-12 20:04:29 +08:00
|
|
|
btrfs_mark_buffer_dirty(trans, leaf);
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
path->slots[0]++;
|
|
|
|
goto next_slot;
|
|
|
|
}
|
|
|
|
ret = 0;
|
|
|
|
error:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2017-02-15 00:55:53 +08:00
|
|
|
int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path)
|
2008-04-29 03:29:42 +08:00
|
|
|
{
|
2016-06-22 08:16:08 +08:00
|
|
|
struct btrfs_root *root = fs_info->dev_root;
|
2008-04-29 03:29:42 +08:00
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
struct btrfs_device *device;
|
2024-01-23 21:26:36 +08:00
|
|
|
struct file *bdev_file;
|
2016-06-23 06:54:23 +08:00
|
|
|
struct super_block *sb = fs_info->sb;
|
2018-07-03 13:14:50 +08:00
|
|
|
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
|
2023-03-24 10:08:38 +08:00
|
|
|
struct btrfs_fs_devices *seed_devices = NULL;
|
2018-07-27 08:04:55 +08:00
|
|
|
u64 orig_super_total_bytes;
|
|
|
|
u64 orig_super_num_devices;
|
2008-04-29 03:29:42 +08:00
|
|
|
int ret = 0;
|
2021-09-21 12:33:23 +08:00
|
|
|
bool seeding_dev = false;
|
2020-07-22 16:09:23 +08:00
|
|
|
bool locked = false;
|
2008-04-29 03:29:42 +08:00
|
|
|
|
2018-07-03 13:14:50 +08:00
|
|
|
if (sb_rdonly(sb) && !fs_devices->seeding)
|
2012-05-10 18:10:38 +08:00
|
|
|
return -EROFS;
|
2008-04-29 03:29:42 +08:00
|
|
|
|
2024-01-23 21:26:36 +08:00
|
|
|
bdev_file = bdev_file_open_by_path(device_path, BLK_OPEN_WRITE,
|
2023-09-27 17:34:26 +08:00
|
|
|
fs_info->bdev_holder, NULL);
|
2024-01-23 21:26:36 +08:00
|
|
|
if (IS_ERR(bdev_file))
|
|
|
|
return PTR_ERR(bdev_file);
|
2008-06-26 04:01:30 +08:00
|
|
|
|
2024-01-23 21:26:36 +08:00
|
|
|
if (!btrfs_check_device_zone_type(fs_info, file_bdev(bdev_file))) {
|
2020-11-10 19:26:08 +08:00
|
|
|
ret = -EINVAL;
|
|
|
|
goto error;
|
|
|
|
}
|
|
|
|
|
2018-07-03 13:14:50 +08:00
|
|
|
if (fs_devices->seeding) {
|
2021-09-21 12:33:23 +08:00
|
|
|
seeding_dev = true;
|
2008-11-18 10:11:30 +08:00
|
|
|
down_write(&sb->s_umount);
|
|
|
|
mutex_lock(&uuid_mutex);
|
2020-07-22 16:09:23 +08:00
|
|
|
locked = true;
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
|
2024-01-23 21:26:36 +08:00
|
|
|
sync_blockdev(file_bdev(bdev_file));
|
2008-06-26 04:01:30 +08:00
|
|
|
|
2020-07-22 16:09:22 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
list_for_each_entry_rcu(device, &fs_devices->devices, dev_list) {
|
2024-01-23 21:26:36 +08:00
|
|
|
if (device->bdev == file_bdev(bdev_file)) {
|
2008-04-29 03:29:42 +08:00
|
|
|
ret = -EEXIST;
|
2020-07-22 16:09:22 +08:00
|
|
|
rcu_read_unlock();
|
2008-11-18 10:11:30 +08:00
|
|
|
goto error;
|
2008-04-29 03:29:42 +08:00
|
|
|
}
|
|
|
|
}
|
2020-07-22 16:09:22 +08:00
|
|
|
rcu_read_unlock();
|
2008-04-29 03:29:42 +08:00
|
|
|
|
2022-11-07 23:07:17 +08:00
|
|
|
device = btrfs_alloc_device(fs_info, NULL, NULL, device_path);
|
2013-08-23 18:20:17 +08:00
|
|
|
if (IS_ERR(device)) {
|
2008-04-29 03:29:42 +08:00
|
|
|
/* we can safely leave the fs_devices entry around */
|
2013-08-23 18:20:17 +08:00
|
|
|
ret = PTR_ERR(device);
|
2008-11-18 10:11:30 +08:00
|
|
|
goto error;
|
2008-04-29 03:29:42 +08:00
|
|
|
}
|
|
|
|
|
2020-11-10 19:26:07 +08:00
|
|
|
device->fs_info = fs_info;
|
2024-01-23 21:26:36 +08:00
|
|
|
device->bdev_file = bdev_file;
|
|
|
|
device->bdev = file_bdev(bdev_file);
|
2022-01-12 13:06:01 +08:00
|
|
|
ret = lookup_bdev(device_path, &device->devt);
|
|
|
|
if (ret)
|
|
|
|
goto error_free_device;
|
2020-11-10 19:26:07 +08:00
|
|
|
|
2021-11-11 13:14:38 +08:00
|
|
|
ret = btrfs_get_dev_zone_info(device, false);
|
2020-11-10 19:26:07 +08:00
|
|
|
if (ret)
|
|
|
|
goto error_free_device;
|
|
|
|
|
2010-05-16 22:48:46 +08:00
|
|
|
trans = btrfs_start_transaction(root, 0);
|
2011-01-20 14:19:37 +08:00
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
2020-11-10 19:26:07 +08:00
|
|
|
goto error_free_zone;
|
2011-01-20 14:19:37 +08:00
|
|
|
}
|
|
|
|
|
2017-12-04 12:54:52 +08:00
|
|
|
set_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
|
2008-11-18 10:11:30 +08:00
|
|
|
device->generation = trans->transid;
|
2016-06-23 06:54:23 +08:00
|
|
|
device->io_width = fs_info->sectorsize;
|
|
|
|
device->io_align = fs_info->sectorsize;
|
|
|
|
device->sector_size = fs_info->sectorsize;
|
2021-10-18 18:11:12 +08:00
|
|
|
device->total_bytes =
|
2023-09-27 17:34:26 +08:00
|
|
|
round_down(bdev_nr_bytes(device->bdev), fs_info->sectorsize);
|
2009-06-04 21:23:50 +08:00
|
|
|
device->disk_total_bytes = device->total_bytes;
|
2014-09-03 21:35:33 +08:00
|
|
|
device->commit_total_bytes = device->total_bytes;
|
2017-12-04 12:54:53 +08:00
|
|
|
set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
|
2017-12-04 12:54:55 +08:00
|
|
|
clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state);
|
2013-10-11 21:20:42 +08:00
|
|
|
device->dev_stats_valid = 1;
|
2024-04-18 12:34:31 +08:00
|
|
|
set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE);
|
2008-04-29 03:29:42 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
if (seeding_dev) {
|
btrfs: fix race between RO remount and the cleaner task
When we are remounting a filesystem in RO mode we can race with the cleaner
task and result in leaking a transaction if the filesystem is unmounted
shortly after, before the transaction kthread had a chance to commit that
transaction. That also results in a crash during unmount, due to a
use-after-free, if hardware acceleration is not available for crc32c.
The following sequence of steps explains how the race happens.
1) The filesystem is mounted in RW mode and the cleaner task is running.
This means that currently BTRFS_FS_CLEANER_RUNNING is set at
fs_info->flags;
2) The cleaner task is currently running delayed iputs for example;
3) A filesystem RO remount operation starts;
4) The RO remount task calls btrfs_commit_super(), which commits any
currently open transaction, and it finishes;
5) At this point the cleaner task is still running and it creates a new
transaction by doing one of the following things:
* When running the delayed iput() for an inode with a 0 link count,
in which case at btrfs_evict_inode() we start a transaction through
the call to evict_refill_and_join(), use it and then release its
handle through btrfs_end_transaction();
* When deleting a dead root through btrfs_clean_one_deleted_snapshot(),
a transaction is started at btrfs_drop_snapshot() and then its handle
is released through a call to btrfs_end_transaction_throttle();
* When the remount task was still running, and before the remount task
called btrfs_delete_unused_bgs(), the cleaner task also called
btrfs_delete_unused_bgs() and it picked and removed one block group
from the list of unused block groups. Before the cleaner task started
a transaction, through btrfs_start_trans_remove_block_group() at
btrfs_delete_unused_bgs(), the remount task had already called
btrfs_commit_super();
6) So at this point the filesystem is in RO mode and we have an open
transaction that was started by the cleaner task;
7) Shortly after a filesystem unmount operation starts. At close_ctree()
we stop the transaction kthread before it had a chance to commit the
transaction, since less than 30 seconds (the default commit interval)
have elapsed since the last transaction was committed;
8) We end up calling iput() against the btree inode at close_ctree() while
there is an open transaction, and since that transaction was used to
update btrees by the cleaner, we have dirty pages in the btree inode
due to COW operations on metadata extents, and therefore writeback is
triggered for the btree inode.
So btree_write_cache_pages() is invoked to flush those dirty pages
during the final iput() on the btree inode. This results in creating a
bio and submitting it, which makes us end up at
btrfs_submit_metadata_bio();
9) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
that calls btrfs_wq_submit_bio(), because check_async_write() returned
a value of 1. This value of 1 is because we did not have hardware
acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
set in fs_info->flags;
10) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
workqueue at fs_info->workers, which was already freed before by the
call to btrfs_stop_all_workers() at close_ctree(). This results in an
invalid memory access due to a use-after-free, leading to a crash.
When this happens, before the crash there are several warnings triggered,
since we have reserved metadata space in a block group, the delayed refs
reservation, etc:
------------[ cut here ]------------
WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 4 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
Code: f0 01 00 00 48 39 c2 75 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 48 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c6 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 2 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Code: 48 83 bb b0 03 00 00 00 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c7 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 5 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Code: ad de 49 be 22 01 00 (...)
RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c8 ]---
BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0
And the crash, which only happens when we do not have crc32c hardware
acceleration, produces the following trace immediately after those
warnings:
stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 2 PID: 1749129 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
Code: 54 55 53 48 89 f3 (...)
RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
FS: 00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
submit_one_bio+0x61/0x70 [btrfs]
btree_write_cache_pages+0x414/0x450 [btrfs]
? kobject_put+0x9a/0x1d0
? trace_hardirqs_on+0x1b/0xf0
? _raw_spin_unlock_irqrestore+0x3c/0x60
? free_debug_processing+0x1e1/0x2b0
do_writepages+0x43/0xe0
? lock_acquired+0x199/0x490
__writeback_single_inode+0x59/0x650
writeback_single_inode+0xaf/0x120
write_inode_now+0x94/0xd0
iput+0x187/0x2b0
close_ctree+0x2c6/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f3cfebabee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
---[ end trace dd74718fef1ed5cc ]---
Finally when we remove the btrfs module (rmmod btrfs), there are several
warnings about objects that were allocated from our slabs but were never
freed, consequence of the transaction that was never committed and got
leaked:
=============================================================================
BUG btrfs_delayed_ref_head (Tainted: G B W ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x0000000050cbdd61 @offset=12104
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
sync_filesystem+0x74/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x0000000086e9b0ff @offset=12776
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
commit_cowonly_roots+0x248/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_tree_ref (Tainted: G B W ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000001a340018 @offset=4408
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_commit_transaction+0x60/0xc40 [btrfs]
create_subvol+0x56a/0x990 [btrfs]
btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
__btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
btrfs_ioctl+0x1a92/0x36f0 [btrfs]
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x000000002b46292a @offset=13648
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_extent_op (Tainted: G B W ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? __mutex_unlock_slowpath+0x45/0x2a0
kmem_cache_destroy+0x55/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000004cf95ea8 @offset=6264
INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1
So fix this by making the remount path to wait for the cleaner task before
calling btrfs_commit_super(). The remount path now waits for the bit
BTRFS_FS_CLEANER_RUNNING to be cleared from fs_info->flags before calling
btrfs_commit_super() and this ensures the cleaner can not start a
transaction after that, because it sleeps when the filesystem is in RO
mode and we have already flagged the filesystem as RO before waiting for
BTRFS_FS_CLEANER_RUNNING to be cleared.
This also introduces a new flag BTRFS_FS_STATE_RO to be used for
fs_info->fs_state when the filesystem is in RO mode. This is because we
were doing the RO check using the flags of the superblock and setting the
RO mode simply by ORing into the superblock's flags - those operations are
not atomic and could result in the cleaner not seeing the update from the
remount task after it clears BTRFS_FS_CLEANER_RUNNING.
Tested-by: Fabian Vogt <fvogt@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-14 18:10:47 +08:00
|
|
|
btrfs_clear_sb_rdonly(sb);
|
2021-11-09 17:51:58 +08:00
|
|
|
|
|
|
|
/* GFP_KERNEL allocation must not be under device_list_mutex */
|
|
|
|
seed_devices = btrfs_init_sprout(fs_info);
|
|
|
|
if (IS_ERR(seed_devices)) {
|
|
|
|
ret = PTR_ERR(seed_devices);
|
2017-09-28 14:51:10 +08:00
|
|
|
btrfs_abort_transaction(trans, ret);
|
|
|
|
goto error_trans;
|
|
|
|
}
|
2021-11-09 17:51:58 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
|
|
|
if (seeding_dev) {
|
|
|
|
btrfs_setup_sprout(fs_info, seed_devices);
|
btrfs: update latest_dev when we create a sprout device
When we add a device to the seed filesystem (sprouting) it is a new
filesystem (and fsid) on the device added. Update the latest_dev so
that /proc/self/mounts shows the correct device.
Example:
$ btrfstune -S1 /dev/vg/seed
$ mount /dev/vg/seed /btrfs
mount: /btrfs: WARNING: device write-protected, mounted read-only.
$ cat /proc/self/mounts | grep btrfs
/dev/mapper/vg-seed /btrfs btrfs ro,relatime,space_cache,subvolid=5,subvol=/ 0 0
$ btrfs dev add -f /dev/vg/new /btrfs
Before:
$ cat /proc/self/mounts | grep btrfs
/dev/mapper/vg-seed /btrfs btrfs ro,relatime,space_cache,subvolid=5,subvol=/ 0 0
After:
$ cat /proc/self/mounts | grep btrfs
/dev/mapper/vg-new /btrfs btrfs ro,relatime,space_cache,subvolid=5,subvol=/ 0 0
Tested-by: Su Yue <l@damenly.su>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-24 13:05:21 +08:00
|
|
|
btrfs_assign_next_active_device(fs_info->fs_devices->latest_dev,
|
|
|
|
device);
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
2008-04-29 03:29:42 +08:00
|
|
|
|
2018-07-03 13:14:50 +08:00
|
|
|
device->fs_devices = fs_devices;
|
2009-06-11 03:17:02 +08:00
|
|
|
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
2018-07-03 13:14:50 +08:00
|
|
|
list_add_rcu(&device->dev_list, &fs_devices->devices);
|
|
|
|
list_add(&device->dev_alloc_list, &fs_devices->alloc_list);
|
|
|
|
fs_devices->num_devices++;
|
|
|
|
fs_devices->open_devices++;
|
|
|
|
fs_devices->rw_devices++;
|
|
|
|
fs_devices->total_devices++;
|
|
|
|
fs_devices->total_rw_bytes += device->total_bytes;
|
2008-09-06 04:43:54 +08:00
|
|
|
|
2017-05-11 14:17:46 +08:00
|
|
|
atomic64_add(device->total_bytes, &fs_info->free_chunk_space);
|
2011-09-27 05:12:22 +08:00
|
|
|
|
2023-09-27 17:34:26 +08:00
|
|
|
if (!bdev_nonrot(device->bdev))
|
2019-11-13 18:27:28 +08:00
|
|
|
fs_devices->rotating = true;
|
2009-06-10 21:51:32 +08:00
|
|
|
|
2018-07-27 08:04:55 +08:00
|
|
|
orig_super_total_bytes = btrfs_super_total_bytes(fs_info->super_copy);
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_set_super_total_bytes(fs_info->super_copy,
|
2018-07-27 08:04:55 +08:00
|
|
|
round_down(orig_super_total_bytes + device->total_bytes,
|
|
|
|
fs_info->sectorsize));
|
2008-04-29 03:29:42 +08:00
|
|
|
|
2018-07-27 08:04:55 +08:00
|
|
|
orig_super_num_devices = btrfs_super_num_devices(fs_info->super_copy);
|
|
|
|
btrfs_set_super_num_devices(fs_info->super_copy,
|
|
|
|
orig_super_num_devices + 1);
|
2014-06-03 11:36:01 +08:00
|
|
|
|
2014-09-03 21:35:41 +08:00
|
|
|
/*
|
|
|
|
* we've got more storage, clear any full flags on the space
|
|
|
|
* infos
|
|
|
|
*/
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_clear_space_info_full(fs_info);
|
2014-09-03 21:35:41 +08:00
|
|
|
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
btrfs: sysfs: init devices outside of the chunk_mutex
While running btrfs/061, btrfs/073, btrfs/078, or btrfs/178 we hit the
following lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
5.9.0-rc3+ #4 Not tainted
------------------------------------------------------
kswapd0/100 is trying to acquire lock:
ffff96ecc22ef4a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330
but task is already holding lock:
ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (fs_reclaim){+.+.}-{0:0}:
fs_reclaim_acquire+0x65/0x80
slab_pre_alloc_hook.constprop.0+0x20/0x200
kmem_cache_alloc+0x37/0x270
alloc_inode+0x82/0xb0
iget_locked+0x10d/0x2c0
kernfs_get_inode+0x1b/0x130
kernfs_get_tree+0x136/0x240
sysfs_get_tree+0x16/0x40
vfs_get_tree+0x28/0xc0
path_mount+0x434/0xc00
__x64_sys_mount+0xe3/0x120
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (kernfs_mutex){+.+.}-{3:3}:
__mutex_lock+0x7e/0x7e0
kernfs_add_one+0x23/0x150
kernfs_create_link+0x63/0xa0
sysfs_do_create_link_sd+0x5e/0xd0
btrfs_sysfs_add_devices_dir+0x81/0x130
btrfs_init_new_device+0x67f/0x1250
btrfs_ioctl+0x1ef/0x2e20
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
__mutex_lock+0x7e/0x7e0
btrfs_chunk_alloc+0x125/0x3a0
find_free_extent+0xdf6/0x1210
btrfs_reserve_extent+0xb3/0x1b0
btrfs_alloc_tree_block+0xb0/0x310
alloc_tree_block_no_bg_flush+0x4a/0x60
__btrfs_cow_block+0x11a/0x530
btrfs_cow_block+0x104/0x220
btrfs_search_slot+0x52e/0x9d0
btrfs_insert_empty_items+0x64/0xb0
btrfs_insert_delayed_items+0x90/0x4f0
btrfs_commit_inode_delayed_items+0x93/0x140
btrfs_log_inode+0x5de/0x2020
btrfs_log_inode_parent+0x429/0xc90
btrfs_log_new_name+0x95/0x9b
btrfs_rename2+0xbb9/0x1800
vfs_rename+0x64f/0x9f0
do_renameat2+0x320/0x4e0
__x64_sys_rename+0x1f/0x30
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&delayed_node->mutex){+.+.}-{3:3}:
__lock_acquire+0x119c/0x1fc0
lock_acquire+0xa7/0x3d0
__mutex_lock+0x7e/0x7e0
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
kthread+0x138/0x160
ret_from_fork+0x1f/0x30
other info that might help us debug this:
Chain exists of:
&delayed_node->mutex --> kernfs_mutex --> fs_reclaim
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(kernfs_mutex);
lock(fs_reclaim);
lock(&delayed_node->mutex);
*** DEADLOCK ***
3 locks held by kswapd0/100:
#0: ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
#1: ffffffff8dd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
#2: ffff96ed2ade30e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
stack backtrace:
CPU: 0 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #4
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack+0x8b/0xb8
check_noncircular+0x12d/0x150
__lock_acquire+0x119c/0x1fc0
lock_acquire+0xa7/0x3d0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
__mutex_lock+0x7e/0x7e0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? lock_acquire+0xa7/0x3d0
? find_held_lock+0x2b/0x80
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
? _raw_spin_unlock_irqrestore+0x41/0x50
? add_wait_queue_exclusive+0x70/0x70
? balance_pgdat+0x670/0x670
kthread+0x138/0x160
? kthread_create_worker_on_cpu+0x40/0x40
ret_from_fork+0x1f/0x30
This happens because we are holding the chunk_mutex at the time of
adding in a new device. However we only need to hold the
device_list_mutex, as we're going to iterate over the fs_devices
devices. Move the sysfs init stuff outside of the chunk_mutex to get
rid of this lockdep splat.
CC: stable@vger.kernel.org # 4.4.x: f3cd2c58110dad14e: btrfs: sysfs, rename device_link add/remove functions
CC: stable@vger.kernel.org # 4.4.x
Reported-by: David Sterba <dsterba@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-01 20:09:01 +08:00
|
|
|
|
|
|
|
/* Add sysfs device entry */
|
2020-09-05 01:34:26 +08:00
|
|
|
btrfs_sysfs_add_device(device);
|
btrfs: sysfs: init devices outside of the chunk_mutex
While running btrfs/061, btrfs/073, btrfs/078, or btrfs/178 we hit the
following lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
5.9.0-rc3+ #4 Not tainted
------------------------------------------------------
kswapd0/100 is trying to acquire lock:
ffff96ecc22ef4a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330
but task is already holding lock:
ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (fs_reclaim){+.+.}-{0:0}:
fs_reclaim_acquire+0x65/0x80
slab_pre_alloc_hook.constprop.0+0x20/0x200
kmem_cache_alloc+0x37/0x270
alloc_inode+0x82/0xb0
iget_locked+0x10d/0x2c0
kernfs_get_inode+0x1b/0x130
kernfs_get_tree+0x136/0x240
sysfs_get_tree+0x16/0x40
vfs_get_tree+0x28/0xc0
path_mount+0x434/0xc00
__x64_sys_mount+0xe3/0x120
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (kernfs_mutex){+.+.}-{3:3}:
__mutex_lock+0x7e/0x7e0
kernfs_add_one+0x23/0x150
kernfs_create_link+0x63/0xa0
sysfs_do_create_link_sd+0x5e/0xd0
btrfs_sysfs_add_devices_dir+0x81/0x130
btrfs_init_new_device+0x67f/0x1250
btrfs_ioctl+0x1ef/0x2e20
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
__mutex_lock+0x7e/0x7e0
btrfs_chunk_alloc+0x125/0x3a0
find_free_extent+0xdf6/0x1210
btrfs_reserve_extent+0xb3/0x1b0
btrfs_alloc_tree_block+0xb0/0x310
alloc_tree_block_no_bg_flush+0x4a/0x60
__btrfs_cow_block+0x11a/0x530
btrfs_cow_block+0x104/0x220
btrfs_search_slot+0x52e/0x9d0
btrfs_insert_empty_items+0x64/0xb0
btrfs_insert_delayed_items+0x90/0x4f0
btrfs_commit_inode_delayed_items+0x93/0x140
btrfs_log_inode+0x5de/0x2020
btrfs_log_inode_parent+0x429/0xc90
btrfs_log_new_name+0x95/0x9b
btrfs_rename2+0xbb9/0x1800
vfs_rename+0x64f/0x9f0
do_renameat2+0x320/0x4e0
__x64_sys_rename+0x1f/0x30
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&delayed_node->mutex){+.+.}-{3:3}:
__lock_acquire+0x119c/0x1fc0
lock_acquire+0xa7/0x3d0
__mutex_lock+0x7e/0x7e0
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
kthread+0x138/0x160
ret_from_fork+0x1f/0x30
other info that might help us debug this:
Chain exists of:
&delayed_node->mutex --> kernfs_mutex --> fs_reclaim
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(kernfs_mutex);
lock(fs_reclaim);
lock(&delayed_node->mutex);
*** DEADLOCK ***
3 locks held by kswapd0/100:
#0: ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
#1: ffffffff8dd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
#2: ffff96ed2ade30e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
stack backtrace:
CPU: 0 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #4
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack+0x8b/0xb8
check_noncircular+0x12d/0x150
__lock_acquire+0x119c/0x1fc0
lock_acquire+0xa7/0x3d0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
__mutex_lock+0x7e/0x7e0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? lock_acquire+0xa7/0x3d0
? find_held_lock+0x2b/0x80
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
? _raw_spin_unlock_irqrestore+0x41/0x50
? add_wait_queue_exclusive+0x70/0x70
? balance_pgdat+0x670/0x670
kthread+0x138/0x160
? kthread_create_worker_on_cpu+0x40/0x40
ret_from_fork+0x1f/0x30
This happens because we are holding the chunk_mutex at the time of
adding in a new device. However we only need to hold the
device_list_mutex, as we're going to iterate over the fs_devices
devices. Move the sysfs init stuff outside of the chunk_mutex to get
rid of this lockdep splat.
CC: stable@vger.kernel.org # 4.4.x: f3cd2c58110dad14e: btrfs: sysfs, rename device_link add/remove functions
CC: stable@vger.kernel.org # 4.4.x
Reported-by: David Sterba <dsterba@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-01 20:09:01 +08:00
|
|
|
|
2018-07-03 13:14:50 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
2008-04-29 03:29:42 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
if (seeding_dev) {
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
2019-03-20 23:29:13 +08:00
|
|
|
ret = init_first_rw_device(trans);
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
2012-09-18 21:52:32 +08:00
|
|
|
if (ret) {
|
2016-06-11 06:19:25 +08:00
|
|
|
btrfs_abort_transaction(trans, ret);
|
2017-09-28 14:51:10 +08:00
|
|
|
goto error_sysfs;
|
2012-09-18 21:52:32 +08:00
|
|
|
}
|
2014-09-03 21:35:41 +08:00
|
|
|
}
|
|
|
|
|
2018-07-21 00:37:47 +08:00
|
|
|
ret = btrfs_add_dev_item(trans, device);
|
2014-09-03 21:35:41 +08:00
|
|
|
if (ret) {
|
2016-06-11 06:19:25 +08:00
|
|
|
btrfs_abort_transaction(trans, ret);
|
2017-09-28 14:51:10 +08:00
|
|
|
goto error_sysfs;
|
2014-09-03 21:35:41 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (seeding_dev) {
|
2019-03-20 23:36:39 +08:00
|
|
|
ret = btrfs_finish_sprout(trans);
|
2012-09-18 21:52:32 +08:00
|
|
|
if (ret) {
|
2016-06-11 06:19:25 +08:00
|
|
|
btrfs_abort_transaction(trans, ret);
|
2017-09-28 14:51:10 +08:00
|
|
|
goto error_sysfs;
|
2012-09-18 21:52:32 +08:00
|
|
|
}
|
2014-06-03 11:36:03 +08:00
|
|
|
|
2020-08-12 21:18:51 +08:00
|
|
|
/*
|
|
|
|
* fs_devices now represents the newly sprouted filesystem and
|
2021-11-09 17:51:58 +08:00
|
|
|
* its fsid has been changed by btrfs_sprout_splice().
|
2020-08-12 21:18:51 +08:00
|
|
|
*/
|
|
|
|
btrfs_sysfs_update_sprout_fsid(fs_devices);
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
|
2016-09-10 09:39:03 +08:00
|
|
|
ret = btrfs_commit_transaction(trans);
|
2008-06-26 04:01:30 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
if (seeding_dev) {
|
|
|
|
mutex_unlock(&uuid_mutex);
|
|
|
|
up_write(&sb->s_umount);
|
2020-07-22 16:09:23 +08:00
|
|
|
locked = false;
|
2008-04-29 03:29:42 +08:00
|
|
|
|
2012-03-12 23:03:00 +08:00
|
|
|
if (ret) /* transaction commit */
|
|
|
|
return ret;
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
ret = btrfs_relocate_sys_chunks(fs_info);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (ret < 0)
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_handle_fs_error(fs_info, ret,
|
2016-09-20 22:05:00 +08:00
|
|
|
"Failed to relocate sys chunks after device initialization. This can be fixed using the \"btrfs balance\" command.");
|
Btrfs: fix deadlock caused by the nested chunk allocation
Steps to reproduce:
# mkfs.btrfs -m raid1 <disk1> <disk2>
# btrfstune -S 1 <disk1>
# mount <disk1> <mnt>
# btrfs device add <disk3> <disk4> <mnt>
# mount -o remount,rw <mnt>
# dd if=/dev/zero of=<mnt>/tmpfile bs=1M count=1
Deadlock happened.
It is because of the nested chunk allocation. When we wrote the data
into the filesystem, we would allocate the data chunk because there was
no data chunk in the filesystem. At the end of the data chunk allocation,
we should insert the metadata of the data chunk into the extent tree, but
there was no raid1 chunk, so we tried to lock the chunk allocation mutex to
allocate the new chunk, but we had held the mutex, the deadlock happened.
By rights, we would allocate the raid1 chunk when we added the second device
because the profile of the seed filesystem is raid1 and we had two devices.
But we didn't do that in fact. It is because the last step of the first device
insertion didn't commit the transaction. So when we added the second device,
we didn't cow the tree, and just inserted the relative metadata into the leaves
which were generated by the first device insertion, and its profile was dup.
So, I fix this problem by commiting the transaction at the end of the first
device insertion.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-16 19:26:46 +08:00
|
|
|
trans = btrfs_attach_transaction(root);
|
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
if (PTR_ERR(trans) == -ENOENT)
|
|
|
|
return 0;
|
2017-09-28 14:51:11 +08:00
|
|
|
ret = PTR_ERR(trans);
|
|
|
|
trans = NULL;
|
|
|
|
goto error_sysfs;
|
Btrfs: fix deadlock caused by the nested chunk allocation
Steps to reproduce:
# mkfs.btrfs -m raid1 <disk1> <disk2>
# btrfstune -S 1 <disk1>
# mount <disk1> <mnt>
# btrfs device add <disk3> <disk4> <mnt>
# mount -o remount,rw <mnt>
# dd if=/dev/zero of=<mnt>/tmpfile bs=1M count=1
Deadlock happened.
It is because of the nested chunk allocation. When we wrote the data
into the filesystem, we would allocate the data chunk because there was
no data chunk in the filesystem. At the end of the data chunk allocation,
we should insert the metadata of the data chunk into the extent tree, but
there was no raid1 chunk, so we tried to lock the chunk allocation mutex to
allocate the new chunk, but we had held the mutex, the deadlock happened.
By rights, we would allocate the raid1 chunk when we added the second device
because the profile of the seed filesystem is raid1 and we had two devices.
But we didn't do that in fact. It is because the last step of the first device
insertion didn't commit the transaction. So when we added the second device,
we didn't cow the tree, and just inserted the relative metadata into the leaves
which were generated by the first device insertion, and its profile was dup.
So, I fix this problem by commiting the transaction at the end of the first
device insertion.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-16 19:26:46 +08:00
|
|
|
}
|
2016-09-10 09:39:03 +08:00
|
|
|
ret = btrfs_commit_transaction(trans);
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
2012-01-17 04:04:47 +08:00
|
|
|
|
2020-05-05 02:58:26 +08:00
|
|
|
/*
|
|
|
|
* Now that we have written a new super block to this device, check all
|
|
|
|
* other fs_devices list if device_path alienates any other scanned
|
|
|
|
* device.
|
|
|
|
* We can ignore the return value as it typically returns -EINVAL and
|
|
|
|
* only succeeds if the device was an alien.
|
|
|
|
*/
|
2022-01-12 13:06:01 +08:00
|
|
|
btrfs_forget_devices(device->devt);
|
2020-05-05 02:58:26 +08:00
|
|
|
|
|
|
|
/* Update ctime/mtime for blkid or udev */
|
2021-10-15 01:11:01 +08:00
|
|
|
update_dev_time(device_path);
|
2020-05-05 02:58:26 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
return ret;
|
2012-03-12 23:03:00 +08:00
|
|
|
|
2017-09-28 14:51:10 +08:00
|
|
|
error_sysfs:
|
2020-09-05 01:34:27 +08:00
|
|
|
btrfs_sysfs_remove_device(device);
|
2018-07-27 08:04:55 +08:00
|
|
|
mutex_lock(&fs_info->fs_devices->device_list_mutex);
|
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
|
|
|
list_del_rcu(&device->dev_list);
|
|
|
|
list_del(&device->dev_alloc_list);
|
|
|
|
fs_info->fs_devices->num_devices--;
|
|
|
|
fs_info->fs_devices->open_devices--;
|
|
|
|
fs_info->fs_devices->rw_devices--;
|
|
|
|
fs_info->fs_devices->total_devices--;
|
|
|
|
fs_info->fs_devices->total_rw_bytes -= device->total_bytes;
|
|
|
|
atomic64_sub(device->total_bytes, &fs_info->free_chunk_space);
|
|
|
|
btrfs_set_super_total_bytes(fs_info->super_copy,
|
|
|
|
orig_super_total_bytes);
|
|
|
|
btrfs_set_super_num_devices(fs_info->super_copy,
|
|
|
|
orig_super_num_devices);
|
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
|
|
|
mutex_unlock(&fs_info->fs_devices->device_list_mutex);
|
2012-03-12 23:03:00 +08:00
|
|
|
error_trans:
|
2017-09-28 14:51:09 +08:00
|
|
|
if (seeding_dev)
|
btrfs: fix race between RO remount and the cleaner task
When we are remounting a filesystem in RO mode we can race with the cleaner
task and result in leaking a transaction if the filesystem is unmounted
shortly after, before the transaction kthread had a chance to commit that
transaction. That also results in a crash during unmount, due to a
use-after-free, if hardware acceleration is not available for crc32c.
The following sequence of steps explains how the race happens.
1) The filesystem is mounted in RW mode and the cleaner task is running.
This means that currently BTRFS_FS_CLEANER_RUNNING is set at
fs_info->flags;
2) The cleaner task is currently running delayed iputs for example;
3) A filesystem RO remount operation starts;
4) The RO remount task calls btrfs_commit_super(), which commits any
currently open transaction, and it finishes;
5) At this point the cleaner task is still running and it creates a new
transaction by doing one of the following things:
* When running the delayed iput() for an inode with a 0 link count,
in which case at btrfs_evict_inode() we start a transaction through
the call to evict_refill_and_join(), use it and then release its
handle through btrfs_end_transaction();
* When deleting a dead root through btrfs_clean_one_deleted_snapshot(),
a transaction is started at btrfs_drop_snapshot() and then its handle
is released through a call to btrfs_end_transaction_throttle();
* When the remount task was still running, and before the remount task
called btrfs_delete_unused_bgs(), the cleaner task also called
btrfs_delete_unused_bgs() and it picked and removed one block group
from the list of unused block groups. Before the cleaner task started
a transaction, through btrfs_start_trans_remove_block_group() at
btrfs_delete_unused_bgs(), the remount task had already called
btrfs_commit_super();
6) So at this point the filesystem is in RO mode and we have an open
transaction that was started by the cleaner task;
7) Shortly after a filesystem unmount operation starts. At close_ctree()
we stop the transaction kthread before it had a chance to commit the
transaction, since less than 30 seconds (the default commit interval)
have elapsed since the last transaction was committed;
8) We end up calling iput() against the btree inode at close_ctree() while
there is an open transaction, and since that transaction was used to
update btrees by the cleaner, we have dirty pages in the btree inode
due to COW operations on metadata extents, and therefore writeback is
triggered for the btree inode.
So btree_write_cache_pages() is invoked to flush those dirty pages
during the final iput() on the btree inode. This results in creating a
bio and submitting it, which makes us end up at
btrfs_submit_metadata_bio();
9) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
that calls btrfs_wq_submit_bio(), because check_async_write() returned
a value of 1. This value of 1 is because we did not have hardware
acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
set in fs_info->flags;
10) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
workqueue at fs_info->workers, which was already freed before by the
call to btrfs_stop_all_workers() at close_ctree(). This results in an
invalid memory access due to a use-after-free, leading to a crash.
When this happens, before the crash there are several warnings triggered,
since we have reserved metadata space in a block group, the delayed refs
reservation, etc:
------------[ cut here ]------------
WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 4 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
Code: f0 01 00 00 48 39 c2 75 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 48 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c6 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 2 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Code: 48 83 bb b0 03 00 00 00 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c7 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 5 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Code: ad de 49 be 22 01 00 (...)
RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c8 ]---
BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0
And the crash, which only happens when we do not have crc32c hardware
acceleration, produces the following trace immediately after those
warnings:
stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 2 PID: 1749129 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
Code: 54 55 53 48 89 f3 (...)
RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
FS: 00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
submit_one_bio+0x61/0x70 [btrfs]
btree_write_cache_pages+0x414/0x450 [btrfs]
? kobject_put+0x9a/0x1d0
? trace_hardirqs_on+0x1b/0xf0
? _raw_spin_unlock_irqrestore+0x3c/0x60
? free_debug_processing+0x1e1/0x2b0
do_writepages+0x43/0xe0
? lock_acquired+0x199/0x490
__writeback_single_inode+0x59/0x650
writeback_single_inode+0xaf/0x120
write_inode_now+0x94/0xd0
iput+0x187/0x2b0
close_ctree+0x2c6/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f3cfebabee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
---[ end trace dd74718fef1ed5cc ]---
Finally when we remove the btrfs module (rmmod btrfs), there are several
warnings about objects that were allocated from our slabs but were never
freed, consequence of the transaction that was never committed and got
leaked:
=============================================================================
BUG btrfs_delayed_ref_head (Tainted: G B W ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x0000000050cbdd61 @offset=12104
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
sync_filesystem+0x74/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x0000000086e9b0ff @offset=12776
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
commit_cowonly_roots+0x248/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_tree_ref (Tainted: G B W ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000001a340018 @offset=4408
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_commit_transaction+0x60/0xc40 [btrfs]
create_subvol+0x56a/0x990 [btrfs]
btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
__btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
btrfs_ioctl+0x1a92/0x36f0 [btrfs]
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x000000002b46292a @offset=13648
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_extent_op (Tainted: G B W ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? __mutex_unlock_slowpath+0x45/0x2a0
kmem_cache_destroy+0x55/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000004cf95ea8 @offset=6264
INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1
So fix this by making the remount path to wait for the cleaner task before
calling btrfs_commit_super(). The remount path now waits for the bit
BTRFS_FS_CLEANER_RUNNING to be cleared from fs_info->flags before calling
btrfs_commit_super() and this ensures the cleaner can not start a
transaction after that, because it sleeps when the filesystem is in RO
mode and we have already flagged the filesystem as RO before waiting for
BTRFS_FS_CLEANER_RUNNING to be cleared.
This also introduces a new flag BTRFS_FS_STATE_RO to be used for
fs_info->fs_state when the filesystem is in RO mode. This is because we
were doing the RO check using the flags of the superblock and setting the
RO mode simply by ORing into the superblock's flags - those operations are
not atomic and could result in the cleaner not seeing the update from the
remount task after it clears BTRFS_FS_CLEANER_RUNNING.
Tested-by: Fabian Vogt <fvogt@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-14 18:10:47 +08:00
|
|
|
btrfs_set_sb_rdonly(sb);
|
2017-09-28 14:51:11 +08:00
|
|
|
if (trans)
|
|
|
|
btrfs_end_transaction(trans);
|
2020-11-10 19:26:07 +08:00
|
|
|
error_free_zone:
|
|
|
|
btrfs_destroy_dev_zone_info(device);
|
2017-10-31 02:29:46 +08:00
|
|
|
error_free_device:
|
2018-03-20 22:47:33 +08:00
|
|
|
btrfs_free_device(device);
|
2008-11-18 10:11:30 +08:00
|
|
|
error:
|
2024-01-23 21:26:36 +08:00
|
|
|
fput(bdev_file);
|
2020-07-22 16:09:23 +08:00
|
|
|
if (locked) {
|
2008-11-18 10:11:30 +08:00
|
|
|
mutex_unlock(&uuid_mutex);
|
|
|
|
up_write(&sb->s_umount);
|
|
|
|
}
|
2012-01-17 04:04:47 +08:00
|
|
|
return ret;
|
2008-04-29 03:29:42 +08:00
|
|
|
}
|
|
|
|
|
2009-01-06 10:25:51 +08:00
|
|
|
static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_device *device)
|
2008-03-25 03:01:56 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct btrfs_path *path;
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_root *root = device->fs_info->chunk_root;
|
2008-03-25 03:01:56 +08:00
|
|
|
struct btrfs_dev_item *dev_item;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_key key;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
|
|
|
|
key.type = BTRFS_DEV_ITEM_KEY;
|
|
|
|
key.offset = device->devid;
|
|
|
|
|
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, 0, 1);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (ret > 0) {
|
|
|
|
ret = -ENOENT;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
dev_item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_dev_item);
|
|
|
|
|
|
|
|
btrfs_set_device_id(leaf, dev_item, device->devid);
|
|
|
|
btrfs_set_device_type(leaf, dev_item, device->type);
|
|
|
|
btrfs_set_device_io_align(leaf, dev_item, device->io_align);
|
|
|
|
btrfs_set_device_io_width(leaf, dev_item, device->io_width);
|
|
|
|
btrfs_set_device_sector_size(leaf, dev_item, device->sector_size);
|
2014-09-03 21:35:38 +08:00
|
|
|
btrfs_set_device_total_bytes(leaf, dev_item,
|
|
|
|
btrfs_device_get_disk_total_bytes(device));
|
|
|
|
btrfs_set_device_bytes_used(leaf, dev_item,
|
|
|
|
btrfs_device_get_bytes_used(device));
|
2023-09-12 20:04:29 +08:00
|
|
|
btrfs_mark_buffer_dirty(trans, leaf);
|
2008-03-25 03:01:56 +08:00
|
|
|
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2014-09-03 21:35:41 +08:00
|
|
|
int btrfs_grow_device(struct btrfs_trans_handle *trans,
|
2008-04-26 04:53:30 +08:00
|
|
|
struct btrfs_device *device, u64 new_size)
|
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *fs_info = device->fs_info;
|
|
|
|
struct btrfs_super_block *super_copy = fs_info->super_copy;
|
2014-09-03 21:35:41 +08:00
|
|
|
u64 old_total;
|
|
|
|
u64 diff;
|
btrfs: fix deadlock between chunk allocation and chunk btree modifications
When a task is doing some modification to the chunk btree and it is not in
the context of a chunk allocation or a chunk removal, it can deadlock with
another task that is currently allocating a new data or metadata chunk.
These contexts are the following:
* When relocating a system chunk, when we need to COW the extent buffers
that belong to the chunk btree;
* When adding a new device (ioctl), where we need to add a new device item
to the chunk btree;
* When removing a device (ioctl), where we need to remove a device item
from the chunk btree;
* When resizing a device (ioctl), where we need to update a device item in
the chunk btree and may need to relocate a system chunk that lies beyond
the new device size when shrinking a device.
The problem happens due to a sequence of steps like the following:
1) Task A starts a data or metadata chunk allocation and it locks the
chunk mutex;
2) Task B is relocating a system chunk, and when it needs to COW an extent
buffer of the chunk btree, it has locked both that extent buffer as
well as its parent extent buffer;
3) Since there is not enough available system space, either because none
of the existing system block groups have enough free space or because
the only one with enough free space is in RO mode due to the relocation,
task B triggers a new system chunk allocation. It blocks when trying to
acquire the chunk mutex, currently held by task A;
4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
the new chunk item into the chunk btree and update the existing device
items there. But in order to do that, it has to lock the extent buffer
that task B locked at step 2, or its parent extent buffer, but task B
is waiting on the chunk mutex, which is currently locked by task A,
therefore resulting in a deadlock.
One example report when the deadlock happens with system chunk relocation:
INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u9:5 state:D stack:25936 pid: 546 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
__down_read_common kernel/locking/rwsem.c:1214 [inline]
__down_read kernel/locking/rwsem.c:1223 [inline]
down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
__btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
worker_thread+0x90/0xed0 kernel/workqueue.c:2444
kthread+0x3e5/0x4d0 kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
INFO: task syz-executor:9107 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor state:D stack:23200 pid: 9107 ppid: 7792 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
__mutex_lock_common kernel/locking/mutex.c:669 [inline]
__mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
__btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
__btrfs_balance fs/btrfs/volumes.c:3911 [inline]
btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
So fix this by making sure that whenever we try to modify the chunk btree
and we are neither in a chunk allocation context nor in a chunk remove
context, we reserve system space before modifying the chunk btree.
Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
CC: stable@vger.kernel.org # 5.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-13 17:12:49 +08:00
|
|
|
int ret;
|
2008-04-26 04:53:30 +08:00
|
|
|
|
2017-12-04 12:54:52 +08:00
|
|
|
if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state))
|
2008-11-18 10:11:30 +08:00
|
|
|
return -EACCES;
|
2014-09-03 21:35:41 +08:00
|
|
|
|
2017-06-16 19:39:20 +08:00
|
|
|
new_size = round_down(new_size, fs_info->sectorsize);
|
|
|
|
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
2014-09-03 21:35:41 +08:00
|
|
|
old_total = btrfs_super_total_bytes(super_copy);
|
2017-07-21 16:28:24 +08:00
|
|
|
diff = round_down(new_size - device->total_bytes, fs_info->sectorsize);
|
2014-09-03 21:35:41 +08:00
|
|
|
|
2012-11-06 01:29:28 +08:00
|
|
|
if (new_size <= device->total_bytes ||
|
2017-12-04 12:54:55 +08:00
|
|
|
test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state)) {
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
2008-11-18 10:11:30 +08:00
|
|
|
return -EINVAL;
|
2014-09-03 21:35:41 +08:00
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2017-06-16 19:39:20 +08:00
|
|
|
btrfs_set_super_total_bytes(super_copy,
|
|
|
|
round_down(old_total + diff, fs_info->sectorsize));
|
2008-11-18 10:11:30 +08:00
|
|
|
device->fs_devices->total_rw_bytes += diff;
|
2023-09-28 01:47:00 +08:00
|
|
|
atomic64_add(diff, &fs_info->free_chunk_space);
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2014-09-03 21:35:38 +08:00
|
|
|
btrfs_device_set_total_bytes(device, new_size);
|
|
|
|
btrfs_device_set_disk_total_bytes(device, new_size);
|
2016-06-23 06:54:56 +08:00
|
|
|
btrfs_clear_space_info_full(device->fs_info);
|
2019-03-25 20:31:22 +08:00
|
|
|
if (list_empty(&device->post_commit_list))
|
|
|
|
list_add_tail(&device->post_commit_list,
|
|
|
|
&trans->transaction->dev_update_list);
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
2009-03-11 00:39:20 +08:00
|
|
|
|
btrfs: fix deadlock between chunk allocation and chunk btree modifications
When a task is doing some modification to the chunk btree and it is not in
the context of a chunk allocation or a chunk removal, it can deadlock with
another task that is currently allocating a new data or metadata chunk.
These contexts are the following:
* When relocating a system chunk, when we need to COW the extent buffers
that belong to the chunk btree;
* When adding a new device (ioctl), where we need to add a new device item
to the chunk btree;
* When removing a device (ioctl), where we need to remove a device item
from the chunk btree;
* When resizing a device (ioctl), where we need to update a device item in
the chunk btree and may need to relocate a system chunk that lies beyond
the new device size when shrinking a device.
The problem happens due to a sequence of steps like the following:
1) Task A starts a data or metadata chunk allocation and it locks the
chunk mutex;
2) Task B is relocating a system chunk, and when it needs to COW an extent
buffer of the chunk btree, it has locked both that extent buffer as
well as its parent extent buffer;
3) Since there is not enough available system space, either because none
of the existing system block groups have enough free space or because
the only one with enough free space is in RO mode due to the relocation,
task B triggers a new system chunk allocation. It blocks when trying to
acquire the chunk mutex, currently held by task A;
4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
the new chunk item into the chunk btree and update the existing device
items there. But in order to do that, it has to lock the extent buffer
that task B locked at step 2, or its parent extent buffer, but task B
is waiting on the chunk mutex, which is currently locked by task A,
therefore resulting in a deadlock.
One example report when the deadlock happens with system chunk relocation:
INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u9:5 state:D stack:25936 pid: 546 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
__down_read_common kernel/locking/rwsem.c:1214 [inline]
__down_read kernel/locking/rwsem.c:1223 [inline]
down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
__btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
worker_thread+0x90/0xed0 kernel/workqueue.c:2444
kthread+0x3e5/0x4d0 kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
INFO: task syz-executor:9107 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor state:D stack:23200 pid: 9107 ppid: 7792 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
__mutex_lock_common kernel/locking/mutex.c:669 [inline]
__mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
__btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
__btrfs_balance fs/btrfs/volumes.c:3911 [inline]
btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
So fix this by making sure that whenever we try to modify the chunk btree
and we are neither in a chunk allocation context nor in a chunk remove
context, we reserve system space before modifying the chunk btree.
Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
CC: stable@vger.kernel.org # 5.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-13 17:12:49 +08:00
|
|
|
btrfs_reserve_chunk_metadata(trans, false);
|
|
|
|
ret = btrfs_update_device(trans, device);
|
|
|
|
btrfs_trans_release_chunk_metadata(trans);
|
|
|
|
|
|
|
|
return ret;
|
2008-04-26 04:53:30 +08:00
|
|
|
}
|
|
|
|
|
2018-07-21 00:37:52 +08:00
|
|
|
static int btrfs_free_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
|
2008-04-26 04:53:30 +08:00
|
|
|
{
|
2018-07-21 00:37:52 +08:00
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
2016-06-21 22:40:19 +08:00
|
|
|
struct btrfs_root *root = fs_info->chunk_root;
|
2008-04-26 04:53:30 +08:00
|
|
|
int ret;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_key key;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2017-07-27 19:37:29 +08:00
|
|
|
key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
|
2008-04-26 04:53:30 +08:00
|
|
|
key.offset = chunk_offset;
|
|
|
|
key.type = BTRFS_CHUNK_ITEM_KEY;
|
|
|
|
|
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
else if (ret > 0) { /* Logic error or corruption */
|
2024-05-22 23:52:38 +08:00
|
|
|
btrfs_err(fs_info, "failed to lookup chunk %llu when freeing",
|
|
|
|
chunk_offset);
|
|
|
|
btrfs_abort_transaction(trans, -ENOENT);
|
|
|
|
ret = -EUCLEAN;
|
2012-03-12 23:03:00 +08:00
|
|
|
goto out;
|
|
|
|
}
|
2008-04-26 04:53:30 +08:00
|
|
|
|
|
|
|
ret = btrfs_del_item(trans, root, path);
|
2024-05-22 23:52:38 +08:00
|
|
|
if (ret < 0) {
|
|
|
|
btrfs_err(fs_info, "failed to delete chunk %llu item", chunk_offset);
|
|
|
|
btrfs_abort_transaction(trans, ret);
|
|
|
|
goto out;
|
|
|
|
}
|
2012-03-12 23:03:00 +08:00
|
|
|
out:
|
2008-04-26 04:53:30 +08:00
|
|
|
btrfs_free_path(path);
|
2011-05-19 12:37:44 +08:00
|
|
|
return ret;
|
2008-04-26 04:53:30 +08:00
|
|
|
}
|
|
|
|
|
2017-07-27 19:37:29 +08:00
|
|
|
static int btrfs_del_sys_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
|
2008-04-26 04:53:30 +08:00
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_super_block *super_copy = fs_info->super_copy;
|
2008-04-26 04:53:30 +08:00
|
|
|
struct btrfs_disk_key *disk_key;
|
|
|
|
struct btrfs_chunk *chunk;
|
|
|
|
u8 *ptr;
|
|
|
|
int ret = 0;
|
|
|
|
u32 num_stripes;
|
|
|
|
u32 array_size;
|
|
|
|
u32 len = 0;
|
|
|
|
u32 cur;
|
|
|
|
struct btrfs_key key;
|
|
|
|
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
lockdep_assert_held(&fs_info->chunk_mutex);
|
2008-04-26 04:53:30 +08:00
|
|
|
array_size = btrfs_super_sys_array_size(super_copy);
|
|
|
|
|
|
|
|
ptr = super_copy->sys_chunk_array;
|
|
|
|
cur = 0;
|
|
|
|
|
|
|
|
while (cur < array_size) {
|
|
|
|
disk_key = (struct btrfs_disk_key *)ptr;
|
|
|
|
btrfs_disk_key_to_cpu(&key, disk_key);
|
|
|
|
|
|
|
|
len = sizeof(*disk_key);
|
|
|
|
|
|
|
|
if (key.type == BTRFS_CHUNK_ITEM_KEY) {
|
|
|
|
chunk = (struct btrfs_chunk *)(ptr + len);
|
|
|
|
num_stripes = btrfs_stack_chunk_num_stripes(chunk);
|
|
|
|
len += btrfs_chunk_item_size(num_stripes);
|
|
|
|
} else {
|
|
|
|
ret = -EIO;
|
|
|
|
break;
|
|
|
|
}
|
2017-07-27 19:37:29 +08:00
|
|
|
if (key.objectid == BTRFS_FIRST_CHUNK_TREE_OBJECTID &&
|
2008-04-26 04:53:30 +08:00
|
|
|
key.offset == chunk_offset) {
|
|
|
|
memmove(ptr, ptr + len, array_size - (cur + len));
|
|
|
|
array_size -= len;
|
|
|
|
btrfs_set_super_sys_array_size(super_copy, array_size);
|
|
|
|
} else {
|
|
|
|
ptr += len;
|
|
|
|
cur += len;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
struct btrfs_chunk_map *btrfs_find_chunk_map_nolock(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 logical, u64 length)
|
|
|
|
{
|
|
|
|
struct rb_node *node = fs_info->mapping_tree.rb_root.rb_node;
|
|
|
|
struct rb_node *prev = NULL;
|
|
|
|
struct rb_node *orig_prev;
|
|
|
|
struct btrfs_chunk_map *map;
|
|
|
|
struct btrfs_chunk_map *prev_map = NULL;
|
|
|
|
|
|
|
|
while (node) {
|
|
|
|
map = rb_entry(node, struct btrfs_chunk_map, rb_node);
|
|
|
|
prev = node;
|
|
|
|
prev_map = map;
|
|
|
|
|
|
|
|
if (logical < map->start) {
|
|
|
|
node = node->rb_left;
|
|
|
|
} else if (logical >= map->start + map->chunk_len) {
|
|
|
|
node = node->rb_right;
|
|
|
|
} else {
|
|
|
|
refcount_inc(&map->refs);
|
|
|
|
return map;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!prev)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
orig_prev = prev;
|
|
|
|
while (prev && logical >= prev_map->start + prev_map->chunk_len) {
|
|
|
|
prev = rb_next(prev);
|
|
|
|
prev_map = rb_entry(prev, struct btrfs_chunk_map, rb_node);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!prev) {
|
|
|
|
prev = orig_prev;
|
|
|
|
prev_map = rb_entry(prev, struct btrfs_chunk_map, rb_node);
|
|
|
|
while (prev && logical < prev_map->start) {
|
|
|
|
prev = rb_prev(prev);
|
|
|
|
prev_map = rb_entry(prev, struct btrfs_chunk_map, rb_node);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (prev) {
|
|
|
|
u64 end = logical + length;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Caller can pass a U64_MAX length when it wants to get any
|
|
|
|
* chunk starting at an offset of 'logical' or higher, so deal
|
|
|
|
* with underflow by resetting the end offset to U64_MAX.
|
|
|
|
*/
|
|
|
|
if (end < logical)
|
|
|
|
end = U64_MAX;
|
|
|
|
|
|
|
|
if (end > prev_map->start &&
|
|
|
|
logical < prev_map->start + prev_map->chunk_len) {
|
|
|
|
refcount_inc(&prev_map->refs);
|
|
|
|
return prev_map;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
struct btrfs_chunk_map *btrfs_find_chunk_map(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 logical, u64 length)
|
|
|
|
{
|
|
|
|
struct btrfs_chunk_map *map;
|
|
|
|
|
|
|
|
read_lock(&fs_info->mapping_tree_lock);
|
|
|
|
map = btrfs_find_chunk_map_nolock(fs_info, logical, length);
|
|
|
|
read_unlock(&fs_info->mapping_tree_lock);
|
|
|
|
|
|
|
|
return map;
|
|
|
|
}
|
|
|
|
|
2018-05-17 07:34:31 +08:00
|
|
|
/*
|
2023-09-08 07:09:25 +08:00
|
|
|
* Find the mapping containing the given logical extent.
|
|
|
|
*
|
2018-05-17 07:34:31 +08:00
|
|
|
* @logical: Logical block offset in bytes.
|
|
|
|
* @length: Length of extent in bytes.
|
|
|
|
*
|
|
|
|
* Return: Chunk mapping or ERR_PTR.
|
|
|
|
*/
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
struct btrfs_chunk_map *btrfs_get_chunk_map(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 logical, u64 length)
|
2017-03-15 04:33:55 +08:00
|
|
|
{
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
struct btrfs_chunk_map *map;
|
2017-03-15 04:33:55 +08:00
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
map = btrfs_find_chunk_map(fs_info, logical, length);
|
2017-03-15 04:33:55 +08:00
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
if (unlikely(!map)) {
|
2023-11-21 21:38:33 +08:00
|
|
|
btrfs_crit(fs_info,
|
|
|
|
"unable to find chunk map for logical %llu length %llu",
|
2017-03-15 04:33:55 +08:00
|
|
|
logical, length);
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
}
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
if (unlikely(map->start > logical || map->start + map->chunk_len <= logical)) {
|
2017-03-15 04:33:55 +08:00
|
|
|
btrfs_crit(fs_info,
|
2023-11-21 21:38:33 +08:00
|
|
|
"found a bad chunk map, wanted %llu-%llu, found %llu-%llu",
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
logical, logical + length, map->start,
|
|
|
|
map->start + map->chunk_len);
|
|
|
|
btrfs_free_chunk_map(map);
|
2017-03-15 04:33:55 +08:00
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
}
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
/* Callers are responsible for dropping the reference. */
|
|
|
|
return map;
|
2017-03-15 04:33:55 +08:00
|
|
|
}
|
|
|
|
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
static int remove_chunk_item(struct btrfs_trans_handle *trans,
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
struct btrfs_chunk_map *map, u64 chunk_offset)
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Removing chunk items and updating the device items in the chunks btree
|
|
|
|
* requires holding the chunk_mutex.
|
|
|
|
* See the comment at btrfs_chunk_alloc() for the details.
|
|
|
|
*/
|
|
|
|
lockdep_assert_held(&trans->fs_info->chunk_mutex);
|
|
|
|
|
|
|
|
for (i = 0; i < map->num_stripes; i++) {
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = btrfs_update_device(trans, map->stripes[i].dev);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
return btrfs_free_chunk(trans, chunk_offset);
|
|
|
|
}
|
|
|
|
|
2018-07-21 00:37:53 +08:00
|
|
|
int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
|
2008-04-26 04:53:30 +08:00
|
|
|
{
|
2018-07-21 00:37:53 +08:00
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
struct btrfs_chunk_map *map;
|
2014-09-03 21:35:41 +08:00
|
|
|
u64 dev_extent_len = 0;
|
2014-09-18 23:20:02 +08:00
|
|
|
int i, ret = 0;
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
|
2008-04-26 04:53:30 +08:00
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
map = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
|
|
|
|
if (IS_ERR(map)) {
|
2014-09-18 23:20:02 +08:00
|
|
|
/*
|
|
|
|
* This is a logic error, but we don't want to just rely on the
|
2016-03-05 03:23:12 +08:00
|
|
|
* user having built with ASSERT enabled, so if ASSERT doesn't
|
2014-09-18 23:20:02 +08:00
|
|
|
* do anything we still error out.
|
|
|
|
*/
|
|
|
|
ASSERT(0);
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
return PTR_ERR(map);
|
2014-09-18 23:20:02 +08:00
|
|
|
}
|
2008-04-26 04:53:30 +08:00
|
|
|
|
2016-05-20 11:34:23 +08:00
|
|
|
/*
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
* First delete the device extent items from the devices btree.
|
|
|
|
* We take the device_list_mutex to avoid racing with the finishing phase
|
|
|
|
* of a device replace operation. See the comment below before acquiring
|
|
|
|
* fs_info->chunk_mutex. Note that here we do not acquire the chunk_mutex
|
|
|
|
* because that can result in a deadlock when deleting the device extent
|
|
|
|
* items from the devices btree - COWing an extent buffer from the btree
|
|
|
|
* may result in allocating a new metadata chunk, which would attempt to
|
|
|
|
* lock again fs_info->chunk_mutex.
|
2016-05-20 11:34:23 +08:00
|
|
|
*/
|
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
2008-04-26 04:53:30 +08:00
|
|
|
for (i = 0; i < map->num_stripes; i++) {
|
2014-09-18 23:20:02 +08:00
|
|
|
struct btrfs_device *device = map->stripes[i].dev;
|
2014-09-03 21:35:41 +08:00
|
|
|
ret = btrfs_free_dev_extent(trans, device,
|
|
|
|
map->stripes[i].physical,
|
|
|
|
&dev_extent_len);
|
2014-09-18 23:20:02 +08:00
|
|
|
if (ret) {
|
2016-05-20 11:34:23 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
2016-06-11 06:19:25 +08:00
|
|
|
btrfs_abort_transaction(trans, ret);
|
2014-09-18 23:20:02 +08:00
|
|
|
goto out;
|
|
|
|
}
|
2008-05-07 23:43:44 +08:00
|
|
|
|
2014-09-03 21:35:41 +08:00
|
|
|
if (device->bytes_used > 0) {
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
2014-09-03 21:35:41 +08:00
|
|
|
btrfs_device_set_bytes_used(device,
|
|
|
|
device->bytes_used - dev_extent_len);
|
2017-05-11 14:17:46 +08:00
|
|
|
atomic64_add(dev_extent_len, &fs_info->free_chunk_space);
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_clear_space_info_full(fs_info);
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
2014-09-03 21:35:41 +08:00
|
|
|
}
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
}
|
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
2008-05-07 23:43:44 +08:00
|
|
|
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
/*
|
|
|
|
* We acquire fs_info->chunk_mutex for 2 reasons:
|
|
|
|
*
|
|
|
|
* 1) Just like with the first phase of the chunk allocation, we must
|
|
|
|
* reserve system space, do all chunk btree updates and deletions, and
|
|
|
|
* update the system chunk array in the superblock while holding this
|
|
|
|
* mutex. This is for similar reasons as explained on the comment at
|
|
|
|
* the top of btrfs_chunk_alloc();
|
|
|
|
*
|
|
|
|
* 2) Prevent races with the final phase of a device replace operation
|
|
|
|
* that replaces the device object associated with the map's stripes,
|
|
|
|
* because the device object's id can change at any time during that
|
|
|
|
* final phase of the device replace operation
|
|
|
|
* (dev-replace.c:btrfs_dev_replace_finishing()), so we could grab the
|
|
|
|
* replaced device and then see it with an ID of
|
|
|
|
* BTRFS_DEV_REPLACE_DEVID, which would cause a failure when updating
|
|
|
|
* the device item, which does not exists on the chunk btree.
|
|
|
|
* The finishing phase of device replace acquires both the
|
|
|
|
* device_list_mutex and the chunk_mutex, in that order, so we are
|
|
|
|
* safe by just acquiring the chunk_mutex.
|
|
|
|
*/
|
|
|
|
trans->removing_chunk = true;
|
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
|
|
|
|
|
|
|
check_system_chunk(trans, map->type);
|
|
|
|
|
|
|
|
ret = remove_chunk_item(trans, map, chunk_offset);
|
|
|
|
/*
|
|
|
|
* Normally we should not get -ENOSPC since we reserved space before
|
|
|
|
* through the call to check_system_chunk().
|
|
|
|
*
|
|
|
|
* Despite our system space_info having enough free space, we may not
|
|
|
|
* be able to allocate extents from its block groups, because all have
|
|
|
|
* an incompatible profile, which will force us to allocate a new system
|
|
|
|
* block group with the right profile, or right after we called
|
|
|
|
* check_system_space() above, a scrub turned the only system block group
|
|
|
|
* with enough free space into RO mode.
|
|
|
|
* This is explained with more detail at do_chunk_alloc().
|
|
|
|
*
|
|
|
|
* So if we get -ENOSPC, allocate a new system chunk and retry once.
|
|
|
|
*/
|
|
|
|
if (ret == -ENOSPC) {
|
|
|
|
const u64 sys_flags = btrfs_system_alloc_profile(fs_info);
|
|
|
|
struct btrfs_block_group *sys_bg;
|
|
|
|
|
2021-08-18 18:41:19 +08:00
|
|
|
sys_bg = btrfs_create_chunk(trans, sys_flags);
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
if (IS_ERR(sys_bg)) {
|
|
|
|
ret = PTR_ERR(sys_bg);
|
|
|
|
btrfs_abort_transaction(trans, ret);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = btrfs_chunk_alloc_add_chunk_item(trans, sys_bg);
|
2018-10-26 19:43:19 +08:00
|
|
|
if (ret) {
|
|
|
|
btrfs_abort_transaction(trans, ret);
|
|
|
|
goto out;
|
2008-05-14 01:46:40 +08:00
|
|
|
}
|
2016-05-20 11:34:23 +08:00
|
|
|
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
ret = remove_chunk_item(trans, map, chunk_offset);
|
|
|
|
if (ret) {
|
|
|
|
btrfs_abort_transaction(trans, ret);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
} else if (ret) {
|
2016-06-11 06:19:25 +08:00
|
|
|
btrfs_abort_transaction(trans, ret);
|
2014-09-18 23:20:02 +08:00
|
|
|
goto out;
|
|
|
|
}
|
2008-04-26 04:53:30 +08:00
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
trace_btrfs_chunk_free(fs_info, map, chunk_offset, map->chunk_len);
|
Btrfs: add initial tracepoint support for btrfs
Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
Here is what I have added:
1) ordere_extent:
btrfs_ordered_extent_add
btrfs_ordered_extent_remove
btrfs_ordered_extent_start
btrfs_ordered_extent_put
These provide critical information to understand how ordered_extents are
updated.
2) extent_map:
btrfs_get_extent
extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.
3) writepage:
__extent_writepage
btrfs_writepage_end_io_hook
Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.
4) inode:
btrfs_inode_new
btrfs_inode_request
btrfs_inode_evict
These can show where and when a inode is created, when a inode is evicted.
5) sync:
btrfs_sync_file
btrfs_sync_fs
These show sync arguments.
6) transaction:
btrfs_transaction_commit
In transaction based filesystem, it will be useful to know the generation and
who does commit.
7) back reference and cow:
btrfs_delayed_tree_ref
btrfs_delayed_data_ref
btrfs_delayed_ref_head
btrfs_cow_block
Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.
8) chunk:
btrfs_chunk_alloc
btrfs_chunk_free
Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.
9) reserved_extent:
btrfs_reserved_extent_alloc
btrfs_reserved_extent_free
These can show how btrfs uses its space.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-24 19:18:59 +08:00
|
|
|
|
2008-04-26 04:53:30 +08:00
|
|
|
if (map->type & BTRFS_BLOCK_GROUP_SYSTEM) {
|
2017-07-27 19:37:29 +08:00
|
|
|
ret = btrfs_del_sys_chunk(fs_info, chunk_offset);
|
2014-09-18 23:20:02 +08:00
|
|
|
if (ret) {
|
2016-06-11 06:19:25 +08:00
|
|
|
btrfs_abort_transaction(trans, ret);
|
2014-09-18 23:20:02 +08:00
|
|
|
goto out;
|
|
|
|
}
|
2008-04-26 04:53:30 +08:00
|
|
|
}
|
|
|
|
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
|
|
|
trans->removing_chunk = false;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We are done with chunk btree updates and deletions, so release the
|
|
|
|
* system space we previously reserved (with check_system_chunk()).
|
|
|
|
*/
|
|
|
|
btrfs_trans_release_chunk_metadata(trans);
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
ret = btrfs_remove_block_group(trans, map);
|
2014-09-18 23:20:02 +08:00
|
|
|
if (ret) {
|
2016-06-11 06:19:25 +08:00
|
|
|
btrfs_abort_transaction(trans, ret);
|
2014-09-18 23:20:02 +08:00
|
|
|
goto out;
|
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2014-09-18 23:20:02 +08:00
|
|
|
out:
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
if (trans->removing_chunk) {
|
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
|
|
|
trans->removing_chunk = false;
|
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
/* once for us */
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
btrfs_free_chunk_map(map);
|
2014-09-18 23:20:02 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2021-04-19 15:41:02 +08:00
|
|
|
int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
|
2014-09-18 23:20:02 +08:00
|
|
|
{
|
2016-06-21 22:40:19 +08:00
|
|
|
struct btrfs_root *root = fs_info->chunk_root;
|
2016-10-11 04:43:31 +08:00
|
|
|
struct btrfs_trans_handle *trans;
|
2019-12-14 08:22:14 +08:00
|
|
|
struct btrfs_block_group *block_group;
|
2021-04-19 15:41:00 +08:00
|
|
|
u64 length;
|
2014-09-18 23:20:02 +08:00
|
|
|
int ret;
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2021-12-16 04:39:59 +08:00
|
|
|
if (btrfs_fs_incompat(fs_info, EXTENT_TREE_V2)) {
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"relocate: not supported on extent tree v2 yet");
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
/*
|
|
|
|
* Prevent races with automatic removal of unused block groups.
|
|
|
|
* After we relocate and before we remove the chunk with offset
|
|
|
|
* chunk_offset, automatic removal of the block group can kick in,
|
|
|
|
* resulting in a failure when calling btrfs_remove_chunk() below.
|
|
|
|
*
|
|
|
|
* Make sure to acquire this mutex before doing a tree search (dev
|
|
|
|
* or chunk trees) to find chunks. Otherwise the cleaner kthread might
|
|
|
|
* call btrfs_remove_chunk() (through btrfs_delete_unused_bgs()) after
|
|
|
|
* we release the path used to search the chunk/dev tree and before
|
|
|
|
* the current task acquires this mutex and calls us.
|
|
|
|
*/
|
2021-04-19 15:41:01 +08:00
|
|
|
lockdep_assert_held(&fs_info->reclaim_bgs_lock);
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
|
2014-09-18 23:20:02 +08:00
|
|
|
/* step one, relocate all the extents inside this chunk */
|
2016-06-23 06:54:24 +08:00
|
|
|
btrfs_scrub_pause(fs_info);
|
2016-06-23 06:54:23 +08:00
|
|
|
ret = btrfs_relocate_block_group(fs_info, chunk_offset);
|
2016-06-23 06:54:24 +08:00
|
|
|
btrfs_scrub_continue(fs_info);
|
btrfs: fix deadlock when aborting transaction during relocation with scrub
Before relocating a block group we pause scrub, then do the relocation and
then unpause scrub. The relocation process requires starting and committing
a transaction, and if we have a failure in the critical section of the
transaction commit path (transaction state >= TRANS_STATE_COMMIT_START),
we will deadlock if there is a paused scrub.
That results in stack traces like the following:
[42.479] BTRFS info (device sdc): relocating block group 53876686848 flags metadata|raid6
[42.936] BTRFS warning (device sdc): Skipping commit of aborted transaction.
[42.936] ------------[ cut here ]------------
[42.936] BTRFS: Transaction aborted (error -28)
[42.936] WARNING: CPU: 11 PID: 346822 at fs/btrfs/transaction.c:1977 btrfs_commit_transaction+0xcc8/0xeb0 [btrfs]
[42.936] Modules linked in: dm_flakey dm_mod loop btrfs (...)
[42.936] CPU: 11 PID: 346822 Comm: btrfs Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[42.936] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[42.936] RIP: 0010:btrfs_commit_transaction+0xcc8/0xeb0 [btrfs]
[42.936] Code: ff ff 45 8b (...)
[42.936] RSP: 0018:ffffb58649633b48 EFLAGS: 00010282
[42.936] RAX: 0000000000000000 RBX: ffff8be6ef4d5bd8 RCX: 0000000000000000
[42.936] RDX: 0000000000000002 RSI: ffffffffb35e7782 RDI: 00000000ffffffff
[42.936] RBP: ffff8be6ef4d5c98 R08: 0000000000000000 R09: ffffb586496339e8
[42.936] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8be6d38c7c00
[42.936] R13: 00000000ffffffe4 R14: ffff8be6c268c000 R15: ffff8be6ef4d5cf0
[42.936] FS: 00007f381a82b340(0000) GS:ffff8beddfcc0000(0000) knlGS:0000000000000000
[42.936] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[42.936] CR2: 00007f1e35fb7638 CR3: 0000000117680006 CR4: 0000000000370ee0
[42.936] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[42.936] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[42.936] Call Trace:
[42.936] <TASK>
[42.936] ? start_transaction+0xcb/0x610 [btrfs]
[42.936] prepare_to_relocate+0x111/0x1a0 [btrfs]
[42.936] relocate_block_group+0x57/0x5d0 [btrfs]
[42.936] ? btrfs_wait_nocow_writers+0x25/0xb0 [btrfs]
[42.936] btrfs_relocate_block_group+0x248/0x3c0 [btrfs]
[42.936] ? __pfx_autoremove_wake_function+0x10/0x10
[42.936] btrfs_relocate_chunk+0x3b/0x150 [btrfs]
[42.936] btrfs_balance+0x8ff/0x11d0 [btrfs]
[42.936] ? __kmem_cache_alloc_node+0x14a/0x410
[42.936] btrfs_ioctl+0x2334/0x32c0 [btrfs]
[42.937] ? mod_objcg_state+0xd2/0x360
[42.937] ? refill_obj_stock+0xb0/0x160
[42.937] ? seq_release+0x25/0x30
[42.937] ? __rseq_handle_notify_resume+0x3b5/0x4b0
[42.937] ? percpu_counter_add_batch+0x2e/0xa0
[42.937] ? __x64_sys_ioctl+0x88/0xc0
[42.937] __x64_sys_ioctl+0x88/0xc0
[42.937] do_syscall_64+0x38/0x90
[42.937] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[42.937] RIP: 0033:0x7f381a6ffe9b
[42.937] Code: 00 48 89 44 24 (...)
[42.937] RSP: 002b:00007ffd45ecf060 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[42.937] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f381a6ffe9b
[42.937] RDX: 00007ffd45ecf150 RSI: 00000000c4009420 RDI: 0000000000000003
[42.937] RBP: 0000000000000003 R08: 0000000000000013 R09: 0000000000000000
[42.937] R10: 00007f381a60c878 R11: 0000000000000246 R12: 00007ffd45ed0423
[42.937] R13: 00007ffd45ecf150 R14: 0000000000000000 R15: 00007ffd45ecf148
[42.937] </TASK>
[42.937] ---[ end trace 0000000000000000 ]---
[42.937] BTRFS: error (device sdc: state A) in cleanup_transaction:1977: errno=-28 No space left
[59.196] INFO: task btrfs:346772 blocked for more than 120 seconds.
[59.196] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.196] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.196] task:btrfs state:D stack:0 pid:346772 ppid:1 flags:0x00004002
[59.196] Call Trace:
[59.196] <TASK>
[59.196] __schedule+0x392/0xa70
[59.196] ? __pv_queued_spin_lock_slowpath+0x165/0x370
[59.196] schedule+0x5d/0xd0
[59.196] __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
[59.197] ? __pfx_autoremove_wake_function+0x10/0x10
[59.197] scrub_pause_off+0x21/0x50 [btrfs]
[59.197] scrub_simple_mirror+0x1c7/0x950 [btrfs]
[59.197] ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
[59.198] ? __pfx_autoremove_wake_function+0x10/0x10
[59.198] scrub_stripe+0x20d/0x740 [btrfs]
[59.198] scrub_chunk+0xc4/0x130 [btrfs]
[59.198] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
[59.198] ? __pfx_autoremove_wake_function+0x10/0x10
[59.198] btrfs_scrub_dev+0x236/0x6a0 [btrfs]
[59.199] ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
[59.199] ? _copy_from_user+0x7b/0x80
[59.199] btrfs_ioctl+0xde1/0x32c0 [btrfs]
[59.199] ? refill_stock+0x33/0x50
[59.199] ? should_failslab+0xa/0x20
[59.199] ? kmem_cache_alloc_node+0x151/0x460
[59.199] ? alloc_io_context+0x1b/0x80
[59.199] ? preempt_count_add+0x70/0xa0
[59.199] ? __x64_sys_ioctl+0x88/0xc0
[59.199] __x64_sys_ioctl+0x88/0xc0
[59.199] do_syscall_64+0x38/0x90
[59.199] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.199] RIP: 0033:0x7f82ffaffe9b
[59.199] RSP: 002b:00007f82ff9fcc50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.199] RAX: ffffffffffffffda RBX: 000055b191e36310 RCX: 00007f82ffaffe9b
[59.199] RDX: 000055b191e36310 RSI: 00000000c400941b RDI: 0000000000000003
[59.199] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
[59.199] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82ff9fd640
[59.199] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
[59.199] </TASK>
[59.199] INFO: task btrfs:346773 blocked for more than 120 seconds.
[59.200] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.200] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.201] task:btrfs state:D stack:0 pid:346773 ppid:1 flags:0x00004002
[59.201] Call Trace:
[59.201] <TASK>
[59.201] __schedule+0x392/0xa70
[59.201] ? __pv_queued_spin_lock_slowpath+0x165/0x370
[59.201] schedule+0x5d/0xd0
[59.201] __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
[59.201] ? __pfx_autoremove_wake_function+0x10/0x10
[59.201] scrub_pause_off+0x21/0x50 [btrfs]
[59.202] scrub_simple_mirror+0x1c7/0x950 [btrfs]
[59.202] ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
[59.202] ? __pfx_autoremove_wake_function+0x10/0x10
[59.202] scrub_stripe+0x20d/0x740 [btrfs]
[59.202] scrub_chunk+0xc4/0x130 [btrfs]
[59.203] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
[59.203] ? __pfx_autoremove_wake_function+0x10/0x10
[59.203] btrfs_scrub_dev+0x236/0x6a0 [btrfs]
[59.203] ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
[59.203] ? _copy_from_user+0x7b/0x80
[59.203] btrfs_ioctl+0xde1/0x32c0 [btrfs]
[59.204] ? should_failslab+0xa/0x20
[59.204] ? kmem_cache_alloc_node+0x151/0x460
[59.204] ? alloc_io_context+0x1b/0x80
[59.204] ? preempt_count_add+0x70/0xa0
[59.204] ? __x64_sys_ioctl+0x88/0xc0
[59.204] __x64_sys_ioctl+0x88/0xc0
[59.204] do_syscall_64+0x38/0x90
[59.204] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.204] RIP: 0033:0x7f82ffaffe9b
[59.204] RSP: 002b:00007f82ff1fbc50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.204] RAX: ffffffffffffffda RBX: 000055b191e36790 RCX: 00007f82ffaffe9b
[59.204] RDX: 000055b191e36790 RSI: 00000000c400941b RDI: 0000000000000003
[59.204] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
[59.204] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82ff1fc640
[59.204] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
[59.204] </TASK>
[59.204] INFO: task btrfs:346774 blocked for more than 120 seconds.
[59.205] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.205] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.206] task:btrfs state:D stack:0 pid:346774 ppid:1 flags:0x00004002
[59.206] Call Trace:
[59.206] <TASK>
[59.206] __schedule+0x392/0xa70
[59.206] schedule+0x5d/0xd0
[59.206] __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
[59.206] ? __pfx_autoremove_wake_function+0x10/0x10
[59.206] scrub_pause_off+0x21/0x50 [btrfs]
[59.207] scrub_simple_mirror+0x1c7/0x950 [btrfs]
[59.207] ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
[59.207] ? __pfx_autoremove_wake_function+0x10/0x10
[59.207] scrub_stripe+0x20d/0x740 [btrfs]
[59.208] scrub_chunk+0xc4/0x130 [btrfs]
[59.208] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
[59.208] ? __mutex_unlock_slowpath.isra.0+0x9a/0x120
[59.208] btrfs_scrub_dev+0x236/0x6a0 [btrfs]
[59.208] ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
[59.209] ? _copy_from_user+0x7b/0x80
[59.209] btrfs_ioctl+0xde1/0x32c0 [btrfs]
[59.209] ? should_failslab+0xa/0x20
[59.209] ? kmem_cache_alloc_node+0x151/0x460
[59.209] ? alloc_io_context+0x1b/0x80
[59.209] ? preempt_count_add+0x70/0xa0
[59.209] ? __x64_sys_ioctl+0x88/0xc0
[59.209] __x64_sys_ioctl+0x88/0xc0
[59.209] do_syscall_64+0x38/0x90
[59.209] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.209] RIP: 0033:0x7f82ffaffe9b
[59.209] RSP: 002b:00007f82fe9fac50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.209] RAX: ffffffffffffffda RBX: 000055b191e36c10 RCX: 00007f82ffaffe9b
[59.209] RDX: 000055b191e36c10 RSI: 00000000c400941b RDI: 0000000000000003
[59.209] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
[59.209] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82fe9fb640
[59.209] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
[59.209] </TASK>
[59.209] INFO: task btrfs:346775 blocked for more than 120 seconds.
[59.210] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.210] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.211] task:btrfs state:D stack:0 pid:346775 ppid:1 flags:0x00004002
[59.211] Call Trace:
[59.211] <TASK>
[59.211] __schedule+0x392/0xa70
[59.211] schedule+0x5d/0xd0
[59.211] __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
[59.211] ? __pfx_autoremove_wake_function+0x10/0x10
[59.211] scrub_pause_off+0x21/0x50 [btrfs]
[59.212] scrub_simple_mirror+0x1c7/0x950 [btrfs]
[59.212] ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
[59.212] ? __pfx_autoremove_wake_function+0x10/0x10
[59.212] scrub_stripe+0x20d/0x740 [btrfs]
[59.213] scrub_chunk+0xc4/0x130 [btrfs]
[59.213] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
[59.213] ? __mutex_unlock_slowpath.isra.0+0x9a/0x120
[59.213] btrfs_scrub_dev+0x236/0x6a0 [btrfs]
[59.213] ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
[59.214] ? _copy_from_user+0x7b/0x80
[59.214] btrfs_ioctl+0xde1/0x32c0 [btrfs]
[59.214] ? should_failslab+0xa/0x20
[59.214] ? kmem_cache_alloc_node+0x151/0x460
[59.214] ? alloc_io_context+0x1b/0x80
[59.214] ? preempt_count_add+0x70/0xa0
[59.214] ? __x64_sys_ioctl+0x88/0xc0
[59.214] __x64_sys_ioctl+0x88/0xc0
[59.214] do_syscall_64+0x38/0x90
[59.214] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.214] RIP: 0033:0x7f82ffaffe9b
[59.214] RSP: 002b:00007f82fe1f9c50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.214] RAX: ffffffffffffffda RBX: 000055b191e37090 RCX: 00007f82ffaffe9b
[59.214] RDX: 000055b191e37090 RSI: 00000000c400941b RDI: 0000000000000003
[59.214] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
[59.214] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82fe1fa640
[59.214] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
[59.214] </TASK>
[59.214] INFO: task btrfs:346776 blocked for more than 120 seconds.
[59.215] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.216] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.217] task:btrfs state:D stack:0 pid:346776 ppid:1 flags:0x00004002
[59.217] Call Trace:
[59.217] <TASK>
[59.217] __schedule+0x392/0xa70
[59.217] ? __pv_queued_spin_lock_slowpath+0x165/0x370
[59.217] schedule+0x5d/0xd0
[59.217] __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
[59.217] ? __pfx_autoremove_wake_function+0x10/0x10
[59.217] scrub_pause_off+0x21/0x50 [btrfs]
[59.217] scrub_simple_mirror+0x1c7/0x950 [btrfs]
[59.217] ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
[59.218] ? __pfx_autoremove_wake_function+0x10/0x10
[59.218] scrub_stripe+0x20d/0x740 [btrfs]
[59.218] scrub_chunk+0xc4/0x130 [btrfs]
[59.218] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
[59.219] ? __pfx_autoremove_wake_function+0x10/0x10
[59.219] btrfs_scrub_dev+0x236/0x6a0 [btrfs]
[59.219] ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
[59.219] ? _copy_from_user+0x7b/0x80
[59.219] btrfs_ioctl+0xde1/0x32c0 [btrfs]
[59.219] ? should_failslab+0xa/0x20
[59.219] ? kmem_cache_alloc_node+0x151/0x460
[59.219] ? alloc_io_context+0x1b/0x80
[59.219] ? preempt_count_add+0x70/0xa0
[59.219] ? __x64_sys_ioctl+0x88/0xc0
[59.219] __x64_sys_ioctl+0x88/0xc0
[59.219] do_syscall_64+0x38/0x90
[59.219] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.219] RIP: 0033:0x7f82ffaffe9b
[59.219] RSP: 002b:00007f82fd9f8c50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.219] RAX: ffffffffffffffda RBX: 000055b191e37510 RCX: 00007f82ffaffe9b
[59.219] RDX: 000055b191e37510 RSI: 00000000c400941b RDI: 0000000000000003
[59.219] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
[59.219] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82fd9f9640
[59.219] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
[59.219] </TASK>
[59.219] INFO: task btrfs:346822 blocked for more than 120 seconds.
[59.220] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.221] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.222] task:btrfs state:D stack:0 pid:346822 ppid:1 flags:0x00004002
[59.222] Call Trace:
[59.222] <TASK>
[59.222] __schedule+0x392/0xa70
[59.222] schedule+0x5d/0xd0
[59.222] btrfs_scrub_cancel+0x91/0x100 [btrfs]
[59.222] ? __pfx_autoremove_wake_function+0x10/0x10
[59.222] btrfs_commit_transaction+0x572/0xeb0 [btrfs]
[59.223] ? start_transaction+0xcb/0x610 [btrfs]
[59.223] prepare_to_relocate+0x111/0x1a0 [btrfs]
[59.223] relocate_block_group+0x57/0x5d0 [btrfs]
[59.223] ? btrfs_wait_nocow_writers+0x25/0xb0 [btrfs]
[59.223] btrfs_relocate_block_group+0x248/0x3c0 [btrfs]
[59.224] ? __pfx_autoremove_wake_function+0x10/0x10
[59.224] btrfs_relocate_chunk+0x3b/0x150 [btrfs]
[59.224] btrfs_balance+0x8ff/0x11d0 [btrfs]
[59.224] ? __kmem_cache_alloc_node+0x14a/0x410
[59.224] btrfs_ioctl+0x2334/0x32c0 [btrfs]
[59.225] ? mod_objcg_state+0xd2/0x360
[59.225] ? refill_obj_stock+0xb0/0x160
[59.225] ? seq_release+0x25/0x30
[59.225] ? __rseq_handle_notify_resume+0x3b5/0x4b0
[59.225] ? percpu_counter_add_batch+0x2e/0xa0
[59.225] ? __x64_sys_ioctl+0x88/0xc0
[59.225] __x64_sys_ioctl+0x88/0xc0
[59.225] do_syscall_64+0x38/0x90
[59.225] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.225] RIP: 0033:0x7f381a6ffe9b
[59.225] RSP: 002b:00007ffd45ecf060 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.225] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f381a6ffe9b
[59.225] RDX: 00007ffd45ecf150 RSI: 00000000c4009420 RDI: 0000000000000003
[59.225] RBP: 0000000000000003 R08: 0000000000000013 R09: 0000000000000000
[59.225] R10: 00007f381a60c878 R11: 0000000000000246 R12: 00007ffd45ed0423
[59.225] R13: 00007ffd45ecf150 R14: 0000000000000000 R15: 00007ffd45ecf148
[59.225] </TASK>
What happens is the following:
1) A scrub is running, so fs_info->scrubs_running is 1;
2) Task A starts block group relocation, and at btrfs_relocate_chunk() it
pauses scrub by calling btrfs_scrub_pause(). That increments
fs_info->scrub_pause_req from 0 to 1 and waits for the scrub task to
pause (for fs_info->scrubs_paused to be == to fs_info->scrubs_running);
3) The scrub task pauses at scrub_pause_off(), waiting for
fs_info->scrub_pause_req to decrease to 0;
4) Task A then enters btrfs_relocate_block_group(), and down that call
chain we start a transaction and then attempt to commit it;
5) When task A calls btrfs_commit_transaction(), it either will do the
commit itself or wait for some other task that already started the
commit of the transaction - it doesn't matter which case;
6) The transaction commit enters state TRANS_STATE_COMMIT_START;
7) An error happens during the transaction commit, like -ENOSPC when
running delayed refs or delayed items for example;
8) This results in calling transaction.c:cleanup_transaction(), where
we call btrfs_scrub_cancel(), incrementing fs_info->scrub_cancel_req
from 0 to 1, and blocking this task waiting for fs_info->scrubs_running
to decrease to 0;
9) From this point on, both the transaction commit and the scrub task
hang forever:
1) The transaction commit is waiting for fs_info->scrubs_running to
be decreased to 0;
2) The scrub task is at scrub_pause_off() waiting for
fs_info->scrub_pause_req to decrease to 0 - so it can not proceed
to stop the scrub and decrement fs_info->scrubs_running from 0 to 1.
Therefore resulting in a deadlock.
Fix this by having cleanup_transaction(), called if a transaction commit
fails, not call btrfs_scrub_cancel() if relocation is in progress, and
having btrfs_relocate_block_group() call btrfs_scrub_cancel() instead if
the relocation failed and a transaction abort happened.
This was triggered with btrfs/061 from fstests.
Fixes: 55e3a601c81c ("btrfs: Fix data checksum error cause by replace with io-load.")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-22 17:46:34 +08:00
|
|
|
if (ret) {
|
|
|
|
/*
|
|
|
|
* If we had a transaction abort, stop all running scrubs.
|
|
|
|
* See transaction.c:cleanup_transaction() why we do it here.
|
|
|
|
*/
|
|
|
|
if (BTRFS_FS_ERROR(fs_info))
|
|
|
|
btrfs_scrub_cancel(fs_info);
|
2014-09-18 23:20:02 +08:00
|
|
|
return ret;
|
btrfs: fix deadlock when aborting transaction during relocation with scrub
Before relocating a block group we pause scrub, then do the relocation and
then unpause scrub. The relocation process requires starting and committing
a transaction, and if we have a failure in the critical section of the
transaction commit path (transaction state >= TRANS_STATE_COMMIT_START),
we will deadlock if there is a paused scrub.
That results in stack traces like the following:
[42.479] BTRFS info (device sdc): relocating block group 53876686848 flags metadata|raid6
[42.936] BTRFS warning (device sdc): Skipping commit of aborted transaction.
[42.936] ------------[ cut here ]------------
[42.936] BTRFS: Transaction aborted (error -28)
[42.936] WARNING: CPU: 11 PID: 346822 at fs/btrfs/transaction.c:1977 btrfs_commit_transaction+0xcc8/0xeb0 [btrfs]
[42.936] Modules linked in: dm_flakey dm_mod loop btrfs (...)
[42.936] CPU: 11 PID: 346822 Comm: btrfs Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[42.936] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[42.936] RIP: 0010:btrfs_commit_transaction+0xcc8/0xeb0 [btrfs]
[42.936] Code: ff ff 45 8b (...)
[42.936] RSP: 0018:ffffb58649633b48 EFLAGS: 00010282
[42.936] RAX: 0000000000000000 RBX: ffff8be6ef4d5bd8 RCX: 0000000000000000
[42.936] RDX: 0000000000000002 RSI: ffffffffb35e7782 RDI: 00000000ffffffff
[42.936] RBP: ffff8be6ef4d5c98 R08: 0000000000000000 R09: ffffb586496339e8
[42.936] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8be6d38c7c00
[42.936] R13: 00000000ffffffe4 R14: ffff8be6c268c000 R15: ffff8be6ef4d5cf0
[42.936] FS: 00007f381a82b340(0000) GS:ffff8beddfcc0000(0000) knlGS:0000000000000000
[42.936] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[42.936] CR2: 00007f1e35fb7638 CR3: 0000000117680006 CR4: 0000000000370ee0
[42.936] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[42.936] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[42.936] Call Trace:
[42.936] <TASK>
[42.936] ? start_transaction+0xcb/0x610 [btrfs]
[42.936] prepare_to_relocate+0x111/0x1a0 [btrfs]
[42.936] relocate_block_group+0x57/0x5d0 [btrfs]
[42.936] ? btrfs_wait_nocow_writers+0x25/0xb0 [btrfs]
[42.936] btrfs_relocate_block_group+0x248/0x3c0 [btrfs]
[42.936] ? __pfx_autoremove_wake_function+0x10/0x10
[42.936] btrfs_relocate_chunk+0x3b/0x150 [btrfs]
[42.936] btrfs_balance+0x8ff/0x11d0 [btrfs]
[42.936] ? __kmem_cache_alloc_node+0x14a/0x410
[42.936] btrfs_ioctl+0x2334/0x32c0 [btrfs]
[42.937] ? mod_objcg_state+0xd2/0x360
[42.937] ? refill_obj_stock+0xb0/0x160
[42.937] ? seq_release+0x25/0x30
[42.937] ? __rseq_handle_notify_resume+0x3b5/0x4b0
[42.937] ? percpu_counter_add_batch+0x2e/0xa0
[42.937] ? __x64_sys_ioctl+0x88/0xc0
[42.937] __x64_sys_ioctl+0x88/0xc0
[42.937] do_syscall_64+0x38/0x90
[42.937] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[42.937] RIP: 0033:0x7f381a6ffe9b
[42.937] Code: 00 48 89 44 24 (...)
[42.937] RSP: 002b:00007ffd45ecf060 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[42.937] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f381a6ffe9b
[42.937] RDX: 00007ffd45ecf150 RSI: 00000000c4009420 RDI: 0000000000000003
[42.937] RBP: 0000000000000003 R08: 0000000000000013 R09: 0000000000000000
[42.937] R10: 00007f381a60c878 R11: 0000000000000246 R12: 00007ffd45ed0423
[42.937] R13: 00007ffd45ecf150 R14: 0000000000000000 R15: 00007ffd45ecf148
[42.937] </TASK>
[42.937] ---[ end trace 0000000000000000 ]---
[42.937] BTRFS: error (device sdc: state A) in cleanup_transaction:1977: errno=-28 No space left
[59.196] INFO: task btrfs:346772 blocked for more than 120 seconds.
[59.196] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.196] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.196] task:btrfs state:D stack:0 pid:346772 ppid:1 flags:0x00004002
[59.196] Call Trace:
[59.196] <TASK>
[59.196] __schedule+0x392/0xa70
[59.196] ? __pv_queued_spin_lock_slowpath+0x165/0x370
[59.196] schedule+0x5d/0xd0
[59.196] __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
[59.197] ? __pfx_autoremove_wake_function+0x10/0x10
[59.197] scrub_pause_off+0x21/0x50 [btrfs]
[59.197] scrub_simple_mirror+0x1c7/0x950 [btrfs]
[59.197] ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
[59.198] ? __pfx_autoremove_wake_function+0x10/0x10
[59.198] scrub_stripe+0x20d/0x740 [btrfs]
[59.198] scrub_chunk+0xc4/0x130 [btrfs]
[59.198] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
[59.198] ? __pfx_autoremove_wake_function+0x10/0x10
[59.198] btrfs_scrub_dev+0x236/0x6a0 [btrfs]
[59.199] ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
[59.199] ? _copy_from_user+0x7b/0x80
[59.199] btrfs_ioctl+0xde1/0x32c0 [btrfs]
[59.199] ? refill_stock+0x33/0x50
[59.199] ? should_failslab+0xa/0x20
[59.199] ? kmem_cache_alloc_node+0x151/0x460
[59.199] ? alloc_io_context+0x1b/0x80
[59.199] ? preempt_count_add+0x70/0xa0
[59.199] ? __x64_sys_ioctl+0x88/0xc0
[59.199] __x64_sys_ioctl+0x88/0xc0
[59.199] do_syscall_64+0x38/0x90
[59.199] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.199] RIP: 0033:0x7f82ffaffe9b
[59.199] RSP: 002b:00007f82ff9fcc50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.199] RAX: ffffffffffffffda RBX: 000055b191e36310 RCX: 00007f82ffaffe9b
[59.199] RDX: 000055b191e36310 RSI: 00000000c400941b RDI: 0000000000000003
[59.199] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
[59.199] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82ff9fd640
[59.199] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
[59.199] </TASK>
[59.199] INFO: task btrfs:346773 blocked for more than 120 seconds.
[59.200] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.200] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.201] task:btrfs state:D stack:0 pid:346773 ppid:1 flags:0x00004002
[59.201] Call Trace:
[59.201] <TASK>
[59.201] __schedule+0x392/0xa70
[59.201] ? __pv_queued_spin_lock_slowpath+0x165/0x370
[59.201] schedule+0x5d/0xd0
[59.201] __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
[59.201] ? __pfx_autoremove_wake_function+0x10/0x10
[59.201] scrub_pause_off+0x21/0x50 [btrfs]
[59.202] scrub_simple_mirror+0x1c7/0x950 [btrfs]
[59.202] ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
[59.202] ? __pfx_autoremove_wake_function+0x10/0x10
[59.202] scrub_stripe+0x20d/0x740 [btrfs]
[59.202] scrub_chunk+0xc4/0x130 [btrfs]
[59.203] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
[59.203] ? __pfx_autoremove_wake_function+0x10/0x10
[59.203] btrfs_scrub_dev+0x236/0x6a0 [btrfs]
[59.203] ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
[59.203] ? _copy_from_user+0x7b/0x80
[59.203] btrfs_ioctl+0xde1/0x32c0 [btrfs]
[59.204] ? should_failslab+0xa/0x20
[59.204] ? kmem_cache_alloc_node+0x151/0x460
[59.204] ? alloc_io_context+0x1b/0x80
[59.204] ? preempt_count_add+0x70/0xa0
[59.204] ? __x64_sys_ioctl+0x88/0xc0
[59.204] __x64_sys_ioctl+0x88/0xc0
[59.204] do_syscall_64+0x38/0x90
[59.204] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.204] RIP: 0033:0x7f82ffaffe9b
[59.204] RSP: 002b:00007f82ff1fbc50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.204] RAX: ffffffffffffffda RBX: 000055b191e36790 RCX: 00007f82ffaffe9b
[59.204] RDX: 000055b191e36790 RSI: 00000000c400941b RDI: 0000000000000003
[59.204] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
[59.204] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82ff1fc640
[59.204] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
[59.204] </TASK>
[59.204] INFO: task btrfs:346774 blocked for more than 120 seconds.
[59.205] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.205] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.206] task:btrfs state:D stack:0 pid:346774 ppid:1 flags:0x00004002
[59.206] Call Trace:
[59.206] <TASK>
[59.206] __schedule+0x392/0xa70
[59.206] schedule+0x5d/0xd0
[59.206] __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
[59.206] ? __pfx_autoremove_wake_function+0x10/0x10
[59.206] scrub_pause_off+0x21/0x50 [btrfs]
[59.207] scrub_simple_mirror+0x1c7/0x950 [btrfs]
[59.207] ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
[59.207] ? __pfx_autoremove_wake_function+0x10/0x10
[59.207] scrub_stripe+0x20d/0x740 [btrfs]
[59.208] scrub_chunk+0xc4/0x130 [btrfs]
[59.208] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
[59.208] ? __mutex_unlock_slowpath.isra.0+0x9a/0x120
[59.208] btrfs_scrub_dev+0x236/0x6a0 [btrfs]
[59.208] ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
[59.209] ? _copy_from_user+0x7b/0x80
[59.209] btrfs_ioctl+0xde1/0x32c0 [btrfs]
[59.209] ? should_failslab+0xa/0x20
[59.209] ? kmem_cache_alloc_node+0x151/0x460
[59.209] ? alloc_io_context+0x1b/0x80
[59.209] ? preempt_count_add+0x70/0xa0
[59.209] ? __x64_sys_ioctl+0x88/0xc0
[59.209] __x64_sys_ioctl+0x88/0xc0
[59.209] do_syscall_64+0x38/0x90
[59.209] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.209] RIP: 0033:0x7f82ffaffe9b
[59.209] RSP: 002b:00007f82fe9fac50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.209] RAX: ffffffffffffffda RBX: 000055b191e36c10 RCX: 00007f82ffaffe9b
[59.209] RDX: 000055b191e36c10 RSI: 00000000c400941b RDI: 0000000000000003
[59.209] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
[59.209] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82fe9fb640
[59.209] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
[59.209] </TASK>
[59.209] INFO: task btrfs:346775 blocked for more than 120 seconds.
[59.210] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.210] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.211] task:btrfs state:D stack:0 pid:346775 ppid:1 flags:0x00004002
[59.211] Call Trace:
[59.211] <TASK>
[59.211] __schedule+0x392/0xa70
[59.211] schedule+0x5d/0xd0
[59.211] __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
[59.211] ? __pfx_autoremove_wake_function+0x10/0x10
[59.211] scrub_pause_off+0x21/0x50 [btrfs]
[59.212] scrub_simple_mirror+0x1c7/0x950 [btrfs]
[59.212] ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
[59.212] ? __pfx_autoremove_wake_function+0x10/0x10
[59.212] scrub_stripe+0x20d/0x740 [btrfs]
[59.213] scrub_chunk+0xc4/0x130 [btrfs]
[59.213] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
[59.213] ? __mutex_unlock_slowpath.isra.0+0x9a/0x120
[59.213] btrfs_scrub_dev+0x236/0x6a0 [btrfs]
[59.213] ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
[59.214] ? _copy_from_user+0x7b/0x80
[59.214] btrfs_ioctl+0xde1/0x32c0 [btrfs]
[59.214] ? should_failslab+0xa/0x20
[59.214] ? kmem_cache_alloc_node+0x151/0x460
[59.214] ? alloc_io_context+0x1b/0x80
[59.214] ? preempt_count_add+0x70/0xa0
[59.214] ? __x64_sys_ioctl+0x88/0xc0
[59.214] __x64_sys_ioctl+0x88/0xc0
[59.214] do_syscall_64+0x38/0x90
[59.214] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.214] RIP: 0033:0x7f82ffaffe9b
[59.214] RSP: 002b:00007f82fe1f9c50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.214] RAX: ffffffffffffffda RBX: 000055b191e37090 RCX: 00007f82ffaffe9b
[59.214] RDX: 000055b191e37090 RSI: 00000000c400941b RDI: 0000000000000003
[59.214] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
[59.214] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82fe1fa640
[59.214] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
[59.214] </TASK>
[59.214] INFO: task btrfs:346776 blocked for more than 120 seconds.
[59.215] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.216] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.217] task:btrfs state:D stack:0 pid:346776 ppid:1 flags:0x00004002
[59.217] Call Trace:
[59.217] <TASK>
[59.217] __schedule+0x392/0xa70
[59.217] ? __pv_queued_spin_lock_slowpath+0x165/0x370
[59.217] schedule+0x5d/0xd0
[59.217] __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
[59.217] ? __pfx_autoremove_wake_function+0x10/0x10
[59.217] scrub_pause_off+0x21/0x50 [btrfs]
[59.217] scrub_simple_mirror+0x1c7/0x950 [btrfs]
[59.217] ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
[59.218] ? __pfx_autoremove_wake_function+0x10/0x10
[59.218] scrub_stripe+0x20d/0x740 [btrfs]
[59.218] scrub_chunk+0xc4/0x130 [btrfs]
[59.218] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
[59.219] ? __pfx_autoremove_wake_function+0x10/0x10
[59.219] btrfs_scrub_dev+0x236/0x6a0 [btrfs]
[59.219] ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
[59.219] ? _copy_from_user+0x7b/0x80
[59.219] btrfs_ioctl+0xde1/0x32c0 [btrfs]
[59.219] ? should_failslab+0xa/0x20
[59.219] ? kmem_cache_alloc_node+0x151/0x460
[59.219] ? alloc_io_context+0x1b/0x80
[59.219] ? preempt_count_add+0x70/0xa0
[59.219] ? __x64_sys_ioctl+0x88/0xc0
[59.219] __x64_sys_ioctl+0x88/0xc0
[59.219] do_syscall_64+0x38/0x90
[59.219] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.219] RIP: 0033:0x7f82ffaffe9b
[59.219] RSP: 002b:00007f82fd9f8c50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.219] RAX: ffffffffffffffda RBX: 000055b191e37510 RCX: 00007f82ffaffe9b
[59.219] RDX: 000055b191e37510 RSI: 00000000c400941b RDI: 0000000000000003
[59.219] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
[59.219] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82fd9f9640
[59.219] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
[59.219] </TASK>
[59.219] INFO: task btrfs:346822 blocked for more than 120 seconds.
[59.220] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.221] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.222] task:btrfs state:D stack:0 pid:346822 ppid:1 flags:0x00004002
[59.222] Call Trace:
[59.222] <TASK>
[59.222] __schedule+0x392/0xa70
[59.222] schedule+0x5d/0xd0
[59.222] btrfs_scrub_cancel+0x91/0x100 [btrfs]
[59.222] ? __pfx_autoremove_wake_function+0x10/0x10
[59.222] btrfs_commit_transaction+0x572/0xeb0 [btrfs]
[59.223] ? start_transaction+0xcb/0x610 [btrfs]
[59.223] prepare_to_relocate+0x111/0x1a0 [btrfs]
[59.223] relocate_block_group+0x57/0x5d0 [btrfs]
[59.223] ? btrfs_wait_nocow_writers+0x25/0xb0 [btrfs]
[59.223] btrfs_relocate_block_group+0x248/0x3c0 [btrfs]
[59.224] ? __pfx_autoremove_wake_function+0x10/0x10
[59.224] btrfs_relocate_chunk+0x3b/0x150 [btrfs]
[59.224] btrfs_balance+0x8ff/0x11d0 [btrfs]
[59.224] ? __kmem_cache_alloc_node+0x14a/0x410
[59.224] btrfs_ioctl+0x2334/0x32c0 [btrfs]
[59.225] ? mod_objcg_state+0xd2/0x360
[59.225] ? refill_obj_stock+0xb0/0x160
[59.225] ? seq_release+0x25/0x30
[59.225] ? __rseq_handle_notify_resume+0x3b5/0x4b0
[59.225] ? percpu_counter_add_batch+0x2e/0xa0
[59.225] ? __x64_sys_ioctl+0x88/0xc0
[59.225] __x64_sys_ioctl+0x88/0xc0
[59.225] do_syscall_64+0x38/0x90
[59.225] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.225] RIP: 0033:0x7f381a6ffe9b
[59.225] RSP: 002b:00007ffd45ecf060 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.225] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f381a6ffe9b
[59.225] RDX: 00007ffd45ecf150 RSI: 00000000c4009420 RDI: 0000000000000003
[59.225] RBP: 0000000000000003 R08: 0000000000000013 R09: 0000000000000000
[59.225] R10: 00007f381a60c878 R11: 0000000000000246 R12: 00007ffd45ed0423
[59.225] R13: 00007ffd45ecf150 R14: 0000000000000000 R15: 00007ffd45ecf148
[59.225] </TASK>
What happens is the following:
1) A scrub is running, so fs_info->scrubs_running is 1;
2) Task A starts block group relocation, and at btrfs_relocate_chunk() it
pauses scrub by calling btrfs_scrub_pause(). That increments
fs_info->scrub_pause_req from 0 to 1 and waits for the scrub task to
pause (for fs_info->scrubs_paused to be == to fs_info->scrubs_running);
3) The scrub task pauses at scrub_pause_off(), waiting for
fs_info->scrub_pause_req to decrease to 0;
4) Task A then enters btrfs_relocate_block_group(), and down that call
chain we start a transaction and then attempt to commit it;
5) When task A calls btrfs_commit_transaction(), it either will do the
commit itself or wait for some other task that already started the
commit of the transaction - it doesn't matter which case;
6) The transaction commit enters state TRANS_STATE_COMMIT_START;
7) An error happens during the transaction commit, like -ENOSPC when
running delayed refs or delayed items for example;
8) This results in calling transaction.c:cleanup_transaction(), where
we call btrfs_scrub_cancel(), incrementing fs_info->scrub_cancel_req
from 0 to 1, and blocking this task waiting for fs_info->scrubs_running
to decrease to 0;
9) From this point on, both the transaction commit and the scrub task
hang forever:
1) The transaction commit is waiting for fs_info->scrubs_running to
be decreased to 0;
2) The scrub task is at scrub_pause_off() waiting for
fs_info->scrub_pause_req to decrease to 0 - so it can not proceed
to stop the scrub and decrement fs_info->scrubs_running from 0 to 1.
Therefore resulting in a deadlock.
Fix this by having cleanup_transaction(), called if a transaction commit
fails, not call btrfs_scrub_cancel() if relocation is in progress, and
having btrfs_relocate_block_group() call btrfs_scrub_cancel() instead if
the relocation failed and a transaction abort happened.
This was triggered with btrfs/061 from fstests.
Fixes: 55e3a601c81c ("btrfs: Fix data checksum error cause by replace with io-load.")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-22 17:46:34 +08:00
|
|
|
}
|
2014-09-18 23:20:02 +08:00
|
|
|
|
2019-12-14 08:22:14 +08:00
|
|
|
block_group = btrfs_lookup_block_group(fs_info, chunk_offset);
|
|
|
|
if (!block_group)
|
|
|
|
return -ENOENT;
|
|
|
|
btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
|
2021-04-19 15:41:00 +08:00
|
|
|
length = block_group->length;
|
2019-12-14 08:22:14 +08:00
|
|
|
btrfs_put_block_group(block_group);
|
|
|
|
|
2021-04-19 15:41:00 +08:00
|
|
|
/*
|
|
|
|
* On a zoned file system, discard the whole block group, this will
|
|
|
|
* trigger a REQ_OP_ZONE_RESET operation on the device zone. If
|
|
|
|
* resetting the zone fails, don't treat it as a fatal problem from the
|
|
|
|
* filesystem's point of view.
|
|
|
|
*/
|
|
|
|
if (btrfs_is_zoned(fs_info)) {
|
|
|
|
ret = btrfs_discard_extent(fs_info, chunk_offset, length, NULL);
|
|
|
|
if (ret)
|
|
|
|
btrfs_info(fs_info,
|
|
|
|
"failed to reset zone %llu after relocation",
|
|
|
|
chunk_offset);
|
|
|
|
}
|
|
|
|
|
2016-10-11 04:43:31 +08:00
|
|
|
trans = btrfs_start_trans_remove_block_group(root->fs_info,
|
|
|
|
chunk_offset);
|
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
|
|
|
btrfs_handle_fs_error(root->fs_info, ret, NULL);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2014-09-18 23:20:02 +08:00
|
|
|
/*
|
2016-10-11 04:43:31 +08:00
|
|
|
* step two, delete the device extents and the
|
|
|
|
* chunk tree entries
|
2014-09-18 23:20:02 +08:00
|
|
|
*/
|
2018-07-21 00:37:53 +08:00
|
|
|
ret = btrfs_remove_chunk(trans, chunk_offset);
|
2016-09-10 09:39:03 +08:00
|
|
|
btrfs_end_transaction(trans);
|
2016-10-11 04:43:31 +08:00
|
|
|
return ret;
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
static int btrfs_relocate_sys_chunks(struct btrfs_fs_info *fs_info)
|
2008-11-18 10:11:30 +08:00
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_root *chunk_root = fs_info->chunk_root;
|
2008-11-18 10:11:30 +08:00
|
|
|
struct btrfs_path *path;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_chunk *chunk;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_key found_key;
|
|
|
|
u64 chunk_type;
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
bool retried = false;
|
|
|
|
int failed = 0;
|
2008-11-18 10:11:30 +08:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
again:
|
2008-11-18 10:11:30 +08:00
|
|
|
key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
|
|
|
|
key.offset = (u64)-1;
|
|
|
|
key.type = BTRFS_CHUNK_ITEM_KEY;
|
|
|
|
|
|
|
|
while (1) {
|
2021-04-19 15:41:01 +08:00
|
|
|
mutex_lock(&fs_info->reclaim_bgs_lock);
|
2008-11-18 10:11:30 +08:00
|
|
|
ret = btrfs_search_slot(NULL, chunk_root, &key, path, 0, 0);
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
if (ret < 0) {
|
2021-04-19 15:41:01 +08:00
|
|
|
mutex_unlock(&fs_info->reclaim_bgs_lock);
|
2008-11-18 10:11:30 +08:00
|
|
|
goto error;
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
}
|
2024-01-24 06:42:29 +08:00
|
|
|
if (ret == 0) {
|
|
|
|
/*
|
|
|
|
* On the first search we would find chunk tree with
|
|
|
|
* offset -1, which is not possible. On subsequent
|
|
|
|
* loops this would find an existing item on an invalid
|
|
|
|
* offset (one less than the previous one, wrong
|
|
|
|
* alignment and size).
|
|
|
|
*/
|
|
|
|
ret = -EUCLEAN;
|
2024-04-19 10:22:48 +08:00
|
|
|
mutex_unlock(&fs_info->reclaim_bgs_lock);
|
2024-01-24 06:42:29 +08:00
|
|
|
goto error;
|
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
|
|
|
|
ret = btrfs_previous_item(chunk_root, path, key.objectid,
|
|
|
|
key.type);
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
if (ret)
|
2021-04-19 15:41:01 +08:00
|
|
|
mutex_unlock(&fs_info->reclaim_bgs_lock);
|
2008-11-18 10:11:30 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto error;
|
|
|
|
if (ret > 0)
|
|
|
|
break;
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
leaf = path->nodes[0];
|
|
|
|
btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
chunk = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_chunk);
|
|
|
|
chunk_type = btrfs_chunk_type(leaf, chunk);
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2008-04-26 04:53:30 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
if (chunk_type & BTRFS_BLOCK_GROUP_SYSTEM) {
|
2016-06-23 06:54:23 +08:00
|
|
|
ret = btrfs_relocate_chunk(fs_info, found_key.offset);
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
if (ret == -ENOSPC)
|
|
|
|
failed++;
|
2014-07-09 06:21:41 +08:00
|
|
|
else
|
|
|
|
BUG_ON(ret);
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
2021-04-19 15:41:01 +08:00
|
|
|
mutex_unlock(&fs_info->reclaim_bgs_lock);
|
2008-04-26 04:53:30 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
if (found_key.offset == 0)
|
|
|
|
break;
|
|
|
|
key.offset = found_key.offset - 1;
|
|
|
|
}
|
|
|
|
ret = 0;
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
if (failed && !retried) {
|
|
|
|
failed = 0;
|
|
|
|
retried = true;
|
|
|
|
goto again;
|
2013-10-31 13:00:08 +08:00
|
|
|
} else if (WARN_ON(failed && retried)) {
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
ret = -ENOSPC;
|
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
error:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
2008-04-26 04:53:30 +08:00
|
|
|
}
|
|
|
|
|
2017-11-16 07:28:11 +08:00
|
|
|
/*
|
|
|
|
* return 1 : allocate a data chunk successfully,
|
|
|
|
* return <0: errors during allocating a data chunk,
|
|
|
|
* return 0 : no need to allocate a data chunk.
|
|
|
|
*/
|
|
|
|
static int btrfs_may_alloc_data_chunk(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 chunk_offset)
|
|
|
|
{
|
2019-10-30 02:20:18 +08:00
|
|
|
struct btrfs_block_group *cache;
|
2017-11-16 07:28:11 +08:00
|
|
|
u64 bytes_used;
|
|
|
|
u64 chunk_type;
|
|
|
|
|
|
|
|
cache = btrfs_lookup_block_group(fs_info, chunk_offset);
|
|
|
|
ASSERT(cache);
|
|
|
|
chunk_type = cache->flags;
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
|
2019-10-18 17:58:22 +08:00
|
|
|
if (!(chunk_type & BTRFS_BLOCK_GROUP_DATA))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
spin_lock(&fs_info->data_sinfo->lock);
|
|
|
|
bytes_used = fs_info->data_sinfo->bytes_used;
|
|
|
|
spin_unlock(&fs_info->data_sinfo->lock);
|
|
|
|
|
|
|
|
if (!bytes_used) {
|
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
trans = btrfs_join_transaction(fs_info->tree_root);
|
|
|
|
if (IS_ERR(trans))
|
|
|
|
return PTR_ERR(trans);
|
|
|
|
|
|
|
|
ret = btrfs_force_chunk_alloc(trans, BTRFS_BLOCK_GROUP_DATA);
|
|
|
|
btrfs_end_transaction(trans);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
return 1;
|
2017-11-16 07:28:11 +08:00
|
|
|
}
|
2019-10-18 17:58:22 +08:00
|
|
|
|
2017-11-16 07:28:11 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2024-02-16 20:00:51 +08:00
|
|
|
static void btrfs_disk_balance_args_to_cpu(struct btrfs_balance_args *cpu,
|
|
|
|
const struct btrfs_disk_balance_args *disk)
|
|
|
|
{
|
|
|
|
memset(cpu, 0, sizeof(*cpu));
|
|
|
|
|
|
|
|
cpu->profiles = le64_to_cpu(disk->profiles);
|
|
|
|
cpu->usage = le64_to_cpu(disk->usage);
|
|
|
|
cpu->devid = le64_to_cpu(disk->devid);
|
|
|
|
cpu->pstart = le64_to_cpu(disk->pstart);
|
|
|
|
cpu->pend = le64_to_cpu(disk->pend);
|
|
|
|
cpu->vstart = le64_to_cpu(disk->vstart);
|
|
|
|
cpu->vend = le64_to_cpu(disk->vend);
|
|
|
|
cpu->target = le64_to_cpu(disk->target);
|
|
|
|
cpu->flags = le64_to_cpu(disk->flags);
|
|
|
|
cpu->limit = le64_to_cpu(disk->limit);
|
|
|
|
cpu->stripes_min = le32_to_cpu(disk->stripes_min);
|
|
|
|
cpu->stripes_max = le32_to_cpu(disk->stripes_max);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void btrfs_cpu_balance_args_to_disk(struct btrfs_disk_balance_args *disk,
|
|
|
|
const struct btrfs_balance_args *cpu)
|
|
|
|
{
|
|
|
|
memset(disk, 0, sizeof(*disk));
|
|
|
|
|
|
|
|
disk->profiles = cpu_to_le64(cpu->profiles);
|
|
|
|
disk->usage = cpu_to_le64(cpu->usage);
|
|
|
|
disk->devid = cpu_to_le64(cpu->devid);
|
|
|
|
disk->pstart = cpu_to_le64(cpu->pstart);
|
|
|
|
disk->pend = cpu_to_le64(cpu->pend);
|
|
|
|
disk->vstart = cpu_to_le64(cpu->vstart);
|
|
|
|
disk->vend = cpu_to_le64(cpu->vend);
|
|
|
|
disk->target = cpu_to_le64(cpu->target);
|
|
|
|
disk->flags = cpu_to_le64(cpu->flags);
|
|
|
|
disk->limit = cpu_to_le64(cpu->limit);
|
|
|
|
disk->stripes_min = cpu_to_le32(cpu->stripes_min);
|
|
|
|
disk->stripes_max = cpu_to_le32(cpu->stripes_max);
|
|
|
|
}
|
|
|
|
|
2016-06-22 09:16:51 +08:00
|
|
|
static int insert_balance_item(struct btrfs_fs_info *fs_info,
|
2012-01-17 04:04:48 +08:00
|
|
|
struct btrfs_balance_control *bctl)
|
|
|
|
{
|
2016-06-22 09:16:51 +08:00
|
|
|
struct btrfs_root *root = fs_info->tree_root;
|
2012-01-17 04:04:48 +08:00
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
struct btrfs_balance_item *item;
|
|
|
|
struct btrfs_disk_balance_args disk_bargs;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_key key;
|
|
|
|
int ret, err;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
trans = btrfs_start_transaction(root, 0);
|
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return PTR_ERR(trans);
|
|
|
|
}
|
|
|
|
|
|
|
|
key.objectid = BTRFS_BALANCE_OBJECTID;
|
2016-01-26 00:51:31 +08:00
|
|
|
key.type = BTRFS_TEMPORARY_ITEM_KEY;
|
2012-01-17 04:04:48 +08:00
|
|
|
key.offset = 0;
|
|
|
|
|
|
|
|
ret = btrfs_insert_empty_item(trans, root, path, &key,
|
|
|
|
sizeof(*item));
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_balance_item);
|
|
|
|
|
2016-11-09 01:09:03 +08:00
|
|
|
memzero_extent_buffer(leaf, (unsigned long)item, sizeof(*item));
|
2012-01-17 04:04:48 +08:00
|
|
|
|
|
|
|
btrfs_cpu_balance_args_to_disk(&disk_bargs, &bctl->data);
|
|
|
|
btrfs_set_balance_data(leaf, item, &disk_bargs);
|
|
|
|
btrfs_cpu_balance_args_to_disk(&disk_bargs, &bctl->meta);
|
|
|
|
btrfs_set_balance_meta(leaf, item, &disk_bargs);
|
|
|
|
btrfs_cpu_balance_args_to_disk(&disk_bargs, &bctl->sys);
|
|
|
|
btrfs_set_balance_sys(leaf, item, &disk_bargs);
|
|
|
|
|
|
|
|
btrfs_set_balance_flags(leaf, item, bctl->flags);
|
|
|
|
|
2023-09-12 20:04:29 +08:00
|
|
|
btrfs_mark_buffer_dirty(trans, leaf);
|
2012-01-17 04:04:48 +08:00
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
2016-09-10 09:39:03 +08:00
|
|
|
err = btrfs_commit_transaction(trans);
|
2012-01-17 04:04:48 +08:00
|
|
|
if (err && !ret)
|
|
|
|
ret = err;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-06-22 09:16:51 +08:00
|
|
|
static int del_balance_item(struct btrfs_fs_info *fs_info)
|
2012-01-17 04:04:48 +08:00
|
|
|
{
|
2016-06-22 09:16:51 +08:00
|
|
|
struct btrfs_root *root = fs_info->tree_root;
|
2012-01-17 04:04:48 +08:00
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_key key;
|
|
|
|
int ret, err;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
btrfs: allow use of global block reserve for balance item deletion
On a filesystem with exhausted metadata, but still enough to start
balance, it's possible to hit this error:
[324402.053842] BTRFS info (device loop0): 1 enospc errors during balance
[324402.060769] BTRFS info (device loop0): balance: ended with status: -28
[324402.172295] BTRFS: error (device loop0) in reset_balance_state:3321: errno=-28 No space left
It fails inside reset_balance_state and turns the filesystem to
read-only, which is unnecessary and should be fixed too, but the problem
is caused by lack for space when the balance item is deleted. This is a
one-time operation and from the same rank as unlink that is allowed to
use the global block reserve. So do the same for the balance item.
Status of the filesystem (100GiB) just after the balance fails:
$ btrfs fi df mnt
Data, single: total=80.01GiB, used=38.58GiB
System, single: total=4.00MiB, used=16.00KiB
Metadata, single: total=19.99GiB, used=19.48GiB
GlobalReserve, single: total=512.00MiB, used=50.11MiB
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-25 18:35:28 +08:00
|
|
|
trans = btrfs_start_transaction_fallback_global_rsv(root, 0);
|
2012-01-17 04:04:48 +08:00
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return PTR_ERR(trans);
|
|
|
|
}
|
|
|
|
|
|
|
|
key.objectid = BTRFS_BALANCE_OBJECTID;
|
2016-01-26 00:51:31 +08:00
|
|
|
key.type = BTRFS_TEMPORARY_ITEM_KEY;
|
2012-01-17 04:04:48 +08:00
|
|
|
key.offset = 0;
|
|
|
|
|
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret > 0) {
|
|
|
|
ret = -ENOENT;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = btrfs_del_item(trans, root, path);
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
2016-09-10 09:39:03 +08:00
|
|
|
err = btrfs_commit_transaction(trans);
|
2012-01-17 04:04:48 +08:00
|
|
|
if (err && !ret)
|
|
|
|
ret = err;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:48 +08:00
|
|
|
/*
|
|
|
|
* This is a heuristic used to reduce the number of chunks balanced on
|
|
|
|
* resume after balance was interrupted.
|
|
|
|
*/
|
|
|
|
static void update_balance_args(struct btrfs_balance_control *bctl)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Turn on soft mode for chunk types that were being converted.
|
|
|
|
*/
|
|
|
|
if (bctl->data.flags & BTRFS_BALANCE_ARGS_CONVERT)
|
|
|
|
bctl->data.flags |= BTRFS_BALANCE_ARGS_SOFT;
|
|
|
|
if (bctl->sys.flags & BTRFS_BALANCE_ARGS_CONVERT)
|
|
|
|
bctl->sys.flags |= BTRFS_BALANCE_ARGS_SOFT;
|
|
|
|
if (bctl->meta.flags & BTRFS_BALANCE_ARGS_CONVERT)
|
|
|
|
bctl->meta.flags |= BTRFS_BALANCE_ARGS_SOFT;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Turn on usage filter if is not already used. The idea is
|
|
|
|
* that chunks that we have already balanced should be
|
|
|
|
* reasonably full. Don't do it for chunks that are being
|
|
|
|
* converted - that will keep us from relocating unconverted
|
|
|
|
* (albeit full) chunks.
|
|
|
|
*/
|
|
|
|
if (!(bctl->data.flags & BTRFS_BALANCE_ARGS_USAGE) &&
|
2015-10-21 00:22:13 +08:00
|
|
|
!(bctl->data.flags & BTRFS_BALANCE_ARGS_USAGE_RANGE) &&
|
2012-01-17 04:04:48 +08:00
|
|
|
!(bctl->data.flags & BTRFS_BALANCE_ARGS_CONVERT)) {
|
|
|
|
bctl->data.flags |= BTRFS_BALANCE_ARGS_USAGE;
|
|
|
|
bctl->data.usage = 90;
|
|
|
|
}
|
|
|
|
if (!(bctl->sys.flags & BTRFS_BALANCE_ARGS_USAGE) &&
|
2015-10-21 00:22:13 +08:00
|
|
|
!(bctl->sys.flags & BTRFS_BALANCE_ARGS_USAGE_RANGE) &&
|
2012-01-17 04:04:48 +08:00
|
|
|
!(bctl->sys.flags & BTRFS_BALANCE_ARGS_CONVERT)) {
|
|
|
|
bctl->sys.flags |= BTRFS_BALANCE_ARGS_USAGE;
|
|
|
|
bctl->sys.usage = 90;
|
|
|
|
}
|
|
|
|
if (!(bctl->meta.flags & BTRFS_BALANCE_ARGS_USAGE) &&
|
2015-10-21 00:22:13 +08:00
|
|
|
!(bctl->meta.flags & BTRFS_BALANCE_ARGS_USAGE_RANGE) &&
|
2012-01-17 04:04:48 +08:00
|
|
|
!(bctl->meta.flags & BTRFS_BALANCE_ARGS_CONVERT)) {
|
|
|
|
bctl->meta.flags |= BTRFS_BALANCE_ARGS_USAGE;
|
|
|
|
bctl->meta.usage = 90;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-03-21 03:23:09 +08:00
|
|
|
/*
|
|
|
|
* Clear the balance status in fs_info and delete the balance item from disk.
|
|
|
|
*/
|
|
|
|
static void reset_balance_state(struct btrfs_fs_info *fs_info)
|
2012-01-17 04:04:47 +08:00
|
|
|
{
|
|
|
|
struct btrfs_balance_control *bctl = fs_info->balance_ctl;
|
2018-03-21 03:23:09 +08:00
|
|
|
int ret;
|
2012-01-17 04:04:47 +08:00
|
|
|
|
2024-01-25 00:23:11 +08:00
|
|
|
ASSERT(fs_info->balance_ctl);
|
2012-01-17 04:04:47 +08:00
|
|
|
|
|
|
|
spin_lock(&fs_info->balance_lock);
|
|
|
|
fs_info->balance_ctl = NULL;
|
|
|
|
spin_unlock(&fs_info->balance_lock);
|
|
|
|
|
|
|
|
kfree(bctl);
|
2018-03-21 03:23:09 +08:00
|
|
|
ret = del_balance_item(fs_info);
|
|
|
|
if (ret)
|
|
|
|
btrfs_handle_fs_error(fs_info, ret, NULL);
|
2012-01-17 04:04:47 +08:00
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:47 +08:00
|
|
|
/*
|
|
|
|
* Balance filters. Return 1 if chunk should be filtered out
|
|
|
|
* (should not be balanced).
|
|
|
|
*/
|
2012-03-27 22:09:16 +08:00
|
|
|
static int chunk_profiles_filter(u64 chunk_type,
|
2012-01-17 04:04:47 +08:00
|
|
|
struct btrfs_balance_args *bargs)
|
|
|
|
{
|
2012-03-27 22:09:16 +08:00
|
|
|
chunk_type = chunk_to_extended(chunk_type) &
|
|
|
|
BTRFS_EXTENDED_PROFILE_MASK;
|
2012-01-17 04:04:47 +08:00
|
|
|
|
2012-03-27 22:09:16 +08:00
|
|
|
if (bargs->profiles & chunk_type)
|
2012-01-17 04:04:47 +08:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2015-11-17 19:29:32 +08:00
|
|
|
static int chunk_usage_range_filter(struct btrfs_fs_info *fs_info, u64 chunk_offset,
|
2012-01-17 04:04:47 +08:00
|
|
|
struct btrfs_balance_args *bargs)
|
2015-10-21 00:22:13 +08:00
|
|
|
{
|
2019-10-30 02:20:18 +08:00
|
|
|
struct btrfs_block_group *cache;
|
2015-10-21 00:22:13 +08:00
|
|
|
u64 chunk_used;
|
|
|
|
u64 user_thresh_min;
|
|
|
|
u64 user_thresh_max;
|
|
|
|
int ret = 1;
|
|
|
|
|
|
|
|
cache = btrfs_lookup_block_group(fs_info, chunk_offset);
|
2019-10-24 00:48:11 +08:00
|
|
|
chunk_used = cache->used;
|
2015-10-21 00:22:13 +08:00
|
|
|
|
|
|
|
if (bargs->usage_min == 0)
|
|
|
|
user_thresh_min = 0;
|
|
|
|
else
|
2022-10-27 05:25:14 +08:00
|
|
|
user_thresh_min = mult_perc(cache->length, bargs->usage_min);
|
2015-10-21 00:22:13 +08:00
|
|
|
|
|
|
|
if (bargs->usage_max == 0)
|
|
|
|
user_thresh_max = 1;
|
|
|
|
else if (bargs->usage_max > 100)
|
2019-10-24 00:48:22 +08:00
|
|
|
user_thresh_max = cache->length;
|
2015-10-21 00:22:13 +08:00
|
|
|
else
|
2022-10-27 05:25:14 +08:00
|
|
|
user_thresh_max = mult_perc(cache->length, bargs->usage_max);
|
2015-10-21 00:22:13 +08:00
|
|
|
|
|
|
|
if (user_thresh_min <= chunk_used && chunk_used < user_thresh_max)
|
|
|
|
ret = 0;
|
|
|
|
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2015-11-17 19:29:32 +08:00
|
|
|
static int chunk_usage_filter(struct btrfs_fs_info *fs_info,
|
2015-10-21 00:22:13 +08:00
|
|
|
u64 chunk_offset, struct btrfs_balance_args *bargs)
|
2012-01-17 04:04:47 +08:00
|
|
|
{
|
2019-10-30 02:20:18 +08:00
|
|
|
struct btrfs_block_group *cache;
|
2012-01-17 04:04:47 +08:00
|
|
|
u64 chunk_used, user_thresh;
|
|
|
|
int ret = 1;
|
|
|
|
|
|
|
|
cache = btrfs_lookup_block_group(fs_info, chunk_offset);
|
2019-10-24 00:48:11 +08:00
|
|
|
chunk_used = cache->used;
|
2012-01-17 04:04:47 +08:00
|
|
|
|
2015-10-21 00:22:13 +08:00
|
|
|
if (bargs->usage_min == 0)
|
2013-02-13 00:28:59 +08:00
|
|
|
user_thresh = 1;
|
2013-01-21 21:15:56 +08:00
|
|
|
else if (bargs->usage > 100)
|
2019-10-24 00:48:22 +08:00
|
|
|
user_thresh = cache->length;
|
2013-01-21 21:15:56 +08:00
|
|
|
else
|
2022-10-27 05:25:14 +08:00
|
|
|
user_thresh = mult_perc(cache->length, bargs->usage);
|
2013-01-21 21:15:56 +08:00
|
|
|
|
2012-01-17 04:04:47 +08:00
|
|
|
if (chunk_used < user_thresh)
|
|
|
|
ret = 0;
|
|
|
|
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:47 +08:00
|
|
|
static int chunk_devid_filter(struct extent_buffer *leaf,
|
|
|
|
struct btrfs_chunk *chunk,
|
|
|
|
struct btrfs_balance_args *bargs)
|
|
|
|
{
|
|
|
|
struct btrfs_stripe *stripe;
|
|
|
|
int num_stripes = btrfs_chunk_num_stripes(leaf, chunk);
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < num_stripes; i++) {
|
|
|
|
stripe = btrfs_stripe_nr(chunk, i);
|
|
|
|
if (btrfs_stripe_devid(leaf, stripe) == bargs->devid)
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2019-05-17 17:43:34 +08:00
|
|
|
static u64 calc_data_stripes(u64 type, int num_stripes)
|
|
|
|
{
|
|
|
|
const int index = btrfs_bg_flags_to_raid_index(type);
|
|
|
|
const int ncopies = btrfs_raid_array[index].ncopies;
|
|
|
|
const int nparity = btrfs_raid_array[index].nparity;
|
|
|
|
|
2021-07-26 20:15:24 +08:00
|
|
|
return (num_stripes - nparity) / ncopies;
|
2019-05-17 17:43:34 +08:00
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:48 +08:00
|
|
|
/* [pstart, pend) */
|
|
|
|
static int chunk_drange_filter(struct extent_buffer *leaf,
|
|
|
|
struct btrfs_chunk *chunk,
|
|
|
|
struct btrfs_balance_args *bargs)
|
|
|
|
{
|
|
|
|
struct btrfs_stripe *stripe;
|
|
|
|
int num_stripes = btrfs_chunk_num_stripes(leaf, chunk);
|
|
|
|
u64 stripe_offset;
|
|
|
|
u64 stripe_length;
|
2019-05-17 17:43:34 +08:00
|
|
|
u64 type;
|
2012-01-17 04:04:48 +08:00
|
|
|
int factor;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (!(bargs->flags & BTRFS_BALANCE_ARGS_DEVID))
|
|
|
|
return 0;
|
|
|
|
|
2019-05-17 17:43:34 +08:00
|
|
|
type = btrfs_chunk_type(leaf, chunk);
|
|
|
|
factor = calc_data_stripes(type, num_stripes);
|
2012-01-17 04:04:48 +08:00
|
|
|
|
|
|
|
for (i = 0; i < num_stripes; i++) {
|
|
|
|
stripe = btrfs_stripe_nr(chunk, i);
|
|
|
|
if (btrfs_stripe_devid(leaf, stripe) != bargs->devid)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
stripe_offset = btrfs_stripe_offset(leaf, stripe);
|
|
|
|
stripe_length = btrfs_chunk_length(leaf, chunk);
|
2015-01-17 00:26:13 +08:00
|
|
|
stripe_length = div_u64(stripe_length, factor);
|
2012-01-17 04:04:48 +08:00
|
|
|
|
|
|
|
if (stripe_offset < bargs->pend &&
|
|
|
|
stripe_offset + stripe_length > bargs->pstart)
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:48 +08:00
|
|
|
/* [vstart, vend) */
|
|
|
|
static int chunk_vrange_filter(struct extent_buffer *leaf,
|
|
|
|
struct btrfs_chunk *chunk,
|
|
|
|
u64 chunk_offset,
|
|
|
|
struct btrfs_balance_args *bargs)
|
|
|
|
{
|
|
|
|
if (chunk_offset < bargs->vend &&
|
|
|
|
chunk_offset + btrfs_chunk_length(leaf, chunk) > bargs->vstart)
|
|
|
|
/* at least part of the chunk is inside this vrange */
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2015-09-29 06:32:41 +08:00
|
|
|
static int chunk_stripes_range_filter(struct extent_buffer *leaf,
|
|
|
|
struct btrfs_chunk *chunk,
|
|
|
|
struct btrfs_balance_args *bargs)
|
|
|
|
{
|
|
|
|
int num_stripes = btrfs_chunk_num_stripes(leaf, chunk);
|
|
|
|
|
|
|
|
if (bargs->stripes_min <= num_stripes
|
|
|
|
&& num_stripes <= bargs->stripes_max)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2012-03-27 22:09:16 +08:00
|
|
|
static int chunk_soft_convert_filter(u64 chunk_type,
|
2012-01-17 04:04:48 +08:00
|
|
|
struct btrfs_balance_args *bargs)
|
|
|
|
{
|
|
|
|
if (!(bargs->flags & BTRFS_BALANCE_ARGS_CONVERT))
|
|
|
|
return 0;
|
|
|
|
|
2012-03-27 22:09:16 +08:00
|
|
|
chunk_type = chunk_to_extended(chunk_type) &
|
|
|
|
BTRFS_EXTENDED_PROFILE_MASK;
|
2012-01-17 04:04:48 +08:00
|
|
|
|
2012-03-27 22:09:16 +08:00
|
|
|
if (bargs->target == chunk_type)
|
2012-01-17 04:04:48 +08:00
|
|
|
return 1;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-03-20 23:38:52 +08:00
|
|
|
static int should_balance_chunk(struct extent_buffer *leaf,
|
2012-01-17 04:04:47 +08:00
|
|
|
struct btrfs_chunk *chunk, u64 chunk_offset)
|
|
|
|
{
|
2019-03-20 23:38:52 +08:00
|
|
|
struct btrfs_fs_info *fs_info = leaf->fs_info;
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_balance_control *bctl = fs_info->balance_ctl;
|
2012-01-17 04:04:47 +08:00
|
|
|
struct btrfs_balance_args *bargs = NULL;
|
|
|
|
u64 chunk_type = btrfs_chunk_type(leaf, chunk);
|
|
|
|
|
|
|
|
/* type filter */
|
|
|
|
if (!((chunk_type & BTRFS_BLOCK_GROUP_TYPE_MASK) &
|
|
|
|
(bctl->flags & BTRFS_BALANCE_TYPE_MASK))) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (chunk_type & BTRFS_BLOCK_GROUP_DATA)
|
|
|
|
bargs = &bctl->data;
|
|
|
|
else if (chunk_type & BTRFS_BLOCK_GROUP_SYSTEM)
|
|
|
|
bargs = &bctl->sys;
|
|
|
|
else if (chunk_type & BTRFS_BLOCK_GROUP_METADATA)
|
|
|
|
bargs = &bctl->meta;
|
|
|
|
|
2012-01-17 04:04:47 +08:00
|
|
|
/* profiles filter */
|
|
|
|
if ((bargs->flags & BTRFS_BALANCE_ARGS_PROFILES) &&
|
|
|
|
chunk_profiles_filter(chunk_type, bargs)) {
|
|
|
|
return 0;
|
2012-01-17 04:04:47 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* usage filter */
|
|
|
|
if ((bargs->flags & BTRFS_BALANCE_ARGS_USAGE) &&
|
2016-06-23 06:54:23 +08:00
|
|
|
chunk_usage_filter(fs_info, chunk_offset, bargs)) {
|
2012-01-17 04:04:47 +08:00
|
|
|
return 0;
|
2015-10-21 00:22:13 +08:00
|
|
|
} else if ((bargs->flags & BTRFS_BALANCE_ARGS_USAGE_RANGE) &&
|
2016-06-23 06:54:23 +08:00
|
|
|
chunk_usage_range_filter(fs_info, chunk_offset, bargs)) {
|
2015-10-21 00:22:13 +08:00
|
|
|
return 0;
|
2012-01-17 04:04:47 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* devid filter */
|
|
|
|
if ((bargs->flags & BTRFS_BALANCE_ARGS_DEVID) &&
|
|
|
|
chunk_devid_filter(leaf, chunk, bargs)) {
|
|
|
|
return 0;
|
2012-01-17 04:04:48 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* drange filter, makes sense only with devid filter */
|
|
|
|
if ((bargs->flags & BTRFS_BALANCE_ARGS_DRANGE) &&
|
2017-07-19 15:48:42 +08:00
|
|
|
chunk_drange_filter(leaf, chunk, bargs)) {
|
2012-01-17 04:04:48 +08:00
|
|
|
return 0;
|
2012-01-17 04:04:48 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* vrange filter */
|
|
|
|
if ((bargs->flags & BTRFS_BALANCE_ARGS_VRANGE) &&
|
|
|
|
chunk_vrange_filter(leaf, chunk, chunk_offset, bargs)) {
|
|
|
|
return 0;
|
2012-01-17 04:04:47 +08:00
|
|
|
}
|
|
|
|
|
2015-09-29 06:32:41 +08:00
|
|
|
/* stripes filter */
|
|
|
|
if ((bargs->flags & BTRFS_BALANCE_ARGS_STRIPES_RANGE) &&
|
|
|
|
chunk_stripes_range_filter(leaf, chunk, bargs)) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:48 +08:00
|
|
|
/* soft profile changing mode */
|
|
|
|
if ((bargs->flags & BTRFS_BALANCE_ARGS_SOFT) &&
|
|
|
|
chunk_soft_convert_filter(chunk_type, bargs)) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-05-07 23:37:51 +08:00
|
|
|
/*
|
|
|
|
* limited by count, must be the last filter
|
|
|
|
*/
|
|
|
|
if ((bargs->flags & BTRFS_BALANCE_ARGS_LIMIT)) {
|
|
|
|
if (bargs->limit == 0)
|
|
|
|
return 0;
|
|
|
|
else
|
|
|
|
bargs->limit--;
|
2015-10-10 23:16:50 +08:00
|
|
|
} else if ((bargs->flags & BTRFS_BALANCE_ARGS_LIMIT_RANGE)) {
|
|
|
|
/*
|
|
|
|
* Same logic as the 'limit' filter; the minimum cannot be
|
2016-05-20 09:18:45 +08:00
|
|
|
* determined here because we do not have the global information
|
2015-10-10 23:16:50 +08:00
|
|
|
* about the count of all chunks that satisfy the filters.
|
|
|
|
*/
|
|
|
|
if (bargs->limit_max == 0)
|
|
|
|
return 0;
|
|
|
|
else
|
|
|
|
bargs->limit_max--;
|
2014-05-07 23:37:51 +08:00
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:47 +08:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:47 +08:00
|
|
|
static int __btrfs_balance(struct btrfs_fs_info *fs_info)
|
2008-04-29 03:29:52 +08:00
|
|
|
{
|
2012-01-17 04:04:49 +08:00
|
|
|
struct btrfs_balance_control *bctl = fs_info->balance_ctl;
|
2012-01-17 04:04:47 +08:00
|
|
|
struct btrfs_root *chunk_root = fs_info->chunk_root;
|
2015-10-10 23:16:50 +08:00
|
|
|
u64 chunk_type;
|
2012-01-17 04:04:47 +08:00
|
|
|
struct btrfs_chunk *chunk;
|
2016-07-13 02:24:21 +08:00
|
|
|
struct btrfs_path *path = NULL;
|
2008-04-29 03:29:52 +08:00
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_key found_key;
|
2012-01-17 04:04:47 +08:00
|
|
|
struct extent_buffer *leaf;
|
|
|
|
int slot;
|
2012-01-17 04:04:47 +08:00
|
|
|
int ret;
|
|
|
|
int enospc_errors = 0;
|
2012-01-17 04:04:49 +08:00
|
|
|
bool counting = true;
|
2015-10-10 23:16:50 +08:00
|
|
|
/* The single value limit and min/max limits use the same bytes in the */
|
2014-05-07 23:37:51 +08:00
|
|
|
u64 limit_data = bctl->data.limit;
|
|
|
|
u64 limit_meta = bctl->meta.limit;
|
|
|
|
u64 limit_sys = bctl->sys.limit;
|
2015-10-10 23:16:50 +08:00
|
|
|
u32 count_data = 0;
|
|
|
|
u32 count_meta = 0;
|
|
|
|
u32 count_sys = 0;
|
btrfs: Fix lost-data-profile caused by balance bg
Reproduce:
(In integration-4.3 branch)
TEST_DEV=(/dev/vdg /dev/vdh)
TEST_DIR=/mnt/tmp
umount "$TEST_DEV" >/dev/null
mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
btrfs balance start -dusage=0 $TEST_DIR
btrfs filesystem usage $TEST_DIR
dd if=/dev/zero of="$TEST_DIR"/file count=100
btrfs filesystem usage $TEST_DIR
Result:
We can see "no data chunk" in first "btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 1.07GiB
And "data chunks changed from raid1 to single" in second
"btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Data,single: Size:256.00MiB, Used:0.00B
/dev/vdh 256.00MiB
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 841.92MiB
Reason:
btrfs balance delete last data chunk in case of no data in
the filesystem, then we can see "no data chunk" by "fi usage"
command.
And when we do write operation to fs, the only available data
profile is 0x0, result is all new chunks are allocated single type.
Fix:
Allocate a data chunk explicitly to ensure we don't lose the
raid profile for data.
Test:
Test by above script, and confirmed the logic by debug output.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-09 11:51:32 +08:00
|
|
|
int chunk_reserved = 0;
|
2008-04-29 03:29:52 +08:00
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
2011-07-13 02:10:23 +08:00
|
|
|
if (!path) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto error;
|
|
|
|
}
|
2012-01-17 04:04:49 +08:00
|
|
|
|
|
|
|
/* zero out stat counters */
|
|
|
|
spin_lock(&fs_info->balance_lock);
|
|
|
|
memset(&bctl->stat, 0, sizeof(bctl->stat));
|
|
|
|
spin_unlock(&fs_info->balance_lock);
|
|
|
|
again:
|
2014-05-07 23:37:51 +08:00
|
|
|
if (!counting) {
|
2015-10-10 23:16:50 +08:00
|
|
|
/*
|
|
|
|
* The single value limit and min/max limits use the same bytes
|
|
|
|
* in the
|
|
|
|
*/
|
2014-05-07 23:37:51 +08:00
|
|
|
bctl->data.limit = limit_data;
|
|
|
|
bctl->meta.limit = limit_meta;
|
|
|
|
bctl->sys.limit = limit_sys;
|
|
|
|
}
|
2008-04-29 03:29:52 +08:00
|
|
|
key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
|
|
|
|
key.offset = (u64)-1;
|
|
|
|
key.type = BTRFS_CHUNK_ITEM_KEY;
|
|
|
|
|
2009-01-06 10:25:51 +08:00
|
|
|
while (1) {
|
2012-01-17 04:04:49 +08:00
|
|
|
if ((!counting && atomic_read(&fs_info->balance_pause_req)) ||
|
2012-01-17 04:04:49 +08:00
|
|
|
atomic_read(&fs_info->balance_cancel_req)) {
|
2012-01-17 04:04:49 +08:00
|
|
|
ret = -ECANCELED;
|
|
|
|
goto error;
|
|
|
|
}
|
|
|
|
|
2021-04-19 15:41:01 +08:00
|
|
|
mutex_lock(&fs_info->reclaim_bgs_lock);
|
2008-04-29 03:29:52 +08:00
|
|
|
ret = btrfs_search_slot(NULL, chunk_root, &key, path, 0, 0);
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
if (ret < 0) {
|
2021-04-19 15:41:01 +08:00
|
|
|
mutex_unlock(&fs_info->reclaim_bgs_lock);
|
2008-04-29 03:29:52 +08:00
|
|
|
goto error;
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
}
|
2008-04-29 03:29:52 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* this shouldn't happen, it means the last relocate
|
|
|
|
* failed
|
|
|
|
*/
|
|
|
|
if (ret == 0)
|
2012-01-17 04:04:47 +08:00
|
|
|
BUG(); /* FIXME break ? */
|
2008-04-29 03:29:52 +08:00
|
|
|
|
|
|
|
ret = btrfs_previous_item(chunk_root, path, 0,
|
|
|
|
BTRFS_CHUNK_ITEM_KEY);
|
2012-01-17 04:04:47 +08:00
|
|
|
if (ret) {
|
2021-04-19 15:41:01 +08:00
|
|
|
mutex_unlock(&fs_info->reclaim_bgs_lock);
|
2012-01-17 04:04:47 +08:00
|
|
|
ret = 0;
|
2008-04-29 03:29:52 +08:00
|
|
|
break;
|
2012-01-17 04:04:47 +08:00
|
|
|
}
|
2008-07-09 02:19:17 +08:00
|
|
|
|
2012-01-17 04:04:47 +08:00
|
|
|
leaf = path->nodes[0];
|
|
|
|
slot = path->slots[0];
|
|
|
|
btrfs_item_key_to_cpu(leaf, &found_key, slot);
|
2008-07-09 02:19:17 +08:00
|
|
|
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
if (found_key.objectid != key.objectid) {
|
2021-04-19 15:41:01 +08:00
|
|
|
mutex_unlock(&fs_info->reclaim_bgs_lock);
|
2008-04-29 03:29:52 +08:00
|
|
|
break;
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
}
|
2008-07-09 02:19:17 +08:00
|
|
|
|
2012-01-17 04:04:47 +08:00
|
|
|
chunk = btrfs_item_ptr(leaf, slot, struct btrfs_chunk);
|
2015-10-10 23:16:50 +08:00
|
|
|
chunk_type = btrfs_chunk_type(leaf, chunk);
|
2012-01-17 04:04:47 +08:00
|
|
|
|
2012-01-17 04:04:49 +08:00
|
|
|
if (!counting) {
|
|
|
|
spin_lock(&fs_info->balance_lock);
|
|
|
|
bctl->stat.considered++;
|
|
|
|
spin_unlock(&fs_info->balance_lock);
|
|
|
|
}
|
|
|
|
|
2019-03-20 23:38:52 +08:00
|
|
|
ret = should_balance_chunk(leaf, chunk, found_key.offset);
|
btrfs: Fix lost-data-profile caused by balance bg
Reproduce:
(In integration-4.3 branch)
TEST_DEV=(/dev/vdg /dev/vdh)
TEST_DIR=/mnt/tmp
umount "$TEST_DEV" >/dev/null
mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
btrfs balance start -dusage=0 $TEST_DIR
btrfs filesystem usage $TEST_DIR
dd if=/dev/zero of="$TEST_DIR"/file count=100
btrfs filesystem usage $TEST_DIR
Result:
We can see "no data chunk" in first "btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 1.07GiB
And "data chunks changed from raid1 to single" in second
"btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Data,single: Size:256.00MiB, Used:0.00B
/dev/vdh 256.00MiB
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 841.92MiB
Reason:
btrfs balance delete last data chunk in case of no data in
the filesystem, then we can see "no data chunk" by "fi usage"
command.
And when we do write operation to fs, the only available data
profile is 0x0, result is all new chunks are allocated single type.
Fix:
Allocate a data chunk explicitly to ensure we don't lose the
raid profile for data.
Test:
Test by above script, and confirmed the logic by debug output.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-09 11:51:32 +08:00
|
|
|
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
if (!ret) {
|
2021-04-19 15:41:01 +08:00
|
|
|
mutex_unlock(&fs_info->reclaim_bgs_lock);
|
2012-01-17 04:04:47 +08:00
|
|
|
goto loop;
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
}
|
2012-01-17 04:04:47 +08:00
|
|
|
|
2012-01-17 04:04:49 +08:00
|
|
|
if (counting) {
|
2021-04-19 15:41:01 +08:00
|
|
|
mutex_unlock(&fs_info->reclaim_bgs_lock);
|
2012-01-17 04:04:49 +08:00
|
|
|
spin_lock(&fs_info->balance_lock);
|
|
|
|
bctl->stat.expected++;
|
|
|
|
spin_unlock(&fs_info->balance_lock);
|
2015-10-10 23:16:50 +08:00
|
|
|
|
|
|
|
if (chunk_type & BTRFS_BLOCK_GROUP_DATA)
|
|
|
|
count_data++;
|
|
|
|
else if (chunk_type & BTRFS_BLOCK_GROUP_SYSTEM)
|
|
|
|
count_sys++;
|
|
|
|
else if (chunk_type & BTRFS_BLOCK_GROUP_METADATA)
|
|
|
|
count_meta++;
|
|
|
|
|
|
|
|
goto loop;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Apply limit_min filter, no need to check if the LIMITS
|
|
|
|
* filter is used, limit_min is 0 by default
|
|
|
|
*/
|
|
|
|
if (((chunk_type & BTRFS_BLOCK_GROUP_DATA) &&
|
|
|
|
count_data < bctl->data.limit_min)
|
|
|
|
|| ((chunk_type & BTRFS_BLOCK_GROUP_METADATA) &&
|
|
|
|
count_meta < bctl->meta.limit_min)
|
|
|
|
|| ((chunk_type & BTRFS_BLOCK_GROUP_SYSTEM) &&
|
|
|
|
count_sys < bctl->sys.limit_min)) {
|
2021-04-19 15:41:01 +08:00
|
|
|
mutex_unlock(&fs_info->reclaim_bgs_lock);
|
2012-01-17 04:04:49 +08:00
|
|
|
goto loop;
|
|
|
|
}
|
|
|
|
|
2017-11-16 07:28:11 +08:00
|
|
|
if (!chunk_reserved) {
|
|
|
|
/*
|
|
|
|
* We may be relocating the only data chunk we have,
|
|
|
|
* which could potentially end up with losing data's
|
|
|
|
* raid profile, so lets allocate an empty one in
|
|
|
|
* advance.
|
|
|
|
*/
|
|
|
|
ret = btrfs_may_alloc_data_chunk(fs_info,
|
|
|
|
found_key.offset);
|
btrfs: Fix lost-data-profile caused by balance bg
Reproduce:
(In integration-4.3 branch)
TEST_DEV=(/dev/vdg /dev/vdh)
TEST_DIR=/mnt/tmp
umount "$TEST_DEV" >/dev/null
mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
btrfs balance start -dusage=0 $TEST_DIR
btrfs filesystem usage $TEST_DIR
dd if=/dev/zero of="$TEST_DIR"/file count=100
btrfs filesystem usage $TEST_DIR
Result:
We can see "no data chunk" in first "btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 1.07GiB
And "data chunks changed from raid1 to single" in second
"btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Data,single: Size:256.00MiB, Used:0.00B
/dev/vdh 256.00MiB
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 841.92MiB
Reason:
btrfs balance delete last data chunk in case of no data in
the filesystem, then we can see "no data chunk" by "fi usage"
command.
And when we do write operation to fs, the only available data
profile is 0x0, result is all new chunks are allocated single type.
Fix:
Allocate a data chunk explicitly to ensure we don't lose the
raid profile for data.
Test:
Test by above script, and confirmed the logic by debug output.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-09 11:51:32 +08:00
|
|
|
if (ret < 0) {
|
2021-04-19 15:41:01 +08:00
|
|
|
mutex_unlock(&fs_info->reclaim_bgs_lock);
|
btrfs: Fix lost-data-profile caused by balance bg
Reproduce:
(In integration-4.3 branch)
TEST_DEV=(/dev/vdg /dev/vdh)
TEST_DIR=/mnt/tmp
umount "$TEST_DEV" >/dev/null
mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
btrfs balance start -dusage=0 $TEST_DIR
btrfs filesystem usage $TEST_DIR
dd if=/dev/zero of="$TEST_DIR"/file count=100
btrfs filesystem usage $TEST_DIR
Result:
We can see "no data chunk" in first "btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 1.07GiB
And "data chunks changed from raid1 to single" in second
"btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Data,single: Size:256.00MiB, Used:0.00B
/dev/vdh 256.00MiB
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 841.92MiB
Reason:
btrfs balance delete last data chunk in case of no data in
the filesystem, then we can see "no data chunk" by "fi usage"
command.
And when we do write operation to fs, the only available data
profile is 0x0, result is all new chunks are allocated single type.
Fix:
Allocate a data chunk explicitly to ensure we don't lose the
raid profile for data.
Test:
Test by above script, and confirmed the logic by debug output.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-09 11:51:32 +08:00
|
|
|
goto error;
|
2017-11-16 07:28:11 +08:00
|
|
|
} else if (ret == 1) {
|
|
|
|
chunk_reserved = 1;
|
btrfs: Fix lost-data-profile caused by balance bg
Reproduce:
(In integration-4.3 branch)
TEST_DEV=(/dev/vdg /dev/vdh)
TEST_DIR=/mnt/tmp
umount "$TEST_DEV" >/dev/null
mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
btrfs balance start -dusage=0 $TEST_DIR
btrfs filesystem usage $TEST_DIR
dd if=/dev/zero of="$TEST_DIR"/file count=100
btrfs filesystem usage $TEST_DIR
Result:
We can see "no data chunk" in first "btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 1.07GiB
And "data chunks changed from raid1 to single" in second
"btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Data,single: Size:256.00MiB, Used:0.00B
/dev/vdh 256.00MiB
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 841.92MiB
Reason:
btrfs balance delete last data chunk in case of no data in
the filesystem, then we can see "no data chunk" by "fi usage"
command.
And when we do write operation to fs, the only available data
profile is 0x0, result is all new chunks are allocated single type.
Fix:
Allocate a data chunk explicitly to ensure we don't lose the
raid profile for data.
Test:
Test by above script, and confirmed the logic by debug output.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-09 11:51:32 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-06-21 22:40:19 +08:00
|
|
|
ret = btrfs_relocate_chunk(fs_info, found_key.offset);
|
2021-04-19 15:41:01 +08:00
|
|
|
mutex_unlock(&fs_info->reclaim_bgs_lock);
|
2012-01-17 04:04:49 +08:00
|
|
|
if (ret == -ENOSPC) {
|
2012-01-17 04:04:47 +08:00
|
|
|
enospc_errors++;
|
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-04 01:28:12 +08:00
|
|
|
} else if (ret == -ETXTBSY) {
|
|
|
|
btrfs_info(fs_info,
|
|
|
|
"skipping relocation of block group %llu due to active swapfile",
|
|
|
|
found_key.offset);
|
|
|
|
ret = 0;
|
|
|
|
} else if (ret) {
|
|
|
|
goto error;
|
2012-01-17 04:04:49 +08:00
|
|
|
} else {
|
|
|
|
spin_lock(&fs_info->balance_lock);
|
|
|
|
bctl->stat.completed++;
|
|
|
|
spin_unlock(&fs_info->balance_lock);
|
|
|
|
}
|
2012-01-17 04:04:47 +08:00
|
|
|
loop:
|
2013-08-27 18:50:44 +08:00
|
|
|
if (found_key.offset == 0)
|
|
|
|
break;
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
key.offset = found_key.offset - 1;
|
2008-04-29 03:29:52 +08:00
|
|
|
}
|
2012-01-17 04:04:47 +08:00
|
|
|
|
2012-01-17 04:04:49 +08:00
|
|
|
if (counting) {
|
|
|
|
btrfs_release_path(path);
|
|
|
|
counting = false;
|
|
|
|
goto again;
|
|
|
|
}
|
2008-04-29 03:29:52 +08:00
|
|
|
error:
|
|
|
|
btrfs_free_path(path);
|
2012-01-17 04:04:47 +08:00
|
|
|
if (enospc_errors) {
|
2013-12-21 00:37:06 +08:00
|
|
|
btrfs_info(fs_info, "%d enospc errors during balance",
|
2016-09-20 22:05:00 +08:00
|
|
|
enospc_errors);
|
2012-01-17 04:04:47 +08:00
|
|
|
if (!ret)
|
|
|
|
ret = -ENOSPC;
|
|
|
|
}
|
|
|
|
|
2008-04-29 03:29:52 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2022-10-27 20:21:42 +08:00
|
|
|
/*
|
|
|
|
* See if a given profile is valid and reduced.
|
|
|
|
*
|
|
|
|
* @flags: profile to validate
|
|
|
|
* @extended: if true @flags is treated as an extended profile
|
2012-03-27 22:09:17 +08:00
|
|
|
*/
|
|
|
|
static int alloc_profile_is_valid(u64 flags, int extended)
|
|
|
|
{
|
|
|
|
u64 mask = (extended ? BTRFS_EXTENDED_PROFILE_MASK :
|
|
|
|
BTRFS_BLOCK_GROUP_PROFILE_MASK);
|
|
|
|
|
|
|
|
flags &= ~BTRFS_BLOCK_GROUP_TYPE_MASK;
|
|
|
|
|
|
|
|
/* 1) check that all other bits are zeroed */
|
|
|
|
if (flags & ~mask)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
/* 2) see if profile is reduced */
|
|
|
|
if (flags == 0)
|
|
|
|
return !extended; /* "0" is valid for usual profiles */
|
|
|
|
|
2019-10-02 01:44:42 +08:00
|
|
|
return has_single_bit_set(flags);
|
2012-03-27 22:09:17 +08:00
|
|
|
}
|
|
|
|
|
2020-02-28 04:00:52 +08:00
|
|
|
/*
|
|
|
|
* Validate target profile against allowed profiles and return true if it's OK.
|
|
|
|
* Otherwise print the error message and return false.
|
|
|
|
*/
|
|
|
|
static inline int validate_convert_profile(struct btrfs_fs_info *fs_info,
|
|
|
|
const struct btrfs_balance_args *bargs,
|
|
|
|
u64 allowed, const char *type)
|
2015-09-23 04:02:25 +08:00
|
|
|
{
|
2020-02-28 04:00:52 +08:00
|
|
|
if (!(bargs->flags & BTRFS_BALANCE_ARGS_CONVERT))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
/* Profile is valid and does not have bits outside of the allowed set */
|
|
|
|
if (alloc_profile_is_valid(bargs->target, 1) &&
|
|
|
|
(bargs->target & ~allowed) == 0)
|
|
|
|
return true;
|
|
|
|
|
|
|
|
btrfs_err(fs_info, "balance: invalid convert %s profile %s",
|
|
|
|
type, btrfs_bg_type_to_raid_name(bargs->target));
|
|
|
|
return false;
|
2015-09-23 04:02:25 +08:00
|
|
|
}
|
|
|
|
|
btrfs: balance: print args during start and resume
The information about balance arguments is important for system audit,
this patch prints the textual representation when balance starts or is
resumed.
Example command:
$ btrfs balance start -f -mprofiles=raid1,convert=single,soft -dlimit=10..20,usage=50 /btrfs
Example kernel log output:
BTRFS info (device sdb): balance: start -f -dusage=50,limit=10..20 -mconvert=single,soft,profiles=raid1 -sconvert=single,soft,profiles=raid1
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog, simplify code ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-20 16:12:56 +08:00
|
|
|
/*
|
|
|
|
* Fill @buf with textual description of balance filter flags @bargs, up to
|
|
|
|
* @size_buf including the terminating null. The output may be trimmed if it
|
|
|
|
* does not fit into the provided buffer.
|
|
|
|
*/
|
|
|
|
static void describe_balance_args(struct btrfs_balance_args *bargs, char *buf,
|
|
|
|
u32 size_buf)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
u32 size_bp = size_buf;
|
|
|
|
char *bp = buf;
|
|
|
|
u64 flags = bargs->flags;
|
|
|
|
char tmp_buf[128] = {'\0'};
|
|
|
|
|
|
|
|
if (!flags)
|
|
|
|
return;
|
|
|
|
|
|
|
|
#define CHECK_APPEND_NOARG(a) \
|
|
|
|
do { \
|
|
|
|
ret = snprintf(bp, size_bp, (a)); \
|
|
|
|
if (ret < 0 || ret >= size_bp) \
|
|
|
|
goto out_overflow; \
|
|
|
|
size_bp -= ret; \
|
|
|
|
bp += ret; \
|
|
|
|
} while (0)
|
|
|
|
|
|
|
|
#define CHECK_APPEND_1ARG(a, v1) \
|
|
|
|
do { \
|
|
|
|
ret = snprintf(bp, size_bp, (a), (v1)); \
|
|
|
|
if (ret < 0 || ret >= size_bp) \
|
|
|
|
goto out_overflow; \
|
|
|
|
size_bp -= ret; \
|
|
|
|
bp += ret; \
|
|
|
|
} while (0)
|
|
|
|
|
|
|
|
#define CHECK_APPEND_2ARG(a, v1, v2) \
|
|
|
|
do { \
|
|
|
|
ret = snprintf(bp, size_bp, (a), (v1), (v2)); \
|
|
|
|
if (ret < 0 || ret >= size_bp) \
|
|
|
|
goto out_overflow; \
|
|
|
|
size_bp -= ret; \
|
|
|
|
bp += ret; \
|
|
|
|
} while (0)
|
|
|
|
|
2019-05-17 17:43:41 +08:00
|
|
|
if (flags & BTRFS_BALANCE_ARGS_CONVERT)
|
|
|
|
CHECK_APPEND_1ARG("convert=%s,",
|
|
|
|
btrfs_bg_type_to_raid_name(bargs->target));
|
btrfs: balance: print args during start and resume
The information about balance arguments is important for system audit,
this patch prints the textual representation when balance starts or is
resumed.
Example command:
$ btrfs balance start -f -mprofiles=raid1,convert=single,soft -dlimit=10..20,usage=50 /btrfs
Example kernel log output:
BTRFS info (device sdb): balance: start -f -dusage=50,limit=10..20 -mconvert=single,soft,profiles=raid1 -sconvert=single,soft,profiles=raid1
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog, simplify code ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-20 16:12:56 +08:00
|
|
|
|
|
|
|
if (flags & BTRFS_BALANCE_ARGS_SOFT)
|
|
|
|
CHECK_APPEND_NOARG("soft,");
|
|
|
|
|
|
|
|
if (flags & BTRFS_BALANCE_ARGS_PROFILES) {
|
|
|
|
btrfs_describe_block_groups(bargs->profiles, tmp_buf,
|
|
|
|
sizeof(tmp_buf));
|
|
|
|
CHECK_APPEND_1ARG("profiles=%s,", tmp_buf);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flags & BTRFS_BALANCE_ARGS_USAGE)
|
|
|
|
CHECK_APPEND_1ARG("usage=%llu,", bargs->usage);
|
|
|
|
|
|
|
|
if (flags & BTRFS_BALANCE_ARGS_USAGE_RANGE)
|
|
|
|
CHECK_APPEND_2ARG("usage=%u..%u,",
|
|
|
|
bargs->usage_min, bargs->usage_max);
|
|
|
|
|
|
|
|
if (flags & BTRFS_BALANCE_ARGS_DEVID)
|
|
|
|
CHECK_APPEND_1ARG("devid=%llu,", bargs->devid);
|
|
|
|
|
|
|
|
if (flags & BTRFS_BALANCE_ARGS_DRANGE)
|
|
|
|
CHECK_APPEND_2ARG("drange=%llu..%llu,",
|
|
|
|
bargs->pstart, bargs->pend);
|
|
|
|
|
|
|
|
if (flags & BTRFS_BALANCE_ARGS_VRANGE)
|
|
|
|
CHECK_APPEND_2ARG("vrange=%llu..%llu,",
|
|
|
|
bargs->vstart, bargs->vend);
|
|
|
|
|
|
|
|
if (flags & BTRFS_BALANCE_ARGS_LIMIT)
|
|
|
|
CHECK_APPEND_1ARG("limit=%llu,", bargs->limit);
|
|
|
|
|
|
|
|
if (flags & BTRFS_BALANCE_ARGS_LIMIT_RANGE)
|
|
|
|
CHECK_APPEND_2ARG("limit=%u..%u,",
|
|
|
|
bargs->limit_min, bargs->limit_max);
|
|
|
|
|
|
|
|
if (flags & BTRFS_BALANCE_ARGS_STRIPES_RANGE)
|
|
|
|
CHECK_APPEND_2ARG("stripes=%u..%u,",
|
|
|
|
bargs->stripes_min, bargs->stripes_max);
|
|
|
|
|
|
|
|
#undef CHECK_APPEND_2ARG
|
|
|
|
#undef CHECK_APPEND_1ARG
|
|
|
|
#undef CHECK_APPEND_NOARG
|
|
|
|
|
|
|
|
out_overflow:
|
|
|
|
|
|
|
|
if (size_bp < size_buf)
|
|
|
|
buf[size_buf - size_bp - 1] = '\0'; /* remove last , */
|
|
|
|
else
|
|
|
|
buf[0] = '\0';
|
|
|
|
}
|
|
|
|
|
|
|
|
static void describe_balance_start_or_resume(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
u32 size_buf = 1024;
|
|
|
|
char tmp_buf[192] = {'\0'};
|
|
|
|
char *buf;
|
|
|
|
char *bp;
|
|
|
|
u32 size_bp = size_buf;
|
|
|
|
int ret;
|
|
|
|
struct btrfs_balance_control *bctl = fs_info->balance_ctl;
|
|
|
|
|
|
|
|
buf = kzalloc(size_buf, GFP_KERNEL);
|
|
|
|
if (!buf)
|
|
|
|
return;
|
|
|
|
|
|
|
|
bp = buf;
|
|
|
|
|
|
|
|
#define CHECK_APPEND_1ARG(a, v1) \
|
|
|
|
do { \
|
|
|
|
ret = snprintf(bp, size_bp, (a), (v1)); \
|
|
|
|
if (ret < 0 || ret >= size_bp) \
|
|
|
|
goto out_overflow; \
|
|
|
|
size_bp -= ret; \
|
|
|
|
bp += ret; \
|
|
|
|
} while (0)
|
|
|
|
|
|
|
|
if (bctl->flags & BTRFS_BALANCE_FORCE)
|
|
|
|
CHECK_APPEND_1ARG("%s", "-f ");
|
|
|
|
|
|
|
|
if (bctl->flags & BTRFS_BALANCE_DATA) {
|
|
|
|
describe_balance_args(&bctl->data, tmp_buf, sizeof(tmp_buf));
|
|
|
|
CHECK_APPEND_1ARG("-d%s ", tmp_buf);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (bctl->flags & BTRFS_BALANCE_METADATA) {
|
|
|
|
describe_balance_args(&bctl->meta, tmp_buf, sizeof(tmp_buf));
|
|
|
|
CHECK_APPEND_1ARG("-m%s ", tmp_buf);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (bctl->flags & BTRFS_BALANCE_SYSTEM) {
|
|
|
|
describe_balance_args(&bctl->sys, tmp_buf, sizeof(tmp_buf));
|
|
|
|
CHECK_APPEND_1ARG("-s%s ", tmp_buf);
|
|
|
|
}
|
|
|
|
|
|
|
|
#undef CHECK_APPEND_1ARG
|
|
|
|
|
|
|
|
out_overflow:
|
|
|
|
|
|
|
|
if (size_bp < size_buf)
|
|
|
|
buf[size_buf - size_bp - 1] = '\0'; /* remove last " " */
|
|
|
|
btrfs_info(fs_info, "balance: %s %s",
|
|
|
|
(bctl->flags & BTRFS_BALANCE_RESUME) ?
|
|
|
|
"resume" : "start", buf);
|
|
|
|
|
|
|
|
kfree(buf);
|
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:47 +08:00
|
|
|
/*
|
2018-03-21 07:20:05 +08:00
|
|
|
* Should be called with balance mutexe held
|
2012-01-17 04:04:47 +08:00
|
|
|
*/
|
2018-05-07 23:44:03 +08:00
|
|
|
int btrfs_balance(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_balance_control *bctl,
|
2012-01-17 04:04:47 +08:00
|
|
|
struct btrfs_ioctl_balance_args *bargs)
|
|
|
|
{
|
2017-03-08 06:34:44 +08:00
|
|
|
u64 meta_target, data_target;
|
2012-01-17 04:04:47 +08:00
|
|
|
u64 allowed;
|
2012-03-27 22:09:17 +08:00
|
|
|
int mixed = 0;
|
2012-01-17 04:04:47 +08:00
|
|
|
int ret;
|
2012-11-06 20:15:27 +08:00
|
|
|
u64 num_devices;
|
2013-01-29 18:13:12 +08:00
|
|
|
unsigned seq;
|
2019-09-25 14:29:28 +08:00
|
|
|
bool reducing_redundancy;
|
2023-06-23 13:05:41 +08:00
|
|
|
bool paused = false;
|
2019-05-17 17:43:27 +08:00
|
|
|
int i;
|
2012-01-17 04:04:47 +08:00
|
|
|
|
2012-01-17 04:04:49 +08:00
|
|
|
if (btrfs_fs_closing(fs_info) ||
|
2012-01-17 04:04:49 +08:00
|
|
|
atomic_read(&fs_info->balance_pause_req) ||
|
2020-02-17 14:16:52 +08:00
|
|
|
btrfs_should_cancel_balance(fs_info)) {
|
2012-01-17 04:04:47 +08:00
|
|
|
ret = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2012-03-27 22:09:17 +08:00
|
|
|
allowed = btrfs_super_incompat_flags(fs_info->super_copy);
|
|
|
|
if (allowed & BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS)
|
|
|
|
mixed = 1;
|
|
|
|
|
2012-01-17 04:04:47 +08:00
|
|
|
/*
|
|
|
|
* In case of mixed groups both data and meta should be picked,
|
|
|
|
* and identical options should be given for both of them.
|
|
|
|
*/
|
2012-03-27 22:09:17 +08:00
|
|
|
allowed = BTRFS_BALANCE_DATA | BTRFS_BALANCE_METADATA;
|
|
|
|
if (mixed && (bctl->flags & allowed)) {
|
2012-01-17 04:04:47 +08:00
|
|
|
if (!(bctl->flags & BTRFS_BALANCE_DATA) ||
|
|
|
|
!(bctl->flags & BTRFS_BALANCE_METADATA) ||
|
|
|
|
memcmp(&bctl->data, &bctl->meta, sizeof(bctl->data))) {
|
2016-09-20 22:05:00 +08:00
|
|
|
btrfs_err(fs_info,
|
2018-05-16 10:51:26 +08:00
|
|
|
"balance: mixed groups data and metadata options must be the same");
|
2012-01-17 04:04:47 +08:00
|
|
|
ret = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
btrfs: check rw_devices, not num_devices for balance
The fstest btrfs/154 reports
[ 8675.381709] BTRFS: Transaction aborted (error -28)
[ 8675.383302] WARNING: CPU: 1 PID: 31900 at fs/btrfs/block-group.c:2038 btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
[ 8675.390925] CPU: 1 PID: 31900 Comm: btrfs Not tainted 5.5.0-rc6-default+ #935
[ 8675.392780] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[ 8675.395452] RIP: 0010:btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
[ 8675.402672] RSP: 0018:ffffb2090888fb00 EFLAGS: 00010286
[ 8675.404413] RAX: 0000000000000000 RBX: ffff92026dfa91c8 RCX: 0000000000000001
[ 8675.406609] RDX: 0000000000000000 RSI: ffffffff8e100899 RDI: ffffffff8e100971
[ 8675.408775] RBP: ffff920247c61660 R08: 0000000000000000 R09: 0000000000000000
[ 8675.410978] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffe4
[ 8675.412647] R13: ffff92026db74000 R14: ffff920247c616b8 R15: ffff92026dfbc000
[ 8675.413994] FS: 00007fd5e57248c0(0000) GS:ffff92027d800000(0000) knlGS:0000000000000000
[ 8675.416146] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8675.417833] CR2: 0000564aa51682d8 CR3: 000000006dcbc004 CR4: 0000000000160ee0
[ 8675.419801] Call Trace:
[ 8675.420742] btrfs_start_dirty_block_groups+0x355/0x480 [btrfs]
[ 8675.422600] btrfs_commit_transaction+0xc8/0xaf0 [btrfs]
[ 8675.424335] reset_balance_state+0x14a/0x190 [btrfs]
[ 8675.425824] btrfs_balance.cold+0xe7/0x154 [btrfs]
[ 8675.427313] ? kmem_cache_alloc_trace+0x235/0x2c0
[ 8675.428663] btrfs_ioctl_balance+0x298/0x350 [btrfs]
[ 8675.430285] btrfs_ioctl+0x466/0x2550 [btrfs]
[ 8675.431788] ? mem_cgroup_charge_statistics+0x51/0xf0
[ 8675.433487] ? mem_cgroup_commit_charge+0x56/0x400
[ 8675.435122] ? do_raw_spin_unlock+0x4b/0xc0
[ 8675.436618] ? _raw_spin_unlock+0x1f/0x30
[ 8675.438093] ? __handle_mm_fault+0x499/0x740
[ 8675.439619] ? do_vfs_ioctl+0x56e/0x770
[ 8675.441034] do_vfs_ioctl+0x56e/0x770
[ 8675.442411] ksys_ioctl+0x3a/0x70
[ 8675.443718] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 8675.445333] __x64_sys_ioctl+0x16/0x20
[ 8675.446705] do_syscall_64+0x50/0x210
[ 8675.448059] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 8675.479187] BTRFS: error (device vdb) in btrfs_create_pending_block_groups:2038: errno=-28 No space left
We now use btrfs_can_overcommit() to see if we can flip a block group
read only. Before this would fail because we weren't taking into
account the usable un-allocated space for allocating chunks. With my
patches we were allowed to do the balance, which is technically correct.
The test is trying to start balance on degraded mount. So now we're
trying to allocate a chunk and cannot because we want to allocate a
RAID1 chunk, but there's only 1 device that's available for usage. This
results in an ENOSPC.
But we shouldn't even be making it this far, we don't have enough
devices to restripe. The problem is we're using btrfs_num_devices(),
that also includes missing devices. That's not actually what we want, we
need to use rw_devices.
The chunk_mutex is not needed here, rw_devices changes only in device
add, remove or replace, all are excluded by EXCL_OP mechanism.
Fixes: e4d8ec0f65b9 ("Btrfs: implement online profile changing")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add stacktrace, update changelog, drop chunk_mutex ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-11 00:11:24 +08:00
|
|
|
/*
|
|
|
|
* rw_devices will not change at the moment, device add/delete/replace
|
2020-08-25 23:02:32 +08:00
|
|
|
* are exclusive
|
btrfs: check rw_devices, not num_devices for balance
The fstest btrfs/154 reports
[ 8675.381709] BTRFS: Transaction aborted (error -28)
[ 8675.383302] WARNING: CPU: 1 PID: 31900 at fs/btrfs/block-group.c:2038 btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
[ 8675.390925] CPU: 1 PID: 31900 Comm: btrfs Not tainted 5.5.0-rc6-default+ #935
[ 8675.392780] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[ 8675.395452] RIP: 0010:btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
[ 8675.402672] RSP: 0018:ffffb2090888fb00 EFLAGS: 00010286
[ 8675.404413] RAX: 0000000000000000 RBX: ffff92026dfa91c8 RCX: 0000000000000001
[ 8675.406609] RDX: 0000000000000000 RSI: ffffffff8e100899 RDI: ffffffff8e100971
[ 8675.408775] RBP: ffff920247c61660 R08: 0000000000000000 R09: 0000000000000000
[ 8675.410978] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffe4
[ 8675.412647] R13: ffff92026db74000 R14: ffff920247c616b8 R15: ffff92026dfbc000
[ 8675.413994] FS: 00007fd5e57248c0(0000) GS:ffff92027d800000(0000) knlGS:0000000000000000
[ 8675.416146] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8675.417833] CR2: 0000564aa51682d8 CR3: 000000006dcbc004 CR4: 0000000000160ee0
[ 8675.419801] Call Trace:
[ 8675.420742] btrfs_start_dirty_block_groups+0x355/0x480 [btrfs]
[ 8675.422600] btrfs_commit_transaction+0xc8/0xaf0 [btrfs]
[ 8675.424335] reset_balance_state+0x14a/0x190 [btrfs]
[ 8675.425824] btrfs_balance.cold+0xe7/0x154 [btrfs]
[ 8675.427313] ? kmem_cache_alloc_trace+0x235/0x2c0
[ 8675.428663] btrfs_ioctl_balance+0x298/0x350 [btrfs]
[ 8675.430285] btrfs_ioctl+0x466/0x2550 [btrfs]
[ 8675.431788] ? mem_cgroup_charge_statistics+0x51/0xf0
[ 8675.433487] ? mem_cgroup_commit_charge+0x56/0x400
[ 8675.435122] ? do_raw_spin_unlock+0x4b/0xc0
[ 8675.436618] ? _raw_spin_unlock+0x1f/0x30
[ 8675.438093] ? __handle_mm_fault+0x499/0x740
[ 8675.439619] ? do_vfs_ioctl+0x56e/0x770
[ 8675.441034] do_vfs_ioctl+0x56e/0x770
[ 8675.442411] ksys_ioctl+0x3a/0x70
[ 8675.443718] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 8675.445333] __x64_sys_ioctl+0x16/0x20
[ 8675.446705] do_syscall_64+0x50/0x210
[ 8675.448059] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 8675.479187] BTRFS: error (device vdb) in btrfs_create_pending_block_groups:2038: errno=-28 No space left
We now use btrfs_can_overcommit() to see if we can flip a block group
read only. Before this would fail because we weren't taking into
account the usable un-allocated space for allocating chunks. With my
patches we were allowed to do the balance, which is technically correct.
The test is trying to start balance on degraded mount. So now we're
trying to allocate a chunk and cannot because we want to allocate a
RAID1 chunk, but there's only 1 device that's available for usage. This
results in an ENOSPC.
But we shouldn't even be making it this far, we don't have enough
devices to restripe. The problem is we're using btrfs_num_devices(),
that also includes missing devices. That's not actually what we want, we
need to use rw_devices.
The chunk_mutex is not needed here, rw_devices changes only in device
add, remove or replace, all are excluded by EXCL_OP mechanism.
Fixes: e4d8ec0f65b9 ("Btrfs: implement online profile changing")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add stacktrace, update changelog, drop chunk_mutex ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-11 00:11:24 +08:00
|
|
|
*/
|
|
|
|
num_devices = fs_info->fs_devices->rw_devices;
|
2019-09-25 10:13:27 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* SINGLE profile on-disk has no profile bit, but in-memory we have a
|
|
|
|
* special bit for it, to make it easier to distinguish. Thus we need
|
|
|
|
* to set it manually, or balance would refuse the profile.
|
|
|
|
*/
|
|
|
|
allowed = BTRFS_AVAIL_ALLOC_BIT_SINGLE;
|
2019-05-17 17:43:27 +08:00
|
|
|
for (i = 0; i < ARRAY_SIZE(btrfs_raid_array); i++)
|
|
|
|
if (num_devices >= btrfs_raid_array[i].devs_min)
|
|
|
|
allowed |= btrfs_raid_array[i].bg_flag;
|
2018-08-10 13:53:21 +08:00
|
|
|
|
2020-02-28 04:00:52 +08:00
|
|
|
if (!validate_convert_profile(fs_info, &bctl->data, allowed, "data") ||
|
|
|
|
!validate_convert_profile(fs_info, &bctl->meta, allowed, "metadata") ||
|
|
|
|
!validate_convert_profile(fs_info, &bctl->sys, allowed, "system")) {
|
2012-01-17 04:04:48 +08:00
|
|
|
ret = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2019-05-17 17:43:29 +08:00
|
|
|
/*
|
|
|
|
* Allow to reduce metadata or system integrity only if force set for
|
|
|
|
* profiles with redundancy (copies, parity)
|
|
|
|
*/
|
|
|
|
allowed = 0;
|
|
|
|
for (i = 0; i < ARRAY_SIZE(btrfs_raid_array); i++) {
|
|
|
|
if (btrfs_raid_array[i].ncopies >= 2 ||
|
|
|
|
btrfs_raid_array[i].tolerated_failures >= 1)
|
|
|
|
allowed |= btrfs_raid_array[i].bg_flag;
|
|
|
|
}
|
2013-01-29 18:13:12 +08:00
|
|
|
do {
|
|
|
|
seq = read_seqbegin(&fs_info->profiles_lock);
|
|
|
|
|
|
|
|
if (((bctl->sys.flags & BTRFS_BALANCE_ARGS_CONVERT) &&
|
|
|
|
(fs_info->avail_system_alloc_bits & allowed) &&
|
|
|
|
!(bctl->sys.target & allowed)) ||
|
|
|
|
((bctl->meta.flags & BTRFS_BALANCE_ARGS_CONVERT) &&
|
|
|
|
(fs_info->avail_metadata_alloc_bits & allowed) &&
|
2018-11-19 17:48:12 +08:00
|
|
|
!(bctl->meta.target & allowed)))
|
2019-09-25 14:29:28 +08:00
|
|
|
reducing_redundancy = true;
|
2018-11-19 17:48:12 +08:00
|
|
|
else
|
2019-09-25 14:29:28 +08:00
|
|
|
reducing_redundancy = false;
|
2018-11-19 17:48:12 +08:00
|
|
|
|
|
|
|
/* if we're not converting, the target field is uninitialized */
|
|
|
|
meta_target = (bctl->meta.flags & BTRFS_BALANCE_ARGS_CONVERT) ?
|
|
|
|
bctl->meta.target : fs_info->avail_metadata_alloc_bits;
|
|
|
|
data_target = (bctl->data.flags & BTRFS_BALANCE_ARGS_CONVERT) ?
|
|
|
|
bctl->data.target : fs_info->avail_data_alloc_bits;
|
2013-01-29 18:13:12 +08:00
|
|
|
} while (read_seqretry(&fs_info->profiles_lock, seq));
|
2012-01-17 04:04:48 +08:00
|
|
|
|
2019-09-25 14:29:28 +08:00
|
|
|
if (reducing_redundancy) {
|
2018-11-19 17:48:12 +08:00
|
|
|
if (bctl->flags & BTRFS_BALANCE_FORCE) {
|
|
|
|
btrfs_info(fs_info,
|
2019-09-25 14:29:28 +08:00
|
|
|
"balance: force reducing metadata redundancy");
|
2018-11-19 17:48:12 +08:00
|
|
|
} else {
|
|
|
|
btrfs_err(fs_info,
|
2019-09-25 14:29:28 +08:00
|
|
|
"balance: reduces metadata redundancy, use --force if you want this");
|
2018-11-19 17:48:12 +08:00
|
|
|
ret = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-03-08 06:34:44 +08:00
|
|
|
if (btrfs_get_num_tolerated_disk_barrier_failures(meta_target) <
|
|
|
|
btrfs_get_num_tolerated_disk_barrier_failures(data_target)) {
|
2016-01-06 16:46:12 +08:00
|
|
|
btrfs_warn(fs_info,
|
2018-05-16 10:51:26 +08:00
|
|
|
"balance: metadata profile %s has lower redundancy than data profile %s",
|
2019-05-17 17:43:41 +08:00
|
|
|
btrfs_bg_type_to_raid_name(meta_target),
|
|
|
|
btrfs_bg_type_to_raid_name(data_target));
|
2016-01-06 16:46:12 +08:00
|
|
|
}
|
|
|
|
|
2016-06-22 09:16:51 +08:00
|
|
|
ret = insert_balance_item(fs_info, bctl);
|
2012-01-17 04:04:48 +08:00
|
|
|
if (ret && ret != -EEXIST)
|
2012-01-17 04:04:48 +08:00
|
|
|
goto out;
|
|
|
|
|
2012-01-17 04:04:48 +08:00
|
|
|
if (!(bctl->flags & BTRFS_BALANCE_RESUME)) {
|
|
|
|
BUG_ON(ret == -EEXIST);
|
2018-03-21 09:41:30 +08:00
|
|
|
BUG_ON(fs_info->balance_ctl);
|
|
|
|
spin_lock(&fs_info->balance_lock);
|
|
|
|
fs_info->balance_ctl = bctl;
|
|
|
|
spin_unlock(&fs_info->balance_lock);
|
2012-01-17 04:04:48 +08:00
|
|
|
} else {
|
|
|
|
BUG_ON(ret != -EEXIST);
|
|
|
|
spin_lock(&fs_info->balance_lock);
|
|
|
|
update_balance_args(bctl);
|
|
|
|
spin_unlock(&fs_info->balance_lock);
|
|
|
|
}
|
2012-01-17 04:04:47 +08:00
|
|
|
|
2018-03-21 08:31:04 +08:00
|
|
|
ASSERT(!test_bit(BTRFS_FS_BALANCE_RUNNING, &fs_info->flags));
|
|
|
|
set_bit(BTRFS_FS_BALANCE_RUNNING, &fs_info->flags);
|
btrfs: balance: print args during start and resume
The information about balance arguments is important for system audit,
this patch prints the textual representation when balance starts or is
resumed.
Example command:
$ btrfs balance start -f -mprofiles=raid1,convert=single,soft -dlimit=10..20,usage=50 /btrfs
Example kernel log output:
BTRFS info (device sdb): balance: start -f -dusage=50,limit=10..20 -mconvert=single,soft,profiles=raid1 -sconvert=single,soft,profiles=raid1
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog, simplify code ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-20 16:12:56 +08:00
|
|
|
describe_balance_start_or_resume(fs_info);
|
2012-01-17 04:04:47 +08:00
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
|
|
|
|
|
|
|
ret = __btrfs_balance(fs_info);
|
|
|
|
|
|
|
|
mutex_lock(&fs_info->balance_mutex);
|
2021-11-25 17:14:41 +08:00
|
|
|
if (ret == -ECANCELED && atomic_read(&fs_info->balance_pause_req)) {
|
2018-11-20 16:12:57 +08:00
|
|
|
btrfs_info(fs_info, "balance: paused");
|
2021-11-25 17:14:41 +08:00
|
|
|
btrfs_exclop_balance(fs_info, BTRFS_EXCLOP_BALANCE_PAUSED);
|
2023-06-23 13:05:41 +08:00
|
|
|
paused = true;
|
2021-11-25 17:14:41 +08:00
|
|
|
}
|
2020-07-13 09:03:21 +08:00
|
|
|
/*
|
|
|
|
* Balance can be canceled by:
|
|
|
|
*
|
|
|
|
* - Regular cancel request
|
|
|
|
* Then ret == -ECANCELED and balance_cancel_req > 0
|
|
|
|
*
|
|
|
|
* - Fatal signal to "btrfs" process
|
|
|
|
* Either the signal caught by wait_reserve_ticket() and callers
|
|
|
|
* got -EINTR, or caught by btrfs_should_cancel_balance() and
|
|
|
|
* got -ECANCELED.
|
|
|
|
* Either way, in this case balance_cancel_req = 0, and
|
|
|
|
* ret == -EINTR or ret == -ECANCELED.
|
|
|
|
*
|
|
|
|
* So here we only check the return value to catch canceled balance.
|
|
|
|
*/
|
|
|
|
else if (ret == -ECANCELED || ret == -EINTR)
|
2018-11-20 16:12:57 +08:00
|
|
|
btrfs_info(fs_info, "balance: canceled");
|
|
|
|
else
|
|
|
|
btrfs_info(fs_info, "balance: ended with status: %d", ret);
|
|
|
|
|
2018-03-21 08:31:04 +08:00
|
|
|
clear_bit(BTRFS_FS_BALANCE_RUNNING, &fs_info->flags);
|
2012-01-17 04:04:47 +08:00
|
|
|
|
|
|
|
if (bargs) {
|
|
|
|
memset(bargs, 0, sizeof(*bargs));
|
2018-03-21 09:05:27 +08:00
|
|
|
btrfs_update_ioctl_balance_args(fs_info, bargs);
|
2012-01-17 04:04:47 +08:00
|
|
|
}
|
|
|
|
|
2023-06-23 13:05:41 +08:00
|
|
|
/* We didn't pause, we can clean everything up. */
|
|
|
|
if (!paused) {
|
2018-03-21 03:23:09 +08:00
|
|
|
reset_balance_state(fs_info);
|
2020-08-25 23:02:32 +08:00
|
|
|
btrfs_exclop_finish(fs_info);
|
2013-03-06 16:57:55 +08:00
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:49 +08:00
|
|
|
wake_up(&fs_info->balance_wait_q);
|
2012-01-17 04:04:47 +08:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
out:
|
2012-01-17 04:04:48 +08:00
|
|
|
if (bctl->flags & BTRFS_BALANCE_RESUME)
|
2018-03-21 03:23:09 +08:00
|
|
|
reset_balance_state(fs_info);
|
2018-03-21 00:28:05 +08:00
|
|
|
else
|
2012-01-17 04:04:48 +08:00
|
|
|
kfree(bctl);
|
2020-08-25 23:02:32 +08:00
|
|
|
btrfs_exclop_finish(fs_info);
|
2018-03-21 00:28:05 +08:00
|
|
|
|
2012-01-17 04:04:48 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int balance_kthread(void *data)
|
|
|
|
{
|
2012-06-23 02:24:13 +08:00
|
|
|
struct btrfs_fs_info *fs_info = data;
|
2012-01-17 04:04:48 +08:00
|
|
|
int ret = 0;
|
2012-01-17 04:04:48 +08:00
|
|
|
|
2022-03-29 14:55:58 +08:00
|
|
|
sb_start_write(fs_info->sb);
|
2012-01-17 04:04:48 +08:00
|
|
|
mutex_lock(&fs_info->balance_mutex);
|
btrfs: balance: print args during start and resume
The information about balance arguments is important for system audit,
this patch prints the textual representation when balance starts or is
resumed.
Example command:
$ btrfs balance start -f -mprofiles=raid1,convert=single,soft -dlimit=10..20,usage=50 /btrfs
Example kernel log output:
BTRFS info (device sdb): balance: start -f -dusage=50,limit=10..20 -mconvert=single,soft,profiles=raid1 -sconvert=single,soft,profiles=raid1
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog, simplify code ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-20 16:12:56 +08:00
|
|
|
if (fs_info->balance_ctl)
|
2018-05-07 23:44:03 +08:00
|
|
|
ret = btrfs_balance(fs_info, fs_info->balance_ctl, NULL);
|
2012-01-17 04:04:48 +08:00
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
2022-03-29 14:55:58 +08:00
|
|
|
sb_end_write(fs_info->sb);
|
2012-06-23 02:24:13 +08:00
|
|
|
|
2012-01-17 04:04:48 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-06-23 02:24:13 +08:00
|
|
|
int btrfs_resume_balance_async(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
struct task_struct *tsk;
|
|
|
|
|
2018-03-21 09:29:13 +08:00
|
|
|
mutex_lock(&fs_info->balance_mutex);
|
2012-06-23 02:24:13 +08:00
|
|
|
if (!fs_info->balance_ctl) {
|
2018-03-21 09:29:13 +08:00
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
2012-06-23 02:24:13 +08:00
|
|
|
return 0;
|
|
|
|
}
|
2018-03-21 09:29:13 +08:00
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
2012-06-23 02:24:13 +08:00
|
|
|
|
2016-06-10 09:38:35 +08:00
|
|
|
if (btrfs_test_opt(fs_info, SKIP_BALANCE)) {
|
2018-05-16 10:51:26 +08:00
|
|
|
btrfs_info(fs_info, "balance: resume skipped");
|
2012-06-23 02:24:13 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2021-11-25 17:14:41 +08:00
|
|
|
spin_lock(&fs_info->super_lock);
|
|
|
|
ASSERT(fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE_PAUSED);
|
|
|
|
fs_info->exclusive_operation = BTRFS_EXCLOP_BALANCE;
|
|
|
|
spin_unlock(&fs_info->super_lock);
|
2018-05-17 15:16:51 +08:00
|
|
|
/*
|
|
|
|
* A ro->rw remount sequence should continue with the paused balance
|
|
|
|
* regardless of who pauses it, system or the user as of now, so set
|
|
|
|
* the resume flag.
|
|
|
|
*/
|
|
|
|
spin_lock(&fs_info->balance_lock);
|
|
|
|
fs_info->balance_ctl->flags |= BTRFS_BALANCE_RESUME;
|
|
|
|
spin_unlock(&fs_info->balance_lock);
|
|
|
|
|
2012-06-23 02:24:13 +08:00
|
|
|
tsk = kthread_run(balance_kthread, fs_info, "btrfs-balance");
|
2013-07-15 19:22:18 +08:00
|
|
|
return PTR_ERR_OR_ZERO(tsk);
|
2012-06-23 02:24:13 +08:00
|
|
|
}
|
|
|
|
|
2012-06-23 02:24:12 +08:00
|
|
|
int btrfs_recover_balance(struct btrfs_fs_info *fs_info)
|
2012-01-17 04:04:48 +08:00
|
|
|
{
|
|
|
|
struct btrfs_balance_control *bctl;
|
|
|
|
struct btrfs_balance_item *item;
|
|
|
|
struct btrfs_disk_balance_args disk_bargs;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_key key;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
key.objectid = BTRFS_BALANCE_OBJECTID;
|
2016-01-26 00:51:31 +08:00
|
|
|
key.type = BTRFS_TEMPORARY_ITEM_KEY;
|
2012-01-17 04:04:48 +08:00
|
|
|
key.offset = 0;
|
|
|
|
|
2012-06-23 02:24:12 +08:00
|
|
|
ret = btrfs_search_slot(NULL, fs_info->tree_root, &key, path, 0, 0);
|
2012-01-17 04:04:48 +08:00
|
|
|
if (ret < 0)
|
2012-06-23 02:24:12 +08:00
|
|
|
goto out;
|
2012-01-17 04:04:48 +08:00
|
|
|
if (ret > 0) { /* ret = -ENOENT; */
|
|
|
|
ret = 0;
|
2012-06-23 02:24:12 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
bctl = kzalloc(sizeof(*bctl), GFP_NOFS);
|
|
|
|
if (!bctl) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
2012-01-17 04:04:48 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_balance_item);
|
|
|
|
|
2012-06-23 02:24:12 +08:00
|
|
|
bctl->flags = btrfs_balance_flags(leaf, item);
|
|
|
|
bctl->flags |= BTRFS_BALANCE_RESUME;
|
2012-01-17 04:04:48 +08:00
|
|
|
|
|
|
|
btrfs_balance_data(leaf, item, &disk_bargs);
|
|
|
|
btrfs_disk_balance_args_to_cpu(&bctl->data, &disk_bargs);
|
|
|
|
btrfs_balance_meta(leaf, item, &disk_bargs);
|
|
|
|
btrfs_disk_balance_args_to_cpu(&bctl->meta, &disk_bargs);
|
|
|
|
btrfs_balance_sys(leaf, item, &disk_bargs);
|
|
|
|
btrfs_disk_balance_args_to_cpu(&bctl->sys, &disk_bargs);
|
|
|
|
|
2018-03-21 03:07:58 +08:00
|
|
|
/*
|
|
|
|
* This should never happen, as the paused balance state is recovered
|
|
|
|
* during mount without any chance of other exclusive ops to collide.
|
|
|
|
*
|
|
|
|
* This gives the exclusive op status to balance and keeps in paused
|
|
|
|
* state until user intervention (cancel or umount). If the ownership
|
|
|
|
* cannot be assigned, show a message but do not fail. The balance
|
|
|
|
* is in a paused state and must have fs_info::balance_ctl properly
|
|
|
|
* set up.
|
|
|
|
*/
|
2021-11-25 17:14:41 +08:00
|
|
|
if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE_PAUSED))
|
2018-03-21 03:07:58 +08:00
|
|
|
btrfs_warn(fs_info,
|
2018-05-16 10:51:26 +08:00
|
|
|
"balance: cannot set exclusive op status, resume manually");
|
2013-01-20 21:57:57 +08:00
|
|
|
|
2020-12-17 00:22:14 +08:00
|
|
|
btrfs_release_path(path);
|
|
|
|
|
2012-06-23 02:24:12 +08:00
|
|
|
mutex_lock(&fs_info->balance_mutex);
|
2018-03-21 09:41:30 +08:00
|
|
|
BUG_ON(fs_info->balance_ctl);
|
|
|
|
spin_lock(&fs_info->balance_lock);
|
|
|
|
fs_info->balance_ctl = bctl;
|
|
|
|
spin_unlock(&fs_info->balance_lock);
|
2012-06-23 02:24:12 +08:00
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
2012-01-17 04:04:48 +08:00
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
2008-04-29 03:29:52 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:49 +08:00
|
|
|
int btrfs_pause_balance(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
mutex_lock(&fs_info->balance_mutex);
|
|
|
|
if (!fs_info->balance_ctl) {
|
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
|
|
|
return -ENOTCONN;
|
|
|
|
}
|
|
|
|
|
2018-03-21 08:31:04 +08:00
|
|
|
if (test_bit(BTRFS_FS_BALANCE_RUNNING, &fs_info->flags)) {
|
2012-01-17 04:04:49 +08:00
|
|
|
atomic_inc(&fs_info->balance_pause_req);
|
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
|
|
|
|
|
|
|
wait_event(fs_info->balance_wait_q,
|
2018-03-21 08:31:04 +08:00
|
|
|
!test_bit(BTRFS_FS_BALANCE_RUNNING, &fs_info->flags));
|
2012-01-17 04:04:49 +08:00
|
|
|
|
|
|
|
mutex_lock(&fs_info->balance_mutex);
|
|
|
|
/* we are good with balance_ctl ripped off from under us */
|
2018-03-21 08:31:04 +08:00
|
|
|
BUG_ON(test_bit(BTRFS_FS_BALANCE_RUNNING, &fs_info->flags));
|
2012-01-17 04:04:49 +08:00
|
|
|
atomic_dec(&fs_info->balance_pause_req);
|
|
|
|
} else {
|
|
|
|
ret = -ENOTCONN;
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:49 +08:00
|
|
|
int btrfs_cancel_balance(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
mutex_lock(&fs_info->balance_mutex);
|
|
|
|
if (!fs_info->balance_ctl) {
|
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
|
|
|
return -ENOTCONN;
|
|
|
|
}
|
|
|
|
|
2018-03-21 08:45:32 +08:00
|
|
|
/*
|
|
|
|
* A paused balance with the item stored on disk can be resumed at
|
|
|
|
* mount time if the mount is read-write. Otherwise it's still paused
|
|
|
|
* and we must not allow cancelling as it deletes the item.
|
|
|
|
*/
|
|
|
|
if (sb_rdonly(fs_info->sb)) {
|
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
|
|
|
return -EROFS;
|
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:49 +08:00
|
|
|
atomic_inc(&fs_info->balance_cancel_req);
|
|
|
|
/*
|
|
|
|
* if we are running just wait and return, balance item is
|
|
|
|
* deleted in btrfs_balance in this case
|
|
|
|
*/
|
2018-03-21 08:31:04 +08:00
|
|
|
if (test_bit(BTRFS_FS_BALANCE_RUNNING, &fs_info->flags)) {
|
2012-01-17 04:04:49 +08:00
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
|
|
|
wait_event(fs_info->balance_wait_q,
|
2018-03-21 08:31:04 +08:00
|
|
|
!test_bit(BTRFS_FS_BALANCE_RUNNING, &fs_info->flags));
|
2012-01-17 04:04:49 +08:00
|
|
|
mutex_lock(&fs_info->balance_mutex);
|
|
|
|
} else {
|
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
2018-03-21 07:20:05 +08:00
|
|
|
/*
|
|
|
|
* Lock released to allow other waiters to continue, we'll
|
|
|
|
* reexamine the status again.
|
|
|
|
*/
|
2012-01-17 04:04:49 +08:00
|
|
|
mutex_lock(&fs_info->balance_mutex);
|
|
|
|
|
2018-03-21 00:28:05 +08:00
|
|
|
if (fs_info->balance_ctl) {
|
2018-03-21 03:23:09 +08:00
|
|
|
reset_balance_state(fs_info);
|
2020-08-25 23:02:32 +08:00
|
|
|
btrfs_exclop_finish(fs_info);
|
2018-05-16 10:51:26 +08:00
|
|
|
btrfs_info(fs_info, "balance: canceled");
|
2018-03-21 00:28:05 +08:00
|
|
|
}
|
2012-01-17 04:04:49 +08:00
|
|
|
}
|
|
|
|
|
2023-08-15 14:55:59 +08:00
|
|
|
ASSERT(!test_bit(BTRFS_FS_BALANCE_RUNNING, &fs_info->flags));
|
2012-01-17 04:04:49 +08:00
|
|
|
atomic_dec(&fs_info->balance_cancel_req);
|
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-04-26 04:53:30 +08:00
|
|
|
/*
|
|
|
|
* shrinking a device means finding all of the device extents past
|
|
|
|
* the new size, and then following the back refs to the chunks.
|
|
|
|
* The chunk relocation code actually frees the device extent
|
|
|
|
*/
|
|
|
|
int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
|
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *fs_info = device->fs_info;
|
|
|
|
struct btrfs_root *root = fs_info->dev_root;
|
2008-04-26 04:53:30 +08:00
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
struct btrfs_dev_extent *dev_extent = NULL;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
u64 length;
|
|
|
|
u64 chunk_offset;
|
|
|
|
int ret;
|
|
|
|
int slot;
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
int failed = 0;
|
|
|
|
bool retried = false;
|
2008-04-26 04:53:30 +08:00
|
|
|
struct extent_buffer *l;
|
|
|
|
struct btrfs_key key;
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_super_block *super_copy = fs_info->super_copy;
|
2008-04-26 04:53:30 +08:00
|
|
|
u64 old_total = btrfs_super_total_bytes(super_copy);
|
2014-09-03 21:35:38 +08:00
|
|
|
u64 old_size = btrfs_device_get_total_bytes(device);
|
2017-06-16 19:39:20 +08:00
|
|
|
u64 diff;
|
2019-03-25 20:31:23 +08:00
|
|
|
u64 start;
|
2023-09-28 01:46:59 +08:00
|
|
|
u64 free_diff = 0;
|
2017-06-16 19:39:20 +08:00
|
|
|
|
|
|
|
new_size = round_down(new_size, fs_info->sectorsize);
|
2019-03-25 20:31:23 +08:00
|
|
|
start = new_size;
|
2017-07-21 16:28:24 +08:00
|
|
|
diff = round_down(old_size - new_size, fs_info->sectorsize);
|
2008-04-26 04:53:30 +08:00
|
|
|
|
2017-12-04 12:54:55 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
|
2012-11-06 01:29:28 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
2008-04-26 04:53:30 +08:00
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2018-04-27 16:22:07 +08:00
|
|
|
path->reada = READA_BACK;
|
2008-04-26 04:53:30 +08:00
|
|
|
|
2019-03-25 20:31:23 +08:00
|
|
|
trans = btrfs_start_transaction(root, 0);
|
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return PTR_ERR(trans);
|
|
|
|
}
|
|
|
|
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
2008-07-09 02:19:17 +08:00
|
|
|
|
2014-09-03 21:35:38 +08:00
|
|
|
btrfs_device_set_total_bytes(device, new_size);
|
2017-12-04 12:54:52 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
|
2008-11-18 10:11:30 +08:00
|
|
|
device->fs_devices->total_rw_bytes -= diff;
|
2023-09-28 01:46:59 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The new free_chunk_space is new_size - used, so we have to
|
|
|
|
* subtract the delta of the old free_chunk_space which included
|
|
|
|
* old_size - used. If used > new_size then just subtract this
|
|
|
|
* entire device's free space.
|
|
|
|
*/
|
|
|
|
if (device->bytes_used < new_size)
|
|
|
|
free_diff = (old_size - device->bytes_used) -
|
|
|
|
(new_size - device->bytes_used);
|
|
|
|
else
|
|
|
|
free_diff = old_size - device->bytes_used;
|
|
|
|
atomic64_sub(free_diff, &fs_info->free_chunk_space);
|
2011-09-27 05:12:22 +08:00
|
|
|
}
|
2019-03-25 20:31:23 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Once the device's size has been set to the new size, ensure all
|
|
|
|
* in-memory chunks are synced to disk so that the loop below sees them
|
|
|
|
* and relocates them accordingly.
|
|
|
|
*/
|
2019-03-27 20:24:12 +08:00
|
|
|
if (contains_pending_extent(device, &start, diff)) {
|
2019-03-25 20:31:23 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
|
|
|
ret = btrfs_commit_transaction(trans);
|
|
|
|
if (ret)
|
|
|
|
goto done;
|
|
|
|
} else {
|
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
|
|
|
btrfs_end_transaction(trans);
|
|
|
|
}
|
2008-04-26 04:53:30 +08:00
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
again:
|
2008-04-26 04:53:30 +08:00
|
|
|
key.objectid = device->devid;
|
|
|
|
key.offset = (u64)-1;
|
|
|
|
key.type = BTRFS_DEV_EXTENT_KEY;
|
|
|
|
|
2012-03-27 22:09:18 +08:00
|
|
|
do {
|
2021-04-19 15:41:01 +08:00
|
|
|
mutex_lock(&fs_info->reclaim_bgs_lock);
|
2008-04-26 04:53:30 +08:00
|
|
|
ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
if (ret < 0) {
|
2021-04-19 15:41:01 +08:00
|
|
|
mutex_unlock(&fs_info->reclaim_bgs_lock);
|
2008-04-26 04:53:30 +08:00
|
|
|
goto done;
|
Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:
[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)
The sequence of steps leading to this are:
CPU 0 CPU 1
btrfs_balance()
btrfs_relocate_chunk()
btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)
cleaner_kthread
locks fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction
checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))
blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex
acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex
btrfs_relocate_block_group(bg X) returns
btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)
Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.
This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:
while true; do btrfs balance start -dusage=0 <mountpoint> ; done
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-11 07:58:53 +08:00
|
|
|
}
|
2008-04-26 04:53:30 +08:00
|
|
|
|
|
|
|
ret = btrfs_previous_item(root, path, 0, key.type);
|
|
|
|
if (ret) {
|
2021-04-19 15:41:01 +08:00
|
|
|
mutex_unlock(&fs_info->reclaim_bgs_lock);
|
2020-12-17 21:21:16 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto done;
|
2008-04-26 04:53:30 +08:00
|
|
|
ret = 0;
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2009-07-22 21:59:00 +08:00
|
|
|
break;
|
2008-04-26 04:53:30 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
l = path->nodes[0];
|
|
|
|
slot = path->slots[0];
|
|
|
|
btrfs_item_key_to_cpu(l, &key, path->slots[0]);
|
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
if (key.objectid != device->devid) {
|
2021-04-19 15:41:01 +08:00
|
|
|
mutex_unlock(&fs_info->reclaim_bgs_lock);
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2009-07-22 21:59:00 +08:00
|
|
|
break;
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
}
|
2008-04-26 04:53:30 +08:00
|
|
|
|
|
|
|
dev_extent = btrfs_item_ptr(l, slot, struct btrfs_dev_extent);
|
|
|
|
length = btrfs_dev_extent_length(l, dev_extent);
|
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
if (key.offset + length <= new_size) {
|
2021-04-19 15:41:01 +08:00
|
|
|
mutex_unlock(&fs_info->reclaim_bgs_lock);
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2009-04-27 19:29:03 +08:00
|
|
|
break;
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
}
|
2008-04-26 04:53:30 +08:00
|
|
|
|
|
|
|
chunk_offset = btrfs_dev_extent_chunk_offset(l, dev_extent);
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2008-04-26 04:53:30 +08:00
|
|
|
|
2017-11-16 07:28:11 +08:00
|
|
|
/*
|
|
|
|
* We may be relocating the only data chunk we have,
|
|
|
|
* which could potentially end up with losing data's
|
|
|
|
* raid profile, so lets allocate an empty one in
|
|
|
|
* advance.
|
|
|
|
*/
|
|
|
|
ret = btrfs_may_alloc_data_chunk(fs_info, chunk_offset);
|
|
|
|
if (ret < 0) {
|
2021-04-19 15:41:01 +08:00
|
|
|
mutex_unlock(&fs_info->reclaim_bgs_lock);
|
2017-11-16 07:28:11 +08:00
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
ret = btrfs_relocate_chunk(fs_info, chunk_offset);
|
2021-04-19 15:41:01 +08:00
|
|
|
mutex_unlock(&fs_info->reclaim_bgs_lock);
|
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-04 01:28:12 +08:00
|
|
|
if (ret == -ENOSPC) {
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
failed++;
|
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-04 01:28:12 +08:00
|
|
|
} else if (ret) {
|
|
|
|
if (ret == -ETXTBSY) {
|
|
|
|
btrfs_warn(fs_info,
|
|
|
|
"could not shrink block group %llu due to active swapfile",
|
|
|
|
chunk_offset);
|
|
|
|
}
|
|
|
|
goto done;
|
|
|
|
}
|
2012-03-27 22:09:18 +08:00
|
|
|
} while (key.offset-- > 0);
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
|
|
|
|
if (failed && !retried) {
|
|
|
|
failed = 0;
|
|
|
|
retried = true;
|
|
|
|
goto again;
|
|
|
|
} else if (failed && retried) {
|
|
|
|
ret = -ENOSPC;
|
|
|
|
goto done;
|
2008-04-26 04:53:30 +08:00
|
|
|
}
|
|
|
|
|
2009-04-27 19:29:03 +08:00
|
|
|
/* Shrinking succeeded, else we would be at "done". */
|
2010-05-16 22:48:46 +08:00
|
|
|
trans = btrfs_start_transaction(root, 0);
|
2011-01-20 14:19:37 +08:00
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
2020-07-31 19:29:11 +08:00
|
|
|
/* Clear all state bits beyond the shrunk device size */
|
|
|
|
clear_extent_bits(&device->alloc_state, new_size, (u64)-1,
|
|
|
|
CHUNK_STATE_MASK);
|
|
|
|
|
2014-09-03 21:35:38 +08:00
|
|
|
btrfs_device_set_disk_total_bytes(device, new_size);
|
2019-03-25 20:31:22 +08:00
|
|
|
if (list_empty(&device->post_commit_list))
|
|
|
|
list_add_tail(&device->post_commit_list,
|
|
|
|
&trans->transaction->dev_update_list);
|
2009-04-27 19:29:03 +08:00
|
|
|
|
|
|
|
WARN_ON(diff > old_total);
|
2017-06-16 19:39:20 +08:00
|
|
|
btrfs_set_super_total_bytes(super_copy,
|
|
|
|
round_down(old_total - diff, fs_info->sectorsize));
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
2014-09-03 21:35:41 +08:00
|
|
|
|
btrfs: fix deadlock between chunk allocation and chunk btree modifications
When a task is doing some modification to the chunk btree and it is not in
the context of a chunk allocation or a chunk removal, it can deadlock with
another task that is currently allocating a new data or metadata chunk.
These contexts are the following:
* When relocating a system chunk, when we need to COW the extent buffers
that belong to the chunk btree;
* When adding a new device (ioctl), where we need to add a new device item
to the chunk btree;
* When removing a device (ioctl), where we need to remove a device item
from the chunk btree;
* When resizing a device (ioctl), where we need to update a device item in
the chunk btree and may need to relocate a system chunk that lies beyond
the new device size when shrinking a device.
The problem happens due to a sequence of steps like the following:
1) Task A starts a data or metadata chunk allocation and it locks the
chunk mutex;
2) Task B is relocating a system chunk, and when it needs to COW an extent
buffer of the chunk btree, it has locked both that extent buffer as
well as its parent extent buffer;
3) Since there is not enough available system space, either because none
of the existing system block groups have enough free space or because
the only one with enough free space is in RO mode due to the relocation,
task B triggers a new system chunk allocation. It blocks when trying to
acquire the chunk mutex, currently held by task A;
4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
the new chunk item into the chunk btree and update the existing device
items there. But in order to do that, it has to lock the extent buffer
that task B locked at step 2, or its parent extent buffer, but task B
is waiting on the chunk mutex, which is currently locked by task A,
therefore resulting in a deadlock.
One example report when the deadlock happens with system chunk relocation:
INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u9:5 state:D stack:25936 pid: 546 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
__down_read_common kernel/locking/rwsem.c:1214 [inline]
__down_read kernel/locking/rwsem.c:1223 [inline]
down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
__btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
worker_thread+0x90/0xed0 kernel/workqueue.c:2444
kthread+0x3e5/0x4d0 kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
INFO: task syz-executor:9107 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor state:D stack:23200 pid: 9107 ppid: 7792 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
__mutex_lock_common kernel/locking/mutex.c:669 [inline]
__mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
__btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
__btrfs_balance fs/btrfs/volumes.c:3911 [inline]
btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
So fix this by making sure that whenever we try to modify the chunk btree
and we are neither in a chunk allocation context nor in a chunk remove
context, we reserve system space before modifying the chunk btree.
Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
CC: stable@vger.kernel.org # 5.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-13 17:12:49 +08:00
|
|
|
btrfs_reserve_chunk_metadata(trans, false);
|
2014-09-03 21:35:41 +08:00
|
|
|
/* Now btrfs_update_device() will change the on-disk size. */
|
|
|
|
ret = btrfs_update_device(trans, device);
|
btrfs: fix deadlock between chunk allocation and chunk btree modifications
When a task is doing some modification to the chunk btree and it is not in
the context of a chunk allocation or a chunk removal, it can deadlock with
another task that is currently allocating a new data or metadata chunk.
These contexts are the following:
* When relocating a system chunk, when we need to COW the extent buffers
that belong to the chunk btree;
* When adding a new device (ioctl), where we need to add a new device item
to the chunk btree;
* When removing a device (ioctl), where we need to remove a device item
from the chunk btree;
* When resizing a device (ioctl), where we need to update a device item in
the chunk btree and may need to relocate a system chunk that lies beyond
the new device size when shrinking a device.
The problem happens due to a sequence of steps like the following:
1) Task A starts a data or metadata chunk allocation and it locks the
chunk mutex;
2) Task B is relocating a system chunk, and when it needs to COW an extent
buffer of the chunk btree, it has locked both that extent buffer as
well as its parent extent buffer;
3) Since there is not enough available system space, either because none
of the existing system block groups have enough free space or because
the only one with enough free space is in RO mode due to the relocation,
task B triggers a new system chunk allocation. It blocks when trying to
acquire the chunk mutex, currently held by task A;
4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
the new chunk item into the chunk btree and update the existing device
items there. But in order to do that, it has to lock the extent buffer
that task B locked at step 2, or its parent extent buffer, but task B
is waiting on the chunk mutex, which is currently locked by task A,
therefore resulting in a deadlock.
One example report when the deadlock happens with system chunk relocation:
INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u9:5 state:D stack:25936 pid: 546 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
__down_read_common kernel/locking/rwsem.c:1214 [inline]
__down_read kernel/locking/rwsem.c:1223 [inline]
down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
__btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
worker_thread+0x90/0xed0 kernel/workqueue.c:2444
kthread+0x3e5/0x4d0 kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
INFO: task syz-executor:9107 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor state:D stack:23200 pid: 9107 ppid: 7792 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
__mutex_lock_common kernel/locking/mutex.c:669 [inline]
__mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
__btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
__btrfs_balance fs/btrfs/volumes.c:3911 [inline]
btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
So fix this by making sure that whenever we try to modify the chunk btree
and we are neither in a chunk allocation context nor in a chunk remove
context, we reserve system space before modifying the chunk btree.
Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
CC: stable@vger.kernel.org # 5.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-13 17:12:49 +08:00
|
|
|
btrfs_trans_release_chunk_metadata(trans);
|
2018-08-06 18:12:37 +08:00
|
|
|
if (ret < 0) {
|
|
|
|
btrfs_abort_transaction(trans, ret);
|
|
|
|
btrfs_end_transaction(trans);
|
|
|
|
} else {
|
|
|
|
ret = btrfs_commit_transaction(trans);
|
|
|
|
}
|
2008-04-26 04:53:30 +08:00
|
|
|
done:
|
|
|
|
btrfs_free_path(path);
|
2015-06-02 21:43:21 +08:00
|
|
|
if (ret) {
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
2015-06-02 21:43:21 +08:00
|
|
|
btrfs_device_set_total_bytes(device, old_size);
|
2023-09-28 01:46:59 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
|
2015-06-02 21:43:21 +08:00
|
|
|
device->fs_devices->total_rw_bytes += diff;
|
2023-09-28 01:46:59 +08:00
|
|
|
atomic64_add(free_diff, &fs_info->free_chunk_space);
|
|
|
|
}
|
2016-10-05 01:34:27 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
2015-06-02 21:43:21 +08:00
|
|
|
}
|
2008-04-26 04:53:30 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
|
2008-03-25 03:01:56 +08:00
|
|
|
struct btrfs_key *key,
|
|
|
|
struct btrfs_chunk *chunk, int item_size)
|
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_super_block *super_copy = fs_info->super_copy;
|
2008-03-25 03:01:56 +08:00
|
|
|
struct btrfs_disk_key disk_key;
|
|
|
|
u32 array_size;
|
|
|
|
u8 *ptr;
|
|
|
|
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
lockdep_assert_held(&fs_info->chunk_mutex);
|
|
|
|
|
2008-03-25 03:01:56 +08:00
|
|
|
array_size = btrfs_super_sys_array_size(super_copy);
|
2014-04-21 20:13:11 +08:00
|
|
|
if (array_size + item_size + sizeof(disk_key)
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
> BTRFS_SYSTEM_CHUNK_ARRAY_SIZE)
|
2008-03-25 03:01:56 +08:00
|
|
|
return -EFBIG;
|
|
|
|
|
|
|
|
ptr = super_copy->sys_chunk_array + array_size;
|
|
|
|
btrfs_cpu_key_to_disk(&disk_key, key);
|
|
|
|
memcpy(ptr, &disk_key, sizeof(disk_key));
|
|
|
|
ptr += sizeof(disk_key);
|
|
|
|
memcpy(ptr, chunk, item_size);
|
|
|
|
item_size += sizeof(disk_key);
|
|
|
|
btrfs_set_super_sys_array_size(super_copy, array_size + item_size);
|
2014-09-03 21:35:39 +08:00
|
|
|
|
2008-03-25 03:01:56 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
/*
|
|
|
|
* sort the devices in descending order by max_avail, total_avail
|
|
|
|
*/
|
|
|
|
static int btrfs_cmp_device_info(const void *a, const void *b)
|
2008-04-18 22:29:51 +08:00
|
|
|
{
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
const struct btrfs_device_info *di_a = a;
|
|
|
|
const struct btrfs_device_info *di_b = b;
|
2008-04-18 22:29:51 +08:00
|
|
|
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
if (di_a->max_avail > di_b->max_avail)
|
2011-01-05 18:07:28 +08:00
|
|
|
return -1;
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
if (di_a->max_avail < di_b->max_avail)
|
2011-01-05 18:07:28 +08:00
|
|
|
return 1;
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
if (di_a->total_avail > di_b->total_avail)
|
|
|
|
return -1;
|
|
|
|
if (di_a->total_avail < di_b->total_avail)
|
|
|
|
return 1;
|
|
|
|
return 0;
|
2011-01-05 18:07:28 +08:00
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2013-01-30 07:40:14 +08:00
|
|
|
static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
|
|
|
|
{
|
2015-01-20 15:11:44 +08:00
|
|
|
if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))
|
2013-01-30 07:40:14 +08:00
|
|
|
return;
|
|
|
|
|
2013-04-11 18:30:16 +08:00
|
|
|
btrfs_set_fs_incompat(info, RAID56);
|
2013-01-30 07:40:14 +08:00
|
|
|
}
|
|
|
|
|
2018-07-11 00:15:05 +08:00
|
|
|
static void check_raid1c34_incompat_flag(struct btrfs_fs_info *info, u64 type)
|
|
|
|
{
|
|
|
|
if (!(type & (BTRFS_BLOCK_GROUP_RAID1C3 | BTRFS_BLOCK_GROUP_RAID1C4)))
|
|
|
|
return;
|
|
|
|
|
|
|
|
btrfs_set_fs_incompat(info, RAID1C34);
|
|
|
|
}
|
|
|
|
|
2020-02-25 11:56:10 +08:00
|
|
|
/*
|
2021-08-18 18:41:19 +08:00
|
|
|
* Structure used internally for btrfs_create_chunk() function.
|
2020-02-25 11:56:10 +08:00
|
|
|
* Wraps needed parameters.
|
|
|
|
*/
|
|
|
|
struct alloc_chunk_ctl {
|
|
|
|
u64 start;
|
|
|
|
u64 type;
|
|
|
|
/* Total number of stripes to allocate */
|
|
|
|
int num_stripes;
|
|
|
|
/* sub_stripes info for map */
|
|
|
|
int sub_stripes;
|
|
|
|
/* Stripes per device */
|
|
|
|
int dev_stripes;
|
|
|
|
/* Maximum number of devices to use */
|
|
|
|
int devs_max;
|
|
|
|
/* Minimum number of devices to use */
|
|
|
|
int devs_min;
|
|
|
|
/* ndevs has to be a multiple of this */
|
|
|
|
int devs_increment;
|
|
|
|
/* Number of copies */
|
|
|
|
int ncopies;
|
|
|
|
/* Number of stripes worth of bytes to store parity information */
|
|
|
|
int nparity;
|
|
|
|
u64 max_stripe_size;
|
|
|
|
u64 max_chunk_size;
|
2020-02-25 11:56:15 +08:00
|
|
|
u64 dev_extent_min;
|
2020-02-25 11:56:10 +08:00
|
|
|
u64 stripe_size;
|
|
|
|
u64 chunk_size;
|
|
|
|
int ndevs;
|
|
|
|
};
|
|
|
|
|
2020-02-25 11:56:11 +08:00
|
|
|
static void init_alloc_chunk_ctl_policy_regular(
|
|
|
|
struct btrfs_fs_devices *fs_devices,
|
|
|
|
struct alloc_chunk_ctl *ctl)
|
|
|
|
{
|
2022-02-09 03:31:20 +08:00
|
|
|
struct btrfs_space_info *space_info;
|
2020-02-25 11:56:11 +08:00
|
|
|
|
2022-02-09 03:31:20 +08:00
|
|
|
space_info = btrfs_find_space_info(fs_devices->fs_info, ctl->type);
|
|
|
|
ASSERT(space_info);
|
|
|
|
|
|
|
|
ctl->max_chunk_size = READ_ONCE(space_info->chunk_size);
|
btrfs: fix stripe length calculation for non-zoned data chunk allocation
Commit f6fca3917b4d "btrfs: store chunk size in space-info struct"
broke data chunk allocations on non-zoned multi-device filesystems when
using default chunk_size. Commit 5da431b71d4b "btrfs: fix the max chunk
size and stripe length calculation" partially fixed that, and this patch
completes the fix for that case.
After commit f6fca3917b4d and 5da431b71d4b, the sequence of events for
a data chunk allocation on a non-zoned filesystem is:
1. btrfs_create_chunk calls init_alloc_chunk_ctl, which copies
space_info->chunk_size (default 10 GiB) to ctl->max_stripe_len
unmodified. Before f6fca3917b4d, ctl->max_stripe_len value was
1 GiB for non-zoned data chunks and not configurable.
2. btrfs_create_chunk calls gather_device_info which consumes
and produces more fields of chunk_ctl.
3. gather_device_info multiplies ctl->max_stripe_len by
ctl->dev_stripes (which is 1 in all cases except dup)
and calls find_free_dev_extent with that number as num_bytes.
4. find_free_dev_extent locates the first dev_extent hole on
a device which is at least as large as num_bytes. With default
max_chunk_size from f6fca3917b4d, it finds the first hole which is
longer than 10 GiB, or the largest hole if that hole is shorter
than 10 GiB. This is different from the pre-f6fca3917b4d
behavior, where num_bytes is 1 GiB, and find_free_dev_extent
may choose a different hole.
5. gather_device_info repeats step 4 with all devices to find
the first or largest dev_extent hole that can be allocated on
each device.
6. gather_device_info sorts the device list by the hole size
on each device, using total unallocated space on each device to
break ties, then returns to btrfs_create_chunk with the list.
7. btrfs_create_chunk calls decide_stripe_size_regular.
8. decide_stripe_size_regular finds the largest stripe_len that
fits across the first nr_devs device dev_extent holes that were
found by gather_device_info (and satisfies other constraints
on stripe_len that are not relevant here).
9. decide_stripe_size_regular caps the length of the stripe it
computed at 1 GiB. This cap appeared in 5da431b71d4b to correct
one of the other regressions introduced in f6fca3917b4d.
10. btrfs_create_chunk creates a new chunk with the above
computed size and number of devices.
At step 4, gather_device_info() has found a location where stripe up to
10 GiB in length could be allocated on several devices, and selected
which devices should have a dev_extent allocated on them, but at step
9, only 1 GiB of the space that was found on each device can be used.
This mismatch causes new suboptimal chunk allocation cases that did not
occur in pre-f6fca3917b4d kernels.
Consider a filesystem using raid1 profile with 3 devices. After some
balances, device 1 has 10x 1 GiB unallocated space, while devices 2
and 3 have 1x 10 GiB unallocated space, i.e. the same total amount of
space, but distributed across different numbers of dev_extent holes.
For visualization, let's ignore all the chunks that were allocated before
this point, and focus on the remaining holes:
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10x 1 GiB unallocated)
Device 2: [__________] (10 GiB contig unallocated)
Device 3: [__________] (10 GiB contig unallocated)
Before f6fca3917b4d, the allocator would fill these optimally by
allocating chunks with dev_extents on devices 1 and 2 ([12]), 1 and 3
([13]), or 2 and 3 ([23]):
[after 0 chunk allocations]
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
Device 2: [__________] (10 GiB)
Device 3: [__________] (10 GiB)
[after 1 chunk allocation]
Device 1: [12] [_] [_] [_] [_] [_] [_] [_] [_] [_]
Device 2: [12] [_________] (9 GiB)
Device 3: [__________] (10 GiB)
[after 2 chunk allocations]
Device 1: [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB)
Device 2: [12] [_________] (9 GiB)
Device 3: [13] [_________] (9 GiB)
[after 3 chunk allocations]
Device 1: [12] [13] [12] [_] [_] [_] [_] [_] [_] [_] (7 GiB)
Device 2: [12] [12] [________] (8 GiB)
Device 3: [13] [_________] (9 GiB)
[...]
[after 12 chunk allocations]
Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [_] [_] (2 GiB)
Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [__] (2 GiB)
Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB)
[after 13 chunk allocations]
Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [_] (1 GiB)
Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB)
Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB)
[after 14 chunk allocations]
Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full)
Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB)
Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [13] [_] (1 GiB)
[after 15 chunk allocations]
Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full)
Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [23] (full)
Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [13] [23] (full)
This allocates all of the space with no waste. The sorting function used
by gather_device_info considers free space holes above 1 GiB in length
to be equal to 1 GiB, so once find_free_dev_extent locates a sufficiently
long hole on each device, all the holes appear equal in the sort, and the
comparison falls back to sorting devices by total free space. This keeps
usable space on each device equal so they can all be filled completely.
After f6fca3917b4d, the allocator prefers the devices with larger holes
over the devices with more free space, so it makes bad allocation choices:
[after 1 chunk allocation]
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
Device 2: [23] [_________] (9 GiB)
Device 3: [23] [_________] (9 GiB)
[after 2 chunk allocations]
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
Device 2: [23] [23] [________] (8 GiB)
Device 3: [23] [23] [________] (8 GiB)
[after 3 chunk allocations]
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
Device 2: [23] [23] [23] [_______] (7 GiB)
Device 3: [23] [23] [23] [_______] (7 GiB)
[...]
[after 9 chunk allocations]
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)
Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)
[after 10 chunk allocations]
Device 1: [12] [_] [_] [_] [_] [_] [_] [_] [_] [_] (9 GiB)
Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [12] (full)
Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)
[after 11 chunk allocations]
Device 1: [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB)
Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [12] (full)
Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [13] (full)
No further allocations are possible, with 8 GiB wasted (4 GiB of data
space). The sort in gather_device_info now considers free space in
holes longer than 1 GiB to be distinct, so it will prefer devices 2 and
3 over device 1 until all but 1 GiB is allocated on devices 2 and 3.
At that point, with only 1 GiB unallocated on every device, the largest
hole length on each device is equal at 1 GiB, so the sort finally moves
to ordering the devices with the most free space, but by this time it
is too late to make use of the free space on device 1.
Note that it's possible to contrive a case where the pre-f6fca3917b4d
allocator fails the same way, but these cases generally have extensive
dev_extent fragmentation as a precondition (e.g. many holes of 768M
in length on one device, and few holes 1 GiB in length on the others).
With the regression in f6fca3917b4d, bad chunk allocation can occur even
under optimal conditions, when all dev_extent holes are exact multiples
of stripe_len in length, as in the example above.
Also note that post-f6fca3917b4d kernels do treat dev_extent holes
larger than 10 GiB as equal, so the bad behavior won't show up on a
freshly formatted filesystem; however, as the filesystem ages and fills
up, and holes ranging from 1 GiB to 10 GiB in size appear, the problem
can show up as a failure to balance after adding or removing devices,
or an unexpected shortfall in available space due to unequal allocation.
To fix the regression and make data chunk allocation work
again, set ctl->max_stripe_len back to the original SZ_1G, or
space_info->chunk_size if that's smaller (the latter can happen if the
user set space_info->chunk_size to less than 1 GiB via sysfs, or it's
a 32 MiB system chunk with a hardcoded chunk_size and stripe_len).
While researching the background of the earlier commits, I found that an
identical fix was already proposed at:
https://lore.kernel.org/linux-btrfs/de83ac46-a4a3-88d3-85ce-255b7abc5249@gmx.com/
The previous review missed one detail: ctl->max_stripe_len is used
before decide_stripe_size_regular() is called, when it is too late for
the changes in that function to have any effect. ctl->max_stripe_len is
not used directly by decide_stripe_size_regular(), but the parameter
does heavily influence the per-device free space data presented to
the function.
Fixes: f6fca3917b4d ("btrfs: store chunk size in space-info struct")
CC: stable@vger.kernel.org # 6.1+
Link: https://lore.kernel.org/linux-btrfs/20231007051421.19657-1-ce3g8jdj@umail.furryterror.org/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-07 13:14:21 +08:00
|
|
|
ctl->max_stripe_size = min_t(u64, ctl->max_chunk_size, SZ_1G);
|
2022-02-09 03:31:20 +08:00
|
|
|
|
|
|
|
if (ctl->type & BTRFS_BLOCK_GROUP_SYSTEM)
|
|
|
|
ctl->devs_max = min_t(int, ctl->devs_max, BTRFS_MAX_DEVS_SYS_CHUNK);
|
2020-02-25 11:56:11 +08:00
|
|
|
|
|
|
|
/* We don't want a chunk larger than 10% of writable space */
|
2022-10-27 05:25:14 +08:00
|
|
|
ctl->max_chunk_size = min(mult_perc(fs_devices->total_rw_bytes, 10),
|
2020-02-25 11:56:11 +08:00
|
|
|
ctl->max_chunk_size);
|
2023-06-22 14:42:40 +08:00
|
|
|
ctl->dev_extent_min = btrfs_stripe_nr_to_offset(ctl->dev_stripes);
|
2020-02-25 11:56:11 +08:00
|
|
|
}
|
|
|
|
|
2021-02-04 18:21:48 +08:00
|
|
|
static void init_alloc_chunk_ctl_policy_zoned(
|
|
|
|
struct btrfs_fs_devices *fs_devices,
|
|
|
|
struct alloc_chunk_ctl *ctl)
|
|
|
|
{
|
|
|
|
u64 zone_size = fs_devices->fs_info->zone_size;
|
|
|
|
u64 limit;
|
|
|
|
int min_num_stripes = ctl->devs_min * ctl->dev_stripes;
|
|
|
|
int min_data_stripes = (min_num_stripes - ctl->nparity) / ctl->ncopies;
|
|
|
|
u64 min_chunk_size = min_data_stripes * zone_size;
|
|
|
|
u64 type = ctl->type;
|
|
|
|
|
|
|
|
ctl->max_stripe_size = zone_size;
|
|
|
|
if (type & BTRFS_BLOCK_GROUP_DATA) {
|
|
|
|
ctl->max_chunk_size = round_down(BTRFS_MAX_DATA_CHUNK_SIZE,
|
|
|
|
zone_size);
|
|
|
|
} else if (type & BTRFS_BLOCK_GROUP_METADATA) {
|
|
|
|
ctl->max_chunk_size = ctl->max_stripe_size;
|
|
|
|
} else if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
|
|
|
|
ctl->max_chunk_size = 2 * ctl->max_stripe_size;
|
|
|
|
ctl->devs_max = min_t(int, ctl->devs_max,
|
|
|
|
BTRFS_MAX_DEVS_SYS_CHUNK);
|
2021-03-23 22:31:19 +08:00
|
|
|
} else {
|
|
|
|
BUG();
|
2021-02-04 18:21:48 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* We don't want a chunk larger than 10% of writable space */
|
2022-10-27 05:25:14 +08:00
|
|
|
limit = max(round_down(mult_perc(fs_devices->total_rw_bytes, 10),
|
2021-02-04 18:21:48 +08:00
|
|
|
zone_size),
|
|
|
|
min_chunk_size);
|
|
|
|
ctl->max_chunk_size = min(limit, ctl->max_chunk_size);
|
|
|
|
ctl->dev_extent_min = zone_size * ctl->dev_stripes;
|
|
|
|
}
|
|
|
|
|
2020-02-25 11:56:11 +08:00
|
|
|
static void init_alloc_chunk_ctl(struct btrfs_fs_devices *fs_devices,
|
|
|
|
struct alloc_chunk_ctl *ctl)
|
|
|
|
{
|
|
|
|
int index = btrfs_bg_flags_to_raid_index(ctl->type);
|
|
|
|
|
|
|
|
ctl->sub_stripes = btrfs_raid_array[index].sub_stripes;
|
|
|
|
ctl->dev_stripes = btrfs_raid_array[index].dev_stripes;
|
|
|
|
ctl->devs_max = btrfs_raid_array[index].devs_max;
|
|
|
|
if (!ctl->devs_max)
|
|
|
|
ctl->devs_max = BTRFS_MAX_DEVS(fs_devices->fs_info);
|
|
|
|
ctl->devs_min = btrfs_raid_array[index].devs_min;
|
|
|
|
ctl->devs_increment = btrfs_raid_array[index].devs_increment;
|
|
|
|
ctl->ncopies = btrfs_raid_array[index].ncopies;
|
|
|
|
ctl->nparity = btrfs_raid_array[index].nparity;
|
|
|
|
ctl->ndevs = 0;
|
|
|
|
|
|
|
|
switch (fs_devices->chunk_alloc_policy) {
|
|
|
|
case BTRFS_CHUNK_ALLOC_REGULAR:
|
|
|
|
init_alloc_chunk_ctl_policy_regular(fs_devices, ctl);
|
|
|
|
break;
|
2021-02-04 18:21:48 +08:00
|
|
|
case BTRFS_CHUNK_ALLOC_ZONED:
|
|
|
|
init_alloc_chunk_ctl_policy_zoned(fs_devices, ctl);
|
|
|
|
break;
|
2020-02-25 11:56:11 +08:00
|
|
|
default:
|
|
|
|
BUG();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-02-25 11:56:12 +08:00
|
|
|
static int gather_device_info(struct btrfs_fs_devices *fs_devices,
|
|
|
|
struct alloc_chunk_ctl *ctl,
|
|
|
|
struct btrfs_device_info *devices_info)
|
2011-01-05 18:07:28 +08:00
|
|
|
{
|
2020-02-25 11:56:12 +08:00
|
|
|
struct btrfs_fs_info *info = fs_devices->fs_info;
|
2017-06-27 15:02:24 +08:00
|
|
|
struct btrfs_device *device;
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
u64 total_avail;
|
2020-02-25 11:56:12 +08:00
|
|
|
u64 dev_extent_want = ctl->max_stripe_size * ctl->dev_stripes;
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
int ret;
|
2020-02-25 11:56:12 +08:00
|
|
|
int ndevs = 0;
|
|
|
|
u64 max_avail;
|
|
|
|
u64 dev_offset;
|
2010-03-18 04:45:56 +08:00
|
|
|
|
2010-04-06 21:37:47 +08:00
|
|
|
/*
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
* in the first pass through the devices list, we gather information
|
|
|
|
* about the available holes on each device.
|
2010-04-06 21:37:47 +08:00
|
|
|
*/
|
2017-06-27 15:02:24 +08:00
|
|
|
list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
|
2017-12-04 12:54:52 +08:00
|
|
|
if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
|
2012-11-03 18:58:34 +08:00
|
|
|
WARN(1, KERN_ERR
|
2013-12-21 00:37:06 +08:00
|
|
|
"BTRFS: read-only device in alloc_list\n");
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
continue;
|
|
|
|
}
|
2011-01-05 18:07:28 +08:00
|
|
|
|
2017-12-04 12:54:53 +08:00
|
|
|
if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
|
|
|
|
&device->dev_state) ||
|
2017-12-04 12:54:55 +08:00
|
|
|
test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
continue;
|
2011-01-05 18:07:28 +08:00
|
|
|
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
if (device->total_bytes > device->bytes_used)
|
|
|
|
total_avail = device->total_bytes - device->bytes_used;
|
|
|
|
else
|
|
|
|
total_avail = 0;
|
Btrfs: fix a bug of balance on full multi-disk partitions
When balancing, we'll first try to shrink devices for some space,
but if it is working on a full multi-disk partition with raid protection,
we may encounter a bug, that is, while shrinking, total_bytes may be less
than bytes_used, and btrfs may allocate a dev extent that accesses out of
device's bounds.
Then we will not be able to write or read the data which stores at the end
of the device, and get the followings:
device fsid 0939f071-7ea3-46c8-95df-f176d773bfb6 devid 1 transid 10 /dev/sdb5
Btrfs detected SSD devices, enabling SSD mode
btrfs: relocating block group 476315648 flags 9
btrfs: found 4 extents
attempt to access beyond end of device
sdb5: rw=145, want=546176, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546304, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546432, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546560, limit=546147
attempt to access beyond end of device
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-02 10:39:03 +08:00
|
|
|
|
|
|
|
/* If there is no space on this device, skip it. */
|
2020-02-25 11:56:15 +08:00
|
|
|
if (total_avail < ctl->dev_extent_min)
|
Btrfs: fix a bug of balance on full multi-disk partitions
When balancing, we'll first try to shrink devices for some space,
but if it is working on a full multi-disk partition with raid protection,
we may encounter a bug, that is, while shrinking, total_bytes may be less
than bytes_used, and btrfs may allocate a dev extent that accesses out of
device's bounds.
Then we will not be able to write or read the data which stores at the end
of the device, and get the followings:
device fsid 0939f071-7ea3-46c8-95df-f176d773bfb6 devid 1 transid 10 /dev/sdb5
Btrfs detected SSD devices, enabling SSD mode
btrfs: relocating block group 476315648 flags 9
btrfs: found 4 extents
attempt to access beyond end of device
sdb5: rw=145, want=546176, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546304, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546432, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546560, limit=546147
attempt to access beyond end of device
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-02 10:39:03 +08:00
|
|
|
continue;
|
2011-01-05 18:07:28 +08:00
|
|
|
|
2020-02-25 11:56:12 +08:00
|
|
|
ret = find_free_dev_extent(device, dev_extent_want, &dev_offset,
|
|
|
|
&max_avail);
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
if (ret && ret != -ENOSPC)
|
2020-02-25 11:56:12 +08:00
|
|
|
return ret;
|
2011-01-05 18:07:28 +08:00
|
|
|
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
if (ret == 0)
|
2020-02-25 11:56:12 +08:00
|
|
|
max_avail = dev_extent_want;
|
2011-01-05 18:07:28 +08:00
|
|
|
|
2020-02-25 11:56:15 +08:00
|
|
|
if (max_avail < ctl->dev_extent_min) {
|
2018-01-22 13:50:54 +08:00
|
|
|
if (btrfs_test_opt(info, ENOSPC_DEBUG))
|
|
|
|
btrfs_debug(info,
|
2020-02-25 11:56:12 +08:00
|
|
|
"%s: devid %llu has no free space, have=%llu want=%llu",
|
2018-01-22 13:50:54 +08:00
|
|
|
__func__, device->devid, max_avail,
|
2020-02-25 11:56:15 +08:00
|
|
|
ctl->dev_extent_min);
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
continue;
|
2018-01-22 13:50:54 +08:00
|
|
|
}
|
2011-01-05 18:07:28 +08:00
|
|
|
|
2013-01-31 08:55:01 +08:00
|
|
|
if (ndevs == fs_devices->rw_devices) {
|
|
|
|
WARN(1, "%s: found more than %llu devices\n",
|
|
|
|
__func__, fs_devices->rw_devices);
|
|
|
|
break;
|
|
|
|
}
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
devices_info[ndevs].dev_offset = dev_offset;
|
|
|
|
devices_info[ndevs].max_avail = max_avail;
|
|
|
|
devices_info[ndevs].total_avail = total_avail;
|
|
|
|
devices_info[ndevs].dev = device;
|
|
|
|
++ndevs;
|
|
|
|
}
|
2020-02-25 11:56:12 +08:00
|
|
|
ctl->ndevs = ndevs;
|
2011-01-05 18:07:28 +08:00
|
|
|
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
/*
|
|
|
|
* now sort the devices by hole size / available space
|
|
|
|
*/
|
2020-02-25 11:56:12 +08:00
|
|
|
sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
btrfs_cmp_device_info, NULL);
|
2011-01-05 18:07:28 +08:00
|
|
|
|
2020-02-25 11:56:12 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-02-25 11:56:13 +08:00
|
|
|
static int decide_stripe_size_regular(struct alloc_chunk_ctl *ctl,
|
|
|
|
struct btrfs_device_info *devices_info)
|
|
|
|
{
|
|
|
|
/* Number of stripes that count for block group size */
|
|
|
|
int data_stripes;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The primary goal is to maximize the number of stripes, so use as
|
|
|
|
* many devices as possible, even if the stripes are not maximum sized.
|
|
|
|
*
|
|
|
|
* The DUP profile stores more than one stripe per device, the
|
|
|
|
* max_avail is the total size so we have to adjust.
|
|
|
|
*/
|
|
|
|
ctl->stripe_size = div_u64(devices_info[ctl->ndevs - 1].max_avail,
|
|
|
|
ctl->dev_stripes);
|
|
|
|
ctl->num_stripes = ctl->ndevs * ctl->dev_stripes;
|
|
|
|
|
|
|
|
/* This will have to be fixed for RAID1 and RAID10 over more drives */
|
|
|
|
data_stripes = (ctl->num_stripes - ctl->nparity) / ctl->ncopies;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Use the number of data stripes to figure out how big this chunk is
|
|
|
|
* really going to be in terms of logical address space, and compare
|
|
|
|
* that answer with the max chunk size. If it's higher, we try to
|
|
|
|
* reduce stripe_size.
|
|
|
|
*/
|
|
|
|
if (ctl->stripe_size * data_stripes > ctl->max_chunk_size) {
|
|
|
|
/*
|
|
|
|
* Reduce stripe_size, round it up to a 16MB boundary again and
|
|
|
|
* then use it, unless it ends up being even bigger than the
|
|
|
|
* previous value we had already.
|
|
|
|
*/
|
|
|
|
ctl->stripe_size = min(round_up(div_u64(ctl->max_chunk_size,
|
|
|
|
data_stripes), SZ_16M),
|
|
|
|
ctl->stripe_size);
|
|
|
|
}
|
|
|
|
|
btrfs: fix the max chunk size and stripe length calculation
[BEHAVIOR CHANGE]
Since commit f6fca3917b4d ("btrfs: store chunk size in space-info
struct"), btrfs no longer can create larger data chunks than 1G:
mkfs.btrfs -f -m raid1 -d raid0 $dev1 $dev2 $dev3 $dev4
mount $dev1 $mnt
btrfs balance start --full $mnt
btrfs balance start --full $mnt
umount $mnt
btrfs ins dump-tree -t chunk $dev1 | grep "DATA|RAID0" -C 2
Before that offending commit, what we got is a 4G data chunk:
item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 9492758528) itemoff 15491 itemsize 176
length 4294967296 owner 2 stripe_len 65536 type DATA|RAID0
io_align 65536 io_width 65536 sector_size 4096
num_stripes 4 sub_stripes 1
Now what we got is only 1G data chunk:
item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 6271533056) itemoff 15491 itemsize 176
length 1073741824 owner 2 stripe_len 65536 type DATA|RAID0
io_align 65536 io_width 65536 sector_size 4096
num_stripes 4 sub_stripes 1
This will increase the number of data chunks by the number of devices,
not only increase system chunk usage, but also greatly increase mount
time.
Without a proper reason, we should not change the max chunk size.
[CAUSE]
Previously, we set max data chunk size to 10G, while max data stripe
length to 1G.
Commit f6fca3917b4d ("btrfs: store chunk size in space-info struct")
completely ignored the 10G limit, but use 1G max stripe limit instead,
causing above shrink in max data chunk size.
[FIX]
Fix the max data chunk size to 10G, and in decide_stripe_size_regular()
we limit stripe_size to 1G manually.
This should only affect data chunks, as for metadata chunks we always
set the max stripe size the same as max chunk size (256M or 1G
depending on fs size).
Now the same script result the same old result:
item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 9492758528) itemoff 15491 itemsize 176
length 4294967296 owner 2 stripe_len 65536 type DATA|RAID0
io_align 65536 io_width 65536 sector_size 4096
num_stripes 4 sub_stripes 1
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Fixes: f6fca3917b4d ("btrfs: store chunk size in space-info struct")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-18 15:06:44 +08:00
|
|
|
/* Stripe size should not go beyond 1G. */
|
|
|
|
ctl->stripe_size = min_t(u64, ctl->stripe_size, SZ_1G);
|
|
|
|
|
2020-02-25 11:56:13 +08:00
|
|
|
/* Align to BTRFS_STRIPE_LEN */
|
|
|
|
ctl->stripe_size = round_down(ctl->stripe_size, BTRFS_STRIPE_LEN);
|
|
|
|
ctl->chunk_size = ctl->stripe_size * data_stripes;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2021-02-04 18:21:48 +08:00
|
|
|
static int decide_stripe_size_zoned(struct alloc_chunk_ctl *ctl,
|
|
|
|
struct btrfs_device_info *devices_info)
|
|
|
|
{
|
|
|
|
u64 zone_size = devices_info[0].dev->zone_info->zone_size;
|
|
|
|
/* Number of stripes that count for block group size */
|
|
|
|
int data_stripes;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* It should hold because:
|
|
|
|
* dev_extent_min == dev_extent_want == zone_size * dev_stripes
|
|
|
|
*/
|
|
|
|
ASSERT(devices_info[ctl->ndevs - 1].max_avail == ctl->dev_extent_min);
|
|
|
|
|
|
|
|
ctl->stripe_size = zone_size;
|
|
|
|
ctl->num_stripes = ctl->ndevs * ctl->dev_stripes;
|
|
|
|
data_stripes = (ctl->num_stripes - ctl->nparity) / ctl->ncopies;
|
|
|
|
|
|
|
|
/* stripe_size is fixed in zoned filesysmte. Reduce ndevs instead. */
|
|
|
|
if (ctl->stripe_size * data_stripes > ctl->max_chunk_size) {
|
|
|
|
ctl->ndevs = div_u64(div_u64(ctl->max_chunk_size * ctl->ncopies,
|
|
|
|
ctl->stripe_size) + ctl->nparity,
|
|
|
|
ctl->dev_stripes);
|
|
|
|
ctl->num_stripes = ctl->ndevs * ctl->dev_stripes;
|
|
|
|
data_stripes = (ctl->num_stripes - ctl->nparity) / ctl->ncopies;
|
|
|
|
ASSERT(ctl->stripe_size * data_stripes <= ctl->max_chunk_size);
|
|
|
|
}
|
|
|
|
|
|
|
|
ctl->chunk_size = ctl->stripe_size * data_stripes;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-02-25 11:56:13 +08:00
|
|
|
static int decide_stripe_size(struct btrfs_fs_devices *fs_devices,
|
|
|
|
struct alloc_chunk_ctl *ctl,
|
|
|
|
struct btrfs_device_info *devices_info)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *info = fs_devices->fs_info;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Round down to number of usable stripes, devs_increment can be any
|
|
|
|
* number so we can't use round_down() that requires power of 2, while
|
|
|
|
* rounddown is safe.
|
|
|
|
*/
|
|
|
|
ctl->ndevs = rounddown(ctl->ndevs, ctl->devs_increment);
|
|
|
|
|
|
|
|
if (ctl->ndevs < ctl->devs_min) {
|
|
|
|
if (btrfs_test_opt(info, ENOSPC_DEBUG)) {
|
|
|
|
btrfs_debug(info,
|
|
|
|
"%s: not enough devices with free space: have=%d minimum required=%d",
|
|
|
|
__func__, ctl->ndevs, ctl->devs_min);
|
|
|
|
}
|
|
|
|
return -ENOSPC;
|
|
|
|
}
|
|
|
|
|
|
|
|
ctl->ndevs = min(ctl->ndevs, ctl->devs_max);
|
|
|
|
|
|
|
|
switch (fs_devices->chunk_alloc_policy) {
|
|
|
|
case BTRFS_CHUNK_ALLOC_REGULAR:
|
|
|
|
return decide_stripe_size_regular(ctl, devices_info);
|
2021-02-04 18:21:48 +08:00
|
|
|
case BTRFS_CHUNK_ALLOC_ZONED:
|
|
|
|
return decide_stripe_size_zoned(ctl, devices_info);
|
2020-02-25 11:56:13 +08:00
|
|
|
default:
|
|
|
|
BUG();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
static void chunk_map_device_set_bits(struct btrfs_chunk_map *map, unsigned int bits)
|
|
|
|
{
|
|
|
|
for (int i = 0; i < map->num_stripes; i++) {
|
|
|
|
struct btrfs_io_stripe *stripe = &map->stripes[i];
|
|
|
|
struct btrfs_device *device = stripe->dev;
|
|
|
|
|
|
|
|
set_extent_bit(&device->alloc_state, stripe->physical,
|
|
|
|
stripe->physical + map->stripe_size - 1,
|
|
|
|
bits | EXTENT_NOWAIT, NULL);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void chunk_map_device_clear_bits(struct btrfs_chunk_map *map, unsigned int bits)
|
|
|
|
{
|
|
|
|
for (int i = 0; i < map->num_stripes; i++) {
|
|
|
|
struct btrfs_io_stripe *stripe = &map->stripes[i];
|
|
|
|
struct btrfs_device *device = stripe->dev;
|
|
|
|
|
|
|
|
__clear_extent_bit(&device->alloc_state, stripe->physical,
|
|
|
|
stripe->physical + map->stripe_size - 1,
|
|
|
|
bits | EXTENT_NOWAIT,
|
|
|
|
NULL, NULL);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void btrfs_remove_chunk_map(struct btrfs_fs_info *fs_info, struct btrfs_chunk_map *map)
|
|
|
|
{
|
|
|
|
write_lock(&fs_info->mapping_tree_lock);
|
|
|
|
rb_erase_cached(&map->rb_node, &fs_info->mapping_tree);
|
|
|
|
RB_CLEAR_NODE(&map->rb_node);
|
|
|
|
chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
|
|
|
|
write_unlock(&fs_info->mapping_tree_lock);
|
|
|
|
|
|
|
|
/* Once for the tree reference. */
|
|
|
|
btrfs_free_chunk_map(map);
|
|
|
|
}
|
|
|
|
|
|
|
|
EXPORT_FOR_TESTS
|
|
|
|
int btrfs_add_chunk_map(struct btrfs_fs_info *fs_info, struct btrfs_chunk_map *map)
|
|
|
|
{
|
|
|
|
struct rb_node **p;
|
|
|
|
struct rb_node *parent = NULL;
|
|
|
|
bool leftmost = true;
|
|
|
|
|
|
|
|
write_lock(&fs_info->mapping_tree_lock);
|
|
|
|
p = &fs_info->mapping_tree.rb_root.rb_node;
|
|
|
|
while (*p) {
|
|
|
|
struct btrfs_chunk_map *entry;
|
|
|
|
|
|
|
|
parent = *p;
|
|
|
|
entry = rb_entry(parent, struct btrfs_chunk_map, rb_node);
|
|
|
|
|
|
|
|
if (map->start < entry->start) {
|
|
|
|
p = &(*p)->rb_left;
|
|
|
|
} else if (map->start > entry->start) {
|
|
|
|
p = &(*p)->rb_right;
|
|
|
|
leftmost = false;
|
|
|
|
} else {
|
|
|
|
write_unlock(&fs_info->mapping_tree_lock);
|
|
|
|
return -EEXIST;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
rb_link_node(&map->rb_node, parent, p);
|
|
|
|
rb_insert_color_cached(&map->rb_node, &fs_info->mapping_tree, leftmost);
|
|
|
|
chunk_map_device_set_bits(map, CHUNK_ALLOCATED);
|
|
|
|
chunk_map_device_clear_bits(map, CHUNK_TRIMMED);
|
|
|
|
write_unlock(&fs_info->mapping_tree_lock);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
EXPORT_FOR_TESTS
|
|
|
|
struct btrfs_chunk_map *btrfs_alloc_chunk_map(int num_stripes, gfp_t gfp)
|
|
|
|
{
|
|
|
|
struct btrfs_chunk_map *map;
|
|
|
|
|
|
|
|
map = kmalloc(btrfs_chunk_map_size(num_stripes), gfp);
|
|
|
|
if (!map)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
refcount_set(&map->refs, 1);
|
|
|
|
RB_CLEAR_NODE(&map->rb_node);
|
|
|
|
|
|
|
|
return map;
|
|
|
|
}
|
|
|
|
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
static struct btrfs_block_group *create_chunk(struct btrfs_trans_handle *trans,
|
2020-02-25 11:56:14 +08:00
|
|
|
struct alloc_chunk_ctl *ctl,
|
|
|
|
struct btrfs_device_info *devices_info)
|
2020-02-25 11:56:12 +08:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *info = trans->fs_info;
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
struct btrfs_chunk_map *map;
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
struct btrfs_block_group *block_group;
|
2020-02-25 11:56:14 +08:00
|
|
|
u64 start = ctl->start;
|
|
|
|
u64 type = ctl->type;
|
2020-02-25 11:56:12 +08:00
|
|
|
int ret;
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
map = btrfs_alloc_chunk_map(ctl->num_stripes, GFP_NOFS);
|
2020-02-25 11:56:14 +08:00
|
|
|
if (!map)
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
return ERR_PTR(-ENOMEM);
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
|
|
|
|
map->start = start;
|
|
|
|
map->chunk_len = ctl->chunk_size;
|
|
|
|
map->stripe_size = ctl->stripe_size;
|
|
|
|
map->type = type;
|
|
|
|
map->io_align = BTRFS_STRIPE_LEN;
|
|
|
|
map->io_width = BTRFS_STRIPE_LEN;
|
|
|
|
map->sub_stripes = ctl->sub_stripes;
|
2020-02-25 11:56:14 +08:00
|
|
|
map->num_stripes = ctl->num_stripes;
|
2020-02-25 11:56:12 +08:00
|
|
|
|
2024-05-21 01:46:44 +08:00
|
|
|
for (int i = 0; i < ctl->ndevs; i++) {
|
|
|
|
for (int j = 0; j < ctl->dev_stripes; j++) {
|
2020-02-25 11:56:14 +08:00
|
|
|
int s = i * ctl->dev_stripes + j;
|
btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-04-12 18:07:57 +08:00
|
|
|
map->stripes[s].dev = devices_info[i].dev;
|
|
|
|
map->stripes[s].physical = devices_info[i].dev_offset +
|
2020-02-25 11:56:14 +08:00
|
|
|
j * ctl->stripe_size;
|
2008-03-25 03:01:59 +08:00
|
|
|
}
|
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2020-02-25 11:56:14 +08:00
|
|
|
trace_btrfs_chunk_alloc(info, map, start, ctl->chunk_size);
|
Btrfs: add initial tracepoint support for btrfs
Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
Here is what I have added:
1) ordere_extent:
btrfs_ordered_extent_add
btrfs_ordered_extent_remove
btrfs_ordered_extent_start
btrfs_ordered_extent_put
These provide critical information to understand how ordered_extents are
updated.
2) extent_map:
btrfs_get_extent
extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.
3) writepage:
__extent_writepage
btrfs_writepage_end_io_hook
Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.
4) inode:
btrfs_inode_new
btrfs_inode_request
btrfs_inode_evict
These can show where and when a inode is created, when a inode is evicted.
5) sync:
btrfs_sync_file
btrfs_sync_fs
These show sync arguments.
6) transaction:
btrfs_transaction_commit
In transaction based filesystem, it will be useful to know the generation and
who does commit.
7) back reference and cow:
btrfs_delayed_tree_ref
btrfs_delayed_data_ref
btrfs_delayed_ref_head
btrfs_cow_block
Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.
8) chunk:
btrfs_chunk_alloc
btrfs_chunk_free
Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.
9) reserved_extent:
btrfs_reserved_extent_alloc
btrfs_reserved_extent_free
These can show how btrfs uses its space.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-24 19:18:59 +08:00
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
ret = btrfs_add_chunk_map(info, map);
|
2013-01-31 23:23:04 +08:00
|
|
|
if (ret) {
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
btrfs_free_chunk_map(map);
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
return ERR_PTR(ret);
|
2013-01-31 23:23:04 +08:00
|
|
|
}
|
2017-08-21 17:43:49 +08:00
|
|
|
|
2023-03-21 19:13:45 +08:00
|
|
|
block_group = btrfs_make_block_group(trans, type, start, ctl->chunk_size);
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
if (IS_ERR(block_group)) {
|
|
|
|
btrfs_remove_chunk_map(info, map);
|
|
|
|
return block_group;
|
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
for (int i = 0; i < map->num_stripes; i++) {
|
2019-03-25 20:31:22 +08:00
|
|
|
struct btrfs_device *dev = map->stripes[i].dev;
|
|
|
|
|
2020-02-25 11:56:10 +08:00
|
|
|
btrfs_device_set_bytes_used(dev,
|
2020-02-25 11:56:14 +08:00
|
|
|
dev->bytes_used + ctl->stripe_size);
|
2019-03-25 20:31:22 +08:00
|
|
|
if (list_empty(&dev->post_commit_list))
|
|
|
|
list_add_tail(&dev->post_commit_list,
|
|
|
|
&trans->transaction->dev_update_list);
|
|
|
|
}
|
2014-09-03 21:35:36 +08:00
|
|
|
|
2020-02-25 11:56:14 +08:00
|
|
|
atomic64_sub(ctl->stripe_size * map->num_stripes,
|
2020-02-25 11:56:10 +08:00
|
|
|
&info->free_chunk_space);
|
2014-09-03 21:35:37 +08:00
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
check_raid56_incompat_flag(info, type);
|
2018-07-11 00:15:05 +08:00
|
|
|
check_raid1c34_incompat_flag(info, type);
|
2013-01-30 07:40:14 +08:00
|
|
|
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
return block_group;
|
2020-02-25 11:56:14 +08:00
|
|
|
}
|
|
|
|
|
2021-08-18 18:41:19 +08:00
|
|
|
struct btrfs_block_group *btrfs_create_chunk(struct btrfs_trans_handle *trans,
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
u64 type)
|
2020-02-25 11:56:14 +08:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *info = trans->fs_info;
|
|
|
|
struct btrfs_fs_devices *fs_devices = info->fs_devices;
|
|
|
|
struct btrfs_device_info *devices_info = NULL;
|
|
|
|
struct alloc_chunk_ctl ctl;
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
struct btrfs_block_group *block_group;
|
2020-02-25 11:56:14 +08:00
|
|
|
int ret;
|
|
|
|
|
2020-03-02 18:29:25 +08:00
|
|
|
lockdep_assert_held(&info->chunk_mutex);
|
|
|
|
|
2020-02-25 11:56:14 +08:00
|
|
|
if (!alloc_profile_is_valid(type, 0)) {
|
|
|
|
ASSERT(0);
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
return ERR_PTR(-EINVAL);
|
2020-02-25 11:56:14 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (list_empty(&fs_devices->alloc_list)) {
|
|
|
|
if (btrfs_test_opt(info, ENOSPC_DEBUG))
|
|
|
|
btrfs_debug(info, "%s: no writable device", __func__);
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
return ERR_PTR(-ENOSPC);
|
2020-02-25 11:56:14 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (!(type & BTRFS_BLOCK_GROUP_TYPE_MASK)) {
|
|
|
|
btrfs_err(info, "invalid chunk type 0x%llx requested", type);
|
|
|
|
ASSERT(0);
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
return ERR_PTR(-EINVAL);
|
2020-02-25 11:56:14 +08:00
|
|
|
}
|
|
|
|
|
2020-03-02 18:29:25 +08:00
|
|
|
ctl.start = find_next_chunk(info);
|
2020-02-25 11:56:14 +08:00
|
|
|
ctl.type = type;
|
|
|
|
init_alloc_chunk_ctl(fs_devices, &ctl);
|
|
|
|
|
|
|
|
devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
|
|
|
|
GFP_NOFS);
|
|
|
|
if (!devices_info)
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
return ERR_PTR(-ENOMEM);
|
2020-02-25 11:56:14 +08:00
|
|
|
|
|
|
|
ret = gather_device_info(fs_devices, &ctl, devices_info);
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
if (ret < 0) {
|
|
|
|
block_group = ERR_PTR(ret);
|
2020-02-25 11:56:14 +08:00
|
|
|
goto out;
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
}
|
2020-02-25 11:56:14 +08:00
|
|
|
|
|
|
|
ret = decide_stripe_size(fs_devices, &ctl, devices_info);
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
if (ret < 0) {
|
|
|
|
block_group = ERR_PTR(ret);
|
2020-02-25 11:56:14 +08:00
|
|
|
goto out;
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
}
|
2020-02-25 11:56:14 +08:00
|
|
|
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
block_group = create_chunk(trans, &ctl, devices_info);
|
2020-02-25 11:56:14 +08:00
|
|
|
|
|
|
|
out:
|
2011-01-05 18:07:28 +08:00
|
|
|
kfree(devices_info);
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
return block_group;
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
/*
|
|
|
|
* This function, btrfs_chunk_alloc_add_chunk_item(), typically belongs to the
|
|
|
|
* phase 1 of chunk allocation. It belongs to phase 2 only when allocating system
|
|
|
|
* chunks.
|
|
|
|
*
|
|
|
|
* See the comment at btrfs_chunk_alloc() for details about the chunk allocation
|
|
|
|
* phases.
|
|
|
|
*/
|
|
|
|
int btrfs_chunk_alloc_add_chunk_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_block_group *bg)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
|
|
|
struct btrfs_root *chunk_root = fs_info->chunk_root;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_chunk *chunk;
|
|
|
|
struct btrfs_stripe *stripe;
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
struct btrfs_chunk_map *map;
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
size_t item_size;
|
|
|
|
int i;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We take the chunk_mutex for 2 reasons:
|
|
|
|
*
|
|
|
|
* 1) Updates and insertions in the chunk btree must be done while holding
|
|
|
|
* the chunk_mutex, as well as updating the system chunk array in the
|
|
|
|
* superblock. See the comment on top of btrfs_chunk_alloc() for the
|
|
|
|
* details;
|
|
|
|
*
|
|
|
|
* 2) To prevent races with the final phase of a device replace operation
|
|
|
|
* that replaces the device object associated with the map's stripes,
|
|
|
|
* because the device object's id can change at any time during that
|
|
|
|
* final phase of the device replace operation
|
|
|
|
* (dev-replace.c:btrfs_dev_replace_finishing()), so we could grab the
|
|
|
|
* replaced device and then see it with an ID of BTRFS_DEV_REPLACE_DEVID,
|
|
|
|
* which would cause a failure when updating the device item, which does
|
|
|
|
* not exists, or persisting a stripe of the chunk item with such ID.
|
|
|
|
* Here we can't use the device_list_mutex because our caller already
|
|
|
|
* has locked the chunk_mutex, and the final phase of device replace
|
|
|
|
* acquires both mutexes - first the device_list_mutex and then the
|
|
|
|
* chunk_mutex. Using any of those two mutexes protects us from a
|
|
|
|
* concurrent device replace.
|
|
|
|
*/
|
|
|
|
lockdep_assert_held(&fs_info->chunk_mutex);
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
map = btrfs_get_chunk_map(fs_info, bg->start, bg->length);
|
|
|
|
if (IS_ERR(map)) {
|
|
|
|
ret = PTR_ERR(map);
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
btrfs_abort_transaction(trans, ret);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
item_size = btrfs_chunk_item_size(map->num_stripes);
|
|
|
|
|
|
|
|
chunk = kzalloc(item_size, GFP_NOFS);
|
|
|
|
if (!chunk) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
btrfs_abort_transaction(trans, ret);
|
Btrfs: fix race when finishing dev replace leading to transaction abort
During the final phase of a device replace operation, I ran into a
transaction abort that resulted in the following trace:
[23919.655368] WARNING: CPU: 10 PID: 30175 at fs/btrfs/extent-tree.c:9843 btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]()
[23919.664742] BTRFS: Transaction aborted (error -2)
[23919.665749] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 parport psmouse acpi_cpufreq processor i2c_core evdev microcode pcspkr button serio_raw ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic ata_piix virtio_pci floppy virtio_ring libata e1000 virtio scsi_mod [last unloaded: btrfs]
[23919.679442] CPU: 10 PID: 30175 Comm: fsstress Not tainted 4.3.0-rc5-btrfs-next-17+ #1
[23919.682392] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[23919.689151] 0000000000000000 ffff8804020cbb50 ffffffff812566f4 ffff8804020cbb98
[23919.692604] ffff8804020cbb88 ffffffff8104d0a6 ffffffffa03eea69 ffff88041b678a48
[23919.694230] ffff88042ac38000 ffff88041b678930 00000000fffffffe ffff8804020cbbf0
[23919.696716] Call Trace:
[23919.698669] [<ffffffff812566f4>] dump_stack+0x4e/0x79
[23919.700597] [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
[23919.701958] [<ffffffffa03eea69>] ? btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
[23919.703612] [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
[23919.705047] [<ffffffffa03eea69>] btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
[23919.706967] [<ffffffffa0402097>] __btrfs_end_transaction+0x84/0x2dd [btrfs]
[23919.708611] [<ffffffffa0402300>] btrfs_end_transaction+0x10/0x12 [btrfs]
[23919.710099] [<ffffffffa03ef0b8>] btrfs_alloc_data_chunk_ondemand+0x121/0x28b [btrfs]
[23919.711970] [<ffffffffa0413025>] btrfs_fallocate+0x7d3/0xc6d [btrfs]
[23919.713602] [<ffffffff8108b78f>] ? lock_acquire+0x10d/0x194
[23919.714756] [<ffffffff81086dbc>] ? percpu_down_read+0x51/0x78
[23919.716155] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
[23919.718918] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
[23919.724170] [<ffffffff8116b579>] vfs_fallocate+0x170/0x1ff
[23919.725482] [<ffffffff8117c1d7>] ioctl_preallocate+0x89/0x9b
[23919.726790] [<ffffffff8117c5ef>] do_vfs_ioctl+0x406/0x4e6
[23919.728428] [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
[23919.729642] [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
[23919.730782] [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
[23919.731847] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[23919.733330] ---[ end trace 166ef301a335832a ]---
This is due to a race between device replace and chunk allocation, which
the following diagram illustrates:
CPU 1 CPU 2
btrfs_dev_replace_finishing()
at this point
dev_replace->tgtdev->devid ==
BTRFS_DEV_REPLACE_DEVID (0ULL)
...
btrfs_start_transaction()
btrfs_commit_transaction()
btrfs_fallocate()
btrfs_alloc_data_chunk_ondemand()
btrfs_join_transaction()
--> starts a new transaction
do_chunk_alloc()
lock fs_info->chunk_mutex
btrfs_alloc_chunk()
--> creates extent map for
the new chunk with
em->bdev->map->stripes[i]->dev->devid
== X (X > 0)
--> extent map is added to
fs_info->mapping_tree
--> initial phase of bg A
allocation completes
unlock fs_info->chunk_mutex
lock fs_info->chunk_mutex
btrfs_dev_replace_update_device_in_mapping_tree()
--> iterates fs_info->mapping_tree and
replaces the device in every extent
map's map->stripes[] with
dev_replace->tgtdev, which still has
an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)
btrfs_end_transaction()
btrfs_create_pending_block_groups()
--> starts final phase of
bg A creation (update device,
extent, and chunk trees, etc)
btrfs_finish_chunk_alloc()
btrfs_update_device()
--> attempts to update a device
item with ID == 0ULL
(BTRFS_DEV_REPLACE_DEVID)
which is the current ID of
bg A's
em->bdev->map->stripes[i]->dev->devid
--> doesn't find such item
returns -ENOENT
--> the device id should have been X
and not 0ULL
got -ENOENT from
btrfs_finish_chunk_alloc()
and aborts current transaction
finishes setting up the target device,
namely it sets tgtdev->devid to the value
of srcdev->devid, which is X (and X > 0)
frees the srcdev
unlock fs_info->chunk_mutex
So fix this by taking the device list mutex when processing the chunk's
extent map stripes to update the device items. This avoids getting the
wrong device id and use-after-free problems if the task finishing a
chunk allocation grabs the replaced device, which is freed while the
dev replace task is holding the device list mutex.
This happened while running fstest btrfs/071.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-11-20 18:42:47 +08:00
|
|
|
goto out;
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
for (i = 0; i < map->num_stripes; i++) {
|
|
|
|
struct btrfs_device *device = map->stripes[i].dev;
|
|
|
|
|
|
|
|
ret = btrfs_update_device(trans, device);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
stripe = &chunk->stripe;
|
2013-06-28 01:22:46 +08:00
|
|
|
for (i = 0; i < map->num_stripes; i++) {
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
struct btrfs_device *device = map->stripes[i].dev;
|
|
|
|
const u64 dev_offset = map->stripes[i].physical;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2008-04-16 03:41:47 +08:00
|
|
|
btrfs_set_stack_stripe_devid(stripe, device->devid);
|
|
|
|
btrfs_set_stack_stripe_offset(stripe, dev_offset);
|
|
|
|
memcpy(stripe->dev_uuid, device->uuid, BTRFS_UUID_SIZE);
|
2008-11-18 10:11:30 +08:00
|
|
|
stripe++;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
|
|
|
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
btrfs_set_stack_chunk_length(chunk, bg->length);
|
2021-11-06 04:45:41 +08:00
|
|
|
btrfs_set_stack_chunk_owner(chunk, BTRFS_EXTENT_TREE_OBJECTID);
|
2023-02-17 13:36:58 +08:00
|
|
|
btrfs_set_stack_chunk_stripe_len(chunk, BTRFS_STRIPE_LEN);
|
2008-11-18 10:11:30 +08:00
|
|
|
btrfs_set_stack_chunk_type(chunk, map->type);
|
|
|
|
btrfs_set_stack_chunk_num_stripes(chunk, map->num_stripes);
|
2023-02-17 13:36:58 +08:00
|
|
|
btrfs_set_stack_chunk_io_align(chunk, BTRFS_STRIPE_LEN);
|
|
|
|
btrfs_set_stack_chunk_io_width(chunk, BTRFS_STRIPE_LEN);
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_set_stack_chunk_sector_size(chunk, fs_info->sectorsize);
|
2008-11-18 10:11:30 +08:00
|
|
|
btrfs_set_stack_chunk_sub_stripes(chunk, map->sub_stripes);
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
|
|
|
|
key.type = BTRFS_CHUNK_ITEM_KEY;
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
key.offset = bg->start;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
ret = btrfs_insert_item(trans, chunk_root, &key, chunk, item_size);
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
2022-07-16 03:45:24 +08:00
|
|
|
set_bit(BLOCK_GROUP_FLAG_CHUNK_ITEM_INSERTED, &bg->runtime_flags);
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
|
|
|
|
if (map->type & BTRFS_BLOCK_GROUP_SYSTEM) {
|
2016-06-23 06:54:24 +08:00
|
|
|
ret = btrfs_add_system_chunk(fs_info, &key, chunk, item_size);
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
if (ret)
|
|
|
|
goto out;
|
2008-04-26 04:53:30 +08:00
|
|
|
}
|
Btrfs: add initial tracepoint support for btrfs
Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
Here is what I have added:
1) ordere_extent:
btrfs_ordered_extent_add
btrfs_ordered_extent_remove
btrfs_ordered_extent_start
btrfs_ordered_extent_put
These provide critical information to understand how ordered_extents are
updated.
2) extent_map:
btrfs_get_extent
extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.
3) writepage:
__extent_writepage
btrfs_writepage_end_io_hook
Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.
4) inode:
btrfs_inode_new
btrfs_inode_request
btrfs_inode_evict
These can show where and when a inode is created, when a inode is evicted.
5) sync:
btrfs_sync_file
btrfs_sync_fs
These show sync arguments.
6) transaction:
btrfs_transaction_commit
In transaction based filesystem, it will be useful to know the generation and
who does commit.
7) back reference and cow:
btrfs_delayed_tree_ref
btrfs_delayed_data_ref
btrfs_delayed_ref_head
btrfs_cow_block
Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.
8) chunk:
btrfs_chunk_alloc
btrfs_chunk_free
Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.
9) reserved_extent:
btrfs_reserved_extent_alloc
btrfs_reserved_extent_free
These can show how btrfs uses its space.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-24 19:18:59 +08:00
|
|
|
|
2013-06-28 01:22:46 +08:00
|
|
|
out:
|
2008-03-25 03:01:56 +08:00
|
|
|
kfree(chunk);
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
btrfs_free_chunk_map(map);
|
2011-08-11 03:32:10 +08:00
|
|
|
return ret;
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2019-03-20 23:29:13 +08:00
|
|
|
static noinline int init_first_rw_device(struct btrfs_trans_handle *trans)
|
2008-11-18 10:11:30 +08:00
|
|
|
{
|
2019-03-20 23:29:13 +08:00
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
2008-11-18 10:11:30 +08:00
|
|
|
u64 alloc_profile;
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
struct btrfs_block_group *meta_bg;
|
|
|
|
struct btrfs_block_group *sys_bg;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* When adding a new device for sprouting, the seed device is read-only
|
|
|
|
* so we must first allocate a metadata and a system chunk. But before
|
|
|
|
* adding the block group items to the extent, device and chunk btrees,
|
|
|
|
* we must first:
|
|
|
|
*
|
|
|
|
* 1) Create both chunks without doing any changes to the btrees, as
|
|
|
|
* otherwise we would get -ENOSPC since the block groups from the
|
|
|
|
* seed device are read-only;
|
|
|
|
*
|
|
|
|
* 2) Add the device item for the new sprout device - finishing the setup
|
|
|
|
* of a new block group requires updating the device item in the chunk
|
|
|
|
* btree, so it must exist when we attempt to do it. The previous step
|
|
|
|
* ensures this does not fail with -ENOSPC.
|
|
|
|
*
|
|
|
|
* After that we can add the block group items to their btrees:
|
|
|
|
* update existing device item in the chunk btree, add a new block group
|
|
|
|
* item to the extent btree, add a new chunk item to the chunk btree and
|
|
|
|
* finally add the new device extent items to the devices btree.
|
|
|
|
*/
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2017-05-17 23:38:35 +08:00
|
|
|
alloc_profile = btrfs_metadata_alloc_profile(fs_info);
|
2021-08-18 18:41:19 +08:00
|
|
|
meta_bg = btrfs_create_chunk(trans, alloc_profile);
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
if (IS_ERR(meta_bg))
|
|
|
|
return PTR_ERR(meta_bg);
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2017-05-17 23:38:35 +08:00
|
|
|
alloc_profile = btrfs_system_alloc_profile(fs_info);
|
2021-08-18 18:41:19 +08:00
|
|
|
sys_bg = btrfs_create_chunk(trans, alloc_profile);
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
if (IS_ERR(sys_bg))
|
|
|
|
return PTR_ERR(sys_bg);
|
|
|
|
|
|
|
|
return 0;
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
static inline int btrfs_chunk_max_errors(struct btrfs_chunk_map *map)
|
2014-07-03 18:22:13 +08:00
|
|
|
{
|
2019-05-17 17:43:22 +08:00
|
|
|
const int index = btrfs_bg_flags_to_raid_index(map->type);
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2019-05-17 17:43:22 +08:00
|
|
|
return btrfs_raid_array[index].tolerated_failures;
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
|
2021-08-24 13:27:42 +08:00
|
|
|
bool btrfs_chunk_writeable(struct btrfs_fs_info *fs_info, u64 chunk_offset)
|
2008-11-18 10:11:30 +08:00
|
|
|
{
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
struct btrfs_chunk_map *map;
|
2014-07-03 18:22:13 +08:00
|
|
|
int miss_ndevs = 0;
|
2008-11-18 10:11:30 +08:00
|
|
|
int i;
|
2021-08-24 13:27:42 +08:00
|
|
|
bool ret = true;
|
2008-11-18 10:11:30 +08:00
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
map = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
|
|
|
|
if (IS_ERR(map))
|
2021-08-24 13:27:42 +08:00
|
|
|
return false;
|
2008-11-18 10:11:30 +08:00
|
|
|
|
|
|
|
for (i = 0; i < map->num_stripes; i++) {
|
2017-12-04 12:54:54 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_MISSING,
|
|
|
|
&map->stripes[i].dev->dev_state)) {
|
2014-07-03 18:22:13 +08:00
|
|
|
miss_ndevs++;
|
|
|
|
continue;
|
|
|
|
}
|
2017-12-04 12:54:52 +08:00
|
|
|
if (!test_bit(BTRFS_DEV_STATE_WRITEABLE,
|
|
|
|
&map->stripes[i].dev->dev_state)) {
|
2021-08-24 13:27:42 +08:00
|
|
|
ret = false;
|
2014-07-03 18:22:13 +08:00
|
|
|
goto end;
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
}
|
2014-07-03 18:22:13 +08:00
|
|
|
|
|
|
|
/*
|
2021-08-24 13:27:42 +08:00
|
|
|
* If the number of missing devices is larger than max errors, we can
|
|
|
|
* not write the data into that chunk successfully.
|
2014-07-03 18:22:13 +08:00
|
|
|
*/
|
|
|
|
if (miss_ndevs > btrfs_chunk_max_errors(map))
|
2021-08-24 13:27:42 +08:00
|
|
|
ret = false;
|
2014-07-03 18:22:13 +08:00
|
|
|
end:
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
btrfs_free_chunk_map(map);
|
2021-08-24 13:27:42 +08:00
|
|
|
return ret;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
void btrfs_mapping_tree_free(struct btrfs_fs_info *fs_info)
|
2008-03-25 03:01:56 +08:00
|
|
|
{
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
write_lock(&fs_info->mapping_tree_lock);
|
|
|
|
while (!RB_EMPTY_ROOT(&fs_info->mapping_tree.rb_root)) {
|
|
|
|
struct btrfs_chunk_map *map;
|
|
|
|
struct rb_node *node;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
node = rb_first_cached(&fs_info->mapping_tree);
|
|
|
|
map = rb_entry(node, struct btrfs_chunk_map, rb_node);
|
|
|
|
rb_erase_cached(&map->rb_node, &fs_info->mapping_tree);
|
|
|
|
RB_CLEAR_NODE(&map->rb_node);
|
|
|
|
chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
|
|
|
|
/* Once for the tree ref. */
|
|
|
|
btrfs_free_chunk_map(map);
|
|
|
|
cond_resched_rwlock_write(&fs_info->mapping_tree_lock);
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
write_unlock(&fs_info->mapping_tree_lock);
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
|
|
|
|
2024-08-13 19:36:40 +08:00
|
|
|
static int btrfs_chunk_map_num_copies(const struct btrfs_chunk_map *map)
|
|
|
|
{
|
|
|
|
enum btrfs_raid_types index = btrfs_bg_flags_to_raid_index(map->type);
|
|
|
|
|
|
|
|
if (map->type & BTRFS_BLOCK_GROUP_RAID5)
|
|
|
|
return 2;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* There could be two corrupted data stripes, we need to loop retry in
|
|
|
|
* order to rebuild the correct data.
|
|
|
|
*
|
|
|
|
* Fail a stripe at a time on every retry except the stripe under
|
|
|
|
* reconstruction.
|
|
|
|
*/
|
|
|
|
if (map->type & BTRFS_BLOCK_GROUP_RAID6)
|
|
|
|
return map->num_stripes;
|
|
|
|
|
|
|
|
/* Non-RAID56, use their ncopies from btrfs_raid_array. */
|
|
|
|
return btrfs_raid_array[index].ncopies;
|
|
|
|
}
|
|
|
|
|
2012-11-05 21:59:07 +08:00
|
|
|
int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
|
2008-04-10 04:28:12 +08:00
|
|
|
{
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
struct btrfs_chunk_map *map;
|
2024-08-13 19:36:40 +08:00
|
|
|
int ret;
|
2008-04-10 04:28:12 +08:00
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
map = btrfs_get_chunk_map(fs_info, logical, len);
|
|
|
|
if (IS_ERR(map))
|
2017-03-15 04:33:55 +08:00
|
|
|
/*
|
|
|
|
* We could return errors for these cases, but that could get
|
|
|
|
* ugly and we'd probably do the same thing which is just not do
|
|
|
|
* anything else and exit, so return 1 so the callers don't try
|
|
|
|
* to use other copies.
|
|
|
|
*/
|
2013-04-23 22:53:18 +08:00
|
|
|
return 1;
|
|
|
|
|
2024-08-13 19:36:40 +08:00
|
|
|
ret = btrfs_chunk_map_num_copies(map);
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
btrfs_free_chunk_map(map);
|
2008-04-10 04:28:12 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
unsigned long btrfs_full_stripe_len(struct btrfs_fs_info *fs_info,
|
2013-01-30 07:40:14 +08:00
|
|
|
u64 logical)
|
|
|
|
{
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
struct btrfs_chunk_map *map;
|
2016-06-23 06:54:23 +08:00
|
|
|
unsigned long len = fs_info->sectorsize;
|
2013-01-30 07:40:14 +08:00
|
|
|
|
2022-05-22 19:47:47 +08:00
|
|
|
if (!btrfs_fs_incompat(fs_info, RAID56))
|
|
|
|
return len;
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
map = btrfs_get_chunk_map(fs_info, logical, len);
|
2013-01-30 07:40:14 +08:00
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
if (!WARN_ON(IS_ERR(map))) {
|
2017-07-11 21:55:51 +08:00
|
|
|
if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)
|
2023-06-22 14:42:40 +08:00
|
|
|
len = btrfs_stripe_nr_to_offset(nr_data_stripes(map));
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
btrfs_free_chunk_map(map);
|
2017-07-11 21:55:51 +08:00
|
|
|
}
|
2013-01-30 07:40:14 +08:00
|
|
|
return len;
|
|
|
|
}
|
|
|
|
|
2017-07-19 15:48:42 +08:00
|
|
|
int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
|
2013-01-30 07:40:14 +08:00
|
|
|
{
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
struct btrfs_chunk_map *map;
|
2013-01-30 07:40:14 +08:00
|
|
|
int ret = 0;
|
|
|
|
|
2022-05-22 19:47:47 +08:00
|
|
|
if (!btrfs_fs_incompat(fs_info, RAID56))
|
|
|
|
return 0;
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
map = btrfs_get_chunk_map(fs_info, logical, len);
|
2013-01-30 07:40:14 +08:00
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
if (!WARN_ON(IS_ERR(map))) {
|
2017-07-11 21:55:51 +08:00
|
|
|
if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)
|
|
|
|
ret = 1;
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
btrfs_free_chunk_map(map);
|
2017-07-11 21:55:51 +08:00
|
|
|
}
|
2013-01-30 07:40:14 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-11-06 21:52:18 +08:00
|
|
|
static int find_live_mirror(struct btrfs_fs_info *fs_info,
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
struct btrfs_chunk_map *map, int first,
|
2018-03-14 16:29:13 +08:00
|
|
|
int dev_replace_is_ongoing)
|
2008-05-14 01:46:40 +08:00
|
|
|
{
|
2024-02-02 12:23:28 +08:00
|
|
|
const enum btrfs_read_policy policy = READ_ONCE(fs_info->fs_devices->read_policy);
|
2008-05-14 01:46:40 +08:00
|
|
|
int i;
|
2018-03-14 16:29:12 +08:00
|
|
|
int num_stripes;
|
2018-03-14 16:29:13 +08:00
|
|
|
int preferred_mirror;
|
2012-11-06 21:52:18 +08:00
|
|
|
int tolerance;
|
|
|
|
struct btrfs_device *srcdev;
|
|
|
|
|
2018-03-14 16:29:12 +08:00
|
|
|
ASSERT((map->type &
|
2019-05-31 21:39:31 +08:00
|
|
|
(BTRFS_BLOCK_GROUP_RAID1_MASK | BTRFS_BLOCK_GROUP_RAID10)));
|
2018-03-14 16:29:12 +08:00
|
|
|
|
|
|
|
if (map->type & BTRFS_BLOCK_GROUP_RAID10)
|
|
|
|
num_stripes = map->sub_stripes;
|
|
|
|
else
|
|
|
|
num_stripes = map->num_stripes;
|
|
|
|
|
2024-02-02 12:23:28 +08:00
|
|
|
switch (policy) {
|
2020-10-28 21:14:46 +08:00
|
|
|
default:
|
|
|
|
/* Shouldn't happen, just warn and use pid instead of failing */
|
2024-02-02 12:23:28 +08:00
|
|
|
btrfs_warn_rl(fs_info, "unknown read_policy type %u, reset to pid",
|
|
|
|
policy);
|
|
|
|
WRITE_ONCE(fs_info->fs_devices->read_policy, BTRFS_READ_POLICY_PID);
|
2020-10-28 21:14:46 +08:00
|
|
|
fallthrough;
|
|
|
|
case BTRFS_READ_POLICY_PID:
|
|
|
|
preferred_mirror = first + (current->pid % num_stripes);
|
|
|
|
break;
|
|
|
|
}
|
2018-03-14 16:29:13 +08:00
|
|
|
|
2012-11-06 21:52:18 +08:00
|
|
|
if (dev_replace_is_ongoing &&
|
|
|
|
fs_info->dev_replace.cont_reading_from_srcdev_mode ==
|
|
|
|
BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_AVOID)
|
|
|
|
srcdev = fs_info->dev_replace.srcdev;
|
|
|
|
else
|
|
|
|
srcdev = NULL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* try to avoid the drive that is the source drive for a
|
|
|
|
* dev-replace procedure, only choose it if no other non-missing
|
|
|
|
* mirror is available
|
|
|
|
*/
|
|
|
|
for (tolerance = 0; tolerance < 2; tolerance++) {
|
2018-03-14 16:29:13 +08:00
|
|
|
if (map->stripes[preferred_mirror].dev->bdev &&
|
|
|
|
(tolerance || map->stripes[preferred_mirror].dev != srcdev))
|
|
|
|
return preferred_mirror;
|
2018-03-14 16:29:12 +08:00
|
|
|
for (i = first; i < first + num_stripes; i++) {
|
2012-11-06 21:52:18 +08:00
|
|
|
if (map->stripes[i].dev->bdev &&
|
|
|
|
(tolerance || map->stripes[i].dev != srcdev))
|
|
|
|
return i;
|
|
|
|
}
|
2008-05-14 01:46:40 +08:00
|
|
|
}
|
2012-11-06 21:52:18 +08:00
|
|
|
|
2008-05-14 01:46:40 +08:00
|
|
|
/* we couldn't find one that doesn't fail. Just return something
|
|
|
|
* and the io error handling code will clean up eventually
|
|
|
|
*/
|
2018-03-14 16:29:13 +08:00
|
|
|
return preferred_mirror;
|
2008-05-14 01:46:40 +08:00
|
|
|
}
|
|
|
|
|
2021-09-23 14:00:08 +08:00
|
|
|
static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_info,
|
2023-09-15 00:06:58 +08:00
|
|
|
u64 logical,
|
btrfs: use an efficient way to represent source of duplicated stripes
For btrfs dev-replace, we have to duplicate writes to the source
device into the target device.
For non-RAID56, all writes into the same mapped ranges are sharing the
same content, thus they don't really need to bother anything.
(E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the
same write to all involved devices).
But for RAID56, all stripes contain different content, thus we must
have a clear mapping of which stripe is duplicated from which original
stripe.
Currently we use a complex way using tgtdev_map[] array, e.g:
num_tgtdevs = 1
tgtdev_map[0] = 0 <- Means stripes[0] is not involved in replace.
tgtdev_map[1] = 3 <- Means stripes[1] is involved in replace,
and it's duplicated to stripes[3].
tgtdev_map[2] = 0 <- Means stripes[2] is not involved in replace.
But this is wasting some space, and ignores one important thing for
dev-replace, there is at most one running replace.
Thus we can change it to a fixed array to represent the mapping:
replace_nr_stripes = 1
replace_stripe_src = 1 <- Means stripes[1] is involved in replace.
thus the extra stripe is a copy of
stripes[1]
By this we can save some space for bioc on RAID56 chunks with many
devices. And we get rid of one variable sized array from bioc.
Thus the patch involves the following changes:
- Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes
and @replace_stripe_src.
@num_tgtdevs is just renamed to @replace_nr_stripes.
While the mapping is completely changed.
- Add extra ASSERT()s for RAID56 code
- Only add two more extra stripes for dev-replace cases.
As we have an upper limit on how many dev-replace stripes we can have.
- Unify the behavior of handle_ops_on_dev_replace()
Previously handle_ops_on_dev_replace() go two different paths for
WRITE and GET_READ_MIRRORS.
Now unify them by always going the WRITE path first (with at most 2
replace stripes), then if we're doing GET_READ_MIRRORS and we have 2
extra stripes, just drop one stripe.
- Remove the @real_stripes argument from alloc_btrfs_io_context()
As we don't need the old variable length array any more.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-07 12:26:14 +08:00
|
|
|
u16 total_stripes)
|
2015-01-20 15:11:34 +08:00
|
|
|
{
|
2023-02-07 12:26:13 +08:00
|
|
|
struct btrfs_io_context *bioc;
|
|
|
|
|
|
|
|
bioc = kzalloc(
|
2021-09-15 15:17:16 +08:00
|
|
|
/* The size of btrfs_io_context */
|
|
|
|
sizeof(struct btrfs_io_context) +
|
|
|
|
/* Plus the variable array for the stripes */
|
2023-02-17 13:37:03 +08:00
|
|
|
sizeof(struct btrfs_io_stripe) * (total_stripes),
|
2022-10-26 09:36:11 +08:00
|
|
|
GFP_NOFS);
|
|
|
|
|
|
|
|
if (!bioc)
|
|
|
|
return NULL;
|
2015-01-20 15:11:34 +08:00
|
|
|
|
2021-09-15 15:17:16 +08:00
|
|
|
refcount_set(&bioc->refs, 1);
|
2015-01-20 15:11:34 +08:00
|
|
|
|
2021-09-23 14:00:08 +08:00
|
|
|
bioc->fs_info = fs_info;
|
btrfs: use an efficient way to represent source of duplicated stripes
For btrfs dev-replace, we have to duplicate writes to the source
device into the target device.
For non-RAID56, all writes into the same mapped ranges are sharing the
same content, thus they don't really need to bother anything.
(E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the
same write to all involved devices).
But for RAID56, all stripes contain different content, thus we must
have a clear mapping of which stripe is duplicated from which original
stripe.
Currently we use a complex way using tgtdev_map[] array, e.g:
num_tgtdevs = 1
tgtdev_map[0] = 0 <- Means stripes[0] is not involved in replace.
tgtdev_map[1] = 3 <- Means stripes[1] is involved in replace,
and it's duplicated to stripes[3].
tgtdev_map[2] = 0 <- Means stripes[2] is not involved in replace.
But this is wasting some space, and ignores one important thing for
dev-replace, there is at most one running replace.
Thus we can change it to a fixed array to represent the mapping:
replace_nr_stripes = 1
replace_stripe_src = 1 <- Means stripes[1] is involved in replace.
thus the extra stripe is a copy of
stripes[1]
By this we can save some space for bioc on RAID56 chunks with many
devices. And we get rid of one variable sized array from bioc.
Thus the patch involves the following changes:
- Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes
and @replace_stripe_src.
@num_tgtdevs is just renamed to @replace_nr_stripes.
While the mapping is completely changed.
- Add extra ASSERT()s for RAID56 code
- Only add two more extra stripes for dev-replace cases.
As we have an upper limit on how many dev-replace stripes we can have.
- Unify the behavior of handle_ops_on_dev_replace()
Previously handle_ops_on_dev_replace() go two different paths for
WRITE and GET_READ_MIRRORS.
Now unify them by always going the WRITE path first (with at most 2
replace stripes), then if we're doing GET_READ_MIRRORS and we have 2
extra stripes, just drop one stripe.
- Remove the @real_stripes argument from alloc_btrfs_io_context()
As we don't need the old variable length array any more.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-07 12:26:14 +08:00
|
|
|
bioc->replace_stripe_src = -1;
|
2023-02-17 13:37:03 +08:00
|
|
|
bioc->full_stripe_logical = (u64)-1;
|
2023-09-15 00:06:58 +08:00
|
|
|
bioc->logical = logical;
|
2020-07-02 21:46:41 +08:00
|
|
|
|
2021-09-15 15:17:16 +08:00
|
|
|
return bioc;
|
2015-01-20 15:11:34 +08:00
|
|
|
}
|
|
|
|
|
2021-09-15 15:17:16 +08:00
|
|
|
void btrfs_get_bioc(struct btrfs_io_context *bioc)
|
2015-01-20 15:11:34 +08:00
|
|
|
{
|
2021-09-15 15:17:16 +08:00
|
|
|
WARN_ON(!refcount_read(&bioc->refs));
|
|
|
|
refcount_inc(&bioc->refs);
|
2015-01-20 15:11:34 +08:00
|
|
|
}
|
|
|
|
|
2021-09-15 15:17:16 +08:00
|
|
|
void btrfs_put_bioc(struct btrfs_io_context *bioc)
|
2015-01-20 15:11:34 +08:00
|
|
|
{
|
2021-09-15 15:17:16 +08:00
|
|
|
if (!bioc)
|
2015-01-20 15:11:34 +08:00
|
|
|
return;
|
2021-09-15 15:17:16 +08:00
|
|
|
if (refcount_dec_and_test(&bioc->refs))
|
|
|
|
kfree(bioc);
|
2015-01-20 15:11:34 +08:00
|
|
|
}
|
|
|
|
|
2017-03-15 04:33:56 +08:00
|
|
|
/*
|
|
|
|
* Please note that, discard won't be sent to target device of device
|
|
|
|
* replace.
|
|
|
|
*/
|
2022-06-03 14:57:25 +08:00
|
|
|
struct btrfs_discard_stripe *btrfs_map_discard(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 logical, u64 *length_ret,
|
|
|
|
u32 *num_stripes)
|
2017-03-15 04:33:56 +08:00
|
|
|
{
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
struct btrfs_chunk_map *map;
|
2022-06-03 14:57:25 +08:00
|
|
|
struct btrfs_discard_stripe *stripes;
|
btrfs: Ensure we trim ranges across block group boundary
[BUG]
When deleting large files (which cross block group boundary) with
discard mount option, we find some btrfs_discard_extent() calls only
trimmed part of its space, not the whole range:
btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50%
type: bbio->map_type, in above case, it's SINGLE DATA.
start: Logical address of this trim
len: Logical length of this trim
trimmed: Physically trimmed bytes
ratio: trimmed / len
Thus leaving some unused space not discarded.
[CAUSE]
When discard mount option is specified, after a transaction is fully
committed (super block written to disk), we begin to cleanup pinned
extents in the following call chain:
btrfs_commit_transaction()
|- btrfs_finish_extent_commit()
|- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY);
|- btrfs_discard_extent()
However, pinned extents are recorded in an extent_io_tree, which can
merge adjacent extent states.
When a large file gets deleted and it has adjacent file extents across
block group boundary, we will get a large merged range like this:
|<--- BG1 --->|<--- BG2 --->|
|//////|<-- Range to discard --->|/////|
To discard that range, we have the following calls:
btrfs_discard_extent()
|- btrfs_map_block()
| Returned bbio will end at BG1's end. As btrfs_map_block()
| never returns result across block group boundary.
|- btrfs_issuse_discard()
Issue discard for each stripe.
So we will only discard the range in BG1, not the remaining part in BG2.
Furthermore, this bug is not that reliably observed, for above case, if
there is no other extent in BG2, BG2 will be empty and btrfs will trim
all space of BG2, covering up the bug.
[FIX]
- Allow __btrfs_map_block_for_discard() to modify @length parameter
btrfs_map_block() uses its @length paramter to notify the caller how
many bytes are mapped in current call.
With __btrfs_map_block_for_discard() also modifing the @length,
btrfs_discard_extent() now understands when to do extra trim.
- Call btrfs_map_block() in a loop until we hit the range end Since we
now know how many bytes are mapped each time, we can iterate through
each block group boundary and issue correct trim for each range.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-23 21:57:27 +08:00
|
|
|
u64 length = *length_ret;
|
2017-03-15 04:33:56 +08:00
|
|
|
u64 offset;
|
2023-02-17 13:36:59 +08:00
|
|
|
u32 stripe_nr;
|
|
|
|
u32 stripe_nr_end;
|
|
|
|
u32 stripe_cnt;
|
2017-03-15 04:33:56 +08:00
|
|
|
u64 stripe_end_offset;
|
|
|
|
u64 stripe_offset;
|
|
|
|
u32 stripe_index;
|
|
|
|
u32 factor = 0;
|
|
|
|
u32 sub_stripes = 0;
|
2023-02-17 13:36:59 +08:00
|
|
|
u32 stripes_per_dev = 0;
|
2017-03-15 04:33:56 +08:00
|
|
|
u32 remaining_stripes = 0;
|
|
|
|
u32 last_stripe = 0;
|
2022-06-03 14:57:25 +08:00
|
|
|
int ret;
|
2017-03-15 04:33:56 +08:00
|
|
|
int i;
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
map = btrfs_get_chunk_map(fs_info, logical, length);
|
|
|
|
if (IS_ERR(map))
|
|
|
|
return ERR_CAST(map);
|
2022-06-03 14:57:25 +08:00
|
|
|
|
2017-03-15 04:33:56 +08:00
|
|
|
/* we don't discard raid56 yet */
|
|
|
|
if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
|
|
|
|
ret = -EOPNOTSUPP;
|
2022-06-03 14:57:25 +08:00
|
|
|
goto out_free_map;
|
2023-02-17 13:36:58 +08:00
|
|
|
}
|
2017-03-15 04:33:56 +08:00
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
offset = logical - map->start;
|
|
|
|
length = min_t(u64, map->start + map->chunk_len - logical, length);
|
btrfs: Ensure we trim ranges across block group boundary
[BUG]
When deleting large files (which cross block group boundary) with
discard mount option, we find some btrfs_discard_extent() calls only
trimmed part of its space, not the whole range:
btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50%
type: bbio->map_type, in above case, it's SINGLE DATA.
start: Logical address of this trim
len: Logical length of this trim
trimmed: Physically trimmed bytes
ratio: trimmed / len
Thus leaving some unused space not discarded.
[CAUSE]
When discard mount option is specified, after a transaction is fully
committed (super block written to disk), we begin to cleanup pinned
extents in the following call chain:
btrfs_commit_transaction()
|- btrfs_finish_extent_commit()
|- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY);
|- btrfs_discard_extent()
However, pinned extents are recorded in an extent_io_tree, which can
merge adjacent extent states.
When a large file gets deleted and it has adjacent file extents across
block group boundary, we will get a large merged range like this:
|<--- BG1 --->|<--- BG2 --->|
|//////|<-- Range to discard --->|/////|
To discard that range, we have the following calls:
btrfs_discard_extent()
|- btrfs_map_block()
| Returned bbio will end at BG1's end. As btrfs_map_block()
| never returns result across block group boundary.
|- btrfs_issuse_discard()
Issue discard for each stripe.
So we will only discard the range in BG1, not the remaining part in BG2.
Furthermore, this bug is not that reliably observed, for above case, if
there is no other extent in BG2, BG2 will be empty and btrfs will trim
all space of BG2, covering up the bug.
[FIX]
- Allow __btrfs_map_block_for_discard() to modify @length parameter
btrfs_map_block() uses its @length paramter to notify the caller how
many bytes are mapped in current call.
With __btrfs_map_block_for_discard() also modifing the @length,
btrfs_discard_extent() now understands when to do extra trim.
- Call btrfs_map_block() in a loop until we hit the range end Since we
now know how many bytes are mapped each time, we can iterate through
each block group boundary and issue correct trim for each range.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-23 21:57:27 +08:00
|
|
|
*length_ret = length;
|
2017-03-15 04:33:56 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* stripe_nr counts the total number of stripes we have to stride
|
|
|
|
* to get to this block
|
|
|
|
*/
|
2023-02-17 13:36:58 +08:00
|
|
|
stripe_nr = offset >> BTRFS_STRIPE_LEN_SHIFT;
|
2017-03-15 04:33:56 +08:00
|
|
|
|
|
|
|
/* stripe_offset is the offset of this block in its stripe */
|
2023-06-22 14:42:40 +08:00
|
|
|
stripe_offset = offset - btrfs_stripe_nr_to_offset(stripe_nr);
|
2017-03-15 04:33:56 +08:00
|
|
|
|
2023-02-17 13:36:58 +08:00
|
|
|
stripe_nr_end = round_up(offset + length, BTRFS_STRIPE_LEN) >>
|
|
|
|
BTRFS_STRIPE_LEN_SHIFT;
|
2017-03-15 04:33:56 +08:00
|
|
|
stripe_cnt = stripe_nr_end - stripe_nr;
|
2023-06-22 14:42:40 +08:00
|
|
|
stripe_end_offset = btrfs_stripe_nr_to_offset(stripe_nr_end) -
|
2017-03-15 04:33:56 +08:00
|
|
|
(offset + length);
|
|
|
|
/*
|
|
|
|
* after this, stripe_nr is the number of stripes on this
|
|
|
|
* device we have to walk to find the data, and stripe_index is
|
|
|
|
* the number of our device in the stripe array
|
|
|
|
*/
|
2022-06-03 14:57:25 +08:00
|
|
|
*num_stripes = 1;
|
2017-03-15 04:33:56 +08:00
|
|
|
stripe_index = 0;
|
|
|
|
if (map->type & (BTRFS_BLOCK_GROUP_RAID0 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID10)) {
|
|
|
|
if (map->type & BTRFS_BLOCK_GROUP_RAID0)
|
|
|
|
sub_stripes = 1;
|
|
|
|
else
|
|
|
|
sub_stripes = map->sub_stripes;
|
|
|
|
|
|
|
|
factor = map->num_stripes / sub_stripes;
|
2022-06-03 14:57:25 +08:00
|
|
|
*num_stripes = min_t(u64, map->num_stripes,
|
2017-03-15 04:33:56 +08:00
|
|
|
sub_stripes * stripe_cnt);
|
2023-02-17 13:36:59 +08:00
|
|
|
stripe_index = stripe_nr % factor;
|
|
|
|
stripe_nr /= factor;
|
2017-03-15 04:33:56 +08:00
|
|
|
stripe_index *= sub_stripes;
|
2023-02-17 13:36:59 +08:00
|
|
|
|
|
|
|
remaining_stripes = stripe_cnt % factor;
|
|
|
|
stripes_per_dev = stripe_cnt / factor;
|
|
|
|
last_stripe = ((stripe_nr_end - 1) % factor) * sub_stripes;
|
2019-05-31 21:39:31 +08:00
|
|
|
} else if (map->type & (BTRFS_BLOCK_GROUP_RAID1_MASK |
|
2017-03-15 04:33:56 +08:00
|
|
|
BTRFS_BLOCK_GROUP_DUP)) {
|
2022-06-03 14:57:25 +08:00
|
|
|
*num_stripes = map->num_stripes;
|
2017-03-15 04:33:56 +08:00
|
|
|
} else {
|
2023-02-17 13:36:59 +08:00
|
|
|
stripe_index = stripe_nr % map->num_stripes;
|
|
|
|
stripe_nr /= map->num_stripes;
|
2017-03-15 04:33:56 +08:00
|
|
|
}
|
|
|
|
|
2022-06-03 14:57:25 +08:00
|
|
|
stripes = kcalloc(*num_stripes, sizeof(*stripes), GFP_NOFS);
|
|
|
|
if (!stripes) {
|
2017-03-15 04:33:56 +08:00
|
|
|
ret = -ENOMEM;
|
2022-06-03 14:57:25 +08:00
|
|
|
goto out_free_map;
|
2017-03-15 04:33:56 +08:00
|
|
|
}
|
|
|
|
|
2022-06-03 14:57:25 +08:00
|
|
|
for (i = 0; i < *num_stripes; i++) {
|
|
|
|
stripes[i].physical =
|
2017-03-15 04:33:56 +08:00
|
|
|
map->stripes[stripe_index].physical +
|
2023-06-22 14:42:40 +08:00
|
|
|
stripe_offset + btrfs_stripe_nr_to_offset(stripe_nr);
|
2022-06-03 14:57:25 +08:00
|
|
|
stripes[i].dev = map->stripes[stripe_index].dev;
|
2017-03-15 04:33:56 +08:00
|
|
|
|
|
|
|
if (map->type & (BTRFS_BLOCK_GROUP_RAID0 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID10)) {
|
2023-06-22 14:42:40 +08:00
|
|
|
stripes[i].length = btrfs_stripe_nr_to_offset(stripes_per_dev);
|
2017-03-15 04:33:56 +08:00
|
|
|
|
|
|
|
if (i / sub_stripes < remaining_stripes)
|
2023-02-17 13:36:58 +08:00
|
|
|
stripes[i].length += BTRFS_STRIPE_LEN;
|
2017-03-15 04:33:56 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Special for the first stripe and
|
|
|
|
* the last stripe:
|
|
|
|
*
|
|
|
|
* |-------|...|-------|
|
|
|
|
* |----------|
|
|
|
|
* off end_off
|
|
|
|
*/
|
|
|
|
if (i < sub_stripes)
|
2022-06-03 14:57:25 +08:00
|
|
|
stripes[i].length -= stripe_offset;
|
2017-03-15 04:33:56 +08:00
|
|
|
|
|
|
|
if (stripe_index >= last_stripe &&
|
|
|
|
stripe_index <= (last_stripe +
|
|
|
|
sub_stripes - 1))
|
2022-06-03 14:57:25 +08:00
|
|
|
stripes[i].length -= stripe_end_offset;
|
2017-03-15 04:33:56 +08:00
|
|
|
|
|
|
|
if (i == sub_stripes - 1)
|
|
|
|
stripe_offset = 0;
|
|
|
|
} else {
|
2022-06-03 14:57:25 +08:00
|
|
|
stripes[i].length = length;
|
2017-03-15 04:33:56 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
stripe_index++;
|
|
|
|
if (stripe_index == map->num_stripes) {
|
|
|
|
stripe_index = 0;
|
|
|
|
stripe_nr++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
btrfs_free_chunk_map(map);
|
2022-06-03 14:57:25 +08:00
|
|
|
return stripes;
|
|
|
|
out_free_map:
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
btrfs_free_chunk_map(map);
|
2022-06-03 14:57:25 +08:00
|
|
|
return ERR_PTR(ret);
|
2017-03-15 04:33:56 +08:00
|
|
|
}
|
|
|
|
|
2021-02-04 18:22:12 +08:00
|
|
|
static bool is_block_group_to_copy(struct btrfs_fs_info *fs_info, u64 logical)
|
|
|
|
{
|
|
|
|
struct btrfs_block_group *cache;
|
|
|
|
bool ret;
|
|
|
|
|
2021-02-04 18:22:13 +08:00
|
|
|
/* Non zoned filesystem does not use "to_copy" flag */
|
2021-02-04 18:22:12 +08:00
|
|
|
if (!btrfs_is_zoned(fs_info))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
cache = btrfs_lookup_block_group(fs_info, logical);
|
|
|
|
|
2022-07-16 03:45:24 +08:00
|
|
|
ret = test_bit(BLOCK_GROUP_FLAG_TO_COPY, &cache->runtime_flags);
|
2021-02-04 18:22:12 +08:00
|
|
|
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2024-05-08 19:14:47 +08:00
|
|
|
static void handle_ops_on_dev_replace(struct btrfs_io_context *bioc,
|
2017-03-15 04:33:58 +08:00
|
|
|
struct btrfs_dev_replace *dev_replace,
|
2021-02-04 18:22:12 +08:00
|
|
|
u64 logical,
|
2024-05-08 19:14:47 +08:00
|
|
|
struct btrfs_io_geometry *io_geom)
|
2017-03-15 04:33:58 +08:00
|
|
|
{
|
|
|
|
u64 srcdev_devid = dev_replace->srcdev->devid;
|
btrfs: use an efficient way to represent source of duplicated stripes
For btrfs dev-replace, we have to duplicate writes to the source
device into the target device.
For non-RAID56, all writes into the same mapped ranges are sharing the
same content, thus they don't really need to bother anything.
(E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the
same write to all involved devices).
But for RAID56, all stripes contain different content, thus we must
have a clear mapping of which stripe is duplicated from which original
stripe.
Currently we use a complex way using tgtdev_map[] array, e.g:
num_tgtdevs = 1
tgtdev_map[0] = 0 <- Means stripes[0] is not involved in replace.
tgtdev_map[1] = 3 <- Means stripes[1] is involved in replace,
and it's duplicated to stripes[3].
tgtdev_map[2] = 0 <- Means stripes[2] is not involved in replace.
But this is wasting some space, and ignores one important thing for
dev-replace, there is at most one running replace.
Thus we can change it to a fixed array to represent the mapping:
replace_nr_stripes = 1
replace_stripe_src = 1 <- Means stripes[1] is involved in replace.
thus the extra stripe is a copy of
stripes[1]
By this we can save some space for bioc on RAID56 chunks with many
devices. And we get rid of one variable sized array from bioc.
Thus the patch involves the following changes:
- Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes
and @replace_stripe_src.
@num_tgtdevs is just renamed to @replace_nr_stripes.
While the mapping is completely changed.
- Add extra ASSERT()s for RAID56 code
- Only add two more extra stripes for dev-replace cases.
As we have an upper limit on how many dev-replace stripes we can have.
- Unify the behavior of handle_ops_on_dev_replace()
Previously handle_ops_on_dev_replace() go two different paths for
WRITE and GET_READ_MIRRORS.
Now unify them by always going the WRITE path first (with at most 2
replace stripes), then if we're doing GET_READ_MIRRORS and we have 2
extra stripes, just drop one stripe.
- Remove the @real_stripes argument from alloc_btrfs_io_context()
As we don't need the old variable length array any more.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-07 12:26:14 +08:00
|
|
|
/*
|
|
|
|
* At this stage, num_stripes is still the real number of stripes,
|
|
|
|
* excluding the duplicated stripes.
|
|
|
|
*/
|
2024-05-08 19:14:47 +08:00
|
|
|
int num_stripes = io_geom->num_stripes;
|
|
|
|
int max_errors = io_geom->max_errors;
|
btrfs: use an efficient way to represent source of duplicated stripes
For btrfs dev-replace, we have to duplicate writes to the source
device into the target device.
For non-RAID56, all writes into the same mapped ranges are sharing the
same content, thus they don't really need to bother anything.
(E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the
same write to all involved devices).
But for RAID56, all stripes contain different content, thus we must
have a clear mapping of which stripe is duplicated from which original
stripe.
Currently we use a complex way using tgtdev_map[] array, e.g:
num_tgtdevs = 1
tgtdev_map[0] = 0 <- Means stripes[0] is not involved in replace.
tgtdev_map[1] = 3 <- Means stripes[1] is involved in replace,
and it's duplicated to stripes[3].
tgtdev_map[2] = 0 <- Means stripes[2] is not involved in replace.
But this is wasting some space, and ignores one important thing for
dev-replace, there is at most one running replace.
Thus we can change it to a fixed array to represent the mapping:
replace_nr_stripes = 1
replace_stripe_src = 1 <- Means stripes[1] is involved in replace.
thus the extra stripe is a copy of
stripes[1]
By this we can save some space for bioc on RAID56 chunks with many
devices. And we get rid of one variable sized array from bioc.
Thus the patch involves the following changes:
- Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes
and @replace_stripe_src.
@num_tgtdevs is just renamed to @replace_nr_stripes.
While the mapping is completely changed.
- Add extra ASSERT()s for RAID56 code
- Only add two more extra stripes for dev-replace cases.
As we have an upper limit on how many dev-replace stripes we can have.
- Unify the behavior of handle_ops_on_dev_replace()
Previously handle_ops_on_dev_replace() go two different paths for
WRITE and GET_READ_MIRRORS.
Now unify them by always going the WRITE path first (with at most 2
replace stripes), then if we're doing GET_READ_MIRRORS and we have 2
extra stripes, just drop one stripe.
- Remove the @real_stripes argument from alloc_btrfs_io_context()
As we don't need the old variable length array any more.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-07 12:26:14 +08:00
|
|
|
int nr_extra_stripes = 0;
|
2017-03-15 04:33:58 +08:00
|
|
|
int i;
|
|
|
|
|
btrfs: use an efficient way to represent source of duplicated stripes
For btrfs dev-replace, we have to duplicate writes to the source
device into the target device.
For non-RAID56, all writes into the same mapped ranges are sharing the
same content, thus they don't really need to bother anything.
(E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the
same write to all involved devices).
But for RAID56, all stripes contain different content, thus we must
have a clear mapping of which stripe is duplicated from which original
stripe.
Currently we use a complex way using tgtdev_map[] array, e.g:
num_tgtdevs = 1
tgtdev_map[0] = 0 <- Means stripes[0] is not involved in replace.
tgtdev_map[1] = 3 <- Means stripes[1] is involved in replace,
and it's duplicated to stripes[3].
tgtdev_map[2] = 0 <- Means stripes[2] is not involved in replace.
But this is wasting some space, and ignores one important thing for
dev-replace, there is at most one running replace.
Thus we can change it to a fixed array to represent the mapping:
replace_nr_stripes = 1
replace_stripe_src = 1 <- Means stripes[1] is involved in replace.
thus the extra stripe is a copy of
stripes[1]
By this we can save some space for bioc on RAID56 chunks with many
devices. And we get rid of one variable sized array from bioc.
Thus the patch involves the following changes:
- Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes
and @replace_stripe_src.
@num_tgtdevs is just renamed to @replace_nr_stripes.
While the mapping is completely changed.
- Add extra ASSERT()s for RAID56 code
- Only add two more extra stripes for dev-replace cases.
As we have an upper limit on how many dev-replace stripes we can have.
- Unify the behavior of handle_ops_on_dev_replace()
Previously handle_ops_on_dev_replace() go two different paths for
WRITE and GET_READ_MIRRORS.
Now unify them by always going the WRITE path first (with at most 2
replace stripes), then if we're doing GET_READ_MIRRORS and we have 2
extra stripes, just drop one stripe.
- Remove the @real_stripes argument from alloc_btrfs_io_context()
As we don't need the old variable length array any more.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-07 12:26:14 +08:00
|
|
|
/*
|
|
|
|
* A block group which has "to_copy" set will eventually be copied by
|
|
|
|
* the dev-replace process. We can avoid cloning IO here.
|
|
|
|
*/
|
|
|
|
if (is_block_group_to_copy(dev_replace->srcdev->fs_info, logical))
|
|
|
|
return;
|
2017-03-15 04:33:58 +08:00
|
|
|
|
btrfs: use an efficient way to represent source of duplicated stripes
For btrfs dev-replace, we have to duplicate writes to the source
device into the target device.
For non-RAID56, all writes into the same mapped ranges are sharing the
same content, thus they don't really need to bother anything.
(E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the
same write to all involved devices).
But for RAID56, all stripes contain different content, thus we must
have a clear mapping of which stripe is duplicated from which original
stripe.
Currently we use a complex way using tgtdev_map[] array, e.g:
num_tgtdevs = 1
tgtdev_map[0] = 0 <- Means stripes[0] is not involved in replace.
tgtdev_map[1] = 3 <- Means stripes[1] is involved in replace,
and it's duplicated to stripes[3].
tgtdev_map[2] = 0 <- Means stripes[2] is not involved in replace.
But this is wasting some space, and ignores one important thing for
dev-replace, there is at most one running replace.
Thus we can change it to a fixed array to represent the mapping:
replace_nr_stripes = 1
replace_stripe_src = 1 <- Means stripes[1] is involved in replace.
thus the extra stripe is a copy of
stripes[1]
By this we can save some space for bioc on RAID56 chunks with many
devices. And we get rid of one variable sized array from bioc.
Thus the patch involves the following changes:
- Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes
and @replace_stripe_src.
@num_tgtdevs is just renamed to @replace_nr_stripes.
While the mapping is completely changed.
- Add extra ASSERT()s for RAID56 code
- Only add two more extra stripes for dev-replace cases.
As we have an upper limit on how many dev-replace stripes we can have.
- Unify the behavior of handle_ops_on_dev_replace()
Previously handle_ops_on_dev_replace() go two different paths for
WRITE and GET_READ_MIRRORS.
Now unify them by always going the WRITE path first (with at most 2
replace stripes), then if we're doing GET_READ_MIRRORS and we have 2
extra stripes, just drop one stripe.
- Remove the @real_stripes argument from alloc_btrfs_io_context()
As we don't need the old variable length array any more.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-07 12:26:14 +08:00
|
|
|
/*
|
|
|
|
* Duplicate the write operations while the dev-replace procedure is
|
|
|
|
* running. Since the copying of the old disk to the new disk takes
|
|
|
|
* place at run time while the filesystem is mounted writable, the
|
|
|
|
* regular write operations to the old disk have to be duplicated to go
|
|
|
|
* to the new disk as well.
|
|
|
|
*
|
|
|
|
* Note that device->missing is handled by the caller, and that the
|
|
|
|
* write to the old disk is already set up in the stripes array.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < num_stripes; i++) {
|
|
|
|
struct btrfs_io_stripe *old = &bioc->stripes[i];
|
|
|
|
struct btrfs_io_stripe *new = &bioc->stripes[num_stripes + nr_extra_stripes];
|
2021-02-04 18:22:12 +08:00
|
|
|
|
btrfs: use an efficient way to represent source of duplicated stripes
For btrfs dev-replace, we have to duplicate writes to the source
device into the target device.
For non-RAID56, all writes into the same mapped ranges are sharing the
same content, thus they don't really need to bother anything.
(E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the
same write to all involved devices).
But for RAID56, all stripes contain different content, thus we must
have a clear mapping of which stripe is duplicated from which original
stripe.
Currently we use a complex way using tgtdev_map[] array, e.g:
num_tgtdevs = 1
tgtdev_map[0] = 0 <- Means stripes[0] is not involved in replace.
tgtdev_map[1] = 3 <- Means stripes[1] is involved in replace,
and it's duplicated to stripes[3].
tgtdev_map[2] = 0 <- Means stripes[2] is not involved in replace.
But this is wasting some space, and ignores one important thing for
dev-replace, there is at most one running replace.
Thus we can change it to a fixed array to represent the mapping:
replace_nr_stripes = 1
replace_stripe_src = 1 <- Means stripes[1] is involved in replace.
thus the extra stripe is a copy of
stripes[1]
By this we can save some space for bioc on RAID56 chunks with many
devices. And we get rid of one variable sized array from bioc.
Thus the patch involves the following changes:
- Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes
and @replace_stripe_src.
@num_tgtdevs is just renamed to @replace_nr_stripes.
While the mapping is completely changed.
- Add extra ASSERT()s for RAID56 code
- Only add two more extra stripes for dev-replace cases.
As we have an upper limit on how many dev-replace stripes we can have.
- Unify the behavior of handle_ops_on_dev_replace()
Previously handle_ops_on_dev_replace() go two different paths for
WRITE and GET_READ_MIRRORS.
Now unify them by always going the WRITE path first (with at most 2
replace stripes), then if we're doing GET_READ_MIRRORS and we have 2
extra stripes, just drop one stripe.
- Remove the @real_stripes argument from alloc_btrfs_io_context()
As we don't need the old variable length array any more.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-07 12:26:14 +08:00
|
|
|
if (old->dev->devid != srcdev_devid)
|
|
|
|
continue;
|
2017-03-15 04:33:58 +08:00
|
|
|
|
btrfs: use an efficient way to represent source of duplicated stripes
For btrfs dev-replace, we have to duplicate writes to the source
device into the target device.
For non-RAID56, all writes into the same mapped ranges are sharing the
same content, thus they don't really need to bother anything.
(E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the
same write to all involved devices).
But for RAID56, all stripes contain different content, thus we must
have a clear mapping of which stripe is duplicated from which original
stripe.
Currently we use a complex way using tgtdev_map[] array, e.g:
num_tgtdevs = 1
tgtdev_map[0] = 0 <- Means stripes[0] is not involved in replace.
tgtdev_map[1] = 3 <- Means stripes[1] is involved in replace,
and it's duplicated to stripes[3].
tgtdev_map[2] = 0 <- Means stripes[2] is not involved in replace.
But this is wasting some space, and ignores one important thing for
dev-replace, there is at most one running replace.
Thus we can change it to a fixed array to represent the mapping:
replace_nr_stripes = 1
replace_stripe_src = 1 <- Means stripes[1] is involved in replace.
thus the extra stripe is a copy of
stripes[1]
By this we can save some space for bioc on RAID56 chunks with many
devices. And we get rid of one variable sized array from bioc.
Thus the patch involves the following changes:
- Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes
and @replace_stripe_src.
@num_tgtdevs is just renamed to @replace_nr_stripes.
While the mapping is completely changed.
- Add extra ASSERT()s for RAID56 code
- Only add two more extra stripes for dev-replace cases.
As we have an upper limit on how many dev-replace stripes we can have.
- Unify the behavior of handle_ops_on_dev_replace()
Previously handle_ops_on_dev_replace() go two different paths for
WRITE and GET_READ_MIRRORS.
Now unify them by always going the WRITE path first (with at most 2
replace stripes), then if we're doing GET_READ_MIRRORS and we have 2
extra stripes, just drop one stripe.
- Remove the @real_stripes argument from alloc_btrfs_io_context()
As we don't need the old variable length array any more.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-07 12:26:14 +08:00
|
|
|
new->physical = old->physical;
|
|
|
|
new->dev = dev_replace->tgtdev;
|
|
|
|
if (bioc->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK)
|
|
|
|
bioc->replace_stripe_src = i;
|
|
|
|
nr_extra_stripes++;
|
|
|
|
}
|
2017-03-15 04:33:58 +08:00
|
|
|
|
btrfs: use an efficient way to represent source of duplicated stripes
For btrfs dev-replace, we have to duplicate writes to the source
device into the target device.
For non-RAID56, all writes into the same mapped ranges are sharing the
same content, thus they don't really need to bother anything.
(E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the
same write to all involved devices).
But for RAID56, all stripes contain different content, thus we must
have a clear mapping of which stripe is duplicated from which original
stripe.
Currently we use a complex way using tgtdev_map[] array, e.g:
num_tgtdevs = 1
tgtdev_map[0] = 0 <- Means stripes[0] is not involved in replace.
tgtdev_map[1] = 3 <- Means stripes[1] is involved in replace,
and it's duplicated to stripes[3].
tgtdev_map[2] = 0 <- Means stripes[2] is not involved in replace.
But this is wasting some space, and ignores one important thing for
dev-replace, there is at most one running replace.
Thus we can change it to a fixed array to represent the mapping:
replace_nr_stripes = 1
replace_stripe_src = 1 <- Means stripes[1] is involved in replace.
thus the extra stripe is a copy of
stripes[1]
By this we can save some space for bioc on RAID56 chunks with many
devices. And we get rid of one variable sized array from bioc.
Thus the patch involves the following changes:
- Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes
and @replace_stripe_src.
@num_tgtdevs is just renamed to @replace_nr_stripes.
While the mapping is completely changed.
- Add extra ASSERT()s for RAID56 code
- Only add two more extra stripes for dev-replace cases.
As we have an upper limit on how many dev-replace stripes we can have.
- Unify the behavior of handle_ops_on_dev_replace()
Previously handle_ops_on_dev_replace() go two different paths for
WRITE and GET_READ_MIRRORS.
Now unify them by always going the WRITE path first (with at most 2
replace stripes), then if we're doing GET_READ_MIRRORS and we have 2
extra stripes, just drop one stripe.
- Remove the @real_stripes argument from alloc_btrfs_io_context()
As we don't need the old variable length array any more.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-07 12:26:14 +08:00
|
|
|
/* We can only have at most 2 extra nr_stripes (for DUP). */
|
|
|
|
ASSERT(nr_extra_stripes <= 2);
|
|
|
|
/*
|
|
|
|
* For GET_READ_MIRRORS, we can only return at most 1 extra stripe for
|
|
|
|
* replace.
|
|
|
|
* If we have 2 extra stripes, only choose the one with smaller physical.
|
|
|
|
*/
|
2024-05-08 19:14:47 +08:00
|
|
|
if (io_geom->op == BTRFS_MAP_GET_READ_MIRRORS && nr_extra_stripes == 2) {
|
btrfs: use an efficient way to represent source of duplicated stripes
For btrfs dev-replace, we have to duplicate writes to the source
device into the target device.
For non-RAID56, all writes into the same mapped ranges are sharing the
same content, thus they don't really need to bother anything.
(E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the
same write to all involved devices).
But for RAID56, all stripes contain different content, thus we must
have a clear mapping of which stripe is duplicated from which original
stripe.
Currently we use a complex way using tgtdev_map[] array, e.g:
num_tgtdevs = 1
tgtdev_map[0] = 0 <- Means stripes[0] is not involved in replace.
tgtdev_map[1] = 3 <- Means stripes[1] is involved in replace,
and it's duplicated to stripes[3].
tgtdev_map[2] = 0 <- Means stripes[2] is not involved in replace.
But this is wasting some space, and ignores one important thing for
dev-replace, there is at most one running replace.
Thus we can change it to a fixed array to represent the mapping:
replace_nr_stripes = 1
replace_stripe_src = 1 <- Means stripes[1] is involved in replace.
thus the extra stripe is a copy of
stripes[1]
By this we can save some space for bioc on RAID56 chunks with many
devices. And we get rid of one variable sized array from bioc.
Thus the patch involves the following changes:
- Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes
and @replace_stripe_src.
@num_tgtdevs is just renamed to @replace_nr_stripes.
While the mapping is completely changed.
- Add extra ASSERT()s for RAID56 code
- Only add two more extra stripes for dev-replace cases.
As we have an upper limit on how many dev-replace stripes we can have.
- Unify the behavior of handle_ops_on_dev_replace()
Previously handle_ops_on_dev_replace() go two different paths for
WRITE and GET_READ_MIRRORS.
Now unify them by always going the WRITE path first (with at most 2
replace stripes), then if we're doing GET_READ_MIRRORS and we have 2
extra stripes, just drop one stripe.
- Remove the @real_stripes argument from alloc_btrfs_io_context()
As we don't need the old variable length array any more.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-07 12:26:14 +08:00
|
|
|
struct btrfs_io_stripe *first = &bioc->stripes[num_stripes];
|
|
|
|
struct btrfs_io_stripe *second = &bioc->stripes[num_stripes + 1];
|
2017-03-15 04:33:58 +08:00
|
|
|
|
btrfs: use an efficient way to represent source of duplicated stripes
For btrfs dev-replace, we have to duplicate writes to the source
device into the target device.
For non-RAID56, all writes into the same mapped ranges are sharing the
same content, thus they don't really need to bother anything.
(E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the
same write to all involved devices).
But for RAID56, all stripes contain different content, thus we must
have a clear mapping of which stripe is duplicated from which original
stripe.
Currently we use a complex way using tgtdev_map[] array, e.g:
num_tgtdevs = 1
tgtdev_map[0] = 0 <- Means stripes[0] is not involved in replace.
tgtdev_map[1] = 3 <- Means stripes[1] is involved in replace,
and it's duplicated to stripes[3].
tgtdev_map[2] = 0 <- Means stripes[2] is not involved in replace.
But this is wasting some space, and ignores one important thing for
dev-replace, there is at most one running replace.
Thus we can change it to a fixed array to represent the mapping:
replace_nr_stripes = 1
replace_stripe_src = 1 <- Means stripes[1] is involved in replace.
thus the extra stripe is a copy of
stripes[1]
By this we can save some space for bioc on RAID56 chunks with many
devices. And we get rid of one variable sized array from bioc.
Thus the patch involves the following changes:
- Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes
and @replace_stripe_src.
@num_tgtdevs is just renamed to @replace_nr_stripes.
While the mapping is completely changed.
- Add extra ASSERT()s for RAID56 code
- Only add two more extra stripes for dev-replace cases.
As we have an upper limit on how many dev-replace stripes we can have.
- Unify the behavior of handle_ops_on_dev_replace()
Previously handle_ops_on_dev_replace() go two different paths for
WRITE and GET_READ_MIRRORS.
Now unify them by always going the WRITE path first (with at most 2
replace stripes), then if we're doing GET_READ_MIRRORS and we have 2
extra stripes, just drop one stripe.
- Remove the @real_stripes argument from alloc_btrfs_io_context()
As we don't need the old variable length array any more.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-07 12:26:14 +08:00
|
|
|
/* Only DUP can have two extra stripes. */
|
|
|
|
ASSERT(bioc->map_type & BTRFS_BLOCK_GROUP_DUP);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Swap the last stripe stripes and reduce @nr_extra_stripes.
|
|
|
|
* The extra stripe would still be there, but won't be accessed.
|
|
|
|
*/
|
|
|
|
if (first->physical > second->physical) {
|
|
|
|
swap(second->physical, first->physical);
|
|
|
|
swap(second->dev, first->dev);
|
|
|
|
nr_extra_stripes--;
|
2017-03-15 04:33:58 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2024-05-08 19:14:47 +08:00
|
|
|
io_geom->num_stripes = num_stripes + nr_extra_stripes;
|
|
|
|
io_geom->max_errors = max_errors + nr_extra_stripes;
|
btrfs: use an efficient way to represent source of duplicated stripes
For btrfs dev-replace, we have to duplicate writes to the source
device into the target device.
For non-RAID56, all writes into the same mapped ranges are sharing the
same content, thus they don't really need to bother anything.
(E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the
same write to all involved devices).
But for RAID56, all stripes contain different content, thus we must
have a clear mapping of which stripe is duplicated from which original
stripe.
Currently we use a complex way using tgtdev_map[] array, e.g:
num_tgtdevs = 1
tgtdev_map[0] = 0 <- Means stripes[0] is not involved in replace.
tgtdev_map[1] = 3 <- Means stripes[1] is involved in replace,
and it's duplicated to stripes[3].
tgtdev_map[2] = 0 <- Means stripes[2] is not involved in replace.
But this is wasting some space, and ignores one important thing for
dev-replace, there is at most one running replace.
Thus we can change it to a fixed array to represent the mapping:
replace_nr_stripes = 1
replace_stripe_src = 1 <- Means stripes[1] is involved in replace.
thus the extra stripe is a copy of
stripes[1]
By this we can save some space for bioc on RAID56 chunks with many
devices. And we get rid of one variable sized array from bioc.
Thus the patch involves the following changes:
- Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes
and @replace_stripe_src.
@num_tgtdevs is just renamed to @replace_nr_stripes.
While the mapping is completely changed.
- Add extra ASSERT()s for RAID56 code
- Only add two more extra stripes for dev-replace cases.
As we have an upper limit on how many dev-replace stripes we can have.
- Unify the behavior of handle_ops_on_dev_replace()
Previously handle_ops_on_dev_replace() go two different paths for
WRITE and GET_READ_MIRRORS.
Now unify them by always going the WRITE path first (with at most 2
replace stripes), then if we're doing GET_READ_MIRRORS and we have 2
extra stripes, just drop one stripe.
- Remove the @real_stripes argument from alloc_btrfs_io_context()
As we don't need the old variable length array any more.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-07 12:26:14 +08:00
|
|
|
bioc->replace_nr_stripes = nr_extra_stripes;
|
2017-03-15 04:33:58 +08:00
|
|
|
}
|
|
|
|
|
2023-12-13 22:43:08 +08:00
|
|
|
static u64 btrfs_max_io_len(struct btrfs_chunk_map *map, u64 offset,
|
|
|
|
struct btrfs_io_geometry *io_geom)
|
2019-06-03 17:05:03 +08:00
|
|
|
{
|
2022-04-12 17:32:51 +08:00
|
|
|
/*
|
2023-01-21 14:50:25 +08:00
|
|
|
* Stripe_nr is the stripe where this block falls. stripe_offset is
|
|
|
|
* the offset of this block in its stripe.
|
2022-04-12 17:32:51 +08:00
|
|
|
*/
|
2023-12-13 22:43:08 +08:00
|
|
|
io_geom->stripe_offset = offset & BTRFS_STRIPE_LEN_MASK;
|
|
|
|
io_geom->stripe_nr = offset >> BTRFS_STRIPE_LEN_SHIFT;
|
|
|
|
ASSERT(io_geom->stripe_offset < U32_MAX);
|
2019-06-03 17:05:03 +08:00
|
|
|
|
2023-01-21 14:50:25 +08:00
|
|
|
if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
|
2023-06-22 14:42:40 +08:00
|
|
|
unsigned long full_stripe_len =
|
|
|
|
btrfs_stripe_nr_to_offset(nr_data_stripes(map));
|
2019-06-03 17:05:03 +08:00
|
|
|
|
2023-02-17 13:36:58 +08:00
|
|
|
/*
|
|
|
|
* For full stripe start, we use previously calculated
|
|
|
|
* @stripe_nr. Align it to nr_data_stripes, then multiply with
|
|
|
|
* STRIPE_LEN.
|
|
|
|
*
|
|
|
|
* By this we can avoid u64 division completely. And we have
|
|
|
|
* to go rounddown(), not round_down(), as nr_data_stripes is
|
|
|
|
* not ensured to be power of 2.
|
|
|
|
*/
|
2023-12-13 22:43:08 +08:00
|
|
|
io_geom->raid56_full_stripe_start = btrfs_stripe_nr_to_offset(
|
|
|
|
rounddown(io_geom->stripe_nr, nr_data_stripes(map)));
|
2019-06-03 17:05:03 +08:00
|
|
|
|
2023-12-13 22:43:08 +08:00
|
|
|
ASSERT(io_geom->raid56_full_stripe_start + full_stripe_len > offset);
|
|
|
|
ASSERT(io_geom->raid56_full_stripe_start <= offset);
|
2019-06-03 17:05:03 +08:00
|
|
|
/*
|
2023-01-21 14:50:25 +08:00
|
|
|
* For writes to RAID56, allow to write a full stripe set, but
|
|
|
|
* no straddling of stripe sets.
|
2019-06-03 17:05:03 +08:00
|
|
|
*/
|
2023-12-13 22:43:08 +08:00
|
|
|
if (io_geom->op == BTRFS_MAP_WRITE)
|
|
|
|
return full_stripe_len - (offset - io_geom->raid56_full_stripe_start);
|
2019-06-03 17:05:03 +08:00
|
|
|
}
|
|
|
|
|
2023-01-21 14:50:25 +08:00
|
|
|
/*
|
|
|
|
* For other RAID types and for RAID56 reads, allow a single stripe (on
|
|
|
|
* a single disk).
|
|
|
|
*/
|
|
|
|
if (map->type & BTRFS_BLOCK_GROUP_STRIPE_MASK)
|
2023-12-13 22:43:08 +08:00
|
|
|
return BTRFS_STRIPE_LEN - io_geom->stripe_offset;
|
2023-01-21 14:50:25 +08:00
|
|
|
return U64_MAX;
|
2019-06-03 17:05:03 +08:00
|
|
|
}
|
|
|
|
|
2023-12-13 22:43:07 +08:00
|
|
|
static int set_io_stripe(struct btrfs_fs_info *fs_info, u64 logical,
|
|
|
|
u64 *length, struct btrfs_io_stripe *dst,
|
|
|
|
struct btrfs_chunk_map *map,
|
|
|
|
struct btrfs_io_geometry *io_geom)
|
2022-08-06 16:03:29 +08:00
|
|
|
{
|
2023-12-13 22:43:07 +08:00
|
|
|
dst->dev = map->stripes[io_geom->stripe_index].dev;
|
2023-09-15 00:07:00 +08:00
|
|
|
|
2023-12-13 22:43:07 +08:00
|
|
|
if (io_geom->op == BTRFS_MAP_READ &&
|
|
|
|
btrfs_need_stripe_tree_update(fs_info, map->type))
|
2023-09-15 00:07:00 +08:00
|
|
|
return btrfs_get_raid_extent_offset(fs_info, logical, length,
|
2023-12-13 22:43:07 +08:00
|
|
|
map->type,
|
|
|
|
io_geom->stripe_index, dst);
|
2023-09-15 00:07:00 +08:00
|
|
|
|
2023-12-13 22:43:07 +08:00
|
|
|
dst->physical = map->stripes[io_geom->stripe_index].physical +
|
|
|
|
io_geom->stripe_offset +
|
|
|
|
btrfs_stripe_nr_to_offset(io_geom->stripe_nr);
|
2023-09-15 00:07:00 +08:00
|
|
|
return 0;
|
2022-08-06 16:03:29 +08:00
|
|
|
}
|
|
|
|
|
2023-12-13 22:42:56 +08:00
|
|
|
static bool is_single_device_io(struct btrfs_fs_info *fs_info,
|
|
|
|
const struct btrfs_io_stripe *smap,
|
|
|
|
const struct btrfs_chunk_map *map,
|
|
|
|
int num_alloc_stripes,
|
|
|
|
enum btrfs_map_op op, int mirror_num)
|
|
|
|
{
|
|
|
|
if (!smap)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if (num_alloc_stripes != 1)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if (btrfs_need_stripe_tree_update(fs_info, map->type) && op != BTRFS_MAP_READ)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if ((map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) && mirror_num > 1)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2023-12-13 22:42:58 +08:00
|
|
|
static void map_blocks_raid0(const struct btrfs_chunk_map *map,
|
|
|
|
struct btrfs_io_geometry *io_geom)
|
|
|
|
{
|
|
|
|
io_geom->stripe_index = io_geom->stripe_nr % map->num_stripes;
|
|
|
|
io_geom->stripe_nr /= map->num_stripes;
|
|
|
|
if (io_geom->op == BTRFS_MAP_READ)
|
|
|
|
io_geom->mirror_num = 1;
|
|
|
|
}
|
|
|
|
|
2023-12-13 22:42:59 +08:00
|
|
|
static void map_blocks_raid1(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_chunk_map *map,
|
|
|
|
struct btrfs_io_geometry *io_geom,
|
|
|
|
bool dev_replace_is_ongoing)
|
|
|
|
{
|
|
|
|
if (io_geom->op != BTRFS_MAP_READ) {
|
|
|
|
io_geom->num_stripes = map->num_stripes;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (io_geom->mirror_num) {
|
|
|
|
io_geom->stripe_index = io_geom->mirror_num - 1;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
io_geom->stripe_index = find_live_mirror(fs_info, map, 0,
|
|
|
|
dev_replace_is_ongoing);
|
|
|
|
io_geom->mirror_num = io_geom->stripe_index + 1;
|
|
|
|
}
|
|
|
|
|
2023-12-13 22:43:00 +08:00
|
|
|
static void map_blocks_dup(const struct btrfs_chunk_map *map,
|
|
|
|
struct btrfs_io_geometry *io_geom)
|
|
|
|
{
|
|
|
|
if (io_geom->op != BTRFS_MAP_READ) {
|
|
|
|
io_geom->num_stripes = map->num_stripes;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (io_geom->mirror_num) {
|
|
|
|
io_geom->stripe_index = io_geom->mirror_num - 1;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
io_geom->mirror_num = 1;
|
|
|
|
}
|
|
|
|
|
2023-12-13 22:43:01 +08:00
|
|
|
static void map_blocks_raid10(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_chunk_map *map,
|
|
|
|
struct btrfs_io_geometry *io_geom,
|
|
|
|
bool dev_replace_is_ongoing)
|
|
|
|
{
|
|
|
|
u32 factor = map->num_stripes / map->sub_stripes;
|
|
|
|
int old_stripe_index;
|
|
|
|
|
|
|
|
io_geom->stripe_index = (io_geom->stripe_nr % factor) * map->sub_stripes;
|
|
|
|
io_geom->stripe_nr /= factor;
|
|
|
|
|
|
|
|
if (io_geom->op != BTRFS_MAP_READ) {
|
|
|
|
io_geom->num_stripes = map->sub_stripes;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (io_geom->mirror_num) {
|
|
|
|
io_geom->stripe_index += io_geom->mirror_num - 1;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
old_stripe_index = io_geom->stripe_index;
|
|
|
|
io_geom->stripe_index = find_live_mirror(fs_info, map,
|
|
|
|
io_geom->stripe_index,
|
|
|
|
dev_replace_is_ongoing);
|
|
|
|
io_geom->mirror_num = io_geom->stripe_index - old_stripe_index + 1;
|
|
|
|
}
|
|
|
|
|
2023-12-13 22:43:03 +08:00
|
|
|
static void map_blocks_raid56_write(struct btrfs_chunk_map *map,
|
|
|
|
struct btrfs_io_geometry *io_geom,
|
|
|
|
u64 logical, u64 *length)
|
|
|
|
{
|
|
|
|
int data_stripes = nr_data_stripes(map);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Needs full stripe mapping.
|
|
|
|
*
|
|
|
|
* Push stripe_nr back to the start of the full stripe For those cases
|
|
|
|
* needing a full stripe, @stripe_nr is the full stripe number.
|
|
|
|
*
|
|
|
|
* Originally we go raid56_full_stripe_start / full_stripe_len, but
|
|
|
|
* that can be expensive. Here we just divide @stripe_nr with
|
|
|
|
* @data_stripes.
|
|
|
|
*/
|
|
|
|
io_geom->stripe_nr /= data_stripes;
|
|
|
|
|
|
|
|
/* RAID[56] write or recovery. Return all stripes */
|
|
|
|
io_geom->num_stripes = map->num_stripes;
|
|
|
|
io_geom->max_errors = btrfs_chunk_max_errors(map);
|
|
|
|
|
|
|
|
/* Return the length to the full stripe end. */
|
|
|
|
*length = min(logical + *length,
|
|
|
|
io_geom->raid56_full_stripe_start + map->start +
|
|
|
|
btrfs_stripe_nr_to_offset(data_stripes)) -
|
|
|
|
logical;
|
|
|
|
io_geom->stripe_index = 0;
|
|
|
|
io_geom->stripe_offset = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void map_blocks_raid56_read(struct btrfs_chunk_map *map,
|
|
|
|
struct btrfs_io_geometry *io_geom)
|
|
|
|
{
|
|
|
|
int data_stripes = nr_data_stripes(map);
|
|
|
|
|
|
|
|
ASSERT(io_geom->mirror_num <= 1);
|
|
|
|
/* Just grab the data stripe directly. */
|
|
|
|
io_geom->stripe_index = io_geom->stripe_nr % data_stripes;
|
|
|
|
io_geom->stripe_nr /= data_stripes;
|
|
|
|
|
|
|
|
/* We distribute the parity blocks across stripes. */
|
|
|
|
io_geom->stripe_index =
|
|
|
|
(io_geom->stripe_nr + io_geom->stripe_index) % map->num_stripes;
|
|
|
|
|
|
|
|
if (io_geom->op == BTRFS_MAP_READ && io_geom->mirror_num < 1)
|
|
|
|
io_geom->mirror_num = 1;
|
|
|
|
}
|
|
|
|
|
2023-12-13 22:43:04 +08:00
|
|
|
static void map_blocks_single(const struct btrfs_chunk_map *map,
|
|
|
|
struct btrfs_io_geometry *io_geom)
|
|
|
|
{
|
|
|
|
io_geom->stripe_index = io_geom->stripe_nr % map->num_stripes;
|
|
|
|
io_geom->stripe_nr /= map->num_stripes;
|
|
|
|
io_geom->mirror_num = io_geom->stripe_index + 1;
|
|
|
|
}
|
|
|
|
|
2023-06-27 15:34:31 +08:00
|
|
|
/*
|
|
|
|
* Map one logical range to one or more physical ranges.
|
|
|
|
*
|
|
|
|
* @length: (Mandatory) mapped length of this run.
|
|
|
|
* One logical range can be split into different segments
|
|
|
|
* due to factors like zones and RAID0/5/6/10 stripe
|
|
|
|
* boundaries.
|
|
|
|
*
|
|
|
|
* @bioc_ret: (Mandatory) returned btrfs_io_context structure.
|
|
|
|
* which has one or more physical ranges (btrfs_io_stripe)
|
|
|
|
* recorded inside.
|
|
|
|
* Caller should call btrfs_put_bioc() to free it after use.
|
|
|
|
*
|
|
|
|
* @smap: (Optional) single physical range optimization.
|
|
|
|
* If the map request can be fulfilled by one single
|
|
|
|
* physical range, and this is parameter is not NULL,
|
|
|
|
* then @bioc_ret would be NULL, and @smap would be
|
|
|
|
* updated.
|
|
|
|
*
|
|
|
|
* @mirror_num_ret: (Mandatory) returned mirror number if the original
|
|
|
|
* value is 0.
|
|
|
|
*
|
|
|
|
* Mirror number 0 means to choose any live mirrors.
|
|
|
|
*
|
|
|
|
* For non-RAID56 profiles, non-zero mirror_num means
|
|
|
|
* the Nth mirror. (e.g. mirror_num 1 means the first
|
|
|
|
* copy).
|
|
|
|
*
|
|
|
|
* For RAID56 profile, mirror 1 means rebuild from P and
|
|
|
|
* the remaining data stripes.
|
|
|
|
*
|
|
|
|
* For RAID6 profile, mirror > 2 means mark another
|
|
|
|
* data/P stripe error and rebuild from the remaining
|
|
|
|
* stripes..
|
|
|
|
*/
|
2023-05-31 12:17:37 +08:00
|
|
|
int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
|
|
|
|
u64 logical, u64 *length,
|
|
|
|
struct btrfs_io_context **bioc_ret,
|
2023-09-17 18:06:21 +08:00
|
|
|
struct btrfs_io_stripe *smap, int *mirror_num_ret)
|
2008-03-25 03:01:56 +08:00
|
|
|
{
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
struct btrfs_chunk_map *map;
|
2023-12-13 22:42:57 +08:00
|
|
|
struct btrfs_io_geometry io_geom = { 0 };
|
2023-01-21 14:50:25 +08:00
|
|
|
u64 map_offset;
|
2011-12-01 12:55:47 +08:00
|
|
|
int ret = 0;
|
btrfs: do not use replace target device as an extra mirror
[BUG]
Currently btrfs can use dev-replace device as an extra mirror for
read-repair. But it can lead to NODATASUM corruption in the following
case:
There is a RAID1 data chunk, and dev-replace is running from
dev2 to dev0.
|//| = Replaced data
X X+1MB X+2MB
Dev 2: | | | <- Source dev
Dev 0: |///////| | <- Target dev
Then a read on dev 2 X+2MB happens.
And something wrong happened inside devid 2, causing an -EIO.
In that case, read-repair would try the next mirror, and since we can
use target device as an extra mirror, we will use that mirror instead.
But unfortunately since the read is beyond the current replace cursor,
we should not trust it at all, what we get would be just uninitialized
garbage.
But if this read is for NODATASUM range, then we just trust them and
cause data corruption.
[CAUSE]
We used to have some checks to make sure we only return such extra
mirror when the range is before our left cursor.
The first commit introducing this behavior is ad6d620e2a57 ("Btrfs:
allow repair code to include target disk when searching mirrors").
But later a fix, 22ab04e814f4 ("Btrfs: fix race between device replace
and chunk allocation") changed the behavior, to always let
btrfs_map_block() include the extra mirror to address a race in
dev-replace which can cause missing writes to target device.
This means, we lose the tracking of cursor for the extra mirror, thus
can lead to above corruption.
[FIX]
The extra mirror is never a reliable one, at the beginning of
dev-replace, the reliability is zero, while only at the end of the
replace it's a fully reliable mirror.
We either do the complex tracking, or never trust it.
IMHO it's much easier to maintain if we don't trust it at all, and the
extra mirror can only benefit for a limited period of time (during
replace).
Thus this patch would completely remove the ability to use target device
as an extra mirror.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-22 15:47:40 +08:00
|
|
|
int num_copies;
|
2021-09-15 15:17:16 +08:00
|
|
|
struct btrfs_io_context *bioc = NULL;
|
2012-11-06 21:43:46 +08:00
|
|
|
struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
|
|
|
|
int dev_replace_is_ongoing = 0;
|
2023-02-07 12:26:13 +08:00
|
|
|
u16 num_alloc_stripes;
|
2023-01-21 14:50:25 +08:00
|
|
|
u64 max_len;
|
2019-06-03 17:05:05 +08:00
|
|
|
|
2021-09-15 15:17:16 +08:00
|
|
|
ASSERT(bioc_ret);
|
2017-03-15 04:33:56 +08:00
|
|
|
|
2023-12-13 22:42:57 +08:00
|
|
|
io_geom.mirror_num = (mirror_num_ret ? *mirror_num_ret : 0);
|
|
|
|
io_geom.num_stripes = 1;
|
|
|
|
io_geom.stripe_index = 0;
|
|
|
|
io_geom.op = op;
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
map = btrfs_get_chunk_map(fs_info, logical, *length);
|
|
|
|
if (IS_ERR(map))
|
|
|
|
return PTR_ERR(map);
|
2021-01-27 21:57:27 +08:00
|
|
|
|
2024-08-13 19:36:40 +08:00
|
|
|
num_copies = btrfs_chunk_map_num_copies(map);
|
|
|
|
if (io_geom.mirror_num > num_copies)
|
|
|
|
return -EINVAL;
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
map_offset = logical - map->start;
|
2023-12-13 22:42:57 +08:00
|
|
|
io_geom.raid56_full_stripe_start = (u64)-1;
|
2023-12-13 22:43:08 +08:00
|
|
|
max_len = btrfs_max_io_len(map, map_offset, &io_geom);
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
*length = min_t(u64, map->chunk_len - map_offset, max_len);
|
2008-03-26 04:50:33 +08:00
|
|
|
|
2018-09-07 22:11:23 +08:00
|
|
|
down_read(&dev_replace->rwsem);
|
2012-11-06 21:43:46 +08:00
|
|
|
dev_replace_is_ongoing = btrfs_dev_replace_is_ongoing(dev_replace);
|
2018-04-05 07:41:06 +08:00
|
|
|
/*
|
|
|
|
* Hold the semaphore for read during the whole operation, write is
|
|
|
|
* requested at commit time but must wait.
|
|
|
|
*/
|
2012-11-06 21:43:46 +08:00
|
|
|
if (!dev_replace_is_ongoing)
|
2018-09-07 22:11:23 +08:00
|
|
|
up_read(&dev_replace->rwsem);
|
2012-11-06 21:43:46 +08:00
|
|
|
|
2023-12-13 22:43:05 +08:00
|
|
|
switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
|
|
|
|
case BTRFS_BLOCK_GROUP_RAID0:
|
2023-12-13 22:42:58 +08:00
|
|
|
map_blocks_raid0(map, &io_geom);
|
2023-12-13 22:43:05 +08:00
|
|
|
break;
|
|
|
|
case BTRFS_BLOCK_GROUP_RAID1:
|
|
|
|
case BTRFS_BLOCK_GROUP_RAID1C3:
|
|
|
|
case BTRFS_BLOCK_GROUP_RAID1C4:
|
2023-12-13 22:42:59 +08:00
|
|
|
map_blocks_raid1(fs_info, map, &io_geom, dev_replace_is_ongoing);
|
2023-12-13 22:43:05 +08:00
|
|
|
break;
|
|
|
|
case BTRFS_BLOCK_GROUP_DUP:
|
2023-12-13 22:43:00 +08:00
|
|
|
map_blocks_dup(map, &io_geom);
|
2023-12-13 22:43:05 +08:00
|
|
|
break;
|
|
|
|
case BTRFS_BLOCK_GROUP_RAID10:
|
2023-12-13 22:43:01 +08:00
|
|
|
map_blocks_raid10(fs_info, map, &io_geom, dev_replace_is_ongoing);
|
2023-12-13 22:43:05 +08:00
|
|
|
break;
|
|
|
|
case BTRFS_BLOCK_GROUP_RAID5:
|
|
|
|
case BTRFS_BLOCK_GROUP_RAID6:
|
2023-12-13 22:43:03 +08:00
|
|
|
if (op != BTRFS_MAP_READ || io_geom.mirror_num > 1)
|
|
|
|
map_blocks_raid56_write(map, &io_geom, logical, length);
|
|
|
|
else
|
|
|
|
map_blocks_raid56_read(map, &io_geom);
|
2023-12-13 22:43:05 +08:00
|
|
|
break;
|
|
|
|
default:
|
2008-04-04 04:29:03 +08:00
|
|
|
/*
|
2023-02-17 13:36:59 +08:00
|
|
|
* After this, stripe_nr is the number of stripes on this
|
2015-02-21 01:43:47 +08:00
|
|
|
* device we have to walk to find the data, and stripe_index is
|
|
|
|
* the number of our device in the stripe array
|
2008-04-04 04:29:03 +08:00
|
|
|
*/
|
2023-12-13 22:43:04 +08:00
|
|
|
map_blocks_single(map, &io_geom);
|
2023-12-13 22:43:05 +08:00
|
|
|
break;
|
2008-04-04 04:29:03 +08:00
|
|
|
}
|
2023-12-13 22:42:57 +08:00
|
|
|
if (io_geom.stripe_index >= map->num_stripes) {
|
2016-09-20 22:05:00 +08:00
|
|
|
btrfs_crit(fs_info,
|
|
|
|
"stripe index math went horribly wrong, got stripe_index=%u, num_stripes=%u",
|
2023-12-13 22:42:57 +08:00
|
|
|
io_geom.stripe_index, map->num_stripes);
|
2016-04-13 00:54:40 +08:00
|
|
|
ret = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
2008-04-10 04:28:12 +08:00
|
|
|
|
2023-12-13 22:42:57 +08:00
|
|
|
num_alloc_stripes = io_geom.num_stripes;
|
btrfs: use an efficient way to represent source of duplicated stripes
For btrfs dev-replace, we have to duplicate writes to the source
device into the target device.
For non-RAID56, all writes into the same mapped ranges are sharing the
same content, thus they don't really need to bother anything.
(E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the
same write to all involved devices).
But for RAID56, all stripes contain different content, thus we must
have a clear mapping of which stripe is duplicated from which original
stripe.
Currently we use a complex way using tgtdev_map[] array, e.g:
num_tgtdevs = 1
tgtdev_map[0] = 0 <- Means stripes[0] is not involved in replace.
tgtdev_map[1] = 3 <- Means stripes[1] is involved in replace,
and it's duplicated to stripes[3].
tgtdev_map[2] = 0 <- Means stripes[2] is not involved in replace.
But this is wasting some space, and ignores one important thing for
dev-replace, there is at most one running replace.
Thus we can change it to a fixed array to represent the mapping:
replace_nr_stripes = 1
replace_stripe_src = 1 <- Means stripes[1] is involved in replace.
thus the extra stripe is a copy of
stripes[1]
By this we can save some space for bioc on RAID56 chunks with many
devices. And we get rid of one variable sized array from bioc.
Thus the patch involves the following changes:
- Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes
and @replace_stripe_src.
@num_tgtdevs is just renamed to @replace_nr_stripes.
While the mapping is completely changed.
- Add extra ASSERT()s for RAID56 code
- Only add two more extra stripes for dev-replace cases.
As we have an upper limit on how many dev-replace stripes we can have.
- Unify the behavior of handle_ops_on_dev_replace()
Previously handle_ops_on_dev_replace() go two different paths for
WRITE and GET_READ_MIRRORS.
Now unify them by always going the WRITE path first (with at most 2
replace stripes), then if we're doing GET_READ_MIRRORS and we have 2
extra stripes, just drop one stripe.
- Remove the @real_stripes argument from alloc_btrfs_io_context()
As we don't need the old variable length array any more.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-07 12:26:14 +08:00
|
|
|
if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL &&
|
|
|
|
op != BTRFS_MAP_READ)
|
|
|
|
/*
|
|
|
|
* For replace case, we need to add extra stripes for extra
|
|
|
|
* duplicated stripes.
|
|
|
|
*
|
|
|
|
* For both WRITE and GET_READ_MIRRORS, we may have at most
|
|
|
|
* 2 more stripes (DUP types, otherwise 1).
|
|
|
|
*/
|
|
|
|
num_alloc_stripes += 2;
|
2014-11-14 16:06:25 +08:00
|
|
|
|
2022-08-06 16:03:29 +08:00
|
|
|
/*
|
|
|
|
* If this I/O maps to a single device, try to return the device and
|
|
|
|
* physical block information on the stack instead of allocating an
|
|
|
|
* I/O context structure.
|
|
|
|
*/
|
2023-12-13 22:42:56 +08:00
|
|
|
if (is_single_device_io(fs_info, smap, map, num_alloc_stripes, op,
|
2023-12-13 22:42:57 +08:00
|
|
|
io_geom.mirror_num)) {
|
2023-12-13 22:43:07 +08:00
|
|
|
ret = set_io_stripe(fs_info, logical, length, smap, map, &io_geom);
|
2023-06-27 14:13:23 +08:00
|
|
|
if (mirror_num_ret)
|
2023-12-13 22:42:57 +08:00
|
|
|
*mirror_num_ret = io_geom.mirror_num;
|
2022-08-06 16:03:29 +08:00
|
|
|
*bioc_ret = NULL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2023-09-15 00:06:58 +08:00
|
|
|
bioc = alloc_btrfs_io_context(fs_info, logical, num_alloc_stripes);
|
2021-09-15 15:17:16 +08:00
|
|
|
if (!bioc) {
|
2011-12-01 12:55:47 +08:00
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
btrfs: use an efficient way to represent source of duplicated stripes
For btrfs dev-replace, we have to duplicate writes to the source
device into the target device.
For non-RAID56, all writes into the same mapped ranges are sharing the
same content, thus they don't really need to bother anything.
(E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the
same write to all involved devices).
But for RAID56, all stripes contain different content, thus we must
have a clear mapping of which stripe is duplicated from which original
stripe.
Currently we use a complex way using tgtdev_map[] array, e.g:
num_tgtdevs = 1
tgtdev_map[0] = 0 <- Means stripes[0] is not involved in replace.
tgtdev_map[1] = 3 <- Means stripes[1] is involved in replace,
and it's duplicated to stripes[3].
tgtdev_map[2] = 0 <- Means stripes[2] is not involved in replace.
But this is wasting some space, and ignores one important thing for
dev-replace, there is at most one running replace.
Thus we can change it to a fixed array to represent the mapping:
replace_nr_stripes = 1
replace_stripe_src = 1 <- Means stripes[1] is involved in replace.
thus the extra stripe is a copy of
stripes[1]
By this we can save some space for bioc on RAID56 chunks with many
devices. And we get rid of one variable sized array from bioc.
Thus the patch involves the following changes:
- Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes
and @replace_stripe_src.
@num_tgtdevs is just renamed to @replace_nr_stripes.
While the mapping is completely changed.
- Add extra ASSERT()s for RAID56 code
- Only add two more extra stripes for dev-replace cases.
As we have an upper limit on how many dev-replace stripes we can have.
- Unify the behavior of handle_ops_on_dev_replace()
Previously handle_ops_on_dev_replace() go two different paths for
WRITE and GET_READ_MIRRORS.
Now unify them by always going the WRITE path first (with at most 2
replace stripes), then if we're doing GET_READ_MIRRORS and we have 2
extra stripes, just drop one stripe.
- Remove the @real_stripes argument from alloc_btrfs_io_context()
As we don't need the old variable length array any more.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-07 12:26:14 +08:00
|
|
|
bioc->map_type = map->type;
|
2020-07-02 21:46:41 +08:00
|
|
|
|
2023-02-17 13:37:03 +08:00
|
|
|
/*
|
|
|
|
* For RAID56 full map, we need to make sure the stripes[] follows the
|
|
|
|
* rule that data stripes are all ordered, then followed with P and Q
|
|
|
|
* (if we have).
|
|
|
|
*
|
|
|
|
* It's still mostly the same as other profiles, just with extra rotation.
|
|
|
|
*/
|
2023-09-17 18:06:21 +08:00
|
|
|
if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK &&
|
2023-12-13 22:42:57 +08:00
|
|
|
(op != BTRFS_MAP_READ || io_geom.mirror_num > 1)) {
|
2023-02-17 13:37:03 +08:00
|
|
|
/*
|
|
|
|
* For RAID56 @stripe_nr is already the number of full stripes
|
|
|
|
* before us, which is also the rotation value (needs to modulo
|
|
|
|
* with num_stripes).
|
|
|
|
*
|
|
|
|
* In this case, we just add @stripe_nr with @i, then do the
|
|
|
|
* modulo, to reduce one modulo call.
|
|
|
|
*/
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
bioc->full_stripe_logical = map->start +
|
2023-12-13 22:43:02 +08:00
|
|
|
btrfs_stripe_nr_to_offset(io_geom.stripe_nr *
|
|
|
|
nr_data_stripes(map));
|
2023-12-13 22:42:57 +08:00
|
|
|
for (int i = 0; i < io_geom.num_stripes; i++) {
|
2023-12-13 22:43:06 +08:00
|
|
|
struct btrfs_io_stripe *dst = &bioc->stripes[i];
|
|
|
|
u32 stripe_index;
|
|
|
|
|
|
|
|
stripe_index = (i + io_geom.stripe_nr) % io_geom.num_stripes;
|
|
|
|
dst->dev = map->stripes[stripe_index].dev;
|
|
|
|
dst->physical =
|
|
|
|
map->stripes[stripe_index].physical +
|
|
|
|
io_geom.stripe_offset +
|
|
|
|
btrfs_stripe_nr_to_offset(io_geom.stripe_nr);
|
2023-09-15 00:07:00 +08:00
|
|
|
}
|
2023-02-17 13:37:03 +08:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* For all other non-RAID56 profiles, just copy the target
|
|
|
|
* stripe into the bioc.
|
|
|
|
*/
|
2024-05-21 01:46:44 +08:00
|
|
|
for (int i = 0; i < io_geom.num_stripes; i++) {
|
2023-12-13 22:43:07 +08:00
|
|
|
ret = set_io_stripe(fs_info, logical, length,
|
|
|
|
&bioc->stripes[i], map, &io_geom);
|
2023-09-15 00:07:00 +08:00
|
|
|
if (ret < 0)
|
|
|
|
break;
|
2023-12-13 22:42:57 +08:00
|
|
|
io_geom.stripe_index++;
|
2023-02-17 13:37:03 +08:00
|
|
|
}
|
2008-03-26 04:50:33 +08:00
|
|
|
}
|
2011-12-01 12:55:47 +08:00
|
|
|
|
2023-09-15 00:07:00 +08:00
|
|
|
if (ret) {
|
|
|
|
*bioc_ret = NULL;
|
|
|
|
btrfs_put_bioc(bioc);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2023-05-31 12:17:39 +08:00
|
|
|
if (op != BTRFS_MAP_READ)
|
2023-12-13 22:42:57 +08:00
|
|
|
io_geom.max_errors = btrfs_chunk_max_errors(map);
|
2011-12-01 12:55:47 +08:00
|
|
|
|
2017-03-15 04:33:58 +08:00
|
|
|
if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL &&
|
2023-05-31 12:17:39 +08:00
|
|
|
op != BTRFS_MAP_READ) {
|
2024-05-08 19:14:47 +08:00
|
|
|
handle_ops_on_dev_replace(bioc, dev_replace, logical, &io_geom);
|
2012-11-06 21:43:46 +08:00
|
|
|
}
|
|
|
|
|
2021-09-15 15:17:16 +08:00
|
|
|
*bioc_ret = bioc;
|
2023-12-13 22:42:57 +08:00
|
|
|
bioc->num_stripes = io_geom.num_stripes;
|
|
|
|
bioc->max_errors = io_geom.max_errors;
|
|
|
|
bioc->mirror_num = io_geom.mirror_num;
|
2012-11-06 22:06:47 +08:00
|
|
|
|
2008-04-10 04:28:12 +08:00
|
|
|
out:
|
Btrfs: fix lockdep deadlock warning due to dev_replace
Xfstests btrfs/011 complains about a deadlock warning,
[ 1226.649039] =========================================================
[ 1226.649039] [ INFO: possible irq lock inversion dependency detected ]
[ 1226.649039] 4.1.0+ #270 Not tainted
[ 1226.649039] ---------------------------------------------------------
[ 1226.652955] kswapd0/46 just changed the state of lock:
[ 1226.652955] (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1226.652955] (&fs_info->dev_replace.lock){+.+.+.}
and interrupts could create inverse lock ordering between them.
[ 1226.652955]
other info that might help us debug this:
[ 1226.652955] Chain exists of:
&delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock
[ 1226.652955] Possible interrupt unsafe locking scenario:
[ 1226.652955] CPU0 CPU1
[ 1226.652955] ---- ----
[ 1226.652955] lock(&fs_info->dev_replace.lock);
[ 1226.652955] local_irq_disable();
[ 1226.652955] lock(&delayed_node->mutex);
[ 1226.652955] lock(&found->groups_sem);
[ 1226.652955] <Interrupt>
[ 1226.652955] lock(&delayed_node->mutex);
[ 1226.652955]
*** DEADLOCK ***
Commit 084b6e7c7607 ("btrfs: Fix a lockdep warning when running xfstest.") tried
to fix a similar one that has the exactly same warning, but with that, we still
run to this.
The above lock chain comes from
btrfs_commit_transaction
->btrfs_run_delayed_items
...
->__btrfs_update_delayed_inode
...
->__btrfs_cow_block
...
->find_free_extent
->cache_block_group
->load_free_space_cache
->btrfs_readpages
->submit_one_bio
...
->__btrfs_map_block
->btrfs_dev_replace_lock
However, with high memory pressure, tasks which hold dev_replace.lock can
be interrupted by kswapd and then kswapd is intended to release memory occupied
by superblock, inodes and dentries, where we may call evict_inode, and it comes
to
[ 1226.652955] [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30
[ 1226.652955] [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700
delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads
to a ABBA deadlock.
To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but
things are simpler here since we only needs read's spinlock to blocking lock.
With this, btrfs/011 no more produces warnings in dmesg.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-07-17 16:49:19 +08:00
|
|
|
if (dev_replace_is_ongoing) {
|
2018-04-05 07:41:06 +08:00
|
|
|
lockdep_assert_held(&dev_replace->rwsem);
|
|
|
|
/* Unlock and let waiting writers proceed */
|
2018-09-07 22:11:23 +08:00
|
|
|
up_read(&dev_replace->rwsem);
|
Btrfs: fix lockdep deadlock warning due to dev_replace
Xfstests btrfs/011 complains about a deadlock warning,
[ 1226.649039] =========================================================
[ 1226.649039] [ INFO: possible irq lock inversion dependency detected ]
[ 1226.649039] 4.1.0+ #270 Not tainted
[ 1226.649039] ---------------------------------------------------------
[ 1226.652955] kswapd0/46 just changed the state of lock:
[ 1226.652955] (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1226.652955] (&fs_info->dev_replace.lock){+.+.+.}
and interrupts could create inverse lock ordering between them.
[ 1226.652955]
other info that might help us debug this:
[ 1226.652955] Chain exists of:
&delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock
[ 1226.652955] Possible interrupt unsafe locking scenario:
[ 1226.652955] CPU0 CPU1
[ 1226.652955] ---- ----
[ 1226.652955] lock(&fs_info->dev_replace.lock);
[ 1226.652955] local_irq_disable();
[ 1226.652955] lock(&delayed_node->mutex);
[ 1226.652955] lock(&found->groups_sem);
[ 1226.652955] <Interrupt>
[ 1226.652955] lock(&delayed_node->mutex);
[ 1226.652955]
*** DEADLOCK ***
Commit 084b6e7c7607 ("btrfs: Fix a lockdep warning when running xfstest.") tried
to fix a similar one that has the exactly same warning, but with that, we still
run to this.
The above lock chain comes from
btrfs_commit_transaction
->btrfs_run_delayed_items
...
->__btrfs_update_delayed_inode
...
->__btrfs_cow_block
...
->find_free_extent
->cache_block_group
->load_free_space_cache
->btrfs_readpages
->submit_one_bio
...
->__btrfs_map_block
->btrfs_dev_replace_lock
However, with high memory pressure, tasks which hold dev_replace.lock can
be interrupted by kswapd and then kswapd is intended to release memory occupied
by superblock, inodes and dentries, where we may call evict_inode, and it comes
to
[ 1226.652955] [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30
[ 1226.652955] [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700
delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads
to a ABBA deadlock.
To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but
things are simpler here since we only needs read's spinlock to blocking lock.
With this, btrfs/011 no more produces warnings in dmesg.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-07-17 16:49:19 +08:00
|
|
|
}
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
btrfs_free_chunk_map(map);
|
2011-12-01 12:55:47 +08:00
|
|
|
return ret;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
|
|
|
|
2021-10-06 04:12:42 +08:00
|
|
|
static bool dev_args_match_fs_devices(const struct btrfs_dev_lookup_args *args,
|
|
|
|
const struct btrfs_fs_devices *fs_devices)
|
|
|
|
{
|
|
|
|
if (args->fsid == NULL)
|
|
|
|
return true;
|
|
|
|
if (memcmp(fs_devices->metadata_uuid, args->fsid, BTRFS_FSID_SIZE) == 0)
|
|
|
|
return true;
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool dev_args_match_device(const struct btrfs_dev_lookup_args *args,
|
|
|
|
const struct btrfs_device *device)
|
|
|
|
{
|
2022-11-03 16:33:01 +08:00
|
|
|
if (args->missing) {
|
|
|
|
if (test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state) &&
|
|
|
|
!device->bdev)
|
|
|
|
return true;
|
|
|
|
return false;
|
|
|
|
}
|
2021-10-06 04:12:42 +08:00
|
|
|
|
2022-11-03 16:33:01 +08:00
|
|
|
if (device->devid != args->devid)
|
2021-10-06 04:12:42 +08:00
|
|
|
return false;
|
|
|
|
if (args->uuid && memcmp(device->uuid, args->uuid, BTRFS_UUID_SIZE) != 0)
|
|
|
|
return false;
|
2022-11-03 16:33:01 +08:00
|
|
|
return true;
|
2021-10-06 04:12:42 +08:00
|
|
|
}
|
|
|
|
|
2019-01-19 14:48:55 +08:00
|
|
|
/*
|
|
|
|
* Find a device specified by @devid or @uuid in the list of @fs_devices, or
|
|
|
|
* return NULL.
|
|
|
|
*
|
|
|
|
* If devid and uuid are both specified, the match must be exact, otherwise
|
|
|
|
* only devid is used.
|
|
|
|
*/
|
2021-10-06 04:12:42 +08:00
|
|
|
struct btrfs_device *btrfs_find_device(const struct btrfs_fs_devices *fs_devices,
|
|
|
|
const struct btrfs_dev_lookup_args *args)
|
2008-03-25 03:01:56 +08:00
|
|
|
{
|
2008-11-18 10:11:30 +08:00
|
|
|
struct btrfs_device *device;
|
2020-07-16 15:25:33 +08:00
|
|
|
struct btrfs_fs_devices *seed_devs;
|
|
|
|
|
2021-10-06 04:12:42 +08:00
|
|
|
if (dev_args_match_fs_devices(args, fs_devices)) {
|
2020-07-16 15:25:33 +08:00
|
|
|
list_for_each_entry(device, &fs_devices->devices, dev_list) {
|
2021-10-06 04:12:42 +08:00
|
|
|
if (dev_args_match_device(args, device))
|
2020-07-16 15:25:33 +08:00
|
|
|
return device;
|
|
|
|
}
|
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2020-07-16 15:25:33 +08:00
|
|
|
list_for_each_entry(seed_devs, &fs_devices->seed_list, seed_list) {
|
2021-10-06 04:12:42 +08:00
|
|
|
if (!dev_args_match_fs_devices(args, seed_devs))
|
|
|
|
continue;
|
|
|
|
list_for_each_entry(device, &seed_devs->devices, dev_list) {
|
|
|
|
if (dev_args_match_device(args, device))
|
|
|
|
return device;
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
}
|
2020-07-16 15:25:33 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
return NULL;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
static struct btrfs_device *add_missing_dev(struct btrfs_fs_devices *fs_devices,
|
2008-05-14 01:46:40 +08:00
|
|
|
u64 devid, u8 *dev_uuid)
|
|
|
|
{
|
|
|
|
struct btrfs_device *device;
|
2020-08-31 22:52:42 +08:00
|
|
|
unsigned int nofs_flag;
|
2008-05-14 01:46:40 +08:00
|
|
|
|
2020-08-31 22:52:42 +08:00
|
|
|
/*
|
|
|
|
* We call this under the chunk_mutex, so we want to use NOFS for this
|
|
|
|
* allocation, however we don't want to change btrfs_alloc_device() to
|
|
|
|
* always do NOFS because we use it in a lot of other GFP_KERNEL safe
|
|
|
|
* places.
|
|
|
|
*/
|
2022-11-07 23:07:17 +08:00
|
|
|
|
2020-08-31 22:52:42 +08:00
|
|
|
nofs_flag = memalloc_nofs_save();
|
2022-11-07 23:07:17 +08:00
|
|
|
device = btrfs_alloc_device(NULL, &devid, dev_uuid, NULL);
|
2020-08-31 22:52:42 +08:00
|
|
|
memalloc_nofs_restore(nofs_flag);
|
2013-08-23 18:20:17 +08:00
|
|
|
if (IS_ERR(device))
|
2017-10-11 12:46:18 +08:00
|
|
|
return device;
|
2013-08-23 18:20:17 +08:00
|
|
|
|
|
|
|
list_add(&device->dev_list, &fs_devices->devices);
|
2008-12-12 23:03:26 +08:00
|
|
|
device->fs_devices = fs_devices;
|
2008-05-14 01:46:40 +08:00
|
|
|
fs_devices->num_devices++;
|
2013-08-23 18:20:17 +08:00
|
|
|
|
2017-12-04 12:54:54 +08:00
|
|
|
set_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state);
|
2010-12-14 03:56:23 +08:00
|
|
|
fs_devices->missing_devices++;
|
2013-08-23 18:20:17 +08:00
|
|
|
|
2008-05-14 01:46:40 +08:00
|
|
|
return device;
|
|
|
|
}
|
|
|
|
|
2022-10-27 20:21:42 +08:00
|
|
|
/*
|
|
|
|
* Allocate new device struct, set up devid and UUID.
|
|
|
|
*
|
2013-08-23 18:20:17 +08:00
|
|
|
* @fs_info: used only for generating a new devid, can be NULL if
|
|
|
|
* devid is provided (i.e. @devid != NULL).
|
|
|
|
* @devid: a pointer to devid for this device. If NULL a new devid
|
|
|
|
* is generated.
|
|
|
|
* @uuid: a pointer to UUID for this device. If NULL a new UUID
|
|
|
|
* is generated.
|
2022-11-07 23:07:17 +08:00
|
|
|
* @path: a pointer to device path if available, NULL otherwise.
|
2013-08-23 18:20:17 +08:00
|
|
|
*
|
|
|
|
* Return: a pointer to a new &struct btrfs_device on success; ERR_PTR()
|
2017-10-31 01:10:25 +08:00
|
|
|
* on error. Returned struct is not linked onto any lists and must be
|
2018-03-20 22:47:33 +08:00
|
|
|
* destroyed with btrfs_free_device.
|
2013-08-23 18:20:17 +08:00
|
|
|
*/
|
|
|
|
struct btrfs_device *btrfs_alloc_device(struct btrfs_fs_info *fs_info,
|
2022-11-07 23:07:17 +08:00
|
|
|
const u64 *devid, const u8 *uuid,
|
|
|
|
const char *path)
|
2013-08-23 18:20:17 +08:00
|
|
|
{
|
|
|
|
struct btrfs_device *dev;
|
|
|
|
u64 tmp;
|
|
|
|
|
2013-10-31 13:00:08 +08:00
|
|
|
if (WARN_ON(!devid && !fs_info))
|
2013-08-23 18:20:17 +08:00
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
|
2021-07-26 20:15:21 +08:00
|
|
|
dev = kzalloc(sizeof(*dev), GFP_KERNEL);
|
|
|
|
if (!dev)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
|
|
|
|
INIT_LIST_HEAD(&dev->dev_list);
|
|
|
|
INIT_LIST_HEAD(&dev->dev_alloc_list);
|
|
|
|
INIT_LIST_HEAD(&dev->post_commit_list);
|
|
|
|
|
|
|
|
atomic_set(&dev->dev_stats_ccnt, 0);
|
|
|
|
btrfs_device_data_ordered_init(dev);
|
2022-10-28 08:47:06 +08:00
|
|
|
extent_io_tree_init(fs_info, &dev->alloc_state, IO_TREE_DEVICE_ALLOC_STATE);
|
2013-08-23 18:20:17 +08:00
|
|
|
|
|
|
|
if (devid)
|
|
|
|
tmp = *devid;
|
|
|
|
else {
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = find_next_devid(fs_info, &tmp);
|
|
|
|
if (ret) {
|
2018-03-20 22:47:33 +08:00
|
|
|
btrfs_free_device(dev);
|
2013-08-23 18:20:17 +08:00
|
|
|
return ERR_PTR(ret);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
dev->devid = tmp;
|
|
|
|
|
|
|
|
if (uuid)
|
|
|
|
memcpy(dev->uuid, uuid, BTRFS_UUID_SIZE);
|
|
|
|
else
|
|
|
|
generate_random_uuid(dev->uuid);
|
|
|
|
|
2022-11-07 23:07:17 +08:00
|
|
|
if (path) {
|
|
|
|
struct rcu_string *name;
|
|
|
|
|
|
|
|
name = rcu_string_strdup(path, GFP_KERNEL);
|
|
|
|
if (!name) {
|
|
|
|
btrfs_free_device(dev);
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
}
|
|
|
|
rcu_assign_pointer(dev->name, name);
|
|
|
|
}
|
|
|
|
|
2013-08-23 18:20:17 +08:00
|
|
|
return dev;
|
|
|
|
}
|
|
|
|
|
2017-10-09 11:07:45 +08:00
|
|
|
static void btrfs_report_missing_device(struct btrfs_fs_info *fs_info,
|
2017-10-09 11:07:46 +08:00
|
|
|
u64 devid, u8 *uuid, bool error)
|
2017-10-09 11:07:45 +08:00
|
|
|
{
|
2017-10-09 11:07:46 +08:00
|
|
|
if (error)
|
|
|
|
btrfs_err_rl(fs_info, "devid %llu uuid %pU is missing",
|
|
|
|
devid, uuid);
|
|
|
|
else
|
|
|
|
btrfs_warn_rl(fs_info, "devid %llu uuid %pU is missing",
|
|
|
|
devid, uuid);
|
2017-10-09 11:07:45 +08:00
|
|
|
}
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
u64 btrfs_calc_stripe_length(const struct btrfs_chunk_map *map)
|
2019-03-25 20:31:25 +08:00
|
|
|
{
|
2022-05-13 16:34:28 +08:00
|
|
|
const int data_stripes = calc_data_stripes(map->type, map->num_stripes);
|
2019-05-14 07:59:54 +08:00
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
return div_u64(map->chunk_len, data_stripes);
|
2019-03-25 20:31:25 +08:00
|
|
|
}
|
|
|
|
|
2021-02-25 09:18:14 +08:00
|
|
|
#if BITS_PER_LONG == 32
|
|
|
|
/*
|
|
|
|
* Due to page cache limit, metadata beyond BTRFS_32BIT_MAX_FILE_SIZE
|
|
|
|
* can't be accessed on 32bit systems.
|
|
|
|
*
|
|
|
|
* This function do mount time check to reject the fs if it already has
|
|
|
|
* metadata chunk beyond that limit.
|
|
|
|
*/
|
|
|
|
static int check_32bit_meta_chunk(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 logical, u64 length, u64 type)
|
|
|
|
{
|
|
|
|
if (!(type & BTRFS_BLOCK_GROUP_METADATA))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (logical + length < MAX_LFS_FILESIZE)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
btrfs_err_32bit_limit(fs_info);
|
|
|
|
return -EOVERFLOW;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This is to give early warning for any metadata chunk reaching
|
|
|
|
* BTRFS_32BIT_EARLY_WARN_THRESHOLD.
|
|
|
|
* Although we can still access the metadata, it's not going to be possible
|
|
|
|
* once the limit is reached.
|
|
|
|
*/
|
|
|
|
static void warn_32bit_meta_chunk(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 logical, u64 length, u64 type)
|
|
|
|
{
|
|
|
|
if (!(type & BTRFS_BLOCK_GROUP_METADATA))
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (logical + length < BTRFS_32BIT_EARLY_WARN_THRESHOLD)
|
|
|
|
return;
|
|
|
|
|
|
|
|
btrfs_warn_32bit_limit(fs_info);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2022-01-12 00:00:26 +08:00
|
|
|
static struct btrfs_device *handle_missing_device(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 devid, u8 *uuid)
|
|
|
|
{
|
|
|
|
struct btrfs_device *dev;
|
|
|
|
|
|
|
|
if (!btrfs_test_opt(fs_info, DEGRADED)) {
|
|
|
|
btrfs_report_missing_device(fs_info, devid, uuid, true);
|
|
|
|
return ERR_PTR(-ENOENT);
|
|
|
|
}
|
|
|
|
|
|
|
|
dev = add_missing_dev(fs_info->fs_devices, devid, uuid);
|
|
|
|
if (IS_ERR(dev)) {
|
|
|
|
btrfs_err(fs_info, "failed to init missing device %llu: %ld",
|
|
|
|
devid, PTR_ERR(dev));
|
|
|
|
return dev;
|
|
|
|
}
|
|
|
|
btrfs_report_missing_device(fs_info, devid, uuid, false);
|
|
|
|
|
|
|
|
return dev;
|
|
|
|
}
|
|
|
|
|
2019-03-20 23:43:07 +08:00
|
|
|
static int read_one_chunk(struct btrfs_key *key, struct extent_buffer *leaf,
|
2016-06-04 03:05:15 +08:00
|
|
|
struct btrfs_chunk *chunk)
|
|
|
|
{
|
2021-10-06 04:12:42 +08:00
|
|
|
BTRFS_DEV_LOOKUP_ARGS(args);
|
2019-03-20 23:43:07 +08:00
|
|
|
struct btrfs_fs_info *fs_info = leaf->fs_info;
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
struct btrfs_chunk_map *map;
|
2016-06-04 03:05:15 +08:00
|
|
|
u64 logical;
|
|
|
|
u64 length;
|
|
|
|
u64 devid;
|
2021-02-25 09:18:14 +08:00
|
|
|
u64 type;
|
2016-06-04 03:05:15 +08:00
|
|
|
u8 uuid[BTRFS_UUID_SIZE];
|
2022-10-21 08:43:45 +08:00
|
|
|
int index;
|
2016-06-04 03:05:15 +08:00
|
|
|
int num_stripes;
|
|
|
|
int ret;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
logical = key->offset;
|
|
|
|
length = btrfs_chunk_length(leaf, chunk);
|
2021-02-25 09:18:14 +08:00
|
|
|
type = btrfs_chunk_type(leaf, chunk);
|
2022-10-21 08:43:45 +08:00
|
|
|
index = btrfs_bg_flags_to_raid_index(type);
|
2016-06-04 03:05:15 +08:00
|
|
|
num_stripes = btrfs_chunk_num_stripes(leaf, chunk);
|
|
|
|
|
2021-02-25 09:18:14 +08:00
|
|
|
#if BITS_PER_LONG == 32
|
|
|
|
ret = check_32bit_meta_chunk(fs_info, logical, length, type);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
warn_32bit_meta_chunk(fs_info, logical, length, type);
|
|
|
|
#endif
|
|
|
|
|
2019-03-20 13:42:33 +08:00
|
|
|
/*
|
|
|
|
* Only need to verify chunk item if we're reading from sys chunk array,
|
|
|
|
* as chunk item in tree block is already verified by tree-checker.
|
|
|
|
*/
|
|
|
|
if (leaf->start == BTRFS_SUPER_INFO_OFFSET) {
|
2019-03-20 23:40:48 +08:00
|
|
|
ret = btrfs_check_chunk_valid(leaf, chunk, logical);
|
2019-03-20 13:42:33 +08:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
}
|
2008-05-07 23:43:44 +08:00
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
map = btrfs_find_chunk_map(fs_info, logical, 1);
|
2008-03-25 03:01:56 +08:00
|
|
|
|
|
|
|
/* already mapped? */
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
if (map && map->start <= logical && map->start + map->chunk_len > logical) {
|
|
|
|
btrfs_free_chunk_map(map);
|
2008-03-25 03:01:56 +08:00
|
|
|
return 0;
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
} else if (map) {
|
|
|
|
btrfs_free_chunk_map(map);
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
map = btrfs_alloc_chunk_map(num_stripes, GFP_NOFS);
|
|
|
|
if (!map)
|
2008-03-25 03:01:56 +08:00
|
|
|
return -ENOMEM;
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
map->start = logical;
|
|
|
|
map->chunk_len = length;
|
2008-03-26 04:50:33 +08:00
|
|
|
map->num_stripes = num_stripes;
|
|
|
|
map->io_width = btrfs_chunk_io_width(leaf, chunk);
|
|
|
|
map->io_align = btrfs_chunk_io_align(leaf, chunk);
|
2021-02-25 09:18:14 +08:00
|
|
|
map->type = type;
|
2022-10-21 08:43:45 +08:00
|
|
|
/*
|
|
|
|
* We can't use the sub_stripes value, as for profiles other than
|
|
|
|
* RAID10, they may have 0 as sub_stripes for filesystems created by
|
|
|
|
* older mkfs (<v5.4).
|
|
|
|
* In that case, it can cause divide-by-zero errors later.
|
|
|
|
* Since currently sub_stripes is fixed for each profile, let's
|
|
|
|
* use the trusted value instead.
|
|
|
|
*/
|
|
|
|
map->sub_stripes = btrfs_raid_array[index].sub_stripes;
|
2018-08-01 10:37:19 +08:00
|
|
|
map->verified_stripes = 0;
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
map->stripe_size = btrfs_calc_stripe_length(map);
|
2008-03-26 04:50:33 +08:00
|
|
|
for (i = 0; i < num_stripes; i++) {
|
|
|
|
map->stripes[i].physical =
|
|
|
|
btrfs_stripe_offset_nr(leaf, chunk, i);
|
|
|
|
devid = btrfs_stripe_devid_nr(leaf, chunk, i);
|
2021-10-06 04:12:42 +08:00
|
|
|
args.devid = devid;
|
2008-04-18 22:29:38 +08:00
|
|
|
read_extent_buffer(leaf, uuid, (unsigned long)
|
|
|
|
btrfs_stripe_dev_uuid_nr(chunk, i),
|
|
|
|
BTRFS_UUID_SIZE);
|
2021-10-06 04:12:42 +08:00
|
|
|
args.uuid = uuid;
|
|
|
|
map->stripes[i].dev = btrfs_find_device(fs_info->fs_devices, &args);
|
2008-05-14 01:46:40 +08:00
|
|
|
if (!map->stripes[i].dev) {
|
2022-01-12 00:00:26 +08:00
|
|
|
map->stripes[i].dev = handle_missing_device(fs_info,
|
|
|
|
devid, uuid);
|
2017-10-11 12:46:18 +08:00
|
|
|
if (IS_ERR(map->stripes[i].dev)) {
|
2022-11-23 22:39:45 +08:00
|
|
|
ret = PTR_ERR(map->stripes[i].dev);
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
btrfs_free_chunk_map(map);
|
2022-11-23 22:39:45 +08:00
|
|
|
return ret;
|
2008-05-14 01:46:40 +08:00
|
|
|
}
|
|
|
|
}
|
2022-01-12 00:00:26 +08:00
|
|
|
|
2017-12-04 12:54:53 +08:00
|
|
|
set_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
|
|
|
|
&(map->stripes[i].dev->dev_state));
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
ret = btrfs_add_chunk_map(fs_info, map);
|
2018-08-01 10:37:20 +08:00
|
|
|
if (ret < 0) {
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"failed to add chunk map, start=%llu len=%llu: %d",
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
map->start, map->chunk_len, ret);
|
2018-08-01 10:37:20 +08:00
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2018-08-01 10:37:20 +08:00
|
|
|
return ret;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
|
|
|
|
2012-03-01 21:56:26 +08:00
|
|
|
static void fill_device_from_item(struct extent_buffer *leaf,
|
2008-03-25 03:01:56 +08:00
|
|
|
struct btrfs_dev_item *dev_item,
|
|
|
|
struct btrfs_device *device)
|
|
|
|
{
|
|
|
|
unsigned long ptr;
|
|
|
|
|
|
|
|
device->devid = btrfs_device_id(leaf, dev_item);
|
2009-04-27 19:29:03 +08:00
|
|
|
device->disk_total_bytes = btrfs_device_total_bytes(leaf, dev_item);
|
|
|
|
device->total_bytes = device->disk_total_bytes;
|
2014-09-03 21:35:33 +08:00
|
|
|
device->commit_total_bytes = device->disk_total_bytes;
|
2008-03-25 03:01:56 +08:00
|
|
|
device->bytes_used = btrfs_device_bytes_used(leaf, dev_item);
|
2014-09-03 21:35:34 +08:00
|
|
|
device->commit_bytes_used = device->bytes_used;
|
2008-03-25 03:01:56 +08:00
|
|
|
device->type = btrfs_device_type(leaf, dev_item);
|
|
|
|
device->io_align = btrfs_device_io_align(leaf, dev_item);
|
|
|
|
device->io_width = btrfs_device_io_width(leaf, dev_item);
|
|
|
|
device->sector_size = btrfs_device_sector_size(leaf, dev_item);
|
2012-11-06 20:15:27 +08:00
|
|
|
WARN_ON(device->devid == BTRFS_DEV_REPLACE_DEVID);
|
2017-12-04 12:54:55 +08:00
|
|
|
clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state);
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2013-08-20 19:20:11 +08:00
|
|
|
ptr = btrfs_device_uuid(dev_item);
|
2008-04-16 03:41:47 +08:00
|
|
|
read_extent_buffer(leaf, device->uuid, ptr, BTRFS_UUID_SIZE);
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
static struct btrfs_fs_devices *open_seed_devices(struct btrfs_fs_info *fs_info,
|
2014-09-03 21:35:46 +08:00
|
|
|
u8 *fsid)
|
2008-11-18 10:11:30 +08:00
|
|
|
{
|
|
|
|
struct btrfs_fs_devices *fs_devices;
|
|
|
|
int ret;
|
|
|
|
|
2018-03-16 09:21:22 +08:00
|
|
|
lockdep_assert_held(&uuid_mutex);
|
2017-06-14 08:48:07 +08:00
|
|
|
ASSERT(fsid);
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2020-08-12 22:04:36 +08:00
|
|
|
/* This will match only for multi-device seed fs */
|
2020-07-16 15:25:33 +08:00
|
|
|
list_for_each_entry(fs_devices, &fs_info->fs_devices->seed_list, seed_list)
|
2017-07-29 17:50:09 +08:00
|
|
|
if (!memcmp(fs_devices->fsid, fsid, BTRFS_FSID_SIZE))
|
2014-09-03 21:35:46 +08:00
|
|
|
return fs_devices;
|
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2018-10-30 22:43:23 +08:00
|
|
|
fs_devices = find_fsid(fsid, NULL);
|
2008-11-18 10:11:30 +08:00
|
|
|
if (!fs_devices) {
|
2016-06-23 06:54:23 +08:00
|
|
|
if (!btrfs_test_opt(fs_info, DEGRADED))
|
2014-09-03 21:35:46 +08:00
|
|
|
return ERR_PTR(-ENOENT);
|
|
|
|
|
2023-08-23 22:52:13 +08:00
|
|
|
fs_devices = alloc_fs_devices(fsid);
|
2014-09-03 21:35:46 +08:00
|
|
|
if (IS_ERR(fs_devices))
|
|
|
|
return fs_devices;
|
|
|
|
|
2019-11-13 18:27:27 +08:00
|
|
|
fs_devices->seeding = true;
|
2014-09-03 21:35:46 +08:00
|
|
|
fs_devices->opened = 1;
|
|
|
|
return fs_devices;
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
2008-12-12 23:03:26 +08:00
|
|
|
|
2020-08-12 22:04:36 +08:00
|
|
|
/*
|
|
|
|
* Upon first call for a seed fs fsid, just create a private copy of the
|
|
|
|
* respective fs_devices and anchor it at fs_info->fs_devices->seed_list
|
|
|
|
*/
|
2008-12-12 23:03:26 +08:00
|
|
|
fs_devices = clone_fs_devices(fs_devices);
|
2014-09-03 21:35:46 +08:00
|
|
|
if (IS_ERR(fs_devices))
|
|
|
|
return fs_devices;
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2023-06-08 19:02:55 +08:00
|
|
|
ret = open_fs_devices(fs_devices, BLK_OPEN_READ, fs_info->bdev_holder);
|
2012-04-14 17:24:33 +08:00
|
|
|
if (ret) {
|
|
|
|
free_fs_devices(fs_devices);
|
2020-09-05 01:34:35 +08:00
|
|
|
return ERR_PTR(ret);
|
2012-04-14 17:24:33 +08:00
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
|
|
|
|
if (!fs_devices->seeding) {
|
2018-04-12 10:29:27 +08:00
|
|
|
close_fs_devices(fs_devices);
|
2008-12-12 23:03:26 +08:00
|
|
|
free_fs_devices(fs_devices);
|
2020-09-05 01:34:35 +08:00
|
|
|
return ERR_PTR(-EINVAL);
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
|
2020-07-16 15:25:33 +08:00
|
|
|
list_add(&fs_devices->seed_list, &fs_info->fs_devices->seed_list);
|
2020-09-05 01:34:35 +08:00
|
|
|
|
2014-09-03 21:35:46 +08:00
|
|
|
return fs_devices;
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
|
2019-03-20 23:45:15 +08:00
|
|
|
static int read_one_dev(struct extent_buffer *leaf,
|
2008-03-25 03:01:56 +08:00
|
|
|
struct btrfs_dev_item *dev_item)
|
|
|
|
{
|
2021-10-06 04:12:42 +08:00
|
|
|
BTRFS_DEV_LOOKUP_ARGS(args);
|
2019-03-20 23:45:15 +08:00
|
|
|
struct btrfs_fs_info *fs_info = leaf->fs_info;
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
|
2008-03-25 03:01:56 +08:00
|
|
|
struct btrfs_device *device;
|
|
|
|
u64 devid;
|
|
|
|
int ret;
|
2017-07-29 17:50:09 +08:00
|
|
|
u8 fs_uuid[BTRFS_FSID_SIZE];
|
2008-04-18 22:29:38 +08:00
|
|
|
u8 dev_uuid[BTRFS_UUID_SIZE];
|
|
|
|
|
2022-06-22 00:40:48 +08:00
|
|
|
devid = btrfs_device_id(leaf, dev_item);
|
|
|
|
args.devid = devid;
|
2013-08-20 19:20:11 +08:00
|
|
|
read_extent_buffer(leaf, dev_uuid, btrfs_device_uuid(dev_item),
|
2008-04-18 22:29:38 +08:00
|
|
|
BTRFS_UUID_SIZE);
|
2013-08-20 19:20:12 +08:00
|
|
|
read_extent_buffer(leaf, fs_uuid, btrfs_device_fsid(dev_item),
|
2017-07-29 17:50:09 +08:00
|
|
|
BTRFS_FSID_SIZE);
|
2021-10-06 04:12:42 +08:00
|
|
|
args.uuid = dev_uuid;
|
|
|
|
args.fsid = fs_uuid;
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2018-10-30 22:43:24 +08:00
|
|
|
if (memcmp(fs_uuid, fs_devices->metadata_uuid, BTRFS_FSID_SIZE)) {
|
2016-06-23 06:54:24 +08:00
|
|
|
fs_devices = open_seed_devices(fs_info, fs_uuid);
|
2014-09-03 21:35:46 +08:00
|
|
|
if (IS_ERR(fs_devices))
|
|
|
|
return PTR_ERR(fs_devices);
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
|
2021-10-06 04:12:42 +08:00
|
|
|
device = btrfs_find_device(fs_info->fs_devices, &args);
|
2014-09-03 21:35:46 +08:00
|
|
|
if (!device) {
|
2017-03-09 09:34:42 +08:00
|
|
|
if (!btrfs_test_opt(fs_info, DEGRADED)) {
|
2017-10-09 11:07:46 +08:00
|
|
|
btrfs_report_missing_device(fs_info, devid,
|
|
|
|
dev_uuid, true);
|
2017-10-09 11:07:44 +08:00
|
|
|
return -ENOENT;
|
2017-03-09 09:34:42 +08:00
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
device = add_missing_dev(fs_devices, devid, dev_uuid);
|
2017-10-11 12:46:18 +08:00
|
|
|
if (IS_ERR(device)) {
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"failed to add missing dev %llu: %ld",
|
|
|
|
devid, PTR_ERR(device));
|
|
|
|
return PTR_ERR(device);
|
|
|
|
}
|
2017-10-09 11:07:46 +08:00
|
|
|
btrfs_report_missing_device(fs_info, devid, dev_uuid, false);
|
2014-09-03 21:35:46 +08:00
|
|
|
} else {
|
2017-03-09 09:34:42 +08:00
|
|
|
if (!device->bdev) {
|
2017-10-09 11:07:46 +08:00
|
|
|
if (!btrfs_test_opt(fs_info, DEGRADED)) {
|
|
|
|
btrfs_report_missing_device(fs_info,
|
|
|
|
devid, dev_uuid, true);
|
2017-10-09 11:07:44 +08:00
|
|
|
return -ENOENT;
|
2017-10-09 11:07:46 +08:00
|
|
|
}
|
|
|
|
btrfs_report_missing_device(fs_info, devid,
|
|
|
|
dev_uuid, false);
|
2017-03-09 09:34:42 +08:00
|
|
|
}
|
2014-09-03 21:35:46 +08:00
|
|
|
|
2017-12-04 12:54:54 +08:00
|
|
|
if (!device->bdev &&
|
|
|
|
!test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state)) {
|
2010-12-14 03:56:23 +08:00
|
|
|
/*
|
|
|
|
* this happens when a device that was properly setup
|
|
|
|
* in the device info lists suddenly goes bad.
|
|
|
|
* device->bdev is NULL, and so we have to set
|
|
|
|
* device->missing to one here
|
|
|
|
*/
|
2014-09-03 21:35:46 +08:00
|
|
|
device->fs_devices->missing_devices++;
|
2017-12-04 12:54:54 +08:00
|
|
|
set_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state);
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
2014-09-03 21:35:46 +08:00
|
|
|
|
|
|
|
/* Move the device to its own fs_devices */
|
|
|
|
if (device->fs_devices != fs_devices) {
|
2017-12-04 12:54:54 +08:00
|
|
|
ASSERT(test_bit(BTRFS_DEV_STATE_MISSING,
|
|
|
|
&device->dev_state));
|
2014-09-03 21:35:46 +08:00
|
|
|
|
|
|
|
list_move(&device->dev_list, &fs_devices->devices);
|
|
|
|
device->fs_devices->num_devices--;
|
|
|
|
fs_devices->num_devices++;
|
|
|
|
|
|
|
|
device->fs_devices->missing_devices--;
|
|
|
|
fs_devices->missing_devices++;
|
|
|
|
|
|
|
|
device->fs_devices = fs_devices;
|
|
|
|
}
|
2008-11-18 10:11:30 +08:00
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
if (device->fs_devices != fs_info->fs_devices) {
|
2017-12-04 12:54:52 +08:00
|
|
|
BUG_ON(test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state));
|
2008-11-18 10:11:30 +08:00
|
|
|
if (device->generation !=
|
|
|
|
btrfs_device_generation(leaf, dev_item))
|
|
|
|
return -EINVAL;
|
2008-03-25 03:01:59 +08:00
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
|
|
|
|
fill_device_from_item(leaf, dev_item, device);
|
2020-11-03 13:49:42 +08:00
|
|
|
if (device->bdev) {
|
2021-10-18 18:11:12 +08:00
|
|
|
u64 max_total_bytes = bdev_nr_bytes(device->bdev);
|
2020-11-03 13:49:42 +08:00
|
|
|
|
|
|
|
if (device->total_bytes > max_total_bytes) {
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"device total_bytes should be at most %llu but found %llu",
|
|
|
|
max_total_bytes, device->total_bytes);
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
}
|
2017-12-04 12:54:53 +08:00
|
|
|
set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
|
2017-12-04 12:54:52 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
|
2017-12-04 12:54:55 +08:00
|
|
|
!test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state)) {
|
2008-11-18 10:11:30 +08:00
|
|
|
device->fs_devices->total_rw_bytes += device->total_bytes;
|
2017-05-11 14:17:46 +08:00
|
|
|
atomic64_add(device->total_bytes - device->bytes_used,
|
|
|
|
&fs_info->free_chunk_space);
|
2011-09-27 05:12:22 +08:00
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
ret = 0;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-06-22 09:16:51 +08:00
|
|
|
int btrfs_read_sys_array(struct btrfs_fs_info *fs_info)
|
2008-03-25 03:01:56 +08:00
|
|
|
{
|
2016-09-20 22:05:02 +08:00
|
|
|
struct btrfs_super_block *super_copy = fs_info->super_copy;
|
2008-05-07 23:43:44 +08:00
|
|
|
struct extent_buffer *sb;
|
2008-03-25 03:01:56 +08:00
|
|
|
struct btrfs_disk_key *disk_key;
|
|
|
|
struct btrfs_chunk *chunk;
|
2014-11-01 02:02:42 +08:00
|
|
|
u8 *array_ptr;
|
|
|
|
unsigned long sb_array_offset;
|
2008-04-25 21:04:37 +08:00
|
|
|
int ret = 0;
|
2008-03-25 03:01:56 +08:00
|
|
|
u32 num_stripes;
|
|
|
|
u32 array_size;
|
|
|
|
u32 len = 0;
|
2014-11-01 02:02:42 +08:00
|
|
|
u32 cur_offset;
|
2016-06-04 03:05:15 +08:00
|
|
|
u64 type;
|
2008-04-25 21:04:37 +08:00
|
|
|
struct btrfs_key key;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
ASSERT(BTRFS_SUPER_INFO_SIZE <= fs_info->nodesize);
|
2022-01-13 13:22:08 +08:00
|
|
|
|
2014-06-15 08:39:54 +08:00
|
|
|
/*
|
2022-01-13 13:22:08 +08:00
|
|
|
* We allocated a dummy extent, just to use extent buffer accessors.
|
|
|
|
* There will be unused space after BTRFS_SUPER_INFO_SIZE, but
|
|
|
|
* that's fine, we will not go beyond system chunk array anyway.
|
2014-06-15 08:39:54 +08:00
|
|
|
*/
|
2022-01-13 13:22:08 +08:00
|
|
|
sb = alloc_dummy_extent_buffer(fs_info, BTRFS_SUPER_INFO_OFFSET);
|
|
|
|
if (!sb)
|
|
|
|
return -ENOMEM;
|
2015-12-03 20:06:46 +08:00
|
|
|
set_extent_buffer_uptodate(sb);
|
2009-02-13 03:09:45 +08:00
|
|
|
|
2008-05-07 23:43:44 +08:00
|
|
|
write_extent_buffer(sb, super_copy, 0, BTRFS_SUPER_INFO_SIZE);
|
2008-03-25 03:01:56 +08:00
|
|
|
array_size = btrfs_super_sys_array_size(super_copy);
|
|
|
|
|
2014-11-01 02:02:42 +08:00
|
|
|
array_ptr = super_copy->sys_chunk_array;
|
|
|
|
sb_array_offset = offsetof(struct btrfs_super_block, sys_chunk_array);
|
|
|
|
cur_offset = 0;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2014-11-01 02:02:42 +08:00
|
|
|
while (cur_offset < array_size) {
|
|
|
|
disk_key = (struct btrfs_disk_key *)array_ptr;
|
2014-11-05 22:24:51 +08:00
|
|
|
len = sizeof(*disk_key);
|
|
|
|
if (cur_offset + len > array_size)
|
|
|
|
goto out_short_read;
|
|
|
|
|
2008-03-25 03:01:56 +08:00
|
|
|
btrfs_disk_key_to_cpu(&key, disk_key);
|
|
|
|
|
2014-11-01 02:02:42 +08:00
|
|
|
array_ptr += len;
|
|
|
|
sb_array_offset += len;
|
|
|
|
cur_offset += len;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2019-10-18 17:58:23 +08:00
|
|
|
if (key.type != BTRFS_CHUNK_ITEM_KEY) {
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"unexpected item type %u in sys_array at offset %u",
|
|
|
|
(u32)key.type, cur_offset);
|
|
|
|
ret = -EIO;
|
|
|
|
break;
|
|
|
|
}
|
2015-12-01 00:27:06 +08:00
|
|
|
|
2019-10-18 17:58:23 +08:00
|
|
|
chunk = (struct btrfs_chunk *)sb_array_offset;
|
|
|
|
/*
|
|
|
|
* At least one btrfs_chunk with one stripe must be present,
|
|
|
|
* exact stripe count check comes afterwards
|
|
|
|
*/
|
|
|
|
len = btrfs_chunk_item_size(1);
|
|
|
|
if (cur_offset + len > array_size)
|
|
|
|
goto out_short_read;
|
2016-06-04 03:05:15 +08:00
|
|
|
|
2019-10-18 17:58:23 +08:00
|
|
|
num_stripes = btrfs_chunk_num_stripes(sb, chunk);
|
|
|
|
if (!num_stripes) {
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"invalid number of stripes %u in sys_array at offset %u",
|
|
|
|
num_stripes, cur_offset);
|
|
|
|
ret = -EIO;
|
|
|
|
break;
|
|
|
|
}
|
2014-11-05 22:24:51 +08:00
|
|
|
|
2019-10-18 17:58:23 +08:00
|
|
|
type = btrfs_chunk_type(sb, chunk);
|
|
|
|
if ((type & BTRFS_BLOCK_GROUP_SYSTEM) == 0) {
|
2016-09-20 22:05:02 +08:00
|
|
|
btrfs_err(fs_info,
|
2019-10-18 17:58:23 +08:00
|
|
|
"invalid chunk type %llu in sys_array at offset %u",
|
|
|
|
type, cur_offset);
|
2008-04-25 21:04:37 +08:00
|
|
|
ret = -EIO;
|
|
|
|
break;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
2019-10-18 17:58:23 +08:00
|
|
|
|
|
|
|
len = btrfs_chunk_item_size(num_stripes);
|
|
|
|
if (cur_offset + len > array_size)
|
|
|
|
goto out_short_read;
|
|
|
|
|
|
|
|
ret = read_one_chunk(&key, sb, chunk);
|
|
|
|
if (ret)
|
|
|
|
break;
|
|
|
|
|
2014-11-01 02:02:42 +08:00
|
|
|
array_ptr += len;
|
|
|
|
sb_array_offset += len;
|
|
|
|
cur_offset += len;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
2016-06-04 08:41:42 +08:00
|
|
|
clear_extent_buffer_uptodate(sb);
|
2016-05-14 08:06:59 +08:00
|
|
|
free_extent_buffer_stale(sb);
|
2008-04-25 21:04:37 +08:00
|
|
|
return ret;
|
2014-11-05 22:24:51 +08:00
|
|
|
|
|
|
|
out_short_read:
|
2016-09-20 22:05:02 +08:00
|
|
|
btrfs_err(fs_info, "sys_array too short to read %u bytes at offset %u",
|
2014-11-05 22:24:51 +08:00
|
|
|
len, cur_offset);
|
2016-06-04 08:41:42 +08:00
|
|
|
clear_extent_buffer_uptodate(sb);
|
2016-05-14 08:06:59 +08:00
|
|
|
free_extent_buffer_stale(sb);
|
2014-11-05 22:24:51 +08:00
|
|
|
return -EIO;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
|
|
|
|
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
|
|
|
/*
|
|
|
|
* Check if all chunks in the fs are OK for read-write degraded mount
|
|
|
|
*
|
2017-12-18 17:08:59 +08:00
|
|
|
* If the @failing_dev is specified, it's accounted as missing.
|
|
|
|
*
|
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
|
|
|
* Return true if all chunks meet the minimal RW mount requirements.
|
|
|
|
* Return false if any chunk doesn't meet the minimal RW mount requirements.
|
|
|
|
*/
|
2017-12-18 17:08:59 +08:00
|
|
|
bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_device *failing_dev)
|
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
|
|
|
{
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
struct btrfs_chunk_map *map;
|
|
|
|
u64 next_start;
|
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
|
|
|
bool ret = true;
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
map = btrfs_find_chunk_map(fs_info, 0, U64_MAX);
|
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
|
|
|
/* No chunk at all? Return false anyway */
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
if (!map) {
|
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
|
|
|
ret = false;
|
|
|
|
goto out;
|
|
|
|
}
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
while (map) {
|
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
|
|
|
int missing = 0;
|
|
|
|
int max_tolerated;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
max_tolerated =
|
|
|
|
btrfs_get_num_tolerated_disk_barrier_failures(
|
|
|
|
map->type);
|
|
|
|
for (i = 0; i < map->num_stripes; i++) {
|
|
|
|
struct btrfs_device *dev = map->stripes[i].dev;
|
|
|
|
|
2017-12-04 12:54:54 +08:00
|
|
|
if (!dev || !dev->bdev ||
|
|
|
|
test_bit(BTRFS_DEV_STATE_MISSING, &dev->dev_state) ||
|
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
|
|
|
dev->last_flush_error)
|
|
|
|
missing++;
|
2017-12-18 17:08:59 +08:00
|
|
|
else if (failing_dev && failing_dev == dev)
|
|
|
|
missing++;
|
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
|
|
|
}
|
|
|
|
if (missing > max_tolerated) {
|
2017-12-18 17:08:59 +08:00
|
|
|
if (!failing_dev)
|
|
|
|
btrfs_warn(fs_info,
|
2018-11-28 19:05:13 +08:00
|
|
|
"chunk %llu missing %d devices, max tolerance is %d for writable mount",
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
map->start, missing, max_tolerated);
|
|
|
|
btrfs_free_chunk_map(map);
|
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
|
|
|
ret = false;
|
|
|
|
goto out;
|
|
|
|
}
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
next_start = map->start + map->chunk_len;
|
|
|
|
btrfs_free_chunk_map(map);
|
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
map = btrfs_find_chunk_map(fs_info, next_start, U64_MAX - next_start);
|
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-03-09 09:34:36 +08:00
|
|
|
}
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2020-07-09 04:55:14 +08:00
|
|
|
static void readahead_tree_node_children(struct extent_buffer *node)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
const int nr_items = btrfs_header_nritems(node);
|
|
|
|
|
2020-11-05 23:45:09 +08:00
|
|
|
for (i = 0; i < nr_items; i++)
|
|
|
|
btrfs_readahead_node_child(node, i);
|
2020-07-09 04:55:14 +08:00
|
|
|
}
|
|
|
|
|
2016-06-21 22:40:19 +08:00
|
|
|
int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info)
|
2008-03-25 03:01:56 +08:00
|
|
|
{
|
2016-06-21 22:40:19 +08:00
|
|
|
struct btrfs_root *root = fs_info->chunk_root;
|
2008-03-25 03:01:56 +08:00
|
|
|
struct btrfs_path *path;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_key found_key;
|
|
|
|
int ret;
|
|
|
|
int slot;
|
2022-03-09 21:50:50 +08:00
|
|
|
int iter_ret = 0;
|
2016-06-04 03:05:14 +08:00
|
|
|
u64 total_dev = 0;
|
2020-07-09 04:55:14 +08:00
|
|
|
u64 last_ra_node = 0;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2018-04-12 10:29:32 +08:00
|
|
|
/*
|
|
|
|
* uuid_mutex is needed only if we are mounting a sprout FS
|
|
|
|
* otherwise we don't need it.
|
|
|
|
*/
|
2011-12-07 11:38:24 +08:00
|
|
|
mutex_lock(&uuid_mutex);
|
|
|
|
|
btrfs: fix mount failure caused by race with umount
It is possible to cause a btrfs mount to fail by racing it with a slow
umount. The crux of the sequence is generic_shutdown_super not yet
calling sop->put_super before btrfs_mount_root calls btrfs_open_devices.
If that occurs, btrfs_open_devices will decide the opened counter is
non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to
0. From here, mount will call sget which will result in grab_super
trying to take the super block umount semaphore. That semaphore will be
held by the slow umount, so mount will block. Before up-ing the
semaphore, umount will delete the super block, resulting in mount's sget
reliably allocating a new one, which causes the mount path to dutifully
fill it out, and increment total_rw_bytes a second time, which causes
the mount to fail, as we see double the expected bytes.
Here is the sequence laid out in greater detail:
CPU0 CPU1
down_write sb->s_umount
btrfs_kill_super
kill_anon_super(sb)
generic_shutdown_super(sb);
shrink_dcache_for_umount(sb);
sync_filesystem(sb);
evict_inodes(sb); // SLOW
btrfs_mount_root
btrfs_scan_one_device
fs_devices = device->fs_devices
fs_info->fs_devices = fs_devices
// fs_devices-opened makes this a no-op
btrfs_open_devices(fs_devices, mode, fs_type)
s = sget(fs_type, test, set, flags, fs_info);
find sb in s_instances
grab_super(sb);
down_write(&s->s_umount); // blocks
sop->put_super(sb)
// sb->fs_devices->opened == 2; no-op
spin_lock(&sb_lock);
hlist_del_init(&sb->s_instances);
spin_unlock(&sb_lock);
up_write(&sb->s_umount);
return 0;
retry lookup
don't find sb in s_instances (deleted by CPU0)
s = alloc_super
return s;
btrfs_fill_super(s, fs_devices, data)
open_ctree // fs_devices total_rw_bytes improperly set!
btrfs_read_chunk_tree
read_one_dev // increment total_rw_bytes again!!
super_total_bytes < fs_devices->total_rw_bytes // ERROR!!!
To fix this, we clear total_rw_bytes from within btrfs_read_chunk_tree
before the calls to read_one_dev, while holding the sb umount semaphore
and the uuid mutex.
To reproduce, it is sufficient to dirty a decent number of inodes, then
quickly umount and mount.
for i in $(seq 0 500)
do
dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1
done
umount /mnt/foo&
mount /mnt/foo
does the trick for me.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-17 04:29:46 +08:00
|
|
|
/*
|
|
|
|
* It is possible for mount and umount to race in such a way that
|
|
|
|
* we execute this code path, but open_fs_devices failed to clear
|
|
|
|
* total_rw_bytes. We certainly want it cleared before reading the
|
|
|
|
* device items, so clear it here.
|
|
|
|
*/
|
|
|
|
fs_info->fs_devices->total_rw_bytes = 0;
|
|
|
|
|
btrfs: silence lockdep when reading chunk tree during mount
Often some test cases like btrfs/161 trigger lockdep splats that complain
about possible unsafe lock scenario due to the fact that during mount,
when reading the chunk tree we end up calling blkdev_get_by_path() while
holding a read lock on a leaf of the chunk tree. That produces a lockdep
splat like the following:
[ 3653.683975] ======================================================
[ 3653.685148] WARNING: possible circular locking dependency detected
[ 3653.686301] 5.15.0-rc7-btrfs-next-103 #1 Not tainted
[ 3653.687239] ------------------------------------------------------
[ 3653.688400] mount/447465 is trying to acquire lock:
[ 3653.689320] ffff8c6b0c76e528 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_get_by_dev.part.0+0xe7/0x320
[ 3653.691054]
but task is already holding lock:
[ 3653.692155] ffff8c6b0a9f39e0 (btrfs-chunk-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 3653.693978]
which lock already depends on the new lock.
[ 3653.695510]
the existing dependency chain (in reverse order) is:
[ 3653.696915]
-> #3 (btrfs-chunk-00){++++}-{3:3}:
[ 3653.698053] down_read_nested+0x4b/0x140
[ 3653.698893] __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 3653.699988] btrfs_read_lock_root_node+0x31/0x40 [btrfs]
[ 3653.701205] btrfs_search_slot+0x537/0xc00 [btrfs]
[ 3653.702234] btrfs_insert_empty_items+0x32/0x70 [btrfs]
[ 3653.703332] btrfs_init_new_device+0x563/0x15b0 [btrfs]
[ 3653.704439] btrfs_ioctl+0x2110/0x3530 [btrfs]
[ 3653.705405] __x64_sys_ioctl+0x83/0xb0
[ 3653.706215] do_syscall_64+0x3b/0xc0
[ 3653.706990] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 3653.708040]
-> #2 (sb_internal#2){.+.+}-{0:0}:
[ 3653.708994] lock_release+0x13d/0x4a0
[ 3653.709533] up_write+0x18/0x160
[ 3653.710017] btrfs_sync_file+0x3f3/0x5b0 [btrfs]
[ 3653.710699] __loop_update_dio+0xbd/0x170 [loop]
[ 3653.711360] lo_ioctl+0x3b1/0x8a0 [loop]
[ 3653.711929] block_ioctl+0x48/0x50
[ 3653.712442] __x64_sys_ioctl+0x83/0xb0
[ 3653.712991] do_syscall_64+0x3b/0xc0
[ 3653.713519] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 3653.714233]
-> #1 (&lo->lo_mutex){+.+.}-{3:3}:
[ 3653.715026] __mutex_lock+0x92/0x900
[ 3653.715648] lo_open+0x28/0x60 [loop]
[ 3653.716275] blkdev_get_whole+0x28/0x90
[ 3653.716867] blkdev_get_by_dev.part.0+0x142/0x320
[ 3653.717537] blkdev_open+0x5e/0xa0
[ 3653.718043] do_dentry_open+0x163/0x390
[ 3653.718604] path_openat+0x3f0/0xa80
[ 3653.719128] do_filp_open+0xa9/0x150
[ 3653.719652] do_sys_openat2+0x97/0x160
[ 3653.720197] __x64_sys_openat+0x54/0x90
[ 3653.720766] do_syscall_64+0x3b/0xc0
[ 3653.721285] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 3653.721986]
-> #0 (&disk->open_mutex){+.+.}-{3:3}:
[ 3653.722775] __lock_acquire+0x130e/0x2210
[ 3653.723348] lock_acquire+0xd7/0x310
[ 3653.723867] __mutex_lock+0x92/0x900
[ 3653.724394] blkdev_get_by_dev.part.0+0xe7/0x320
[ 3653.725041] blkdev_get_by_path+0xb8/0xd0
[ 3653.725614] btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
[ 3653.726332] open_fs_devices+0xd7/0x2c0 [btrfs]
[ 3653.726999] btrfs_read_chunk_tree+0x3ad/0x870 [btrfs]
[ 3653.727739] open_ctree+0xb8e/0x17bf [btrfs]
[ 3653.728384] btrfs_mount_root.cold+0x12/0xde [btrfs]
[ 3653.729130] legacy_get_tree+0x30/0x50
[ 3653.729676] vfs_get_tree+0x28/0xc0
[ 3653.730192] vfs_kern_mount.part.0+0x71/0xb0
[ 3653.730800] btrfs_mount+0x11d/0x3a0 [btrfs]
[ 3653.731427] legacy_get_tree+0x30/0x50
[ 3653.731970] vfs_get_tree+0x28/0xc0
[ 3653.732486] path_mount+0x2d4/0xbe0
[ 3653.732997] __x64_sys_mount+0x103/0x140
[ 3653.733560] do_syscall_64+0x3b/0xc0
[ 3653.734080] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 3653.734782]
other info that might help us debug this:
[ 3653.735784] Chain exists of:
&disk->open_mutex --> sb_internal#2 --> btrfs-chunk-00
[ 3653.737123] Possible unsafe locking scenario:
[ 3653.737865] CPU0 CPU1
[ 3653.738435] ---- ----
[ 3653.739007] lock(btrfs-chunk-00);
[ 3653.739449] lock(sb_internal#2);
[ 3653.740193] lock(btrfs-chunk-00);
[ 3653.740955] lock(&disk->open_mutex);
[ 3653.741431]
*** DEADLOCK ***
[ 3653.742176] 3 locks held by mount/447465:
[ 3653.742739] #0: ffff8c6acf85c0e8 (&type->s_umount_key#44/1){+.+.}-{3:3}, at: alloc_super+0xd5/0x3b0
[ 3653.744114] #1: ffffffffc0b28f70 (uuid_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x59/0x870 [btrfs]
[ 3653.745563] #2: ffff8c6b0a9f39e0 (btrfs-chunk-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 3653.747066]
stack backtrace:
[ 3653.747723] CPU: 4 PID: 447465 Comm: mount Not tainted 5.15.0-rc7-btrfs-next-103 #1
[ 3653.748873] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 3653.750592] Call Trace:
[ 3653.750967] dump_stack_lvl+0x57/0x72
[ 3653.751526] check_noncircular+0xf3/0x110
[ 3653.752136] ? stack_trace_save+0x4b/0x70
[ 3653.752748] __lock_acquire+0x130e/0x2210
[ 3653.753356] lock_acquire+0xd7/0x310
[ 3653.753898] ? blkdev_get_by_dev.part.0+0xe7/0x320
[ 3653.754596] ? lock_is_held_type+0xe8/0x140
[ 3653.755125] ? blkdev_get_by_dev.part.0+0xe7/0x320
[ 3653.755729] ? blkdev_get_by_dev.part.0+0xe7/0x320
[ 3653.756338] __mutex_lock+0x92/0x900
[ 3653.756794] ? blkdev_get_by_dev.part.0+0xe7/0x320
[ 3653.757400] ? do_raw_spin_unlock+0x4b/0xa0
[ 3653.757930] ? _raw_spin_unlock+0x29/0x40
[ 3653.758437] ? bd_prepare_to_claim+0x129/0x150
[ 3653.758999] ? trace_module_get+0x2b/0xd0
[ 3653.759508] ? try_module_get.part.0+0x50/0x80
[ 3653.760072] blkdev_get_by_dev.part.0+0xe7/0x320
[ 3653.760661] ? devcgroup_check_permission+0xc1/0x1f0
[ 3653.761288] blkdev_get_by_path+0xb8/0xd0
[ 3653.761797] btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
[ 3653.762454] open_fs_devices+0xd7/0x2c0 [btrfs]
[ 3653.763055] ? clone_fs_devices+0x8f/0x170 [btrfs]
[ 3653.763689] btrfs_read_chunk_tree+0x3ad/0x870 [btrfs]
[ 3653.764370] ? kvm_sched_clock_read+0x14/0x40
[ 3653.764922] open_ctree+0xb8e/0x17bf [btrfs]
[ 3653.765493] ? super_setup_bdi_name+0x79/0xd0
[ 3653.766043] btrfs_mount_root.cold+0x12/0xde [btrfs]
[ 3653.766780] ? rcu_read_lock_sched_held+0x3f/0x80
[ 3653.767488] ? kfree+0x1f2/0x3c0
[ 3653.767979] legacy_get_tree+0x30/0x50
[ 3653.768548] vfs_get_tree+0x28/0xc0
[ 3653.769076] vfs_kern_mount.part.0+0x71/0xb0
[ 3653.769718] btrfs_mount+0x11d/0x3a0 [btrfs]
[ 3653.770381] ? rcu_read_lock_sched_held+0x3f/0x80
[ 3653.771086] ? kfree+0x1f2/0x3c0
[ 3653.771574] legacy_get_tree+0x30/0x50
[ 3653.772136] vfs_get_tree+0x28/0xc0
[ 3653.772673] path_mount+0x2d4/0xbe0
[ 3653.773201] __x64_sys_mount+0x103/0x140
[ 3653.773793] do_syscall_64+0x3b/0xc0
[ 3653.774333] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 3653.775094] RIP: 0033:0x7f648bc45aaa
This happens because through btrfs_read_chunk_tree(), which is called only
during mount, ends up acquiring the mutex open_mutex of a block device
while holding a read lock on a leaf of the chunk tree while other paths
need to acquire other locks before locking extent buffers of the chunk
tree.
Since at mount time when we call btrfs_read_chunk_tree() we know that
we don't have other tasks running in parallel and modifying the chunk
tree, we can simply skip locking of chunk tree extent buffers. So do
that and move the assertion that checks the fs is not yet mounted to the
top block of btrfs_read_chunk_tree(), with a comment before doing it.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-04 20:43:08 +08:00
|
|
|
/*
|
|
|
|
* Lockdep complains about possible circular locking dependency between
|
|
|
|
* a disk's open_mutex (struct gendisk.open_mutex), the rw semaphores
|
|
|
|
* used for freeze procection of a fs (struct super_block.s_writers),
|
|
|
|
* which we take when starting a transaction, and extent buffers of the
|
|
|
|
* chunk tree if we call read_one_dev() while holding a lock on an
|
|
|
|
* extent buffer of the chunk tree. Since we are mounting the filesystem
|
|
|
|
* and at this point there can't be any concurrent task modifying the
|
|
|
|
* chunk tree, to keep it simple, just skip locking on the chunk tree.
|
|
|
|
*/
|
|
|
|
ASSERT(!test_bit(BTRFS_FS_OPEN, &fs_info->flags));
|
|
|
|
path->skip_locking = 1;
|
|
|
|
|
2013-07-30 19:03:04 +08:00
|
|
|
/*
|
|
|
|
* Read all device items, and then all the chunk items. All
|
|
|
|
* device items are found before any chunk item (their object id
|
|
|
|
* is smaller than the lowest possible object id for a chunk
|
|
|
|
* item - BTRFS_FIRST_CHUNK_TREE_OBJECTID).
|
2008-03-25 03:01:56 +08:00
|
|
|
*/
|
|
|
|
key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
|
|
|
|
key.offset = 0;
|
|
|
|
key.type = 0;
|
2022-03-09 21:50:50 +08:00
|
|
|
btrfs_for_each_slot(root, &key, &found_key, path, iter_ret) {
|
|
|
|
struct extent_buffer *node = path->nodes[1];
|
2020-07-09 04:55:14 +08:00
|
|
|
|
2008-03-25 03:01:56 +08:00
|
|
|
leaf = path->nodes[0];
|
|
|
|
slot = path->slots[0];
|
2022-03-09 21:50:50 +08:00
|
|
|
|
2020-07-09 04:55:14 +08:00
|
|
|
if (node) {
|
|
|
|
if (last_ra_node != node->start) {
|
|
|
|
readahead_tree_node_children(node);
|
|
|
|
last_ra_node = node->start;
|
|
|
|
}
|
|
|
|
}
|
2013-07-30 19:03:04 +08:00
|
|
|
if (found_key.type == BTRFS_DEV_ITEM_KEY) {
|
|
|
|
struct btrfs_dev_item *dev_item;
|
|
|
|
dev_item = btrfs_item_ptr(leaf, slot,
|
2008-03-25 03:01:56 +08:00
|
|
|
struct btrfs_dev_item);
|
2019-03-20 23:45:15 +08:00
|
|
|
ret = read_one_dev(leaf, dev_item);
|
2013-07-30 19:03:04 +08:00
|
|
|
if (ret)
|
|
|
|
goto error;
|
2016-06-04 03:05:14 +08:00
|
|
|
total_dev++;
|
2008-03-25 03:01:56 +08:00
|
|
|
} else if (found_key.type == BTRFS_CHUNK_ITEM_KEY) {
|
|
|
|
struct btrfs_chunk *chunk;
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 21:43:06 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We are only called at mount time, so no need to take
|
|
|
|
* fs_info->chunk_mutex. Plus, to avoid lockdep warnings,
|
|
|
|
* we always lock first fs_info->chunk_mutex before
|
|
|
|
* acquiring any locks on the chunk tree. This is a
|
|
|
|
* requirement for chunk allocation, see the comment on
|
|
|
|
* top of btrfs_chunk_alloc() for details.
|
|
|
|
*/
|
2008-03-25 03:01:56 +08:00
|
|
|
chunk = btrfs_item_ptr(leaf, slot, struct btrfs_chunk);
|
2019-03-20 23:43:07 +08:00
|
|
|
ret = read_one_chunk(&found_key, leaf, chunk);
|
2008-11-18 10:11:30 +08:00
|
|
|
if (ret)
|
|
|
|
goto error;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
2022-03-09 21:50:50 +08:00
|
|
|
}
|
|
|
|
/* Catch error found during iteration */
|
|
|
|
if (iter_ret < 0) {
|
|
|
|
ret = iter_ret;
|
|
|
|
goto error;
|
2008-03-25 03:01:56 +08:00
|
|
|
}
|
2016-06-04 03:05:14 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* After loading chunk tree, we've got all device information,
|
|
|
|
* do another round of validation checks.
|
|
|
|
*/
|
2016-06-23 06:54:23 +08:00
|
|
|
if (total_dev != fs_info->fs_devices->total_devices) {
|
btrfs: repair super block num_devices automatically
[BUG]
There is a report that a btrfs has a bad super block num devices.
This makes btrfs to reject the fs completely.
BTRFS error (device sdd3): super_num_devices 3 mismatch with num_devices 2 found here
BTRFS error (device sdd3): failed to read chunk tree: -22
BTRFS error (device sdd3): open_ctree failed
[CAUSE]
During btrfs device removal, chunk tree and super block num devs are
updated in two different transactions:
btrfs_rm_device()
|- btrfs_rm_dev_item(device)
| |- trans = btrfs_start_transaction()
| | Now we got transaction X
| |
| |- btrfs_del_item()
| | Now device item is removed from chunk tree
| |
| |- btrfs_commit_transaction()
| Transaction X got committed, super num devs untouched,
| but device item removed from chunk tree.
| (AKA, super num devs is already incorrect)
|
|- cur_devices->num_devices--;
|- cur_devices->total_devices--;
|- btrfs_set_super_num_devices()
All those operations are not in transaction X, thus it will
only be written back to disk in next transaction.
So after the transaction X in btrfs_rm_dev_item() committed, but before
transaction X+1 (which can be minutes away), a power loss happen, then
we got the super num mismatch.
This has been fixed by commit bbac58698a55 ("btrfs: remove device item
and update super block in the same transaction").
[FIX]
Make the super_num_devices check less strict, converting it from a hard
error to a warning, and reset the value to a correct one for the current
or next transaction commit.
As the number of device items is the critical information where the
super block num_devices is only a cached value (and also useful for
cross checking), it's safe to automatically update it. Other device
related problems like missing device are handled after that and may
require other means to resolve, like degraded mount. With this fix,
potentially affected filesystems won't fail mount and require the manual
repair by btrfs check.
Reported-by: Luca Béla Palkovics <luca.bela.palkovics@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CA+8xDSpvdm_U0QLBAnrH=zqDq_cWCOH5TiV46CKmp3igr44okQ@mail.gmail.com/
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-02-28 15:05:53 +08:00
|
|
|
btrfs_warn(fs_info,
|
|
|
|
"super block num_devices %llu mismatch with DEV_ITEM count %llu, will be repaired on next transaction commit",
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_super_num_devices(fs_info->super_copy),
|
2016-06-04 03:05:14 +08:00
|
|
|
total_dev);
|
btrfs: repair super block num_devices automatically
[BUG]
There is a report that a btrfs has a bad super block num devices.
This makes btrfs to reject the fs completely.
BTRFS error (device sdd3): super_num_devices 3 mismatch with num_devices 2 found here
BTRFS error (device sdd3): failed to read chunk tree: -22
BTRFS error (device sdd3): open_ctree failed
[CAUSE]
During btrfs device removal, chunk tree and super block num devs are
updated in two different transactions:
btrfs_rm_device()
|- btrfs_rm_dev_item(device)
| |- trans = btrfs_start_transaction()
| | Now we got transaction X
| |
| |- btrfs_del_item()
| | Now device item is removed from chunk tree
| |
| |- btrfs_commit_transaction()
| Transaction X got committed, super num devs untouched,
| but device item removed from chunk tree.
| (AKA, super num devs is already incorrect)
|
|- cur_devices->num_devices--;
|- cur_devices->total_devices--;
|- btrfs_set_super_num_devices()
All those operations are not in transaction X, thus it will
only be written back to disk in next transaction.
So after the transaction X in btrfs_rm_dev_item() committed, but before
transaction X+1 (which can be minutes away), a power loss happen, then
we got the super num mismatch.
This has been fixed by commit bbac58698a55 ("btrfs: remove device item
and update super block in the same transaction").
[FIX]
Make the super_num_devices check less strict, converting it from a hard
error to a warning, and reset the value to a correct one for the current
or next transaction commit.
As the number of device items is the critical information where the
super block num_devices is only a cached value (and also useful for
cross checking), it's safe to automatically update it. Other device
related problems like missing device are handled after that and may
require other means to resolve, like degraded mount. With this fix,
potentially affected filesystems won't fail mount and require the manual
repair by btrfs check.
Reported-by: Luca Béla Palkovics <luca.bela.palkovics@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CA+8xDSpvdm_U0QLBAnrH=zqDq_cWCOH5TiV46CKmp3igr44okQ@mail.gmail.com/
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-02-28 15:05:53 +08:00
|
|
|
fs_info->fs_devices->total_devices = total_dev;
|
|
|
|
btrfs_set_super_num_devices(fs_info->super_copy, total_dev);
|
2016-06-04 03:05:14 +08:00
|
|
|
}
|
2016-06-23 06:54:23 +08:00
|
|
|
if (btrfs_super_total_bytes(fs_info->super_copy) <
|
|
|
|
fs_info->fs_devices->total_rw_bytes) {
|
|
|
|
btrfs_err(fs_info,
|
2016-06-04 03:05:14 +08:00
|
|
|
"super_total_bytes %llu mismatch with fs_devices total_rw_bytes %llu",
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_super_total_bytes(fs_info->super_copy),
|
|
|
|
fs_info->fs_devices->total_rw_bytes);
|
2016-06-04 03:05:14 +08:00
|
|
|
ret = -EINVAL;
|
|
|
|
goto error;
|
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
ret = 0;
|
|
|
|
error:
|
2011-12-07 11:38:24 +08:00
|
|
|
mutex_unlock(&uuid_mutex);
|
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
btrfs_free_path(path);
|
2008-03-25 03:01:56 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2012-05-25 22:06:08 +08:00
|
|
|
|
2022-11-04 22:12:34 +08:00
|
|
|
int btrfs_init_devices_late(struct btrfs_fs_info *fs_info)
|
2013-05-15 15:48:19 +08:00
|
|
|
{
|
2020-07-16 15:25:33 +08:00
|
|
|
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices, *seed_devs;
|
2013-05-15 15:48:19 +08:00
|
|
|
struct btrfs_device *device;
|
2022-11-04 22:12:34 +08:00
|
|
|
int ret = 0;
|
2013-05-15 15:48:19 +08:00
|
|
|
|
2020-07-16 15:25:33 +08:00
|
|
|
fs_devices->fs_info = fs_info;
|
|
|
|
|
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
|
|
|
list_for_each_entry(device, &fs_devices->devices, dev_list)
|
|
|
|
device->fs_info = fs_info;
|
|
|
|
|
|
|
|
list_for_each_entry(seed_devs, &fs_devices->seed_list, seed_list) {
|
2022-11-04 22:12:34 +08:00
|
|
|
list_for_each_entry(device, &seed_devs->devices, dev_list) {
|
2016-06-23 06:54:56 +08:00
|
|
|
device->fs_info = fs_info;
|
2022-11-04 22:12:34 +08:00
|
|
|
ret = btrfs_get_dev_zone_info(device, false);
|
|
|
|
if (ret)
|
|
|
|
break;
|
|
|
|
}
|
2014-05-11 23:14:59 +08:00
|
|
|
|
2020-07-16 15:25:33 +08:00
|
|
|
seed_devs->fs_info = fs_info;
|
2014-05-11 23:14:59 +08:00
|
|
|
}
|
2020-09-05 01:34:31 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
2022-11-04 22:12:34 +08:00
|
|
|
|
|
|
|
return ret;
|
2013-05-15 15:48:19 +08:00
|
|
|
}
|
|
|
|
|
2019-08-22 02:05:32 +08:00
|
|
|
static u64 btrfs_dev_stats_value(const struct extent_buffer *eb,
|
|
|
|
const struct btrfs_dev_stats_item *ptr,
|
|
|
|
int index)
|
|
|
|
{
|
|
|
|
u64 val;
|
|
|
|
|
|
|
|
read_extent_buffer(eb, &val,
|
|
|
|
offsetof(struct btrfs_dev_stats_item, values) +
|
|
|
|
((unsigned long)ptr) + (index * sizeof(u64)),
|
|
|
|
sizeof(val));
|
|
|
|
return val;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void btrfs_set_dev_stats_value(struct extent_buffer *eb,
|
|
|
|
struct btrfs_dev_stats_item *ptr,
|
|
|
|
int index, u64 val)
|
|
|
|
{
|
|
|
|
write_extent_buffer(eb, &val,
|
|
|
|
offsetof(struct btrfs_dev_stats_item, values) +
|
|
|
|
((unsigned long)ptr) + (index * sizeof(u64)),
|
|
|
|
sizeof(val));
|
|
|
|
}
|
|
|
|
|
2020-09-19 04:44:33 +08:00
|
|
|
static int btrfs_device_init_dev_stats(struct btrfs_device *device,
|
|
|
|
struct btrfs_path *path)
|
2012-05-25 22:06:10 +08:00
|
|
|
{
|
2020-09-19 04:44:32 +08:00
|
|
|
struct btrfs_dev_stats_item *ptr;
|
2012-05-25 22:06:10 +08:00
|
|
|
struct extent_buffer *eb;
|
2020-09-19 04:44:32 +08:00
|
|
|
struct btrfs_key key;
|
|
|
|
int item_size;
|
|
|
|
int i, ret, slot;
|
|
|
|
|
2021-03-12 00:23:15 +08:00
|
|
|
if (!device->fs_info->dev_root)
|
|
|
|
return 0;
|
|
|
|
|
2020-09-19 04:44:32 +08:00
|
|
|
key.objectid = BTRFS_DEV_STATS_OBJECTID;
|
|
|
|
key.type = BTRFS_PERSISTENT_ITEM_KEY;
|
|
|
|
key.offset = device->devid;
|
|
|
|
ret = btrfs_search_slot(NULL, device->fs_info->dev_root, &key, path, 0, 0);
|
|
|
|
if (ret) {
|
|
|
|
for (i = 0; i < BTRFS_DEV_STAT_VALUES_MAX; i++)
|
|
|
|
btrfs_dev_stat_set(device, i, 0);
|
|
|
|
device->dev_stats_valid = 1;
|
|
|
|
btrfs_release_path(path);
|
2020-09-19 04:44:33 +08:00
|
|
|
return ret < 0 ? ret : 0;
|
2020-09-19 04:44:32 +08:00
|
|
|
}
|
|
|
|
slot = path->slots[0];
|
|
|
|
eb = path->nodes[0];
|
2021-10-22 02:58:35 +08:00
|
|
|
item_size = btrfs_item_size(eb, slot);
|
2020-09-19 04:44:32 +08:00
|
|
|
|
|
|
|
ptr = btrfs_item_ptr(eb, slot, struct btrfs_dev_stats_item);
|
|
|
|
|
|
|
|
for (i = 0; i < BTRFS_DEV_STAT_VALUES_MAX; i++) {
|
|
|
|
if (item_size >= (1 + i) * sizeof(__le64))
|
|
|
|
btrfs_dev_stat_set(device, i,
|
|
|
|
btrfs_dev_stats_value(eb, ptr, i));
|
|
|
|
else
|
|
|
|
btrfs_dev_stat_set(device, i, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
device->dev_stats_valid = 1;
|
|
|
|
btrfs_dev_stat_print_on_load(device);
|
|
|
|
btrfs_release_path(path);
|
2020-09-19 04:44:33 +08:00
|
|
|
|
|
|
|
return 0;
|
2020-09-19 04:44:32 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_init_dev_stats(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices, *seed_devs;
|
2012-05-25 22:06:10 +08:00
|
|
|
struct btrfs_device *device;
|
|
|
|
struct btrfs_path *path = NULL;
|
2020-09-19 04:44:33 +08:00
|
|
|
int ret = 0;
|
2012-05-25 22:06:10 +08:00
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
2019-08-21 17:26:32 +08:00
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
2012-05-25 22:06:10 +08:00
|
|
|
|
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
2020-09-19 04:44:33 +08:00
|
|
|
list_for_each_entry(device, &fs_devices->devices, dev_list) {
|
|
|
|
ret = btrfs_device_init_dev_stats(device, path);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
}
|
2020-09-19 04:44:32 +08:00
|
|
|
list_for_each_entry(seed_devs, &fs_devices->seed_list, seed_list) {
|
2020-09-19 04:44:33 +08:00
|
|
|
list_for_each_entry(device, &seed_devs->devices, dev_list) {
|
|
|
|
ret = btrfs_device_init_dev_stats(device, path);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
}
|
2012-05-25 22:06:10 +08:00
|
|
|
}
|
2020-09-19 04:44:33 +08:00
|
|
|
out:
|
2012-05-25 22:06:10 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
|
|
|
|
|
|
|
btrfs_free_path(path);
|
2020-09-19 04:44:33 +08:00
|
|
|
return ret;
|
2012-05-25 22:06:10 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static int update_dev_stat_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_device *device)
|
|
|
|
{
|
2018-07-21 00:37:49 +08:00
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
2016-06-22 09:16:51 +08:00
|
|
|
struct btrfs_root *dev_root = fs_info->dev_root;
|
2012-05-25 22:06:10 +08:00
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct extent_buffer *eb;
|
|
|
|
struct btrfs_dev_stats_item *ptr;
|
|
|
|
int ret;
|
|
|
|
int i;
|
|
|
|
|
2016-01-26 00:51:31 +08:00
|
|
|
key.objectid = BTRFS_DEV_STATS_OBJECTID;
|
|
|
|
key.type = BTRFS_PERSISTENT_ITEM_KEY;
|
2012-05-25 22:06:10 +08:00
|
|
|
key.offset = device->devid;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
2017-02-15 16:35:01 +08:00
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
2012-05-25 22:06:10 +08:00
|
|
|
ret = btrfs_search_slot(trans, dev_root, &key, path, -1, 1);
|
|
|
|
if (ret < 0) {
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_warn_in_rcu(fs_info,
|
2015-10-08 15:01:03 +08:00
|
|
|
"error %d while searching for dev_stats item for device %s",
|
2022-11-13 09:32:07 +08:00
|
|
|
ret, btrfs_dev_name(device));
|
2012-05-25 22:06:10 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (ret == 0 &&
|
2021-10-22 02:58:35 +08:00
|
|
|
btrfs_item_size(path->nodes[0], path->slots[0]) < sizeof(*ptr)) {
|
2012-05-25 22:06:10 +08:00
|
|
|
/* need to delete old one and insert a new one */
|
|
|
|
ret = btrfs_del_item(trans, dev_root, path);
|
|
|
|
if (ret != 0) {
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_warn_in_rcu(fs_info,
|
2015-10-08 15:01:03 +08:00
|
|
|
"delete too small dev_stats item for device %s failed %d",
|
2022-11-13 09:32:07 +08:00
|
|
|
btrfs_dev_name(device), ret);
|
2012-05-25 22:06:10 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
ret = 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (ret == 1) {
|
|
|
|
/* need to insert a new item */
|
|
|
|
btrfs_release_path(path);
|
|
|
|
ret = btrfs_insert_empty_item(trans, dev_root, path,
|
|
|
|
&key, sizeof(*ptr));
|
|
|
|
if (ret < 0) {
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_warn_in_rcu(fs_info,
|
2015-10-08 15:01:03 +08:00
|
|
|
"insert dev_stats item for device %s failed %d",
|
2022-11-13 09:32:07 +08:00
|
|
|
btrfs_dev_name(device), ret);
|
2012-05-25 22:06:10 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
eb = path->nodes[0];
|
|
|
|
ptr = btrfs_item_ptr(eb, path->slots[0], struct btrfs_dev_stats_item);
|
|
|
|
for (i = 0; i < BTRFS_DEV_STAT_VALUES_MAX; i++)
|
|
|
|
btrfs_set_dev_stats_value(eb, ptr, i,
|
|
|
|
btrfs_dev_stat_read(device, i));
|
2023-09-12 20:04:29 +08:00
|
|
|
btrfs_mark_buffer_dirty(trans, eb);
|
2012-05-25 22:06:10 +08:00
|
|
|
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* called from commit_transaction. Writes all changed device stats to disk.
|
|
|
|
*/
|
2019-03-20 23:50:38 +08:00
|
|
|
int btrfs_run_dev_stats(struct btrfs_trans_handle *trans)
|
2012-05-25 22:06:10 +08:00
|
|
|
{
|
2019-03-20 23:50:38 +08:00
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
2012-05-25 22:06:10 +08:00
|
|
|
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
|
|
|
|
struct btrfs_device *device;
|
2014-07-24 11:37:11 +08:00
|
|
|
int stats_cnt;
|
2012-05-25 22:06:10 +08:00
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
|
|
|
list_for_each_entry(device, &fs_devices->devices, dev_list) {
|
2017-10-24 18:47:37 +08:00
|
|
|
stats_cnt = atomic_read(&device->dev_stats_ccnt);
|
|
|
|
if (!device->dev_stats_valid || stats_cnt == 0)
|
2012-05-25 22:06:10 +08:00
|
|
|
continue;
|
|
|
|
|
2017-10-24 18:47:37 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* There is a LOAD-LOAD control dependency between the value of
|
|
|
|
* dev_stats_ccnt and updating the on-disk values which requires
|
|
|
|
* reading the in-memory counters. Such control dependencies
|
|
|
|
* require explicit read memory barriers.
|
|
|
|
*
|
|
|
|
* This memory barriers pairs with smp_mb__before_atomic in
|
|
|
|
* btrfs_dev_stat_inc/btrfs_dev_stat_set and with the full
|
|
|
|
* barrier implied by atomic_xchg in
|
|
|
|
* btrfs_dev_stats_read_and_reset
|
|
|
|
*/
|
|
|
|
smp_rmb();
|
|
|
|
|
2018-07-21 00:37:49 +08:00
|
|
|
ret = update_dev_stat_item(trans, device);
|
2012-05-25 22:06:10 +08:00
|
|
|
if (!ret)
|
2014-07-24 11:37:11 +08:00
|
|
|
atomic_sub(stats_cnt, &device->dev_stats_ccnt);
|
2012-05-25 22:06:10 +08:00
|
|
|
}
|
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-05-25 22:06:08 +08:00
|
|
|
void btrfs_dev_stat_inc_and_print(struct btrfs_device *dev, int index)
|
|
|
|
{
|
|
|
|
btrfs_dev_stat_inc(dev, index);
|
|
|
|
|
2012-05-25 22:06:10 +08:00
|
|
|
if (!dev->dev_stats_valid)
|
|
|
|
return;
|
2016-06-23 06:54:56 +08:00
|
|
|
btrfs_err_rl_in_rcu(dev->fs_info,
|
2015-10-08 16:43:10 +08:00
|
|
|
"bdev %s errs: wr %u, rd %u, flush %u, corrupt %u, gen %u",
|
2022-11-13 09:32:07 +08:00
|
|
|
btrfs_dev_name(dev),
|
2012-05-25 22:06:08 +08:00
|
|
|
btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_WRITE_ERRS),
|
|
|
|
btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_READ_ERRS),
|
|
|
|
btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_FLUSH_ERRS),
|
2013-12-21 00:37:06 +08:00
|
|
|
btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_CORRUPTION_ERRS),
|
|
|
|
btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_GENERATION_ERRS));
|
2012-05-25 22:06:08 +08:00
|
|
|
}
|
2012-05-25 22:06:09 +08:00
|
|
|
|
2012-05-25 22:06:10 +08:00
|
|
|
static void btrfs_dev_stat_print_on_load(struct btrfs_device *dev)
|
|
|
|
{
|
2012-07-17 23:02:11 +08:00
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < BTRFS_DEV_STAT_VALUES_MAX; i++)
|
|
|
|
if (btrfs_dev_stat_read(dev, i) != 0)
|
|
|
|
break;
|
|
|
|
if (i == BTRFS_DEV_STAT_VALUES_MAX)
|
|
|
|
return; /* all values == 0, suppress message */
|
|
|
|
|
2016-06-23 06:54:56 +08:00
|
|
|
btrfs_info_in_rcu(dev->fs_info,
|
2015-10-08 15:01:03 +08:00
|
|
|
"bdev %s errs: wr %u, rd %u, flush %u, corrupt %u, gen %u",
|
2022-11-13 09:32:07 +08:00
|
|
|
btrfs_dev_name(dev),
|
2012-05-25 22:06:10 +08:00
|
|
|
btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_WRITE_ERRS),
|
|
|
|
btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_READ_ERRS),
|
|
|
|
btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_FLUSH_ERRS),
|
|
|
|
btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_CORRUPTION_ERRS),
|
|
|
|
btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_GENERATION_ERRS));
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
int btrfs_get_dev_stats(struct btrfs_fs_info *fs_info,
|
2012-06-22 20:30:39 +08:00
|
|
|
struct btrfs_ioctl_get_dev_stats *stats)
|
2012-05-25 22:06:09 +08:00
|
|
|
{
|
2021-10-06 04:12:42 +08:00
|
|
|
BTRFS_DEV_LOOKUP_ARGS(args);
|
2012-05-25 22:06:09 +08:00
|
|
|
struct btrfs_device *dev;
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
|
2012-05-25 22:06:09 +08:00
|
|
|
int i;
|
|
|
|
|
|
|
|
mutex_lock(&fs_devices->device_list_mutex);
|
2021-10-06 04:12:42 +08:00
|
|
|
args.devid = stats->devid;
|
|
|
|
dev = btrfs_find_device(fs_info->fs_devices, &args);
|
2012-05-25 22:06:09 +08:00
|
|
|
mutex_unlock(&fs_devices->device_list_mutex);
|
|
|
|
|
|
|
|
if (!dev) {
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_warn(fs_info, "get dev_stats failed, device not found");
|
2012-05-25 22:06:09 +08:00
|
|
|
return -ENODEV;
|
2012-05-25 22:06:10 +08:00
|
|
|
} else if (!dev->dev_stats_valid) {
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_warn(fs_info, "get dev_stats failed, not yet valid");
|
2012-05-25 22:06:10 +08:00
|
|
|
return -ENODEV;
|
2012-06-22 20:30:39 +08:00
|
|
|
} else if (stats->flags & BTRFS_DEV_STATS_RESET) {
|
2012-05-25 22:06:09 +08:00
|
|
|
for (i = 0; i < BTRFS_DEV_STAT_VALUES_MAX; i++) {
|
|
|
|
if (stats->nr_items > i)
|
|
|
|
stats->values[i] =
|
|
|
|
btrfs_dev_stat_read_and_reset(dev, i);
|
|
|
|
else
|
2019-08-07 16:21:19 +08:00
|
|
|
btrfs_dev_stat_set(dev, i, 0);
|
2012-05-25 22:06:09 +08:00
|
|
|
}
|
2020-01-10 12:26:34 +08:00
|
|
|
btrfs_info(fs_info, "device stats zeroed by %s (%d)",
|
|
|
|
current->comm, task_pid_nr(current));
|
2012-05-25 22:06:09 +08:00
|
|
|
} else {
|
|
|
|
for (i = 0; i < BTRFS_DEV_STAT_VALUES_MAX; i++)
|
|
|
|
if (stats->nr_items > i)
|
|
|
|
stats->values[i] = btrfs_dev_stat_read(dev, i);
|
|
|
|
}
|
|
|
|
if (stats->nr_items > BTRFS_DEV_STAT_VALUES_MAX)
|
|
|
|
stats->nr_items = BTRFS_DEV_STAT_VALUES_MAX;
|
|
|
|
return 0;
|
|
|
|
}
|
2012-11-05 22:50:14 +08:00
|
|
|
|
2014-09-03 21:35:33 +08:00
|
|
|
/*
|
2019-03-25 20:31:22 +08:00
|
|
|
* Update the size and bytes used for each device where it changed. This is
|
|
|
|
* delayed since we would otherwise get errors while writing out the
|
|
|
|
* superblocks.
|
|
|
|
*
|
|
|
|
* Must be invoked during transaction commit.
|
2014-09-03 21:35:33 +08:00
|
|
|
*/
|
2019-03-25 20:31:22 +08:00
|
|
|
void btrfs_commit_device_sizes(struct btrfs_transaction *trans)
|
2014-09-03 21:35:33 +08:00
|
|
|
{
|
|
|
|
struct btrfs_device *curr, *next;
|
|
|
|
|
2019-03-25 20:31:22 +08:00
|
|
|
ASSERT(trans->state == TRANS_STATE_COMMIT_DOING);
|
2014-09-03 21:35:34 +08:00
|
|
|
|
2019-03-25 20:31:22 +08:00
|
|
|
if (list_empty(&trans->dev_update_list))
|
2014-09-03 21:35:34 +08:00
|
|
|
return;
|
|
|
|
|
2019-03-25 20:31:22 +08:00
|
|
|
/*
|
|
|
|
* We don't need the device_list_mutex here. This list is owned by the
|
|
|
|
* transaction and the transaction must complete before the device is
|
|
|
|
* released.
|
|
|
|
*/
|
|
|
|
mutex_lock(&trans->fs_info->chunk_mutex);
|
|
|
|
list_for_each_entry_safe(curr, next, &trans->dev_update_list,
|
|
|
|
post_commit_list) {
|
|
|
|
list_del_init(&curr->post_commit_list);
|
|
|
|
curr->commit_total_bytes = curr->disk_total_bytes;
|
|
|
|
curr->commit_bytes_used = curr->bytes_used;
|
2014-09-03 21:35:34 +08:00
|
|
|
}
|
2019-03-25 20:31:22 +08:00
|
|
|
mutex_unlock(&trans->fs_info->chunk_mutex);
|
2014-09-03 21:35:34 +08:00
|
|
|
}
|
2015-03-10 06:38:31 +08:00
|
|
|
|
2018-07-14 02:46:30 +08:00
|
|
|
/*
|
|
|
|
* Multiplicity factor for simple profiles: DUP, RAID1-like and RAID10.
|
|
|
|
*/
|
|
|
|
int btrfs_bg_type_to_factor(u64 flags)
|
|
|
|
{
|
2019-05-17 17:43:31 +08:00
|
|
|
const int index = btrfs_bg_flags_to_raid_index(flags);
|
|
|
|
|
|
|
|
return btrfs_raid_array[index].ncopies;
|
2018-07-14 02:46:30 +08:00
|
|
|
}
|
2018-08-01 10:37:19 +08:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
static int verify_one_dev_extent(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 chunk_offset, u64 devid,
|
|
|
|
u64 physical_offset, u64 physical_len)
|
|
|
|
{
|
2021-10-06 04:12:42 +08:00
|
|
|
struct btrfs_dev_lookup_args args = { .devid = devid };
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
struct btrfs_chunk_map *map;
|
2018-10-05 17:45:55 +08:00
|
|
|
struct btrfs_device *dev;
|
2018-08-01 10:37:19 +08:00
|
|
|
u64 stripe_len;
|
|
|
|
bool found = false;
|
|
|
|
int ret = 0;
|
|
|
|
int i;
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
map = btrfs_find_chunk_map(fs_info, chunk_offset, 1);
|
|
|
|
if (!map) {
|
2018-08-01 10:37:19 +08:00
|
|
|
btrfs_err(fs_info,
|
|
|
|
"dev extent physical offset %llu on devid %llu doesn't have corresponding chunk",
|
|
|
|
physical_offset, devid);
|
|
|
|
ret = -EUCLEAN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
stripe_len = btrfs_calc_stripe_length(map);
|
2018-08-01 10:37:19 +08:00
|
|
|
if (physical_len != stripe_len) {
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"dev extent physical offset %llu on devid %llu length doesn't match chunk %llu, have %llu expect %llu",
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
physical_offset, devid, map->start, physical_len,
|
2018-08-01 10:37:19 +08:00
|
|
|
stripe_len);
|
|
|
|
ret = -EUCLEAN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2022-06-13 15:06:35 +08:00
|
|
|
/*
|
|
|
|
* Very old mkfs.btrfs (before v4.1) will not respect the reserved
|
|
|
|
* space. Although kernel can handle it without problem, better to warn
|
|
|
|
* the users.
|
|
|
|
*/
|
|
|
|
if (physical_offset < BTRFS_DEVICE_RANGE_RESERVED)
|
|
|
|
btrfs_warn(fs_info,
|
|
|
|
"devid %llu physical %llu len %llu inside the reserved space",
|
|
|
|
devid, physical_offset, physical_len);
|
|
|
|
|
2018-08-01 10:37:19 +08:00
|
|
|
for (i = 0; i < map->num_stripes; i++) {
|
|
|
|
if (map->stripes[i].dev->devid == devid &&
|
|
|
|
map->stripes[i].physical == physical_offset) {
|
|
|
|
found = true;
|
|
|
|
if (map->verified_stripes >= map->num_stripes) {
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"too many dev extents for chunk %llu found",
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
map->start);
|
2018-08-01 10:37:19 +08:00
|
|
|
ret = -EUCLEAN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
map->verified_stripes++;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!found) {
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"dev extent physical offset %llu devid %llu has no corresponding chunk",
|
|
|
|
physical_offset, devid);
|
|
|
|
ret = -EUCLEAN;
|
|
|
|
}
|
2018-10-05 17:45:55 +08:00
|
|
|
|
2021-05-21 23:42:23 +08:00
|
|
|
/* Make sure no dev extent is beyond device boundary */
|
2021-10-06 04:12:42 +08:00
|
|
|
dev = btrfs_find_device(fs_info->fs_devices, &args);
|
2018-10-05 17:45:55 +08:00
|
|
|
if (!dev) {
|
|
|
|
btrfs_err(fs_info, "failed to find devid %llu", devid);
|
|
|
|
ret = -EUCLEAN;
|
|
|
|
goto out;
|
|
|
|
}
|
2019-01-08 14:08:18 +08:00
|
|
|
|
2018-10-05 17:45:55 +08:00
|
|
|
if (physical_offset + physical_len > dev->disk_total_bytes) {
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"dev extent devid %llu physical offset %llu len %llu is beyond device boundary %llu",
|
|
|
|
devid, physical_offset, physical_len,
|
|
|
|
dev->disk_total_bytes);
|
|
|
|
ret = -EUCLEAN;
|
|
|
|
goto out;
|
|
|
|
}
|
2021-02-04 18:21:49 +08:00
|
|
|
|
|
|
|
if (dev->zone_info) {
|
|
|
|
u64 zone_size = dev->zone_info->zone_size;
|
|
|
|
|
|
|
|
if (!IS_ALIGNED(physical_offset, zone_size) ||
|
|
|
|
!IS_ALIGNED(physical_len, zone_size)) {
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"zoned: dev extent devid %llu physical offset %llu len %llu is not aligned to device zone",
|
|
|
|
devid, physical_offset, physical_len);
|
|
|
|
ret = -EUCLEAN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-08-01 10:37:19 +08:00
|
|
|
out:
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
btrfs_free_chunk_map(map);
|
2018-08-01 10:37:19 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int verify_chunk_dev_extent_mapping(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
struct rb_node *node;
|
|
|
|
int ret = 0;
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
read_lock(&fs_info->mapping_tree_lock);
|
|
|
|
for (node = rb_first_cached(&fs_info->mapping_tree); node; node = rb_next(node)) {
|
|
|
|
struct btrfs_chunk_map *map;
|
|
|
|
|
|
|
|
map = rb_entry(node, struct btrfs_chunk_map, rb_node);
|
|
|
|
if (map->num_stripes != map->verified_stripes) {
|
2018-08-01 10:37:19 +08:00
|
|
|
btrfs_err(fs_info,
|
|
|
|
"chunk %llu has missing dev extent, have %d expect %d",
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
map->start, map->verified_stripes, map->num_stripes);
|
2018-08-01 10:37:19 +08:00
|
|
|
ret = -EUCLEAN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
out:
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
read_unlock(&fs_info->mapping_tree_lock);
|
2018-08-01 10:37:19 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Ensure that all dev extents are mapped to correct chunk, otherwise
|
|
|
|
* later chunk allocation/free would cause unexpected behavior.
|
|
|
|
*
|
|
|
|
* NOTE: This will iterate through the whole device tree, which should be of
|
|
|
|
* the same size level as the chunk tree. This slightly increases mount time.
|
|
|
|
*/
|
|
|
|
int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_root *root = fs_info->dev_root;
|
|
|
|
struct btrfs_key key;
|
btrfs: volumes: Make sure there is no overlap of dev extents at mount time
Enhance btrfs_verify_dev_extents() to remember previous checked dev
extents, so it can verify no dev extents can overlap.
Analysis from Hans:
"Imagine allocating a DATA|DUP chunk.
In the chunk allocator, we first set...
max_stripe_size = SZ_1G;
max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE
... which is 10GiB.
Then...
/* we don't want a chunk larger than 10% of writeable space */
max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
max_chunk_size);
Imagine we only have one 7880MiB block device in this filesystem. Now
max_chunk_size is down to 788MiB.
The next step in the code is to search for max_stripe_size * dev_stripes
amount of free space on the device, which is in our example 1GiB * 2 =
2GiB. Imagine the device has exactly 1578MiB free in one contiguous
piece. This amount of bytes will be put in devices_info[ndevs - 1].max_avail
Next we recalculate the stripe_size (which is actually the device extent
length), based on the actual maximum amount of available raw disk space:
stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes);
stripe_size is now 789MiB
Next we do...
data_stripes = num_stripes / ncopies
...where data_stripes ends up as 1, because num_stripes is 2 (the amount
of device extents we're going to have), and DUP has ncopies 2.
Next there's a check...
if (stripe_size * data_stripes > max_chunk_size)
...which matches because 789MiB * 1 > 788MiB.
We go into the if code, and next is...
stripe_size = div_u64(max_chunk_size, data_stripes);
...which resets stripe_size to max_chunk_size: 788MiB
Next is a fun one...
/* bump the answer up to a 16MB boundary */
stripe_size = round_up(stripe_size, SZ_16M);
...which changes stripe_size from 788MiB to 800MiB.
We're not done changing stripe_size yet...
/* But don't go higher than the limits we found while searching
* for free extents
*/
stripe_size = min(devices_info[ndevs - 1].max_avail,
stripe_size);
This is bad. max_avail is twice the stripe_size (we need to fit 2 device
extents on the same device for DUP).
The result here is that 800MiB < 1578MiB, so it's unchanged. However,
the resulting DUP chunk will need 1600MiB disk space, which isn't there,
and the second dev_extent might extend into the next thing (next
dev_extent? end of device?) for 22MiB.
The last shown line of code relies on a situation where there's twice
the value of stripe_size present as value for the variable stripe_size
when it's DUP. This was actually the case before commit 92e222df7b
"btrfs: alloc_chunk: fix DUP stripe size handling", from which I quote:
"[...] in the meantime there's a check to see if the stripe_size does
not exceed max_chunk_size. Since during this check stripe_size is twice
the amount as intended, the check will reduce the stripe_size to
max_chunk_size if the actual correct to be used stripe_size is more than
half the amount of max_chunk_size."
In the previous version of the code, the 16MiB alignment (why is this
done, by the way?) would result in a 50% chance that it would actually
do an 8MiB alignment for the individual dev_extents, since it was
operating on double the size. Does this matter?
Does it matter that stripe_size can be set to anything which is not
16MiB aligned because of the amount of remaining available disk space
which is just taken?
What is the main purpose of this round_up?
The most straightforward thing to do seems something like...
stripe_size = min(
div_u64(devices_info[ndevs - 1].max_avail, dev_stripes),
stripe_size
)
..just putting half of the max_avail into stripe_size."
Link: https://lore.kernel.org/linux-btrfs/b3461a38-e5f8-f41d-c67c-2efac8129054@mendix.com/
Reported-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ add analysis from report ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-05 17:45:54 +08:00
|
|
|
u64 prev_devid = 0;
|
|
|
|
u64 prev_dev_ext_end = 0;
|
2018-08-01 10:37:19 +08:00
|
|
|
int ret = 0;
|
|
|
|
|
2020-10-16 23:29:18 +08:00
|
|
|
/*
|
|
|
|
* We don't have a dev_root because we mounted with ignorebadroots and
|
|
|
|
* failed to load the root, so we want to skip the verification in this
|
|
|
|
* case for sure.
|
|
|
|
*
|
|
|
|
* However if the dev root is fine, but the tree itself is corrupted
|
|
|
|
* we'd still fail to mount. This verification is only to make sure
|
|
|
|
* writes can happen safely, so instead just bypass this check
|
|
|
|
* completely in the case of IGNOREBADROOTS.
|
|
|
|
*/
|
|
|
|
if (btrfs_test_opt(fs_info, IGNOREBADROOTS))
|
|
|
|
return 0;
|
|
|
|
|
2018-08-01 10:37:19 +08:00
|
|
|
key.objectid = 1;
|
|
|
|
key.type = BTRFS_DEV_EXTENT_KEY;
|
|
|
|
key.offset = 0;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
path->reada = READA_FORWARD;
|
|
|
|
ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
|
2021-07-13 21:58:03 +08:00
|
|
|
ret = btrfs_next_leaf(root, path);
|
2018-08-01 10:37:19 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
/* No dev extents at all? Not good */
|
|
|
|
if (ret > 0) {
|
|
|
|
ret = -EUCLEAN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
while (1) {
|
|
|
|
struct extent_buffer *leaf = path->nodes[0];
|
|
|
|
struct btrfs_dev_extent *dext;
|
|
|
|
int slot = path->slots[0];
|
|
|
|
u64 chunk_offset;
|
|
|
|
u64 physical_offset;
|
|
|
|
u64 physical_len;
|
|
|
|
u64 devid;
|
|
|
|
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, slot);
|
|
|
|
if (key.type != BTRFS_DEV_EXTENT_KEY)
|
|
|
|
break;
|
|
|
|
devid = key.objectid;
|
|
|
|
physical_offset = key.offset;
|
|
|
|
|
|
|
|
dext = btrfs_item_ptr(leaf, slot, struct btrfs_dev_extent);
|
|
|
|
chunk_offset = btrfs_dev_extent_chunk_offset(leaf, dext);
|
|
|
|
physical_len = btrfs_dev_extent_length(leaf, dext);
|
|
|
|
|
btrfs: volumes: Make sure there is no overlap of dev extents at mount time
Enhance btrfs_verify_dev_extents() to remember previous checked dev
extents, so it can verify no dev extents can overlap.
Analysis from Hans:
"Imagine allocating a DATA|DUP chunk.
In the chunk allocator, we first set...
max_stripe_size = SZ_1G;
max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE
... which is 10GiB.
Then...
/* we don't want a chunk larger than 10% of writeable space */
max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
max_chunk_size);
Imagine we only have one 7880MiB block device in this filesystem. Now
max_chunk_size is down to 788MiB.
The next step in the code is to search for max_stripe_size * dev_stripes
amount of free space on the device, which is in our example 1GiB * 2 =
2GiB. Imagine the device has exactly 1578MiB free in one contiguous
piece. This amount of bytes will be put in devices_info[ndevs - 1].max_avail
Next we recalculate the stripe_size (which is actually the device extent
length), based on the actual maximum amount of available raw disk space:
stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes);
stripe_size is now 789MiB
Next we do...
data_stripes = num_stripes / ncopies
...where data_stripes ends up as 1, because num_stripes is 2 (the amount
of device extents we're going to have), and DUP has ncopies 2.
Next there's a check...
if (stripe_size * data_stripes > max_chunk_size)
...which matches because 789MiB * 1 > 788MiB.
We go into the if code, and next is...
stripe_size = div_u64(max_chunk_size, data_stripes);
...which resets stripe_size to max_chunk_size: 788MiB
Next is a fun one...
/* bump the answer up to a 16MB boundary */
stripe_size = round_up(stripe_size, SZ_16M);
...which changes stripe_size from 788MiB to 800MiB.
We're not done changing stripe_size yet...
/* But don't go higher than the limits we found while searching
* for free extents
*/
stripe_size = min(devices_info[ndevs - 1].max_avail,
stripe_size);
This is bad. max_avail is twice the stripe_size (we need to fit 2 device
extents on the same device for DUP).
The result here is that 800MiB < 1578MiB, so it's unchanged. However,
the resulting DUP chunk will need 1600MiB disk space, which isn't there,
and the second dev_extent might extend into the next thing (next
dev_extent? end of device?) for 22MiB.
The last shown line of code relies on a situation where there's twice
the value of stripe_size present as value for the variable stripe_size
when it's DUP. This was actually the case before commit 92e222df7b
"btrfs: alloc_chunk: fix DUP stripe size handling", from which I quote:
"[...] in the meantime there's a check to see if the stripe_size does
not exceed max_chunk_size. Since during this check stripe_size is twice
the amount as intended, the check will reduce the stripe_size to
max_chunk_size if the actual correct to be used stripe_size is more than
half the amount of max_chunk_size."
In the previous version of the code, the 16MiB alignment (why is this
done, by the way?) would result in a 50% chance that it would actually
do an 8MiB alignment for the individual dev_extents, since it was
operating on double the size. Does this matter?
Does it matter that stripe_size can be set to anything which is not
16MiB aligned because of the amount of remaining available disk space
which is just taken?
What is the main purpose of this round_up?
The most straightforward thing to do seems something like...
stripe_size = min(
div_u64(devices_info[ndevs - 1].max_avail, dev_stripes),
stripe_size
)
..just putting half of the max_avail into stripe_size."
Link: https://lore.kernel.org/linux-btrfs/b3461a38-e5f8-f41d-c67c-2efac8129054@mendix.com/
Reported-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ add analysis from report ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-05 17:45:54 +08:00
|
|
|
/* Check if this dev extent overlaps with the previous one */
|
|
|
|
if (devid == prev_devid && physical_offset < prev_dev_ext_end) {
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"dev extent devid %llu physical offset %llu overlap with previous dev extent end %llu",
|
|
|
|
devid, physical_offset, prev_dev_ext_end);
|
|
|
|
ret = -EUCLEAN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2018-08-01 10:37:19 +08:00
|
|
|
ret = verify_one_dev_extent(fs_info, chunk_offset, devid,
|
|
|
|
physical_offset, physical_len);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
btrfs: volumes: Make sure there is no overlap of dev extents at mount time
Enhance btrfs_verify_dev_extents() to remember previous checked dev
extents, so it can verify no dev extents can overlap.
Analysis from Hans:
"Imagine allocating a DATA|DUP chunk.
In the chunk allocator, we first set...
max_stripe_size = SZ_1G;
max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE
... which is 10GiB.
Then...
/* we don't want a chunk larger than 10% of writeable space */
max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
max_chunk_size);
Imagine we only have one 7880MiB block device in this filesystem. Now
max_chunk_size is down to 788MiB.
The next step in the code is to search for max_stripe_size * dev_stripes
amount of free space on the device, which is in our example 1GiB * 2 =
2GiB. Imagine the device has exactly 1578MiB free in one contiguous
piece. This amount of bytes will be put in devices_info[ndevs - 1].max_avail
Next we recalculate the stripe_size (which is actually the device extent
length), based on the actual maximum amount of available raw disk space:
stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes);
stripe_size is now 789MiB
Next we do...
data_stripes = num_stripes / ncopies
...where data_stripes ends up as 1, because num_stripes is 2 (the amount
of device extents we're going to have), and DUP has ncopies 2.
Next there's a check...
if (stripe_size * data_stripes > max_chunk_size)
...which matches because 789MiB * 1 > 788MiB.
We go into the if code, and next is...
stripe_size = div_u64(max_chunk_size, data_stripes);
...which resets stripe_size to max_chunk_size: 788MiB
Next is a fun one...
/* bump the answer up to a 16MB boundary */
stripe_size = round_up(stripe_size, SZ_16M);
...which changes stripe_size from 788MiB to 800MiB.
We're not done changing stripe_size yet...
/* But don't go higher than the limits we found while searching
* for free extents
*/
stripe_size = min(devices_info[ndevs - 1].max_avail,
stripe_size);
This is bad. max_avail is twice the stripe_size (we need to fit 2 device
extents on the same device for DUP).
The result here is that 800MiB < 1578MiB, so it's unchanged. However,
the resulting DUP chunk will need 1600MiB disk space, which isn't there,
and the second dev_extent might extend into the next thing (next
dev_extent? end of device?) for 22MiB.
The last shown line of code relies on a situation where there's twice
the value of stripe_size present as value for the variable stripe_size
when it's DUP. This was actually the case before commit 92e222df7b
"btrfs: alloc_chunk: fix DUP stripe size handling", from which I quote:
"[...] in the meantime there's a check to see if the stripe_size does
not exceed max_chunk_size. Since during this check stripe_size is twice
the amount as intended, the check will reduce the stripe_size to
max_chunk_size if the actual correct to be used stripe_size is more than
half the amount of max_chunk_size."
In the previous version of the code, the 16MiB alignment (why is this
done, by the way?) would result in a 50% chance that it would actually
do an 8MiB alignment for the individual dev_extents, since it was
operating on double the size. Does this matter?
Does it matter that stripe_size can be set to anything which is not
16MiB aligned because of the amount of remaining available disk space
which is just taken?
What is the main purpose of this round_up?
The most straightforward thing to do seems something like...
stripe_size = min(
div_u64(devices_info[ndevs - 1].max_avail, dev_stripes),
stripe_size
)
..just putting half of the max_avail into stripe_size."
Link: https://lore.kernel.org/linux-btrfs/b3461a38-e5f8-f41d-c67c-2efac8129054@mendix.com/
Reported-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ add analysis from report ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-05 17:45:54 +08:00
|
|
|
prev_devid = devid;
|
|
|
|
prev_dev_ext_end = physical_offset + physical_len;
|
|
|
|
|
2018-08-01 10:37:19 +08:00
|
|
|
ret = btrfs_next_item(root, path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret > 0) {
|
|
|
|
ret = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Ensure all chunks have corresponding dev extents */
|
|
|
|
ret = verify_chunk_dev_extent_mapping(fs_info);
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-04 01:28:12 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Check whether the given block group or device is pinned by any inode being
|
|
|
|
* used as a swapfile.
|
|
|
|
*/
|
|
|
|
bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr)
|
|
|
|
{
|
|
|
|
struct btrfs_swapfile_pin *sp;
|
|
|
|
struct rb_node *node;
|
|
|
|
|
|
|
|
spin_lock(&fs_info->swapfile_pins_lock);
|
|
|
|
node = fs_info->swapfile_pins.rb_node;
|
|
|
|
while (node) {
|
|
|
|
sp = rb_entry(node, struct btrfs_swapfile_pin, node);
|
|
|
|
if (ptr < sp->ptr)
|
|
|
|
node = node->rb_left;
|
|
|
|
else if (ptr > sp->ptr)
|
|
|
|
node = node->rb_right;
|
|
|
|
else
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
spin_unlock(&fs_info->swapfile_pins_lock);
|
|
|
|
return node != NULL;
|
|
|
|
}
|
btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
When a bad checksum is found and if the filesystem has a mirror of the
damaged data, we read the correct data from the mirror and writes it to
damaged blocks. This however, violates the sequential write constraints
of a zoned block device.
We can consider three methods to repair an IO failure in zoned filesystems:
(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to
the new extent
(3) Relocate the corresponding block group
Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and
so it unnecessary degrades non-damaged data.
Method (2) is much like device replacing but done in the same device. It
is safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.
Method (3) invokes relocation of the damaged block group and is
straightforward to implement. It relocates all the mirrored device
extents, so it potentially is a more costly operation than method (1) or
(2). But it relocates only used extents which reduce the total IO size.
Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).
For protecting a block group gets relocated multiple time with multiple
IO errors, this commit introduces "relocating_repair" bit to show it's
now relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.
This commit also supports repairing in the scrub process.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-04 18:22:16 +08:00
|
|
|
|
|
|
|
static int relocating_repair_kthread(void *data)
|
|
|
|
{
|
2022-03-31 18:34:08 +08:00
|
|
|
struct btrfs_block_group *cache = data;
|
btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
When a bad checksum is found and if the filesystem has a mirror of the
damaged data, we read the correct data from the mirror and writes it to
damaged blocks. This however, violates the sequential write constraints
of a zoned block device.
We can consider three methods to repair an IO failure in zoned filesystems:
(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to
the new extent
(3) Relocate the corresponding block group
Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and
so it unnecessary degrades non-damaged data.
Method (2) is much like device replacing but done in the same device. It
is safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.
Method (3) invokes relocation of the damaged block group and is
straightforward to implement. It relocates all the mirrored device
extents, so it potentially is a more costly operation than method (1) or
(2). But it relocates only used extents which reduce the total IO size.
Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).
For protecting a block group gets relocated multiple time with multiple
IO errors, this commit introduces "relocating_repair" bit to show it's
now relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.
This commit also supports repairing in the scrub process.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-04 18:22:16 +08:00
|
|
|
struct btrfs_fs_info *fs_info = cache->fs_info;
|
|
|
|
u64 target;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
target = cache->start;
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
|
2022-02-18 12:14:19 +08:00
|
|
|
sb_start_write(fs_info->sb);
|
btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
When a bad checksum is found and if the filesystem has a mirror of the
damaged data, we read the correct data from the mirror and writes it to
damaged blocks. This however, violates the sequential write constraints
of a zoned block device.
We can consider three methods to repair an IO failure in zoned filesystems:
(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to
the new extent
(3) Relocate the corresponding block group
Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and
so it unnecessary degrades non-damaged data.
Method (2) is much like device replacing but done in the same device. It
is safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.
Method (3) invokes relocation of the damaged block group and is
straightforward to implement. It relocates all the mirrored device
extents, so it potentially is a more costly operation than method (1) or
(2). But it relocates only used extents which reduce the total IO size.
Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).
For protecting a block group gets relocated multiple time with multiple
IO errors, this commit introduces "relocating_repair" bit to show it's
now relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.
This commit also supports repairing in the scrub process.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-04 18:22:16 +08:00
|
|
|
if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE)) {
|
|
|
|
btrfs_info(fs_info,
|
|
|
|
"zoned: skip relocating block group %llu to repair: EBUSY",
|
|
|
|
target);
|
2022-02-18 12:14:19 +08:00
|
|
|
sb_end_write(fs_info->sb);
|
btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
When a bad checksum is found and if the filesystem has a mirror of the
damaged data, we read the correct data from the mirror and writes it to
damaged blocks. This however, violates the sequential write constraints
of a zoned block device.
We can consider three methods to repair an IO failure in zoned filesystems:
(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to
the new extent
(3) Relocate the corresponding block group
Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and
so it unnecessary degrades non-damaged data.
Method (2) is much like device replacing but done in the same device. It
is safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.
Method (3) invokes relocation of the damaged block group and is
straightforward to implement. It relocates all the mirrored device
extents, so it potentially is a more costly operation than method (1) or
(2). But it relocates only used extents which reduce the total IO size.
Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).
For protecting a block group gets relocated multiple time with multiple
IO errors, this commit introduces "relocating_repair" bit to show it's
now relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.
This commit also supports repairing in the scrub process.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-04 18:22:16 +08:00
|
|
|
return -EBUSY;
|
|
|
|
}
|
|
|
|
|
2021-04-19 15:41:01 +08:00
|
|
|
mutex_lock(&fs_info->reclaim_bgs_lock);
|
btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
When a bad checksum is found and if the filesystem has a mirror of the
damaged data, we read the correct data from the mirror and writes it to
damaged blocks. This however, violates the sequential write constraints
of a zoned block device.
We can consider three methods to repair an IO failure in zoned filesystems:
(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to
the new extent
(3) Relocate the corresponding block group
Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and
so it unnecessary degrades non-damaged data.
Method (2) is much like device replacing but done in the same device. It
is safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.
Method (3) invokes relocation of the damaged block group and is
straightforward to implement. It relocates all the mirrored device
extents, so it potentially is a more costly operation than method (1) or
(2). But it relocates only used extents which reduce the total IO size.
Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).
For protecting a block group gets relocated multiple time with multiple
IO errors, this commit introduces "relocating_repair" bit to show it's
now relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.
This commit also supports repairing in the scrub process.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-04 18:22:16 +08:00
|
|
|
|
|
|
|
/* Ensure block group still exists */
|
|
|
|
cache = btrfs_lookup_block_group(fs_info, target);
|
|
|
|
if (!cache)
|
|
|
|
goto out;
|
|
|
|
|
2022-07-16 03:45:24 +08:00
|
|
|
if (!test_bit(BLOCK_GROUP_FLAG_RELOCATING_REPAIR, &cache->runtime_flags))
|
btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
When a bad checksum is found and if the filesystem has a mirror of the
damaged data, we read the correct data from the mirror and writes it to
damaged blocks. This however, violates the sequential write constraints
of a zoned block device.
We can consider three methods to repair an IO failure in zoned filesystems:
(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to
the new extent
(3) Relocate the corresponding block group
Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and
so it unnecessary degrades non-damaged data.
Method (2) is much like device replacing but done in the same device. It
is safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.
Method (3) invokes relocation of the damaged block group and is
straightforward to implement. It relocates all the mirrored device
extents, so it potentially is a more costly operation than method (1) or
(2). But it relocates only used extents which reduce the total IO size.
Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).
For protecting a block group gets relocated multiple time with multiple
IO errors, this commit introduces "relocating_repair" bit to show it's
now relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.
This commit also supports repairing in the scrub process.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-04 18:22:16 +08:00
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = btrfs_may_alloc_data_chunk(fs_info, target);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
btrfs_info(fs_info,
|
|
|
|
"zoned: relocating block group %llu to repair IO failure",
|
|
|
|
target);
|
|
|
|
ret = btrfs_relocate_chunk(fs_info, target);
|
|
|
|
|
|
|
|
out:
|
|
|
|
if (cache)
|
|
|
|
btrfs_put_block_group(cache);
|
2021-04-19 15:41:01 +08:00
|
|
|
mutex_unlock(&fs_info->reclaim_bgs_lock);
|
btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
When a bad checksum is found and if the filesystem has a mirror of the
damaged data, we read the correct data from the mirror and writes it to
damaged blocks. This however, violates the sequential write constraints
of a zoned block device.
We can consider three methods to repair an IO failure in zoned filesystems:
(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to
the new extent
(3) Relocate the corresponding block group
Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and
so it unnecessary degrades non-damaged data.
Method (2) is much like device replacing but done in the same device. It
is safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.
Method (3) invokes relocation of the damaged block group and is
straightforward to implement. It relocates all the mirrored device
extents, so it potentially is a more costly operation than method (1) or
(2). But it relocates only used extents which reduce the total IO size.
Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).
For protecting a block group gets relocated multiple time with multiple
IO errors, this commit introduces "relocating_repair" bit to show it's
now relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.
This commit also supports repairing in the scrub process.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-04 18:22:16 +08:00
|
|
|
btrfs_exclop_finish(fs_info);
|
2022-02-18 12:14:19 +08:00
|
|
|
sb_end_write(fs_info->sb);
|
btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
When a bad checksum is found and if the filesystem has a mirror of the
damaged data, we read the correct data from the mirror and writes it to
damaged blocks. This however, violates the sequential write constraints
of a zoned block device.
We can consider three methods to repair an IO failure in zoned filesystems:
(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to
the new extent
(3) Relocate the corresponding block group
Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and
so it unnecessary degrades non-damaged data.
Method (2) is much like device replacing but done in the same device. It
is safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.
Method (3) invokes relocation of the damaged block group and is
straightforward to implement. It relocates all the mirrored device
extents, so it potentially is a more costly operation than method (1) or
(2). But it relocates only used extents which reduce the total IO size.
Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).
For protecting a block group gets relocated multiple time with multiple
IO errors, this commit introduces "relocating_repair" bit to show it's
now relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.
This commit also supports repairing in the scrub process.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-04 18:22:16 +08:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2021-12-07 22:28:36 +08:00
|
|
|
bool btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical)
|
btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
When a bad checksum is found and if the filesystem has a mirror of the
damaged data, we read the correct data from the mirror and writes it to
damaged blocks. This however, violates the sequential write constraints
of a zoned block device.
We can consider three methods to repair an IO failure in zoned filesystems:
(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to
the new extent
(3) Relocate the corresponding block group
Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and
so it unnecessary degrades non-damaged data.
Method (2) is much like device replacing but done in the same device. It
is safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.
Method (3) invokes relocation of the damaged block group and is
straightforward to implement. It relocates all the mirrored device
extents, so it potentially is a more costly operation than method (1) or
(2). But it relocates only used extents which reduce the total IO size.
Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).
For protecting a block group gets relocated multiple time with multiple
IO errors, this commit introduces "relocating_repair" bit to show it's
now relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.
This commit also supports repairing in the scrub process.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-04 18:22:16 +08:00
|
|
|
{
|
|
|
|
struct btrfs_block_group *cache;
|
|
|
|
|
2021-12-07 22:28:36 +08:00
|
|
|
if (!btrfs_is_zoned(fs_info))
|
|
|
|
return false;
|
|
|
|
|
btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
When a bad checksum is found and if the filesystem has a mirror of the
damaged data, we read the correct data from the mirror and writes it to
damaged blocks. This however, violates the sequential write constraints
of a zoned block device.
We can consider three methods to repair an IO failure in zoned filesystems:
(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to
the new extent
(3) Relocate the corresponding block group
Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and
so it unnecessary degrades non-damaged data.
Method (2) is much like device replacing but done in the same device. It
is safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.
Method (3) invokes relocation of the damaged block group and is
straightforward to implement. It relocates all the mirrored device
extents, so it potentially is a more costly operation than method (1) or
(2). But it relocates only used extents which reduce the total IO size.
Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).
For protecting a block group gets relocated multiple time with multiple
IO errors, this commit introduces "relocating_repair" bit to show it's
now relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.
This commit also supports repairing in the scrub process.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-04 18:22:16 +08:00
|
|
|
/* Do not attempt to repair in degraded state */
|
|
|
|
if (btrfs_test_opt(fs_info, DEGRADED))
|
2021-12-07 22:28:36 +08:00
|
|
|
return true;
|
btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
When a bad checksum is found and if the filesystem has a mirror of the
damaged data, we read the correct data from the mirror and writes it to
damaged blocks. This however, violates the sequential write constraints
of a zoned block device.
We can consider three methods to repair an IO failure in zoned filesystems:
(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to
the new extent
(3) Relocate the corresponding block group
Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and
so it unnecessary degrades non-damaged data.
Method (2) is much like device replacing but done in the same device. It
is safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.
Method (3) invokes relocation of the damaged block group and is
straightforward to implement. It relocates all the mirrored device
extents, so it potentially is a more costly operation than method (1) or
(2). But it relocates only used extents which reduce the total IO size.
Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).
For protecting a block group gets relocated multiple time with multiple
IO errors, this commit introduces "relocating_repair" bit to show it's
now relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.
This commit also supports repairing in the scrub process.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-04 18:22:16 +08:00
|
|
|
|
|
|
|
cache = btrfs_lookup_block_group(fs_info, logical);
|
|
|
|
if (!cache)
|
2021-12-07 22:28:36 +08:00
|
|
|
return true;
|
btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
When a bad checksum is found and if the filesystem has a mirror of the
damaged data, we read the correct data from the mirror and writes it to
damaged blocks. This however, violates the sequential write constraints
of a zoned block device.
We can consider three methods to repair an IO failure in zoned filesystems:
(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to
the new extent
(3) Relocate the corresponding block group
Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and
so it unnecessary degrades non-damaged data.
Method (2) is much like device replacing but done in the same device. It
is safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.
Method (3) invokes relocation of the damaged block group and is
straightforward to implement. It relocates all the mirrored device
extents, so it potentially is a more costly operation than method (1) or
(2). But it relocates only used extents which reduce the total IO size.
Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).
For protecting a block group gets relocated multiple time with multiple
IO errors, this commit introduces "relocating_repair" bit to show it's
now relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.
This commit also supports repairing in the scrub process.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-04 18:22:16 +08:00
|
|
|
|
2022-07-16 03:45:24 +08:00
|
|
|
if (test_and_set_bit(BLOCK_GROUP_FLAG_RELOCATING_REPAIR, &cache->runtime_flags)) {
|
btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
When a bad checksum is found and if the filesystem has a mirror of the
damaged data, we read the correct data from the mirror and writes it to
damaged blocks. This however, violates the sequential write constraints
of a zoned block device.
We can consider three methods to repair an IO failure in zoned filesystems:
(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to
the new extent
(3) Relocate the corresponding block group
Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and
so it unnecessary degrades non-damaged data.
Method (2) is much like device replacing but done in the same device. It
is safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.
Method (3) invokes relocation of the damaged block group and is
straightforward to implement. It relocates all the mirrored device
extents, so it potentially is a more costly operation than method (1) or
(2). But it relocates only used extents which reduce the total IO size.
Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).
For protecting a block group gets relocated multiple time with multiple
IO errors, this commit introduces "relocating_repair" bit to show it's
now relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.
This commit also supports repairing in the scrub process.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-04 18:22:16 +08:00
|
|
|
btrfs_put_block_group(cache);
|
2021-12-07 22:28:36 +08:00
|
|
|
return true;
|
btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
When a bad checksum is found and if the filesystem has a mirror of the
damaged data, we read the correct data from the mirror and writes it to
damaged blocks. This however, violates the sequential write constraints
of a zoned block device.
We can consider three methods to repair an IO failure in zoned filesystems:
(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to
the new extent
(3) Relocate the corresponding block group
Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and
so it unnecessary degrades non-damaged data.
Method (2) is much like device replacing but done in the same device. It
is safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.
Method (3) invokes relocation of the damaged block group and is
straightforward to implement. It relocates all the mirrored device
extents, so it potentially is a more costly operation than method (1) or
(2). But it relocates only used extents which reduce the total IO size.
Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).
For protecting a block group gets relocated multiple time with multiple
IO errors, this commit introduces "relocating_repair" bit to show it's
now relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.
This commit also supports repairing in the scrub process.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-04 18:22:16 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
kthread_run(relocating_repair_kthread, cache,
|
|
|
|
"btrfs-relocating-repair");
|
|
|
|
|
2021-12-07 22:28:36 +08:00
|
|
|
return true;
|
btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
When a bad checksum is found and if the filesystem has a mirror of the
damaged data, we read the correct data from the mirror and writes it to
damaged blocks. This however, violates the sequential write constraints
of a zoned block device.
We can consider three methods to repair an IO failure in zoned filesystems:
(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to
the new extent
(3) Relocate the corresponding block group
Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and
so it unnecessary degrades non-damaged data.
Method (2) is much like device replacing but done in the same device. It
is safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.
Method (3) invokes relocation of the damaged block group and is
straightforward to implement. It relocates all the mirrored device
extents, so it potentially is a more costly operation than method (1) or
(2). But it relocates only used extents which reduce the total IO size.
Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).
For protecting a block group gets relocated multiple time with multiple
IO errors, this commit introduces "relocating_repair" bit to show it's
now relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.
This commit also supports repairing in the scrub process.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-04 18:22:16 +08:00
|
|
|
}
|
2023-03-20 10:12:49 +08:00
|
|
|
|
|
|
|
static void map_raid56_repair_block(struct btrfs_io_context *bioc,
|
|
|
|
struct btrfs_io_stripe *smap,
|
|
|
|
u64 logical)
|
|
|
|
{
|
|
|
|
int data_stripes = nr_bioc_data_stripes(bioc);
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < data_stripes; i++) {
|
|
|
|
u64 stripe_start = bioc->full_stripe_logical +
|
2023-06-22 14:42:40 +08:00
|
|
|
btrfs_stripe_nr_to_offset(i);
|
2023-03-20 10:12:49 +08:00
|
|
|
|
|
|
|
if (logical >= stripe_start &&
|
|
|
|
logical < stripe_start + BTRFS_STRIPE_LEN)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
ASSERT(i < data_stripes);
|
|
|
|
smap->dev = bioc->stripes[i].dev;
|
|
|
|
smap->physical = bioc->stripes[i].physical +
|
|
|
|
((logical - bioc->full_stripe_logical) &
|
|
|
|
BTRFS_STRIPE_LEN_MASK);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Map a repair write into a single device.
|
|
|
|
*
|
|
|
|
* A repair write is triggered by read time repair or scrub, which would only
|
|
|
|
* update the contents of a single device.
|
|
|
|
* Not update any other mirrors nor go through RMW path.
|
|
|
|
*
|
|
|
|
* Callers should ensure:
|
|
|
|
*
|
|
|
|
* - Call btrfs_bio_counter_inc_blocked() first
|
|
|
|
* - The range does not cross stripe boundary
|
|
|
|
* - Has a valid @mirror_num passed in.
|
|
|
|
*/
|
|
|
|
int btrfs_map_repair_block(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_io_stripe *smap, u64 logical,
|
|
|
|
u32 length, int mirror_num)
|
|
|
|
{
|
|
|
|
struct btrfs_io_context *bioc = NULL;
|
|
|
|
u64 map_length = length;
|
|
|
|
int mirror_ret = mirror_num;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ASSERT(mirror_num > 0);
|
|
|
|
|
2023-05-31 12:17:37 +08:00
|
|
|
ret = btrfs_map_block(fs_info, BTRFS_MAP_WRITE, logical, &map_length,
|
2023-09-17 18:06:21 +08:00
|
|
|
&bioc, smap, &mirror_ret);
|
2023-03-20 10:12:49 +08:00
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
/* The map range should not cross stripe boundary. */
|
|
|
|
ASSERT(map_length >= length);
|
|
|
|
|
|
|
|
/* Already mapped to single stripe. */
|
|
|
|
if (!bioc)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
/* Map the RAID56 multi-stripe writes to a single one. */
|
|
|
|
if (bioc->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
|
|
|
|
map_raid56_repair_block(bioc, smap, logical);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
ASSERT(mirror_num <= bioc->num_stripes);
|
|
|
|
smap->dev = bioc->stripes[mirror_num - 1].dev;
|
|
|
|
smap->physical = bioc->stripes[mirror_num - 1].physical;
|
|
|
|
out:
|
|
|
|
btrfs_put_bioc(bioc);
|
|
|
|
ASSERT(smap->dev);
|
|
|
|
return 0;
|
|
|
|
}
|