2018-04-04 01:16:55 +08:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2009-04-03 21:47:43 +08:00
|
|
|
/*
|
|
|
|
* Copyright (C) 2009 Oracle. All rights reserved.
|
|
|
|
*/
|
|
|
|
|
2018-04-04 01:16:55 +08:00
|
|
|
#ifndef BTRFS_FREE_SPACE_CACHE_H
|
|
|
|
#define BTRFS_FREE_SPACE_CACHE_H
|
2009-04-03 21:47:43 +08:00
|
|
|
|
2019-12-14 08:22:12 +08:00
|
|
|
/*
|
|
|
|
* This is the trim state of an extent or bitmap.
|
2019-12-14 08:22:13 +08:00
|
|
|
*
|
|
|
|
* BTRFS_TRIM_STATE_TRIMMING is special and used to maintain the state of a
|
|
|
|
* bitmap as we may need several trims to fully trim a single bitmap entry.
|
|
|
|
* This is reset should any free space other than trimmed space be added to the
|
|
|
|
* bitmap.
|
2019-12-14 08:22:12 +08:00
|
|
|
*/
|
|
|
|
enum btrfs_trim_state {
|
|
|
|
BTRFS_TRIM_STATE_UNTRIMMED,
|
|
|
|
BTRFS_TRIM_STATE_TRIMMED,
|
2019-12-14 08:22:13 +08:00
|
|
|
BTRFS_TRIM_STATE_TRIMMING,
|
2019-12-14 08:22:12 +08:00
|
|
|
};
|
|
|
|
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
struct btrfs_free_space {
|
|
|
|
struct rb_node offset_index;
|
btrfs: index free space entries on size
Currently we index free space on offset only, because usually we have a
hint from the allocator that we want to honor for locality reasons.
However if we fail to use this hint we have to go back to a brute force
search through the free space entries to find a large enough extent.
With sufficiently fragmented free space this becomes quite expensive, as
we have to linearly search all of the free space entries to find if we
have a part that's long enough.
To fix this add a cached rb tree to index based on free space entry
bytes. This will allow us to quickly look up the largest chunk in the
free space tree for this block group, and stop searching once we've
found an entry that is too small to satisfy our allocation. We simply
choose to use this tree if we're searching from the beginning of the
block group, as we know we do not care about locality at that point.
I wrote an allocator test that creates a 10TiB ram backed null block
device and then fallocates random files until the file system is full.
I think go through and delete all of the odd files. Then I spawn 8
threads that fallocate 64MiB files (1/2 our extent size cap) until the
file system is full again. I use bcc's funclatency to measure the
latency of find_free_extent. The baseline results are
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 10356 |**** |
512 -> 1023 : 58242 |************************* |
1024 -> 2047 : 74418 |******************************** |
2048 -> 4095 : 90393 |****************************************|
4096 -> 8191 : 79119 |*********************************** |
8192 -> 16383 : 35614 |*************** |
16384 -> 32767 : 13418 |***** |
32768 -> 65535 : 12811 |***** |
65536 -> 131071 : 17090 |******* |
131072 -> 262143 : 26465 |*********** |
262144 -> 524287 : 40179 |***************** |
524288 -> 1048575 : 55469 |************************ |
1048576 -> 2097151 : 48807 |********************* |
2097152 -> 4194303 : 26744 |*********** |
4194304 -> 8388607 : 35351 |*************** |
8388608 -> 16777215 : 13918 |****** |
16777216 -> 33554431 : 21 | |
avg = 908079 nsecs, total: 580889071441 nsecs, count: 639690
And the patch results are
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 6883 |** |
512 -> 1023 : 54346 |********************* |
1024 -> 2047 : 79170 |******************************** |
2048 -> 4095 : 98890 |****************************************|
4096 -> 8191 : 81911 |********************************* |
8192 -> 16383 : 27075 |********** |
16384 -> 32767 : 14668 |***** |
32768 -> 65535 : 13251 |***** |
65536 -> 131071 : 15340 |****** |
131072 -> 262143 : 26715 |********** |
262144 -> 524287 : 43274 |***************** |
524288 -> 1048575 : 53870 |********************* |
1048576 -> 2097151 : 55368 |********************** |
2097152 -> 4194303 : 41036 |**************** |
4194304 -> 8388607 : 24927 |********** |
8388608 -> 16777215 : 33 | |
16777216 -> 33554431 : 9 | |
avg = 623599 nsecs, total: 397259314759 nsecs, count: 637042
There's a little variation in the amount of calls done because of timing
of the threads with metadata requirements, but the avg, total, and
count's are relatively consistent between runs (usually within 2-5% of
each other). As you can see here we have around a 30% decrease in
average latency with a 30% decrease in overall time spent in
find_free_extent.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-19 05:33:15 +08:00
|
|
|
struct rb_node bytes_index;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
u64 offset;
|
|
|
|
u64 bytes;
|
2015-10-03 04:09:42 +08:00
|
|
|
u64 max_extent_size;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
unsigned long *bitmap;
|
|
|
|
struct list_head list;
|
2019-12-14 08:22:12 +08:00
|
|
|
enum btrfs_trim_state trim_state;
|
2019-12-14 08:22:20 +08:00
|
|
|
s32 bitmap_extents;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
};
|
|
|
|
|
2019-12-14 08:22:12 +08:00
|
|
|
static inline bool btrfs_free_space_trimmed(struct btrfs_free_space *info)
|
|
|
|
{
|
|
|
|
return (info->trim_state == BTRFS_TRIM_STATE_TRIMMED);
|
|
|
|
}
|
|
|
|
|
2019-12-14 08:22:13 +08:00
|
|
|
static inline bool btrfs_free_space_trimming_bitmap(
|
|
|
|
struct btrfs_free_space *info)
|
|
|
|
{
|
|
|
|
return (info->trim_state == BTRFS_TRIM_STATE_TRIMMING);
|
|
|
|
}
|
|
|
|
|
2022-09-14 23:06:32 +08:00
|
|
|
/*
|
|
|
|
* Deltas are an effective way to populate global statistics. Give macro names
|
|
|
|
* to make it clear what we're doing. An example is discard_extents in
|
|
|
|
* btrfs_free_space_ctl.
|
|
|
|
*/
|
2022-10-27 03:08:16 +08:00
|
|
|
enum {
|
|
|
|
BTRFS_STAT_CURR,
|
|
|
|
BTRFS_STAT_PREV,
|
|
|
|
BTRFS_STAT_NR_ENTRIES,
|
|
|
|
};
|
2022-09-14 23:06:32 +08:00
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
struct btrfs_free_space_ctl {
|
|
|
|
spinlock_t tree_lock;
|
|
|
|
struct rb_root free_space_offset;
|
btrfs: index free space entries on size
Currently we index free space on offset only, because usually we have a
hint from the allocator that we want to honor for locality reasons.
However if we fail to use this hint we have to go back to a brute force
search through the free space entries to find a large enough extent.
With sufficiently fragmented free space this becomes quite expensive, as
we have to linearly search all of the free space entries to find if we
have a part that's long enough.
To fix this add a cached rb tree to index based on free space entry
bytes. This will allow us to quickly look up the largest chunk in the
free space tree for this block group, and stop searching once we've
found an entry that is too small to satisfy our allocation. We simply
choose to use this tree if we're searching from the beginning of the
block group, as we know we do not care about locality at that point.
I wrote an allocator test that creates a 10TiB ram backed null block
device and then fallocates random files until the file system is full.
I think go through and delete all of the odd files. Then I spawn 8
threads that fallocate 64MiB files (1/2 our extent size cap) until the
file system is full again. I use bcc's funclatency to measure the
latency of find_free_extent. The baseline results are
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 10356 |**** |
512 -> 1023 : 58242 |************************* |
1024 -> 2047 : 74418 |******************************** |
2048 -> 4095 : 90393 |****************************************|
4096 -> 8191 : 79119 |*********************************** |
8192 -> 16383 : 35614 |*************** |
16384 -> 32767 : 13418 |***** |
32768 -> 65535 : 12811 |***** |
65536 -> 131071 : 17090 |******* |
131072 -> 262143 : 26465 |*********** |
262144 -> 524287 : 40179 |***************** |
524288 -> 1048575 : 55469 |************************ |
1048576 -> 2097151 : 48807 |********************* |
2097152 -> 4194303 : 26744 |*********** |
4194304 -> 8388607 : 35351 |*************** |
8388608 -> 16777215 : 13918 |****** |
16777216 -> 33554431 : 21 | |
avg = 908079 nsecs, total: 580889071441 nsecs, count: 639690
And the patch results are
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 6883 |** |
512 -> 1023 : 54346 |********************* |
1024 -> 2047 : 79170 |******************************** |
2048 -> 4095 : 98890 |****************************************|
4096 -> 8191 : 81911 |********************************* |
8192 -> 16383 : 27075 |********** |
16384 -> 32767 : 14668 |***** |
32768 -> 65535 : 13251 |***** |
65536 -> 131071 : 15340 |****** |
131072 -> 262143 : 26715 |********** |
262144 -> 524287 : 43274 |***************** |
524288 -> 1048575 : 53870 |********************* |
1048576 -> 2097151 : 55368 |********************** |
2097152 -> 4194303 : 41036 |**************** |
4194304 -> 8388607 : 24927 |********** |
8388608 -> 16777215 : 33 | |
16777216 -> 33554431 : 9 | |
avg = 623599 nsecs, total: 397259314759 nsecs, count: 637042
There's a little variation in the amount of calls done because of timing
of the threads with metadata requirements, but the avg, total, and
count's are relatively consistent between runs (usually within 2-5% of
each other). As you can see here we have around a 30% decrease in
average latency with a 30% decrease in overall time spent in
find_free_extent.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-19 05:33:15 +08:00
|
|
|
struct rb_root_cached free_space_bytes;
|
2011-03-29 13:46:06 +08:00
|
|
|
u64 free_space;
|
|
|
|
int extents_thresh;
|
|
|
|
int free_extents;
|
|
|
|
int total_bitmaps;
|
|
|
|
int unit;
|
|
|
|
u64 start;
|
2019-12-14 08:22:20 +08:00
|
|
|
s32 discardable_extents[BTRFS_STAT_NR_ENTRIES];
|
2019-12-14 08:22:21 +08:00
|
|
|
s64 discardable_bytes[BTRFS_STAT_NR_ENTRIES];
|
2015-11-19 18:42:28 +08:00
|
|
|
const struct btrfs_free_space_op *op;
|
2021-11-23 20:44:22 +08:00
|
|
|
struct btrfs_block_group *block_group;
|
Btrfs: fix race between writing free space cache and trimming
Trimming is completely transactionless, and the way it operates consists
of hiding free space entries from a block group, perform the trim/discard
and then make the free space entries visible again.
Therefore while a free space entry is being trimmed, we can have free space
cache writing running in parallel (as part of a transaction commit) which
will miss the free space entry. This means that an unmount (or crash/reboot)
after that transaction commit and mount again before another transaction
starts/commits after the discard finishes, we will have some free space
that won't be used again unless the free space cache is rebuilt. After the
unmount, fsck (btrfsck, btrfs check) reports the issue like the following
example:
*** fsck.btrfs output ***
checking extents
checking free space cache
There is no free space entry for 521764864-521781248
There is no free space entry for 521764864-1103101952
cache appears valid but isnt 29360128
Checking filesystem on /dev/sdc
UUID: b4789e27-4774-4626-98e9-ae8dfbfb0fb5
found 1235681286 bytes used err is -22
(...)
Another issue caused by this race is a crash while writing bitmap entries
to the cache, because while the cache writeout task accesses the bitmaps,
the trim task can be concurrently modifying the bitmap or worse might
be freeing the bitmap. The later case results in the following crash:
[55650.804460] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
[55650.804835] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop parport_pc parport i2c_piix4 psmouse evdev pcspkr microcode processor i2ccore serio_raw thermal_sys button ext4 crc16 jbd2 mbcache sg sd_mod crc_t10dif sr_mod cdrom crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata virtio_pci virtio_ring virtio scsi_mod e1000 [last unloaded: btrfs]
[55650.806169] CPU: 1 PID: 31002 Comm: btrfs-transacti Tainted: G W 3.17.0-rc5-btrfs-next-1+ #1
[55650.806493] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[55650.806867] task: ffff8800b12f6410 ti: ffff880071538000 task.ti: ffff880071538000
[55650.807166] RIP: 0010:[<ffffffffa037cf45>] [<ffffffffa037cf45>] write_bitmap_entries+0x65/0xbb [btrfs]
[55650.807514] RSP: 0018:ffff88007153bc30 EFLAGS: 00010246
[55650.807687] RAX: 000000005d1ec000 RBX: ffff8800a665df08 RCX: 0000000000000400
[55650.807885] RDX: ffff88005d1ec000 RSI: 6b6b6b6b6b6b6b6b RDI: ffff88005d1ec000
[55650.808017] RBP: ffff88007153bc58 R08: 00000000ddd51536 R09: 00000000000001e0
[55650.808017] R10: 0000000000000000 R11: 0000000000000037 R12: 6b6b6b6b6b6b6b6b
[55650.808017] R13: ffff88007153bca8 R14: 6b6b6b6b6b6b6b6b R15: ffff88007153bc98
[55650.808017] FS: 0000000000000000(0000) GS:ffff88023ec80000(0000) knlGS:0000000000000000
[55650.808017] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[55650.808017] CR2: 0000000002273b88 CR3: 00000000b18f6000 CR4: 00000000000006e0
[55650.808017] Stack:
[55650.808017] ffff88020e834e00 ffff880172d68db0 0000000000000000 ffff88019257c800
[55650.808017] ffff8801d42ea720 ffff88007153bd10 ffffffffa037d2fa ffff880224e99180
[55650.808017] ffff8801469a6188 ffff880224e99140 ffff880172d68c50 00000003000000b7
[55650.808017] Call Trace:
[55650.808017] [<ffffffffa037d2fa>] __btrfs_write_out_cache+0x1ea/0x37f [btrfs]
[55650.808017] [<ffffffffa037d959>] btrfs_write_out_cache+0xa1/0xd8 [btrfs]
[55650.808017] [<ffffffffa033936b>] btrfs_write_dirty_block_groups+0x4b5/0x505 [btrfs]
[55650.808017] [<ffffffffa03aa98e>] commit_cowonly_roots+0x15e/0x1f7 [btrfs]
[55650.808017] [<ffffffff813eb9c7>] ? _raw_spin_lock+0xe/0x10
[55650.808017] [<ffffffffa0346e46>] btrfs_commit_transaction+0x411/0x882 [btrfs]
[55650.808017] [<ffffffffa03432a4>] transaction_kthread+0xf2/0x1a4 [btrfs]
[55650.808017] [<ffffffffa03431b2>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
[55650.808017] [<ffffffff8105966b>] kthread+0xb7/0xbf
[55650.808017] [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[55650.808017] [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
[55650.808017] [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[55650.808017] Code: 4c 89 ef 8d 70 ff e8 d4 fc ff ff 41 8b 45 34 41 39 45 30 7d 5c 31 f6 4c 89 ef e8 80 f6 ff ff 49 8b 7d 00 4c 89 f6 b9 00 04 00 00 <f3> a5 4c 89 ef 41 8b 45 30 8d 70 ff e8 a3 fc ff ff 41 8b 45 34
[55650.808017] RIP [<ffffffffa037cf45>] write_bitmap_entries+0x65/0xbb [btrfs]
[55650.808017] RSP <ffff88007153bc30>
[55650.815725] ---[ end trace 1c032e96b149ff86 ]---
Fix this by serializing both tasks in such a way that cache writeout
doesn't wait for the trim/discard of free space entries to finish and
doesn't miss any free space entry.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 01:04:09 +08:00
|
|
|
struct mutex cache_writeout_mutex;
|
|
|
|
struct list_head trimming_ranges;
|
2011-03-29 13:46:06 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
struct btrfs_free_space_op {
|
|
|
|
bool (*use_bitmap)(struct btrfs_free_space_ctl *ctl,
|
|
|
|
struct btrfs_free_space *info);
|
|
|
|
};
|
|
|
|
|
2019-08-22 01:57:04 +08:00
|
|
|
struct btrfs_io_ctl {
|
|
|
|
void *cur, *orig;
|
|
|
|
struct page *page;
|
|
|
|
struct page **pages;
|
|
|
|
struct btrfs_fs_info *fs_info;
|
|
|
|
struct inode *inode;
|
|
|
|
unsigned long size;
|
|
|
|
int index;
|
|
|
|
int num_pages;
|
|
|
|
int entries;
|
|
|
|
int bitmaps;
|
|
|
|
};
|
2015-04-05 08:14:42 +08:00
|
|
|
|
2022-09-14 23:06:39 +08:00
|
|
|
int __init btrfs_free_space_init(void);
|
|
|
|
void __cold btrfs_free_space_exit(void);
|
2019-10-30 02:20:18 +08:00
|
|
|
struct inode *lookup_free_space_inode(struct btrfs_block_group *block_group,
|
2019-03-20 20:40:19 +08:00
|
|
|
struct btrfs_path *path);
|
2019-03-20 20:42:57 +08:00
|
|
|
int create_free_space_inode(struct btrfs_trans_handle *trans,
|
2019-10-30 02:20:18 +08:00
|
|
|
struct btrfs_block_group *block_group,
|
2010-06-22 02:48:16 +08:00
|
|
|
struct btrfs_path *path);
|
2020-11-19 07:06:25 +08:00
|
|
|
int btrfs_remove_free_space_inode(struct btrfs_trans_handle *trans,
|
|
|
|
struct inode *inode,
|
|
|
|
struct btrfs_block_group *block_group);
|
2010-07-03 00:14:14 +08:00
|
|
|
|
2017-02-16 05:28:30 +08:00
|
|
|
int btrfs_truncate_free_space_cache(struct btrfs_trans_handle *trans,
|
2019-10-30 02:20:18 +08:00
|
|
|
struct btrfs_block_group *block_group,
|
2010-06-22 02:48:16 +08:00
|
|
|
struct inode *inode);
|
2019-10-30 02:20:18 +08:00
|
|
|
int load_free_space_cache(struct btrfs_block_group *block_group);
|
2016-09-10 00:09:35 +08:00
|
|
|
int btrfs_wait_cache_io(struct btrfs_trans_handle *trans,
|
2019-10-30 02:20:18 +08:00
|
|
|
struct btrfs_block_group *block_group,
|
2016-09-10 00:09:35 +08:00
|
|
|
struct btrfs_path *path);
|
2019-03-20 20:51:56 +08:00
|
|
|
int btrfs_write_out_cache(struct btrfs_trans_handle *trans,
|
2019-10-30 02:20:18 +08:00
|
|
|
struct btrfs_block_group *block_group,
|
2010-07-03 00:14:14 +08:00
|
|
|
struct btrfs_path *path);
|
2011-04-20 10:33:24 +08:00
|
|
|
|
2020-10-23 21:58:08 +08:00
|
|
|
void btrfs_init_free_space_ctl(struct btrfs_block_group *block_group,
|
|
|
|
struct btrfs_free_space_ctl *ctl);
|
2021-11-23 20:44:21 +08:00
|
|
|
int __btrfs_add_free_space(struct btrfs_block_group *block_group, u64 bytenr,
|
|
|
|
u64 size, enum btrfs_trim_state trim_state);
|
2019-10-30 02:20:18 +08:00
|
|
|
int btrfs_add_free_space(struct btrfs_block_group *block_group,
|
2019-06-21 03:37:43 +08:00
|
|
|
u64 bytenr, u64 size);
|
2021-02-04 18:21:52 +08:00
|
|
|
int btrfs_add_free_space_unused(struct btrfs_block_group *block_group,
|
|
|
|
u64 bytenr, u64 size);
|
2019-12-14 08:22:14 +08:00
|
|
|
int btrfs_add_free_space_async_trimmed(struct btrfs_block_group *block_group,
|
|
|
|
u64 bytenr, u64 size);
|
2019-10-30 02:20:18 +08:00
|
|
|
int btrfs_remove_free_space(struct btrfs_block_group *block_group,
|
2009-04-03 21:47:43 +08:00
|
|
|
u64 bytenr, u64 size);
|
2019-10-30 02:20:18 +08:00
|
|
|
void btrfs_remove_free_space_cache(struct btrfs_block_group *block_group);
|
btrfs: handle empty block_group removal for async discard
block_group removal is a little tricky. It can race with the extent
allocator, the cleaner thread, and balancing. The current path is for a
block_group to be added to the unused_bgs list. Then, when the cleaner
thread comes around, it starts a transaction and then proceeds with
removing the block_group. Extents that are pinned are subsequently
removed from the pinned trees and then eventually a discard is issued
for the entire block_group.
Async discard introduces another player into the game, the discard
workqueue. While it has none of the racing issues, the new problem is
ensuring we don't leave free space untrimmed prior to forgetting the
block_group. This is handled by placing fully free block_groups on a
separate discard queue. This is necessary to maintain discarding order
as in the future we will slowly trim even fully free block_groups. The
ordering helps us make progress on the same block_group rather than say
the last fully freed block_group or needing to search through the fully
freed block groups at the beginning of a list and insert after.
The new order of events is a fully freed block group gets placed on the
unused discard queue first. Once it's processed, it will be placed on
the unusued_bgs list and then the original sequence of events will
happen, just without the final whole block_group discard.
The mount flags can change when processing unused_bgs, so when flipping
from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
free block groups on the discard_list to the unused_bg queue which will
do the final discard for us.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-14 08:22:15 +08:00
|
|
|
bool btrfs_is_free_space_trimmed(struct btrfs_block_group *block_group);
|
2019-10-30 02:20:18 +08:00
|
|
|
u64 btrfs_find_space_for_alloc(struct btrfs_block_group *block_group,
|
2013-09-09 13:19:42 +08:00
|
|
|
u64 offset, u64 bytes, u64 empty_size,
|
|
|
|
u64 *max_extent_size);
|
2019-10-30 02:20:18 +08:00
|
|
|
void btrfs_dump_free_space(struct btrfs_block_group *block_group,
|
2009-04-03 21:47:43 +08:00
|
|
|
u64 bytes);
|
2019-10-30 02:20:18 +08:00
|
|
|
int btrfs_find_space_cluster(struct btrfs_block_group *block_group,
|
2009-04-03 21:47:43 +08:00
|
|
|
struct btrfs_free_cluster *cluster,
|
|
|
|
u64 offset, u64 bytes, u64 empty_size);
|
|
|
|
void btrfs_init_free_cluster(struct btrfs_free_cluster *cluster);
|
2019-10-30 02:20:18 +08:00
|
|
|
u64 btrfs_alloc_from_cluster(struct btrfs_block_group *block_group,
|
2009-04-03 21:47:43 +08:00
|
|
|
struct btrfs_free_cluster *cluster, u64 bytes,
|
2013-09-09 13:19:42 +08:00
|
|
|
u64 min_start, u64 *max_extent_size);
|
2020-06-03 18:10:18 +08:00
|
|
|
void btrfs_return_cluster_to_free_space(
|
2019-10-30 02:20:18 +08:00
|
|
|
struct btrfs_block_group *block_group,
|
2009-04-03 21:47:43 +08:00
|
|
|
struct btrfs_free_cluster *cluster);
|
2019-10-30 02:20:18 +08:00
|
|
|
int btrfs_trim_block_group(struct btrfs_block_group *block_group,
|
2011-03-24 18:24:28 +08:00
|
|
|
u64 *trimmed, u64 start, u64 end, u64 minlen);
|
2019-12-14 08:22:16 +08:00
|
|
|
int btrfs_trim_block_group_extents(struct btrfs_block_group *block_group,
|
|
|
|
u64 *trimmed, u64 start, u64 end, u64 minlen,
|
|
|
|
bool async);
|
|
|
|
int btrfs_trim_block_group_bitmaps(struct btrfs_block_group *block_group,
|
|
|
|
u64 *trimmed, u64 start, u64 end, u64 minlen,
|
2020-01-03 05:26:39 +08:00
|
|
|
u64 maxlen, bool async);
|
2013-03-15 21:47:08 +08:00
|
|
|
|
btrfs: keep sb cache_generation consistent with space_cache
When mounting, btrfs uses the cache_generation in the super block to
determine if space cache v1 is in use. However, by mounting with
nospace_cache or space_cache=v2, it is possible to disable space cache
v1, which does not result in un-setting cache_generation back to 0.
In order to base some logic, like mount option printing in /proc/mounts,
on the current state of the space cache rather than just the values of
the mount option, keep the value of cache_generation consistent with the
status of space cache v1.
We ensure that cache_generation > 0 iff the file system is using
space_cache v1. This requires committing a transaction on any mount
which changes whether we are using v1. (v1->nospace_cache, v1->v2,
nospace_cache->v1, v2->v1).
Since the mechanism for writing out the cache generation is transaction
commit, but we want some finer grained control over when we un-set it,
we can't just rely on the SPACE_CACHE mount option, and introduce an
fs_info flag that mount can use when it wants to unset the generation.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-19 07:06:22 +08:00
|
|
|
bool btrfs_free_space_cache_v1_active(struct btrfs_fs_info *fs_info);
|
|
|
|
int btrfs_set_free_space_cache_v1_active(struct btrfs_fs_info *fs_info, bool active);
|
2016-05-20 09:18:45 +08:00
|
|
|
/* Support functions for running our sanity tests */
|
2013-08-15 03:05:12 +08:00
|
|
|
#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
|
2019-10-30 02:20:18 +08:00
|
|
|
int test_add_free_space_entry(struct btrfs_block_group *cache,
|
2013-08-15 03:05:12 +08:00
|
|
|
u64 offset, u64 bytes, bool bitmap);
|
2019-10-30 02:20:18 +08:00
|
|
|
int test_check_exists(struct btrfs_block_group *cache, u64 offset, u64 bytes);
|
2013-08-15 03:05:12 +08:00
|
|
|
#endif
|
2013-03-15 21:47:08 +08:00
|
|
|
|
2009-04-03 21:47:43 +08:00
|
|
|
#endif
|