2018-04-04 01:23:33 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2007-06-12 21:07:21 +08:00
|
|
|
/*
|
|
|
|
* Copyright (C) 2007 Oracle. All rights reserved.
|
|
|
|
*/
|
|
|
|
|
2008-02-21 01:07:25 +08:00
|
|
|
#include <linux/bio.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 16:04:11 +08:00
|
|
|
#include <linux/slab.h>
|
2008-02-21 01:07:25 +08:00
|
|
|
#include <linux/pagemap.h>
|
|
|
|
#include <linux/highmem.h>
|
2019-04-01 16:29:58 +08:00
|
|
|
#include <linux/sched/mm.h>
|
2019-06-03 22:58:57 +08:00
|
|
|
#include <crypto/hash.h>
|
2022-10-19 22:50:49 +08:00
|
|
|
#include "messages.h"
|
2007-03-16 07:03:33 +08:00
|
|
|
#include "ctree.h"
|
2007-03-27 04:00:06 +08:00
|
|
|
#include "disk-io.h"
|
2007-03-21 02:38:32 +08:00
|
|
|
#include "transaction.h"
|
2022-11-15 17:44:05 +08:00
|
|
|
#include "bio.h"
|
2016-03-10 17:26:59 +08:00
|
|
|
#include "compression.h"
|
2022-10-19 22:50:47 +08:00
|
|
|
#include "fs.h"
|
2022-10-19 22:51:00 +08:00
|
|
|
#include "accessors.h"
|
2022-10-27 03:08:27 +08:00
|
|
|
#include "file-item.h"
|
2007-03-16 07:03:33 +08:00
|
|
|
|
2016-08-04 05:05:46 +08:00
|
|
|
#define __MAX_CSUM_ITEMS(r, size) ((unsigned long)(((BTRFS_LEAF_DATA_SIZE(r) - \
|
|
|
|
sizeof(struct btrfs_item) * 2) / \
|
|
|
|
size) - 1))
|
2009-01-07 00:42:00 +08:00
|
|
|
|
2012-09-21 04:33:00 +08:00
|
|
|
#define MAX_CSUM_ITEMS(r, size) (min_t(u32, __MAX_CSUM_ITEMS(r, size), \
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
PAGE_SIZE))
|
2012-02-01 09:19:02 +08:00
|
|
|
|
2022-10-27 20:21:42 +08:00
|
|
|
/*
|
|
|
|
* Set inode's size according to filesystem options.
|
2021-01-22 17:57:54 +08:00
|
|
|
*
|
|
|
|
* @inode: inode we want to update the disk_i_size for
|
|
|
|
* @new_i_size: i_size we want to set to, 0 if we use i_size
|
2020-01-17 22:02:21 +08:00
|
|
|
*
|
|
|
|
* With NO_HOLES set this simply sets the disk_is_size to whatever i_size_read()
|
|
|
|
* returns as it is perfectly fine with a file that has holes without hole file
|
|
|
|
* extent items.
|
|
|
|
*
|
|
|
|
* However without NO_HOLES we need to only return the area that is contiguous
|
|
|
|
* from the 0 offset of the file. Otherwise we could end up adjust i_size up
|
|
|
|
* to an extent that has a gap in between.
|
|
|
|
*
|
|
|
|
* Finally new_i_size should only be set in the case of truncate where we're not
|
|
|
|
* ready to use i_size_read() as the limiter yet.
|
|
|
|
*/
|
2020-11-02 22:48:53 +08:00
|
|
|
void btrfs_inode_safe_disk_i_size_write(struct btrfs_inode *inode, u64 new_i_size)
|
2020-01-17 22:02:21 +08:00
|
|
|
{
|
|
|
|
u64 start, end, i_size;
|
|
|
|
int ret;
|
|
|
|
|
btrfs: fix encoded write i_size corruption with no-holes
We have observed a btrfs filesystem corruption on workloads using
no-holes and encoded writes via send stream v2. The symptom is that a
file appears to be truncated to the end of its last aligned extent, even
though the final unaligned extent and even the file extent and otherwise
correctly updated inode item have been written.
So if we were writing out a 1MiB+X file via 8 128K extents and one
extent of length X, i_size would be set to 1MiB, but the ninth extent,
nbyte, etc. would all appear correct otherwise.
The source of the race is a narrow (one line of code) window in which a
no-holes fs has read in an updated i_size, but has not yet set a shared
disk_i_size variable to write. Therefore, if two ordered extents run in
parallel (par for the course for receive workloads), the following
sequence can play out: (following "threads" a bit loosely, since there
are callbacks involved for endio but extra threads aren't needed to
cause the issue)
ENC-WR1 (second to last) ENC-WR2 (last)
------- -------
btrfs_do_encoded_write
set i_size = 1M
submit bio B1 ending at 1M
endio B1
btrfs_inode_safe_disk_i_size_write
local i_size = 1M
falls off a cliff for some reason
btrfs_do_encoded_write
set i_size = 1M+X
submit bio B2 ending at 1M+X
endio B2
btrfs_inode_safe_disk_i_size_write
local i_size = 1M+X
disk_i_size = 1M+X
disk_i_size = 1M
btrfs_delayed_update_inode
btrfs_delayed_update_inode
And the delayed inode ends up filled with nbytes=1M+X and isize=1M, and
writes respect i_size and present a corrupted file missing its last
extents.
Fix this by holding the inode lock in the no-holes case so that a thread
can't sneak in a write to disk_i_size that gets overwritten with an out
of date i_size.
Fixes: 41a2ee75aab0 ("btrfs: introduce per-inode file extent tree")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-29 05:02:11 +08:00
|
|
|
spin_lock(&inode->lock);
|
2020-11-02 22:48:53 +08:00
|
|
|
i_size = new_i_size ?: i_size_read(&inode->vfs_inode);
|
2024-04-30 23:52:27 +08:00
|
|
|
if (!inode->file_extent_tree) {
|
2020-11-02 22:48:53 +08:00
|
|
|
inode->disk_i_size = i_size;
|
btrfs: fix encoded write i_size corruption with no-holes
We have observed a btrfs filesystem corruption on workloads using
no-holes and encoded writes via send stream v2. The symptom is that a
file appears to be truncated to the end of its last aligned extent, even
though the final unaligned extent and even the file extent and otherwise
correctly updated inode item have been written.
So if we were writing out a 1MiB+X file via 8 128K extents and one
extent of length X, i_size would be set to 1MiB, but the ninth extent,
nbyte, etc. would all appear correct otherwise.
The source of the race is a narrow (one line of code) window in which a
no-holes fs has read in an updated i_size, but has not yet set a shared
disk_i_size variable to write. Therefore, if two ordered extents run in
parallel (par for the course for receive workloads), the following
sequence can play out: (following "threads" a bit loosely, since there
are callbacks involved for endio but extra threads aren't needed to
cause the issue)
ENC-WR1 (second to last) ENC-WR2 (last)
------- -------
btrfs_do_encoded_write
set i_size = 1M
submit bio B1 ending at 1M
endio B1
btrfs_inode_safe_disk_i_size_write
local i_size = 1M
falls off a cliff for some reason
btrfs_do_encoded_write
set i_size = 1M+X
submit bio B2 ending at 1M+X
endio B2
btrfs_inode_safe_disk_i_size_write
local i_size = 1M+X
disk_i_size = 1M+X
disk_i_size = 1M
btrfs_delayed_update_inode
btrfs_delayed_update_inode
And the delayed inode ends up filled with nbytes=1M+X and isize=1M, and
writes respect i_size and present a corrupted file missing its last
extents.
Fix this by holding the inode lock in the no-holes case so that a thread
can't sneak in a write to disk_i_size that gets overwritten with an out
of date i_size.
Fixes: 41a2ee75aab0 ("btrfs: introduce per-inode file extent tree")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-29 05:02:11 +08:00
|
|
|
goto out_unlock;
|
2020-01-17 22:02:21 +08:00
|
|
|
}
|
|
|
|
|
2023-12-01 06:42:01 +08:00
|
|
|
ret = find_contiguous_extent_bit(inode->file_extent_tree, 0, &start,
|
2020-11-02 22:48:53 +08:00
|
|
|
&end, EXTENT_DIRTY);
|
2020-01-17 22:02:21 +08:00
|
|
|
if (!ret && start == 0)
|
|
|
|
i_size = min(i_size, end + 1);
|
|
|
|
else
|
|
|
|
i_size = 0;
|
2020-11-02 22:48:53 +08:00
|
|
|
inode->disk_i_size = i_size;
|
btrfs: fix encoded write i_size corruption with no-holes
We have observed a btrfs filesystem corruption on workloads using
no-holes and encoded writes via send stream v2. The symptom is that a
file appears to be truncated to the end of its last aligned extent, even
though the final unaligned extent and even the file extent and otherwise
correctly updated inode item have been written.
So if we were writing out a 1MiB+X file via 8 128K extents and one
extent of length X, i_size would be set to 1MiB, but the ninth extent,
nbyte, etc. would all appear correct otherwise.
The source of the race is a narrow (one line of code) window in which a
no-holes fs has read in an updated i_size, but has not yet set a shared
disk_i_size variable to write. Therefore, if two ordered extents run in
parallel (par for the course for receive workloads), the following
sequence can play out: (following "threads" a bit loosely, since there
are callbacks involved for endio but extra threads aren't needed to
cause the issue)
ENC-WR1 (second to last) ENC-WR2 (last)
------- -------
btrfs_do_encoded_write
set i_size = 1M
submit bio B1 ending at 1M
endio B1
btrfs_inode_safe_disk_i_size_write
local i_size = 1M
falls off a cliff for some reason
btrfs_do_encoded_write
set i_size = 1M+X
submit bio B2 ending at 1M+X
endio B2
btrfs_inode_safe_disk_i_size_write
local i_size = 1M+X
disk_i_size = 1M+X
disk_i_size = 1M
btrfs_delayed_update_inode
btrfs_delayed_update_inode
And the delayed inode ends up filled with nbytes=1M+X and isize=1M, and
writes respect i_size and present a corrupted file missing its last
extents.
Fix this by holding the inode lock in the no-holes case so that a thread
can't sneak in a write to disk_i_size that gets overwritten with an out
of date i_size.
Fixes: 41a2ee75aab0 ("btrfs: introduce per-inode file extent tree")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-29 05:02:11 +08:00
|
|
|
out_unlock:
|
2020-11-02 22:48:53 +08:00
|
|
|
spin_unlock(&inode->lock);
|
2020-01-17 22:02:21 +08:00
|
|
|
}
|
|
|
|
|
2022-10-27 20:21:42 +08:00
|
|
|
/*
|
|
|
|
* Mark range within a file as having a new extent inserted.
|
2021-01-22 17:57:54 +08:00
|
|
|
*
|
|
|
|
* @inode: inode being modified
|
|
|
|
* @start: start file offset of the file extent we've inserted
|
|
|
|
* @len: logical length of the file extent item
|
2020-01-17 22:02:21 +08:00
|
|
|
*
|
|
|
|
* Call when we are inserting a new file extent where there was none before.
|
|
|
|
* Does not need to call this in the case where we're replacing an existing file
|
|
|
|
* extent, however if not sure it's fine to call this multiple times.
|
|
|
|
*
|
|
|
|
* The start and len must match the file extent item, so thus must be sectorsize
|
|
|
|
* aligned.
|
|
|
|
*/
|
|
|
|
int btrfs_inode_set_file_extent_range(struct btrfs_inode *inode, u64 start,
|
|
|
|
u64 len)
|
|
|
|
{
|
2024-04-30 23:52:27 +08:00
|
|
|
if (!inode->file_extent_tree)
|
|
|
|
return 0;
|
|
|
|
|
2020-01-17 22:02:21 +08:00
|
|
|
if (len == 0)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
ASSERT(IS_ALIGNED(start + len, inode->root->fs_info->sectorsize));
|
|
|
|
|
2023-12-01 06:42:01 +08:00
|
|
|
return set_extent_bit(inode->file_extent_tree, start, start + len - 1,
|
2023-05-25 07:04:39 +08:00
|
|
|
EXTENT_DIRTY, NULL);
|
2020-01-17 22:02:21 +08:00
|
|
|
}
|
|
|
|
|
2022-10-27 20:21:42 +08:00
|
|
|
/*
|
|
|
|
* Mark an inode range as not having a backing extent.
|
2021-01-22 17:57:54 +08:00
|
|
|
*
|
|
|
|
* @inode: inode being modified
|
|
|
|
* @start: start file offset of the file extent we've inserted
|
|
|
|
* @len: logical length of the file extent item
|
2020-01-17 22:02:21 +08:00
|
|
|
*
|
|
|
|
* Called when we drop a file extent, for example when we truncate. Doesn't
|
|
|
|
* need to be called for cases where we're replacing a file extent, like when
|
|
|
|
* we've COWed a file extent.
|
|
|
|
*
|
|
|
|
* The start and len must match the file extent item, so thus must be sectorsize
|
|
|
|
* aligned.
|
|
|
|
*/
|
|
|
|
int btrfs_inode_clear_file_extent_range(struct btrfs_inode *inode, u64 start,
|
|
|
|
u64 len)
|
|
|
|
{
|
2024-04-30 23:52:27 +08:00
|
|
|
if (!inode->file_extent_tree)
|
|
|
|
return 0;
|
|
|
|
|
2020-01-17 22:02:21 +08:00
|
|
|
if (len == 0)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
ASSERT(IS_ALIGNED(start + len, inode->root->fs_info->sectorsize) ||
|
|
|
|
len == (u64)-1);
|
|
|
|
|
2023-12-01 06:42:01 +08:00
|
|
|
return clear_extent_bit(inode->file_extent_tree, start,
|
2022-09-10 05:53:47 +08:00
|
|
|
start + len - 1, EXTENT_DIRTY, NULL);
|
2020-01-17 22:02:21 +08:00
|
|
|
}
|
|
|
|
|
2022-11-14 08:26:31 +08:00
|
|
|
static size_t bytes_to_csum_size(const struct btrfs_fs_info *fs_info, u32 bytes)
|
2019-05-22 16:19:01 +08:00
|
|
|
{
|
2022-11-14 08:26:31 +08:00
|
|
|
ASSERT(IS_ALIGNED(bytes, fs_info->sectorsize));
|
2019-05-22 16:19:01 +08:00
|
|
|
|
2022-11-14 08:26:31 +08:00
|
|
|
return (bytes >> fs_info->sectorsize_bits) * fs_info->csum_size;
|
|
|
|
}
|
|
|
|
|
|
|
|
static size_t csum_size_to_bytes(const struct btrfs_fs_info *fs_info, u32 csum_size)
|
|
|
|
{
|
|
|
|
ASSERT(IS_ALIGNED(csum_size, fs_info->csum_size));
|
|
|
|
|
|
|
|
return (csum_size / fs_info->csum_size) << fs_info->sectorsize_bits;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline u32 max_ordered_sum_bytes(const struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
u32 max_csum_size = round_down(PAGE_SIZE - sizeof(struct btrfs_ordered_sum),
|
|
|
|
fs_info->csum_size);
|
|
|
|
|
|
|
|
return csum_size_to_bytes(fs_info, max_csum_size);
|
2019-05-22 16:19:01 +08:00
|
|
|
}
|
2009-01-07 00:42:00 +08:00
|
|
|
|
2022-09-15 07:04:47 +08:00
|
|
|
/*
|
|
|
|
* Calculate the total size needed to allocate for an ordered sum structure
|
|
|
|
* spanning @bytes in the file.
|
|
|
|
*/
|
|
|
|
static int btrfs_ordered_sum_size(struct btrfs_fs_info *fs_info, unsigned long bytes)
|
|
|
|
{
|
2022-11-14 08:26:31 +08:00
|
|
|
return sizeof(struct btrfs_ordered_sum) + bytes_to_csum_size(fs_info, bytes);
|
2022-09-15 07:04:47 +08:00
|
|
|
}
|
|
|
|
|
2022-07-24 06:25:29 +08:00
|
|
|
int btrfs_insert_hole_extent(struct btrfs_trans_handle *trans,
|
2008-05-03 02:43:14 +08:00
|
|
|
struct btrfs_root *root,
|
2022-07-24 06:25:29 +08:00
|
|
|
u64 objectid, u64 pos, u64 num_bytes)
|
2007-03-21 02:38:32 +08:00
|
|
|
{
|
2007-03-27 04:00:06 +08:00
|
|
|
int ret = 0;
|
|
|
|
struct btrfs_file_extent_item *item;
|
|
|
|
struct btrfs_key file_key;
|
2007-04-02 23:20:42 +08:00
|
|
|
struct btrfs_path *path;
|
2007-10-16 04:14:19 +08:00
|
|
|
struct extent_buffer *leaf;
|
2007-03-27 04:00:06 +08:00
|
|
|
|
2007-04-02 23:20:42 +08:00
|
|
|
path = btrfs_alloc_path();
|
2011-03-23 16:14:16 +08:00
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
2007-03-27 04:00:06 +08:00
|
|
|
file_key.objectid = objectid;
|
2007-04-18 01:26:50 +08:00
|
|
|
file_key.offset = pos;
|
2014-06-05 00:41:45 +08:00
|
|
|
file_key.type = BTRFS_EXTENT_DATA_KEY;
|
2007-03-27 04:00:06 +08:00
|
|
|
|
2007-04-02 23:20:42 +08:00
|
|
|
ret = btrfs_insert_empty_item(trans, root, path, &file_key,
|
2007-03-27 04:00:06 +08:00
|
|
|
sizeof(*item));
|
2007-06-23 02:16:25 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2007-10-16 04:14:19 +08:00
|
|
|
leaf = path->nodes[0];
|
|
|
|
item = btrfs_item_ptr(leaf, path->slots[0],
|
2007-03-27 04:00:06 +08:00
|
|
|
struct btrfs_file_extent_item);
|
2022-07-24 06:25:29 +08:00
|
|
|
btrfs_set_file_extent_disk_bytenr(leaf, item, 0);
|
|
|
|
btrfs_set_file_extent_disk_num_bytes(leaf, item, 0);
|
|
|
|
btrfs_set_file_extent_offset(leaf, item, 0);
|
2007-10-16 04:15:53 +08:00
|
|
|
btrfs_set_file_extent_num_bytes(leaf, item, num_bytes);
|
2022-07-24 06:25:29 +08:00
|
|
|
btrfs_set_file_extent_ram_bytes(leaf, item, num_bytes);
|
2007-10-16 04:14:19 +08:00
|
|
|
btrfs_set_file_extent_generation(leaf, item, trans->transid);
|
|
|
|
btrfs_set_file_extent_type(leaf, item, BTRFS_FILE_EXTENT_REG);
|
2022-07-24 06:25:29 +08:00
|
|
|
btrfs_set_file_extent_compression(leaf, item, 0);
|
|
|
|
btrfs_set_file_extent_encryption(leaf, item, 0);
|
|
|
|
btrfs_set_file_extent_other_encoding(leaf, item, 0);
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-30 02:49:59 +08:00
|
|
|
|
2023-09-12 20:04:29 +08:00
|
|
|
btrfs_mark_buffer_dirty(trans, leaf);
|
2007-06-23 02:16:25 +08:00
|
|
|
out:
|
2007-04-02 23:20:42 +08:00
|
|
|
btrfs_free_path(path);
|
2007-06-23 02:16:25 +08:00
|
|
|
return ret;
|
2007-03-21 02:38:32 +08:00
|
|
|
}
|
2007-03-27 04:00:06 +08:00
|
|
|
|
2013-04-26 04:41:01 +08:00
|
|
|
static struct btrfs_csum_item *
|
|
|
|
btrfs_lookup_csum(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
u64 bytenr, int cow)
|
2007-04-16 21:22:45 +08:00
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *fs_info = root->fs_info;
|
2007-04-16 21:22:45 +08:00
|
|
|
int ret;
|
|
|
|
struct btrfs_key file_key;
|
|
|
|
struct btrfs_key found_key;
|
|
|
|
struct btrfs_csum_item *item;
|
2007-10-16 04:14:19 +08:00
|
|
|
struct extent_buffer *leaf;
|
2007-04-16 21:22:45 +08:00
|
|
|
u64 csum_offset = 0;
|
2020-07-02 17:27:30 +08:00
|
|
|
const u32 csum_size = fs_info->csum_size;
|
2007-04-19 04:15:28 +08:00
|
|
|
int csums_in_item;
|
2007-04-16 21:22:45 +08:00
|
|
|
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-09 05:58:54 +08:00
|
|
|
file_key.objectid = BTRFS_EXTENT_CSUM_OBJECTID;
|
|
|
|
file_key.offset = bytenr;
|
2014-06-05 00:41:45 +08:00
|
|
|
file_key.type = BTRFS_EXTENT_CSUM_KEY;
|
2007-04-18 01:26:50 +08:00
|
|
|
ret = btrfs_search_slot(trans, root, &file_key, path, 0, cow);
|
2007-04-16 21:22:45 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto fail;
|
2007-10-16 04:14:19 +08:00
|
|
|
leaf = path->nodes[0];
|
2007-04-16 21:22:45 +08:00
|
|
|
if (ret > 0) {
|
|
|
|
ret = 1;
|
2007-04-18 03:39:32 +08:00
|
|
|
if (path->slots[0] == 0)
|
2007-04-16 21:22:45 +08:00
|
|
|
goto fail;
|
|
|
|
path->slots[0]--;
|
2007-10-16 04:14:19 +08:00
|
|
|
btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
|
2014-06-05 00:41:45 +08:00
|
|
|
if (found_key.type != BTRFS_EXTENT_CSUM_KEY)
|
2007-04-16 21:22:45 +08:00
|
|
|
goto fail;
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-09 05:58:54 +08:00
|
|
|
|
|
|
|
csum_offset = (bytenr - found_key.offset) >>
|
2020-07-02 03:19:09 +08:00
|
|
|
fs_info->sectorsize_bits;
|
2021-10-22 02:58:35 +08:00
|
|
|
csums_in_item = btrfs_item_size(leaf, path->slots[0]);
|
2008-12-02 20:17:45 +08:00
|
|
|
csums_in_item /= csum_size;
|
2007-04-19 04:15:28 +08:00
|
|
|
|
2013-03-28 16:12:15 +08:00
|
|
|
if (csum_offset == csums_in_item) {
|
2007-04-19 04:15:28 +08:00
|
|
|
ret = -EFBIG;
|
2007-04-16 21:22:45 +08:00
|
|
|
goto fail;
|
2013-03-28 16:12:15 +08:00
|
|
|
} else if (csum_offset > csums_in_item) {
|
|
|
|
goto fail;
|
2007-04-16 21:22:45 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_csum_item);
|
2007-05-11 00:36:17 +08:00
|
|
|
item = (struct btrfs_csum_item *)((unsigned char *)item +
|
2008-12-02 20:17:45 +08:00
|
|
|
csum_offset * csum_size);
|
2007-04-16 21:22:45 +08:00
|
|
|
return item;
|
|
|
|
fail:
|
|
|
|
if (ret > 0)
|
2007-04-18 01:26:50 +08:00
|
|
|
ret = -ENOENT;
|
2007-04-16 21:22:45 +08:00
|
|
|
return ERR_PTR(ret);
|
|
|
|
}
|
|
|
|
|
2007-03-27 04:00:06 +08:00
|
|
|
int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, u64 objectid,
|
2007-03-27 23:26:26 +08:00
|
|
|
u64 offset, int mod)
|
2007-03-27 04:00:06 +08:00
|
|
|
{
|
|
|
|
struct btrfs_key file_key;
|
|
|
|
int ins_len = mod < 0 ? -1 : 0;
|
|
|
|
int cow = mod != 0;
|
|
|
|
|
|
|
|
file_key.objectid = objectid;
|
2007-04-18 03:39:32 +08:00
|
|
|
file_key.offset = offset;
|
2014-06-05 00:41:45 +08:00
|
|
|
file_key.type = BTRFS_EXTENT_DATA_KEY;
|
2021-07-27 02:51:07 +08:00
|
|
|
|
|
|
|
return btrfs_search_slot(trans, root, &file_key, path, ins_len, cow);
|
2007-03-27 04:00:06 +08:00
|
|
|
}
|
2007-03-30 03:15:27 +08:00
|
|
|
|
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 14:48:06 +08:00
|
|
|
/*
|
|
|
|
* Find checksums for logical bytenr range [disk_bytenr, disk_bytenr + len) and
|
2022-10-27 20:21:42 +08:00
|
|
|
* store the result to @dst.
|
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 14:48:06 +08:00
|
|
|
*
|
|
|
|
* Return >0 for the number of sectors we found.
|
|
|
|
* Return 0 for the range [disk_bytenr, disk_bytenr + sectorsize) has no csum
|
|
|
|
* for it. Caller may want to try next sector until one range is hit.
|
|
|
|
* Return <0 for fatal error.
|
|
|
|
*/
|
|
|
|
static int search_csum_tree(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_path *path, u64 disk_bytenr,
|
|
|
|
u64 len, u8 *dst)
|
|
|
|
{
|
2021-11-06 04:45:48 +08:00
|
|
|
struct btrfs_root *csum_root;
|
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 14:48:06 +08:00
|
|
|
struct btrfs_csum_item *item = NULL;
|
|
|
|
struct btrfs_key key;
|
|
|
|
const u32 sectorsize = fs_info->sectorsize;
|
|
|
|
const u32 csum_size = fs_info->csum_size;
|
|
|
|
u32 itemsize;
|
|
|
|
int ret;
|
|
|
|
u64 csum_start;
|
|
|
|
u64 csum_len;
|
|
|
|
|
|
|
|
ASSERT(IS_ALIGNED(disk_bytenr, sectorsize) &&
|
|
|
|
IS_ALIGNED(len, sectorsize));
|
|
|
|
|
|
|
|
/* Check if the current csum item covers disk_bytenr */
|
|
|
|
if (path->nodes[0]) {
|
|
|
|
item = btrfs_item_ptr(path->nodes[0], path->slots[0],
|
|
|
|
struct btrfs_csum_item);
|
|
|
|
btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
|
2021-10-22 02:58:35 +08:00
|
|
|
itemsize = btrfs_item_size(path->nodes[0], path->slots[0]);
|
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 14:48:06 +08:00
|
|
|
|
|
|
|
csum_start = key.offset;
|
|
|
|
csum_len = (itemsize / csum_size) * sectorsize;
|
|
|
|
|
|
|
|
if (in_range(disk_bytenr, csum_start, csum_len))
|
|
|
|
goto found;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Current item doesn't contain the desired range, search again */
|
|
|
|
btrfs_release_path(path);
|
2021-11-06 04:45:48 +08:00
|
|
|
csum_root = btrfs_csum_root(fs_info, disk_bytenr);
|
|
|
|
item = btrfs_lookup_csum(NULL, csum_root, path, disk_bytenr, 0);
|
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 14:48:06 +08:00
|
|
|
if (IS_ERR(item)) {
|
|
|
|
ret = PTR_ERR(item);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
|
2021-10-22 02:58:35 +08:00
|
|
|
itemsize = btrfs_item_size(path->nodes[0], path->slots[0]);
|
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 14:48:06 +08:00
|
|
|
|
|
|
|
csum_start = key.offset;
|
|
|
|
csum_len = (itemsize / csum_size) * sectorsize;
|
|
|
|
ASSERT(in_range(disk_bytenr, csum_start, csum_len));
|
|
|
|
|
|
|
|
found:
|
|
|
|
ret = (min(csum_start + csum_len, disk_bytenr + len) -
|
|
|
|
disk_bytenr) >> fs_info->sectorsize_bits;
|
|
|
|
read_extent_buffer(path->nodes[0], dst, (unsigned long)item,
|
|
|
|
ret * csum_size);
|
|
|
|
out:
|
2022-02-18 23:03:22 +08:00
|
|
|
if (ret == -ENOENT || ret == -EFBIG)
|
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 14:48:06 +08:00
|
|
|
ret = 0;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2022-10-27 20:21:42 +08:00
|
|
|
/*
|
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 14:48:06 +08:00
|
|
|
* Lookup the checksum for the read bio in csum tree.
|
2020-12-02 14:48:05 +08:00
|
|
|
*
|
2019-12-03 09:34:17 +08:00
|
|
|
* Return: BLK_STS_RESOURCE if allocating memory fails, BLK_STS_OK otherwise.
|
|
|
|
*/
|
2023-01-21 14:50:02 +08:00
|
|
|
blk_status_t btrfs_lookup_bio_sums(struct btrfs_bio *bbio)
|
2008-08-01 03:42:53 +08:00
|
|
|
{
|
2023-01-21 14:50:02 +08:00
|
|
|
struct btrfs_inode *inode = bbio->inode;
|
|
|
|
struct btrfs_fs_info *fs_info = inode->root->fs_info;
|
|
|
|
struct bio *bio = &bbio->bio;
|
2013-07-25 19:22:34 +08:00
|
|
|
struct btrfs_path *path;
|
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 14:48:06 +08:00
|
|
|
const u32 sectorsize = fs_info->sectorsize;
|
|
|
|
const u32 csum_size = fs_info->csum_size;
|
|
|
|
u32 orig_len = bio->bi_iter.bi_size;
|
|
|
|
u64 orig_disk_bytenr = bio->bi_iter.bi_sector << SECTOR_SHIFT;
|
|
|
|
const unsigned int nblocks = orig_len >> fs_info->sectorsize_bits;
|
2022-02-18 23:03:23 +08:00
|
|
|
blk_status_t ret = BLK_STS_OK;
|
2023-02-23 01:07:02 +08:00
|
|
|
u32 bio_offset = 0;
|
2008-08-01 03:42:53 +08:00
|
|
|
|
2023-01-21 14:50:02 +08:00
|
|
|
if ((inode->flags & BTRFS_INODE_NODATASUM) ||
|
2021-11-06 04:45:47 +08:00
|
|
|
test_bit(BTRFS_FS_STATE_NO_CSUMS, &fs_info->fs_state))
|
2020-10-16 23:29:14 +08:00
|
|
|
return BLK_STS_OK;
|
|
|
|
|
2020-12-02 14:48:05 +08:00
|
|
|
/*
|
|
|
|
* This function is only called for read bio.
|
|
|
|
*
|
|
|
|
* This means two things:
|
|
|
|
* - All our csums should only be in csum tree
|
|
|
|
* No ordered extents csums, as ordered extents are only for write
|
|
|
|
* path.
|
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 14:48:06 +08:00
|
|
|
* - No need to bother any other info from bvec
|
|
|
|
* Since we're looking up csums, the only important info is the
|
|
|
|
* disk_bytenr and the length, which can be extracted from bi_iter
|
|
|
|
* directly.
|
2020-12-02 14:48:05 +08:00
|
|
|
*/
|
|
|
|
ASSERT(bio_op(bio) == REQ_OP_READ);
|
2008-08-01 03:42:53 +08:00
|
|
|
path = btrfs_alloc_path();
|
2011-03-01 14:48:31 +08:00
|
|
|
if (!path)
|
2017-06-03 15:38:06 +08:00
|
|
|
return BLK_STS_RESOURCE;
|
2013-07-25 19:22:34 +08:00
|
|
|
|
2023-01-21 14:50:02 +08:00
|
|
|
if (nblocks * csum_size > BTRFS_BIO_INLINE_CSUM_SIZE) {
|
|
|
|
bbio->csum = kmalloc_array(nblocks, csum_size, GFP_NOFS);
|
|
|
|
if (!bbio->csum) {
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return BLK_STS_RESOURCE;
|
2013-07-25 19:22:34 +08:00
|
|
|
}
|
|
|
|
} else {
|
2023-01-21 14:50:02 +08:00
|
|
|
bbio->csum = bbio->csum_inline;
|
2013-07-25 19:22:34 +08:00
|
|
|
}
|
|
|
|
|
btrfs: use nodesize to determine if we need readahead in btrfs_lookup_bio_sums
In btrfs_lookup_bio_sums() if the bio is pretty large, we want to
start readahead in the csum tree.
However the threshold is an immediate number, (PAGE_SIZE * 8), from the
initial btrfs merge.
The meaning of the value is pretty hard to guess, especially when the
immediate number is from the times when 4K sectorsize was the default
and only CRC32C was supported.
For the most common btrfs setup, CRC32 csum and 4K sectorsize,
it means just 32K read would kick readahead, while the csum itself is
only 32 bytes in size.
Now let's be more reasonable by taking both csum size and node size into
consideration.
If the csum size for the bio is larger than one leaf, then we kick the
readahead. This means for current default btrfs, the threshold will be
16M.
This change should not change performance observably, thus this is
mostly a readability enhancement.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-13 20:51:41 +08:00
|
|
|
/*
|
|
|
|
* If requested number of sectors is larger than one leaf can contain,
|
|
|
|
* kick the readahead for csum tree.
|
|
|
|
*/
|
|
|
|
if (nblocks > fs_info->csums_per_leaf)
|
2015-11-27 23:31:35 +08:00
|
|
|
path->reada = READA_FORWARD;
|
2008-08-01 03:42:53 +08:00
|
|
|
|
2011-07-27 03:35:09 +08:00
|
|
|
/*
|
|
|
|
* the free space stuff is only read when it hasn't been
|
|
|
|
* updated in the current transaction. So, we can safely
|
|
|
|
* read from the commit root and sidestep a nasty deadlock
|
|
|
|
* between reading the free space cache and updating the csum tree.
|
|
|
|
*/
|
2023-01-21 14:50:02 +08:00
|
|
|
if (btrfs_is_free_space_inode(inode)) {
|
2011-07-27 03:35:09 +08:00
|
|
|
path->search_commit_root = 1;
|
2011-09-11 22:52:24 +08:00
|
|
|
path->skip_locking = 1;
|
|
|
|
}
|
2011-07-27 03:35:09 +08:00
|
|
|
|
2023-02-23 01:07:02 +08:00
|
|
|
while (bio_offset < orig_len) {
|
|
|
|
int count;
|
|
|
|
u64 cur_disk_bytenr = orig_disk_bytenr + bio_offset;
|
|
|
|
u8 *csum_dst = bbio->csum +
|
|
|
|
(bio_offset >> fs_info->sectorsize_bits) * csum_size;
|
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 14:48:06 +08:00
|
|
|
|
|
|
|
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
|
2023-02-23 01:07:02 +08:00
|
|
|
orig_len - bio_offset, csum_dst);
|
2022-02-18 23:03:23 +08:00
|
|
|
if (count < 0) {
|
|
|
|
ret = errno_to_blk_status(count);
|
2023-01-21 14:50:02 +08:00
|
|
|
if (bbio->csum != bbio->csum_inline)
|
|
|
|
kfree(bbio->csum);
|
|
|
|
bbio->csum = NULL;
|
2022-02-18 23:03:23 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We didn't find a csum for this range. We need to make sure
|
|
|
|
* we complain loudly about this, because we are not NODATASUM.
|
|
|
|
*
|
|
|
|
* However for the DATA_RELOC inode we could potentially be
|
|
|
|
* relocating data extents for a NODATASUM inode, so the inode
|
|
|
|
* itself won't be marked with NODATASUM, but the extent we're
|
|
|
|
* copying is in fact NODATASUM. If we don't find a csum we
|
|
|
|
* assume this is the case.
|
|
|
|
*/
|
|
|
|
if (count == 0) {
|
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 14:48:06 +08:00
|
|
|
memset(csum_dst, 0, csum_size);
|
|
|
|
count = 1;
|
|
|
|
|
2024-04-16 04:16:23 +08:00
|
|
|
if (btrfs_root_id(inode->root) == BTRFS_DATA_RELOC_TREE_OBJECTID) {
|
2023-02-23 01:07:02 +08:00
|
|
|
u64 file_offset = bbio->file_offset + bio_offset;
|
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 14:48:06 +08:00
|
|
|
|
2023-05-25 07:04:32 +08:00
|
|
|
set_extent_bit(&inode->io_tree, file_offset,
|
|
|
|
file_offset + sectorsize - 1,
|
2023-05-25 07:04:39 +08:00
|
|
|
EXTENT_NODATASUM, NULL);
|
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 14:48:06 +08:00
|
|
|
} else {
|
|
|
|
btrfs_warn_rl(fs_info,
|
|
|
|
"csum hole found for disk bytenr range [%llu, %llu)",
|
|
|
|
cur_disk_bytenr, cur_disk_bytenr + sectorsize);
|
|
|
|
}
|
2013-04-05 15:20:56 +08:00
|
|
|
}
|
2023-02-23 01:07:02 +08:00
|
|
|
bio_offset += count * sectorsize;
|
2008-08-01 03:42:53 +08:00
|
|
|
}
|
2016-03-21 21:59:09 +08:00
|
|
|
|
2008-08-01 03:42:53 +08:00
|
|
|
btrfs_free_path(path);
|
2022-02-18 23:03:23 +08:00
|
|
|
return ret;
|
2010-05-23 23:00:55 +08:00
|
|
|
}
|
|
|
|
|
2024-04-12 01:17:05 +08:00
|
|
|
/*
|
|
|
|
* Search for checksums for a given logical range.
|
|
|
|
*
|
|
|
|
* @root: The root where to look for checksums.
|
|
|
|
* @start: Logical address of target checksum range.
|
|
|
|
* @end: End offset (inclusive) of the target checksum range.
|
|
|
|
* @list: List for adding each checksum that was found.
|
btrfs: make NOCOW checks for existence of checksums in a range more efficient
Before deciding if we can do a NOCOW write into a range, one of the things
we have to do is check if there are checksum items for that range. We do
that through the btrfs_lookup_csums_list() function, which searches for
checksums and adds them to a list supplied by the caller.
But all we need is to check if there is any checksum, we don't need to
look for all of them and collect them into a list, which requires more
search time in the checksums tree, allocating memory for checksums items
to add to the list, copy checksums from a leaf into those list items,
then free that memory, etc. This is all unnecessary overhead, wasting
mostly CPU time, and perhaps some occasional IO if we need to read from
disk any extent buffers.
So change btrfs_lookup_csums_list() to allow to return immediately in
case it finds any checksum, without the need to add it to a list and read
it from a leaf. This is accomplished by allowing a NULL list parameter and
making the function return 1 if it found any checksum, 0 if it didn't
found any, and a negative value in case of an error.
The following test with fio was used to measure performance:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
cat <<EOF > /tmp/fio-job.ini
[global]
name=fio-rand-write
filename=$MNT/fio-rand-write
rw=randwrite
bssplit=4k/20:8k/20:16k/20:32k/20:64k/20
direct=1
numjobs=16
fallocate=posix
time_based
runtime=300
[file1]
size=8G
ioengine=io_uring
iodepth=16
EOF
umount $MNT &> /dev/null
mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The test was run on a release kernel (Debian's default kernel config).
The results before this patch:
WRITE: bw=139MiB/s (146MB/s), 8204KiB/s-9504KiB/s (8401kB/s-9732kB/s), io=17.0GiB (18.3GB), run=125317-125344msec
The results after this patch:
WRITE: bw=153MiB/s (160MB/s), 9241KiB/s-10.0MiB/s (9463kB/s-10.5MB/s), io=17.0GiB (18.3GB), run=114054-114071msec
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-12 18:48:31 +08:00
|
|
|
* Can be NULL in case the caller only wants to check if
|
|
|
|
* there any checksums for the range.
|
2024-04-12 01:17:05 +08:00
|
|
|
* @nowait: Indicate if the search must be non-blocking or not.
|
|
|
|
*
|
btrfs: make NOCOW checks for existence of checksums in a range more efficient
Before deciding if we can do a NOCOW write into a range, one of the things
we have to do is check if there are checksum items for that range. We do
that through the btrfs_lookup_csums_list() function, which searches for
checksums and adds them to a list supplied by the caller.
But all we need is to check if there is any checksum, we don't need to
look for all of them and collect them into a list, which requires more
search time in the checksums tree, allocating memory for checksums items
to add to the list, copy checksums from a leaf into those list items,
then free that memory, etc. This is all unnecessary overhead, wasting
mostly CPU time, and perhaps some occasional IO if we need to read from
disk any extent buffers.
So change btrfs_lookup_csums_list() to allow to return immediately in
case it finds any checksum, without the need to add it to a list and read
it from a leaf. This is accomplished by allowing a NULL list parameter and
making the function return 1 if it found any checksum, 0 if it didn't
found any, and a negative value in case of an error.
The following test with fio was used to measure performance:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
cat <<EOF > /tmp/fio-job.ini
[global]
name=fio-rand-write
filename=$MNT/fio-rand-write
rw=randwrite
bssplit=4k/20:8k/20:16k/20:32k/20:64k/20
direct=1
numjobs=16
fallocate=posix
time_based
runtime=300
[file1]
size=8G
ioengine=io_uring
iodepth=16
EOF
umount $MNT &> /dev/null
mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The test was run on a release kernel (Debian's default kernel config).
The results before this patch:
WRITE: bw=139MiB/s (146MB/s), 8204KiB/s-9504KiB/s (8401kB/s-9732kB/s), io=17.0GiB (18.3GB), run=125317-125344msec
The results after this patch:
WRITE: bw=153MiB/s (160MB/s), 9241KiB/s-10.0MiB/s (9463kB/s-10.5MB/s), io=17.0GiB (18.3GB), run=114054-114071msec
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-12 18:48:31 +08:00
|
|
|
* Return < 0 on error, 0 if no checksums were found, or 1 if checksums were
|
|
|
|
* found.
|
2024-04-12 01:17:05 +08:00
|
|
|
*/
|
2022-11-14 08:26:32 +08:00
|
|
|
int btrfs_lookup_csums_list(struct btrfs_root *root, u64 start, u64 end,
|
2024-04-12 01:26:56 +08:00
|
|
|
struct list_head *list, bool nowait)
|
2008-12-12 23:03:38 +08:00
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *fs_info = root->fs_info;
|
2008-12-12 23:03:38 +08:00
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_ordered_sum *sums;
|
|
|
|
struct btrfs_csum_item *item;
|
|
|
|
int ret;
|
btrfs: make NOCOW checks for existence of checksums in a range more efficient
Before deciding if we can do a NOCOW write into a range, one of the things
we have to do is check if there are checksum items for that range. We do
that through the btrfs_lookup_csums_list() function, which searches for
checksums and adds them to a list supplied by the caller.
But all we need is to check if there is any checksum, we don't need to
look for all of them and collect them into a list, which requires more
search time in the checksums tree, allocating memory for checksums items
to add to the list, copy checksums from a leaf into those list items,
then free that memory, etc. This is all unnecessary overhead, wasting
mostly CPU time, and perhaps some occasional IO if we need to read from
disk any extent buffers.
So change btrfs_lookup_csums_list() to allow to return immediately in
case it finds any checksum, without the need to add it to a list and read
it from a leaf. This is accomplished by allowing a NULL list parameter and
making the function return 1 if it found any checksum, 0 if it didn't
found any, and a negative value in case of an error.
The following test with fio was used to measure performance:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
cat <<EOF > /tmp/fio-job.ini
[global]
name=fio-rand-write
filename=$MNT/fio-rand-write
rw=randwrite
bssplit=4k/20:8k/20:16k/20:32k/20:64k/20
direct=1
numjobs=16
fallocate=posix
time_based
runtime=300
[file1]
size=8G
ioengine=io_uring
iodepth=16
EOF
umount $MNT &> /dev/null
mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The test was run on a release kernel (Debian's default kernel config).
The results before this patch:
WRITE: bw=139MiB/s (146MB/s), 8204KiB/s-9504KiB/s (8401kB/s-9732kB/s), io=17.0GiB (18.3GB), run=125317-125344msec
The results after this patch:
WRITE: bw=153MiB/s (160MB/s), 9241KiB/s-10.0MiB/s (9463kB/s-10.5MB/s), io=17.0GiB (18.3GB), run=114054-114071msec
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-12 18:48:31 +08:00
|
|
|
bool found_csums = false;
|
2008-12-12 23:03:38 +08:00
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
|
|
|
|
IS_ALIGNED(end + 1, fs_info->sectorsize));
|
2013-10-15 21:36:40 +08:00
|
|
|
|
2008-12-12 23:03:38 +08:00
|
|
|
path = btrfs_alloc_path();
|
btrfs: don't BUG_ON btrfs_alloc_path() errors
This patch fixes many callers of btrfs_alloc_path() which BUG_ON allocation
failure. All the sites that are fixed in this patch were checked by me to
be fairly trivial to fix because of at least one of two criteria:
- Callers of the function catch errors from it already so bubbling the
error up will be handled.
- Callers of the function might BUG_ON any nonzero return code in which
case there is no behavior changed (but we still got to remove a BUG_ON)
The following functions were updated:
btrfs_lookup_extent, alloc_reserved_tree_block, btrfs_remove_block_group,
btrfs_lookup_csums_range, btrfs_csum_file_blocks, btrfs_mark_extent_written,
btrfs_inode_by_name, btrfs_new_inode, btrfs_symlink,
insert_reserved_file_extent, and run_delalloc_nocow
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2011-07-14 01:38:47 +08:00
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
2008-12-12 23:03:38 +08:00
|
|
|
|
2022-09-13 03:27:43 +08:00
|
|
|
path->nowait = nowait;
|
2011-03-08 21:14:00 +08:00
|
|
|
|
2008-12-12 23:03:38 +08:00
|
|
|
key.objectid = BTRFS_EXTENT_CSUM_OBJECTID;
|
|
|
|
key.offset = start;
|
|
|
|
key.type = BTRFS_EXTENT_CSUM_KEY;
|
|
|
|
|
2009-01-07 00:42:00 +08:00
|
|
|
ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
|
2008-12-12 23:03:38 +08:00
|
|
|
if (ret < 0)
|
2024-04-12 01:39:51 +08:00
|
|
|
goto out;
|
2008-12-12 23:03:38 +08:00
|
|
|
if (ret > 0 && path->slots[0] > 0) {
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, path->slots[0] - 1);
|
2022-11-14 08:26:31 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* There are two cases we can hit here for the previous csum
|
|
|
|
* item:
|
|
|
|
*
|
|
|
|
* |<- search range ->|
|
|
|
|
* |<- csum item ->|
|
|
|
|
*
|
|
|
|
* Or
|
|
|
|
* |<- search range ->|
|
|
|
|
* |<- csum item ->|
|
|
|
|
*
|
|
|
|
* Check if the previous csum item covers the leading part of
|
|
|
|
* the search range. If so we have to start from previous csum
|
|
|
|
* item.
|
|
|
|
*/
|
2008-12-12 23:03:38 +08:00
|
|
|
if (key.objectid == BTRFS_EXTENT_CSUM_OBJECTID &&
|
|
|
|
key.type == BTRFS_EXTENT_CSUM_KEY) {
|
2022-11-14 08:26:31 +08:00
|
|
|
if (bytes_to_csum_size(fs_info, start - key.offset) <
|
2021-10-22 02:58:35 +08:00
|
|
|
btrfs_item_size(leaf, path->slots[0] - 1))
|
2008-12-12 23:03:38 +08:00
|
|
|
path->slots[0]--;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
while (start <= end) {
|
2022-11-14 08:26:31 +08:00
|
|
|
u64 csum_end;
|
|
|
|
|
2008-12-12 23:03:38 +08:00
|
|
|
leaf = path->nodes[0];
|
|
|
|
if (path->slots[0] >= btrfs_header_nritems(leaf)) {
|
2009-01-07 00:42:00 +08:00
|
|
|
ret = btrfs_next_leaf(root, path);
|
2008-12-12 23:03:38 +08:00
|
|
|
if (ret < 0)
|
2024-04-12 01:39:51 +08:00
|
|
|
goto out;
|
2008-12-12 23:03:38 +08:00
|
|
|
if (ret > 0)
|
|
|
|
break;
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
}
|
|
|
|
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
|
|
|
|
if (key.objectid != BTRFS_EXTENT_CSUM_OBJECTID ||
|
2013-03-18 17:18:09 +08:00
|
|
|
key.type != BTRFS_EXTENT_CSUM_KEY ||
|
|
|
|
key.offset > end)
|
2008-12-12 23:03:38 +08:00
|
|
|
break;
|
|
|
|
|
|
|
|
if (key.offset > start)
|
|
|
|
start = key.offset;
|
|
|
|
|
2022-11-14 08:26:31 +08:00
|
|
|
csum_end = key.offset + csum_size_to_bytes(fs_info,
|
|
|
|
btrfs_item_size(leaf, path->slots[0]));
|
2008-12-17 23:21:48 +08:00
|
|
|
if (csum_end <= start) {
|
|
|
|
path->slots[0]++;
|
|
|
|
continue;
|
|
|
|
}
|
2008-12-12 23:03:38 +08:00
|
|
|
|
btrfs: make NOCOW checks for existence of checksums in a range more efficient
Before deciding if we can do a NOCOW write into a range, one of the things
we have to do is check if there are checksum items for that range. We do
that through the btrfs_lookup_csums_list() function, which searches for
checksums and adds them to a list supplied by the caller.
But all we need is to check if there is any checksum, we don't need to
look for all of them and collect them into a list, which requires more
search time in the checksums tree, allocating memory for checksums items
to add to the list, copy checksums from a leaf into those list items,
then free that memory, etc. This is all unnecessary overhead, wasting
mostly CPU time, and perhaps some occasional IO if we need to read from
disk any extent buffers.
So change btrfs_lookup_csums_list() to allow to return immediately in
case it finds any checksum, without the need to add it to a list and read
it from a leaf. This is accomplished by allowing a NULL list parameter and
making the function return 1 if it found any checksum, 0 if it didn't
found any, and a negative value in case of an error.
The following test with fio was used to measure performance:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
cat <<EOF > /tmp/fio-job.ini
[global]
name=fio-rand-write
filename=$MNT/fio-rand-write
rw=randwrite
bssplit=4k/20:8k/20:16k/20:32k/20:64k/20
direct=1
numjobs=16
fallocate=posix
time_based
runtime=300
[file1]
size=8G
ioengine=io_uring
iodepth=16
EOF
umount $MNT &> /dev/null
mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The test was run on a release kernel (Debian's default kernel config).
The results before this patch:
WRITE: bw=139MiB/s (146MB/s), 8204KiB/s-9504KiB/s (8401kB/s-9732kB/s), io=17.0GiB (18.3GB), run=125317-125344msec
The results after this patch:
WRITE: bw=153MiB/s (160MB/s), 9241KiB/s-10.0MiB/s (9463kB/s-10.5MB/s), io=17.0GiB (18.3GB), run=114054-114071msec
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-12 18:48:31 +08:00
|
|
|
found_csums = true;
|
|
|
|
if (!list)
|
|
|
|
goto out;
|
|
|
|
|
2009-01-07 00:42:00 +08:00
|
|
|
csum_end = min(csum_end, end + 1);
|
2008-12-12 23:03:38 +08:00
|
|
|
item = btrfs_item_ptr(path->nodes[0], path->slots[0],
|
|
|
|
struct btrfs_csum_item);
|
2009-01-07 00:42:00 +08:00
|
|
|
while (start < csum_end) {
|
2022-11-14 08:26:31 +08:00
|
|
|
unsigned long offset;
|
|
|
|
size_t size;
|
|
|
|
|
2009-01-07 00:42:00 +08:00
|
|
|
size = min_t(size_t, csum_end - start,
|
2022-11-14 08:26:31 +08:00
|
|
|
max_ordered_sum_bytes(fs_info));
|
2016-06-23 06:54:23 +08:00
|
|
|
sums = kzalloc(btrfs_ordered_sum_size(fs_info, size),
|
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
|
|
|
GFP_NOFS);
|
2011-08-06 06:46:16 +08:00
|
|
|
if (!sums) {
|
|
|
|
ret = -ENOMEM;
|
2024-04-12 01:39:51 +08:00
|
|
|
goto out;
|
2011-08-06 06:46:16 +08:00
|
|
|
}
|
2008-12-12 23:03:38 +08:00
|
|
|
|
2023-05-24 23:03:07 +08:00
|
|
|
sums->logical = start;
|
2023-05-24 23:03:06 +08:00
|
|
|
sums->len = size;
|
2009-01-07 00:42:00 +08:00
|
|
|
|
2022-11-14 08:26:31 +08:00
|
|
|
offset = bytes_to_csum_size(fs_info, start - key.offset);
|
2009-01-07 00:42:00 +08:00
|
|
|
|
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
|
|
|
read_extent_buffer(path->nodes[0],
|
|
|
|
sums->sums,
|
|
|
|
((unsigned long)item) + offset,
|
2022-11-14 08:26:31 +08:00
|
|
|
bytes_to_csum_size(fs_info, size));
|
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
|
|
|
|
2022-11-14 08:26:31 +08:00
|
|
|
start += size;
|
2024-04-12 01:33:43 +08:00
|
|
|
list_add_tail(&sums->list, list);
|
2009-01-07 00:42:00 +08:00
|
|
|
}
|
2008-12-12 23:03:38 +08:00
|
|
|
path->slots[0]++;
|
|
|
|
}
|
2024-04-12 01:39:51 +08:00
|
|
|
out:
|
btrfs: make NOCOW checks for existence of checksums in a range more efficient
Before deciding if we can do a NOCOW write into a range, one of the things
we have to do is check if there are checksum items for that range. We do
that through the btrfs_lookup_csums_list() function, which searches for
checksums and adds them to a list supplied by the caller.
But all we need is to check if there is any checksum, we don't need to
look for all of them and collect them into a list, which requires more
search time in the checksums tree, allocating memory for checksums items
to add to the list, copy checksums from a leaf into those list items,
then free that memory, etc. This is all unnecessary overhead, wasting
mostly CPU time, and perhaps some occasional IO if we need to read from
disk any extent buffers.
So change btrfs_lookup_csums_list() to allow to return immediately in
case it finds any checksum, without the need to add it to a list and read
it from a leaf. This is accomplished by allowing a NULL list parameter and
making the function return 1 if it found any checksum, 0 if it didn't
found any, and a negative value in case of an error.
The following test with fio was used to measure performance:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
cat <<EOF > /tmp/fio-job.ini
[global]
name=fio-rand-write
filename=$MNT/fio-rand-write
rw=randwrite
bssplit=4k/20:8k/20:16k/20:32k/20:64k/20
direct=1
numjobs=16
fallocate=posix
time_based
runtime=300
[file1]
size=8G
ioengine=io_uring
iodepth=16
EOF
umount $MNT &> /dev/null
mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The test was run on a release kernel (Debian's default kernel config).
The results before this patch:
WRITE: bw=139MiB/s (146MB/s), 8204KiB/s-9504KiB/s (8401kB/s-9732kB/s), io=17.0GiB (18.3GB), run=125317-125344msec
The results after this patch:
WRITE: bw=153MiB/s (160MB/s), 9241KiB/s-10.0MiB/s (9463kB/s-10.5MB/s), io=17.0GiB (18.3GB), run=114054-114071msec
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-12 18:48:31 +08:00
|
|
|
btrfs_free_path(path);
|
2024-04-12 01:39:51 +08:00
|
|
|
if (ret < 0) {
|
btrfs: make NOCOW checks for existence of checksums in a range more efficient
Before deciding if we can do a NOCOW write into a range, one of the things
we have to do is check if there are checksum items for that range. We do
that through the btrfs_lookup_csums_list() function, which searches for
checksums and adds them to a list supplied by the caller.
But all we need is to check if there is any checksum, we don't need to
look for all of them and collect them into a list, which requires more
search time in the checksums tree, allocating memory for checksums items
to add to the list, copy checksums from a leaf into those list items,
then free that memory, etc. This is all unnecessary overhead, wasting
mostly CPU time, and perhaps some occasional IO if we need to read from
disk any extent buffers.
So change btrfs_lookup_csums_list() to allow to return immediately in
case it finds any checksum, without the need to add it to a list and read
it from a leaf. This is accomplished by allowing a NULL list parameter and
making the function return 1 if it found any checksum, 0 if it didn't
found any, and a negative value in case of an error.
The following test with fio was used to measure performance:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
cat <<EOF > /tmp/fio-job.ini
[global]
name=fio-rand-write
filename=$MNT/fio-rand-write
rw=randwrite
bssplit=4k/20:8k/20:16k/20:32k/20:64k/20
direct=1
numjobs=16
fallocate=posix
time_based
runtime=300
[file1]
size=8G
ioengine=io_uring
iodepth=16
EOF
umount $MNT &> /dev/null
mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The test was run on a release kernel (Debian's default kernel config).
The results before this patch:
WRITE: bw=139MiB/s (146MB/s), 8204KiB/s-9504KiB/s (8401kB/s-9732kB/s), io=17.0GiB (18.3GB), run=125317-125344msec
The results after this patch:
WRITE: bw=153MiB/s (160MB/s), 9241KiB/s-10.0MiB/s (9463kB/s-10.5MB/s), io=17.0GiB (18.3GB), run=114054-114071msec
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-12 18:48:31 +08:00
|
|
|
if (list) {
|
|
|
|
struct btrfs_ordered_sum *tmp_sums;
|
|
|
|
|
|
|
|
list_for_each_entry_safe(sums, tmp_sums, list, list)
|
|
|
|
kfree(sums);
|
|
|
|
}
|
2024-04-12 01:39:51 +08:00
|
|
|
|
btrfs: make NOCOW checks for existence of checksums in a range more efficient
Before deciding if we can do a NOCOW write into a range, one of the things
we have to do is check if there are checksum items for that range. We do
that through the btrfs_lookup_csums_list() function, which searches for
checksums and adds them to a list supplied by the caller.
But all we need is to check if there is any checksum, we don't need to
look for all of them and collect them into a list, which requires more
search time in the checksums tree, allocating memory for checksums items
to add to the list, copy checksums from a leaf into those list items,
then free that memory, etc. This is all unnecessary overhead, wasting
mostly CPU time, and perhaps some occasional IO if we need to read from
disk any extent buffers.
So change btrfs_lookup_csums_list() to allow to return immediately in
case it finds any checksum, without the need to add it to a list and read
it from a leaf. This is accomplished by allowing a NULL list parameter and
making the function return 1 if it found any checksum, 0 if it didn't
found any, and a negative value in case of an error.
The following test with fio was used to measure performance:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
cat <<EOF > /tmp/fio-job.ini
[global]
name=fio-rand-write
filename=$MNT/fio-rand-write
rw=randwrite
bssplit=4k/20:8k/20:16k/20:32k/20:64k/20
direct=1
numjobs=16
fallocate=posix
time_based
runtime=300
[file1]
size=8G
ioengine=io_uring
iodepth=16
EOF
umount $MNT &> /dev/null
mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The test was run on a release kernel (Debian's default kernel config).
The results before this patch:
WRITE: bw=139MiB/s (146MB/s), 8204KiB/s-9504KiB/s (8401kB/s-9732kB/s), io=17.0GiB (18.3GB), run=125317-125344msec
The results after this patch:
WRITE: bw=153MiB/s (160MB/s), 9241KiB/s-10.0MiB/s (9463kB/s-10.5MB/s), io=17.0GiB (18.3GB), run=114054-114071msec
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-12 18:48:31 +08:00
|
|
|
return ret;
|
2011-08-06 06:46:16 +08:00
|
|
|
}
|
|
|
|
|
btrfs: make NOCOW checks for existence of checksums in a range more efficient
Before deciding if we can do a NOCOW write into a range, one of the things
we have to do is check if there are checksum items for that range. We do
that through the btrfs_lookup_csums_list() function, which searches for
checksums and adds them to a list supplied by the caller.
But all we need is to check if there is any checksum, we don't need to
look for all of them and collect them into a list, which requires more
search time in the checksums tree, allocating memory for checksums items
to add to the list, copy checksums from a leaf into those list items,
then free that memory, etc. This is all unnecessary overhead, wasting
mostly CPU time, and perhaps some occasional IO if we need to read from
disk any extent buffers.
So change btrfs_lookup_csums_list() to allow to return immediately in
case it finds any checksum, without the need to add it to a list and read
it from a leaf. This is accomplished by allowing a NULL list parameter and
making the function return 1 if it found any checksum, 0 if it didn't
found any, and a negative value in case of an error.
The following test with fio was used to measure performance:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
cat <<EOF > /tmp/fio-job.ini
[global]
name=fio-rand-write
filename=$MNT/fio-rand-write
rw=randwrite
bssplit=4k/20:8k/20:16k/20:32k/20:64k/20
direct=1
numjobs=16
fallocate=posix
time_based
runtime=300
[file1]
size=8G
ioengine=io_uring
iodepth=16
EOF
umount $MNT &> /dev/null
mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The test was run on a release kernel (Debian's default kernel config).
The results before this patch:
WRITE: bw=139MiB/s (146MB/s), 8204KiB/s-9504KiB/s (8401kB/s-9732kB/s), io=17.0GiB (18.3GB), run=125317-125344msec
The results after this patch:
WRITE: bw=153MiB/s (160MB/s), 9241KiB/s-10.0MiB/s (9463kB/s-10.5MB/s), io=17.0GiB (18.3GB), run=114054-114071msec
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-12 18:48:31 +08:00
|
|
|
return found_csums ? 1 : 0;
|
2008-12-12 23:03:38 +08:00
|
|
|
}
|
|
|
|
|
2022-11-14 08:26:32 +08:00
|
|
|
/*
|
|
|
|
* Do the same work as btrfs_lookup_csums_list(), the difference is in how
|
|
|
|
* we return the result.
|
|
|
|
*
|
|
|
|
* This version will set the corresponding bits in @csum_bitmap to represent
|
|
|
|
* that there is a csum found.
|
|
|
|
* Each bit represents a sector. Thus caller should ensure @csum_buf passed
|
|
|
|
* in is large enough to contain all csums.
|
|
|
|
*/
|
2023-08-03 14:33:30 +08:00
|
|
|
int btrfs_lookup_csums_bitmap(struct btrfs_root *root, struct btrfs_path *path,
|
|
|
|
u64 start, u64 end, u8 *csum_buf,
|
|
|
|
unsigned long *csum_bitmap)
|
2022-11-14 08:26:32 +08:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = root->fs_info;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_csum_item *item;
|
|
|
|
const u64 orig_start = start;
|
2023-08-03 14:33:30 +08:00
|
|
|
bool free_path = false;
|
2022-11-14 08:26:32 +08:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
|
|
|
|
IS_ALIGNED(end + 1, fs_info->sectorsize));
|
|
|
|
|
2023-08-03 14:33:30 +08:00
|
|
|
if (!path) {
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
free_path = true;
|
|
|
|
}
|
2022-11-14 08:26:32 +08:00
|
|
|
|
2023-08-03 14:33:30 +08:00
|
|
|
/* Check if we can reuse the previous path. */
|
|
|
|
if (path->nodes[0]) {
|
|
|
|
btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
|
|
|
|
|
|
|
|
if (key.objectid == BTRFS_EXTENT_CSUM_OBJECTID &&
|
|
|
|
key.type == BTRFS_EXTENT_CSUM_KEY &&
|
|
|
|
key.offset <= start)
|
|
|
|
goto search_forward;
|
|
|
|
btrfs_release_path(path);
|
2023-03-20 10:12:51 +08:00
|
|
|
}
|
|
|
|
|
2022-11-14 08:26:32 +08:00
|
|
|
key.objectid = BTRFS_EXTENT_CSUM_OBJECTID;
|
|
|
|
key.type = BTRFS_EXTENT_CSUM_KEY;
|
|
|
|
key.offset = start;
|
|
|
|
|
|
|
|
ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
|
|
|
|
if (ret < 0)
|
|
|
|
goto fail;
|
|
|
|
if (ret > 0 && path->slots[0] > 0) {
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, path->slots[0] - 1);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* There are two cases we can hit here for the previous csum
|
|
|
|
* item:
|
|
|
|
*
|
|
|
|
* |<- search range ->|
|
|
|
|
* |<- csum item ->|
|
|
|
|
*
|
|
|
|
* Or
|
|
|
|
* |<- search range ->|
|
|
|
|
* |<- csum item ->|
|
|
|
|
*
|
|
|
|
* Check if the previous csum item covers the leading part of
|
|
|
|
* the search range. If so we have to start from previous csum
|
|
|
|
* item.
|
|
|
|
*/
|
|
|
|
if (key.objectid == BTRFS_EXTENT_CSUM_OBJECTID &&
|
|
|
|
key.type == BTRFS_EXTENT_CSUM_KEY) {
|
|
|
|
if (bytes_to_csum_size(fs_info, start - key.offset) <
|
|
|
|
btrfs_item_size(leaf, path->slots[0] - 1))
|
|
|
|
path->slots[0]--;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-08-03 14:33:30 +08:00
|
|
|
search_forward:
|
2022-11-14 08:26:32 +08:00
|
|
|
while (start <= end) {
|
|
|
|
u64 csum_end;
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
if (path->slots[0] >= btrfs_header_nritems(leaf)) {
|
|
|
|
ret = btrfs_next_leaf(root, path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto fail;
|
|
|
|
if (ret > 0)
|
|
|
|
break;
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
}
|
|
|
|
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
|
|
|
|
if (key.objectid != BTRFS_EXTENT_CSUM_OBJECTID ||
|
|
|
|
key.type != BTRFS_EXTENT_CSUM_KEY ||
|
|
|
|
key.offset > end)
|
|
|
|
break;
|
|
|
|
|
|
|
|
if (key.offset > start)
|
|
|
|
start = key.offset;
|
|
|
|
|
|
|
|
csum_end = key.offset + csum_size_to_bytes(fs_info,
|
|
|
|
btrfs_item_size(leaf, path->slots[0]));
|
|
|
|
if (csum_end <= start) {
|
|
|
|
path->slots[0]++;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
csum_end = min(csum_end, end + 1);
|
|
|
|
item = btrfs_item_ptr(path->nodes[0], path->slots[0],
|
|
|
|
struct btrfs_csum_item);
|
|
|
|
while (start < csum_end) {
|
|
|
|
unsigned long offset;
|
|
|
|
size_t size;
|
|
|
|
u8 *csum_dest = csum_buf + bytes_to_csum_size(fs_info,
|
|
|
|
start - orig_start);
|
|
|
|
|
|
|
|
size = min_t(size_t, csum_end - start, end + 1 - start);
|
|
|
|
|
|
|
|
offset = bytes_to_csum_size(fs_info, start - key.offset);
|
|
|
|
|
|
|
|
read_extent_buffer(path->nodes[0], csum_dest,
|
|
|
|
((unsigned long)item) + offset,
|
|
|
|
bytes_to_csum_size(fs_info, size));
|
|
|
|
|
|
|
|
bitmap_set(csum_bitmap,
|
|
|
|
(start - orig_start) >> fs_info->sectorsize_bits,
|
|
|
|
size >> fs_info->sectorsize_bits);
|
|
|
|
|
|
|
|
start += size;
|
|
|
|
}
|
|
|
|
path->slots[0]++;
|
|
|
|
}
|
|
|
|
ret = 0;
|
|
|
|
fail:
|
2023-08-03 14:33:30 +08:00
|
|
|
if (free_path)
|
|
|
|
btrfs_free_path(path);
|
2022-11-14 08:26:32 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2022-10-27 20:21:42 +08:00
|
|
|
/*
|
|
|
|
* Calculate checksums of the data contained inside a bio.
|
2019-04-22 21:07:31 +08:00
|
|
|
*/
|
2023-01-21 14:50:16 +08:00
|
|
|
blk_status_t btrfs_csum_one_bio(struct btrfs_bio *bbio)
|
2008-04-16 23:15:20 +08:00
|
|
|
{
|
2023-05-31 15:54:03 +08:00
|
|
|
struct btrfs_ordered_extent *ordered = bbio->ordered;
|
2023-01-21 14:50:16 +08:00
|
|
|
struct btrfs_inode *inode = bbio->inode;
|
2020-06-03 13:55:03 +08:00
|
|
|
struct btrfs_fs_info *fs_info = inode->root->fs_info;
|
2019-06-03 22:58:57 +08:00
|
|
|
SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
|
2023-01-21 14:50:16 +08:00
|
|
|
struct bio *bio = &bbio->bio;
|
2008-07-18 00:53:50 +08:00
|
|
|
struct btrfs_ordered_sum *sums;
|
2008-04-16 23:15:20 +08:00
|
|
|
char *data;
|
2017-05-16 06:33:27 +08:00
|
|
|
struct bvec_iter iter;
|
|
|
|
struct bio_vec bvec;
|
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
|
|
|
int index;
|
2019-11-07 07:38:43 +08:00
|
|
|
unsigned int blockcount;
|
2017-05-16 06:33:27 +08:00
|
|
|
int i;
|
2019-04-01 16:29:58 +08:00
|
|
|
unsigned nofs_flag;
|
|
|
|
|
|
|
|
nofs_flag = memalloc_nofs_save();
|
|
|
|
sums = kvzalloc(btrfs_ordered_sum_size(fs_info, bio->bi_iter.bi_size),
|
|
|
|
GFP_KERNEL);
|
|
|
|
memalloc_nofs_restore(nofs_flag);
|
2008-04-16 23:15:20 +08:00
|
|
|
|
|
|
|
if (!sums)
|
2017-06-03 15:38:06 +08:00
|
|
|
return BLK_STS_RESOURCE;
|
2008-07-18 18:17:13 +08:00
|
|
|
|
2013-10-12 06:44:27 +08:00
|
|
|
sums->len = bio->bi_iter.bi_size;
|
2008-07-18 00:53:50 +08:00
|
|
|
INIT_LIST_HEAD(&sums->list);
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-09 05:58:54 +08:00
|
|
|
|
2023-05-24 23:03:07 +08:00
|
|
|
sums->logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
|
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
|
|
|
index = 0;
|
2008-04-16 23:15:20 +08:00
|
|
|
|
2019-06-03 22:58:57 +08:00
|
|
|
shash->tfm = fs_info->csum_shash;
|
|
|
|
|
2017-05-16 06:33:27 +08:00
|
|
|
bio_for_each_segment(bvec, bio, iter) {
|
2019-11-07 07:38:43 +08:00
|
|
|
blockcount = BTRFS_BYTES_TO_BLKS(fs_info,
|
2017-05-16 06:33:27 +08:00
|
|
|
bvec.bv_len + fs_info->sectorsize
|
2016-06-23 06:54:23 +08:00
|
|
|
- 1);
|
2016-01-21 18:25:54 +08:00
|
|
|
|
2019-11-07 07:38:43 +08:00
|
|
|
for (i = 0; i < blockcount; i++) {
|
2021-10-12 14:31:53 +08:00
|
|
|
data = bvec_kmap_local(&bvec);
|
|
|
|
crypto_shash_digest(shash,
|
|
|
|
data + (i * fs_info->sectorsize),
|
2020-05-01 14:51:59 +08:00
|
|
|
fs_info->sectorsize,
|
|
|
|
sums->sums + index);
|
2021-10-12 14:31:53 +08:00
|
|
|
kunmap_local(data);
|
2020-07-01 00:04:02 +08:00
|
|
|
index += fs_info->csum_size;
|
2008-07-18 18:17:13 +08:00
|
|
|
}
|
|
|
|
|
2008-04-16 23:15:20 +08:00
|
|
|
}
|
2023-05-24 23:03:08 +08:00
|
|
|
|
|
|
|
bbio->sums = sums;
|
2019-04-10 21:16:11 +08:00
|
|
|
btrfs_add_ordered_sum(ordered, sums);
|
2008-04-16 23:15:20 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2023-05-24 23:03:08 +08:00
|
|
|
/*
|
|
|
|
* Nodatasum I/O on zoned file systems still requires an btrfs_ordered_sum to
|
|
|
|
* record the updated logical address on Zone Append completion.
|
|
|
|
* Allocate just the structure with an empty sums array here for that case.
|
|
|
|
*/
|
|
|
|
blk_status_t btrfs_alloc_dummy_sum(struct btrfs_bio *bbio)
|
|
|
|
{
|
|
|
|
bbio->sums = kmalloc(sizeof(*bbio->sums), GFP_NOFS);
|
|
|
|
if (!bbio->sums)
|
|
|
|
return BLK_STS_RESOURCE;
|
|
|
|
bbio->sums->len = bbio->bio.bi_iter.bi_size;
|
|
|
|
bbio->sums->logical = bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT;
|
2023-05-31 15:54:02 +08:00
|
|
|
btrfs_add_ordered_sum(bbio->ordered, bbio->sums);
|
2023-05-24 23:03:08 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-12-10 22:10:46 +08:00
|
|
|
/*
|
2022-10-27 20:21:42 +08:00
|
|
|
* Remove one checksum overlapping a range.
|
|
|
|
*
|
|
|
|
* This expects the key to describe the csum pointed to by the path, and it
|
|
|
|
* expects the csum to overlap the range [bytenr, len]
|
2008-12-10 22:10:46 +08:00
|
|
|
*
|
2022-10-27 20:21:42 +08:00
|
|
|
* The csum should not be entirely contained in the range and the range should
|
|
|
|
* not be entirely contained in the csum.
|
2008-12-10 22:10:46 +08:00
|
|
|
*
|
2022-10-27 20:21:42 +08:00
|
|
|
* This calls btrfs_truncate_item with the correct args based on the overlap,
|
|
|
|
* and fixes up the key as required.
|
2008-12-10 22:10:46 +08:00
|
|
|
*/
|
2023-09-12 20:04:29 +08:00
|
|
|
static noinline void truncate_one_csum(struct btrfs_trans_handle *trans,
|
2012-03-01 21:56:26 +08:00
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_key *key,
|
|
|
|
u64 bytenr, u64 len)
|
2008-12-10 22:10:46 +08:00
|
|
|
{
|
2023-09-12 20:04:29 +08:00
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
2008-12-10 22:10:46 +08:00
|
|
|
struct extent_buffer *leaf;
|
2020-07-02 17:27:30 +08:00
|
|
|
const u32 csum_size = fs_info->csum_size;
|
2008-12-10 22:10:46 +08:00
|
|
|
u64 csum_end;
|
|
|
|
u64 end_byte = bytenr + len;
|
2020-07-02 03:19:09 +08:00
|
|
|
u32 blocksize_bits = fs_info->sectorsize_bits;
|
2008-12-10 22:10:46 +08:00
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
2021-10-22 02:58:35 +08:00
|
|
|
csum_end = btrfs_item_size(leaf, path->slots[0]) / csum_size;
|
2020-07-02 03:19:09 +08:00
|
|
|
csum_end <<= blocksize_bits;
|
2008-12-10 22:10:46 +08:00
|
|
|
csum_end += key->offset;
|
|
|
|
|
|
|
|
if (key->offset < bytenr && csum_end <= end_byte) {
|
|
|
|
/*
|
|
|
|
* [ bytenr - len ]
|
|
|
|
* [ ]
|
|
|
|
* [csum ]
|
|
|
|
* A simple truncate off the end of the item
|
|
|
|
*/
|
|
|
|
u32 new_size = (bytenr - key->offset) >> blocksize_bits;
|
|
|
|
new_size *= csum_size;
|
2023-09-12 20:04:29 +08:00
|
|
|
btrfs_truncate_item(trans, path, new_size, 1);
|
2008-12-10 22:10:46 +08:00
|
|
|
} else if (key->offset >= bytenr && csum_end > end_byte &&
|
|
|
|
end_byte > key->offset) {
|
|
|
|
/*
|
|
|
|
* [ bytenr - len ]
|
|
|
|
* [ ]
|
|
|
|
* [csum ]
|
|
|
|
* we need to truncate from the beginning of the csum
|
|
|
|
*/
|
|
|
|
u32 new_size = (csum_end - end_byte) >> blocksize_bits;
|
|
|
|
new_size *= csum_size;
|
|
|
|
|
2023-09-12 20:04:29 +08:00
|
|
|
btrfs_truncate_item(trans, path, new_size, 0);
|
2008-12-10 22:10:46 +08:00
|
|
|
|
|
|
|
key->offset = end_byte;
|
2023-09-12 20:04:29 +08:00
|
|
|
btrfs_set_item_key_safe(trans, path, key);
|
2008-12-10 22:10:46 +08:00
|
|
|
} else {
|
|
|
|
BUG();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2022-10-27 20:21:42 +08:00
|
|
|
* Delete the csum items from the csum tree for a given range of bytes.
|
2008-12-10 22:10:46 +08:00
|
|
|
*/
|
|
|
|
int btrfs_del_csums(struct btrfs_trans_handle *trans,
|
Btrfs: fix missing data checksums after replaying a log tree
When logging a file that has shared extents (reflinked with other files or
with itself), we can end up logging multiple checksum items that cover
overlapping ranges. This confuses the search for checksums at log replay
time causing some checksums to never be added to the fs/subvolume tree.
Consider the following example of a file that shares the same extent at
offsets 0 and 256Kb:
[ bytenr 13893632, offset 64Kb, len 64Kb ]
0 64Kb
[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb
[ bytenr 13893632, offset 0, len 256Kb ]
256Kb 512Kb
When logging the inode, at tree-log.c:copy_items(), when processing the
file extent item at offset 0, we log a checksum item covering the range
13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 +
64Kb + 64Kb, respectively.
Later when processing the extent item at offset 256K, we log the checksums
for the range from 13893632 to 14155776 (which corresponds to 13893632 +
256Kb). These checksums get merged with the checksum item for the range
from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync.
So after this we get the two following checksum items in the log tree:
(...)
item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512
range start 13631488 end 14155776 length 524288
item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64
range start 13959168 end 14024704 length 65536
The first one covers the range from the second one, they overlap.
So far this does not cause a problem after replaying the log, because
when replaying the file extent item for offset 256K, we copy all the
checksums for the extent 13893632 from the log tree to the fs/subvolume
tree, since searching for an checksum item for bytenr 13893632 leaves us
at the first checksum item, which covers the whole range of the extent.
However if we write 64Kb to file offset 256Kb for example, we will
not be able to find and copy the checksums for the last 128Kb of the
extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb.
After writing 64Kb into file offset 256Kb we get the following extent
layout for our file:
[ bytenr 13893632, offset 64K, len 64Kb ]
0 64Kb
[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb
[ bytenr 14155776, offset 0, len 64Kb ]
256Kb 320Kb
[ bytenr 13893632, offset 64Kb, len 192Kb ]
320Kb 512Kb
After fsync'ing the file, if we have a power failure and then mount
the filesystem to replay the log, the following happens:
1) When replaying the file extent item for file offset 320Kb, we
lookup for the checksums for the extent range from 13959168
(13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call
to btrfs_lookup_csums_range();
2) btrfs_lookup_csums_range() finds the checksum item that starts
precisely at offset 13959168 (item 7 in the log tree, shown before);
3) However that checksum item only covers 64Kb of data, and not 192Kb
of data;
4) As a result only the checksums for the first 64Kb of data referenced
by the file extent item are found and copied to the fs/subvolume tree.
The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get
the corresponding data checksums found and copied to the fs/subvolume
tree.
5) After replaying the log userspace will not be able to read the file
range from 384Kb to 512Kb, because the checksums are missing and
resulting in an -EIO error.
The following steps reproduce this scenario:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
$ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar
$ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
$ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
<power failure>
$ mount /dev/sdc /mnt/sdc
$ md5sum /mnt/sdc/foobar
md5sum: /mnt/sdc/foobar: Input/output error
$ dmesg | tail
[165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408
[165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504
[165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600
[165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696
[165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792
[165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888
[165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984
[165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080
[165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
[165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
Fix this simply by deleting first any checksums, from the log tree, for the
range of the extent we are logging at copy_items(). This ensures we do not
get checksum items in the log tree that have overlapping ranges.
This is a long time issue that has been present since we have the clone
(and deduplication) ioctl, and can happen both when an extent is shared
between different files and within the same file.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-06 00:58:30 +08:00
|
|
|
struct btrfs_root *root, u64 bytenr, u64 len)
|
2008-12-10 22:10:46 +08:00
|
|
|
{
|
Btrfs: fix missing data checksums after replaying a log tree
When logging a file that has shared extents (reflinked with other files or
with itself), we can end up logging multiple checksum items that cover
overlapping ranges. This confuses the search for checksums at log replay
time causing some checksums to never be added to the fs/subvolume tree.
Consider the following example of a file that shares the same extent at
offsets 0 and 256Kb:
[ bytenr 13893632, offset 64Kb, len 64Kb ]
0 64Kb
[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb
[ bytenr 13893632, offset 0, len 256Kb ]
256Kb 512Kb
When logging the inode, at tree-log.c:copy_items(), when processing the
file extent item at offset 0, we log a checksum item covering the range
13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 +
64Kb + 64Kb, respectively.
Later when processing the extent item at offset 256K, we log the checksums
for the range from 13893632 to 14155776 (which corresponds to 13893632 +
256Kb). These checksums get merged with the checksum item for the range
from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync.
So after this we get the two following checksum items in the log tree:
(...)
item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512
range start 13631488 end 14155776 length 524288
item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64
range start 13959168 end 14024704 length 65536
The first one covers the range from the second one, they overlap.
So far this does not cause a problem after replaying the log, because
when replaying the file extent item for offset 256K, we copy all the
checksums for the extent 13893632 from the log tree to the fs/subvolume
tree, since searching for an checksum item for bytenr 13893632 leaves us
at the first checksum item, which covers the whole range of the extent.
However if we write 64Kb to file offset 256Kb for example, we will
not be able to find and copy the checksums for the last 128Kb of the
extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb.
After writing 64Kb into file offset 256Kb we get the following extent
layout for our file:
[ bytenr 13893632, offset 64K, len 64Kb ]
0 64Kb
[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb
[ bytenr 14155776, offset 0, len 64Kb ]
256Kb 320Kb
[ bytenr 13893632, offset 64Kb, len 192Kb ]
320Kb 512Kb
After fsync'ing the file, if we have a power failure and then mount
the filesystem to replay the log, the following happens:
1) When replaying the file extent item for file offset 320Kb, we
lookup for the checksums for the extent range from 13959168
(13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call
to btrfs_lookup_csums_range();
2) btrfs_lookup_csums_range() finds the checksum item that starts
precisely at offset 13959168 (item 7 in the log tree, shown before);
3) However that checksum item only covers 64Kb of data, and not 192Kb
of data;
4) As a result only the checksums for the first 64Kb of data referenced
by the file extent item are found and copied to the fs/subvolume tree.
The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get
the corresponding data checksums found and copied to the fs/subvolume
tree.
5) After replaying the log userspace will not be able to read the file
range from 384Kb to 512Kb, because the checksums are missing and
resulting in an -EIO error.
The following steps reproduce this scenario:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
$ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar
$ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
$ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
<power failure>
$ mount /dev/sdc /mnt/sdc
$ md5sum /mnt/sdc/foobar
md5sum: /mnt/sdc/foobar: Input/output error
$ dmesg | tail
[165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408
[165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504
[165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600
[165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696
[165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792
[165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888
[165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984
[165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080
[165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
[165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
Fix this simply by deleting first any checksums, from the log tree, for the
range of the extent we are logging at copy_items(). This ensures we do not
get checksum items in the log tree that have overlapping ranges.
This is a long time issue that has been present since we have the clone
(and deduplication) ioctl, and can happen both when an extent is shared
between different files and within the same file.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-06 00:58:30 +08:00
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
2008-12-10 22:10:46 +08:00
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_key key;
|
|
|
|
u64 end_byte = bytenr + len;
|
|
|
|
u64 csum_end;
|
|
|
|
struct extent_buffer *leaf;
|
2021-05-19 22:52:45 +08:00
|
|
|
int ret = 0;
|
2020-07-02 17:27:30 +08:00
|
|
|
const u32 csum_size = fs_info->csum_size;
|
2020-07-02 03:19:09 +08:00
|
|
|
u32 blocksize_bits = fs_info->sectorsize_bits;
|
2008-12-10 22:10:46 +08:00
|
|
|
|
2024-04-16 04:16:23 +08:00
|
|
|
ASSERT(btrfs_root_id(root) == BTRFS_CSUM_TREE_OBJECTID ||
|
|
|
|
btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID);
|
Btrfs: fix missing data checksums after replaying a log tree
When logging a file that has shared extents (reflinked with other files or
with itself), we can end up logging multiple checksum items that cover
overlapping ranges. This confuses the search for checksums at log replay
time causing some checksums to never be added to the fs/subvolume tree.
Consider the following example of a file that shares the same extent at
offsets 0 and 256Kb:
[ bytenr 13893632, offset 64Kb, len 64Kb ]
0 64Kb
[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb
[ bytenr 13893632, offset 0, len 256Kb ]
256Kb 512Kb
When logging the inode, at tree-log.c:copy_items(), when processing the
file extent item at offset 0, we log a checksum item covering the range
13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 +
64Kb + 64Kb, respectively.
Later when processing the extent item at offset 256K, we log the checksums
for the range from 13893632 to 14155776 (which corresponds to 13893632 +
256Kb). These checksums get merged with the checksum item for the range
from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync.
So after this we get the two following checksum items in the log tree:
(...)
item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512
range start 13631488 end 14155776 length 524288
item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64
range start 13959168 end 14024704 length 65536
The first one covers the range from the second one, they overlap.
So far this does not cause a problem after replaying the log, because
when replaying the file extent item for offset 256K, we copy all the
checksums for the extent 13893632 from the log tree to the fs/subvolume
tree, since searching for an checksum item for bytenr 13893632 leaves us
at the first checksum item, which covers the whole range of the extent.
However if we write 64Kb to file offset 256Kb for example, we will
not be able to find and copy the checksums for the last 128Kb of the
extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb.
After writing 64Kb into file offset 256Kb we get the following extent
layout for our file:
[ bytenr 13893632, offset 64K, len 64Kb ]
0 64Kb
[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb
[ bytenr 14155776, offset 0, len 64Kb ]
256Kb 320Kb
[ bytenr 13893632, offset 64Kb, len 192Kb ]
320Kb 512Kb
After fsync'ing the file, if we have a power failure and then mount
the filesystem to replay the log, the following happens:
1) When replaying the file extent item for file offset 320Kb, we
lookup for the checksums for the extent range from 13959168
(13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call
to btrfs_lookup_csums_range();
2) btrfs_lookup_csums_range() finds the checksum item that starts
precisely at offset 13959168 (item 7 in the log tree, shown before);
3) However that checksum item only covers 64Kb of data, and not 192Kb
of data;
4) As a result only the checksums for the first 64Kb of data referenced
by the file extent item are found and copied to the fs/subvolume tree.
The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get
the corresponding data checksums found and copied to the fs/subvolume
tree.
5) After replaying the log userspace will not be able to read the file
range from 384Kb to 512Kb, because the checksums are missing and
resulting in an -EIO error.
The following steps reproduce this scenario:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
$ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar
$ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
$ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
<power failure>
$ mount /dev/sdc /mnt/sdc
$ md5sum /mnt/sdc/foobar
md5sum: /mnt/sdc/foobar: Input/output error
$ dmesg | tail
[165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408
[165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504
[165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600
[165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696
[165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792
[165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888
[165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984
[165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080
[165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
[165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
Fix this simply by deleting first any checksums, from the log tree, for the
range of the extent we are logging at copy_items(). This ensures we do not
get checksum items in the log tree that have overlapping ranges.
This is a long time issue that has been present since we have the clone
(and deduplication) ioctl, and can happen both when an extent is shared
between different files and within the same file.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-06 00:58:30 +08:00
|
|
|
|
2008-12-10 22:10:46 +08:00
|
|
|
path = btrfs_alloc_path();
|
2011-01-26 14:22:08 +08:00
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
2008-12-10 22:10:46 +08:00
|
|
|
|
2009-01-06 10:25:51 +08:00
|
|
|
while (1) {
|
2008-12-10 22:10:46 +08:00
|
|
|
key.objectid = BTRFS_EXTENT_CSUM_OBJECTID;
|
|
|
|
key.offset = end_byte - 1;
|
|
|
|
key.type = BTRFS_EXTENT_CSUM_KEY;
|
|
|
|
|
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
|
|
|
|
if (ret > 0) {
|
2021-05-19 22:52:45 +08:00
|
|
|
ret = 0;
|
2008-12-10 22:10:46 +08:00
|
|
|
if (path->slots[0] == 0)
|
2011-05-19 12:37:44 +08:00
|
|
|
break;
|
2008-12-10 22:10:46 +08:00
|
|
|
path->slots[0]--;
|
2011-01-29 02:44:44 +08:00
|
|
|
} else if (ret < 0) {
|
2011-05-19 12:37:44 +08:00
|
|
|
break;
|
2008-12-10 22:10:46 +08:00
|
|
|
}
|
2011-01-29 02:44:44 +08:00
|
|
|
|
2008-12-10 22:10:46 +08:00
|
|
|
leaf = path->nodes[0];
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
|
|
|
|
|
|
|
|
if (key.objectid != BTRFS_EXTENT_CSUM_OBJECTID ||
|
|
|
|
key.type != BTRFS_EXTENT_CSUM_KEY) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (key.offset >= end_byte)
|
|
|
|
break;
|
|
|
|
|
2021-10-22 02:58:35 +08:00
|
|
|
csum_end = btrfs_item_size(leaf, path->slots[0]) / csum_size;
|
2008-12-10 22:10:46 +08:00
|
|
|
csum_end <<= blocksize_bits;
|
|
|
|
csum_end += key.offset;
|
|
|
|
|
|
|
|
/* this csum ends before we start, we're done */
|
|
|
|
if (csum_end <= bytenr)
|
|
|
|
break;
|
|
|
|
|
|
|
|
/* delete the entire item, it is inside our range */
|
|
|
|
if (key.offset >= bytenr && csum_end <= end_byte) {
|
2017-01-28 09:47:56 +08:00
|
|
|
int del_nr = 1;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check how many csum items preceding this one in this
|
|
|
|
* leaf correspond to our range and then delete them all
|
|
|
|
* at once.
|
|
|
|
*/
|
|
|
|
if (key.offset > bytenr && path->slots[0] > 0) {
|
|
|
|
int slot = path->slots[0] - 1;
|
|
|
|
|
|
|
|
while (slot >= 0) {
|
|
|
|
struct btrfs_key pk;
|
|
|
|
|
|
|
|
btrfs_item_key_to_cpu(leaf, &pk, slot);
|
|
|
|
if (pk.offset < bytenr ||
|
|
|
|
pk.type != BTRFS_EXTENT_CSUM_KEY ||
|
|
|
|
pk.objectid !=
|
|
|
|
BTRFS_EXTENT_CSUM_OBJECTID)
|
|
|
|
break;
|
|
|
|
path->slots[0] = slot;
|
|
|
|
del_nr++;
|
|
|
|
key.offset = pk.offset;
|
|
|
|
slot--;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
ret = btrfs_del_items(trans, root, path,
|
|
|
|
path->slots[0], del_nr);
|
2011-05-19 12:37:44 +08:00
|
|
|
if (ret)
|
2021-05-19 22:52:45 +08:00
|
|
|
break;
|
2008-12-17 02:51:01 +08:00
|
|
|
if (key.offset == bytenr)
|
|
|
|
break;
|
2008-12-10 22:10:46 +08:00
|
|
|
} else if (key.offset < bytenr && csum_end > end_byte) {
|
|
|
|
unsigned long offset;
|
|
|
|
unsigned long shift_len;
|
|
|
|
unsigned long item_offset;
|
|
|
|
/*
|
|
|
|
* [ bytenr - len ]
|
|
|
|
* [csum ]
|
|
|
|
*
|
|
|
|
* Our bytes are in the middle of the csum,
|
|
|
|
* we need to split this item and insert a new one.
|
|
|
|
*
|
|
|
|
* But we can't drop the path because the
|
|
|
|
* csum could change, get removed, extended etc.
|
|
|
|
*
|
|
|
|
* The trick here is the max size of a csum item leaves
|
|
|
|
* enough room in the tree block for a single
|
|
|
|
* item header. So, we split the item in place,
|
|
|
|
* adding a new header pointing to the existing
|
|
|
|
* bytes. Then we loop around again and we have
|
|
|
|
* a nicely formed csum item that we can neatly
|
|
|
|
* truncate.
|
|
|
|
*/
|
|
|
|
offset = (bytenr - key.offset) >> blocksize_bits;
|
|
|
|
offset *= csum_size;
|
|
|
|
|
|
|
|
shift_len = (len >> blocksize_bits) * csum_size;
|
|
|
|
|
|
|
|
item_offset = btrfs_item_ptr_offset(leaf,
|
|
|
|
path->slots[0]);
|
|
|
|
|
2016-11-09 01:09:03 +08:00
|
|
|
memzero_extent_buffer(leaf, item_offset + offset,
|
2008-12-10 22:10:46 +08:00
|
|
|
shift_len);
|
|
|
|
key.offset = bytenr;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* btrfs_split_item returns -EAGAIN when the
|
|
|
|
* item changed size or key
|
|
|
|
*/
|
|
|
|
ret = btrfs_split_item(trans, root, path, &key, offset);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (ret && ret != -EAGAIN) {
|
2016-06-11 06:19:25 +08:00
|
|
|
btrfs_abort_transaction(trans, ret);
|
2021-05-19 22:52:45 +08:00
|
|
|
break;
|
2012-03-12 23:03:00 +08:00
|
|
|
}
|
2021-05-19 22:52:45 +08:00
|
|
|
ret = 0;
|
2008-12-10 22:10:46 +08:00
|
|
|
|
|
|
|
key.offset = end_byte - 1;
|
|
|
|
} else {
|
2023-09-12 20:04:29 +08:00
|
|
|
truncate_one_csum(trans, path, &key, bytenr, len);
|
2008-12-17 02:51:01 +08:00
|
|
|
if (key.offset < bytenr)
|
|
|
|
break;
|
2008-12-10 22:10:46 +08:00
|
|
|
}
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2008-12-10 22:10:46 +08:00
|
|
|
}
|
|
|
|
btrfs_free_path(path);
|
2011-05-19 12:37:44 +08:00
|
|
|
return ret;
|
2008-12-10 22:10:46 +08:00
|
|
|
}
|
|
|
|
|
btrfs: fix fsync failure and transaction abort after writes to prealloc extents
When doing a series of partial writes to different ranges of preallocated
extents with transaction commits and fsyncs in between, we can end up with
a checksum items in a log tree. This causes an fsync to fail with -EIO and
abort the transaction, turning the filesystem to RO mode, when syncing the
log.
For this to happen, we need to have a full fsync of a file following one
or more fast fsyncs.
The following example reproduces the problem and explains how it happens:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
# Create our test file with 2 preallocated extents. Leave a 1M hole
# between them to ensure that we get two file extent items that will
# never be merged into a single one. The extents are contiguous on disk,
# which will later result in the checksums for their data to be merged
# into a single checksum item in the csums btree.
#
$ xfs_io -f \
-c "falloc 0 1M" \
-c "falloc 3M 3M" \
/mnt/foobar
# Now write to the second extent and leave only 1M of it as unwritten,
# which corresponds to the file range [4M, 5M[.
#
# Then fsync the file to flush delalloc and to clear full sync flag from
# the inode, so that a future fsync will use the fast code path.
#
# After the writeback triggered by the fsync we have 3 file extent items
# that point to the second extent we previously allocated:
#
# 1) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the
# file range [3M, 4M[
#
# 2) One file extent item of type BTRFS_FILE_EXTENT_PREALLOC that covers
# the file range [4M, 5M[
#
# 3) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the
# file range [5M, 6M[
#
# All these file extent items have a generation of 6, which is the ID of
# the transaction where they were created. The split of the original file
# extent item is done at btrfs_mark_extent_written() when ordered extents
# complete for the file ranges [3M, 4M[ and [5M, 6M[.
#
$ xfs_io -c "pwrite -S 0xab 3M 1M" \
-c "pwrite -S 0xef 5M 1M" \
-c "fsync" \
/mnt/foobar
# Commit the current transaction. This wipes out the log tree created by
# the previous fsync.
sync
# Now write to the unwritten range of the second extent we allocated,
# corresponding to the file range [4M, 5M[, and fsync the file, which
# triggers the fast fsync code path.
#
# The fast fsync code path sees that there is a new extent map covering
# the file range [4M, 5M[ and therefore it will log a checksum item
# covering the range [1M, 2M[ of the second extent we allocated.
#
# Also, after the fsync finishes we no longer have the 3 file extent
# items that pointed to 3 sections of the second extent we allocated.
# Instead we end up with a single file extent item pointing to the whole
# extent, with a type of BTRFS_FILE_EXTENT_REG and a generation of 7 (the
# current transaction ID). This is due to the file extent item merging we
# do when completing ordered extents into ranges that point to unwritten
# (preallocated) extents. This merging is done at
# btrfs_mark_extent_written().
#
$ xfs_io -c "pwrite -S 0xcd 4M 1M" \
-c "fsync" \
/mnt/foobar
# Now do some write to our file outside the range of the second extent
# that we allocated with fallocate() and truncate the file size from 6M
# down to 5M.
#
# The truncate operation sets the full sync runtime flag on the inode,
# forcing the next fsync to use the slow code path. It also changes the
# length of the second file extent item so that it represents the file
# range [3M, 5M[ and not the range [3M, 6M[ anymore.
#
# Finally fsync the file. Since this is a fsync that triggers the slow
# code path, it will remove all items associated to the inode from the
# log tree and then it will scan for file extent items in the
# fs/subvolume tree that have a generation matching the current
# transaction ID, which is 7. This means it will log 2 file extent
# items:
#
# 1) One for the first extent we allocated, covering the file range
# [0, 1M[
#
# 2) Another for the first 2M of the second extent we allocated,
# covering the file range [3M, 5M[
#
# When logging the first file extent item we log a single checksum item
# that has all the checksums for the entire extent.
#
# When logging the second file extent item, we also lookup for the
# checksums that are associated with the range [0, 2M[ of the second
# extent we allocated (file range [3M, 5M[), and then we log them with
# btrfs_csum_file_blocks(). However that results in ending up with a log
# that has two checksum items with ranges that overlap:
#
# 1) One for the range [1M, 2M[ of the second extent we allocated,
# corresponding to the file range [4M, 5M[, which we logged in the
# previous fsync that used the fast code path;
#
# 2) One for the ranges [0, 1M[ and [0, 2M[ of the first and second
# extents, respectively, corresponding to the files ranges [0, 1M[
# and [3M, 5M[. This one was added during this last fsync that uses
# the slow code path and overlaps with the previous one logged by
# the previous fast fsync.
#
# This happens because when logging the checksums for the second
# extent, we notice they start at an offset that matches the end of the
# checksums item that we logged for the first extent, and because both
# extents are contiguous on disk, btrfs_csum_file_blocks() decides to
# extend that existing checksums item and append the checksums for the
# second extent to this item. The end result is we end up with two
# checksum items in the log tree that have overlapping ranges, as
# listed before, resulting in the fsync to fail with -EIO and aborting
# the transaction, turning the filesystem into RO mode.
#
$ xfs_io -c "pwrite -S 0xff 0 1M" \
-c "truncate 5M" \
-c "fsync" \
/mnt/foobar
fsync: Input/output error
After running the example, dmesg/syslog shows the tree checker complained
about the checksum items with overlapping ranges and we aborted the
transaction:
$ dmesg
(...)
[756289.557487] BTRFS critical (device sdc): corrupt leaf: root=18446744073709551610 block=30720000 slot=5, csum end range (16777216) goes beyond the start range (15728640) of the next csum item
[756289.560583] BTRFS info (device sdc): leaf 30720000 gen 7 total ptrs 7 free space 11677 owner 18446744073709551610
[756289.562435] BTRFS info (device sdc): refs 2 lock_owner 0 current 2303929
[756289.563654] item 0 key (257 1 0) itemoff 16123 itemsize 160
[756289.564649] inode generation 6 size 5242880 mode 100600
[756289.565636] item 1 key (257 12 256) itemoff 16107 itemsize 16
[756289.566694] item 2 key (257 108 0) itemoff 16054 itemsize 53
[756289.567725] extent data disk bytenr 13631488 nr 1048576
[756289.568697] extent data offset 0 nr 1048576 ram 1048576
[756289.569689] item 3 key (257 108 1048576) itemoff 16001 itemsize 53
[756289.570682] extent data disk bytenr 0 nr 0
[756289.571363] extent data offset 0 nr 2097152 ram 2097152
[756289.572213] item 4 key (257 108 3145728) itemoff 15948 itemsize 53
[756289.573246] extent data disk bytenr 14680064 nr 3145728
[756289.574121] extent data offset 0 nr 2097152 ram 3145728
[756289.574993] item 5 key (18446744073709551606 128 13631488) itemoff 12876 itemsize 3072
[756289.576113] item 6 key (18446744073709551606 128 15728640) itemoff 11852 itemsize 1024
[756289.577286] BTRFS error (device sdc): block=30720000 write time tree block corruption detected
[756289.578644] ------------[ cut here ]------------
[756289.579376] WARNING: CPU: 0 PID: 2303929 at fs/btrfs/disk-io.c:465 csum_one_extent_buffer+0xed/0x100 [btrfs]
[756289.580857] Modules linked in: btrfs dm_zero dm_dust loop dm_snapshot (...)
[756289.591534] CPU: 0 PID: 2303929 Comm: xfs_io Tainted: G W 5.12.0-rc8-btrfs-next-87 #1
[756289.592580] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[756289.594161] RIP: 0010:csum_one_extent_buffer+0xed/0x100 [btrfs]
[756289.595122] Code: 5d c3 e8 76 60 (...)
[756289.597509] RSP: 0018:ffffb51b416cb898 EFLAGS: 00010282
[756289.598142] RAX: 0000000000000000 RBX: fffff02b8a365bc0 RCX: 0000000000000000
[756289.598970] RDX: 0000000000000000 RSI: ffffffffa9112421 RDI: 00000000ffffffff
[756289.599798] RBP: ffffa06500880000 R08: 0000000000000000 R09: 0000000000000000
[756289.600619] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[756289.601456] R13: ffffa0652b1d8980 R14: ffffa06500880000 R15: 0000000000000000
[756289.602278] FS: 00007f08b23c9800(0000) GS:ffffa0682be00000(0000) knlGS:0000000000000000
[756289.603217] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[756289.603892] CR2: 00005652f32d0138 CR3: 000000025d616003 CR4: 0000000000370ef0
[756289.604725] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[756289.605563] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[756289.606400] Call Trace:
[756289.606704] btree_csum_one_bio+0x244/0x2b0 [btrfs]
[756289.607313] btrfs_submit_metadata_bio+0xb7/0x100 [btrfs]
[756289.608040] submit_one_bio+0x61/0x70 [btrfs]
[756289.608587] btree_write_cache_pages+0x587/0x610 [btrfs]
[756289.609258] ? free_debug_processing+0x1d5/0x240
[756289.609812] ? __module_address+0x28/0xf0
[756289.610298] ? lock_acquire+0x1a0/0x3e0
[756289.610754] ? lock_acquired+0x19f/0x430
[756289.611220] ? lock_acquire+0x1a0/0x3e0
[756289.611675] do_writepages+0x43/0xf0
[756289.612101] ? __filemap_fdatawrite_range+0xa4/0x100
[756289.612800] __filemap_fdatawrite_range+0xc5/0x100
[756289.613393] btrfs_write_marked_extents+0x68/0x160 [btrfs]
[756289.614085] btrfs_sync_log+0x21c/0xf20 [btrfs]
[756289.614661] ? finish_wait+0x90/0x90
[756289.615096] ? __mutex_unlock_slowpath+0x45/0x2a0
[756289.615661] ? btrfs_log_inode_parent+0x3c9/0xdc0 [btrfs]
[756289.616338] ? lock_acquire+0x1a0/0x3e0
[756289.616801] ? lock_acquired+0x19f/0x430
[756289.617284] ? lock_acquire+0x1a0/0x3e0
[756289.617750] ? lock_release+0x214/0x470
[756289.618221] ? lock_acquired+0x19f/0x430
[756289.618704] ? dput+0x20/0x4a0
[756289.619079] ? dput+0x20/0x4a0
[756289.619452] ? lockref_put_or_lock+0x9/0x30
[756289.619969] ? lock_release+0x214/0x470
[756289.620445] ? lock_release+0x214/0x470
[756289.620924] ? lock_release+0x214/0x470
[756289.621415] btrfs_sync_file+0x46a/0x5b0 [btrfs]
[756289.621982] do_fsync+0x38/0x70
[756289.622395] __x64_sys_fsync+0x10/0x20
[756289.622907] do_syscall_64+0x33/0x80
[756289.623438] entry_SYSCALL_64_after_hwframe+0x44/0xae
[756289.624063] RIP: 0033:0x7f08b27fbb7b
[756289.624588] Code: 0f 05 48 3d 00 (...)
[756289.626760] RSP: 002b:00007ffe2583f940 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[756289.627639] RAX: ffffffffffffffda RBX: 00005652f32cd0f0 RCX: 00007f08b27fbb7b
[756289.628464] RDX: 00005652f32cbca0 RSI: 00005652f32cd110 RDI: 0000000000000003
[756289.629323] RBP: 00005652f32cd110 R08: 0000000000000000 R09: 00007f08b28c4be0
[756289.630172] R10: fffffffffffff39a R11: 0000000000000293 R12: 0000000000000001
[756289.631007] R13: 00005652f32cd0f0 R14: 0000000000000001 R15: 00005652f32cc480
[756289.631819] irq event stamp: 0
[756289.632188] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[756289.632911] hardirqs last disabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0
[756289.633893] softirqs last enabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0
[756289.634871] softirqs last disabled at (0): [<0000000000000000>] 0x0
[756289.635606] ---[ end trace 0a039fdc16ff3fef ]---
[756289.636179] BTRFS: error (device sdc) in btrfs_sync_log:3136: errno=-5 IO failure
[756289.637082] BTRFS info (device sdc): forced readonly
Having checksum items covering ranges that overlap is dangerous as in some
cases it can lead to having extent ranges for which we miss checksums
after log replay or getting the wrong checksum item. There were some fixes
in the past for bugs that resulted in this problem, and were explained and
fixed by the following commits:
27b9a8122ff71a ("Btrfs: fix csum tree corruption, duplicate and outdated checksums")
b84b8390d6009c ("Btrfs: fix file read corruption after extent cloning and fsync")
40e046acbd2f36 ("Btrfs: fix missing data checksums after replaying a log tree")
e289f03ea79bbc ("btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents")
Fix the issue by making btrfs_csum_file_blocks() taking into account the
start offset of the next checksum item when it decides to extend an
existing checksum item, so that it never extends the checksum to end at a
range that goes beyond the start range of the next checksum item.
When we can not access the next checksum item without releasing the path,
simply drop the optimization of extending the previous checksum item and
fallback to inserting a new checksum item - this happens rarely and the
optimization is not significant enough for a log tree in order to justify
the extra complexity, as it would only save a few bytes (the size of a
struct btrfs_item) of leaf space.
This behaviour is only needed when inserting into a log tree because
for the regular checksums tree we never have a case where we try to
insert a range of checksums that overlap with a range that was previously
inserted.
A test case for fstests will follow soon.
Reported-by: Philipp Fent <fent@in.tum.de>
Link: https://lore.kernel.org/linux-btrfs/93c4600e-5263-5cba-adf0-6f47526e7561@in.tum.de/
CC: stable@vger.kernel.org # 5.4+
Tested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-24 18:35:53 +08:00
|
|
|
static int find_next_csum_offset(struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
u64 *next_offset)
|
|
|
|
{
|
|
|
|
const u32 nritems = btrfs_header_nritems(path->nodes[0]);
|
|
|
|
struct btrfs_key found_key;
|
|
|
|
int slot = path->slots[0] + 1;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (nritems == 0 || slot >= nritems) {
|
|
|
|
ret = btrfs_next_leaf(root, path);
|
|
|
|
if (ret < 0) {
|
|
|
|
return ret;
|
|
|
|
} else if (ret > 0) {
|
|
|
|
*next_offset = (u64)-1;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
slot = path->slots[0];
|
|
|
|
}
|
|
|
|
|
|
|
|
btrfs_item_key_to_cpu(path->nodes[0], &found_key, slot);
|
|
|
|
|
|
|
|
if (found_key.objectid != BTRFS_EXTENT_CSUM_OBJECTID ||
|
|
|
|
found_key.type != BTRFS_EXTENT_CSUM_KEY)
|
|
|
|
*next_offset = (u64)-1;
|
|
|
|
else
|
|
|
|
*next_offset = found_key.offset;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-02-21 01:07:25 +08:00
|
|
|
int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-09 05:58:54 +08:00
|
|
|
struct btrfs_root *root,
|
2008-07-18 00:53:50 +08:00
|
|
|
struct btrfs_ordered_sum *sums)
|
2007-03-30 03:15:27 +08:00
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *fs_info = root->fs_info;
|
2007-03-30 03:15:27 +08:00
|
|
|
struct btrfs_key file_key;
|
2007-04-16 21:22:45 +08:00
|
|
|
struct btrfs_key found_key;
|
2007-04-02 23:20:42 +08:00
|
|
|
struct btrfs_path *path;
|
2007-03-30 03:15:27 +08:00
|
|
|
struct btrfs_csum_item *item;
|
2008-02-21 01:07:25 +08:00
|
|
|
struct btrfs_csum_item *item_end;
|
2007-10-16 04:22:25 +08:00
|
|
|
struct extent_buffer *leaf = NULL;
|
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
|
|
|
u64 next_offset;
|
|
|
|
u64 total_bytes = 0;
|
2007-04-16 21:22:45 +08:00
|
|
|
u64 csum_offset;
|
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
|
|
|
u64 bytenr;
|
2007-10-26 03:42:56 +08:00
|
|
|
u32 ins_size;
|
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
|
|
|
int index = 0;
|
|
|
|
int found_next;
|
|
|
|
int ret;
|
2020-07-02 17:27:30 +08:00
|
|
|
const u32 csum_size = fs_info->csum_size;
|
2008-02-21 01:07:25 +08:00
|
|
|
|
2007-04-02 23:20:42 +08:00
|
|
|
path = btrfs_alloc_path();
|
btrfs: don't BUG_ON btrfs_alloc_path() errors
This patch fixes many callers of btrfs_alloc_path() which BUG_ON allocation
failure. All the sites that are fixed in this patch were checked by me to
be fairly trivial to fix because of at least one of two criteria:
- Callers of the function catch errors from it already so bubbling the
error up will be handled.
- Callers of the function might BUG_ON any nonzero return code in which
case there is no behavior changed (but we still got to remove a BUG_ON)
The following functions were updated:
btrfs_lookup_extent, alloc_reserved_tree_block, btrfs_remove_block_group,
btrfs_lookup_csums_range, btrfs_csum_file_blocks, btrfs_mark_extent_written,
btrfs_inode_by_name, btrfs_new_inode, btrfs_symlink,
insert_reserved_file_extent, and run_delalloc_nocow
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2011-07-14 01:38:47 +08:00
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
2008-02-21 01:07:25 +08:00
|
|
|
again:
|
|
|
|
next_offset = (u64)-1;
|
|
|
|
found_next = 0;
|
2023-05-24 23:03:07 +08:00
|
|
|
bytenr = sums->logical + total_bytes;
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-09 05:58:54 +08:00
|
|
|
file_key.objectid = BTRFS_EXTENT_CSUM_OBJECTID;
|
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
|
|
|
file_key.offset = bytenr;
|
2014-06-05 00:41:45 +08:00
|
|
|
file_key.type = BTRFS_EXTENT_CSUM_KEY;
|
2007-04-19 04:15:28 +08:00
|
|
|
|
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
|
|
|
item = btrfs_lookup_csum(trans, root, path, bytenr, 1);
|
2007-10-16 04:22:25 +08:00
|
|
|
if (!IS_ERR(item)) {
|
2008-08-28 18:15:25 +08:00
|
|
|
ret = 0;
|
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
|
|
|
leaf = path->nodes[0];
|
|
|
|
item_end = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_csum_item);
|
|
|
|
item_end = (struct btrfs_csum_item *)((char *)item_end +
|
2021-10-22 02:58:35 +08:00
|
|
|
btrfs_item_size(leaf, path->slots[0]));
|
2007-04-19 04:15:28 +08:00
|
|
|
goto found;
|
2007-10-16 04:22:25 +08:00
|
|
|
}
|
2007-04-19 04:15:28 +08:00
|
|
|
ret = PTR_ERR(item);
|
2010-05-16 22:49:59 +08:00
|
|
|
if (ret != -EFBIG && ret != -ENOENT)
|
2020-05-18 19:15:18 +08:00
|
|
|
goto out;
|
2010-05-16 22:49:59 +08:00
|
|
|
|
2007-04-19 04:15:28 +08:00
|
|
|
if (ret == -EFBIG) {
|
|
|
|
u32 item_size;
|
|
|
|
/* we found one, but it isn't big enough yet */
|
2007-10-16 04:14:19 +08:00
|
|
|
leaf = path->nodes[0];
|
2021-10-22 02:58:35 +08:00
|
|
|
item_size = btrfs_item_size(leaf, path->slots[0]);
|
2008-12-02 20:17:45 +08:00
|
|
|
if ((item_size / csum_size) >=
|
2016-06-23 06:54:23 +08:00
|
|
|
MAX_CSUM_ITEMS(fs_info, csum_size)) {
|
2007-04-19 04:15:28 +08:00
|
|
|
/* already at max size, make a new one */
|
|
|
|
goto insert;
|
|
|
|
}
|
|
|
|
} else {
|
btrfs: fix fsync failure and transaction abort after writes to prealloc extents
When doing a series of partial writes to different ranges of preallocated
extents with transaction commits and fsyncs in between, we can end up with
a checksum items in a log tree. This causes an fsync to fail with -EIO and
abort the transaction, turning the filesystem to RO mode, when syncing the
log.
For this to happen, we need to have a full fsync of a file following one
or more fast fsyncs.
The following example reproduces the problem and explains how it happens:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
# Create our test file with 2 preallocated extents. Leave a 1M hole
# between them to ensure that we get two file extent items that will
# never be merged into a single one. The extents are contiguous on disk,
# which will later result in the checksums for their data to be merged
# into a single checksum item in the csums btree.
#
$ xfs_io -f \
-c "falloc 0 1M" \
-c "falloc 3M 3M" \
/mnt/foobar
# Now write to the second extent and leave only 1M of it as unwritten,
# which corresponds to the file range [4M, 5M[.
#
# Then fsync the file to flush delalloc and to clear full sync flag from
# the inode, so that a future fsync will use the fast code path.
#
# After the writeback triggered by the fsync we have 3 file extent items
# that point to the second extent we previously allocated:
#
# 1) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the
# file range [3M, 4M[
#
# 2) One file extent item of type BTRFS_FILE_EXTENT_PREALLOC that covers
# the file range [4M, 5M[
#
# 3) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the
# file range [5M, 6M[
#
# All these file extent items have a generation of 6, which is the ID of
# the transaction where they were created. The split of the original file
# extent item is done at btrfs_mark_extent_written() when ordered extents
# complete for the file ranges [3M, 4M[ and [5M, 6M[.
#
$ xfs_io -c "pwrite -S 0xab 3M 1M" \
-c "pwrite -S 0xef 5M 1M" \
-c "fsync" \
/mnt/foobar
# Commit the current transaction. This wipes out the log tree created by
# the previous fsync.
sync
# Now write to the unwritten range of the second extent we allocated,
# corresponding to the file range [4M, 5M[, and fsync the file, which
# triggers the fast fsync code path.
#
# The fast fsync code path sees that there is a new extent map covering
# the file range [4M, 5M[ and therefore it will log a checksum item
# covering the range [1M, 2M[ of the second extent we allocated.
#
# Also, after the fsync finishes we no longer have the 3 file extent
# items that pointed to 3 sections of the second extent we allocated.
# Instead we end up with a single file extent item pointing to the whole
# extent, with a type of BTRFS_FILE_EXTENT_REG and a generation of 7 (the
# current transaction ID). This is due to the file extent item merging we
# do when completing ordered extents into ranges that point to unwritten
# (preallocated) extents. This merging is done at
# btrfs_mark_extent_written().
#
$ xfs_io -c "pwrite -S 0xcd 4M 1M" \
-c "fsync" \
/mnt/foobar
# Now do some write to our file outside the range of the second extent
# that we allocated with fallocate() and truncate the file size from 6M
# down to 5M.
#
# The truncate operation sets the full sync runtime flag on the inode,
# forcing the next fsync to use the slow code path. It also changes the
# length of the second file extent item so that it represents the file
# range [3M, 5M[ and not the range [3M, 6M[ anymore.
#
# Finally fsync the file. Since this is a fsync that triggers the slow
# code path, it will remove all items associated to the inode from the
# log tree and then it will scan for file extent items in the
# fs/subvolume tree that have a generation matching the current
# transaction ID, which is 7. This means it will log 2 file extent
# items:
#
# 1) One for the first extent we allocated, covering the file range
# [0, 1M[
#
# 2) Another for the first 2M of the second extent we allocated,
# covering the file range [3M, 5M[
#
# When logging the first file extent item we log a single checksum item
# that has all the checksums for the entire extent.
#
# When logging the second file extent item, we also lookup for the
# checksums that are associated with the range [0, 2M[ of the second
# extent we allocated (file range [3M, 5M[), and then we log them with
# btrfs_csum_file_blocks(). However that results in ending up with a log
# that has two checksum items with ranges that overlap:
#
# 1) One for the range [1M, 2M[ of the second extent we allocated,
# corresponding to the file range [4M, 5M[, which we logged in the
# previous fsync that used the fast code path;
#
# 2) One for the ranges [0, 1M[ and [0, 2M[ of the first and second
# extents, respectively, corresponding to the files ranges [0, 1M[
# and [3M, 5M[. This one was added during this last fsync that uses
# the slow code path and overlaps with the previous one logged by
# the previous fast fsync.
#
# This happens because when logging the checksums for the second
# extent, we notice they start at an offset that matches the end of the
# checksums item that we logged for the first extent, and because both
# extents are contiguous on disk, btrfs_csum_file_blocks() decides to
# extend that existing checksums item and append the checksums for the
# second extent to this item. The end result is we end up with two
# checksum items in the log tree that have overlapping ranges, as
# listed before, resulting in the fsync to fail with -EIO and aborting
# the transaction, turning the filesystem into RO mode.
#
$ xfs_io -c "pwrite -S 0xff 0 1M" \
-c "truncate 5M" \
-c "fsync" \
/mnt/foobar
fsync: Input/output error
After running the example, dmesg/syslog shows the tree checker complained
about the checksum items with overlapping ranges and we aborted the
transaction:
$ dmesg
(...)
[756289.557487] BTRFS critical (device sdc): corrupt leaf: root=18446744073709551610 block=30720000 slot=5, csum end range (16777216) goes beyond the start range (15728640) of the next csum item
[756289.560583] BTRFS info (device sdc): leaf 30720000 gen 7 total ptrs 7 free space 11677 owner 18446744073709551610
[756289.562435] BTRFS info (device sdc): refs 2 lock_owner 0 current 2303929
[756289.563654] item 0 key (257 1 0) itemoff 16123 itemsize 160
[756289.564649] inode generation 6 size 5242880 mode 100600
[756289.565636] item 1 key (257 12 256) itemoff 16107 itemsize 16
[756289.566694] item 2 key (257 108 0) itemoff 16054 itemsize 53
[756289.567725] extent data disk bytenr 13631488 nr 1048576
[756289.568697] extent data offset 0 nr 1048576 ram 1048576
[756289.569689] item 3 key (257 108 1048576) itemoff 16001 itemsize 53
[756289.570682] extent data disk bytenr 0 nr 0
[756289.571363] extent data offset 0 nr 2097152 ram 2097152
[756289.572213] item 4 key (257 108 3145728) itemoff 15948 itemsize 53
[756289.573246] extent data disk bytenr 14680064 nr 3145728
[756289.574121] extent data offset 0 nr 2097152 ram 3145728
[756289.574993] item 5 key (18446744073709551606 128 13631488) itemoff 12876 itemsize 3072
[756289.576113] item 6 key (18446744073709551606 128 15728640) itemoff 11852 itemsize 1024
[756289.577286] BTRFS error (device sdc): block=30720000 write time tree block corruption detected
[756289.578644] ------------[ cut here ]------------
[756289.579376] WARNING: CPU: 0 PID: 2303929 at fs/btrfs/disk-io.c:465 csum_one_extent_buffer+0xed/0x100 [btrfs]
[756289.580857] Modules linked in: btrfs dm_zero dm_dust loop dm_snapshot (...)
[756289.591534] CPU: 0 PID: 2303929 Comm: xfs_io Tainted: G W 5.12.0-rc8-btrfs-next-87 #1
[756289.592580] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[756289.594161] RIP: 0010:csum_one_extent_buffer+0xed/0x100 [btrfs]
[756289.595122] Code: 5d c3 e8 76 60 (...)
[756289.597509] RSP: 0018:ffffb51b416cb898 EFLAGS: 00010282
[756289.598142] RAX: 0000000000000000 RBX: fffff02b8a365bc0 RCX: 0000000000000000
[756289.598970] RDX: 0000000000000000 RSI: ffffffffa9112421 RDI: 00000000ffffffff
[756289.599798] RBP: ffffa06500880000 R08: 0000000000000000 R09: 0000000000000000
[756289.600619] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[756289.601456] R13: ffffa0652b1d8980 R14: ffffa06500880000 R15: 0000000000000000
[756289.602278] FS: 00007f08b23c9800(0000) GS:ffffa0682be00000(0000) knlGS:0000000000000000
[756289.603217] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[756289.603892] CR2: 00005652f32d0138 CR3: 000000025d616003 CR4: 0000000000370ef0
[756289.604725] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[756289.605563] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[756289.606400] Call Trace:
[756289.606704] btree_csum_one_bio+0x244/0x2b0 [btrfs]
[756289.607313] btrfs_submit_metadata_bio+0xb7/0x100 [btrfs]
[756289.608040] submit_one_bio+0x61/0x70 [btrfs]
[756289.608587] btree_write_cache_pages+0x587/0x610 [btrfs]
[756289.609258] ? free_debug_processing+0x1d5/0x240
[756289.609812] ? __module_address+0x28/0xf0
[756289.610298] ? lock_acquire+0x1a0/0x3e0
[756289.610754] ? lock_acquired+0x19f/0x430
[756289.611220] ? lock_acquire+0x1a0/0x3e0
[756289.611675] do_writepages+0x43/0xf0
[756289.612101] ? __filemap_fdatawrite_range+0xa4/0x100
[756289.612800] __filemap_fdatawrite_range+0xc5/0x100
[756289.613393] btrfs_write_marked_extents+0x68/0x160 [btrfs]
[756289.614085] btrfs_sync_log+0x21c/0xf20 [btrfs]
[756289.614661] ? finish_wait+0x90/0x90
[756289.615096] ? __mutex_unlock_slowpath+0x45/0x2a0
[756289.615661] ? btrfs_log_inode_parent+0x3c9/0xdc0 [btrfs]
[756289.616338] ? lock_acquire+0x1a0/0x3e0
[756289.616801] ? lock_acquired+0x19f/0x430
[756289.617284] ? lock_acquire+0x1a0/0x3e0
[756289.617750] ? lock_release+0x214/0x470
[756289.618221] ? lock_acquired+0x19f/0x430
[756289.618704] ? dput+0x20/0x4a0
[756289.619079] ? dput+0x20/0x4a0
[756289.619452] ? lockref_put_or_lock+0x9/0x30
[756289.619969] ? lock_release+0x214/0x470
[756289.620445] ? lock_release+0x214/0x470
[756289.620924] ? lock_release+0x214/0x470
[756289.621415] btrfs_sync_file+0x46a/0x5b0 [btrfs]
[756289.621982] do_fsync+0x38/0x70
[756289.622395] __x64_sys_fsync+0x10/0x20
[756289.622907] do_syscall_64+0x33/0x80
[756289.623438] entry_SYSCALL_64_after_hwframe+0x44/0xae
[756289.624063] RIP: 0033:0x7f08b27fbb7b
[756289.624588] Code: 0f 05 48 3d 00 (...)
[756289.626760] RSP: 002b:00007ffe2583f940 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[756289.627639] RAX: ffffffffffffffda RBX: 00005652f32cd0f0 RCX: 00007f08b27fbb7b
[756289.628464] RDX: 00005652f32cbca0 RSI: 00005652f32cd110 RDI: 0000000000000003
[756289.629323] RBP: 00005652f32cd110 R08: 0000000000000000 R09: 00007f08b28c4be0
[756289.630172] R10: fffffffffffff39a R11: 0000000000000293 R12: 0000000000000001
[756289.631007] R13: 00005652f32cd0f0 R14: 0000000000000001 R15: 00005652f32cc480
[756289.631819] irq event stamp: 0
[756289.632188] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[756289.632911] hardirqs last disabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0
[756289.633893] softirqs last enabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0
[756289.634871] softirqs last disabled at (0): [<0000000000000000>] 0x0
[756289.635606] ---[ end trace 0a039fdc16ff3fef ]---
[756289.636179] BTRFS: error (device sdc) in btrfs_sync_log:3136: errno=-5 IO failure
[756289.637082] BTRFS info (device sdc): forced readonly
Having checksum items covering ranges that overlap is dangerous as in some
cases it can lead to having extent ranges for which we miss checksums
after log replay or getting the wrong checksum item. There were some fixes
in the past for bugs that resulted in this problem, and were explained and
fixed by the following commits:
27b9a8122ff71a ("Btrfs: fix csum tree corruption, duplicate and outdated checksums")
b84b8390d6009c ("Btrfs: fix file read corruption after extent cloning and fsync")
40e046acbd2f36 ("Btrfs: fix missing data checksums after replaying a log tree")
e289f03ea79bbc ("btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents")
Fix the issue by making btrfs_csum_file_blocks() taking into account the
start offset of the next checksum item when it decides to extend an
existing checksum item, so that it never extends the checksum to end at a
range that goes beyond the start range of the next checksum item.
When we can not access the next checksum item without releasing the path,
simply drop the optimization of extending the previous checksum item and
fallback to inserting a new checksum item - this happens rarely and the
optimization is not significant enough for a log tree in order to justify
the extra complexity, as it would only save a few bytes (the size of a
struct btrfs_item) of leaf space.
This behaviour is only needed when inserting into a log tree because
for the regular checksums tree we never have a case where we try to
insert a range of checksums that overlap with a range that was previously
inserted.
A test case for fstests will follow soon.
Reported-by: Philipp Fent <fent@in.tum.de>
Link: https://lore.kernel.org/linux-btrfs/93c4600e-5263-5cba-adf0-6f47526e7561@in.tum.de/
CC: stable@vger.kernel.org # 5.4+
Tested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-24 18:35:53 +08:00
|
|
|
/* We didn't find a csum item, insert one. */
|
|
|
|
ret = find_next_csum_offset(root, path, &next_offset);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2007-10-26 03:42:56 +08:00
|
|
|
found_next = 1;
|
2007-04-19 04:15:28 +08:00
|
|
|
goto insert;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
btrfs: make checksum item extension more efficient
When we want to add checksums into the checksums tree, or a log tree, we
try whenever possible to extend existing checksum items, as this helps
reduce amount of metadata space used, since adding a new item uses extra
metadata space for a btrfs_item structure (25 bytes).
However we have two inefficiencies in the current approach:
1) After finding a checksum item that covers a range with an end offset
that matches the start offset of the checksum range we want to insert,
we release the search path populated by btrfs_lookup_csum() and then
do another COW search on tree with the goal of getting additional
space for at least one checksum. Doing this path release and then
searching again is a waste of time because very often the leaf already
has enough free space for at least one more checksum;
2) After the COW search that guarantees we get free space in the leaf for
at least one more checksum, we end up not doing the extension of the
previous checksum item, and fallback to insertion of a new checksum
item, if the leaf doesn't have an amount of free space larger then the
space required for 2 checksums plus one btrfs_item structure - this is
pointless for two reasons:
a) We want to extend an existing item, so we don't need to account for
a btrfs_item structure (25 bytes);
b) We made the COW search with an insertion size for 1 single checksum,
so if the leaf ends up with a free space amount smaller then 2
checksums plus the size of a btrfs_item structure, we give up on the
extension of the existing item and jump to the 'insert' label, where
we end up releasing the path and then doing yet another search to
insert a new checksum item for a single checksum.
Fix these inefficiencies by doing the following:
- For case 1), before releasing the path just check if the leaf already
has enough space for at least 1 more checksum, and if it does, jump
directly to the item extension code, with releasing our current path,
which was already COWed by btrfs_lookup_csum();
- For case 2), fix the logic so that for item extension we require only
that the leaf has enough free space for 1 checksum, and not a minimum
of 2 checksums plus space for a btrfs_item structure.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-18 19:15:00 +08:00
|
|
|
* At this point, we know the tree has a checksum item that ends at an
|
|
|
|
* offset matching the start of the checksum range we want to insert.
|
|
|
|
* We try to extend that item as much as possible and then add as many
|
|
|
|
* checksums to it as they fit.
|
|
|
|
*
|
|
|
|
* First check if the leaf has enough free space for at least one
|
|
|
|
* checksum. If it has go directly to the item extension code, otherwise
|
|
|
|
* release the path and do a search for insertion before the extension.
|
2007-04-19 04:15:28 +08:00
|
|
|
*/
|
btrfs: make checksum item extension more efficient
When we want to add checksums into the checksums tree, or a log tree, we
try whenever possible to extend existing checksum items, as this helps
reduce amount of metadata space used, since adding a new item uses extra
metadata space for a btrfs_item structure (25 bytes).
However we have two inefficiencies in the current approach:
1) After finding a checksum item that covers a range with an end offset
that matches the start offset of the checksum range we want to insert,
we release the search path populated by btrfs_lookup_csum() and then
do another COW search on tree with the goal of getting additional
space for at least one checksum. Doing this path release and then
searching again is a waste of time because very often the leaf already
has enough free space for at least one more checksum;
2) After the COW search that guarantees we get free space in the leaf for
at least one more checksum, we end up not doing the extension of the
previous checksum item, and fallback to insertion of a new checksum
item, if the leaf doesn't have an amount of free space larger then the
space required for 2 checksums plus one btrfs_item structure - this is
pointless for two reasons:
a) We want to extend an existing item, so we don't need to account for
a btrfs_item structure (25 bytes);
b) We made the COW search with an insertion size for 1 single checksum,
so if the leaf ends up with a free space amount smaller then 2
checksums plus the size of a btrfs_item structure, we give up on the
extension of the existing item and jump to the 'insert' label, where
we end up releasing the path and then doing yet another search to
insert a new checksum item for a single checksum.
Fix these inefficiencies by doing the following:
- For case 1), before releasing the path just check if the leaf already
has enough space for at least 1 more checksum, and if it does, jump
directly to the item extension code, with releasing our current path,
which was already COWed by btrfs_lookup_csum();
- For case 2), fix the logic so that for item extension we require only
that the leaf has enough free space for 1 checksum, and not a minimum
of 2 checksums plus space for a btrfs_item structure.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-18 19:15:00 +08:00
|
|
|
if (btrfs_leaf_free_space(leaf) >= csum_size) {
|
|
|
|
btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
|
|
|
|
csum_offset = (bytenr - found_key.offset) >>
|
2020-07-02 03:19:09 +08:00
|
|
|
fs_info->sectorsize_bits;
|
btrfs: make checksum item extension more efficient
When we want to add checksums into the checksums tree, or a log tree, we
try whenever possible to extend existing checksum items, as this helps
reduce amount of metadata space used, since adding a new item uses extra
metadata space for a btrfs_item structure (25 bytes).
However we have two inefficiencies in the current approach:
1) After finding a checksum item that covers a range with an end offset
that matches the start offset of the checksum range we want to insert,
we release the search path populated by btrfs_lookup_csum() and then
do another COW search on tree with the goal of getting additional
space for at least one checksum. Doing this path release and then
searching again is a waste of time because very often the leaf already
has enough free space for at least one more checksum;
2) After the COW search that guarantees we get free space in the leaf for
at least one more checksum, we end up not doing the extension of the
previous checksum item, and fallback to insertion of a new checksum
item, if the leaf doesn't have an amount of free space larger then the
space required for 2 checksums plus one btrfs_item structure - this is
pointless for two reasons:
a) We want to extend an existing item, so we don't need to account for
a btrfs_item structure (25 bytes);
b) We made the COW search with an insertion size for 1 single checksum,
so if the leaf ends up with a free space amount smaller then 2
checksums plus the size of a btrfs_item structure, we give up on the
extension of the existing item and jump to the 'insert' label, where
we end up releasing the path and then doing yet another search to
insert a new checksum item for a single checksum.
Fix these inefficiencies by doing the following:
- For case 1), before releasing the path just check if the leaf already
has enough space for at least 1 more checksum, and if it does, jump
directly to the item extension code, with releasing our current path,
which was already COWed by btrfs_lookup_csum();
- For case 2), fix the logic so that for item extension we require only
that the leaf has enough free space for 1 checksum, and not a minimum
of 2 checksums plus space for a btrfs_item structure.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-18 19:15:00 +08:00
|
|
|
goto extend_csum;
|
|
|
|
}
|
|
|
|
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
btrfs: correctly calculate item size used when item key collision happens
Item key collision is allowed for some item types, like dir item and
inode refs, but the overall item size is limited by the nodesize.
item size(ins_len) passed from btrfs_insert_empty_items to
btrfs_search_slot already contains size of btrfs_item.
When btrfs_search_slot reaches leaf, we'll see if we need to split leaf.
The check incorrectly reports that split leaf is required, because
it treats the space required by the newly inserted item as
btrfs_item + item data. But in item key collision case, only item data
is actually needed, the newly inserted item could merge into the existing
one. No new btrfs_item will be inserted.
And split_leaf return EOVERFLOW from following code:
if (extend && data_size + btrfs_item_size_nr(l, slot) +
sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(fs_info))
return -EOVERFLOW;
In most cases, when callers receive EOVERFLOW, they either return
this error or handle in different ways. For example, in normal dir item
creation the userspace will get errno EOVERFLOW; in inode ref case
INODE_EXTREF is used instead.
However, this is not the case for rename. To avoid the unrecoverable
situation in rename, btrfs_check_dir_item_collision is called in
early phase of rename. In this function, when item key collision is
detected leaf space is checked:
data_size = sizeof(*di) + name_len;
if (data_size + btrfs_item_size_nr(leaf, slot) +
sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(root->fs_info))
the sizeof(struct btrfs_item) + btrfs_item_size_nr(leaf, slot) here
refers to existing item size, the condition here correctly calculates
the needed size for collision case rather than the wrong case above.
The consequence of inconsistent condition check between
btrfs_check_dir_item_collision and btrfs_search_slot when item key
collision happens is that we might pass check here but fail
later at btrfs_search_slot. Rename fails and volume is forced readonly
[436149.586170] ------------[ cut here ]------------
[436149.586173] BTRFS: Transaction aborted (error -75)
[436149.586196] WARNING: CPU: 0 PID: 16733 at fs/btrfs/inode.c:9870 btrfs_rename2+0x1938/0x1b70 [btrfs]
[436149.586227] CPU: 0 PID: 16733 Comm: python Tainted: G D 4.18.0-rc5+ #1
[436149.586228] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
[436149.586238] RIP: 0010:btrfs_rename2+0x1938/0x1b70 [btrfs]
[436149.586254] RSP: 0018:ffffa327043a7ce0 EFLAGS: 00010286
[436149.586255] RAX: 0000000000000000 RBX: ffff8d8a17d13340 RCX: 0000000000000006
[436149.586256] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8d8a7fc164b0
[436149.586257] RBP: ffffa327043a7da0 R08: 0000000000000560 R09: 7265282064657472
[436149.586258] R10: 0000000000000000 R11: 6361736e61725420 R12: ffff8d8a0d4c8b08
[436149.586258] R13: ffff8d8a17d13340 R14: ffff8d8a33e0a540 R15: 00000000000001fe
[436149.586260] FS: 00007fa313933740(0000) GS:ffff8d8a7fc00000(0000) knlGS:0000000000000000
[436149.586261] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[436149.586262] CR2: 000055d8d9c9a720 CR3: 000000007aae0003 CR4: 00000000003606f0
[436149.586295] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[436149.586296] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[436149.586296] Call Trace:
[436149.586311] vfs_rename+0x383/0x920
[436149.586313] ? vfs_rename+0x383/0x920
[436149.586315] do_renameat2+0x4ca/0x590
[436149.586317] __x64_sys_rename+0x20/0x30
[436149.586324] do_syscall_64+0x5a/0x120
[436149.586330] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[436149.586332] RIP: 0033:0x7fa3133b1d37
[436149.586348] RSP: 002b:00007fffd3e43908 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
[436149.586349] RAX: ffffffffffffffda RBX: 00007fa3133b1d30 RCX: 00007fa3133b1d37
[436149.586350] RDX: 000055d8da06b5e0 RSI: 000055d8da225d60 RDI: 000055d8da2c4da0
[436149.586351] RBP: 000055d8da2252f0 R08: 00007fa313782000 R09: 00000000000177e0
[436149.586351] R10: 000055d8da010680 R11: 0000000000000246 R12: 00007fa313840b00
Thanks to Hans van Kranenburg for information about crc32 hash collision
tools, I was able to reproduce the dir item collision with following
python script.
https://github.com/wutzuchieh/misc_tools/blob/master/crc32_forge.py Run
it under a btrfs volume will trigger the abort transaction. It simply
creates files and rename them to forged names that leads to
hash collision.
There are two ways to fix this. One is to simply revert the patch
878f2d2cb355 ("Btrfs: fix max dir item size calculation") to make the
condition consistent although that patch is correct about the size.
The other way is to handle the leaf space check correctly when
collision happens. I prefer the second one since it correct leaf
space check in collision case. This fix will not account
sizeof(struct btrfs_item) when the item already exists.
There are two places where ins_len doesn't contain
sizeof(struct btrfs_item), however.
1. extent-tree.c: lookup_inline_extent_backref
2. file-item.c: btrfs_csum_file_blocks
to make the logic of btrfs_search_slot more clear, we add a flag
search_for_extension in btrfs_path.
This flag indicates that ins_len passed to btrfs_search_slot doesn't
contain sizeof(struct btrfs_item). When key exists, btrfs_search_slot
will use the actual size needed to calculate the required leaf space.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-01 17:25:12 +08:00
|
|
|
path->search_for_extension = 1;
|
2007-04-16 21:22:45 +08:00
|
|
|
ret = btrfs_search_slot(trans, root, &file_key, path,
|
2008-12-02 20:17:45 +08:00
|
|
|
csum_size, 1);
|
btrfs: correctly calculate item size used when item key collision happens
Item key collision is allowed for some item types, like dir item and
inode refs, but the overall item size is limited by the nodesize.
item size(ins_len) passed from btrfs_insert_empty_items to
btrfs_search_slot already contains size of btrfs_item.
When btrfs_search_slot reaches leaf, we'll see if we need to split leaf.
The check incorrectly reports that split leaf is required, because
it treats the space required by the newly inserted item as
btrfs_item + item data. But in item key collision case, only item data
is actually needed, the newly inserted item could merge into the existing
one. No new btrfs_item will be inserted.
And split_leaf return EOVERFLOW from following code:
if (extend && data_size + btrfs_item_size_nr(l, slot) +
sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(fs_info))
return -EOVERFLOW;
In most cases, when callers receive EOVERFLOW, they either return
this error or handle in different ways. For example, in normal dir item
creation the userspace will get errno EOVERFLOW; in inode ref case
INODE_EXTREF is used instead.
However, this is not the case for rename. To avoid the unrecoverable
situation in rename, btrfs_check_dir_item_collision is called in
early phase of rename. In this function, when item key collision is
detected leaf space is checked:
data_size = sizeof(*di) + name_len;
if (data_size + btrfs_item_size_nr(leaf, slot) +
sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(root->fs_info))
the sizeof(struct btrfs_item) + btrfs_item_size_nr(leaf, slot) here
refers to existing item size, the condition here correctly calculates
the needed size for collision case rather than the wrong case above.
The consequence of inconsistent condition check between
btrfs_check_dir_item_collision and btrfs_search_slot when item key
collision happens is that we might pass check here but fail
later at btrfs_search_slot. Rename fails and volume is forced readonly
[436149.586170] ------------[ cut here ]------------
[436149.586173] BTRFS: Transaction aborted (error -75)
[436149.586196] WARNING: CPU: 0 PID: 16733 at fs/btrfs/inode.c:9870 btrfs_rename2+0x1938/0x1b70 [btrfs]
[436149.586227] CPU: 0 PID: 16733 Comm: python Tainted: G D 4.18.0-rc5+ #1
[436149.586228] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
[436149.586238] RIP: 0010:btrfs_rename2+0x1938/0x1b70 [btrfs]
[436149.586254] RSP: 0018:ffffa327043a7ce0 EFLAGS: 00010286
[436149.586255] RAX: 0000000000000000 RBX: ffff8d8a17d13340 RCX: 0000000000000006
[436149.586256] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8d8a7fc164b0
[436149.586257] RBP: ffffa327043a7da0 R08: 0000000000000560 R09: 7265282064657472
[436149.586258] R10: 0000000000000000 R11: 6361736e61725420 R12: ffff8d8a0d4c8b08
[436149.586258] R13: ffff8d8a17d13340 R14: ffff8d8a33e0a540 R15: 00000000000001fe
[436149.586260] FS: 00007fa313933740(0000) GS:ffff8d8a7fc00000(0000) knlGS:0000000000000000
[436149.586261] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[436149.586262] CR2: 000055d8d9c9a720 CR3: 000000007aae0003 CR4: 00000000003606f0
[436149.586295] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[436149.586296] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[436149.586296] Call Trace:
[436149.586311] vfs_rename+0x383/0x920
[436149.586313] ? vfs_rename+0x383/0x920
[436149.586315] do_renameat2+0x4ca/0x590
[436149.586317] __x64_sys_rename+0x20/0x30
[436149.586324] do_syscall_64+0x5a/0x120
[436149.586330] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[436149.586332] RIP: 0033:0x7fa3133b1d37
[436149.586348] RSP: 002b:00007fffd3e43908 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
[436149.586349] RAX: ffffffffffffffda RBX: 00007fa3133b1d30 RCX: 00007fa3133b1d37
[436149.586350] RDX: 000055d8da06b5e0 RSI: 000055d8da225d60 RDI: 000055d8da2c4da0
[436149.586351] RBP: 000055d8da2252f0 R08: 00007fa313782000 R09: 00000000000177e0
[436149.586351] R10: 000055d8da010680 R11: 0000000000000246 R12: 00007fa313840b00
Thanks to Hans van Kranenburg for information about crc32 hash collision
tools, I was able to reproduce the dir item collision with following
python script.
https://github.com/wutzuchieh/misc_tools/blob/master/crc32_forge.py Run
it under a btrfs volume will trigger the abort transaction. It simply
creates files and rename them to forged names that leads to
hash collision.
There are two ways to fix this. One is to simply revert the patch
878f2d2cb355 ("Btrfs: fix max dir item size calculation") to make the
condition consistent although that patch is correct about the size.
The other way is to handle the leaf space check correctly when
collision happens. I prefer the second one since it correct leaf
space check in collision case. This fix will not account
sizeof(struct btrfs_item) when the item already exists.
There are two places where ins_len doesn't contain
sizeof(struct btrfs_item), however.
1. extent-tree.c: lookup_inline_extent_backref
2. file-item.c: btrfs_csum_file_blocks
to make the logic of btrfs_search_slot more clear, we add a flag
search_for_extension in btrfs_path.
This flag indicates that ins_len passed to btrfs_search_slot doesn't
contain sizeof(struct btrfs_item). When key exists, btrfs_search_slot
will use the actual size needed to calculate the required leaf space.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-01 17:25:12 +08:00
|
|
|
path->search_for_extension = 0;
|
2007-04-16 21:22:45 +08:00
|
|
|
if (ret < 0)
|
2020-05-18 19:15:18 +08:00
|
|
|
goto out;
|
2008-12-10 22:10:46 +08:00
|
|
|
|
|
|
|
if (ret > 0) {
|
|
|
|
if (path->slots[0] == 0)
|
|
|
|
goto insert;
|
|
|
|
path->slots[0]--;
|
2007-04-16 21:22:45 +08:00
|
|
|
}
|
2008-12-10 22:10:46 +08:00
|
|
|
|
2007-10-16 04:14:19 +08:00
|
|
|
leaf = path->nodes[0];
|
|
|
|
btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
|
2020-07-02 03:19:09 +08:00
|
|
|
csum_offset = (bytenr - found_key.offset) >> fs_info->sectorsize_bits;
|
2008-12-10 22:10:46 +08:00
|
|
|
|
2014-06-05 00:41:45 +08:00
|
|
|
if (found_key.type != BTRFS_EXTENT_CSUM_KEY ||
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-09 05:58:54 +08:00
|
|
|
found_key.objectid != BTRFS_EXTENT_CSUM_OBJECTID ||
|
2016-06-23 06:54:23 +08:00
|
|
|
csum_offset >= MAX_CSUM_ITEMS(fs_info, csum_size)) {
|
2007-04-16 21:22:45 +08:00
|
|
|
goto insert;
|
|
|
|
}
|
2008-12-10 22:10:46 +08:00
|
|
|
|
btrfs: make checksum item extension more efficient
When we want to add checksums into the checksums tree, or a log tree, we
try whenever possible to extend existing checksum items, as this helps
reduce amount of metadata space used, since adding a new item uses extra
metadata space for a btrfs_item structure (25 bytes).
However we have two inefficiencies in the current approach:
1) After finding a checksum item that covers a range with an end offset
that matches the start offset of the checksum range we want to insert,
we release the search path populated by btrfs_lookup_csum() and then
do another COW search on tree with the goal of getting additional
space for at least one checksum. Doing this path release and then
searching again is a waste of time because very often the leaf already
has enough free space for at least one more checksum;
2) After the COW search that guarantees we get free space in the leaf for
at least one more checksum, we end up not doing the extension of the
previous checksum item, and fallback to insertion of a new checksum
item, if the leaf doesn't have an amount of free space larger then the
space required for 2 checksums plus one btrfs_item structure - this is
pointless for two reasons:
a) We want to extend an existing item, so we don't need to account for
a btrfs_item structure (25 bytes);
b) We made the COW search with an insertion size for 1 single checksum,
so if the leaf ends up with a free space amount smaller then 2
checksums plus the size of a btrfs_item structure, we give up on the
extension of the existing item and jump to the 'insert' label, where
we end up releasing the path and then doing yet another search to
insert a new checksum item for a single checksum.
Fix these inefficiencies by doing the following:
- For case 1), before releasing the path just check if the leaf already
has enough space for at least 1 more checksum, and if it does, jump
directly to the item extension code, with releasing our current path,
which was already COWed by btrfs_lookup_csum();
- For case 2), fix the logic so that for item extension we require only
that the leaf has enough free space for 1 checksum, and not a minimum
of 2 checksums plus space for a btrfs_item structure.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-18 19:15:00 +08:00
|
|
|
extend_csum:
|
2021-10-22 02:58:35 +08:00
|
|
|
if (csum_offset == btrfs_item_size(leaf, path->slots[0]) /
|
2008-12-02 20:17:45 +08:00
|
|
|
csum_size) {
|
Btrfs: extend the checksum item as much as possible
For write, we also reserve some space for COW blocks during updating
the checksum tree, and we calculate the number of blocks by checking
if the number of bytes outstanding that are going to need csums needs
one more block for csum.
When we add these checksum into the checksum tree, we use ordered sums
list.
Every ordered sum contains csums for each sector, and we'll first try
to look up an existing csum item,
a) if we don't yet have a proper csum item, then we need to insert one,
b) or if we find one but the csum item is not big enough, then we need
to extend it.
The point is we'll unlock the whole path and then insert or extend.
So others can hack in and update the tree.
Each insert or extend needs update the tree with COW on, and we may need
to insert/extend for many times.
That means what we've reserved for updating checksum tree is NOT enough
indeed.
The case is even more serious with having several write threads at the
same time, it can end up eating our reserved space quickly and starting
eating globle reserve pool instead.
I don't yet come up with a way to calculate the worse case for updating
csum, but extending the checksum item as much as possible can be helpful
in my test.
The idea behind is that it can reduce the times we insert/extend so that
it saves us precious reserved space.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-04 21:12:18 +08:00
|
|
|
int extend_nr;
|
|
|
|
u64 tmp;
|
|
|
|
u32 diff;
|
|
|
|
|
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
|
|
|
tmp = sums->len - total_bytes;
|
2020-07-02 03:19:09 +08:00
|
|
|
tmp >>= fs_info->sectorsize_bits;
|
Btrfs: extend the checksum item as much as possible
For write, we also reserve some space for COW blocks during updating
the checksum tree, and we calculate the number of blocks by checking
if the number of bytes outstanding that are going to need csums needs
one more block for csum.
When we add these checksum into the checksum tree, we use ordered sums
list.
Every ordered sum contains csums for each sector, and we'll first try
to look up an existing csum item,
a) if we don't yet have a proper csum item, then we need to insert one,
b) or if we find one but the csum item is not big enough, then we need
to extend it.
The point is we'll unlock the whole path and then insert or extend.
So others can hack in and update the tree.
Each insert or extend needs update the tree with COW on, and we may need
to insert/extend for many times.
That means what we've reserved for updating checksum tree is NOT enough
indeed.
The case is even more serious with having several write threads at the
same time, it can end up eating our reserved space quickly and starting
eating globle reserve pool instead.
I don't yet come up with a way to calculate the worse case for updating
csum, but extending the checksum item as much as possible can be helpful
in my test.
The idea behind is that it can reduce the times we insert/extend so that
it saves us precious reserved space.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-04 21:12:18 +08:00
|
|
|
WARN_ON(tmp < 1);
|
btrfs: fix fsync failure and transaction abort after writes to prealloc extents
When doing a series of partial writes to different ranges of preallocated
extents with transaction commits and fsyncs in between, we can end up with
a checksum items in a log tree. This causes an fsync to fail with -EIO and
abort the transaction, turning the filesystem to RO mode, when syncing the
log.
For this to happen, we need to have a full fsync of a file following one
or more fast fsyncs.
The following example reproduces the problem and explains how it happens:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
# Create our test file with 2 preallocated extents. Leave a 1M hole
# between them to ensure that we get two file extent items that will
# never be merged into a single one. The extents are contiguous on disk,
# which will later result in the checksums for their data to be merged
# into a single checksum item in the csums btree.
#
$ xfs_io -f \
-c "falloc 0 1M" \
-c "falloc 3M 3M" \
/mnt/foobar
# Now write to the second extent and leave only 1M of it as unwritten,
# which corresponds to the file range [4M, 5M[.
#
# Then fsync the file to flush delalloc and to clear full sync flag from
# the inode, so that a future fsync will use the fast code path.
#
# After the writeback triggered by the fsync we have 3 file extent items
# that point to the second extent we previously allocated:
#
# 1) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the
# file range [3M, 4M[
#
# 2) One file extent item of type BTRFS_FILE_EXTENT_PREALLOC that covers
# the file range [4M, 5M[
#
# 3) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the
# file range [5M, 6M[
#
# All these file extent items have a generation of 6, which is the ID of
# the transaction where they were created. The split of the original file
# extent item is done at btrfs_mark_extent_written() when ordered extents
# complete for the file ranges [3M, 4M[ and [5M, 6M[.
#
$ xfs_io -c "pwrite -S 0xab 3M 1M" \
-c "pwrite -S 0xef 5M 1M" \
-c "fsync" \
/mnt/foobar
# Commit the current transaction. This wipes out the log tree created by
# the previous fsync.
sync
# Now write to the unwritten range of the second extent we allocated,
# corresponding to the file range [4M, 5M[, and fsync the file, which
# triggers the fast fsync code path.
#
# The fast fsync code path sees that there is a new extent map covering
# the file range [4M, 5M[ and therefore it will log a checksum item
# covering the range [1M, 2M[ of the second extent we allocated.
#
# Also, after the fsync finishes we no longer have the 3 file extent
# items that pointed to 3 sections of the second extent we allocated.
# Instead we end up with a single file extent item pointing to the whole
# extent, with a type of BTRFS_FILE_EXTENT_REG and a generation of 7 (the
# current transaction ID). This is due to the file extent item merging we
# do when completing ordered extents into ranges that point to unwritten
# (preallocated) extents. This merging is done at
# btrfs_mark_extent_written().
#
$ xfs_io -c "pwrite -S 0xcd 4M 1M" \
-c "fsync" \
/mnt/foobar
# Now do some write to our file outside the range of the second extent
# that we allocated with fallocate() and truncate the file size from 6M
# down to 5M.
#
# The truncate operation sets the full sync runtime flag on the inode,
# forcing the next fsync to use the slow code path. It also changes the
# length of the second file extent item so that it represents the file
# range [3M, 5M[ and not the range [3M, 6M[ anymore.
#
# Finally fsync the file. Since this is a fsync that triggers the slow
# code path, it will remove all items associated to the inode from the
# log tree and then it will scan for file extent items in the
# fs/subvolume tree that have a generation matching the current
# transaction ID, which is 7. This means it will log 2 file extent
# items:
#
# 1) One for the first extent we allocated, covering the file range
# [0, 1M[
#
# 2) Another for the first 2M of the second extent we allocated,
# covering the file range [3M, 5M[
#
# When logging the first file extent item we log a single checksum item
# that has all the checksums for the entire extent.
#
# When logging the second file extent item, we also lookup for the
# checksums that are associated with the range [0, 2M[ of the second
# extent we allocated (file range [3M, 5M[), and then we log them with
# btrfs_csum_file_blocks(). However that results in ending up with a log
# that has two checksum items with ranges that overlap:
#
# 1) One for the range [1M, 2M[ of the second extent we allocated,
# corresponding to the file range [4M, 5M[, which we logged in the
# previous fsync that used the fast code path;
#
# 2) One for the ranges [0, 1M[ and [0, 2M[ of the first and second
# extents, respectively, corresponding to the files ranges [0, 1M[
# and [3M, 5M[. This one was added during this last fsync that uses
# the slow code path and overlaps with the previous one logged by
# the previous fast fsync.
#
# This happens because when logging the checksums for the second
# extent, we notice they start at an offset that matches the end of the
# checksums item that we logged for the first extent, and because both
# extents are contiguous on disk, btrfs_csum_file_blocks() decides to
# extend that existing checksums item and append the checksums for the
# second extent to this item. The end result is we end up with two
# checksum items in the log tree that have overlapping ranges, as
# listed before, resulting in the fsync to fail with -EIO and aborting
# the transaction, turning the filesystem into RO mode.
#
$ xfs_io -c "pwrite -S 0xff 0 1M" \
-c "truncate 5M" \
-c "fsync" \
/mnt/foobar
fsync: Input/output error
After running the example, dmesg/syslog shows the tree checker complained
about the checksum items with overlapping ranges and we aborted the
transaction:
$ dmesg
(...)
[756289.557487] BTRFS critical (device sdc): corrupt leaf: root=18446744073709551610 block=30720000 slot=5, csum end range (16777216) goes beyond the start range (15728640) of the next csum item
[756289.560583] BTRFS info (device sdc): leaf 30720000 gen 7 total ptrs 7 free space 11677 owner 18446744073709551610
[756289.562435] BTRFS info (device sdc): refs 2 lock_owner 0 current 2303929
[756289.563654] item 0 key (257 1 0) itemoff 16123 itemsize 160
[756289.564649] inode generation 6 size 5242880 mode 100600
[756289.565636] item 1 key (257 12 256) itemoff 16107 itemsize 16
[756289.566694] item 2 key (257 108 0) itemoff 16054 itemsize 53
[756289.567725] extent data disk bytenr 13631488 nr 1048576
[756289.568697] extent data offset 0 nr 1048576 ram 1048576
[756289.569689] item 3 key (257 108 1048576) itemoff 16001 itemsize 53
[756289.570682] extent data disk bytenr 0 nr 0
[756289.571363] extent data offset 0 nr 2097152 ram 2097152
[756289.572213] item 4 key (257 108 3145728) itemoff 15948 itemsize 53
[756289.573246] extent data disk bytenr 14680064 nr 3145728
[756289.574121] extent data offset 0 nr 2097152 ram 3145728
[756289.574993] item 5 key (18446744073709551606 128 13631488) itemoff 12876 itemsize 3072
[756289.576113] item 6 key (18446744073709551606 128 15728640) itemoff 11852 itemsize 1024
[756289.577286] BTRFS error (device sdc): block=30720000 write time tree block corruption detected
[756289.578644] ------------[ cut here ]------------
[756289.579376] WARNING: CPU: 0 PID: 2303929 at fs/btrfs/disk-io.c:465 csum_one_extent_buffer+0xed/0x100 [btrfs]
[756289.580857] Modules linked in: btrfs dm_zero dm_dust loop dm_snapshot (...)
[756289.591534] CPU: 0 PID: 2303929 Comm: xfs_io Tainted: G W 5.12.0-rc8-btrfs-next-87 #1
[756289.592580] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[756289.594161] RIP: 0010:csum_one_extent_buffer+0xed/0x100 [btrfs]
[756289.595122] Code: 5d c3 e8 76 60 (...)
[756289.597509] RSP: 0018:ffffb51b416cb898 EFLAGS: 00010282
[756289.598142] RAX: 0000000000000000 RBX: fffff02b8a365bc0 RCX: 0000000000000000
[756289.598970] RDX: 0000000000000000 RSI: ffffffffa9112421 RDI: 00000000ffffffff
[756289.599798] RBP: ffffa06500880000 R08: 0000000000000000 R09: 0000000000000000
[756289.600619] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[756289.601456] R13: ffffa0652b1d8980 R14: ffffa06500880000 R15: 0000000000000000
[756289.602278] FS: 00007f08b23c9800(0000) GS:ffffa0682be00000(0000) knlGS:0000000000000000
[756289.603217] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[756289.603892] CR2: 00005652f32d0138 CR3: 000000025d616003 CR4: 0000000000370ef0
[756289.604725] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[756289.605563] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[756289.606400] Call Trace:
[756289.606704] btree_csum_one_bio+0x244/0x2b0 [btrfs]
[756289.607313] btrfs_submit_metadata_bio+0xb7/0x100 [btrfs]
[756289.608040] submit_one_bio+0x61/0x70 [btrfs]
[756289.608587] btree_write_cache_pages+0x587/0x610 [btrfs]
[756289.609258] ? free_debug_processing+0x1d5/0x240
[756289.609812] ? __module_address+0x28/0xf0
[756289.610298] ? lock_acquire+0x1a0/0x3e0
[756289.610754] ? lock_acquired+0x19f/0x430
[756289.611220] ? lock_acquire+0x1a0/0x3e0
[756289.611675] do_writepages+0x43/0xf0
[756289.612101] ? __filemap_fdatawrite_range+0xa4/0x100
[756289.612800] __filemap_fdatawrite_range+0xc5/0x100
[756289.613393] btrfs_write_marked_extents+0x68/0x160 [btrfs]
[756289.614085] btrfs_sync_log+0x21c/0xf20 [btrfs]
[756289.614661] ? finish_wait+0x90/0x90
[756289.615096] ? __mutex_unlock_slowpath+0x45/0x2a0
[756289.615661] ? btrfs_log_inode_parent+0x3c9/0xdc0 [btrfs]
[756289.616338] ? lock_acquire+0x1a0/0x3e0
[756289.616801] ? lock_acquired+0x19f/0x430
[756289.617284] ? lock_acquire+0x1a0/0x3e0
[756289.617750] ? lock_release+0x214/0x470
[756289.618221] ? lock_acquired+0x19f/0x430
[756289.618704] ? dput+0x20/0x4a0
[756289.619079] ? dput+0x20/0x4a0
[756289.619452] ? lockref_put_or_lock+0x9/0x30
[756289.619969] ? lock_release+0x214/0x470
[756289.620445] ? lock_release+0x214/0x470
[756289.620924] ? lock_release+0x214/0x470
[756289.621415] btrfs_sync_file+0x46a/0x5b0 [btrfs]
[756289.621982] do_fsync+0x38/0x70
[756289.622395] __x64_sys_fsync+0x10/0x20
[756289.622907] do_syscall_64+0x33/0x80
[756289.623438] entry_SYSCALL_64_after_hwframe+0x44/0xae
[756289.624063] RIP: 0033:0x7f08b27fbb7b
[756289.624588] Code: 0f 05 48 3d 00 (...)
[756289.626760] RSP: 002b:00007ffe2583f940 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[756289.627639] RAX: ffffffffffffffda RBX: 00005652f32cd0f0 RCX: 00007f08b27fbb7b
[756289.628464] RDX: 00005652f32cbca0 RSI: 00005652f32cd110 RDI: 0000000000000003
[756289.629323] RBP: 00005652f32cd110 R08: 0000000000000000 R09: 00007f08b28c4be0
[756289.630172] R10: fffffffffffff39a R11: 0000000000000293 R12: 0000000000000001
[756289.631007] R13: 00005652f32cd0f0 R14: 0000000000000001 R15: 00005652f32cc480
[756289.631819] irq event stamp: 0
[756289.632188] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[756289.632911] hardirqs last disabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0
[756289.633893] softirqs last enabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0
[756289.634871] softirqs last disabled at (0): [<0000000000000000>] 0x0
[756289.635606] ---[ end trace 0a039fdc16ff3fef ]---
[756289.636179] BTRFS: error (device sdc) in btrfs_sync_log:3136: errno=-5 IO failure
[756289.637082] BTRFS info (device sdc): forced readonly
Having checksum items covering ranges that overlap is dangerous as in some
cases it can lead to having extent ranges for which we miss checksums
after log replay or getting the wrong checksum item. There were some fixes
in the past for bugs that resulted in this problem, and were explained and
fixed by the following commits:
27b9a8122ff71a ("Btrfs: fix csum tree corruption, duplicate and outdated checksums")
b84b8390d6009c ("Btrfs: fix file read corruption after extent cloning and fsync")
40e046acbd2f36 ("Btrfs: fix missing data checksums after replaying a log tree")
e289f03ea79bbc ("btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents")
Fix the issue by making btrfs_csum_file_blocks() taking into account the
start offset of the next checksum item when it decides to extend an
existing checksum item, so that it never extends the checksum to end at a
range that goes beyond the start range of the next checksum item.
When we can not access the next checksum item without releasing the path,
simply drop the optimization of extending the previous checksum item and
fallback to inserting a new checksum item - this happens rarely and the
optimization is not significant enough for a log tree in order to justify
the extra complexity, as it would only save a few bytes (the size of a
struct btrfs_item) of leaf space.
This behaviour is only needed when inserting into a log tree because
for the regular checksums tree we never have a case where we try to
insert a range of checksums that overlap with a range that was previously
inserted.
A test case for fstests will follow soon.
Reported-by: Philipp Fent <fent@in.tum.de>
Link: https://lore.kernel.org/linux-btrfs/93c4600e-5263-5cba-adf0-6f47526e7561@in.tum.de/
CC: stable@vger.kernel.org # 5.4+
Tested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-24 18:35:53 +08:00
|
|
|
extend_nr = max_t(int, 1, tmp);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* A log tree can already have checksum items with a subset of
|
|
|
|
* the checksums we are trying to log. This can happen after
|
|
|
|
* doing a sequence of partial writes into prealloc extents and
|
|
|
|
* fsyncs in between, with a full fsync logging a larger subrange
|
|
|
|
* of an extent for which a previous fast fsync logged a smaller
|
|
|
|
* subrange. And this happens in particular due to merging file
|
|
|
|
* extent items when we complete an ordered extent for a range
|
|
|
|
* covered by a prealloc extent - this is done at
|
|
|
|
* btrfs_mark_extent_written().
|
|
|
|
*
|
|
|
|
* So if we try to extend the previous checksum item, which has
|
|
|
|
* a range that ends at the start of the range we want to insert,
|
|
|
|
* make sure we don't extend beyond the start offset of the next
|
|
|
|
* checksum item. If we are at the last item in the leaf, then
|
|
|
|
* forget the optimization of extending and add a new checksum
|
|
|
|
* item - it is not worth the complexity of releasing the path,
|
|
|
|
* getting the first key for the next leaf, repeat the btree
|
|
|
|
* search, etc, because log trees are temporary anyway and it
|
|
|
|
* would only save a few bytes of leaf space.
|
|
|
|
*/
|
2024-04-16 04:16:23 +08:00
|
|
|
if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID) {
|
btrfs: fix fsync failure and transaction abort after writes to prealloc extents
When doing a series of partial writes to different ranges of preallocated
extents with transaction commits and fsyncs in between, we can end up with
a checksum items in a log tree. This causes an fsync to fail with -EIO and
abort the transaction, turning the filesystem to RO mode, when syncing the
log.
For this to happen, we need to have a full fsync of a file following one
or more fast fsyncs.
The following example reproduces the problem and explains how it happens:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
# Create our test file with 2 preallocated extents. Leave a 1M hole
# between them to ensure that we get two file extent items that will
# never be merged into a single one. The extents are contiguous on disk,
# which will later result in the checksums for their data to be merged
# into a single checksum item in the csums btree.
#
$ xfs_io -f \
-c "falloc 0 1M" \
-c "falloc 3M 3M" \
/mnt/foobar
# Now write to the second extent and leave only 1M of it as unwritten,
# which corresponds to the file range [4M, 5M[.
#
# Then fsync the file to flush delalloc and to clear full sync flag from
# the inode, so that a future fsync will use the fast code path.
#
# After the writeback triggered by the fsync we have 3 file extent items
# that point to the second extent we previously allocated:
#
# 1) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the
# file range [3M, 4M[
#
# 2) One file extent item of type BTRFS_FILE_EXTENT_PREALLOC that covers
# the file range [4M, 5M[
#
# 3) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the
# file range [5M, 6M[
#
# All these file extent items have a generation of 6, which is the ID of
# the transaction where they were created. The split of the original file
# extent item is done at btrfs_mark_extent_written() when ordered extents
# complete for the file ranges [3M, 4M[ and [5M, 6M[.
#
$ xfs_io -c "pwrite -S 0xab 3M 1M" \
-c "pwrite -S 0xef 5M 1M" \
-c "fsync" \
/mnt/foobar
# Commit the current transaction. This wipes out the log tree created by
# the previous fsync.
sync
# Now write to the unwritten range of the second extent we allocated,
# corresponding to the file range [4M, 5M[, and fsync the file, which
# triggers the fast fsync code path.
#
# The fast fsync code path sees that there is a new extent map covering
# the file range [4M, 5M[ and therefore it will log a checksum item
# covering the range [1M, 2M[ of the second extent we allocated.
#
# Also, after the fsync finishes we no longer have the 3 file extent
# items that pointed to 3 sections of the second extent we allocated.
# Instead we end up with a single file extent item pointing to the whole
# extent, with a type of BTRFS_FILE_EXTENT_REG and a generation of 7 (the
# current transaction ID). This is due to the file extent item merging we
# do when completing ordered extents into ranges that point to unwritten
# (preallocated) extents. This merging is done at
# btrfs_mark_extent_written().
#
$ xfs_io -c "pwrite -S 0xcd 4M 1M" \
-c "fsync" \
/mnt/foobar
# Now do some write to our file outside the range of the second extent
# that we allocated with fallocate() and truncate the file size from 6M
# down to 5M.
#
# The truncate operation sets the full sync runtime flag on the inode,
# forcing the next fsync to use the slow code path. It also changes the
# length of the second file extent item so that it represents the file
# range [3M, 5M[ and not the range [3M, 6M[ anymore.
#
# Finally fsync the file. Since this is a fsync that triggers the slow
# code path, it will remove all items associated to the inode from the
# log tree and then it will scan for file extent items in the
# fs/subvolume tree that have a generation matching the current
# transaction ID, which is 7. This means it will log 2 file extent
# items:
#
# 1) One for the first extent we allocated, covering the file range
# [0, 1M[
#
# 2) Another for the first 2M of the second extent we allocated,
# covering the file range [3M, 5M[
#
# When logging the first file extent item we log a single checksum item
# that has all the checksums for the entire extent.
#
# When logging the second file extent item, we also lookup for the
# checksums that are associated with the range [0, 2M[ of the second
# extent we allocated (file range [3M, 5M[), and then we log them with
# btrfs_csum_file_blocks(). However that results in ending up with a log
# that has two checksum items with ranges that overlap:
#
# 1) One for the range [1M, 2M[ of the second extent we allocated,
# corresponding to the file range [4M, 5M[, which we logged in the
# previous fsync that used the fast code path;
#
# 2) One for the ranges [0, 1M[ and [0, 2M[ of the first and second
# extents, respectively, corresponding to the files ranges [0, 1M[
# and [3M, 5M[. This one was added during this last fsync that uses
# the slow code path and overlaps with the previous one logged by
# the previous fast fsync.
#
# This happens because when logging the checksums for the second
# extent, we notice they start at an offset that matches the end of the
# checksums item that we logged for the first extent, and because both
# extents are contiguous on disk, btrfs_csum_file_blocks() decides to
# extend that existing checksums item and append the checksums for the
# second extent to this item. The end result is we end up with two
# checksum items in the log tree that have overlapping ranges, as
# listed before, resulting in the fsync to fail with -EIO and aborting
# the transaction, turning the filesystem into RO mode.
#
$ xfs_io -c "pwrite -S 0xff 0 1M" \
-c "truncate 5M" \
-c "fsync" \
/mnt/foobar
fsync: Input/output error
After running the example, dmesg/syslog shows the tree checker complained
about the checksum items with overlapping ranges and we aborted the
transaction:
$ dmesg
(...)
[756289.557487] BTRFS critical (device sdc): corrupt leaf: root=18446744073709551610 block=30720000 slot=5, csum end range (16777216) goes beyond the start range (15728640) of the next csum item
[756289.560583] BTRFS info (device sdc): leaf 30720000 gen 7 total ptrs 7 free space 11677 owner 18446744073709551610
[756289.562435] BTRFS info (device sdc): refs 2 lock_owner 0 current 2303929
[756289.563654] item 0 key (257 1 0) itemoff 16123 itemsize 160
[756289.564649] inode generation 6 size 5242880 mode 100600
[756289.565636] item 1 key (257 12 256) itemoff 16107 itemsize 16
[756289.566694] item 2 key (257 108 0) itemoff 16054 itemsize 53
[756289.567725] extent data disk bytenr 13631488 nr 1048576
[756289.568697] extent data offset 0 nr 1048576 ram 1048576
[756289.569689] item 3 key (257 108 1048576) itemoff 16001 itemsize 53
[756289.570682] extent data disk bytenr 0 nr 0
[756289.571363] extent data offset 0 nr 2097152 ram 2097152
[756289.572213] item 4 key (257 108 3145728) itemoff 15948 itemsize 53
[756289.573246] extent data disk bytenr 14680064 nr 3145728
[756289.574121] extent data offset 0 nr 2097152 ram 3145728
[756289.574993] item 5 key (18446744073709551606 128 13631488) itemoff 12876 itemsize 3072
[756289.576113] item 6 key (18446744073709551606 128 15728640) itemoff 11852 itemsize 1024
[756289.577286] BTRFS error (device sdc): block=30720000 write time tree block corruption detected
[756289.578644] ------------[ cut here ]------------
[756289.579376] WARNING: CPU: 0 PID: 2303929 at fs/btrfs/disk-io.c:465 csum_one_extent_buffer+0xed/0x100 [btrfs]
[756289.580857] Modules linked in: btrfs dm_zero dm_dust loop dm_snapshot (...)
[756289.591534] CPU: 0 PID: 2303929 Comm: xfs_io Tainted: G W 5.12.0-rc8-btrfs-next-87 #1
[756289.592580] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[756289.594161] RIP: 0010:csum_one_extent_buffer+0xed/0x100 [btrfs]
[756289.595122] Code: 5d c3 e8 76 60 (...)
[756289.597509] RSP: 0018:ffffb51b416cb898 EFLAGS: 00010282
[756289.598142] RAX: 0000000000000000 RBX: fffff02b8a365bc0 RCX: 0000000000000000
[756289.598970] RDX: 0000000000000000 RSI: ffffffffa9112421 RDI: 00000000ffffffff
[756289.599798] RBP: ffffa06500880000 R08: 0000000000000000 R09: 0000000000000000
[756289.600619] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[756289.601456] R13: ffffa0652b1d8980 R14: ffffa06500880000 R15: 0000000000000000
[756289.602278] FS: 00007f08b23c9800(0000) GS:ffffa0682be00000(0000) knlGS:0000000000000000
[756289.603217] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[756289.603892] CR2: 00005652f32d0138 CR3: 000000025d616003 CR4: 0000000000370ef0
[756289.604725] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[756289.605563] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[756289.606400] Call Trace:
[756289.606704] btree_csum_one_bio+0x244/0x2b0 [btrfs]
[756289.607313] btrfs_submit_metadata_bio+0xb7/0x100 [btrfs]
[756289.608040] submit_one_bio+0x61/0x70 [btrfs]
[756289.608587] btree_write_cache_pages+0x587/0x610 [btrfs]
[756289.609258] ? free_debug_processing+0x1d5/0x240
[756289.609812] ? __module_address+0x28/0xf0
[756289.610298] ? lock_acquire+0x1a0/0x3e0
[756289.610754] ? lock_acquired+0x19f/0x430
[756289.611220] ? lock_acquire+0x1a0/0x3e0
[756289.611675] do_writepages+0x43/0xf0
[756289.612101] ? __filemap_fdatawrite_range+0xa4/0x100
[756289.612800] __filemap_fdatawrite_range+0xc5/0x100
[756289.613393] btrfs_write_marked_extents+0x68/0x160 [btrfs]
[756289.614085] btrfs_sync_log+0x21c/0xf20 [btrfs]
[756289.614661] ? finish_wait+0x90/0x90
[756289.615096] ? __mutex_unlock_slowpath+0x45/0x2a0
[756289.615661] ? btrfs_log_inode_parent+0x3c9/0xdc0 [btrfs]
[756289.616338] ? lock_acquire+0x1a0/0x3e0
[756289.616801] ? lock_acquired+0x19f/0x430
[756289.617284] ? lock_acquire+0x1a0/0x3e0
[756289.617750] ? lock_release+0x214/0x470
[756289.618221] ? lock_acquired+0x19f/0x430
[756289.618704] ? dput+0x20/0x4a0
[756289.619079] ? dput+0x20/0x4a0
[756289.619452] ? lockref_put_or_lock+0x9/0x30
[756289.619969] ? lock_release+0x214/0x470
[756289.620445] ? lock_release+0x214/0x470
[756289.620924] ? lock_release+0x214/0x470
[756289.621415] btrfs_sync_file+0x46a/0x5b0 [btrfs]
[756289.621982] do_fsync+0x38/0x70
[756289.622395] __x64_sys_fsync+0x10/0x20
[756289.622907] do_syscall_64+0x33/0x80
[756289.623438] entry_SYSCALL_64_after_hwframe+0x44/0xae
[756289.624063] RIP: 0033:0x7f08b27fbb7b
[756289.624588] Code: 0f 05 48 3d 00 (...)
[756289.626760] RSP: 002b:00007ffe2583f940 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[756289.627639] RAX: ffffffffffffffda RBX: 00005652f32cd0f0 RCX: 00007f08b27fbb7b
[756289.628464] RDX: 00005652f32cbca0 RSI: 00005652f32cd110 RDI: 0000000000000003
[756289.629323] RBP: 00005652f32cd110 R08: 0000000000000000 R09: 00007f08b28c4be0
[756289.630172] R10: fffffffffffff39a R11: 0000000000000293 R12: 0000000000000001
[756289.631007] R13: 00005652f32cd0f0 R14: 0000000000000001 R15: 00005652f32cc480
[756289.631819] irq event stamp: 0
[756289.632188] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[756289.632911] hardirqs last disabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0
[756289.633893] softirqs last enabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0
[756289.634871] softirqs last disabled at (0): [<0000000000000000>] 0x0
[756289.635606] ---[ end trace 0a039fdc16ff3fef ]---
[756289.636179] BTRFS: error (device sdc) in btrfs_sync_log:3136: errno=-5 IO failure
[756289.637082] BTRFS info (device sdc): forced readonly
Having checksum items covering ranges that overlap is dangerous as in some
cases it can lead to having extent ranges for which we miss checksums
after log replay or getting the wrong checksum item. There were some fixes
in the past for bugs that resulted in this problem, and were explained and
fixed by the following commits:
27b9a8122ff71a ("Btrfs: fix csum tree corruption, duplicate and outdated checksums")
b84b8390d6009c ("Btrfs: fix file read corruption after extent cloning and fsync")
40e046acbd2f36 ("Btrfs: fix missing data checksums after replaying a log tree")
e289f03ea79bbc ("btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents")
Fix the issue by making btrfs_csum_file_blocks() taking into account the
start offset of the next checksum item when it decides to extend an
existing checksum item, so that it never extends the checksum to end at a
range that goes beyond the start range of the next checksum item.
When we can not access the next checksum item without releasing the path,
simply drop the optimization of extending the previous checksum item and
fallback to inserting a new checksum item - this happens rarely and the
optimization is not significant enough for a log tree in order to justify
the extra complexity, as it would only save a few bytes (the size of a
struct btrfs_item) of leaf space.
This behaviour is only needed when inserting into a log tree because
for the regular checksums tree we never have a case where we try to
insert a range of checksums that overlap with a range that was previously
inserted.
A test case for fstests will follow soon.
Reported-by: Philipp Fent <fent@in.tum.de>
Link: https://lore.kernel.org/linux-btrfs/93c4600e-5263-5cba-adf0-6f47526e7561@in.tum.de/
CC: stable@vger.kernel.org # 5.4+
Tested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-24 18:35:53 +08:00
|
|
|
if (path->slots[0] + 1 >=
|
|
|
|
btrfs_header_nritems(path->nodes[0])) {
|
|
|
|
ret = find_next_csum_offset(root, path, &next_offset);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
found_next = 1;
|
|
|
|
goto insert;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = find_next_csum_offset(root, path, &next_offset);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
tmp = (next_offset - bytenr) >> fs_info->sectorsize_bits;
|
|
|
|
if (tmp <= INT_MAX)
|
|
|
|
extend_nr = min_t(int, extend_nr, tmp);
|
|
|
|
}
|
Btrfs: extend the checksum item as much as possible
For write, we also reserve some space for COW blocks during updating
the checksum tree, and we calculate the number of blocks by checking
if the number of bytes outstanding that are going to need csums needs
one more block for csum.
When we add these checksum into the checksum tree, we use ordered sums
list.
Every ordered sum contains csums for each sector, and we'll first try
to look up an existing csum item,
a) if we don't yet have a proper csum item, then we need to insert one,
b) or if we find one but the csum item is not big enough, then we need
to extend it.
The point is we'll unlock the whole path and then insert or extend.
So others can hack in and update the tree.
Each insert or extend needs update the tree with COW on, and we may need
to insert/extend for many times.
That means what we've reserved for updating checksum tree is NOT enough
indeed.
The case is even more serious with having several write threads at the
same time, it can end up eating our reserved space quickly and starting
eating globle reserve pool instead.
I don't yet come up with a way to calculate the worse case for updating
csum, but extending the checksum item as much as possible can be helpful
in my test.
The idea behind is that it can reduce the times we insert/extend so that
it saves us precious reserved space.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-04 21:12:18 +08:00
|
|
|
|
|
|
|
diff = (csum_offset + extend_nr) * csum_size;
|
2016-06-23 06:54:23 +08:00
|
|
|
diff = min(diff,
|
|
|
|
MAX_CSUM_ITEMS(fs_info, csum_size) * csum_size);
|
2008-12-10 22:10:46 +08:00
|
|
|
|
2021-10-22 02:58:35 +08:00
|
|
|
diff = diff - btrfs_item_size(leaf, path->slots[0]);
|
btrfs: make checksum item extension more efficient
When we want to add checksums into the checksums tree, or a log tree, we
try whenever possible to extend existing checksum items, as this helps
reduce amount of metadata space used, since adding a new item uses extra
metadata space for a btrfs_item structure (25 bytes).
However we have two inefficiencies in the current approach:
1) After finding a checksum item that covers a range with an end offset
that matches the start offset of the checksum range we want to insert,
we release the search path populated by btrfs_lookup_csum() and then
do another COW search on tree with the goal of getting additional
space for at least one checksum. Doing this path release and then
searching again is a waste of time because very often the leaf already
has enough free space for at least one more checksum;
2) After the COW search that guarantees we get free space in the leaf for
at least one more checksum, we end up not doing the extension of the
previous checksum item, and fallback to insertion of a new checksum
item, if the leaf doesn't have an amount of free space larger then the
space required for 2 checksums plus one btrfs_item structure - this is
pointless for two reasons:
a) We want to extend an existing item, so we don't need to account for
a btrfs_item structure (25 bytes);
b) We made the COW search with an insertion size for 1 single checksum,
so if the leaf ends up with a free space amount smaller then 2
checksums plus the size of a btrfs_item structure, we give up on the
extension of the existing item and jump to the 'insert' label, where
we end up releasing the path and then doing yet another search to
insert a new checksum item for a single checksum.
Fix these inefficiencies by doing the following:
- For case 1), before releasing the path just check if the leaf already
has enough space for at least 1 more checksum, and if it does, jump
directly to the item extension code, with releasing our current path,
which was already COWed by btrfs_lookup_csum();
- For case 2), fix the logic so that for item extension we require only
that the leaf has enough free space for 1 checksum, and not a minimum
of 2 checksums plus space for a btrfs_item structure.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-18 19:15:00 +08:00
|
|
|
diff = min_t(u32, btrfs_leaf_free_space(leaf), diff);
|
Btrfs: extend the checksum item as much as possible
For write, we also reserve some space for COW blocks during updating
the checksum tree, and we calculate the number of blocks by checking
if the number of bytes outstanding that are going to need csums needs
one more block for csum.
When we add these checksum into the checksum tree, we use ordered sums
list.
Every ordered sum contains csums for each sector, and we'll first try
to look up an existing csum item,
a) if we don't yet have a proper csum item, then we need to insert one,
b) or if we find one but the csum item is not big enough, then we need
to extend it.
The point is we'll unlock the whole path and then insert or extend.
So others can hack in and update the tree.
Each insert or extend needs update the tree with COW on, and we may need
to insert/extend for many times.
That means what we've reserved for updating checksum tree is NOT enough
indeed.
The case is even more serious with having several write threads at the
same time, it can end up eating our reserved space quickly and starting
eating globle reserve pool instead.
I don't yet come up with a way to calculate the worse case for updating
csum, but extending the checksum item as much as possible can be helpful
in my test.
The idea behind is that it can reduce the times we insert/extend so that
it saves us precious reserved space.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-04 21:12:18 +08:00
|
|
|
diff /= csum_size;
|
|
|
|
diff *= csum_size;
|
2008-12-10 22:10:46 +08:00
|
|
|
|
2023-09-12 20:04:29 +08:00
|
|
|
btrfs_extend_item(trans, path, diff);
|
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
|
|
|
ret = 0;
|
2007-04-16 21:22:45 +08:00
|
|
|
goto csum;
|
|
|
|
}
|
|
|
|
|
|
|
|
insert:
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2007-04-16 21:22:45 +08:00
|
|
|
csum_offset = 0;
|
2007-10-26 03:42:56 +08:00
|
|
|
if (found_next) {
|
Btrfs: extend the checksum item as much as possible
For write, we also reserve some space for COW blocks during updating
the checksum tree, and we calculate the number of blocks by checking
if the number of bytes outstanding that are going to need csums needs
one more block for csum.
When we add these checksum into the checksum tree, we use ordered sums
list.
Every ordered sum contains csums for each sector, and we'll first try
to look up an existing csum item,
a) if we don't yet have a proper csum item, then we need to insert one,
b) or if we find one but the csum item is not big enough, then we need
to extend it.
The point is we'll unlock the whole path and then insert or extend.
So others can hack in and update the tree.
Each insert or extend needs update the tree with COW on, and we may need
to insert/extend for many times.
That means what we've reserved for updating checksum tree is NOT enough
indeed.
The case is even more serious with having several write threads at the
same time, it can end up eating our reserved space quickly and starting
eating globle reserve pool instead.
I don't yet come up with a way to calculate the worse case for updating
csum, but extending the checksum item as much as possible can be helpful
in my test.
The idea behind is that it can reduce the times we insert/extend so that
it saves us precious reserved space.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-04 21:12:18 +08:00
|
|
|
u64 tmp;
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-09 05:58:54 +08:00
|
|
|
|
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
|
|
|
tmp = sums->len - total_bytes;
|
2020-07-02 03:19:09 +08:00
|
|
|
tmp >>= fs_info->sectorsize_bits;
|
Btrfs: extend the checksum item as much as possible
For write, we also reserve some space for COW blocks during updating
the checksum tree, and we calculate the number of blocks by checking
if the number of bytes outstanding that are going to need csums needs
one more block for csum.
When we add these checksum into the checksum tree, we use ordered sums
list.
Every ordered sum contains csums for each sector, and we'll first try
to look up an existing csum item,
a) if we don't yet have a proper csum item, then we need to insert one,
b) or if we find one but the csum item is not big enough, then we need
to extend it.
The point is we'll unlock the whole path and then insert or extend.
So others can hack in and update the tree.
Each insert or extend needs update the tree with COW on, and we may need
to insert/extend for many times.
That means what we've reserved for updating checksum tree is NOT enough
indeed.
The case is even more serious with having several write threads at the
same time, it can end up eating our reserved space quickly and starting
eating globle reserve pool instead.
I don't yet come up with a way to calculate the worse case for updating
csum, but extending the checksum item as much as possible can be helpful
in my test.
The idea behind is that it can reduce the times we insert/extend so that
it saves us precious reserved space.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-04 21:12:18 +08:00
|
|
|
tmp = min(tmp, (next_offset - file_key.offset) >>
|
2020-07-02 03:19:09 +08:00
|
|
|
fs_info->sectorsize_bits);
|
Btrfs: extend the checksum item as much as possible
For write, we also reserve some space for COW blocks during updating
the checksum tree, and we calculate the number of blocks by checking
if the number of bytes outstanding that are going to need csums needs
one more block for csum.
When we add these checksum into the checksum tree, we use ordered sums
list.
Every ordered sum contains csums for each sector, and we'll first try
to look up an existing csum item,
a) if we don't yet have a proper csum item, then we need to insert one,
b) or if we find one but the csum item is not big enough, then we need
to extend it.
The point is we'll unlock the whole path and then insert or extend.
So others can hack in and update the tree.
Each insert or extend needs update the tree with COW on, and we may need
to insert/extend for many times.
That means what we've reserved for updating checksum tree is NOT enough
indeed.
The case is even more serious with having several write threads at the
same time, it can end up eating our reserved space quickly and starting
eating globle reserve pool instead.
I don't yet come up with a way to calculate the worse case for updating
csum, but extending the checksum item as much as possible can be helpful
in my test.
The idea behind is that it can reduce the times we insert/extend so that
it saves us precious reserved space.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-04 21:12:18 +08:00
|
|
|
|
2016-12-15 21:38:28 +08:00
|
|
|
tmp = max_t(u64, 1, tmp);
|
|
|
|
tmp = min_t(u64, tmp, MAX_CSUM_ITEMS(fs_info, csum_size));
|
2008-12-02 20:17:45 +08:00
|
|
|
ins_size = csum_size * tmp;
|
2007-10-26 03:42:56 +08:00
|
|
|
} else {
|
2008-12-02 20:17:45 +08:00
|
|
|
ins_size = csum_size;
|
2007-10-26 03:42:56 +08:00
|
|
|
}
|
2007-04-02 23:20:42 +08:00
|
|
|
ret = btrfs_insert_empty_item(trans, root, path, &file_key,
|
2007-10-26 03:42:56 +08:00
|
|
|
ins_size);
|
2007-06-23 02:16:25 +08:00
|
|
|
if (ret < 0)
|
2020-05-18 19:15:18 +08:00
|
|
|
goto out;
|
2007-10-16 04:14:19 +08:00
|
|
|
leaf = path->nodes[0];
|
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
|
|
|
csum:
|
2007-10-16 04:14:19 +08:00
|
|
|
item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_csum_item);
|
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
|
|
|
item_end = (struct btrfs_csum_item *)((unsigned char *)item +
|
2021-10-22 02:58:35 +08:00
|
|
|
btrfs_item_size(leaf, path->slots[0]));
|
2007-05-11 00:36:17 +08:00
|
|
|
item = (struct btrfs_csum_item *)((unsigned char *)item +
|
2008-12-02 20:17:45 +08:00
|
|
|
csum_offset * csum_size);
|
2007-04-18 01:26:50 +08:00
|
|
|
found:
|
2020-07-02 03:19:09 +08:00
|
|
|
ins_size = (u32)(sums->len - total_bytes) >> fs_info->sectorsize_bits;
|
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
|
|
|
ins_size *= csum_size;
|
|
|
|
ins_size = min_t(u32, (unsigned long)item_end - (unsigned long)item,
|
|
|
|
ins_size);
|
|
|
|
write_extent_buffer(leaf, sums->sums + index, (unsigned long)item,
|
|
|
|
ins_size);
|
|
|
|
|
2019-05-22 16:19:01 +08:00
|
|
|
index += ins_size;
|
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
|
|
|
ins_size /= csum_size;
|
2016-06-23 06:54:23 +08:00
|
|
|
total_bytes += ins_size * fs_info->sectorsize;
|
2011-07-20 00:04:14 +08:00
|
|
|
|
2023-09-12 20:04:29 +08:00
|
|
|
btrfs_mark_buffer_dirty(trans, path->nodes[0]);
|
2008-07-18 00:53:50 +08:00
|
|
|
if (total_bytes < sums->len) {
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2009-03-13 23:00:37 +08:00
|
|
|
cond_resched();
|
2008-02-21 01:07:25 +08:00
|
|
|
goto again;
|
|
|
|
}
|
2008-08-16 03:34:18 +08:00
|
|
|
out:
|
2007-04-02 23:20:42 +08:00
|
|
|
btrfs_free_path(path);
|
2007-03-30 03:15:27 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2014-06-09 10:48:05 +08:00
|
|
|
|
2017-02-20 19:51:02 +08:00
|
|
|
void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode,
|
2014-06-09 10:48:05 +08:00
|
|
|
const struct btrfs_path *path,
|
|
|
|
struct btrfs_file_extent_item *fi,
|
|
|
|
struct extent_map *em)
|
|
|
|
{
|
2018-06-29 16:56:42 +08:00
|
|
|
struct btrfs_fs_info *fs_info = inode->root->fs_info;
|
2017-02-20 19:51:02 +08:00
|
|
|
struct btrfs_root *root = inode->root;
|
2014-06-09 10:48:05 +08:00
|
|
|
struct extent_buffer *leaf = path->nodes[0];
|
|
|
|
const int slot = path->slots[0];
|
|
|
|
struct btrfs_key key;
|
2024-04-02 14:00:15 +08:00
|
|
|
u64 extent_start;
|
2014-06-09 10:48:05 +08:00
|
|
|
u64 bytenr;
|
|
|
|
u8 type = btrfs_file_extent_type(leaf, fi);
|
|
|
|
int compress_type = btrfs_file_extent_compression(leaf, fi);
|
|
|
|
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, slot);
|
|
|
|
extent_start = key.offset;
|
|
|
|
em->ram_bytes = btrfs_file_extent_ram_bytes(leaf, fi);
|
2022-02-08 13:31:19 +08:00
|
|
|
em->generation = btrfs_file_extent_generation(leaf, fi);
|
2014-06-09 10:48:05 +08:00
|
|
|
if (type == BTRFS_FILE_EXTENT_REG ||
|
|
|
|
type == BTRFS_FILE_EXTENT_PREALLOC) {
|
|
|
|
em->start = extent_start;
|
2024-04-02 14:00:15 +08:00
|
|
|
em->len = btrfs_file_extent_end(path) - extent_start;
|
2014-06-09 10:48:05 +08:00
|
|
|
em->orig_start = extent_start -
|
|
|
|
btrfs_file_extent_offset(leaf, fi);
|
2024-04-30 06:23:00 +08:00
|
|
|
em->disk_num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
|
2014-06-09 10:48:05 +08:00
|
|
|
bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
|
|
|
|
if (bytenr == 0) {
|
|
|
|
em->block_start = EXTENT_MAP_HOLE;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
if (compress_type != BTRFS_COMPRESS_NONE) {
|
btrfs: use the flags of an extent map to identify the compression type
Currently, in struct extent_map, we use an unsigned int (32 bits) to
identify the compression type of an extent and an unsigned long (64 bits
on a 64 bits platform, 32 bits otherwise) for flags. We are only using
6 different flags, so an unsigned long is excessive and we can use flags
to identify the compression type instead of using a dedicated 32 bits
field.
We can easily have tens or hundreds of thousands (or more) of extent maps
on busy and large filesystems, specially with compression enabled or many
or large files with tons of small extents. So it's convenient to have the
extent_map structure as small as possible in order to use less memory.
So remove the compression type field from struct extent_map, use flags
to identify the compression type and shorten the flags field from an
unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and
reduces the size of the structure from 136 bytes down to 128 bytes, using
now only two cache lines, and increases the number of extent maps we can
have per 4K page from 30 to 32. By using a u32 for the flags instead of
an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(),
but that level of atomicity is not needed as most flags are never cleared
once set (before adding an extent map to the tree), and the ones that can
be cleared or set after an extent map is added to the tree, are always
performed while holding the write lock on the extent map tree, while the
reader holds a lock on the tree or tests for a flag that never changes
once the extent map is in the tree (such as compression flags).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-05 00:20:33 +08:00
|
|
|
extent_map_set_compression(em, compress_type);
|
2014-06-09 10:48:05 +08:00
|
|
|
em->block_start = bytenr;
|
2024-04-30 06:23:00 +08:00
|
|
|
em->block_len = em->disk_num_bytes;
|
2014-06-09 10:48:05 +08:00
|
|
|
} else {
|
|
|
|
bytenr += btrfs_file_extent_offset(leaf, fi);
|
|
|
|
em->block_start = bytenr;
|
|
|
|
em->block_len = em->len;
|
|
|
|
if (type == BTRFS_FILE_EXTENT_PREALLOC)
|
btrfs: use the flags of an extent map to identify the compression type
Currently, in struct extent_map, we use an unsigned int (32 bits) to
identify the compression type of an extent and an unsigned long (64 bits
on a 64 bits platform, 32 bits otherwise) for flags. We are only using
6 different flags, so an unsigned long is excessive and we can use flags
to identify the compression type instead of using a dedicated 32 bits
field.
We can easily have tens or hundreds of thousands (or more) of extent maps
on busy and large filesystems, specially with compression enabled or many
or large files with tons of small extents. So it's convenient to have the
extent_map structure as small as possible in order to use less memory.
So remove the compression type field from struct extent_map, use flags
to identify the compression type and shorten the flags field from an
unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and
reduces the size of the structure from 136 bytes down to 128 bytes, using
now only two cache lines, and increases the number of extent maps we can
have per 4K page from 30 to 32. By using a u32 for the flags instead of
an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(),
but that level of atomicity is not needed as most flags are never cleared
once set (before adding an extent map to the tree), and the ones that can
be cleared or set after an extent map is added to the tree, are always
performed while holding the write lock on the extent map tree, while the
reader holds a lock on the tree or tests for a flag that never changes
once the extent map is in the tree (such as compression flags).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-05 00:20:33 +08:00
|
|
|
em->flags |= EXTENT_FLAG_PREALLOC;
|
2014-06-09 10:48:05 +08:00
|
|
|
}
|
|
|
|
} else if (type == BTRFS_FILE_EXTENT_INLINE) {
|
2024-04-02 14:00:15 +08:00
|
|
|
/* Tree-checker has ensured this. */
|
|
|
|
ASSERT(extent_start == 0);
|
|
|
|
|
2014-06-09 10:48:05 +08:00
|
|
|
em->block_start = EXTENT_MAP_INLINE;
|
2024-04-02 14:00:15 +08:00
|
|
|
em->start = 0;
|
|
|
|
em->len = fs_info->sectorsize;
|
2014-06-09 10:48:05 +08:00
|
|
|
/*
|
|
|
|
* Initialize orig_start and block_len with the same values
|
|
|
|
* as in inode.c:btrfs_get_extent().
|
|
|
|
*/
|
|
|
|
em->orig_start = EXTENT_MAP_HOLE;
|
|
|
|
em->block_len = (u64)-1;
|
btrfs: use the flags of an extent map to identify the compression type
Currently, in struct extent_map, we use an unsigned int (32 bits) to
identify the compression type of an extent and an unsigned long (64 bits
on a 64 bits platform, 32 bits otherwise) for flags. We are only using
6 different flags, so an unsigned long is excessive and we can use flags
to identify the compression type instead of using a dedicated 32 bits
field.
We can easily have tens or hundreds of thousands (or more) of extent maps
on busy and large filesystems, specially with compression enabled or many
or large files with tons of small extents. So it's convenient to have the
extent_map structure as small as possible in order to use less memory.
So remove the compression type field from struct extent_map, use flags
to identify the compression type and shorten the flags field from an
unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and
reduces the size of the structure from 136 bytes down to 128 bytes, using
now only two cache lines, and increases the number of extent maps we can
have per 4K page from 30 to 32. By using a u32 for the flags instead of
an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(),
but that level of atomicity is not needed as most flags are never cleared
once set (before adding an extent map to the tree), and the ones that can
be cleared or set after an extent map is added to the tree, are always
performed while holding the write lock on the extent map tree, while the
reader holds a lock on the tree or tests for a flag that never changes
once the extent map is in the tree (such as compression flags).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-05 00:20:33 +08:00
|
|
|
extent_map_set_compression(em, compress_type);
|
2014-06-09 10:48:05 +08:00
|
|
|
} else {
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_err(fs_info,
|
2017-02-20 19:51:02 +08:00
|
|
|
"unknown file extent item type %d, inode %llu, offset %llu, "
|
|
|
|
"root %llu", type, btrfs_ino(inode), extent_start,
|
2024-04-16 04:16:23 +08:00
|
|
|
btrfs_root_id(root));
|
2014-06-09 10:48:05 +08:00
|
|
|
}
|
|
|
|
}
|
2020-03-09 20:41:06 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Returns the end offset (non inclusive) of the file extent item the given path
|
|
|
|
* points to. If it points to an inline extent, the returned offset is rounded
|
|
|
|
* up to the sector size.
|
|
|
|
*/
|
|
|
|
u64 btrfs_file_extent_end(const struct btrfs_path *path)
|
|
|
|
{
|
|
|
|
const struct extent_buffer *leaf = path->nodes[0];
|
|
|
|
const int slot = path->slots[0];
|
|
|
|
struct btrfs_file_extent_item *fi;
|
|
|
|
struct btrfs_key key;
|
|
|
|
u64 end;
|
|
|
|
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, slot);
|
|
|
|
ASSERT(key.type == BTRFS_EXTENT_DATA_KEY);
|
|
|
|
fi = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item);
|
|
|
|
|
2024-04-02 14:00:15 +08:00
|
|
|
if (btrfs_file_extent_type(leaf, fi) == BTRFS_FILE_EXTENT_INLINE)
|
|
|
|
end = leaf->fs_info->sectorsize;
|
|
|
|
else
|
2020-03-09 20:41:06 +08:00
|
|
|
end = key.offset + btrfs_file_extent_num_bytes(leaf, fi);
|
|
|
|
|
|
|
|
return end;
|
|
|
|
}
|