2018-04-04 01:23:33 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2012-07-26 05:19:24 +08:00
|
|
|
/*
|
|
|
|
* Copyright (C) 2012 Alexander Block. All rights reserved.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/bsearch.h>
|
|
|
|
#include <linux/fs.h>
|
|
|
|
#include <linux/file.h>
|
|
|
|
#include <linux/sort.h>
|
|
|
|
#include <linux/mount.h>
|
|
|
|
#include <linux/xattr.h>
|
|
|
|
#include <linux/posix_acl_xattr.h>
|
2022-07-15 19:59:38 +08:00
|
|
|
#include <linux/radix-tree.h>
|
2012-07-27 08:11:13 +08:00
|
|
|
#include <linux/vmalloc.h>
|
2013-08-21 15:32:13 +08:00
|
|
|
#include <linux/string.h>
|
2017-09-27 22:43:13 +08:00
|
|
|
#include <linux/compat.h>
|
btrfs: Remove custom crc32c init code
The custom crc32 init code was introduced in
14a958e678cd ("Btrfs: fix btrfs boot when compiled as built-in") to
enable using btrfs as a built-in. However, later as pointed out by
60efa5eb2e88 ("Btrfs: use late_initcall instead of module_init") this
wasn't enough and finally btrfs was switched to late_initcall which
comes after the generic crc32c implementation is initiliased. The
latter commit superseeded the former. Now that we don't have to
maintain our own code let's just remove it and switch to using the
generic implementation.
Despite touching a lot of files the patch is really simple. Here is the gist of
the changes:
1. Select LIBCRC32C rather than the low-level modules.
2. s/btrfs_crc32c/crc32c/g
3. replace hash.h with linux/crc32c.h
4. Move the btrfs namehash funcs to ctree.h and change the tree accordingly.
I've tested this with btrfs being both a module and a built-in and xfstest
doesn't complain.
Does seem to fix the longstanding problem of not automatically selectiong
the crc32c module when btrfs is used. Possibly there is a workaround in
dracut.
The modinfo confirms that now all the module dependencies are there:
before:
depends: zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate
after:
depends: libcrc32c,zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add more info to changelog from mails ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-08 17:45:05 +08:00
|
|
|
#include <linux/crc32c.h>
|
2022-08-16 04:54:28 +08:00
|
|
|
#include <linux/fsverity.h>
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
#include "send.h"
|
2022-06-02 21:28:41 +08:00
|
|
|
#include "ctree.h"
|
2012-07-26 05:19:24 +08:00
|
|
|
#include "backref.h"
|
|
|
|
#include "locking.h"
|
|
|
|
#include "disk-io.h"
|
|
|
|
#include "btrfs_inode.h"
|
|
|
|
#include "transaction.h"
|
2016-03-10 17:26:59 +08:00
|
|
|
#include "compression.h"
|
btrfs: send: emit file capabilities after chown
Whenever a chown is executed, all capabilities of the file being touched
are lost. When doing incremental send with a file with capabilities,
there is a situation where the capability can be lost on the receiving
side. The sequence of actions bellow shows the problem:
$ mount /dev/sda fs1
$ mount /dev/sdb fs2
$ touch fs1/foo.bar
$ setcap cap_sys_nice+ep fs1/foo.bar
$ btrfs subvolume snapshot -r fs1 fs1/snap_init
$ btrfs send fs1/snap_init | btrfs receive fs2
$ chgrp adm fs1/foo.bar
$ setcap cap_sys_nice+ep fs1/foo.bar
$ btrfs subvolume snapshot -r fs1 fs1/snap_complete
$ btrfs subvolume snapshot -r fs1 fs1/snap_incremental
$ btrfs send fs1/snap_complete | btrfs receive fs2
$ btrfs send -p fs1/snap_init fs1/snap_incremental | btrfs receive fs2
At this point, only a chown was emitted by "btrfs send" since only the
group was changed. This makes the cap_sys_nice capability to be dropped
from fs2/snap_incremental/foo.bar
To fix that, only emit capabilities after chown is emitted. The current
code first checks for xattrs that are new/changed, emits them, and later
emit the chown. Now, __process_new_xattr skips capabilities, letting
only finish_inode_if_needed to emit them, if they exist, for the inode
being processed.
This behavior was being worked around in "btrfs receive" side by caching
the capability and only applying it after chown. Now, xattrs are only
emmited _after_ chown, making that workaround not needed anymore.
Link: https://github.com/kdave/btrfs-progs/issues/202
CC: stable@vger.kernel.org # 4.4+
Suggested-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-11 10:15:07 +08:00
|
|
|
#include "xattr.h"
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
#include "print-tree.h"
|
2022-10-19 22:51:00 +08:00
|
|
|
#include "accessors.h"
|
2022-10-27 03:08:26 +08:00
|
|
|
#include "dir-item.h"
|
2022-10-27 03:08:27 +08:00
|
|
|
#include "file-item.h"
|
2022-10-27 03:08:29 +08:00
|
|
|
#include "ioctl.h"
|
2022-10-27 03:08:37 +08:00
|
|
|
#include "verity.h"
|
2023-01-11 19:36:13 +08:00
|
|
|
#include "lru_cache.h"
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2019-10-30 20:23:01 +08:00
|
|
|
/*
|
|
|
|
* Maximum number of references an extent can have in order for us to attempt to
|
|
|
|
* issue clone operations instead of write operations. This currently exists to
|
|
|
|
* avoid hitting limitations of the backreference walking code (taking a lot of
|
|
|
|
* time and using too much memory for extents with large number of references).
|
|
|
|
*/
|
2022-11-02 00:15:54 +08:00
|
|
|
#define SEND_MAX_EXTENT_REFS 1024
|
2019-10-30 20:23:01 +08:00
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
/*
|
|
|
|
* A fs_path is a helper to dynamically build path names with unknown size.
|
|
|
|
* It reallocates the internal buffer on demand.
|
|
|
|
* It allows fast adding of path elements on the right side (normal path) and
|
|
|
|
* fast adding to the left side (reversed path). A reversed path can also be
|
|
|
|
* unreversed if needed.
|
|
|
|
*/
|
|
|
|
struct fs_path {
|
|
|
|
union {
|
|
|
|
struct {
|
|
|
|
char *start;
|
|
|
|
char *end;
|
|
|
|
|
|
|
|
char *buf;
|
2014-02-04 02:23:47 +08:00
|
|
|
unsigned short buf_len:15;
|
|
|
|
unsigned short reversed:1;
|
2012-07-26 05:19:24 +08:00
|
|
|
char inline_buf[];
|
|
|
|
};
|
2014-02-05 23:17:34 +08:00
|
|
|
/*
|
|
|
|
* Average path length does not exceed 200 bytes, we'll have
|
|
|
|
* better packing in the slab and higher chance to satisfy
|
|
|
|
* a allocation later during send.
|
|
|
|
*/
|
|
|
|
char pad[256];
|
2012-07-26 05:19:24 +08:00
|
|
|
};
|
|
|
|
};
|
|
|
|
#define FS_PATH_INLINE_SIZE \
|
|
|
|
(sizeof(struct fs_path) - offsetof(struct fs_path, inline_buf))
|
|
|
|
|
|
|
|
|
|
|
|
/* reused for each extent */
|
|
|
|
struct clone_root {
|
|
|
|
struct btrfs_root *root;
|
|
|
|
u64 ino;
|
|
|
|
u64 offset;
|
2022-11-02 00:15:45 +08:00
|
|
|
u64 num_bytes;
|
btrfs: send: skip unnecessary backref iterations
When looking for a clone source for an extent, we are iterating over all
the backreferences for an extent. This is often a waste of time, because
once we find a good clone source we could stop immediately instead of
continuing backref walking, which is expensive.
Basically what happens currently is this:
1) Call iterate_extent_inodes() to iterate over all the backreferences;
2) It calls btrfs_find_all_leafs() which in turn calls the main function
to walk over backrefs and collect them - find_parent_nodes();
3) Then we collect all the references for our target data extent from the
extent tree (and delayed refs if any), add them to the rb trees,
resolve all the indirect backreferences and search for all the file
extent items in fs trees, building a list of inodes for each one of
them (struct extent_inode_elem);
4) Then back at iterate_extent_inodes() we find all the roots associated
to each found leaf, and call the callback __iterate_backrefs defined
at send.c for each inode in the inode list associated to each leaf.
Some times one the first backreferences we find in a fs tree is optimal
to satisfy the clone operation that send wants to perform, and in that
case we could stop immediately and avoid resolving all the remaining
indirect backreferences (search fs trees for the respective file extent
items, etc). This possibly if when we find a fs tree leaf with a file
extent item we are able to know what are all the roots that can lead to
the leaf - this is now possible after the previous patch in the series
that adds a cache that maps leaves to a list of roots. So we can now
shortcircuit backref walking during send, by having the callback we
pass to iterate_extent_inodes() to be called when we find a file extent
item for an indirect backreference, and have it return a special value
when it found a suitable backreference and it does not need to look for
more backreferences. This change does that.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:51 +08:00
|
|
|
bool found_ref;
|
2012-07-26 05:19:24 +08:00
|
|
|
};
|
|
|
|
|
2023-01-11 19:36:18 +08:00
|
|
|
#define SEND_MAX_NAME_CACHE_SIZE 256
|
2012-07-26 05:19:24 +08:00
|
|
|
|
btrfs: send: cache leaf to roots mapping during backref walking
During a send operation, when doing backref walking to determine which
inodes/offsets/roots we can clone from, the most repetitive and expensive
step is to map each leaf that has file extent items pointing to the target
data extent to the IDs of the roots from which the leaves are accessible,
which happens at iterate_extent_inodes(). That step requires finding every
parent node of a leaf, then the parent of each parent, and so on until we
reach a root node. So it's a naturally expensive operation, and repetitive
because each leaf can have hundreds of file extent items (for a nodesize
of 16K, that can be slightly over 200 file extent items). There's also
temporal locality, as we process all file extent items from a leave before
moving the next leaf.
This change caches the mapping of leaves to root IDs, to avoid repeating
those computations over and over again. The cache is limited to a maximum
of 128 entries, with each entry being a struct with a size of 128 bytes,
so the maximum cache size is 16K plus any nodes internally allocated by
the maple tree that is used to index pointers to those structs. The cache
is invalidated whenever we detect relocation happened since we started
filling the cache, because if relocation happened then extent buffers for
leaves and nodes of the trees used by a send operation may have been
reallocated.
This cache also allows for another important optimization that is
introduced in the next patch in the series.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:50 +08:00
|
|
|
/*
|
2023-01-11 19:36:19 +08:00
|
|
|
* Limit the root_ids array of struct backref_cache_entry to 17 elements.
|
|
|
|
* This makes the size of a cache entry to be exactly 192 bytes on x86_64, which
|
|
|
|
* can be satisfied from the kmalloc-192 slab, without wasting any space.
|
btrfs: send: cache leaf to roots mapping during backref walking
During a send operation, when doing backref walking to determine which
inodes/offsets/roots we can clone from, the most repetitive and expensive
step is to map each leaf that has file extent items pointing to the target
data extent to the IDs of the roots from which the leaves are accessible,
which happens at iterate_extent_inodes(). That step requires finding every
parent node of a leaf, then the parent of each parent, and so on until we
reach a root node. So it's a naturally expensive operation, and repetitive
because each leaf can have hundreds of file extent items (for a nodesize
of 16K, that can be slightly over 200 file extent items). There's also
temporal locality, as we process all file extent items from a leave before
moving the next leaf.
This change caches the mapping of leaves to root IDs, to avoid repeating
those computations over and over again. The cache is limited to a maximum
of 128 entries, with each entry being a struct with a size of 128 bytes,
so the maximum cache size is 16K plus any nodes internally allocated by
the maple tree that is used to index pointers to those structs. The cache
is invalidated whenever we detect relocation happened since we started
filling the cache, because if relocation happened then extent buffers for
leaves and nodes of the trees used by a send operation may have been
reallocated.
This cache also allows for another important optimization that is
introduced in the next patch in the series.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:50 +08:00
|
|
|
* The most common case is to have a single root for cloning, which corresponds
|
2023-01-11 19:36:19 +08:00
|
|
|
* to the send root. Having the user specify more than 16 clone roots is not
|
btrfs: send: cache leaf to roots mapping during backref walking
During a send operation, when doing backref walking to determine which
inodes/offsets/roots we can clone from, the most repetitive and expensive
step is to map each leaf that has file extent items pointing to the target
data extent to the IDs of the roots from which the leaves are accessible,
which happens at iterate_extent_inodes(). That step requires finding every
parent node of a leaf, then the parent of each parent, and so on until we
reach a root node. So it's a naturally expensive operation, and repetitive
because each leaf can have hundreds of file extent items (for a nodesize
of 16K, that can be slightly over 200 file extent items). There's also
temporal locality, as we process all file extent items from a leave before
moving the next leaf.
This change caches the mapping of leaves to root IDs, to avoid repeating
those computations over and over again. The cache is limited to a maximum
of 128 entries, with each entry being a struct with a size of 128 bytes,
so the maximum cache size is 16K plus any nodes internally allocated by
the maple tree that is used to index pointers to those structs. The cache
is invalidated whenever we detect relocation happened since we started
filling the cache, because if relocation happened then extent buffers for
leaves and nodes of the trees used by a send operation may have been
reallocated.
This cache also allows for another important optimization that is
introduced in the next patch in the series.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:50 +08:00
|
|
|
* common, and in such rare cases we simply don't use caching if the number of
|
2023-01-11 19:36:19 +08:00
|
|
|
* cloning roots that lead down to a leaf is more than 17.
|
btrfs: send: cache leaf to roots mapping during backref walking
During a send operation, when doing backref walking to determine which
inodes/offsets/roots we can clone from, the most repetitive and expensive
step is to map each leaf that has file extent items pointing to the target
data extent to the IDs of the roots from which the leaves are accessible,
which happens at iterate_extent_inodes(). That step requires finding every
parent node of a leaf, then the parent of each parent, and so on until we
reach a root node. So it's a naturally expensive operation, and repetitive
because each leaf can have hundreds of file extent items (for a nodesize
of 16K, that can be slightly over 200 file extent items). There's also
temporal locality, as we process all file extent items from a leave before
moving the next leaf.
This change caches the mapping of leaves to root IDs, to avoid repeating
those computations over and over again. The cache is limited to a maximum
of 128 entries, with each entry being a struct with a size of 128 bytes,
so the maximum cache size is 16K plus any nodes internally allocated by
the maple tree that is used to index pointers to those structs. The cache
is invalidated whenever we detect relocation happened since we started
filling the cache, because if relocation happened then extent buffers for
leaves and nodes of the trees used by a send operation may have been
reallocated.
This cache also allows for another important optimization that is
introduced in the next patch in the series.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:50 +08:00
|
|
|
*/
|
2023-01-11 19:36:19 +08:00
|
|
|
#define SEND_MAX_BACKREF_CACHE_ROOTS 17
|
btrfs: send: cache leaf to roots mapping during backref walking
During a send operation, when doing backref walking to determine which
inodes/offsets/roots we can clone from, the most repetitive and expensive
step is to map each leaf that has file extent items pointing to the target
data extent to the IDs of the roots from which the leaves are accessible,
which happens at iterate_extent_inodes(). That step requires finding every
parent node of a leaf, then the parent of each parent, and so on until we
reach a root node. So it's a naturally expensive operation, and repetitive
because each leaf can have hundreds of file extent items (for a nodesize
of 16K, that can be slightly over 200 file extent items). There's also
temporal locality, as we process all file extent items from a leave before
moving the next leaf.
This change caches the mapping of leaves to root IDs, to avoid repeating
those computations over and over again. The cache is limited to a maximum
of 128 entries, with each entry being a struct with a size of 128 bytes,
so the maximum cache size is 16K plus any nodes internally allocated by
the maple tree that is used to index pointers to those structs. The cache
is invalidated whenever we detect relocation happened since we started
filling the cache, because if relocation happened then extent buffers for
leaves and nodes of the trees used by a send operation may have been
reallocated.
This cache also allows for another important optimization that is
introduced in the next patch in the series.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:50 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Max number of entries in the cache.
|
2023-01-11 19:36:19 +08:00
|
|
|
* With SEND_MAX_BACKREF_CACHE_ROOTS as 17, the size in bytes, excluding
|
|
|
|
* maple tree's internal nodes, is 24K.
|
btrfs: send: cache leaf to roots mapping during backref walking
During a send operation, when doing backref walking to determine which
inodes/offsets/roots we can clone from, the most repetitive and expensive
step is to map each leaf that has file extent items pointing to the target
data extent to the IDs of the roots from which the leaves are accessible,
which happens at iterate_extent_inodes(). That step requires finding every
parent node of a leaf, then the parent of each parent, and so on until we
reach a root node. So it's a naturally expensive operation, and repetitive
because each leaf can have hundreds of file extent items (for a nodesize
of 16K, that can be slightly over 200 file extent items). There's also
temporal locality, as we process all file extent items from a leave before
moving the next leaf.
This change caches the mapping of leaves to root IDs, to avoid repeating
those computations over and over again. The cache is limited to a maximum
of 128 entries, with each entry being a struct with a size of 128 bytes,
so the maximum cache size is 16K plus any nodes internally allocated by
the maple tree that is used to index pointers to those structs. The cache
is invalidated whenever we detect relocation happened since we started
filling the cache, because if relocation happened then extent buffers for
leaves and nodes of the trees used by a send operation may have been
reallocated.
This cache also allows for another important optimization that is
introduced in the next patch in the series.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:50 +08:00
|
|
|
*/
|
|
|
|
#define SEND_MAX_BACKREF_CACHE_SIZE 128
|
|
|
|
|
|
|
|
/*
|
|
|
|
* A backref cache entry maps a leaf to a list of IDs of roots from which the
|
|
|
|
* leaf is accessible and we can use for clone operations.
|
|
|
|
* With SEND_MAX_BACKREF_CACHE_ROOTS as 12, each cache entry is 128 bytes (on
|
|
|
|
* x86_64).
|
|
|
|
*/
|
|
|
|
struct backref_cache_entry {
|
2023-01-11 19:36:13 +08:00
|
|
|
struct btrfs_lru_cache_entry entry;
|
btrfs: send: cache leaf to roots mapping during backref walking
During a send operation, when doing backref walking to determine which
inodes/offsets/roots we can clone from, the most repetitive and expensive
step is to map each leaf that has file extent items pointing to the target
data extent to the IDs of the roots from which the leaves are accessible,
which happens at iterate_extent_inodes(). That step requires finding every
parent node of a leaf, then the parent of each parent, and so on until we
reach a root node. So it's a naturally expensive operation, and repetitive
because each leaf can have hundreds of file extent items (for a nodesize
of 16K, that can be slightly over 200 file extent items). There's also
temporal locality, as we process all file extent items from a leave before
moving the next leaf.
This change caches the mapping of leaves to root IDs, to avoid repeating
those computations over and over again. The cache is limited to a maximum
of 128 entries, with each entry being a struct with a size of 128 bytes,
so the maximum cache size is 16K plus any nodes internally allocated by
the maple tree that is used to index pointers to those structs. The cache
is invalidated whenever we detect relocation happened since we started
filling the cache, because if relocation happened then extent buffers for
leaves and nodes of the trees used by a send operation may have been
reallocated.
This cache also allows for another important optimization that is
introduced in the next patch in the series.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:50 +08:00
|
|
|
u64 root_ids[SEND_MAX_BACKREF_CACHE_ROOTS];
|
|
|
|
/* Number of valid elements in the root_ids array. */
|
|
|
|
int num_roots;
|
|
|
|
};
|
|
|
|
|
2023-01-11 19:36:13 +08:00
|
|
|
/* See the comment at lru_cache.h about struct btrfs_lru_cache_entry. */
|
|
|
|
static_assert(offsetof(struct backref_cache_entry, entry) == 0);
|
|
|
|
|
2023-01-11 19:36:15 +08:00
|
|
|
/*
|
|
|
|
* Max number of entries in the cache that stores directories that were already
|
|
|
|
* created. The cache uses raw struct btrfs_lru_cache_entry entries, so it uses
|
2023-01-11 19:36:16 +08:00
|
|
|
* at most 4096 bytes - sizeof(struct btrfs_lru_cache_entry) is 48 bytes, but
|
2023-01-11 19:36:15 +08:00
|
|
|
* the kmalloc-64 slab is used, so we get 4096 bytes (64 bytes * 64).
|
|
|
|
*/
|
|
|
|
#define SEND_MAX_DIR_CREATED_CACHE_SIZE 64
|
|
|
|
|
btrfs: send: cache utimes operations for directories if possible
Whenever we add or remove an entry to a directory, we issue an utimes
command for the directory. If we add 1000 entries to a directory (create
1000 files under it or move 1000 files to it), then we issue the same
utimes command 1000 times, which increases the send stream size, results
in more pipe IO, one search in the send b+tree, allocating one path for
the search, etc, as well as making the receiver do a system call for each
duplicated utimes command.
We also issue an utimes command when we create a new directory, but later
we might add entries to it corresponding to inodes with an higher inode
number, so it's pointless to issue the utimes command before we create
the last inode under the directory.
So use a lru cache to track directories for which we must send a utimes
command. When we need to remove an entry from the cache, we issue the
utimes command for the respective directory. When finishing the send
operation, we go over each cache element and issue the respective utimes
command. Finally the caching is entirely optional, just a performance
optimization, meaning that if we fail to cache (due to memory allocation
failure), we issue the utimes command right away, that is, we fallback
to the previous, unoptimized, behaviour.
This patch belongs to a patchset comprised of the following patches:
btrfs: send: directly return from did_overwrite_ref() and simplify it
btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
btrfs: send: directly return from will_overwrite_ref() and simplify it
btrfs: send: avoid extra b+tree searches when checking reference overrides
btrfs: send: remove send_progress argument from can_rmdir()
btrfs: send: avoid duplicated orphan dir allocation and initialization
btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
btrfs: send: reduce searches on parent root when checking if dir can be removed
btrfs: send: iterate waiting dir move rbtree only once when processing refs
btrfs: send: initialize all the red black trees earlier
btrfs: send: genericize the backref cache to allow it to be reused
btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
btrfs: send: cache information about created directories
btrfs: allow a generation number to be associated with lru cache entries
btrfs: add an api to delete a specific entry from the lru cache
btrfs: send: use the lru cache to implement the name cache
btrfs: send: update size of roots array for backref cache entries
btrfs: send: cache utimes operations for directories if possible
The following test was run before and after applying the whole patchset,
and on a non-debug kernel (Debian's default kernel config):
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
mkfs.btrfs -f $DEV > /dev/null
mount $DEV $MNT
mkdir $MNT/A
for ((i = 1; i <= 20000; i++)); do
echo -n > $MNT/A/file_$i
done
btrfs subvolume snapshot -r $MNT $MNT/snap1
mkdir $MNT/B
for ((i = 20000; i <= 40000; i++)); do
echo -n > $MNT/B/file_$i
done
mv $MNT/A/file_* $MNT/B/
btrfs subvolume snapshot -r $MNT $MNT/snap2
start=$(date +%s%N)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "Incremental send took $dur milliseconds"
umount $MNT
Before the whole patchset: 18408 milliseconds
After the whole patchset: 1942 milliseconds (9.5x speedup)
Using 60000 files instead of 40000:
Before the whole patchset: 39764 milliseconds
After the whole patchset: 3076 milliseconds (12.9x speedup)
Using 20000 files instead of 40000:
Before the whole patchset: 5072 milliseconds
After the whole patchset: 916 milliseconds (5.5x speedup)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-11 19:36:20 +08:00
|
|
|
/*
|
|
|
|
* Max number of entries in the cache that stores directories that were already
|
|
|
|
* created. The cache uses raw struct btrfs_lru_cache_entry entries, so it uses
|
|
|
|
* at most 4096 bytes - sizeof(struct btrfs_lru_cache_entry) is 48 bytes, but
|
|
|
|
* the kmalloc-64 slab is used, so we get 4096 bytes (64 bytes * 64).
|
|
|
|
*/
|
|
|
|
#define SEND_MAX_DIR_UTIMES_CACHE_SIZE 64
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
struct send_ctx {
|
|
|
|
struct file *send_filp;
|
|
|
|
loff_t send_off;
|
|
|
|
char *send_buf;
|
|
|
|
u32 send_size;
|
|
|
|
u32 send_max_size;
|
2022-03-18 01:25:40 +08:00
|
|
|
/*
|
|
|
|
* Whether BTRFS_SEND_A_DATA attribute was already added to current
|
|
|
|
* command (since protocol v2, data must be the last attribute).
|
|
|
|
*/
|
|
|
|
bool put_data;
|
btrfs: send: get send buffer pages for protocol v2
For encoded writes in send v2, we will get the encoded data with
btrfs_encoded_read_regular_fill_pages(), which expects a list of raw
pages. To avoid extra buffers and copies, we should read directly into
the send buffer. Therefore, we need the raw pages for the send buffer.
We currently allocate the send buffer with kvmalloc(), which may return
a kmalloc'd buffer or a vmalloc'd buffer. For vmalloc, we can get the
pages with vmalloc_to_page(). For kmalloc, we could use virt_to_page().
However, the buffer size we use (144K) is not a power of two, which in
theory is not guaranteed to return a page-aligned buffer, and in
practice would waste a lot of memory due to rounding up to the next
power of two. 144K is large enough that it usually gets allocated with
vmalloc(), anyways. So, for send v2, replace kvmalloc() with vmalloc()
and save the pages in an array.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-05 01:29:07 +08:00
|
|
|
struct page **send_buf_pages;
|
2013-02-05 04:54:57 +08:00
|
|
|
u64 flags; /* 'flags' member of btrfs_ioctl_send_args is u64 */
|
2021-10-22 22:53:36 +08:00
|
|
|
/* Protocol version compatibility requested */
|
|
|
|
u32 proto;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
struct btrfs_root *send_root;
|
|
|
|
struct btrfs_root *parent_root;
|
|
|
|
struct clone_root *clone_roots;
|
|
|
|
int clone_roots_cnt;
|
|
|
|
|
|
|
|
/* current state of the compare_tree call */
|
|
|
|
struct btrfs_path *left_path;
|
|
|
|
struct btrfs_path *right_path;
|
|
|
|
struct btrfs_key *cmp_key;
|
|
|
|
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
/*
|
|
|
|
* Keep track of the generation of the last transaction that was used
|
|
|
|
* for relocating a block group. This is periodically checked in order
|
|
|
|
* to detect if a relocation happened since the last check, so that we
|
|
|
|
* don't operate on stale extent buffers for nodes (level >= 1) or on
|
|
|
|
* stale disk_bytenr values of file extent items.
|
|
|
|
*/
|
|
|
|
u64 last_reloc_trans;
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
/*
|
|
|
|
* infos of the currently processed inode. In case of deleted inodes,
|
|
|
|
* these are the values from the deleted inode.
|
|
|
|
*/
|
|
|
|
u64 cur_ino;
|
|
|
|
u64 cur_inode_gen;
|
|
|
|
u64 cur_inode_size;
|
|
|
|
u64 cur_inode_mode;
|
2014-02-27 17:29:01 +08:00
|
|
|
u64 cur_inode_rdev;
|
2013-10-23 00:18:51 +08:00
|
|
|
u64 cur_inode_last_extent;
|
2018-02-07 04:40:40 +08:00
|
|
|
u64 cur_inode_next_write_offset;
|
2022-06-03 00:03:08 +08:00
|
|
|
bool cur_inode_new;
|
|
|
|
bool cur_inode_new_gen;
|
|
|
|
bool cur_inode_deleted;
|
2018-07-24 18:54:04 +08:00
|
|
|
bool ignore_cur_inode;
|
2022-08-16 04:54:28 +08:00
|
|
|
bool cur_inode_needs_verity;
|
|
|
|
void *verity_descriptor;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
u64 send_progress;
|
|
|
|
|
|
|
|
struct list_head new_refs;
|
|
|
|
struct list_head deleted_refs;
|
|
|
|
|
2023-01-11 19:36:18 +08:00
|
|
|
struct btrfs_lru_cache name_cache;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2022-05-06 01:16:14 +08:00
|
|
|
/*
|
|
|
|
* The inode we are currently processing. It's not NULL only when we
|
|
|
|
* need to issue write commands for data extents from this inode.
|
|
|
|
*/
|
|
|
|
struct inode *cur_inode;
|
2014-03-05 10:07:35 +08:00
|
|
|
struct file_ra_state ra;
|
btrfs: send: avoid trashing the page cache
A send operation reads extent data using the buffered IO path for getting
extent data to send in write commands and this is both because it's simple
and to make use of the generic readahead infrastructure, which results in
a massive speedup.
However this fills the page cache with data that, most of the time, is
really only used by the send operation - once the write commands are sent,
it's not useful to have the data in the page cache anymore. For large
snapshots, bringing all data into the page cache eventually leads to the
need to evict other data from the page cache that may be more useful for
applications (and kernel subsystems).
Even if extents are shared with the subvolume on which a snapshot is based
on and the data is currently on the page cache due to being read through
the subvolume, attempting to read the data through the snapshot will
always result in bringing a new copy of the data into another location in
the page cache (there's currently no shared memory for shared extents).
So make send evict the data it has read before if when it first opened
the inode, its mapping had no pages currently loaded: when
inode->i_mapping->nr_pages has a value of 0. Do this instead of deciding
based on the return value of filemap_range_has_page() before reading an
extent because the generic readahead mechanism may read pages beyond the
range we request (and it very often does it), which means a call to
filemap_range_has_page() will return true due to the readahead that was
triggered when processing a previous extent - we don't have a simple way
to distinguish this case from the case where the data was brought into
the page cache through someone else. So checking for the mapping number
of pages being 0 when we first open the inode is simple, cheap and it
generally accomplishes the goal of not trashing the page cache - the
only exception is if part of data was previously loaded into the page
cache through the snapshot by some other process, in that case we end
up not evicting any data send brings into the page cache, just like
before this change - but that however is not the common case.
Example scenario, on a box with 32G of RAM:
$ btrfs subvolume create /mnt/sv1
$ xfs_io -f -c "pwrite 0 4G" /mnt/sv1/file1
$ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1
$ free -m
total used free shared buff/cache available
Mem: 31937 186 26866 0 4883 31297
Swap: 8188 0 8188
# After this we get less 4G of free memory.
$ btrfs send /mnt/snap1 >/dev/null
$ free -m
total used free shared buff/cache available
Mem: 31937 186 22814 0 8935 31297
Swap: 8188 0 8188
The same, obviously, applies to an incremental send.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-17 18:47:30 +08:00
|
|
|
u64 page_cache_clear_start;
|
|
|
|
bool clean_page_cache;
|
2014-03-05 10:07:35 +08:00
|
|
|
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
/*
|
|
|
|
* We process inodes by their increasing order, so if before an
|
|
|
|
* incremental send we reverse the parent/child relationship of
|
|
|
|
* directories such that a directory with a lower inode number was
|
|
|
|
* the parent of a directory with a higher inode number, and the one
|
|
|
|
* becoming the new parent got renamed too, we can't rename/move the
|
|
|
|
* directory with lower inode number when we finish processing it - we
|
|
|
|
* must process the directory with higher inode number first, then
|
|
|
|
* rename/move it and then rename/move the directory with lower inode
|
|
|
|
* number. Example follows.
|
|
|
|
*
|
|
|
|
* Tree state when the first send was performed:
|
|
|
|
*
|
|
|
|
* .
|
|
|
|
* |-- a (ino 257)
|
|
|
|
* |-- b (ino 258)
|
|
|
|
* |
|
|
|
|
* |
|
|
|
|
* |-- c (ino 259)
|
|
|
|
* | |-- d (ino 260)
|
|
|
|
* |
|
|
|
|
* |-- c2 (ino 261)
|
|
|
|
*
|
|
|
|
* Tree state when the second (incremental) send is performed:
|
|
|
|
*
|
|
|
|
* .
|
|
|
|
* |-- a (ino 257)
|
|
|
|
* |-- b (ino 258)
|
|
|
|
* |-- c2 (ino 261)
|
|
|
|
* |-- d2 (ino 260)
|
|
|
|
* |-- cc (ino 259)
|
|
|
|
*
|
|
|
|
* The sequence of steps that lead to the second state was:
|
|
|
|
*
|
|
|
|
* mv /a/b/c/d /a/b/c2/d2
|
|
|
|
* mv /a/b/c /a/b/c2/d2/cc
|
|
|
|
*
|
|
|
|
* "c" has lower inode number, but we can't move it (2nd mv operation)
|
|
|
|
* before we move "d", which has higher inode number.
|
|
|
|
*
|
|
|
|
* So we just memorize which move/rename operations must be performed
|
|
|
|
* later when their respective parent is processed and moved/renamed.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* Indexed by parent directory inode number. */
|
|
|
|
struct rb_root pending_dir_moves;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Reverse index, indexed by the inode number of a directory that
|
|
|
|
* is waiting for the move/rename of its immediate parent before its
|
|
|
|
* own move/rename can be performed.
|
|
|
|
*/
|
|
|
|
struct rb_root waiting_dir_moves;
|
2014-02-19 22:31:44 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* A directory that is going to be rm'ed might have a child directory
|
|
|
|
* which is in the pending directory moves index above. In this case,
|
|
|
|
* the directory can only be removed after the move/rename of its child
|
|
|
|
* is performed. Example:
|
|
|
|
*
|
|
|
|
* Parent snapshot:
|
|
|
|
*
|
|
|
|
* . (ino 256)
|
|
|
|
* |-- a/ (ino 257)
|
|
|
|
* |-- b/ (ino 258)
|
|
|
|
* |-- c/ (ino 259)
|
|
|
|
* | |-- x/ (ino 260)
|
|
|
|
* |
|
|
|
|
* |-- y/ (ino 261)
|
|
|
|
*
|
|
|
|
* Send snapshot:
|
|
|
|
*
|
|
|
|
* . (ino 256)
|
|
|
|
* |-- a/ (ino 257)
|
|
|
|
* |-- b/ (ino 258)
|
|
|
|
* |-- YY/ (ino 261)
|
|
|
|
* |-- x/ (ino 260)
|
|
|
|
*
|
|
|
|
* Sequence of steps that lead to the send snapshot:
|
|
|
|
* rm -f /a/b/c/foo.txt
|
|
|
|
* mv /a/b/y /a/b/YY
|
|
|
|
* mv /a/b/c/x /a/b/YY
|
|
|
|
* rmdir /a/b/c
|
|
|
|
*
|
|
|
|
* When the child is processed, its move/rename is delayed until its
|
|
|
|
* parent is processed (as explained above), but all other operations
|
|
|
|
* like update utimes, chown, chgrp, etc, are performed and the paths
|
|
|
|
* that it uses for those operations must use the orphanized name of
|
|
|
|
* its parent (the directory we're going to rm later), so we need to
|
|
|
|
* memorize that name.
|
|
|
|
*
|
|
|
|
* Indexed by the inode number of the directory to be deleted.
|
|
|
|
*/
|
|
|
|
struct rb_root orphan_dirs;
|
btrfs: send: fix sending link commands for existing file paths
There is a bug sending link commands for existing file paths. When we're
processing an inode, we go over all references. All the new file paths are
added to the "new_refs" list. And all the deleted file paths are added to
the "deleted_refs" list. In the end, when we finish processing the inode,
we iterate over all the items in the "new_refs" list and send link commands
for those file paths. After that, we go over all the items in the
"deleted_refs" list and send unlink commands for them. If there are
duplicated file paths in both lists, we will try to create them before we
remove them. Then the receiver gets an -EEXIST error when trying the link
operations.
Example for having duplicated file paths in both list:
$ btrfs subvolume create vol
# create a file and 2000 hard links to the same inode
$ touch vol/foo
$ for i in {1..2000}; do link vol/foo vol/$i ; done
# take a snapshot for a parent snapshot
$ btrfs subvolume snapshot -r vol snap1
# remove 2000 hard links and re-create the last 1000 links
$ for i in {1..2000}; do rm vol/$i; done;
$ for i in {1001..2000}; do link vol/foo vol/$i; done
# take another one for a send snapshot
$ btrfs subvolume snapshot -r vol snap2
$ mkdir receive_dir
$ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
At subvol snap2
link 1238 -> foo
ERROR: link 1238 -> foo failed: File exists
In this case, we will have the same file paths added to both lists. In the
parent snapshot, reference paths {1..1237} are stored in inode references,
but reference paths {1238..2000} are stored in inode extended references.
In the send snapshot, all reference paths {1001..2000} are stored in inode
references. During the incremental send, we process their inode references
first. In record_changed_ref(), we iterate all its inode references in the
send/parent snapshot. For every inode reference, we also use find_iref() to
check whether the same file path also appears in the parent/send snapshot
or not. Inode references {1238..2000} which appear in the send snapshot but
not in the parent snapshot are added to the "new_refs" list. On the other
hand, Inode references {1..1000} which appear in the parent snapshot but
not in the send snapshot are added to the "deleted_refs" list. Next, when
we process their inode extended references, reference paths {1238..2000}
are added to the "deleted_refs" list because all of them only appear in the
parent snapshot. Now two lists contain items as below:
"new_refs" list: {1238..2000}
"deleted_refs" list: {1..1000}, {1238..2000}
Reference paths {1238..2000} appear in both lists. And as the processing
order mentioned about before, the receiver gets an -EEXIST error when trying
the link operations.
To fix the bug, the idea is to process the "deleted_refs" list before
the "new_refs" list. However, it's not easy to reshuffle the processing
order. For one reason, if we do so, we may unlink all the existing paths
first, there's no valid path anymore for links. And it's inefficient
because we do a bunch of unlinks followed by links for the same paths.
Moreover, it makes less sense to have duplications in both lists. A
reference path cannot not only be regarded as new but also has been seen in
the past, or we won't call it a new path. However, it's also not a good
idea to make find_iref() check a reference against all inode references
and all inode extended references because it may result in large disk
reads.
So we introduce two rbtrees to make the references easier for lookups.
And we also introduce record_new_ref_if_needed() and
record_deleted_ref_if_needed() for changed_ref() to check and remove
duplicated references early.
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-12 09:36:32 +08:00
|
|
|
|
|
|
|
struct rb_root rbtree_new_refs;
|
|
|
|
struct rb_root rbtree_deleted_refs;
|
btrfs: send: cache leaf to roots mapping during backref walking
During a send operation, when doing backref walking to determine which
inodes/offsets/roots we can clone from, the most repetitive and expensive
step is to map each leaf that has file extent items pointing to the target
data extent to the IDs of the roots from which the leaves are accessible,
which happens at iterate_extent_inodes(). That step requires finding every
parent node of a leaf, then the parent of each parent, and so on until we
reach a root node. So it's a naturally expensive operation, and repetitive
because each leaf can have hundreds of file extent items (for a nodesize
of 16K, that can be slightly over 200 file extent items). There's also
temporal locality, as we process all file extent items from a leave before
moving the next leaf.
This change caches the mapping of leaves to root IDs, to avoid repeating
those computations over and over again. The cache is limited to a maximum
of 128 entries, with each entry being a struct with a size of 128 bytes,
so the maximum cache size is 16K plus any nodes internally allocated by
the maple tree that is used to index pointers to those structs. The cache
is invalidated whenever we detect relocation happened since we started
filling the cache, because if relocation happened then extent buffers for
leaves and nodes of the trees used by a send operation may have been
reallocated.
This cache also allows for another important optimization that is
introduced in the next patch in the series.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:50 +08:00
|
|
|
|
2023-01-11 19:36:13 +08:00
|
|
|
struct btrfs_lru_cache backref_cache;
|
|
|
|
u64 backref_cache_last_reloc_trans;
|
2023-01-11 19:36:15 +08:00
|
|
|
|
|
|
|
struct btrfs_lru_cache dir_created_cache;
|
btrfs: send: cache utimes operations for directories if possible
Whenever we add or remove an entry to a directory, we issue an utimes
command for the directory. If we add 1000 entries to a directory (create
1000 files under it or move 1000 files to it), then we issue the same
utimes command 1000 times, which increases the send stream size, results
in more pipe IO, one search in the send b+tree, allocating one path for
the search, etc, as well as making the receiver do a system call for each
duplicated utimes command.
We also issue an utimes command when we create a new directory, but later
we might add entries to it corresponding to inodes with an higher inode
number, so it's pointless to issue the utimes command before we create
the last inode under the directory.
So use a lru cache to track directories for which we must send a utimes
command. When we need to remove an entry from the cache, we issue the
utimes command for the respective directory. When finishing the send
operation, we go over each cache element and issue the respective utimes
command. Finally the caching is entirely optional, just a performance
optimization, meaning that if we fail to cache (due to memory allocation
failure), we issue the utimes command right away, that is, we fallback
to the previous, unoptimized, behaviour.
This patch belongs to a patchset comprised of the following patches:
btrfs: send: directly return from did_overwrite_ref() and simplify it
btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
btrfs: send: directly return from will_overwrite_ref() and simplify it
btrfs: send: avoid extra b+tree searches when checking reference overrides
btrfs: send: remove send_progress argument from can_rmdir()
btrfs: send: avoid duplicated orphan dir allocation and initialization
btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
btrfs: send: reduce searches on parent root when checking if dir can be removed
btrfs: send: iterate waiting dir move rbtree only once when processing refs
btrfs: send: initialize all the red black trees earlier
btrfs: send: genericize the backref cache to allow it to be reused
btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
btrfs: send: cache information about created directories
btrfs: allow a generation number to be associated with lru cache entries
btrfs: add an api to delete a specific entry from the lru cache
btrfs: send: use the lru cache to implement the name cache
btrfs: send: update size of roots array for backref cache entries
btrfs: send: cache utimes operations for directories if possible
The following test was run before and after applying the whole patchset,
and on a non-debug kernel (Debian's default kernel config):
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
mkfs.btrfs -f $DEV > /dev/null
mount $DEV $MNT
mkdir $MNT/A
for ((i = 1; i <= 20000; i++)); do
echo -n > $MNT/A/file_$i
done
btrfs subvolume snapshot -r $MNT $MNT/snap1
mkdir $MNT/B
for ((i = 20000; i <= 40000; i++)); do
echo -n > $MNT/B/file_$i
done
mv $MNT/A/file_* $MNT/B/
btrfs subvolume snapshot -r $MNT $MNT/snap2
start=$(date +%s%N)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "Incremental send took $dur milliseconds"
umount $MNT
Before the whole patchset: 18408 milliseconds
After the whole patchset: 1942 milliseconds (9.5x speedup)
Using 60000 files instead of 40000:
Before the whole patchset: 39764 milliseconds
After the whole patchset: 3076 milliseconds (12.9x speedup)
Using 20000 files instead of 40000:
Before the whole patchset: 5072 milliseconds
After the whole patchset: 916 milliseconds (5.5x speedup)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-11 19:36:20 +08:00
|
|
|
struct btrfs_lru_cache dir_utimes_cache;
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
struct pending_dir_move {
|
|
|
|
struct rb_node node;
|
|
|
|
struct list_head list;
|
|
|
|
u64 parent_ino;
|
|
|
|
u64 ino;
|
|
|
|
u64 gen;
|
|
|
|
struct list_head update_refs;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct waiting_dir_move {
|
|
|
|
struct rb_node node;
|
|
|
|
u64 ino;
|
2014-02-19 22:31:44 +08:00
|
|
|
/*
|
|
|
|
* There might be some directory that could not be removed because it
|
|
|
|
* was waiting for this directory inode to be moved first. Therefore
|
|
|
|
* after this directory is moved, we can try to rmdir the ino rmdir_ino.
|
|
|
|
*/
|
|
|
|
u64 rmdir_ino;
|
2020-12-10 20:09:02 +08:00
|
|
|
u64 rmdir_gen;
|
Btrfs: incremental send, check if orphanized dir inode needs delayed rename
If a directory inode is orphanized, because some inode previously
processed has a new name that collides with the old name of the current
inode, we need to check if it needs its rename operation delayed too,
as its ancestor-descendent relationship with some other inode might
have been reversed between the parent and send snapshots and therefore
its rename operation needs to happen after that other inode is renamed.
For example, for the following reproducer where this is needed (provided
by Robbie Ko):
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt2
$ mkdir -p /mnt/data/n1/n2
$ mkdir /mnt/data/n4
$ mkdir -p /mnt/data/t6/t7
$ mkdir /mnt/data/t5
$ mkdir /mnt/data/t7
$ mkdir /mnt/data/n4/t2
$ mkdir /mnt/data/t4
$ mkdir /mnt/data/t3
$ mv /mnt/data/t7 /mnt/data/n4/t2
$ mv /mnt/data/t4 /mnt/data/n4/t2/t7
$ mv /mnt/data/t5 /mnt/data/n4/t2/t7/t4
$ mv /mnt/data/t6 /mnt/data/n4/t2/t7/t4/t5
$ mv /mnt/data/n1/n2 /mnt/data/n4/t2/t7/t4/t5/t6
$ mv /mnt/data/n1 /mnt/data/n4/t2/t7/t4/t5/t6
$ mv /mnt/data/n4/t2/t7/t4/t5/t6/t7 /mnt/data/n4/t2/t7/t4/t5/t6/n2
$ mv /mnt/data/t3 /mnt/data/n4/t2/t7/t4/t5/t6/n2/t7
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ mv /mnt/data/n4/t2/t7/t4/t5/t6/n1 /mnt/data/n4
$ mv /mnt/data/n4/t2 /mnt/data/n4/n1
$ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6/n2 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/n2/t7/t3 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/t7/t4 /mnt/data/n4/n1/t2/t6
$ mv /mnt/data/n4/n1/t2/t7 /mnt/data/n4/n1/t2/t3
$ mv /mnt/data/n4/n1/t2/n2/t7 /mnt/data/n4/n1/t2
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send /mnt/snap1 | btrfs receive /mnt2
$ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2
ERROR: send ioctl failed with -12: Cannot allocate memory
Where the parent snapshot directory hierarchy is the following:
. (ino 256)
|-- data/ (ino 257)
|-- n4/ (ino 260)
|-- t2/ (ino 265)
|-- t7/ (ino 264)
|-- t4/ (ino 266)
|-- t5/ (ino 263)
|-- t6/ (ino 261)
|-- n1/ (ino 258)
|-- n2/ (ino 259)
|-- t7/ (ino 262)
|-- t3/ (ino 267)
And the send snapshot's directory hierarchy is the following:
. (ino 256)
|-- data/ (ino 257)
|-- n4/ (ino 260)
|-- n1/ (ino 258)
|-- t2/ (ino 265)
|-- n2/ (ino 259)
|-- t3/ (ino 267)
| |-- t7 (ino 264)
|
|-- t6/ (ino 261)
| |-- t4/ (ino 266)
| |-- t5/ (ino 263)
|
|-- t7/ (ino 262)
While processing inode 262 we orphanize inode 264 and later attempt
to rename inode 264 to its new name/location, which resulted in building
an incorrect destination path string for the rename operation with the
value "data/n4/t2/t7/t4/t5/t6/n2/t7/t3/t7". This rename operation must
have been done only after inode 267 is processed and renamed, as the
ancestor-descendent relationship between inodes 264 and 267 was reversed
between both snapshots, because otherwise it results in an infinite loop
when building the path string for inode 264 when we are processing an
inode with a number larger than 264. That loop is the following:
start inode 264, send progress of 265 for example
parent of 264 -> 267
parent of 267 -> 262
parent of 262 -> 259
parent of 259 -> 261
parent of 261 -> 263
parent of 263 -> 266
parent of 266 -> 264
|--> back to first iteration while current path string length
is <= PATH_MAX, and fail with -ENOMEM otherwise
So fix this by making the check if we need to delay a directory rename
regardless of the current inode having been orphanized or not.
A test case for fstests follows soon.
Thanks to Robbie Ko for providing a reproducer for this problem.
Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-04-09 21:09:14 +08:00
|
|
|
bool orphanized;
|
2014-02-19 22:31:44 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
struct orphan_dir_info {
|
|
|
|
struct rb_node node;
|
|
|
|
u64 ino;
|
|
|
|
u64 gen;
|
2018-05-08 18:11:38 +08:00
|
|
|
u64 last_dir_index_offset;
|
2023-01-11 19:36:09 +08:00
|
|
|
u64 dir_high_seq_ino;
|
2012-07-26 05:19:24 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
struct name_cache_entry {
|
2012-07-28 20:20:58 +08:00
|
|
|
/*
|
2023-01-11 19:36:18 +08:00
|
|
|
* The key in the entry is an inode number, and the generation matches
|
|
|
|
* the inode's generation.
|
2012-07-28 20:20:58 +08:00
|
|
|
*/
|
2023-01-11 19:36:18 +08:00
|
|
|
struct btrfs_lru_cache_entry entry;
|
2012-07-26 05:19:24 +08:00
|
|
|
u64 parent_ino;
|
|
|
|
u64 parent_gen;
|
|
|
|
int ret;
|
|
|
|
int need_later_update;
|
|
|
|
int name_len;
|
|
|
|
char name[];
|
|
|
|
};
|
|
|
|
|
2023-01-11 19:36:18 +08:00
|
|
|
/* See the comment at lru_cache.h about struct btrfs_lru_cache_entry. */
|
|
|
|
static_assert(offsetof(struct name_cache_entry, entry) == 0);
|
|
|
|
|
2019-08-22 01:12:59 +08:00
|
|
|
#define ADVANCE 1
|
|
|
|
#define ADVANCE_ONLY_NEXT -1
|
|
|
|
|
|
|
|
enum btrfs_compare_tree_result {
|
|
|
|
BTRFS_COMPARE_TREE_NEW,
|
|
|
|
BTRFS_COMPARE_TREE_DELETED,
|
|
|
|
BTRFS_COMPARE_TREE_CHANGED,
|
|
|
|
BTRFS_COMPARE_TREE_SAME,
|
|
|
|
};
|
|
|
|
|
2018-02-20 00:24:18 +08:00
|
|
|
__cold
|
Btrfs: send, don't bug on inconsistent snapshots
When doing an incremental send, if we find a new/modified/deleted extent,
reference or xattr without having previously processed the corresponding
inode item we end up exexuting a BUG_ON(). This is because whenever an
extent, xattr or reference is added, modified or deleted, we always expect
to have the corresponding inode item updated. However there are situations
where this will not happen due to transient -ENOMEM or -ENOSPC errors when
doing delayed inode updates.
For example, when punching holes we can succeed in deleting and modifying
(shrinking) extents but later fail to do the delayed inode update. So after
such failure we close our transaction handle and right after a snapshot of
the fs/subvol tree can be made and used later for a send operation. The
same thing can happen during truncate, link, unlink, and xattr related
operations.
So instead of executing a BUG_ON, make send return an -EIO error and print
an informative error message do dmesg/syslog.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-08-01 08:50:37 +08:00
|
|
|
static void inconsistent_snapshot_error(struct send_ctx *sctx,
|
|
|
|
enum btrfs_compare_tree_result result,
|
|
|
|
const char *what)
|
|
|
|
{
|
|
|
|
const char *result_string;
|
|
|
|
|
|
|
|
switch (result) {
|
|
|
|
case BTRFS_COMPARE_TREE_NEW:
|
|
|
|
result_string = "new";
|
|
|
|
break;
|
|
|
|
case BTRFS_COMPARE_TREE_DELETED:
|
|
|
|
result_string = "deleted";
|
|
|
|
break;
|
|
|
|
case BTRFS_COMPARE_TREE_CHANGED:
|
|
|
|
result_string = "updated";
|
|
|
|
break;
|
|
|
|
case BTRFS_COMPARE_TREE_SAME:
|
|
|
|
ASSERT(0);
|
|
|
|
result_string = "unchanged";
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
ASSERT(0);
|
|
|
|
result_string = "unexpected";
|
|
|
|
}
|
|
|
|
|
|
|
|
btrfs_err(sctx->send_root->fs_info,
|
|
|
|
"Send: inconsistent snapshot, found %s %s for inode %llu without updated inode item, send root is %llu, parent root is %llu",
|
|
|
|
result_string, what, sctx->cmp_key->objectid,
|
|
|
|
sctx->send_root->root_key.objectid,
|
|
|
|
(sctx->parent_root ?
|
|
|
|
sctx->parent_root->root_key.objectid : 0));
|
|
|
|
}
|
|
|
|
|
2021-10-22 22:53:36 +08:00
|
|
|
__maybe_unused
|
|
|
|
static bool proto_cmd_ok(const struct send_ctx *sctx, int cmd)
|
|
|
|
{
|
|
|
|
switch (sctx->proto) {
|
2022-03-18 01:25:38 +08:00
|
|
|
case 1: return cmd <= BTRFS_SEND_C_MAX_V1;
|
|
|
|
case 2: return cmd <= BTRFS_SEND_C_MAX_V2;
|
2022-10-07 23:10:02 +08:00
|
|
|
case 3: return cmd <= BTRFS_SEND_C_MAX_V3;
|
2021-10-22 22:53:36 +08:00
|
|
|
default: return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
static int is_waiting_for_move(struct send_ctx *sctx, u64 ino);
|
|
|
|
|
2014-02-19 22:31:44 +08:00
|
|
|
static struct waiting_dir_move *
|
|
|
|
get_waiting_dir_move(struct send_ctx *sctx, u64 ino);
|
|
|
|
|
2020-12-10 20:09:02 +08:00
|
|
|
static int is_waiting_for_rm(struct send_ctx *sctx, u64 dir_ino, u64 gen);
|
2014-02-19 22:31:44 +08:00
|
|
|
|
2013-10-23 00:18:51 +08:00
|
|
|
static int need_send_hole(struct send_ctx *sctx)
|
|
|
|
{
|
|
|
|
return (sctx->parent_root && !sctx->cur_inode_new &&
|
|
|
|
!sctx->cur_inode_new_gen && !sctx->cur_inode_deleted &&
|
|
|
|
S_ISREG(sctx->cur_inode_mode));
|
|
|
|
}
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
static void fs_path_reset(struct fs_path *p)
|
|
|
|
{
|
|
|
|
if (p->reversed) {
|
|
|
|
p->start = p->buf + p->buf_len - 1;
|
|
|
|
p->end = p->start;
|
|
|
|
*p->start = 0;
|
|
|
|
} else {
|
|
|
|
p->start = p->buf;
|
|
|
|
p->end = p->start;
|
|
|
|
*p->start = 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
static struct fs_path *fs_path_alloc(void)
|
2012-07-26 05:19:24 +08:00
|
|
|
{
|
|
|
|
struct fs_path *p;
|
|
|
|
|
2016-01-19 01:42:13 +08:00
|
|
|
p = kmalloc(sizeof(*p), GFP_KERNEL);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (!p)
|
|
|
|
return NULL;
|
|
|
|
p->reversed = 0;
|
|
|
|
p->buf = p->inline_buf;
|
|
|
|
p->buf_len = FS_PATH_INLINE_SIZE;
|
|
|
|
fs_path_reset(p);
|
|
|
|
return p;
|
|
|
|
}
|
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
static struct fs_path *fs_path_alloc_reversed(void)
|
2012-07-26 05:19:24 +08:00
|
|
|
{
|
|
|
|
struct fs_path *p;
|
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
p = fs_path_alloc();
|
2012-07-26 05:19:24 +08:00
|
|
|
if (!p)
|
|
|
|
return NULL;
|
|
|
|
p->reversed = 1;
|
|
|
|
fs_path_reset(p);
|
|
|
|
return p;
|
|
|
|
}
|
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
static void fs_path_free(struct fs_path *p)
|
2012-07-26 05:19:24 +08:00
|
|
|
{
|
|
|
|
if (!p)
|
|
|
|
return;
|
2014-02-05 23:17:34 +08:00
|
|
|
if (p->buf != p->inline_buf)
|
|
|
|
kfree(p->buf);
|
2012-07-26 05:19:24 +08:00
|
|
|
kfree(p);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int fs_path_len(struct fs_path *p)
|
|
|
|
{
|
|
|
|
return p->end - p->start;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int fs_path_ensure_buf(struct fs_path *p, int len)
|
|
|
|
{
|
|
|
|
char *tmp_buf;
|
|
|
|
int path_len;
|
|
|
|
int old_buf_len;
|
|
|
|
|
|
|
|
len++;
|
|
|
|
|
|
|
|
if (p->buf_len >= len)
|
|
|
|
return 0;
|
|
|
|
|
2014-04-26 20:02:03 +08:00
|
|
|
if (len > PATH_MAX) {
|
|
|
|
WARN_ON(1);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
2014-02-26 02:32:59 +08:00
|
|
|
path_len = p->end - p->start;
|
|
|
|
old_buf_len = p->buf_len;
|
|
|
|
|
2022-09-24 04:28:13 +08:00
|
|
|
/*
|
|
|
|
* Allocate to the next largest kmalloc bucket size, to let
|
|
|
|
* the fast path happen most of the time.
|
|
|
|
*/
|
|
|
|
len = kmalloc_size_roundup(len);
|
2014-02-05 23:17:34 +08:00
|
|
|
/*
|
|
|
|
* First time the inline_buf does not suffice
|
|
|
|
*/
|
2014-05-22 00:38:13 +08:00
|
|
|
if (p->buf == p->inline_buf) {
|
2016-01-19 01:42:13 +08:00
|
|
|
tmp_buf = kmalloc(len, GFP_KERNEL);
|
2014-05-22 00:38:13 +08:00
|
|
|
if (tmp_buf)
|
|
|
|
memcpy(tmp_buf, p->buf, old_buf_len);
|
|
|
|
} else {
|
2016-01-19 01:42:13 +08:00
|
|
|
tmp_buf = krealloc(p->buf, len, GFP_KERNEL);
|
2014-05-22 00:38:13 +08:00
|
|
|
}
|
2014-02-26 02:33:08 +08:00
|
|
|
if (!tmp_buf)
|
|
|
|
return -ENOMEM;
|
|
|
|
p->buf = tmp_buf;
|
2022-09-24 04:28:13 +08:00
|
|
|
p->buf_len = len;
|
2014-02-05 23:17:34 +08:00
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
if (p->reversed) {
|
|
|
|
tmp_buf = p->buf + old_buf_len - path_len - 1;
|
|
|
|
p->end = p->buf + p->buf_len - 1;
|
|
|
|
p->start = p->end - path_len;
|
|
|
|
memmove(p->start, tmp_buf, path_len + 1);
|
|
|
|
} else {
|
|
|
|
p->start = p->buf;
|
|
|
|
p->end = p->start + path_len;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-02-04 02:23:19 +08:00
|
|
|
static int fs_path_prepare_for_add(struct fs_path *p, int name_len,
|
|
|
|
char **prepared)
|
2012-07-26 05:19:24 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
int new_len;
|
|
|
|
|
|
|
|
new_len = p->end - p->start + name_len;
|
|
|
|
if (p->start != p->end)
|
|
|
|
new_len++;
|
|
|
|
ret = fs_path_ensure_buf(p, new_len);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (p->reversed) {
|
|
|
|
if (p->start != p->end)
|
|
|
|
*--p->start = '/';
|
|
|
|
p->start -= name_len;
|
2014-02-04 02:23:19 +08:00
|
|
|
*prepared = p->start;
|
2012-07-26 05:19:24 +08:00
|
|
|
} else {
|
|
|
|
if (p->start != p->end)
|
|
|
|
*p->end++ = '/';
|
2014-02-04 02:23:19 +08:00
|
|
|
*prepared = p->end;
|
2012-07-26 05:19:24 +08:00
|
|
|
p->end += name_len;
|
|
|
|
*p->end = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int fs_path_add(struct fs_path *p, const char *name, int name_len)
|
|
|
|
{
|
|
|
|
int ret;
|
2014-02-04 02:23:19 +08:00
|
|
|
char *prepared;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2014-02-04 02:23:19 +08:00
|
|
|
ret = fs_path_prepare_for_add(p, name_len, &prepared);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2014-02-04 02:23:19 +08:00
|
|
|
memcpy(prepared, name, name_len);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int fs_path_add_path(struct fs_path *p, struct fs_path *p2)
|
|
|
|
{
|
|
|
|
int ret;
|
2014-02-04 02:23:19 +08:00
|
|
|
char *prepared;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2014-02-04 02:23:19 +08:00
|
|
|
ret = fs_path_prepare_for_add(p, p2->end - p2->start, &prepared);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2014-02-04 02:23:19 +08:00
|
|
|
memcpy(prepared, p2->start, p2->end - p2->start);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int fs_path_add_from_extent_buffer(struct fs_path *p,
|
|
|
|
struct extent_buffer *eb,
|
|
|
|
unsigned long off, int len)
|
|
|
|
{
|
|
|
|
int ret;
|
2014-02-04 02:23:19 +08:00
|
|
|
char *prepared;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2014-02-04 02:23:19 +08:00
|
|
|
ret = fs_path_prepare_for_add(p, len, &prepared);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
2014-02-04 02:23:19 +08:00
|
|
|
read_extent_buffer(eb, prepared, off, len);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int fs_path_copy(struct fs_path *p, struct fs_path *from)
|
|
|
|
{
|
|
|
|
p->reversed = from->reversed;
|
|
|
|
fs_path_reset(p);
|
|
|
|
|
2022-01-11 09:57:16 +08:00
|
|
|
return fs_path_add_path(p, from);
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void fs_path_unreverse(struct fs_path *p)
|
|
|
|
{
|
|
|
|
char *tmp;
|
|
|
|
int len;
|
|
|
|
|
|
|
|
if (!p->reversed)
|
|
|
|
return;
|
|
|
|
|
|
|
|
tmp = p->start;
|
|
|
|
len = p->end - p->start;
|
|
|
|
p->start = p->buf;
|
|
|
|
p->end = p->start + len;
|
|
|
|
memmove(p->start, tmp, len + 1);
|
|
|
|
p->reversed = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct btrfs_path *alloc_path_for_send(void)
|
|
|
|
{
|
|
|
|
struct btrfs_path *path;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return NULL;
|
|
|
|
path->search_commit_root = 1;
|
|
|
|
path->skip_locking = 1;
|
2014-03-29 05:16:01 +08:00
|
|
|
path->need_commit_sem = 1;
|
2012-07-26 05:19:24 +08:00
|
|
|
return path;
|
|
|
|
}
|
|
|
|
|
2013-04-26 04:41:01 +08:00
|
|
|
static int write_buf(struct file *filp, const void *buf, u32 len, loff_t *off)
|
2012-07-26 05:19:24 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
u32 pos = 0;
|
|
|
|
|
|
|
|
while (pos < len) {
|
2017-09-01 23:39:19 +08:00
|
|
|
ret = kernel_write(filp, buf + pos, len - pos, off);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
2017-09-01 23:39:19 +08:00
|
|
|
return ret;
|
2022-06-02 21:40:46 +08:00
|
|
|
if (ret == 0)
|
2017-09-01 23:39:19 +08:00
|
|
|
return -EIO;
|
2012-07-26 05:19:24 +08:00
|
|
|
pos += ret;
|
|
|
|
}
|
|
|
|
|
2017-09-01 23:39:19 +08:00
|
|
|
return 0;
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static int tlv_put(struct send_ctx *sctx, u16 attr, const void *data, int len)
|
|
|
|
{
|
|
|
|
struct btrfs_tlv_header *hdr;
|
|
|
|
int total_len = sizeof(*hdr) + len;
|
|
|
|
int left = sctx->send_max_size - sctx->send_size;
|
|
|
|
|
2022-03-18 01:25:40 +08:00
|
|
|
if (WARN_ON_ONCE(sctx->put_data))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
if (unlikely(left < total_len))
|
|
|
|
return -EOVERFLOW;
|
|
|
|
|
|
|
|
hdr = (struct btrfs_tlv_header *) (sctx->send_buf + sctx->send_size);
|
2020-09-15 16:54:23 +08:00
|
|
|
put_unaligned_le16(attr, &hdr->tlv_type);
|
|
|
|
put_unaligned_le16(len, &hdr->tlv_len);
|
2012-07-26 05:19:24 +08:00
|
|
|
memcpy(hdr + 1, data, len);
|
|
|
|
sctx->send_size += total_len;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-12-17 00:34:10 +08:00
|
|
|
#define TLV_PUT_DEFINE_INT(bits) \
|
|
|
|
static int tlv_put_u##bits(struct send_ctx *sctx, \
|
|
|
|
u##bits attr, u##bits value) \
|
|
|
|
{ \
|
|
|
|
__le##bits __tmp = cpu_to_le##bits(value); \
|
|
|
|
return tlv_put(sctx, attr, &__tmp, sizeof(__tmp)); \
|
|
|
|
}
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2022-08-16 04:54:28 +08:00
|
|
|
TLV_PUT_DEFINE_INT(8)
|
2022-03-18 01:25:42 +08:00
|
|
|
TLV_PUT_DEFINE_INT(32)
|
2013-12-17 00:34:10 +08:00
|
|
|
TLV_PUT_DEFINE_INT(64)
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
static int tlv_put_string(struct send_ctx *sctx, u16 attr,
|
|
|
|
const char *str, int len)
|
|
|
|
{
|
|
|
|
if (len == -1)
|
|
|
|
len = strlen(str);
|
|
|
|
return tlv_put(sctx, attr, str, len);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int tlv_put_uuid(struct send_ctx *sctx, u16 attr,
|
|
|
|
const u8 *uuid)
|
|
|
|
{
|
|
|
|
return tlv_put(sctx, attr, uuid, BTRFS_UUID_SIZE);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int tlv_put_btrfs_timespec(struct send_ctx *sctx, u16 attr,
|
|
|
|
struct extent_buffer *eb,
|
|
|
|
struct btrfs_timespec *ts)
|
|
|
|
{
|
|
|
|
struct btrfs_timespec bts;
|
|
|
|
read_extent_buffer(eb, &bts, (unsigned long)ts, sizeof(bts));
|
|
|
|
return tlv_put(sctx, attr, &bts, sizeof(bts));
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2018-03-03 09:05:49 +08:00
|
|
|
#define TLV_PUT(sctx, attrtype, data, attrlen) \
|
2012-07-26 05:19:24 +08:00
|
|
|
do { \
|
2018-03-03 09:05:49 +08:00
|
|
|
ret = tlv_put(sctx, attrtype, data, attrlen); \
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0) \
|
|
|
|
goto tlv_put_failure; \
|
|
|
|
} while (0)
|
|
|
|
|
|
|
|
#define TLV_PUT_INT(sctx, attrtype, bits, value) \
|
|
|
|
do { \
|
|
|
|
ret = tlv_put_u##bits(sctx, attrtype, value); \
|
|
|
|
if (ret < 0) \
|
|
|
|
goto tlv_put_failure; \
|
|
|
|
} while (0)
|
|
|
|
|
|
|
|
#define TLV_PUT_U8(sctx, attrtype, data) TLV_PUT_INT(sctx, attrtype, 8, data)
|
|
|
|
#define TLV_PUT_U16(sctx, attrtype, data) TLV_PUT_INT(sctx, attrtype, 16, data)
|
|
|
|
#define TLV_PUT_U32(sctx, attrtype, data) TLV_PUT_INT(sctx, attrtype, 32, data)
|
|
|
|
#define TLV_PUT_U64(sctx, attrtype, data) TLV_PUT_INT(sctx, attrtype, 64, data)
|
|
|
|
#define TLV_PUT_STRING(sctx, attrtype, str, len) \
|
|
|
|
do { \
|
|
|
|
ret = tlv_put_string(sctx, attrtype, str, len); \
|
|
|
|
if (ret < 0) \
|
|
|
|
goto tlv_put_failure; \
|
|
|
|
} while (0)
|
|
|
|
#define TLV_PUT_PATH(sctx, attrtype, p) \
|
|
|
|
do { \
|
|
|
|
ret = tlv_put_string(sctx, attrtype, p->start, \
|
|
|
|
p->end - p->start); \
|
|
|
|
if (ret < 0) \
|
|
|
|
goto tlv_put_failure; \
|
|
|
|
} while(0)
|
|
|
|
#define TLV_PUT_UUID(sctx, attrtype, uuid) \
|
|
|
|
do { \
|
|
|
|
ret = tlv_put_uuid(sctx, attrtype, uuid); \
|
|
|
|
if (ret < 0) \
|
|
|
|
goto tlv_put_failure; \
|
|
|
|
} while (0)
|
|
|
|
#define TLV_PUT_BTRFS_TIMESPEC(sctx, attrtype, eb, ts) \
|
|
|
|
do { \
|
|
|
|
ret = tlv_put_btrfs_timespec(sctx, attrtype, eb, ts); \
|
|
|
|
if (ret < 0) \
|
|
|
|
goto tlv_put_failure; \
|
|
|
|
} while (0)
|
|
|
|
|
|
|
|
static int send_header(struct send_ctx *sctx)
|
|
|
|
{
|
|
|
|
struct btrfs_stream_header hdr;
|
|
|
|
|
|
|
|
strcpy(hdr.magic, BTRFS_SEND_STREAM_MAGIC);
|
2022-03-18 01:25:43 +08:00
|
|
|
hdr.version = cpu_to_le32(sctx->proto);
|
2012-09-14 14:04:21 +08:00
|
|
|
return write_buf(sctx->send_filp, &hdr, sizeof(hdr),
|
|
|
|
&sctx->send_off);
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* For each command/item we want to send to userspace, we call this function.
|
|
|
|
*/
|
|
|
|
static int begin_cmd(struct send_ctx *sctx, int cmd)
|
|
|
|
{
|
|
|
|
struct btrfs_cmd_header *hdr;
|
|
|
|
|
2013-10-31 13:00:08 +08:00
|
|
|
if (WARN_ON(!sctx->send_buf))
|
2012-07-26 05:19:24 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
BUG_ON(sctx->send_size);
|
|
|
|
|
|
|
|
sctx->send_size += sizeof(*hdr);
|
|
|
|
hdr = (struct btrfs_cmd_header *)sctx->send_buf;
|
2020-09-15 16:54:23 +08:00
|
|
|
put_unaligned_le16(cmd, &hdr->cmd);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int send_cmd(struct send_ctx *sctx)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct btrfs_cmd_header *hdr;
|
|
|
|
u32 crc;
|
|
|
|
|
|
|
|
hdr = (struct btrfs_cmd_header *)sctx->send_buf;
|
2020-09-15 16:54:23 +08:00
|
|
|
put_unaligned_le32(sctx->send_size - sizeof(*hdr), &hdr->len);
|
|
|
|
put_unaligned_le32(0, &hdr->crc);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2023-08-26 04:19:20 +08:00
|
|
|
crc = crc32c(0, (unsigned char *)sctx->send_buf, sctx->send_size);
|
2020-09-15 16:54:23 +08:00
|
|
|
put_unaligned_le32(crc, &hdr->crc);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2012-09-14 14:04:21 +08:00
|
|
|
ret = write_buf(sctx->send_filp, sctx->send_buf, sctx->send_size,
|
|
|
|
&sctx->send_off);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
sctx->send_size = 0;
|
2022-03-18 01:25:40 +08:00
|
|
|
sctx->put_data = false;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Sends a move instruction to user space
|
|
|
|
*/
|
|
|
|
static int send_rename(struct send_ctx *sctx,
|
|
|
|
struct fs_path *from, struct fs_path *to)
|
|
|
|
{
|
2016-09-20 22:05:03 +08:00
|
|
|
struct btrfs_fs_info *fs_info = sctx->send_root->fs_info;
|
2012-07-26 05:19:24 +08:00
|
|
|
int ret;
|
|
|
|
|
2016-09-20 22:05:03 +08:00
|
|
|
btrfs_debug(fs_info, "send_rename %s -> %s", from->start, to->start);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
ret = begin_cmd(sctx, BTRFS_SEND_C_RENAME);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, from);
|
|
|
|
TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH_TO, to);
|
|
|
|
|
|
|
|
ret = send_cmd(sctx);
|
|
|
|
|
|
|
|
tlv_put_failure:
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Sends a link instruction to user space
|
|
|
|
*/
|
|
|
|
static int send_link(struct send_ctx *sctx,
|
|
|
|
struct fs_path *path, struct fs_path *lnk)
|
|
|
|
{
|
2016-09-20 22:05:03 +08:00
|
|
|
struct btrfs_fs_info *fs_info = sctx->send_root->fs_info;
|
2012-07-26 05:19:24 +08:00
|
|
|
int ret;
|
|
|
|
|
2016-09-20 22:05:03 +08:00
|
|
|
btrfs_debug(fs_info, "send_link %s -> %s", path->start, lnk->start);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
ret = begin_cmd(sctx, BTRFS_SEND_C_LINK);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, path);
|
|
|
|
TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH_LINK, lnk);
|
|
|
|
|
|
|
|
ret = send_cmd(sctx);
|
|
|
|
|
|
|
|
tlv_put_failure:
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Sends an unlink instruction to user space
|
|
|
|
*/
|
|
|
|
static int send_unlink(struct send_ctx *sctx, struct fs_path *path)
|
|
|
|
{
|
2016-09-20 22:05:03 +08:00
|
|
|
struct btrfs_fs_info *fs_info = sctx->send_root->fs_info;
|
2012-07-26 05:19:24 +08:00
|
|
|
int ret;
|
|
|
|
|
2016-09-20 22:05:03 +08:00
|
|
|
btrfs_debug(fs_info, "send_unlink %s", path->start);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
ret = begin_cmd(sctx, BTRFS_SEND_C_UNLINK);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, path);
|
|
|
|
|
|
|
|
ret = send_cmd(sctx);
|
|
|
|
|
|
|
|
tlv_put_failure:
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Sends a rmdir instruction to user space
|
|
|
|
*/
|
|
|
|
static int send_rmdir(struct send_ctx *sctx, struct fs_path *path)
|
|
|
|
{
|
2016-09-20 22:05:03 +08:00
|
|
|
struct btrfs_fs_info *fs_info = sctx->send_root->fs_info;
|
2012-07-26 05:19:24 +08:00
|
|
|
int ret;
|
|
|
|
|
2016-09-20 22:05:03 +08:00
|
|
|
btrfs_debug(fs_info, "send_rmdir %s", path->start);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
ret = begin_cmd(sctx, BTRFS_SEND_C_RMDIR);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, path);
|
|
|
|
|
|
|
|
ret = send_cmd(sctx);
|
|
|
|
|
|
|
|
tlv_put_failure:
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2022-08-12 22:42:32 +08:00
|
|
|
struct btrfs_inode_info {
|
|
|
|
u64 size;
|
|
|
|
u64 gen;
|
|
|
|
u64 mode;
|
|
|
|
u64 uid;
|
|
|
|
u64 gid;
|
|
|
|
u64 rdev;
|
|
|
|
u64 fileattr;
|
btrfs: send: fix failures when processing inodes with no links
There is a bug causing send failures when processing an orphan directory
with no links. In commit 46b2f4590aab ("Btrfs: fix send failure when
root has deleted files still open")', the orphan inode issue was
addressed. The send operation fails with a ENOENT error because of any
attempts to generate a path for the inode with a link count of zero.
Therefore, in that patch, sctx->ignore_cur_inode was introduced to be
set if the current inode has a link count of zero for bypassing some
unnecessary steps. And a helper function btrfs_unlink_all_paths() was
introduced and called to clean up old paths found in the parent
snapshot. However, not only regular files but also directories can be
orphan inodes. So if the send operation meets an orphan directory, it
will issue a wrong unlink command for that directory now. Soon the
receive operation fails with a EISDIR error. Besides, the send operation
also fails with a ENOENT error later when it tries to generate a path of
it.
Similar example but making an orphan dir for an incremental send:
$ btrfs subvolume create vol
$ mkdir vol/dir
$ touch vol/dir/foo
$ btrfs subvolume snapshot -r vol snap1
$ btrfs subvolume snapshot -r vol snap2
# Turn the second snapshot to RW mode and delete the whole dir while
# holding an open file descriptor on it.
$ btrfs property set snap2 ro false
$ exec 73<snap2/dir
$ rm -rf snap2/dir
# Set the second snapshot back to RO mode and do an incremental send.
$ btrfs property set snap2 ro true
$ mkdir receive_dir
$ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
At subvol snap2
At snapshot snap2
ERROR: send ioctl failed with -2: No such file or directory
ERROR: unlink dir failed. Is a directory
Actually, orphan inodes are more common use cases in cascading backups.
(Please see the illustration below.) In a cascading backup, a user wants
to replicate a couple of snapshots from Machine A to Machine B and from
Machine B to Machine C. Machine B doesn't take any RO snapshots for
sending. All a receiver does is create an RW snapshot of its parent
snapshot, apply the send stream and turn it into RO mode at the end.
Even if all paths of some inodes are deleted in applying the send
stream, these inodes would not be deleted and become orphans after
changing the subvolume from RW to RO. Moreover, orphan inodes can occur
not only in send snapshots but also in parent snapshots because Machine
B may do a batch replication of a couple of snapshots.
An illustration for cascading backups:
Machine A (snapshot {1..n}) --> Machine B --> Machine C
The idea to solve the problem is to delete all the items of orphan
inodes before using these snapshots for sending. I used to think that
the reasonable timing for doing that is during the ioctl of changing the
subvolume from RW to RO because it sounds good that we will not modify
the fs tree of a RO snapshot anymore. However, attempting to do the
orphan cleanup in the ioctl would be pointless. Because if someone is
holding an open file descriptor on the inode, the reference count of the
inode will never drop to 0. Then iput() cannot trigger eviction, which
finally deletes all the items of it. So we try to extend the original
patch to handle orphans in send/parent snapshots. Here are several cases
that need to be considered:
Case 1: BTRFS_COMPARE_TREE_NEW
| send snapshot | action
--------------------------------
nlink | 0 | ignore
In case 1, when we get a BTRFS_COMPARE_TREE_NEW tree comparison result,
it means that a new inode is found in the send snapshot and it doesn't
appear in the parent snapshot. Since this inode has a link count of zero
(It's an orphan and there're no paths for it.), we can leverage
sctx->ignore_cur_inode in the original patch to prevent it from being
created.
Case 2: BTRFS_COMPARE_TREE_DELETED
| parent snapshot | action
----------------------------------
nlink | 0 | as usual
In case 2, when we get a BTRFS_COMPARE_TREE_DELETED tree comparison
result, it means that the inode only appears in the parent snapshot.
As usual, the send operation will try to delete all its paths. However,
this inode has a link count of zero, so no paths of it will be found. No
deletion operations will be issued. We don't need to change any logic.
Case 3: BTRFS_COMPARE_TREE_CHANGED
| | parent snapshot | send snapshot | action
-----------------------------------------------------------------------
subcase 1 | nlink | 0 | 0 | ignore
subcase 2 | nlink | >0 | 0 | new_gen(deletion)
subcase 3 | nlink | 0 | >0 | new_gen(creation)
In case 3, when we get a BTRFS_COMPARE_TREE_CHANGED tree comparison result,
it means that the inode appears in both snapshots. Here are 3 subcases.
First, when the inode has link counts of zero in both snapshots. Since
there are no paths for this inode in (source/destination) parent
snapshots and we don't care about whether there is also an orphan inode
in destination or not, we can set sctx->ignore_cur_inode on to prevent
it from being created.
For the second and the third subcases, if there are paths in one
snapshot and there're no paths in the other snapshot for this inode. We
can treat this inode as a new generation. We can also leverage the logic
handling a new generation of an inode with small adjustments. Then it
will delete all old paths and create a new inode with new attributes and
paths only when there's a positive link count in the send snapshot.
In subcase 2, the send operation only needs to delete all old paths as
in the parent snapshot. But it may require more operations for a
directory to remove its old paths. If a not-empty directory is going to
be deleted (because it has a link count of zero in the send snapshot)
but there are files/directories with bigger inode numbers under it, the
send operation will need to rename it to its orphan name first. After
processing and deleting the last item under this directory, the send
operation will check this directory, aka the parent directory of the
last item, again and issue a rmdir operation to remove it finally.
Therefore, we also need to treat inodes with a link count of zero as if
they didn't exist in get_cur_inode_state(), which is used in
process_recorded_refs(). By doing this, when checking a directory with
orphan names after the last item under it has been deleted, the send
operation now can properly issue a rmdir operation. Otherwise, without
doing this, the orphan directory with an orphan name would be kept here
at the end due to the existing inode with a link count of zero being
found.
In subcase 3, as in case 2, no old paths would be found, so no deletion
operations will be issued. The send operation will only create a new one
for that inode.
Note that subcase 3 is not common. That's because it's easy to reduce
the hard links of an inode, but once all valid paths are removed,
there are no valid paths for creating other hard links. The only way to
do that is trying to send an older snapshot after a newer snapshot has
been sent.
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-12 22:42:33 +08:00
|
|
|
u64 nlink;
|
2022-08-12 22:42:32 +08:00
|
|
|
};
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
/*
|
|
|
|
* Helper function to retrieve some fields from an inode item.
|
|
|
|
*/
|
2022-08-12 22:42:32 +08:00
|
|
|
static int get_inode_info(struct btrfs_root *root, u64 ino,
|
|
|
|
struct btrfs_inode_info *info)
|
2012-07-26 05:19:24 +08:00
|
|
|
{
|
|
|
|
int ret;
|
2022-08-12 22:42:32 +08:00
|
|
|
struct btrfs_path *path;
|
2012-07-26 05:19:24 +08:00
|
|
|
struct btrfs_inode_item *ii;
|
|
|
|
struct btrfs_key key;
|
|
|
|
|
2022-08-12 22:42:32 +08:00
|
|
|
path = alloc_path_for_send();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
key.objectid = ino;
|
|
|
|
key.type = BTRFS_INODE_ITEM_KEY;
|
|
|
|
key.offset = 0;
|
|
|
|
ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
|
|
|
|
if (ret) {
|
2014-03-29 05:16:01 +08:00
|
|
|
if (ret > 0)
|
|
|
|
ret = -ENOENT;
|
2022-08-12 22:42:32 +08:00
|
|
|
goto out;
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
2022-08-12 22:42:32 +08:00
|
|
|
if (!info)
|
|
|
|
goto out;
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
ii = btrfs_item_ptr(path->nodes[0], path->slots[0],
|
|
|
|
struct btrfs_inode_item);
|
2022-08-12 22:42:32 +08:00
|
|
|
info->size = btrfs_inode_size(path->nodes[0], ii);
|
|
|
|
info->gen = btrfs_inode_generation(path->nodes[0], ii);
|
|
|
|
info->mode = btrfs_inode_mode(path->nodes[0], ii);
|
|
|
|
info->uid = btrfs_inode_uid(path->nodes[0], ii);
|
|
|
|
info->gid = btrfs_inode_gid(path->nodes[0], ii);
|
|
|
|
info->rdev = btrfs_inode_rdev(path->nodes[0], ii);
|
btrfs: send: fix failures when processing inodes with no links
There is a bug causing send failures when processing an orphan directory
with no links. In commit 46b2f4590aab ("Btrfs: fix send failure when
root has deleted files still open")', the orphan inode issue was
addressed. The send operation fails with a ENOENT error because of any
attempts to generate a path for the inode with a link count of zero.
Therefore, in that patch, sctx->ignore_cur_inode was introduced to be
set if the current inode has a link count of zero for bypassing some
unnecessary steps. And a helper function btrfs_unlink_all_paths() was
introduced and called to clean up old paths found in the parent
snapshot. However, not only regular files but also directories can be
orphan inodes. So if the send operation meets an orphan directory, it
will issue a wrong unlink command for that directory now. Soon the
receive operation fails with a EISDIR error. Besides, the send operation
also fails with a ENOENT error later when it tries to generate a path of
it.
Similar example but making an orphan dir for an incremental send:
$ btrfs subvolume create vol
$ mkdir vol/dir
$ touch vol/dir/foo
$ btrfs subvolume snapshot -r vol snap1
$ btrfs subvolume snapshot -r vol snap2
# Turn the second snapshot to RW mode and delete the whole dir while
# holding an open file descriptor on it.
$ btrfs property set snap2 ro false
$ exec 73<snap2/dir
$ rm -rf snap2/dir
# Set the second snapshot back to RO mode and do an incremental send.
$ btrfs property set snap2 ro true
$ mkdir receive_dir
$ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
At subvol snap2
At snapshot snap2
ERROR: send ioctl failed with -2: No such file or directory
ERROR: unlink dir failed. Is a directory
Actually, orphan inodes are more common use cases in cascading backups.
(Please see the illustration below.) In a cascading backup, a user wants
to replicate a couple of snapshots from Machine A to Machine B and from
Machine B to Machine C. Machine B doesn't take any RO snapshots for
sending. All a receiver does is create an RW snapshot of its parent
snapshot, apply the send stream and turn it into RO mode at the end.
Even if all paths of some inodes are deleted in applying the send
stream, these inodes would not be deleted and become orphans after
changing the subvolume from RW to RO. Moreover, orphan inodes can occur
not only in send snapshots but also in parent snapshots because Machine
B may do a batch replication of a couple of snapshots.
An illustration for cascading backups:
Machine A (snapshot {1..n}) --> Machine B --> Machine C
The idea to solve the problem is to delete all the items of orphan
inodes before using these snapshots for sending. I used to think that
the reasonable timing for doing that is during the ioctl of changing the
subvolume from RW to RO because it sounds good that we will not modify
the fs tree of a RO snapshot anymore. However, attempting to do the
orphan cleanup in the ioctl would be pointless. Because if someone is
holding an open file descriptor on the inode, the reference count of the
inode will never drop to 0. Then iput() cannot trigger eviction, which
finally deletes all the items of it. So we try to extend the original
patch to handle orphans in send/parent snapshots. Here are several cases
that need to be considered:
Case 1: BTRFS_COMPARE_TREE_NEW
| send snapshot | action
--------------------------------
nlink | 0 | ignore
In case 1, when we get a BTRFS_COMPARE_TREE_NEW tree comparison result,
it means that a new inode is found in the send snapshot and it doesn't
appear in the parent snapshot. Since this inode has a link count of zero
(It's an orphan and there're no paths for it.), we can leverage
sctx->ignore_cur_inode in the original patch to prevent it from being
created.
Case 2: BTRFS_COMPARE_TREE_DELETED
| parent snapshot | action
----------------------------------
nlink | 0 | as usual
In case 2, when we get a BTRFS_COMPARE_TREE_DELETED tree comparison
result, it means that the inode only appears in the parent snapshot.
As usual, the send operation will try to delete all its paths. However,
this inode has a link count of zero, so no paths of it will be found. No
deletion operations will be issued. We don't need to change any logic.
Case 3: BTRFS_COMPARE_TREE_CHANGED
| | parent snapshot | send snapshot | action
-----------------------------------------------------------------------
subcase 1 | nlink | 0 | 0 | ignore
subcase 2 | nlink | >0 | 0 | new_gen(deletion)
subcase 3 | nlink | 0 | >0 | new_gen(creation)
In case 3, when we get a BTRFS_COMPARE_TREE_CHANGED tree comparison result,
it means that the inode appears in both snapshots. Here are 3 subcases.
First, when the inode has link counts of zero in both snapshots. Since
there are no paths for this inode in (source/destination) parent
snapshots and we don't care about whether there is also an orphan inode
in destination or not, we can set sctx->ignore_cur_inode on to prevent
it from being created.
For the second and the third subcases, if there are paths in one
snapshot and there're no paths in the other snapshot for this inode. We
can treat this inode as a new generation. We can also leverage the logic
handling a new generation of an inode with small adjustments. Then it
will delete all old paths and create a new inode with new attributes and
paths only when there's a positive link count in the send snapshot.
In subcase 2, the send operation only needs to delete all old paths as
in the parent snapshot. But it may require more operations for a
directory to remove its old paths. If a not-empty directory is going to
be deleted (because it has a link count of zero in the send snapshot)
but there are files/directories with bigger inode numbers under it, the
send operation will need to rename it to its orphan name first. After
processing and deleting the last item under this directory, the send
operation will check this directory, aka the parent directory of the
last item, again and issue a rmdir operation to remove it finally.
Therefore, we also need to treat inodes with a link count of zero as if
they didn't exist in get_cur_inode_state(), which is used in
process_recorded_refs(). By doing this, when checking a directory with
orphan names after the last item under it has been deleted, the send
operation now can properly issue a rmdir operation. Otherwise, without
doing this, the orphan directory with an orphan name would be kept here
at the end due to the existing inode with a link count of zero being
found.
In subcase 3, as in case 2, no old paths would be found, so no deletion
operations will be issued. The send operation will only create a new one
for that inode.
Note that subcase 3 is not common. That's because it's easy to reduce
the hard links of an inode, but once all valid paths are removed,
there are no valid paths for creating other hard links. The only way to
do that is trying to send an older snapshot after a newer snapshot has
been sent.
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-12 22:42:33 +08:00
|
|
|
info->nlink = btrfs_inode_nlink(path->nodes[0], ii);
|
2022-05-19 00:02:55 +08:00
|
|
|
/*
|
|
|
|
* Transfer the unchanged u64 value of btrfs_inode_item::flags, that's
|
|
|
|
* otherwise logically split to 32/32 parts.
|
|
|
|
*/
|
2022-08-12 22:42:32 +08:00
|
|
|
info->fileattr = btrfs_inode_flags(path->nodes[0], ii);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2022-08-12 22:42:32 +08:00
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
2014-03-29 05:16:01 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2022-08-12 22:42:32 +08:00
|
|
|
static int get_inode_gen(struct btrfs_root *root, u64 ino, u64 *gen)
|
2014-03-29 05:16:01 +08:00
|
|
|
{
|
|
|
|
int ret;
|
2022-12-17 04:15:53 +08:00
|
|
|
struct btrfs_inode_info info = { 0 };
|
2014-03-29 05:16:01 +08:00
|
|
|
|
2022-12-17 04:15:53 +08:00
|
|
|
ASSERT(gen);
|
2022-08-12 22:42:32 +08:00
|
|
|
|
|
|
|
ret = get_inode_info(root, ino, &info);
|
2022-12-17 04:15:53 +08:00
|
|
|
*gen = info.gen;
|
2012-07-26 05:19:24 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
typedef int (*iterate_inode_ref_t)(int num, u64 dir, int index,
|
|
|
|
struct fs_path *p,
|
|
|
|
void *ctx);
|
|
|
|
|
|
|
|
/*
|
2012-10-15 16:30:45 +08:00
|
|
|
* Helper function to iterate the entries in ONE btrfs_inode_ref or
|
|
|
|
* btrfs_inode_extref.
|
2012-07-26 05:19:24 +08:00
|
|
|
* The iterate callback may return a non zero value to stop iteration. This can
|
|
|
|
* be a negative value for error codes or 1 to simply stop it.
|
|
|
|
*
|
2012-10-15 16:30:45 +08:00
|
|
|
* path must point to the INODE_REF or INODE_EXTREF when called.
|
2012-07-26 05:19:24 +08:00
|
|
|
*/
|
2013-05-08 15:51:52 +08:00
|
|
|
static int iterate_inode_ref(struct btrfs_root *root, struct btrfs_path *path,
|
2012-07-26 05:19:24 +08:00
|
|
|
struct btrfs_key *found_key, int resolve,
|
|
|
|
iterate_inode_ref_t iterate, void *ctx)
|
|
|
|
{
|
2012-10-15 16:30:45 +08:00
|
|
|
struct extent_buffer *eb = path->nodes[0];
|
2012-07-26 05:19:24 +08:00
|
|
|
struct btrfs_inode_ref *iref;
|
2012-10-15 16:30:45 +08:00
|
|
|
struct btrfs_inode_extref *extref;
|
2012-07-26 05:19:24 +08:00
|
|
|
struct btrfs_path *tmp_path;
|
|
|
|
struct fs_path *p;
|
2012-10-15 16:30:45 +08:00
|
|
|
u32 cur = 0;
|
2012-07-26 05:19:24 +08:00
|
|
|
u32 total;
|
2012-10-15 16:30:45 +08:00
|
|
|
int slot = path->slots[0];
|
2012-07-26 05:19:24 +08:00
|
|
|
u32 name_len;
|
|
|
|
char *start;
|
|
|
|
int ret = 0;
|
2012-10-15 16:30:45 +08:00
|
|
|
int num = 0;
|
2012-07-26 05:19:24 +08:00
|
|
|
int index;
|
2012-10-15 16:30:45 +08:00
|
|
|
u64 dir;
|
|
|
|
unsigned long name_off;
|
|
|
|
unsigned long elem_size;
|
|
|
|
unsigned long ptr;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
p = fs_path_alloc_reversed();
|
2012-07-26 05:19:24 +08:00
|
|
|
if (!p)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
tmp_path = alloc_path_for_send();
|
|
|
|
if (!tmp_path) {
|
2013-05-08 15:51:52 +08:00
|
|
|
fs_path_free(p);
|
2012-07-26 05:19:24 +08:00
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2012-10-15 16:30:45 +08:00
|
|
|
if (found_key->type == BTRFS_INODE_REF_KEY) {
|
|
|
|
ptr = (unsigned long)btrfs_item_ptr(eb, slot,
|
|
|
|
struct btrfs_inode_ref);
|
2021-10-22 02:58:35 +08:00
|
|
|
total = btrfs_item_size(eb, slot);
|
2012-10-15 16:30:45 +08:00
|
|
|
elem_size = sizeof(*iref);
|
|
|
|
} else {
|
|
|
|
ptr = btrfs_item_ptr_offset(eb, slot);
|
2021-10-22 02:58:35 +08:00
|
|
|
total = btrfs_item_size(eb, slot);
|
2012-10-15 16:30:45 +08:00
|
|
|
elem_size = sizeof(*extref);
|
|
|
|
}
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
while (cur < total) {
|
|
|
|
fs_path_reset(p);
|
|
|
|
|
2012-10-15 16:30:45 +08:00
|
|
|
if (found_key->type == BTRFS_INODE_REF_KEY) {
|
|
|
|
iref = (struct btrfs_inode_ref *)(ptr + cur);
|
|
|
|
name_len = btrfs_inode_ref_name_len(eb, iref);
|
|
|
|
name_off = (unsigned long)(iref + 1);
|
|
|
|
index = btrfs_inode_ref_index(eb, iref);
|
|
|
|
dir = found_key->offset;
|
|
|
|
} else {
|
|
|
|
extref = (struct btrfs_inode_extref *)(ptr + cur);
|
|
|
|
name_len = btrfs_inode_extref_name_len(eb, extref);
|
|
|
|
name_off = (unsigned long)&extref->name;
|
|
|
|
index = btrfs_inode_extref_index(eb, extref);
|
|
|
|
dir = btrfs_inode_extref_parent(eb, extref);
|
|
|
|
}
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
if (resolve) {
|
2012-10-15 16:30:45 +08:00
|
|
|
start = btrfs_ref_to_path(root, tmp_path, name_len,
|
|
|
|
name_off, eb, dir,
|
|
|
|
p->buf, p->buf_len);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (IS_ERR(start)) {
|
|
|
|
ret = PTR_ERR(start);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
if (start < p->buf) {
|
|
|
|
/* overflow , try again with larger buffer */
|
|
|
|
ret = fs_path_ensure_buf(p,
|
|
|
|
p->buf_len + p->buf - start);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2012-10-15 16:30:45 +08:00
|
|
|
start = btrfs_ref_to_path(root, tmp_path,
|
|
|
|
name_len, name_off,
|
|
|
|
eb, dir,
|
|
|
|
p->buf, p->buf_len);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (IS_ERR(start)) {
|
|
|
|
ret = PTR_ERR(start);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
BUG_ON(start < p->buf);
|
|
|
|
}
|
|
|
|
p->start = start;
|
|
|
|
} else {
|
2012-10-15 16:30:45 +08:00
|
|
|
ret = fs_path_add_from_extent_buffer(p, eb, name_off,
|
|
|
|
name_len);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2012-10-15 16:30:45 +08:00
|
|
|
cur += elem_size + name_len;
|
|
|
|
ret = iterate(num, dir, index, p, ctx);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
num++;
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
|
|
|
btrfs_free_path(tmp_path);
|
2013-05-08 15:51:52 +08:00
|
|
|
fs_path_free(p);
|
2012-07-26 05:19:24 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
typedef int (*iterate_dir_item_t)(int num, struct btrfs_key *di_key,
|
|
|
|
const char *name, int name_len,
|
|
|
|
const char *data, int data_len,
|
2021-11-05 08:00:13 +08:00
|
|
|
void *ctx);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Helper function to iterate the entries in ONE btrfs_dir_item.
|
|
|
|
* The iterate callback may return a non zero value to stop iteration. This can
|
|
|
|
* be a negative value for error codes or 1 to simply stop it.
|
|
|
|
*
|
|
|
|
* path must point to the dir item when called.
|
|
|
|
*/
|
2013-05-08 15:51:52 +08:00
|
|
|
static int iterate_dir_item(struct btrfs_root *root, struct btrfs_path *path,
|
2012-07-26 05:19:24 +08:00
|
|
|
iterate_dir_item_t iterate, void *ctx)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
struct extent_buffer *eb;
|
|
|
|
struct btrfs_dir_item *di;
|
|
|
|
struct btrfs_key di_key;
|
|
|
|
char *buf = NULL;
|
2014-05-24 03:15:16 +08:00
|
|
|
int buf_len;
|
2012-07-26 05:19:24 +08:00
|
|
|
u32 name_len;
|
|
|
|
u32 data_len;
|
|
|
|
u32 cur;
|
|
|
|
u32 len;
|
|
|
|
u32 total;
|
|
|
|
int slot;
|
|
|
|
int num;
|
|
|
|
|
2014-08-20 17:45:45 +08:00
|
|
|
/*
|
|
|
|
* Start with a small buffer (1 page). If later we end up needing more
|
|
|
|
* space, which can happen for xattrs on a fs with a leaf size greater
|
|
|
|
* then the page size, attempt to increase the buffer. Typically xattr
|
|
|
|
* values are small.
|
|
|
|
*/
|
|
|
|
buf_len = PATH_MAX;
|
2016-01-19 01:42:13 +08:00
|
|
|
buf = kmalloc(buf_len, GFP_KERNEL);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (!buf) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
eb = path->nodes[0];
|
|
|
|
slot = path->slots[0];
|
|
|
|
di = btrfs_item_ptr(eb, slot, struct btrfs_dir_item);
|
|
|
|
cur = 0;
|
|
|
|
len = 0;
|
2021-10-22 02:58:35 +08:00
|
|
|
total = btrfs_item_size(eb, slot);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
num = 0;
|
|
|
|
while (cur < total) {
|
|
|
|
name_len = btrfs_dir_name_len(eb, di);
|
|
|
|
data_len = btrfs_dir_data_len(eb, di);
|
|
|
|
btrfs_dir_item_key_to_cpu(eb, di, &di_key);
|
|
|
|
|
2022-10-21 00:58:28 +08:00
|
|
|
if (btrfs_dir_ftype(eb, di) == BTRFS_FT_XATTR) {
|
2014-05-24 03:15:16 +08:00
|
|
|
if (name_len > XATTR_NAME_MAX) {
|
|
|
|
ret = -ENAMETOOLONG;
|
|
|
|
goto out;
|
|
|
|
}
|
2016-06-15 21:22:56 +08:00
|
|
|
if (name_len + data_len >
|
|
|
|
BTRFS_MAX_XATTR_SIZE(root->fs_info)) {
|
2014-05-24 03:15:16 +08:00
|
|
|
ret = -E2BIG;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Path too long
|
|
|
|
*/
|
2014-08-20 17:45:45 +08:00
|
|
|
if (name_len + data_len > PATH_MAX) {
|
2014-05-24 03:15:16 +08:00
|
|
|
ret = -ENAMETOOLONG;
|
|
|
|
goto out;
|
|
|
|
}
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
2014-08-20 17:45:45 +08:00
|
|
|
if (name_len + data_len > buf_len) {
|
|
|
|
buf_len = name_len + data_len;
|
|
|
|
if (is_vmalloc_addr(buf)) {
|
|
|
|
vfree(buf);
|
|
|
|
buf = NULL;
|
|
|
|
} else {
|
|
|
|
char *tmp = krealloc(buf, buf_len,
|
2016-01-19 01:42:13 +08:00
|
|
|
GFP_KERNEL | __GFP_NOWARN);
|
2014-08-20 17:45:45 +08:00
|
|
|
|
|
|
|
if (!tmp)
|
|
|
|
kfree(buf);
|
|
|
|
buf = tmp;
|
|
|
|
}
|
|
|
|
if (!buf) {
|
2017-06-01 00:40:02 +08:00
|
|
|
buf = kvmalloc(buf_len, GFP_KERNEL);
|
2014-08-20 17:45:45 +08:00
|
|
|
if (!buf) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
read_extent_buffer(eb, buf, (unsigned long)(di + 1),
|
|
|
|
name_len + data_len);
|
|
|
|
|
|
|
|
len = sizeof(*di) + name_len + data_len;
|
|
|
|
di = (struct btrfs_dir_item *)((char *)di + len);
|
|
|
|
cur += len;
|
|
|
|
|
|
|
|
ret = iterate(num, &di_key, buf, name_len, buf + name_len,
|
2021-11-05 08:00:13 +08:00
|
|
|
data_len, ctx);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret) {
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
num++;
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
2014-08-20 17:45:45 +08:00
|
|
|
kvfree(buf);
|
2012-07-26 05:19:24 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int __copy_first_ref(int num, u64 dir, int index,
|
|
|
|
struct fs_path *p, void *ctx)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct fs_path *pt = ctx;
|
|
|
|
|
|
|
|
ret = fs_path_copy(pt, p);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
/* we want the first only */
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Retrieve the first path of an inode. If an inode has more then one
|
|
|
|
* ref/hardlink, this is ignored.
|
|
|
|
*/
|
2013-05-08 15:51:52 +08:00
|
|
|
static int get_inode_path(struct btrfs_root *root,
|
2012-07-26 05:19:24 +08:00
|
|
|
u64 ino, struct fs_path *path)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct btrfs_key key, found_key;
|
|
|
|
struct btrfs_path *p;
|
|
|
|
|
|
|
|
p = alloc_path_for_send();
|
|
|
|
if (!p)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
fs_path_reset(path);
|
|
|
|
|
|
|
|
key.objectid = ino;
|
|
|
|
key.type = BTRFS_INODE_REF_KEY;
|
|
|
|
key.offset = 0;
|
|
|
|
|
|
|
|
ret = btrfs_search_slot_for_read(root, &key, p, 1, 0);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret) {
|
|
|
|
ret = 1;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
btrfs_item_key_to_cpu(p->nodes[0], &found_key, p->slots[0]);
|
|
|
|
if (found_key.objectid != ino ||
|
2012-10-15 16:30:45 +08:00
|
|
|
(found_key.type != BTRFS_INODE_REF_KEY &&
|
|
|
|
found_key.type != BTRFS_INODE_EXTREF_KEY)) {
|
2012-07-26 05:19:24 +08:00
|
|
|
ret = -ENOENT;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
ret = iterate_inode_ref(root, p, &found_key, 1,
|
|
|
|
__copy_first_ref, path);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
ret = 0;
|
|
|
|
|
|
|
|
out:
|
|
|
|
btrfs_free_path(p);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
struct backref_ctx {
|
|
|
|
struct send_ctx *sctx;
|
|
|
|
|
|
|
|
/* number of total found references */
|
|
|
|
u64 found;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* used for clones found in send_root. clones found behind cur_objectid
|
|
|
|
* and cur_offset are not considered as allowed clones.
|
|
|
|
*/
|
|
|
|
u64 cur_objectid;
|
|
|
|
u64 cur_offset;
|
|
|
|
|
|
|
|
/* may be truncated in case it's the last extent in a file */
|
|
|
|
u64 extent_len;
|
btrfs: send: avoid double extent tree search when finding clone source
At find_extent_clone() we search twice for the extent item corresponding
to the data extent that the current file extent items points to:
1) Once with a call to extent_from_logical();
2) Once again during backref walking, through iterate_extent_inodes()
which eventually leads to find_parent_nodes() where we will search
again the extent tree for the same extent item.
The extent tree can be huge, so doing this one extra search for every
extent we want to send adds up and it's expensive.
The first call is there since the send code was introduced and it
accomplishes two things:
1) Check that the extent is flagged as a data extent in the extent tree.
But it can not be anything else, otherwise we wouldn't have a file
extent item in the send root pointing to it.
This was probably added to catch bugs in the early days where send was
yet too young and the interaction with everything else was far from
perfect;
2) Check how many direct references there are on the extent, and if
there's too many (more than SEND_MAX_EXTENT_REFS), avoid doing the
backred walking as it may take too long and slowdown send.
So improve on this by having a callback in the backref walking code that
is called when it finds the extent item in the extent tree, and have those
checks done in the callback. When the callback returns anything different
from 0, it stops the backref walking code. This way we do a single search
on the extent tree for the extent item of our data extent.
Also, before this change we were only checking the number of references on
the data extent against SEND_MAX_EXTENT_REFS, but after starting backref
walking we will end up resolving backrefs for extent buffers in the path
from a leaf having a file extent item pointing to our data extent, up to
roots of trees from which the extent buffer is accessible from, due to
shared subtrees resulting from snapshoting. We were therefore allowing for
the possibility for send taking too long due to some node in the path from
the leaf to a root node being shared too many times. After this change we
check for reference counts being greater than SEND_MAX_EXTENT_REFS for
both data extents and metadata extents.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:52 +08:00
|
|
|
|
|
|
|
/* The bytenr the file extent item we are processing refers to. */
|
|
|
|
u64 bytenr;
|
2022-11-02 00:15:53 +08:00
|
|
|
/* The owner (root id) of the data backref for the current extent. */
|
|
|
|
u64 backref_owner;
|
|
|
|
/* The offset of the data backref for the current extent. */
|
|
|
|
u64 backref_offset;
|
2012-07-26 05:19:24 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
static int __clone_root_cmp_bsearch(const void *key, const void *elt)
|
|
|
|
{
|
2012-08-13 16:52:38 +08:00
|
|
|
u64 root = (u64)(uintptr_t)key;
|
2021-07-26 20:15:26 +08:00
|
|
|
const struct clone_root *cr = elt;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2018-08-06 13:25:24 +08:00
|
|
|
if (root < cr->root->root_key.objectid)
|
2012-07-26 05:19:24 +08:00
|
|
|
return -1;
|
2018-08-06 13:25:24 +08:00
|
|
|
if (root > cr->root->root_key.objectid)
|
2012-07-26 05:19:24 +08:00
|
|
|
return 1;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int __clone_root_cmp_sort(const void *e1, const void *e2)
|
|
|
|
{
|
2021-07-26 20:15:26 +08:00
|
|
|
const struct clone_root *cr1 = e1;
|
|
|
|
const struct clone_root *cr2 = e2;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2018-08-06 13:25:24 +08:00
|
|
|
if (cr1->root->root_key.objectid < cr2->root->root_key.objectid)
|
2012-07-26 05:19:24 +08:00
|
|
|
return -1;
|
2018-08-06 13:25:24 +08:00
|
|
|
if (cr1->root->root_key.objectid > cr2->root->root_key.objectid)
|
2012-07-26 05:19:24 +08:00
|
|
|
return 1;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Called for every backref that is found for the current extent.
|
btrfs: send: skip unnecessary backref iterations
When looking for a clone source for an extent, we are iterating over all
the backreferences for an extent. This is often a waste of time, because
once we find a good clone source we could stop immediately instead of
continuing backref walking, which is expensive.
Basically what happens currently is this:
1) Call iterate_extent_inodes() to iterate over all the backreferences;
2) It calls btrfs_find_all_leafs() which in turn calls the main function
to walk over backrefs and collect them - find_parent_nodes();
3) Then we collect all the references for our target data extent from the
extent tree (and delayed refs if any), add them to the rb trees,
resolve all the indirect backreferences and search for all the file
extent items in fs trees, building a list of inodes for each one of
them (struct extent_inode_elem);
4) Then back at iterate_extent_inodes() we find all the roots associated
to each found leaf, and call the callback __iterate_backrefs defined
at send.c for each inode in the inode list associated to each leaf.
Some times one the first backreferences we find in a fs tree is optimal
to satisfy the clone operation that send wants to perform, and in that
case we could stop immediately and avoid resolving all the remaining
indirect backreferences (search fs trees for the respective file extent
items, etc). This possibly if when we find a fs tree leaf with a file
extent item we are able to know what are all the roots that can lead to
the leaf - this is now possible after the previous patch in the series
that adds a cache that maps leaves to a list of roots. So we can now
shortcircuit backref walking during send, by having the callback we
pass to iterate_extent_inodes() to be called when we find a file extent
item for an indirect backreference, and have it return a special value
when it found a suitable backreference and it does not need to look for
more backreferences. This change does that.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:51 +08:00
|
|
|
* Results are collected in sctx->clone_roots->ino/offset.
|
2012-07-26 05:19:24 +08:00
|
|
|
*/
|
btrfs: send: skip unnecessary backref iterations
When looking for a clone source for an extent, we are iterating over all
the backreferences for an extent. This is often a waste of time, because
once we find a good clone source we could stop immediately instead of
continuing backref walking, which is expensive.
Basically what happens currently is this:
1) Call iterate_extent_inodes() to iterate over all the backreferences;
2) It calls btrfs_find_all_leafs() which in turn calls the main function
to walk over backrefs and collect them - find_parent_nodes();
3) Then we collect all the references for our target data extent from the
extent tree (and delayed refs if any), add them to the rb trees,
resolve all the indirect backreferences and search for all the file
extent items in fs trees, building a list of inodes for each one of
them (struct extent_inode_elem);
4) Then back at iterate_extent_inodes() we find all the roots associated
to each found leaf, and call the callback __iterate_backrefs defined
at send.c for each inode in the inode list associated to each leaf.
Some times one the first backreferences we find in a fs tree is optimal
to satisfy the clone operation that send wants to perform, and in that
case we could stop immediately and avoid resolving all the remaining
indirect backreferences (search fs trees for the respective file extent
items, etc). This possibly if when we find a fs tree leaf with a file
extent item we are able to know what are all the roots that can lead to
the leaf - this is now possible after the previous patch in the series
that adds a cache that maps leaves to a list of roots. So we can now
shortcircuit backref walking during send, by having the callback we
pass to iterate_extent_inodes() to be called when we find a file extent
item for an indirect backreference, and have it return a special value
when it found a suitable backreference and it does not need to look for
more backreferences. This change does that.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:51 +08:00
|
|
|
static int iterate_backrefs(u64 ino, u64 offset, u64 num_bytes, u64 root_id,
|
|
|
|
void *ctx_)
|
2012-07-26 05:19:24 +08:00
|
|
|
{
|
|
|
|
struct backref_ctx *bctx = ctx_;
|
btrfs: send: skip unnecessary backref iterations
When looking for a clone source for an extent, we are iterating over all
the backreferences for an extent. This is often a waste of time, because
once we find a good clone source we could stop immediately instead of
continuing backref walking, which is expensive.
Basically what happens currently is this:
1) Call iterate_extent_inodes() to iterate over all the backreferences;
2) It calls btrfs_find_all_leafs() which in turn calls the main function
to walk over backrefs and collect them - find_parent_nodes();
3) Then we collect all the references for our target data extent from the
extent tree (and delayed refs if any), add them to the rb trees,
resolve all the indirect backreferences and search for all the file
extent items in fs trees, building a list of inodes for each one of
them (struct extent_inode_elem);
4) Then back at iterate_extent_inodes() we find all the roots associated
to each found leaf, and call the callback __iterate_backrefs defined
at send.c for each inode in the inode list associated to each leaf.
Some times one the first backreferences we find in a fs tree is optimal
to satisfy the clone operation that send wants to perform, and in that
case we could stop immediately and avoid resolving all the remaining
indirect backreferences (search fs trees for the respective file extent
items, etc). This possibly if when we find a fs tree leaf with a file
extent item we are able to know what are all the roots that can lead to
the leaf - this is now possible after the previous patch in the series
that adds a cache that maps leaves to a list of roots. So we can now
shortcircuit backref walking during send, by having the callback we
pass to iterate_extent_inodes() to be called when we find a file extent
item for an indirect backreference, and have it return a special value
when it found a suitable backreference and it does not need to look for
more backreferences. This change does that.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:51 +08:00
|
|
|
struct clone_root *clone_root;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
/* First check if the root is in the list of accepted clone sources */
|
btrfs: send: skip unnecessary backref iterations
When looking for a clone source for an extent, we are iterating over all
the backreferences for an extent. This is often a waste of time, because
once we find a good clone source we could stop immediately instead of
continuing backref walking, which is expensive.
Basically what happens currently is this:
1) Call iterate_extent_inodes() to iterate over all the backreferences;
2) It calls btrfs_find_all_leafs() which in turn calls the main function
to walk over backrefs and collect them - find_parent_nodes();
3) Then we collect all the references for our target data extent from the
extent tree (and delayed refs if any), add them to the rb trees,
resolve all the indirect backreferences and search for all the file
extent items in fs trees, building a list of inodes for each one of
them (struct extent_inode_elem);
4) Then back at iterate_extent_inodes() we find all the roots associated
to each found leaf, and call the callback __iterate_backrefs defined
at send.c for each inode in the inode list associated to each leaf.
Some times one the first backreferences we find in a fs tree is optimal
to satisfy the clone operation that send wants to perform, and in that
case we could stop immediately and avoid resolving all the remaining
indirect backreferences (search fs trees for the respective file extent
items, etc). This possibly if when we find a fs tree leaf with a file
extent item we are able to know what are all the roots that can lead to
the leaf - this is now possible after the previous patch in the series
that adds a cache that maps leaves to a list of roots. So we can now
shortcircuit backref walking during send, by having the callback we
pass to iterate_extent_inodes() to be called when we find a file extent
item for an indirect backreference, and have it return a special value
when it found a suitable backreference and it does not need to look for
more backreferences. This change does that.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:51 +08:00
|
|
|
clone_root = bsearch((void *)(uintptr_t)root_id, bctx->sctx->clone_roots,
|
|
|
|
bctx->sctx->clone_roots_cnt,
|
|
|
|
sizeof(struct clone_root),
|
|
|
|
__clone_root_cmp_bsearch);
|
|
|
|
if (!clone_root)
|
2012-07-26 05:19:24 +08:00
|
|
|
return 0;
|
|
|
|
|
btrfs: send: skip unnecessary backref iterations
When looking for a clone source for an extent, we are iterating over all
the backreferences for an extent. This is often a waste of time, because
once we find a good clone source we could stop immediately instead of
continuing backref walking, which is expensive.
Basically what happens currently is this:
1) Call iterate_extent_inodes() to iterate over all the backreferences;
2) It calls btrfs_find_all_leafs() which in turn calls the main function
to walk over backrefs and collect them - find_parent_nodes();
3) Then we collect all the references for our target data extent from the
extent tree (and delayed refs if any), add them to the rb trees,
resolve all the indirect backreferences and search for all the file
extent items in fs trees, building a list of inodes for each one of
them (struct extent_inode_elem);
4) Then back at iterate_extent_inodes() we find all the roots associated
to each found leaf, and call the callback __iterate_backrefs defined
at send.c for each inode in the inode list associated to each leaf.
Some times one the first backreferences we find in a fs tree is optimal
to satisfy the clone operation that send wants to perform, and in that
case we could stop immediately and avoid resolving all the remaining
indirect backreferences (search fs trees for the respective file extent
items, etc). This possibly if when we find a fs tree leaf with a file
extent item we are able to know what are all the roots that can lead to
the leaf - this is now possible after the previous patch in the series
that adds a cache that maps leaves to a list of roots. So we can now
shortcircuit backref walking during send, by having the callback we
pass to iterate_extent_inodes() to be called when we find a file extent
item for an indirect backreference, and have it return a special value
when it found a suitable backreference and it does not need to look for
more backreferences. This change does that.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:51 +08:00
|
|
|
/* This is our own reference, bail out as we can't clone from it. */
|
|
|
|
if (clone_root->root == bctx->sctx->send_root &&
|
2012-07-26 05:19:24 +08:00
|
|
|
ino == bctx->cur_objectid &&
|
btrfs: send: skip unnecessary backref iterations
When looking for a clone source for an extent, we are iterating over all
the backreferences for an extent. This is often a waste of time, because
once we find a good clone source we could stop immediately instead of
continuing backref walking, which is expensive.
Basically what happens currently is this:
1) Call iterate_extent_inodes() to iterate over all the backreferences;
2) It calls btrfs_find_all_leafs() which in turn calls the main function
to walk over backrefs and collect them - find_parent_nodes();
3) Then we collect all the references for our target data extent from the
extent tree (and delayed refs if any), add them to the rb trees,
resolve all the indirect backreferences and search for all the file
extent items in fs trees, building a list of inodes for each one of
them (struct extent_inode_elem);
4) Then back at iterate_extent_inodes() we find all the roots associated
to each found leaf, and call the callback __iterate_backrefs defined
at send.c for each inode in the inode list associated to each leaf.
Some times one the first backreferences we find in a fs tree is optimal
to satisfy the clone operation that send wants to perform, and in that
case we could stop immediately and avoid resolving all the remaining
indirect backreferences (search fs trees for the respective file extent
items, etc). This possibly if when we find a fs tree leaf with a file
extent item we are able to know what are all the roots that can lead to
the leaf - this is now possible after the previous patch in the series
that adds a cache that maps leaves to a list of roots. So we can now
shortcircuit backref walking during send, by having the callback we
pass to iterate_extent_inodes() to be called when we find a file extent
item for an indirect backreference, and have it return a special value
when it found a suitable backreference and it does not need to look for
more backreferences. This change does that.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:51 +08:00
|
|
|
offset == bctx->cur_offset)
|
|
|
|
return 0;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Make sure we don't consider clones from send_root that are
|
|
|
|
* behind the current inode/offset.
|
|
|
|
*/
|
btrfs: send: skip unnecessary backref iterations
When looking for a clone source for an extent, we are iterating over all
the backreferences for an extent. This is often a waste of time, because
once we find a good clone source we could stop immediately instead of
continuing backref walking, which is expensive.
Basically what happens currently is this:
1) Call iterate_extent_inodes() to iterate over all the backreferences;
2) It calls btrfs_find_all_leafs() which in turn calls the main function
to walk over backrefs and collect them - find_parent_nodes();
3) Then we collect all the references for our target data extent from the
extent tree (and delayed refs if any), add them to the rb trees,
resolve all the indirect backreferences and search for all the file
extent items in fs trees, building a list of inodes for each one of
them (struct extent_inode_elem);
4) Then back at iterate_extent_inodes() we find all the roots associated
to each found leaf, and call the callback __iterate_backrefs defined
at send.c for each inode in the inode list associated to each leaf.
Some times one the first backreferences we find in a fs tree is optimal
to satisfy the clone operation that send wants to perform, and in that
case we could stop immediately and avoid resolving all the remaining
indirect backreferences (search fs trees for the respective file extent
items, etc). This possibly if when we find a fs tree leaf with a file
extent item we are able to know what are all the roots that can lead to
the leaf - this is now possible after the previous patch in the series
that adds a cache that maps leaves to a list of roots. So we can now
shortcircuit backref walking during send, by having the callback we
pass to iterate_extent_inodes() to be called when we find a file extent
item for an indirect backreference, and have it return a special value
when it found a suitable backreference and it does not need to look for
more backreferences. This change does that.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:51 +08:00
|
|
|
if (clone_root->root == bctx->sctx->send_root) {
|
2012-07-26 05:19:24 +08:00
|
|
|
/*
|
2019-10-30 20:23:11 +08:00
|
|
|
* If the source inode was not yet processed we can't issue a
|
|
|
|
* clone operation, as the source extent does not exist yet at
|
|
|
|
* the destination of the stream.
|
2012-07-26 05:19:24 +08:00
|
|
|
*/
|
2019-10-30 20:23:11 +08:00
|
|
|
if (ino > bctx->cur_objectid)
|
|
|
|
return 0;
|
|
|
|
/*
|
|
|
|
* We clone from the inode currently being sent as long as the
|
|
|
|
* source extent is already processed, otherwise we could try
|
|
|
|
* to clone from an extent that does not exist yet at the
|
|
|
|
* destination of the stream.
|
|
|
|
*/
|
|
|
|
if (ino == bctx->cur_objectid &&
|
Btrfs: send, fix emission of invalid clone operations within the same file
When doing an incremental send and a file has extents shared with itself
at different file offsets, it's possible for send to emit clone operations
that will fail at the destination because the source range goes beyond the
file's current size. This happens when the file size has increased in the
send snapshot, there is a hole between the shared extents and both shared
extents are at file offsets which are greater the file's size in the
parent snapshot.
Example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt/sdb
$ xfs_io -f -c "pwrite -S 0xf1 0 64K" /mnt/sdb/foobar
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base
$ btrfs send -f /tmp/1.snap /mnt/sdb/base
# Create a 320K extent at file offset 512K.
$ xfs_io -c "pwrite -S 0xab 512K 64K" /mnt/sdb/foobar
$ xfs_io -c "pwrite -S 0xcd 576K 64K" /mnt/sdb/foobar
$ xfs_io -c "pwrite -S 0xef 640K 64K" /mnt/sdb/foobar
$ xfs_io -c "pwrite -S 0x64 704K 64K" /mnt/sdb/foobar
$ xfs_io -c "pwrite -S 0x73 768K 64K" /mnt/sdb/foobar
# Clone part of that 320K extent into a lower file offset (192K).
# This file offset is greater than the file's size in the parent
# snapshot (64K). Also the clone range is a bit behind the offset of
# the 320K extent so that we leave a hole between the shared extents.
$ xfs_io -c "reflink /mnt/sdb/foobar 448K 192K 192K" /mnt/sdb/foobar
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr
$ btrfs send -p /mnt/sdb/base -f /tmp/2.snap /mnt/sdb/incr
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ btrfs receive -f /tmp/1.snap /mnt/sdc
$ btrfs receive -f /tmp/2.snap /mnt/sdc
ERROR: failed to clone extents to foobar: Invalid argument
The problem is that after processing the extent at file offset 256K, which
refers to the first 128K of the 320K extent created by the buffered write
operations, we have 'cur_inode_next_write_offset' set to 384K, which
corresponds to the end offset of the partially shared extent (256K + 128K)
and to the current file size in the receiver. Then when we process the
extent at offset 512K, we do extent backreference iteration to figure out
if we can clone the extent from some other inode or from the same inode,
and we consider the extent at offset 256K of the same inode as a valid
source for a clone operation, which is not correct because at that point
the current file size in the receiver is 384K, which corresponds to the
end of last processed extent (at file offset 256K), so using a clone
source range from 256K to 256K + 320K is invalid because that goes past
the current size of the file (384K) - this makes the receiver get an
-EINVAL error when attempting the clone operation.
So fix this by excluding clone sources that have a range that goes beyond
the current file size in the receiver when iterating extent backreferences.
A test case for fstests follows soon.
Fixes: 11f2069c113e02 ("Btrfs: send, allow clone operations within the same file")
CC: stable@vger.kernel.org # 5.5+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-30 01:09:53 +08:00
|
|
|
offset + bctx->extent_len >
|
|
|
|
bctx->sctx->cur_inode_next_write_offset)
|
2012-07-26 05:19:24 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
bctx->found++;
|
btrfs: send: skip unnecessary backref iterations
When looking for a clone source for an extent, we are iterating over all
the backreferences for an extent. This is often a waste of time, because
once we find a good clone source we could stop immediately instead of
continuing backref walking, which is expensive.
Basically what happens currently is this:
1) Call iterate_extent_inodes() to iterate over all the backreferences;
2) It calls btrfs_find_all_leafs() which in turn calls the main function
to walk over backrefs and collect them - find_parent_nodes();
3) Then we collect all the references for our target data extent from the
extent tree (and delayed refs if any), add them to the rb trees,
resolve all the indirect backreferences and search for all the file
extent items in fs trees, building a list of inodes for each one of
them (struct extent_inode_elem);
4) Then back at iterate_extent_inodes() we find all the roots associated
to each found leaf, and call the callback __iterate_backrefs defined
at send.c for each inode in the inode list associated to each leaf.
Some times one the first backreferences we find in a fs tree is optimal
to satisfy the clone operation that send wants to perform, and in that
case we could stop immediately and avoid resolving all the remaining
indirect backreferences (search fs trees for the respective file extent
items, etc). This possibly if when we find a fs tree leaf with a file
extent item we are able to know what are all the roots that can lead to
the leaf - this is now possible after the previous patch in the series
that adds a cache that maps leaves to a list of roots. So we can now
shortcircuit backref walking during send, by having the callback we
pass to iterate_extent_inodes() to be called when we find a file extent
item for an indirect backreference, and have it return a special value
when it found a suitable backreference and it does not need to look for
more backreferences. This change does that.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:51 +08:00
|
|
|
clone_root->found_ref = true;
|
2022-11-02 00:15:45 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the given backref refers to a file extent item with a larger
|
|
|
|
* number of bytes than what we found before, use the new one so that
|
|
|
|
* we clone more optimally and end up doing less writes and getting
|
|
|
|
* less exclusive, non-shared extents at the destination.
|
|
|
|
*/
|
btrfs: send: skip unnecessary backref iterations
When looking for a clone source for an extent, we are iterating over all
the backreferences for an extent. This is often a waste of time, because
once we find a good clone source we could stop immediately instead of
continuing backref walking, which is expensive.
Basically what happens currently is this:
1) Call iterate_extent_inodes() to iterate over all the backreferences;
2) It calls btrfs_find_all_leafs() which in turn calls the main function
to walk over backrefs and collect them - find_parent_nodes();
3) Then we collect all the references for our target data extent from the
extent tree (and delayed refs if any), add them to the rb trees,
resolve all the indirect backreferences and search for all the file
extent items in fs trees, building a list of inodes for each one of
them (struct extent_inode_elem);
4) Then back at iterate_extent_inodes() we find all the roots associated
to each found leaf, and call the callback __iterate_backrefs defined
at send.c for each inode in the inode list associated to each leaf.
Some times one the first backreferences we find in a fs tree is optimal
to satisfy the clone operation that send wants to perform, and in that
case we could stop immediately and avoid resolving all the remaining
indirect backreferences (search fs trees for the respective file extent
items, etc). This possibly if when we find a fs tree leaf with a file
extent item we are able to know what are all the roots that can lead to
the leaf - this is now possible after the previous patch in the series
that adds a cache that maps leaves to a list of roots. So we can now
shortcircuit backref walking during send, by having the callback we
pass to iterate_extent_inodes() to be called when we find a file extent
item for an indirect backreference, and have it return a special value
when it found a suitable backreference and it does not need to look for
more backreferences. This change does that.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:51 +08:00
|
|
|
if (num_bytes > clone_root->num_bytes) {
|
|
|
|
clone_root->ino = ino;
|
|
|
|
clone_root->offset = offset;
|
|
|
|
clone_root->num_bytes = num_bytes;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Found a perfect candidate, so there's no need to continue
|
|
|
|
* backref walking.
|
|
|
|
*/
|
|
|
|
if (num_bytes >= bctx->extent_len)
|
|
|
|
return BTRFS_ITERATE_EXTENT_INODES_STOP;
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
btrfs: send: cache leaf to roots mapping during backref walking
During a send operation, when doing backref walking to determine which
inodes/offsets/roots we can clone from, the most repetitive and expensive
step is to map each leaf that has file extent items pointing to the target
data extent to the IDs of the roots from which the leaves are accessible,
which happens at iterate_extent_inodes(). That step requires finding every
parent node of a leaf, then the parent of each parent, and so on until we
reach a root node. So it's a naturally expensive operation, and repetitive
because each leaf can have hundreds of file extent items (for a nodesize
of 16K, that can be slightly over 200 file extent items). There's also
temporal locality, as we process all file extent items from a leave before
moving the next leaf.
This change caches the mapping of leaves to root IDs, to avoid repeating
those computations over and over again. The cache is limited to a maximum
of 128 entries, with each entry being a struct with a size of 128 bytes,
so the maximum cache size is 16K plus any nodes internally allocated by
the maple tree that is used to index pointers to those structs. The cache
is invalidated whenever we detect relocation happened since we started
filling the cache, because if relocation happened then extent buffers for
leaves and nodes of the trees used by a send operation may have been
reallocated.
This cache also allows for another important optimization that is
introduced in the next patch in the series.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:50 +08:00
|
|
|
static bool lookup_backref_cache(u64 leaf_bytenr, void *ctx,
|
|
|
|
const u64 **root_ids_ret, int *root_count_ret)
|
|
|
|
{
|
btrfs: send: skip unnecessary backref iterations
When looking for a clone source for an extent, we are iterating over all
the backreferences for an extent. This is often a waste of time, because
once we find a good clone source we could stop immediately instead of
continuing backref walking, which is expensive.
Basically what happens currently is this:
1) Call iterate_extent_inodes() to iterate over all the backreferences;
2) It calls btrfs_find_all_leafs() which in turn calls the main function
to walk over backrefs and collect them - find_parent_nodes();
3) Then we collect all the references for our target data extent from the
extent tree (and delayed refs if any), add them to the rb trees,
resolve all the indirect backreferences and search for all the file
extent items in fs trees, building a list of inodes for each one of
them (struct extent_inode_elem);
4) Then back at iterate_extent_inodes() we find all the roots associated
to each found leaf, and call the callback __iterate_backrefs defined
at send.c for each inode in the inode list associated to each leaf.
Some times one the first backreferences we find in a fs tree is optimal
to satisfy the clone operation that send wants to perform, and in that
case we could stop immediately and avoid resolving all the remaining
indirect backreferences (search fs trees for the respective file extent
items, etc). This possibly if when we find a fs tree leaf with a file
extent item we are able to know what are all the roots that can lead to
the leaf - this is now possible after the previous patch in the series
that adds a cache that maps leaves to a list of roots. So we can now
shortcircuit backref walking during send, by having the callback we
pass to iterate_extent_inodes() to be called when we find a file extent
item for an indirect backreference, and have it return a special value
when it found a suitable backreference and it does not need to look for
more backreferences. This change does that.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:51 +08:00
|
|
|
struct backref_ctx *bctx = ctx;
|
|
|
|
struct send_ctx *sctx = bctx->sctx;
|
btrfs: send: cache leaf to roots mapping during backref walking
During a send operation, when doing backref walking to determine which
inodes/offsets/roots we can clone from, the most repetitive and expensive
step is to map each leaf that has file extent items pointing to the target
data extent to the IDs of the roots from which the leaves are accessible,
which happens at iterate_extent_inodes(). That step requires finding every
parent node of a leaf, then the parent of each parent, and so on until we
reach a root node. So it's a naturally expensive operation, and repetitive
because each leaf can have hundreds of file extent items (for a nodesize
of 16K, that can be slightly over 200 file extent items). There's also
temporal locality, as we process all file extent items from a leave before
moving the next leaf.
This change caches the mapping of leaves to root IDs, to avoid repeating
those computations over and over again. The cache is limited to a maximum
of 128 entries, with each entry being a struct with a size of 128 bytes,
so the maximum cache size is 16K plus any nodes internally allocated by
the maple tree that is used to index pointers to those structs. The cache
is invalidated whenever we detect relocation happened since we started
filling the cache, because if relocation happened then extent buffers for
leaves and nodes of the trees used by a send operation may have been
reallocated.
This cache also allows for another important optimization that is
introduced in the next patch in the series.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:50 +08:00
|
|
|
struct btrfs_fs_info *fs_info = sctx->send_root->fs_info;
|
|
|
|
const u64 key = leaf_bytenr >> fs_info->sectorsize_bits;
|
2023-01-11 19:36:13 +08:00
|
|
|
struct btrfs_lru_cache_entry *raw_entry;
|
btrfs: send: cache leaf to roots mapping during backref walking
During a send operation, when doing backref walking to determine which
inodes/offsets/roots we can clone from, the most repetitive and expensive
step is to map each leaf that has file extent items pointing to the target
data extent to the IDs of the roots from which the leaves are accessible,
which happens at iterate_extent_inodes(). That step requires finding every
parent node of a leaf, then the parent of each parent, and so on until we
reach a root node. So it's a naturally expensive operation, and repetitive
because each leaf can have hundreds of file extent items (for a nodesize
of 16K, that can be slightly over 200 file extent items). There's also
temporal locality, as we process all file extent items from a leave before
moving the next leaf.
This change caches the mapping of leaves to root IDs, to avoid repeating
those computations over and over again. The cache is limited to a maximum
of 128 entries, with each entry being a struct with a size of 128 bytes,
so the maximum cache size is 16K plus any nodes internally allocated by
the maple tree that is used to index pointers to those structs. The cache
is invalidated whenever we detect relocation happened since we started
filling the cache, because if relocation happened then extent buffers for
leaves and nodes of the trees used by a send operation may have been
reallocated.
This cache also allows for another important optimization that is
introduced in the next patch in the series.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:50 +08:00
|
|
|
struct backref_cache_entry *entry;
|
|
|
|
|
2023-01-11 19:36:13 +08:00
|
|
|
if (btrfs_lru_cache_size(&sctx->backref_cache) == 0)
|
btrfs: send: cache leaf to roots mapping during backref walking
During a send operation, when doing backref walking to determine which
inodes/offsets/roots we can clone from, the most repetitive and expensive
step is to map each leaf that has file extent items pointing to the target
data extent to the IDs of the roots from which the leaves are accessible,
which happens at iterate_extent_inodes(). That step requires finding every
parent node of a leaf, then the parent of each parent, and so on until we
reach a root node. So it's a naturally expensive operation, and repetitive
because each leaf can have hundreds of file extent items (for a nodesize
of 16K, that can be slightly over 200 file extent items). There's also
temporal locality, as we process all file extent items from a leave before
moving the next leaf.
This change caches the mapping of leaves to root IDs, to avoid repeating
those computations over and over again. The cache is limited to a maximum
of 128 entries, with each entry being a struct with a size of 128 bytes,
so the maximum cache size is 16K plus any nodes internally allocated by
the maple tree that is used to index pointers to those structs. The cache
is invalidated whenever we detect relocation happened since we started
filling the cache, because if relocation happened then extent buffers for
leaves and nodes of the trees used by a send operation may have been
reallocated.
This cache also allows for another important optimization that is
introduced in the next patch in the series.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:50 +08:00
|
|
|
return false;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If relocation happened since we first filled the cache, then we must
|
|
|
|
* empty the cache and can not use it, because even though we operate on
|
|
|
|
* read-only roots, their leaves and nodes may have been reallocated and
|
|
|
|
* now be used for different nodes/leaves of the same tree or some other
|
|
|
|
* tree.
|
|
|
|
*
|
|
|
|
* We are called from iterate_extent_inodes() while either holding a
|
|
|
|
* transaction handle or holding fs_info->commit_root_sem, so no need
|
|
|
|
* to take any lock here.
|
|
|
|
*/
|
2023-01-11 19:36:13 +08:00
|
|
|
if (fs_info->last_reloc_trans > sctx->backref_cache_last_reloc_trans) {
|
|
|
|
btrfs_lru_cache_clear(&sctx->backref_cache);
|
btrfs: send: cache leaf to roots mapping during backref walking
During a send operation, when doing backref walking to determine which
inodes/offsets/roots we can clone from, the most repetitive and expensive
step is to map each leaf that has file extent items pointing to the target
data extent to the IDs of the roots from which the leaves are accessible,
which happens at iterate_extent_inodes(). That step requires finding every
parent node of a leaf, then the parent of each parent, and so on until we
reach a root node. So it's a naturally expensive operation, and repetitive
because each leaf can have hundreds of file extent items (for a nodesize
of 16K, that can be slightly over 200 file extent items). There's also
temporal locality, as we process all file extent items from a leave before
moving the next leaf.
This change caches the mapping of leaves to root IDs, to avoid repeating
those computations over and over again. The cache is limited to a maximum
of 128 entries, with each entry being a struct with a size of 128 bytes,
so the maximum cache size is 16K plus any nodes internally allocated by
the maple tree that is used to index pointers to those structs. The cache
is invalidated whenever we detect relocation happened since we started
filling the cache, because if relocation happened then extent buffers for
leaves and nodes of the trees used by a send operation may have been
reallocated.
This cache also allows for another important optimization that is
introduced in the next patch in the series.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:50 +08:00
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2023-01-11 19:36:16 +08:00
|
|
|
raw_entry = btrfs_lru_cache_lookup(&sctx->backref_cache, key, 0);
|
2023-01-11 19:36:13 +08:00
|
|
|
if (!raw_entry)
|
btrfs: send: cache leaf to roots mapping during backref walking
During a send operation, when doing backref walking to determine which
inodes/offsets/roots we can clone from, the most repetitive and expensive
step is to map each leaf that has file extent items pointing to the target
data extent to the IDs of the roots from which the leaves are accessible,
which happens at iterate_extent_inodes(). That step requires finding every
parent node of a leaf, then the parent of each parent, and so on until we
reach a root node. So it's a naturally expensive operation, and repetitive
because each leaf can have hundreds of file extent items (for a nodesize
of 16K, that can be slightly over 200 file extent items). There's also
temporal locality, as we process all file extent items from a leave before
moving the next leaf.
This change caches the mapping of leaves to root IDs, to avoid repeating
those computations over and over again. The cache is limited to a maximum
of 128 entries, with each entry being a struct with a size of 128 bytes,
so the maximum cache size is 16K plus any nodes internally allocated by
the maple tree that is used to index pointers to those structs. The cache
is invalidated whenever we detect relocation happened since we started
filling the cache, because if relocation happened then extent buffers for
leaves and nodes of the trees used by a send operation may have been
reallocated.
This cache also allows for another important optimization that is
introduced in the next patch in the series.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:50 +08:00
|
|
|
return false;
|
|
|
|
|
2023-01-11 19:36:13 +08:00
|
|
|
entry = container_of(raw_entry, struct backref_cache_entry, entry);
|
btrfs: send: cache leaf to roots mapping during backref walking
During a send operation, when doing backref walking to determine which
inodes/offsets/roots we can clone from, the most repetitive and expensive
step is to map each leaf that has file extent items pointing to the target
data extent to the IDs of the roots from which the leaves are accessible,
which happens at iterate_extent_inodes(). That step requires finding every
parent node of a leaf, then the parent of each parent, and so on until we
reach a root node. So it's a naturally expensive operation, and repetitive
because each leaf can have hundreds of file extent items (for a nodesize
of 16K, that can be slightly over 200 file extent items). There's also
temporal locality, as we process all file extent items from a leave before
moving the next leaf.
This change caches the mapping of leaves to root IDs, to avoid repeating
those computations over and over again. The cache is limited to a maximum
of 128 entries, with each entry being a struct with a size of 128 bytes,
so the maximum cache size is 16K plus any nodes internally allocated by
the maple tree that is used to index pointers to those structs. The cache
is invalidated whenever we detect relocation happened since we started
filling the cache, because if relocation happened then extent buffers for
leaves and nodes of the trees used by a send operation may have been
reallocated.
This cache also allows for another important optimization that is
introduced in the next patch in the series.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:50 +08:00
|
|
|
*root_ids_ret = entry->root_ids;
|
|
|
|
*root_count_ret = entry->num_roots;
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void store_backref_cache(u64 leaf_bytenr, const struct ulist *root_ids,
|
|
|
|
void *ctx)
|
|
|
|
{
|
btrfs: send: skip unnecessary backref iterations
When looking for a clone source for an extent, we are iterating over all
the backreferences for an extent. This is often a waste of time, because
once we find a good clone source we could stop immediately instead of
continuing backref walking, which is expensive.
Basically what happens currently is this:
1) Call iterate_extent_inodes() to iterate over all the backreferences;
2) It calls btrfs_find_all_leafs() which in turn calls the main function
to walk over backrefs and collect them - find_parent_nodes();
3) Then we collect all the references for our target data extent from the
extent tree (and delayed refs if any), add them to the rb trees,
resolve all the indirect backreferences and search for all the file
extent items in fs trees, building a list of inodes for each one of
them (struct extent_inode_elem);
4) Then back at iterate_extent_inodes() we find all the roots associated
to each found leaf, and call the callback __iterate_backrefs defined
at send.c for each inode in the inode list associated to each leaf.
Some times one the first backreferences we find in a fs tree is optimal
to satisfy the clone operation that send wants to perform, and in that
case we could stop immediately and avoid resolving all the remaining
indirect backreferences (search fs trees for the respective file extent
items, etc). This possibly if when we find a fs tree leaf with a file
extent item we are able to know what are all the roots that can lead to
the leaf - this is now possible after the previous patch in the series
that adds a cache that maps leaves to a list of roots. So we can now
shortcircuit backref walking during send, by having the callback we
pass to iterate_extent_inodes() to be called when we find a file extent
item for an indirect backreference, and have it return a special value
when it found a suitable backreference and it does not need to look for
more backreferences. This change does that.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:51 +08:00
|
|
|
struct backref_ctx *bctx = ctx;
|
|
|
|
struct send_ctx *sctx = bctx->sctx;
|
btrfs: send: cache leaf to roots mapping during backref walking
During a send operation, when doing backref walking to determine which
inodes/offsets/roots we can clone from, the most repetitive and expensive
step is to map each leaf that has file extent items pointing to the target
data extent to the IDs of the roots from which the leaves are accessible,
which happens at iterate_extent_inodes(). That step requires finding every
parent node of a leaf, then the parent of each parent, and so on until we
reach a root node. So it's a naturally expensive operation, and repetitive
because each leaf can have hundreds of file extent items (for a nodesize
of 16K, that can be slightly over 200 file extent items). There's also
temporal locality, as we process all file extent items from a leave before
moving the next leaf.
This change caches the mapping of leaves to root IDs, to avoid repeating
those computations over and over again. The cache is limited to a maximum
of 128 entries, with each entry being a struct with a size of 128 bytes,
so the maximum cache size is 16K plus any nodes internally allocated by
the maple tree that is used to index pointers to those structs. The cache
is invalidated whenever we detect relocation happened since we started
filling the cache, because if relocation happened then extent buffers for
leaves and nodes of the trees used by a send operation may have been
reallocated.
This cache also allows for another important optimization that is
introduced in the next patch in the series.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:50 +08:00
|
|
|
struct btrfs_fs_info *fs_info = sctx->send_root->fs_info;
|
|
|
|
struct backref_cache_entry *new_entry;
|
|
|
|
struct ulist_iterator uiter;
|
|
|
|
struct ulist_node *node;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We're called while holding a transaction handle or while holding
|
|
|
|
* fs_info->commit_root_sem (at iterate_extent_inodes()), so must do a
|
|
|
|
* NOFS allocation.
|
|
|
|
*/
|
|
|
|
new_entry = kmalloc(sizeof(struct backref_cache_entry), GFP_NOFS);
|
|
|
|
/* No worries, cache is optional. */
|
|
|
|
if (!new_entry)
|
|
|
|
return;
|
|
|
|
|
2023-01-11 19:36:13 +08:00
|
|
|
new_entry->entry.key = leaf_bytenr >> fs_info->sectorsize_bits;
|
2023-01-11 19:36:16 +08:00
|
|
|
new_entry->entry.gen = 0;
|
btrfs: send: cache leaf to roots mapping during backref walking
During a send operation, when doing backref walking to determine which
inodes/offsets/roots we can clone from, the most repetitive and expensive
step is to map each leaf that has file extent items pointing to the target
data extent to the IDs of the roots from which the leaves are accessible,
which happens at iterate_extent_inodes(). That step requires finding every
parent node of a leaf, then the parent of each parent, and so on until we
reach a root node. So it's a naturally expensive operation, and repetitive
because each leaf can have hundreds of file extent items (for a nodesize
of 16K, that can be slightly over 200 file extent items). There's also
temporal locality, as we process all file extent items from a leave before
moving the next leaf.
This change caches the mapping of leaves to root IDs, to avoid repeating
those computations over and over again. The cache is limited to a maximum
of 128 entries, with each entry being a struct with a size of 128 bytes,
so the maximum cache size is 16K plus any nodes internally allocated by
the maple tree that is used to index pointers to those structs. The cache
is invalidated whenever we detect relocation happened since we started
filling the cache, because if relocation happened then extent buffers for
leaves and nodes of the trees used by a send operation may have been
reallocated.
This cache also allows for another important optimization that is
introduced in the next patch in the series.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:50 +08:00
|
|
|
new_entry->num_roots = 0;
|
|
|
|
ULIST_ITER_INIT(&uiter);
|
|
|
|
while ((node = ulist_next(root_ids, &uiter)) != NULL) {
|
|
|
|
const u64 root_id = node->val;
|
|
|
|
struct clone_root *root;
|
|
|
|
|
|
|
|
root = bsearch((void *)(uintptr_t)root_id, sctx->clone_roots,
|
|
|
|
sctx->clone_roots_cnt, sizeof(struct clone_root),
|
|
|
|
__clone_root_cmp_bsearch);
|
|
|
|
if (!root)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* Too many roots, just exit, no worries as caching is optional. */
|
|
|
|
if (new_entry->num_roots >= SEND_MAX_BACKREF_CACHE_ROOTS) {
|
|
|
|
kfree(new_entry);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
new_entry->root_ids[new_entry->num_roots] = root_id;
|
|
|
|
new_entry->num_roots++;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We may have not added any roots to the new cache entry, which means
|
|
|
|
* none of the roots is part of the list of roots from which we are
|
|
|
|
* allowed to clone. Cache the new entry as it's still useful to avoid
|
|
|
|
* backref walking to determine which roots have a path to the leaf.
|
2023-01-11 19:36:13 +08:00
|
|
|
*
|
|
|
|
* Also use GFP_NOFS because we're called while holding a transaction
|
|
|
|
* handle or while holding fs_info->commit_root_sem.
|
btrfs: send: cache leaf to roots mapping during backref walking
During a send operation, when doing backref walking to determine which
inodes/offsets/roots we can clone from, the most repetitive and expensive
step is to map each leaf that has file extent items pointing to the target
data extent to the IDs of the roots from which the leaves are accessible,
which happens at iterate_extent_inodes(). That step requires finding every
parent node of a leaf, then the parent of each parent, and so on until we
reach a root node. So it's a naturally expensive operation, and repetitive
because each leaf can have hundreds of file extent items (for a nodesize
of 16K, that can be slightly over 200 file extent items). There's also
temporal locality, as we process all file extent items from a leave before
moving the next leaf.
This change caches the mapping of leaves to root IDs, to avoid repeating
those computations over and over again. The cache is limited to a maximum
of 128 entries, with each entry being a struct with a size of 128 bytes,
so the maximum cache size is 16K plus any nodes internally allocated by
the maple tree that is used to index pointers to those structs. The cache
is invalidated whenever we detect relocation happened since we started
filling the cache, because if relocation happened then extent buffers for
leaves and nodes of the trees used by a send operation may have been
reallocated.
This cache also allows for another important optimization that is
introduced in the next patch in the series.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:50 +08:00
|
|
|
*/
|
2023-01-11 19:36:13 +08:00
|
|
|
ret = btrfs_lru_cache_store(&sctx->backref_cache, &new_entry->entry,
|
|
|
|
GFP_NOFS);
|
btrfs: send: cache leaf to roots mapping during backref walking
During a send operation, when doing backref walking to determine which
inodes/offsets/roots we can clone from, the most repetitive and expensive
step is to map each leaf that has file extent items pointing to the target
data extent to the IDs of the roots from which the leaves are accessible,
which happens at iterate_extent_inodes(). That step requires finding every
parent node of a leaf, then the parent of each parent, and so on until we
reach a root node. So it's a naturally expensive operation, and repetitive
because each leaf can have hundreds of file extent items (for a nodesize
of 16K, that can be slightly over 200 file extent items). There's also
temporal locality, as we process all file extent items from a leave before
moving the next leaf.
This change caches the mapping of leaves to root IDs, to avoid repeating
those computations over and over again. The cache is limited to a maximum
of 128 entries, with each entry being a struct with a size of 128 bytes,
so the maximum cache size is 16K plus any nodes internally allocated by
the maple tree that is used to index pointers to those structs. The cache
is invalidated whenever we detect relocation happened since we started
filling the cache, because if relocation happened then extent buffers for
leaves and nodes of the trees used by a send operation may have been
reallocated.
This cache also allows for another important optimization that is
introduced in the next patch in the series.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:50 +08:00
|
|
|
ASSERT(ret == 0 || ret == -ENOMEM);
|
|
|
|
if (ret) {
|
|
|
|
/* Caching is optional, no worries. */
|
|
|
|
kfree(new_entry);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We are called from iterate_extent_inodes() while either holding a
|
|
|
|
* transaction handle or holding fs_info->commit_root_sem, so no need
|
|
|
|
* to take any lock here.
|
|
|
|
*/
|
2023-01-11 19:36:13 +08:00
|
|
|
if (btrfs_lru_cache_size(&sctx->backref_cache) == 1)
|
|
|
|
sctx->backref_cache_last_reloc_trans = fs_info->last_reloc_trans;
|
btrfs: send: cache leaf to roots mapping during backref walking
During a send operation, when doing backref walking to determine which
inodes/offsets/roots we can clone from, the most repetitive and expensive
step is to map each leaf that has file extent items pointing to the target
data extent to the IDs of the roots from which the leaves are accessible,
which happens at iterate_extent_inodes(). That step requires finding every
parent node of a leaf, then the parent of each parent, and so on until we
reach a root node. So it's a naturally expensive operation, and repetitive
because each leaf can have hundreds of file extent items (for a nodesize
of 16K, that can be slightly over 200 file extent items). There's also
temporal locality, as we process all file extent items from a leave before
moving the next leaf.
This change caches the mapping of leaves to root IDs, to avoid repeating
those computations over and over again. The cache is limited to a maximum
of 128 entries, with each entry being a struct with a size of 128 bytes,
so the maximum cache size is 16K plus any nodes internally allocated by
the maple tree that is used to index pointers to those structs. The cache
is invalidated whenever we detect relocation happened since we started
filling the cache, because if relocation happened then extent buffers for
leaves and nodes of the trees used by a send operation may have been
reallocated.
This cache also allows for another important optimization that is
introduced in the next patch in the series.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:50 +08:00
|
|
|
}
|
|
|
|
|
btrfs: send: avoid double extent tree search when finding clone source
At find_extent_clone() we search twice for the extent item corresponding
to the data extent that the current file extent items points to:
1) Once with a call to extent_from_logical();
2) Once again during backref walking, through iterate_extent_inodes()
which eventually leads to find_parent_nodes() where we will search
again the extent tree for the same extent item.
The extent tree can be huge, so doing this one extra search for every
extent we want to send adds up and it's expensive.
The first call is there since the send code was introduced and it
accomplishes two things:
1) Check that the extent is flagged as a data extent in the extent tree.
But it can not be anything else, otherwise we wouldn't have a file
extent item in the send root pointing to it.
This was probably added to catch bugs in the early days where send was
yet too young and the interaction with everything else was far from
perfect;
2) Check how many direct references there are on the extent, and if
there's too many (more than SEND_MAX_EXTENT_REFS), avoid doing the
backred walking as it may take too long and slowdown send.
So improve on this by having a callback in the backref walking code that
is called when it finds the extent item in the extent tree, and have those
checks done in the callback. When the callback returns anything different
from 0, it stops the backref walking code. This way we do a single search
on the extent tree for the extent item of our data extent.
Also, before this change we were only checking the number of references on
the data extent against SEND_MAX_EXTENT_REFS, but after starting backref
walking we will end up resolving backrefs for extent buffers in the path
from a leaf having a file extent item pointing to our data extent, up to
roots of trees from which the extent buffer is accessible from, due to
shared subtrees resulting from snapshoting. We were therefore allowing for
the possibility for send taking too long due to some node in the path from
the leaf to a root node being shared too many times. After this change we
check for reference counts being greater than SEND_MAX_EXTENT_REFS for
both data extents and metadata extents.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:52 +08:00
|
|
|
static int check_extent_item(u64 bytenr, const struct btrfs_extent_item *ei,
|
|
|
|
const struct extent_buffer *leaf, void *ctx)
|
|
|
|
{
|
|
|
|
const u64 refs = btrfs_extent_refs(leaf, ei);
|
|
|
|
const struct backref_ctx *bctx = ctx;
|
|
|
|
const struct send_ctx *sctx = bctx->sctx;
|
|
|
|
|
|
|
|
if (bytenr == bctx->bytenr) {
|
|
|
|
const u64 flags = btrfs_extent_flags(leaf, ei);
|
|
|
|
|
|
|
|
if (WARN_ON(flags & BTRFS_EXTENT_FLAG_TREE_BLOCK))
|
|
|
|
return -EUCLEAN;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we have only one reference and only the send root as a
|
|
|
|
* clone source - meaning no clone roots were given in the
|
|
|
|
* struct btrfs_ioctl_send_args passed to the send ioctl - then
|
|
|
|
* it's our reference and there's no point in doing backref
|
|
|
|
* walking which is expensive, so exit early.
|
|
|
|
*/
|
|
|
|
if (refs == 1 && sctx->clone_roots_cnt == 1)
|
|
|
|
return -ENOENT;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Backreference walking (iterate_extent_inodes() below) is currently
|
|
|
|
* too expensive when an extent has a large number of references, both
|
|
|
|
* in time spent and used memory. So for now just fallback to write
|
|
|
|
* operations instead of clone operations when an extent has more than
|
|
|
|
* a certain amount of references.
|
|
|
|
*/
|
|
|
|
if (refs > SEND_MAX_EXTENT_REFS)
|
|
|
|
return -ENOENT;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2022-11-02 00:15:53 +08:00
|
|
|
static bool skip_self_data_ref(u64 root, u64 ino, u64 offset, void *ctx)
|
|
|
|
{
|
|
|
|
const struct backref_ctx *bctx = ctx;
|
|
|
|
|
|
|
|
if (ino == bctx->cur_objectid &&
|
|
|
|
root == bctx->backref_owner &&
|
|
|
|
offset == bctx->backref_offset)
|
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
/*
|
2012-07-28 20:11:31 +08:00
|
|
|
* Given an inode, offset and extent item, it finds a good clone for a clone
|
|
|
|
* instruction. Returns -ENOENT when none could be found. The function makes
|
|
|
|
* sure that the returned clone is usable at the point where sending is at the
|
|
|
|
* moment. This means, that no clones are accepted which lie behind the current
|
|
|
|
* inode+offset.
|
|
|
|
*
|
2012-07-26 05:19:24 +08:00
|
|
|
* path must point to the extent item when called.
|
|
|
|
*/
|
|
|
|
static int find_extent_clone(struct send_ctx *sctx,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
u64 ino, u64 data_offset,
|
|
|
|
u64 ino_size,
|
|
|
|
struct clone_root **found)
|
|
|
|
{
|
2016-09-20 22:05:03 +08:00
|
|
|
struct btrfs_fs_info *fs_info = sctx->send_root->fs_info;
|
2012-07-26 05:19:24 +08:00
|
|
|
int ret;
|
|
|
|
int extent_type;
|
|
|
|
u64 logical;
|
2012-08-08 04:25:13 +08:00
|
|
|
u64 disk_byte;
|
2012-07-26 05:19:24 +08:00
|
|
|
u64 num_bytes;
|
|
|
|
struct btrfs_file_extent_item *fi;
|
|
|
|
struct extent_buffer *eb = path->nodes[0];
|
2022-11-02 00:15:47 +08:00
|
|
|
struct backref_ctx backref_ctx = { 0 };
|
|
|
|
struct btrfs_backref_walk_ctx backref_walk_ctx = { 0 };
|
2012-07-26 05:19:24 +08:00
|
|
|
struct clone_root *cur_clone_root;
|
2012-08-08 04:25:13 +08:00
|
|
|
int compressed;
|
2012-07-26 05:19:24 +08:00
|
|
|
u32 i;
|
|
|
|
|
2022-11-02 00:15:42 +08:00
|
|
|
/*
|
|
|
|
* With fallocate we can get prealloc extents beyond the inode's i_size,
|
|
|
|
* so we don't do anything here because clone operations can not clone
|
|
|
|
* to a range beyond i_size without increasing the i_size of the
|
|
|
|
* destination inode.
|
|
|
|
*/
|
|
|
|
if (data_offset >= ino_size)
|
2022-11-02 00:15:41 +08:00
|
|
|
return 0;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2022-11-02 00:15:41 +08:00
|
|
|
fi = btrfs_item_ptr(eb, path->slots[0], struct btrfs_file_extent_item);
|
2012-07-26 05:19:24 +08:00
|
|
|
extent_type = btrfs_file_extent_type(eb, fi);
|
2022-11-02 00:15:41 +08:00
|
|
|
if (extent_type == BTRFS_FILE_EXTENT_INLINE)
|
|
|
|
return -ENOENT;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2012-08-08 04:25:13 +08:00
|
|
|
disk_byte = btrfs_file_extent_disk_bytenr(eb, fi);
|
2022-11-02 00:15:41 +08:00
|
|
|
if (disk_byte == 0)
|
|
|
|
return -ENOENT;
|
|
|
|
|
|
|
|
compressed = btrfs_file_extent_compression(eb, fi);
|
|
|
|
num_bytes = btrfs_file_extent_num_bytes(eb, fi);
|
2012-08-08 04:25:13 +08:00
|
|
|
logical = disk_byte + btrfs_file_extent_offset(eb, fi);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Setup the clone roots.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < sctx->clone_roots_cnt; i++) {
|
|
|
|
cur_clone_root = sctx->clone_roots + i;
|
|
|
|
cur_clone_root->ino = (u64)-1;
|
|
|
|
cur_clone_root->offset = 0;
|
2022-11-02 00:15:45 +08:00
|
|
|
cur_clone_root->num_bytes = 0;
|
btrfs: send: skip unnecessary backref iterations
When looking for a clone source for an extent, we are iterating over all
the backreferences for an extent. This is often a waste of time, because
once we find a good clone source we could stop immediately instead of
continuing backref walking, which is expensive.
Basically what happens currently is this:
1) Call iterate_extent_inodes() to iterate over all the backreferences;
2) It calls btrfs_find_all_leafs() which in turn calls the main function
to walk over backrefs and collect them - find_parent_nodes();
3) Then we collect all the references for our target data extent from the
extent tree (and delayed refs if any), add them to the rb trees,
resolve all the indirect backreferences and search for all the file
extent items in fs trees, building a list of inodes for each one of
them (struct extent_inode_elem);
4) Then back at iterate_extent_inodes() we find all the roots associated
to each found leaf, and call the callback __iterate_backrefs defined
at send.c for each inode in the inode list associated to each leaf.
Some times one the first backreferences we find in a fs tree is optimal
to satisfy the clone operation that send wants to perform, and in that
case we could stop immediately and avoid resolving all the remaining
indirect backreferences (search fs trees for the respective file extent
items, etc). This possibly if when we find a fs tree leaf with a file
extent item we are able to know what are all the roots that can lead to
the leaf - this is now possible after the previous patch in the series
that adds a cache that maps leaves to a list of roots. So we can now
shortcircuit backref walking during send, by having the callback we
pass to iterate_extent_inodes() to be called when we find a file extent
item for an indirect backreference, and have it return a special value
when it found a suitable backreference and it does not need to look for
more backreferences. This change does that.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:51 +08:00
|
|
|
cur_clone_root->found_ref = false;
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
2021-07-28 05:17:31 +08:00
|
|
|
backref_ctx.sctx = sctx;
|
|
|
|
backref_ctx.cur_objectid = ino;
|
|
|
|
backref_ctx.cur_offset = data_offset;
|
btrfs: send: avoid double extent tree search when finding clone source
At find_extent_clone() we search twice for the extent item corresponding
to the data extent that the current file extent items points to:
1) Once with a call to extent_from_logical();
2) Once again during backref walking, through iterate_extent_inodes()
which eventually leads to find_parent_nodes() where we will search
again the extent tree for the same extent item.
The extent tree can be huge, so doing this one extra search for every
extent we want to send adds up and it's expensive.
The first call is there since the send code was introduced and it
accomplishes two things:
1) Check that the extent is flagged as a data extent in the extent tree.
But it can not be anything else, otherwise we wouldn't have a file
extent item in the send root pointing to it.
This was probably added to catch bugs in the early days where send was
yet too young and the interaction with everything else was far from
perfect;
2) Check how many direct references there are on the extent, and if
there's too many (more than SEND_MAX_EXTENT_REFS), avoid doing the
backred walking as it may take too long and slowdown send.
So improve on this by having a callback in the backref walking code that
is called when it finds the extent item in the extent tree, and have those
checks done in the callback. When the callback returns anything different
from 0, it stops the backref walking code. This way we do a single search
on the extent tree for the extent item of our data extent.
Also, before this change we were only checking the number of references on
the data extent against SEND_MAX_EXTENT_REFS, but after starting backref
walking we will end up resolving backrefs for extent buffers in the path
from a leaf having a file extent item pointing to our data extent, up to
roots of trees from which the extent buffer is accessible from, due to
shared subtrees resulting from snapshoting. We were therefore allowing for
the possibility for send taking too long due to some node in the path from
the leaf to a root node being shared too many times. After this change we
check for reference counts being greater than SEND_MAX_EXTENT_REFS for
both data extents and metadata extents.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:52 +08:00
|
|
|
backref_ctx.bytenr = disk_byte;
|
2022-11-02 00:15:53 +08:00
|
|
|
/*
|
|
|
|
* Use the header owner and not the send root's id, because in case of a
|
|
|
|
* snapshot we can have shared subtrees.
|
|
|
|
*/
|
|
|
|
backref_ctx.backref_owner = btrfs_header_owner(eb);
|
|
|
|
backref_ctx.backref_offset = data_offset - btrfs_file_extent_offset(eb, fi);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The last extent of a file may be too large due to page alignment.
|
|
|
|
* We need to adjust extent_len in this case so that the checks in
|
btrfs: send: skip unnecessary backref iterations
When looking for a clone source for an extent, we are iterating over all
the backreferences for an extent. This is often a waste of time, because
once we find a good clone source we could stop immediately instead of
continuing backref walking, which is expensive.
Basically what happens currently is this:
1) Call iterate_extent_inodes() to iterate over all the backreferences;
2) It calls btrfs_find_all_leafs() which in turn calls the main function
to walk over backrefs and collect them - find_parent_nodes();
3) Then we collect all the references for our target data extent from the
extent tree (and delayed refs if any), add them to the rb trees,
resolve all the indirect backreferences and search for all the file
extent items in fs trees, building a list of inodes for each one of
them (struct extent_inode_elem);
4) Then back at iterate_extent_inodes() we find all the roots associated
to each found leaf, and call the callback __iterate_backrefs defined
at send.c for each inode in the inode list associated to each leaf.
Some times one the first backreferences we find in a fs tree is optimal
to satisfy the clone operation that send wants to perform, and in that
case we could stop immediately and avoid resolving all the remaining
indirect backreferences (search fs trees for the respective file extent
items, etc). This possibly if when we find a fs tree leaf with a file
extent item we are able to know what are all the roots that can lead to
the leaf - this is now possible after the previous patch in the series
that adds a cache that maps leaves to a list of roots. So we can now
shortcircuit backref walking during send, by having the callback we
pass to iterate_extent_inodes() to be called when we find a file extent
item for an indirect backreference, and have it return a special value
when it found a suitable backreference and it does not need to look for
more backreferences. This change does that.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:51 +08:00
|
|
|
* iterate_backrefs() work.
|
2012-07-26 05:19:24 +08:00
|
|
|
*/
|
|
|
|
if (data_offset + num_bytes >= ino_size)
|
2021-07-28 05:17:31 +08:00
|
|
|
backref_ctx.extent_len = ino_size - data_offset;
|
2022-11-02 00:15:43 +08:00
|
|
|
else
|
|
|
|
backref_ctx.extent_len = num_bytes;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Now collect all backrefs.
|
|
|
|
*/
|
btrfs: send: avoid double extent tree search when finding clone source
At find_extent_clone() we search twice for the extent item corresponding
to the data extent that the current file extent items points to:
1) Once with a call to extent_from_logical();
2) Once again during backref walking, through iterate_extent_inodes()
which eventually leads to find_parent_nodes() where we will search
again the extent tree for the same extent item.
The extent tree can be huge, so doing this one extra search for every
extent we want to send adds up and it's expensive.
The first call is there since the send code was introduced and it
accomplishes two things:
1) Check that the extent is flagged as a data extent in the extent tree.
But it can not be anything else, otherwise we wouldn't have a file
extent item in the send root pointing to it.
This was probably added to catch bugs in the early days where send was
yet too young and the interaction with everything else was far from
perfect;
2) Check how many direct references there are on the extent, and if
there's too many (more than SEND_MAX_EXTENT_REFS), avoid doing the
backred walking as it may take too long and slowdown send.
So improve on this by having a callback in the backref walking code that
is called when it finds the extent item in the extent tree, and have those
checks done in the callback. When the callback returns anything different
from 0, it stops the backref walking code. This way we do a single search
on the extent tree for the extent item of our data extent.
Also, before this change we were only checking the number of references on
the data extent against SEND_MAX_EXTENT_REFS, but after starting backref
walking we will end up resolving backrefs for extent buffers in the path
from a leaf having a file extent item pointing to our data extent, up to
roots of trees from which the extent buffer is accessible from, due to
shared subtrees resulting from snapshoting. We were therefore allowing for
the possibility for send taking too long due to some node in the path from
the leaf to a root node being shared too many times. After this change we
check for reference counts being greater than SEND_MAX_EXTENT_REFS for
both data extents and metadata extents.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:52 +08:00
|
|
|
backref_walk_ctx.bytenr = disk_byte;
|
2012-08-08 04:25:13 +08:00
|
|
|
if (compressed == BTRFS_COMPRESS_NONE)
|
btrfs: send: avoid double extent tree search when finding clone source
At find_extent_clone() we search twice for the extent item corresponding
to the data extent that the current file extent items points to:
1) Once with a call to extent_from_logical();
2) Once again during backref walking, through iterate_extent_inodes()
which eventually leads to find_parent_nodes() where we will search
again the extent tree for the same extent item.
The extent tree can be huge, so doing this one extra search for every
extent we want to send adds up and it's expensive.
The first call is there since the send code was introduced and it
accomplishes two things:
1) Check that the extent is flagged as a data extent in the extent tree.
But it can not be anything else, otherwise we wouldn't have a file
extent item in the send root pointing to it.
This was probably added to catch bugs in the early days where send was
yet too young and the interaction with everything else was far from
perfect;
2) Check how many direct references there are on the extent, and if
there's too many (more than SEND_MAX_EXTENT_REFS), avoid doing the
backred walking as it may take too long and slowdown send.
So improve on this by having a callback in the backref walking code that
is called when it finds the extent item in the extent tree, and have those
checks done in the callback. When the callback returns anything different
from 0, it stops the backref walking code. This way we do a single search
on the extent tree for the extent item of our data extent.
Also, before this change we were only checking the number of references on
the data extent against SEND_MAX_EXTENT_REFS, but after starting backref
walking we will end up resolving backrefs for extent buffers in the path
from a leaf having a file extent item pointing to our data extent, up to
roots of trees from which the extent buffer is accessible from, due to
shared subtrees resulting from snapshoting. We were therefore allowing for
the possibility for send taking too long due to some node in the path from
the leaf to a root node being shared too many times. After this change we
check for reference counts being greater than SEND_MAX_EXTENT_REFS for
both data extents and metadata extents.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:52 +08:00
|
|
|
backref_walk_ctx.extent_item_pos = btrfs_file_extent_offset(eb, fi);
|
2022-11-02 00:15:47 +08:00
|
|
|
backref_walk_ctx.fs_info = fs_info;
|
btrfs: send: cache leaf to roots mapping during backref walking
During a send operation, when doing backref walking to determine which
inodes/offsets/roots we can clone from, the most repetitive and expensive
step is to map each leaf that has file extent items pointing to the target
data extent to the IDs of the roots from which the leaves are accessible,
which happens at iterate_extent_inodes(). That step requires finding every
parent node of a leaf, then the parent of each parent, and so on until we
reach a root node. So it's a naturally expensive operation, and repetitive
because each leaf can have hundreds of file extent items (for a nodesize
of 16K, that can be slightly over 200 file extent items). There's also
temporal locality, as we process all file extent items from a leave before
moving the next leaf.
This change caches the mapping of leaves to root IDs, to avoid repeating
those computations over and over again. The cache is limited to a maximum
of 128 entries, with each entry being a struct with a size of 128 bytes,
so the maximum cache size is 16K plus any nodes internally allocated by
the maple tree that is used to index pointers to those structs. The cache
is invalidated whenever we detect relocation happened since we started
filling the cache, because if relocation happened then extent buffers for
leaves and nodes of the trees used by a send operation may have been
reallocated.
This cache also allows for another important optimization that is
introduced in the next patch in the series.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:50 +08:00
|
|
|
backref_walk_ctx.cache_lookup = lookup_backref_cache;
|
|
|
|
backref_walk_ctx.cache_store = store_backref_cache;
|
btrfs: send: skip unnecessary backref iterations
When looking for a clone source for an extent, we are iterating over all
the backreferences for an extent. This is often a waste of time, because
once we find a good clone source we could stop immediately instead of
continuing backref walking, which is expensive.
Basically what happens currently is this:
1) Call iterate_extent_inodes() to iterate over all the backreferences;
2) It calls btrfs_find_all_leafs() which in turn calls the main function
to walk over backrefs and collect them - find_parent_nodes();
3) Then we collect all the references for our target data extent from the
extent tree (and delayed refs if any), add them to the rb trees,
resolve all the indirect backreferences and search for all the file
extent items in fs trees, building a list of inodes for each one of
them (struct extent_inode_elem);
4) Then back at iterate_extent_inodes() we find all the roots associated
to each found leaf, and call the callback __iterate_backrefs defined
at send.c for each inode in the inode list associated to each leaf.
Some times one the first backreferences we find in a fs tree is optimal
to satisfy the clone operation that send wants to perform, and in that
case we could stop immediately and avoid resolving all the remaining
indirect backreferences (search fs trees for the respective file extent
items, etc). This possibly if when we find a fs tree leaf with a file
extent item we are able to know what are all the roots that can lead to
the leaf - this is now possible after the previous patch in the series
that adds a cache that maps leaves to a list of roots. So we can now
shortcircuit backref walking during send, by having the callback we
pass to iterate_extent_inodes() to be called when we find a file extent
item for an indirect backreference, and have it return a special value
when it found a suitable backreference and it does not need to look for
more backreferences. This change does that.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:51 +08:00
|
|
|
backref_walk_ctx.indirect_ref_iterator = iterate_backrefs;
|
btrfs: send: avoid double extent tree search when finding clone source
At find_extent_clone() we search twice for the extent item corresponding
to the data extent that the current file extent items points to:
1) Once with a call to extent_from_logical();
2) Once again during backref walking, through iterate_extent_inodes()
which eventually leads to find_parent_nodes() where we will search
again the extent tree for the same extent item.
The extent tree can be huge, so doing this one extra search for every
extent we want to send adds up and it's expensive.
The first call is there since the send code was introduced and it
accomplishes two things:
1) Check that the extent is flagged as a data extent in the extent tree.
But it can not be anything else, otherwise we wouldn't have a file
extent item in the send root pointing to it.
This was probably added to catch bugs in the early days where send was
yet too young and the interaction with everything else was far from
perfect;
2) Check how many direct references there are on the extent, and if
there's too many (more than SEND_MAX_EXTENT_REFS), avoid doing the
backred walking as it may take too long and slowdown send.
So improve on this by having a callback in the backref walking code that
is called when it finds the extent item in the extent tree, and have those
checks done in the callback. When the callback returns anything different
from 0, it stops the backref walking code. This way we do a single search
on the extent tree for the extent item of our data extent.
Also, before this change we were only checking the number of references on
the data extent against SEND_MAX_EXTENT_REFS, but after starting backref
walking we will end up resolving backrefs for extent buffers in the path
from a leaf having a file extent item pointing to our data extent, up to
roots of trees from which the extent buffer is accessible from, due to
shared subtrees resulting from snapshoting. We were therefore allowing for
the possibility for send taking too long due to some node in the path from
the leaf to a root node being shared too many times. After this change we
check for reference counts being greater than SEND_MAX_EXTENT_REFS for
both data extents and metadata extents.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:52 +08:00
|
|
|
backref_walk_ctx.check_extent_item = check_extent_item;
|
btrfs: send: skip unnecessary backref iterations
When looking for a clone source for an extent, we are iterating over all
the backreferences for an extent. This is often a waste of time, because
once we find a good clone source we could stop immediately instead of
continuing backref walking, which is expensive.
Basically what happens currently is this:
1) Call iterate_extent_inodes() to iterate over all the backreferences;
2) It calls btrfs_find_all_leafs() which in turn calls the main function
to walk over backrefs and collect them - find_parent_nodes();
3) Then we collect all the references for our target data extent from the
extent tree (and delayed refs if any), add them to the rb trees,
resolve all the indirect backreferences and search for all the file
extent items in fs trees, building a list of inodes for each one of
them (struct extent_inode_elem);
4) Then back at iterate_extent_inodes() we find all the roots associated
to each found leaf, and call the callback __iterate_backrefs defined
at send.c for each inode in the inode list associated to each leaf.
Some times one the first backreferences we find in a fs tree is optimal
to satisfy the clone operation that send wants to perform, and in that
case we could stop immediately and avoid resolving all the remaining
indirect backreferences (search fs trees for the respective file extent
items, etc). This possibly if when we find a fs tree leaf with a file
extent item we are able to know what are all the roots that can lead to
the leaf - this is now possible after the previous patch in the series
that adds a cache that maps leaves to a list of roots. So we can now
shortcircuit backref walking during send, by having the callback we
pass to iterate_extent_inodes() to be called when we find a file extent
item for an indirect backreference, and have it return a special value
when it found a suitable backreference and it does not need to look for
more backreferences. This change does that.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:51 +08:00
|
|
|
backref_walk_ctx.user_ctx = &backref_ctx;
|
2012-08-08 04:25:13 +08:00
|
|
|
|
2022-11-02 00:15:53 +08:00
|
|
|
/*
|
|
|
|
* If have a single clone root, then it's the send root and we can tell
|
|
|
|
* the backref walking code to skip our own backref and not resolve it,
|
|
|
|
* since we can not use it for cloning - the source and destination
|
|
|
|
* ranges can't overlap and in case the leaf is shared through a subtree
|
|
|
|
* due to snapshots, we can't use those other roots since they are not
|
|
|
|
* in the list of clone roots.
|
|
|
|
*/
|
|
|
|
if (sctx->clone_roots_cnt == 1)
|
|
|
|
backref_walk_ctx.skip_data_ref = skip_self_data_ref;
|
|
|
|
|
btrfs: send: skip unnecessary backref iterations
When looking for a clone source for an extent, we are iterating over all
the backreferences for an extent. This is often a waste of time, because
once we find a good clone source we could stop immediately instead of
continuing backref walking, which is expensive.
Basically what happens currently is this:
1) Call iterate_extent_inodes() to iterate over all the backreferences;
2) It calls btrfs_find_all_leafs() which in turn calls the main function
to walk over backrefs and collect them - find_parent_nodes();
3) Then we collect all the references for our target data extent from the
extent tree (and delayed refs if any), add them to the rb trees,
resolve all the indirect backreferences and search for all the file
extent items in fs trees, building a list of inodes for each one of
them (struct extent_inode_elem);
4) Then back at iterate_extent_inodes() we find all the roots associated
to each found leaf, and call the callback __iterate_backrefs defined
at send.c for each inode in the inode list associated to each leaf.
Some times one the first backreferences we find in a fs tree is optimal
to satisfy the clone operation that send wants to perform, and in that
case we could stop immediately and avoid resolving all the remaining
indirect backreferences (search fs trees for the respective file extent
items, etc). This possibly if when we find a fs tree leaf with a file
extent item we are able to know what are all the roots that can lead to
the leaf - this is now possible after the previous patch in the series
that adds a cache that maps leaves to a list of roots. So we can now
shortcircuit backref walking during send, by having the callback we
pass to iterate_extent_inodes() to be called when we find a file extent
item for an indirect backreference, and have it return a special value
when it found a suitable backreference and it does not need to look for
more backreferences. This change does that.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:51 +08:00
|
|
|
ret = iterate_extent_inodes(&backref_walk_ctx, true, iterate_backrefs,
|
2022-11-02 00:15:47 +08:00
|
|
|
&backref_ctx);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
btrfs: send: avoid double extent tree search when finding clone source
At find_extent_clone() we search twice for the extent item corresponding
to the data extent that the current file extent items points to:
1) Once with a call to extent_from_logical();
2) Once again during backref walking, through iterate_extent_inodes()
which eventually leads to find_parent_nodes() where we will search
again the extent tree for the same extent item.
The extent tree can be huge, so doing this one extra search for every
extent we want to send adds up and it's expensive.
The first call is there since the send code was introduced and it
accomplishes two things:
1) Check that the extent is flagged as a data extent in the extent tree.
But it can not be anything else, otherwise we wouldn't have a file
extent item in the send root pointing to it.
This was probably added to catch bugs in the early days where send was
yet too young and the interaction with everything else was far from
perfect;
2) Check how many direct references there are on the extent, and if
there's too many (more than SEND_MAX_EXTENT_REFS), avoid doing the
backred walking as it may take too long and slowdown send.
So improve on this by having a callback in the backref walking code that
is called when it finds the extent item in the extent tree, and have those
checks done in the callback. When the callback returns anything different
from 0, it stops the backref walking code. This way we do a single search
on the extent tree for the extent item of our data extent.
Also, before this change we were only checking the number of references on
the data extent against SEND_MAX_EXTENT_REFS, but after starting backref
walking we will end up resolving backrefs for extent buffers in the path
from a leaf having a file extent item pointing to our data extent, up to
roots of trees from which the extent buffer is accessible from, due to
shared subtrees resulting from snapshoting. We were therefore allowing for
the possibility for send taking too long due to some node in the path from
the leaf to a root node being shared too many times. After this change we
check for reference counts being greater than SEND_MAX_EXTENT_REFS for
both data extents and metadata extents.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:52 +08:00
|
|
|
return ret;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
down_read(&fs_info->commit_root_sem);
|
|
|
|
if (fs_info->last_reloc_trans > sctx->last_reloc_trans) {
|
|
|
|
/*
|
|
|
|
* A transaction commit for a transaction in which block group
|
|
|
|
* relocation was done just happened.
|
|
|
|
* The disk_bytenr of the file extent item we processed is
|
|
|
|
* possibly stale, referring to the extent's location before
|
|
|
|
* relocation. So act as if we haven't found any clone sources
|
|
|
|
* and fallback to write commands, which will read the correct
|
|
|
|
* data from the new extent location. Otherwise we will fail
|
|
|
|
* below because we haven't found our own back reference or we
|
|
|
|
* could be getting incorrect sources in case the old extent
|
|
|
|
* was already reallocated after the relocation.
|
|
|
|
*/
|
|
|
|
up_read(&fs_info->commit_root_sem);
|
btrfs: send: avoid double extent tree search when finding clone source
At find_extent_clone() we search twice for the extent item corresponding
to the data extent that the current file extent items points to:
1) Once with a call to extent_from_logical();
2) Once again during backref walking, through iterate_extent_inodes()
which eventually leads to find_parent_nodes() where we will search
again the extent tree for the same extent item.
The extent tree can be huge, so doing this one extra search for every
extent we want to send adds up and it's expensive.
The first call is there since the send code was introduced and it
accomplishes two things:
1) Check that the extent is flagged as a data extent in the extent tree.
But it can not be anything else, otherwise we wouldn't have a file
extent item in the send root pointing to it.
This was probably added to catch bugs in the early days where send was
yet too young and the interaction with everything else was far from
perfect;
2) Check how many direct references there are on the extent, and if
there's too many (more than SEND_MAX_EXTENT_REFS), avoid doing the
backred walking as it may take too long and slowdown send.
So improve on this by having a callback in the backref walking code that
is called when it finds the extent item in the extent tree, and have those
checks done in the callback. When the callback returns anything different
from 0, it stops the backref walking code. This way we do a single search
on the extent tree for the extent item of our data extent.
Also, before this change we were only checking the number of references on
the data extent against SEND_MAX_EXTENT_REFS, but after starting backref
walking we will end up resolving backrefs for extent buffers in the path
from a leaf having a file extent item pointing to our data extent, up to
roots of trees from which the extent buffer is accessible from, due to
shared subtrees resulting from snapshoting. We were therefore allowing for
the possibility for send taking too long due to some node in the path from
the leaf to a root node being shared too many times. After this change we
check for reference counts being greater than SEND_MAX_EXTENT_REFS for
both data extents and metadata extents.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:52 +08:00
|
|
|
return -ENOENT;
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
}
|
|
|
|
up_read(&fs_info->commit_root_sem);
|
|
|
|
|
2016-09-20 22:05:03 +08:00
|
|
|
btrfs_debug(fs_info,
|
|
|
|
"find_extent_clone: data_offset=%llu, ino=%llu, num_bytes=%llu, logical=%llu",
|
|
|
|
data_offset, ino, num_bytes, logical);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
btrfs: send: skip unnecessary backref iterations
When looking for a clone source for an extent, we are iterating over all
the backreferences for an extent. This is often a waste of time, because
once we find a good clone source we could stop immediately instead of
continuing backref walking, which is expensive.
Basically what happens currently is this:
1) Call iterate_extent_inodes() to iterate over all the backreferences;
2) It calls btrfs_find_all_leafs() which in turn calls the main function
to walk over backrefs and collect them - find_parent_nodes();
3) Then we collect all the references for our target data extent from the
extent tree (and delayed refs if any), add them to the rb trees,
resolve all the indirect backreferences and search for all the file
extent items in fs trees, building a list of inodes for each one of
them (struct extent_inode_elem);
4) Then back at iterate_extent_inodes() we find all the roots associated
to each found leaf, and call the callback __iterate_backrefs defined
at send.c for each inode in the inode list associated to each leaf.
Some times one the first backreferences we find in a fs tree is optimal
to satisfy the clone operation that send wants to perform, and in that
case we could stop immediately and avoid resolving all the remaining
indirect backreferences (search fs trees for the respective file extent
items, etc). This possibly if when we find a fs tree leaf with a file
extent item we are able to know what are all the roots that can lead to
the leaf - this is now possible after the previous patch in the series
that adds a cache that maps leaves to a list of roots. So we can now
shortcircuit backref walking during send, by having the callback we
pass to iterate_extent_inodes() to be called when we find a file extent
item for an indirect backreference, and have it return a special value
when it found a suitable backreference and it does not need to look for
more backreferences. This change does that.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:51 +08:00
|
|
|
if (!backref_ctx.found) {
|
2016-09-20 22:05:03 +08:00
|
|
|
btrfs_debug(fs_info, "no clones found");
|
btrfs: send: avoid double extent tree search when finding clone source
At find_extent_clone() we search twice for the extent item corresponding
to the data extent that the current file extent items points to:
1) Once with a call to extent_from_logical();
2) Once again during backref walking, through iterate_extent_inodes()
which eventually leads to find_parent_nodes() where we will search
again the extent tree for the same extent item.
The extent tree can be huge, so doing this one extra search for every
extent we want to send adds up and it's expensive.
The first call is there since the send code was introduced and it
accomplishes two things:
1) Check that the extent is flagged as a data extent in the extent tree.
But it can not be anything else, otherwise we wouldn't have a file
extent item in the send root pointing to it.
This was probably added to catch bugs in the early days where send was
yet too young and the interaction with everything else was far from
perfect;
2) Check how many direct references there are on the extent, and if
there's too many (more than SEND_MAX_EXTENT_REFS), avoid doing the
backred walking as it may take too long and slowdown send.
So improve on this by having a callback in the backref walking code that
is called when it finds the extent item in the extent tree, and have those
checks done in the callback. When the callback returns anything different
from 0, it stops the backref walking code. This way we do a single search
on the extent tree for the extent item of our data extent.
Also, before this change we were only checking the number of references on
the data extent against SEND_MAX_EXTENT_REFS, but after starting backref
walking we will end up resolving backrefs for extent buffers in the path
from a leaf having a file extent item pointing to our data extent, up to
roots of trees from which the extent buffer is accessible from, due to
shared subtrees resulting from snapshoting. We were therefore allowing for
the possibility for send taking too long due to some node in the path from
the leaf to a root node being shared too many times. After this change we
check for reference counts being greater than SEND_MAX_EXTENT_REFS for
both data extents and metadata extents.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:52 +08:00
|
|
|
return -ENOENT;
|
btrfs: send: skip unnecessary backref iterations
When looking for a clone source for an extent, we are iterating over all
the backreferences for an extent. This is often a waste of time, because
once we find a good clone source we could stop immediately instead of
continuing backref walking, which is expensive.
Basically what happens currently is this:
1) Call iterate_extent_inodes() to iterate over all the backreferences;
2) It calls btrfs_find_all_leafs() which in turn calls the main function
to walk over backrefs and collect them - find_parent_nodes();
3) Then we collect all the references for our target data extent from the
extent tree (and delayed refs if any), add them to the rb trees,
resolve all the indirect backreferences and search for all the file
extent items in fs trees, building a list of inodes for each one of
them (struct extent_inode_elem);
4) Then back at iterate_extent_inodes() we find all the roots associated
to each found leaf, and call the callback __iterate_backrefs defined
at send.c for each inode in the inode list associated to each leaf.
Some times one the first backreferences we find in a fs tree is optimal
to satisfy the clone operation that send wants to perform, and in that
case we could stop immediately and avoid resolving all the remaining
indirect backreferences (search fs trees for the respective file extent
items, etc). This possibly if when we find a fs tree leaf with a file
extent item we are able to know what are all the roots that can lead to
the leaf - this is now possible after the previous patch in the series
that adds a cache that maps leaves to a list of roots. So we can now
shortcircuit backref walking during send, by having the callback we
pass to iterate_extent_inodes() to be called when we find a file extent
item for an indirect backreference, and have it return a special value
when it found a suitable backreference and it does not need to look for
more backreferences. This change does that.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:51 +08:00
|
|
|
}
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
cur_clone_root = NULL;
|
|
|
|
for (i = 0; i < sctx->clone_roots_cnt; i++) {
|
2022-11-02 00:15:45 +08:00
|
|
|
struct clone_root *clone_root = &sctx->clone_roots[i];
|
2012-07-26 05:19:24 +08:00
|
|
|
|
btrfs: send: skip unnecessary backref iterations
When looking for a clone source for an extent, we are iterating over all
the backreferences for an extent. This is often a waste of time, because
once we find a good clone source we could stop immediately instead of
continuing backref walking, which is expensive.
Basically what happens currently is this:
1) Call iterate_extent_inodes() to iterate over all the backreferences;
2) It calls btrfs_find_all_leafs() which in turn calls the main function
to walk over backrefs and collect them - find_parent_nodes();
3) Then we collect all the references for our target data extent from the
extent tree (and delayed refs if any), add them to the rb trees,
resolve all the indirect backreferences and search for all the file
extent items in fs trees, building a list of inodes for each one of
them (struct extent_inode_elem);
4) Then back at iterate_extent_inodes() we find all the roots associated
to each found leaf, and call the callback __iterate_backrefs defined
at send.c for each inode in the inode list associated to each leaf.
Some times one the first backreferences we find in a fs tree is optimal
to satisfy the clone operation that send wants to perform, and in that
case we could stop immediately and avoid resolving all the remaining
indirect backreferences (search fs trees for the respective file extent
items, etc). This possibly if when we find a fs tree leaf with a file
extent item we are able to know what are all the roots that can lead to
the leaf - this is now possible after the previous patch in the series
that adds a cache that maps leaves to a list of roots. So we can now
shortcircuit backref walking during send, by having the callback we
pass to iterate_extent_inodes() to be called when we find a file extent
item for an indirect backreference, and have it return a special value
when it found a suitable backreference and it does not need to look for
more backreferences. This change does that.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:51 +08:00
|
|
|
if (!clone_root->found_ref)
|
2022-11-02 00:15:45 +08:00
|
|
|
continue;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Choose the root from which we can clone more bytes, to
|
|
|
|
* minimize write operations and therefore have more extent
|
|
|
|
* sharing at the destination (the same as in the source).
|
|
|
|
*/
|
|
|
|
if (!cur_clone_root ||
|
|
|
|
clone_root->num_bytes > cur_clone_root->num_bytes) {
|
|
|
|
cur_clone_root = clone_root;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We found an optimal clone candidate (any inode from
|
|
|
|
* any root is fine), so we're done.
|
|
|
|
*/
|
|
|
|
if (clone_root->num_bytes >= backref_ctx.extent_len)
|
|
|
|
break;
|
|
|
|
}
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (cur_clone_root) {
|
|
|
|
*found = cur_clone_root;
|
|
|
|
ret = 0;
|
|
|
|
} else {
|
|
|
|
ret = -ENOENT;
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
static int read_symlink(struct btrfs_root *root,
|
2012-07-26 05:19:24 +08:00
|
|
|
u64 ino,
|
|
|
|
struct fs_path *dest)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_file_extent_item *ei;
|
|
|
|
u8 type;
|
|
|
|
u8 compression;
|
|
|
|
unsigned long off;
|
|
|
|
int len;
|
|
|
|
|
|
|
|
path = alloc_path_for_send();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
key.objectid = ino;
|
|
|
|
key.type = BTRFS_EXTENT_DATA_KEY;
|
|
|
|
key.offset = 0;
|
|
|
|
ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2016-01-01 02:07:59 +08:00
|
|
|
if (ret) {
|
|
|
|
/*
|
|
|
|
* An empty symlink inode. Can happen in rare error paths when
|
|
|
|
* creating a symlink (transaction committed before the inode
|
|
|
|
* eviction handler removed the symlink inode items and a crash
|
|
|
|
* happened in between or the subvol was snapshoted in between).
|
|
|
|
* Print an informative message to dmesg/syslog so that the user
|
|
|
|
* can delete the symlink.
|
|
|
|
*/
|
|
|
|
btrfs_err(root->fs_info,
|
|
|
|
"Found empty symlink inode %llu at root %llu",
|
|
|
|
ino, root->root_key.objectid);
|
|
|
|
ret = -EIO;
|
|
|
|
goto out;
|
|
|
|
}
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
ei = btrfs_item_ptr(path->nodes[0], path->slots[0],
|
|
|
|
struct btrfs_file_extent_item);
|
|
|
|
type = btrfs_file_extent_type(path->nodes[0], ei);
|
2023-06-12 18:40:59 +08:00
|
|
|
if (unlikely(type != BTRFS_FILE_EXTENT_INLINE)) {
|
|
|
|
ret = -EUCLEAN;
|
|
|
|
btrfs_crit(root->fs_info,
|
|
|
|
"send: found symlink extent that is not inline, ino %llu root %llu extent type %d",
|
|
|
|
ino, btrfs_root_id(root), type);
|
|
|
|
goto out;
|
|
|
|
}
|
2012-07-26 05:19:24 +08:00
|
|
|
compression = btrfs_file_extent_compression(path->nodes[0], ei);
|
2023-06-12 18:40:59 +08:00
|
|
|
if (unlikely(compression != BTRFS_COMPRESS_NONE)) {
|
|
|
|
ret = -EUCLEAN;
|
|
|
|
btrfs_crit(root->fs_info,
|
|
|
|
"send: found symlink extent with compression, ino %llu root %llu compression type %d",
|
|
|
|
ino, btrfs_root_id(root), compression);
|
|
|
|
goto out;
|
|
|
|
}
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
off = btrfs_file_extent_inline_start(ei);
|
2018-06-06 15:41:49 +08:00
|
|
|
len = btrfs_file_extent_ram_bytes(path->nodes[0], ei);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
ret = fs_path_add_from_extent_buffer(dest, path->nodes[0], off, len);
|
|
|
|
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Helper function to generate a file name that is unique in the root of
|
|
|
|
* send_root and parent_root. This is used to generate names for orphan inodes.
|
|
|
|
*/
|
|
|
|
static int gen_unique_name(struct send_ctx *sctx,
|
|
|
|
u64 ino, u64 gen,
|
|
|
|
struct fs_path *dest)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_dir_item *di;
|
|
|
|
char tmp[64];
|
|
|
|
int len;
|
|
|
|
u64 idx = 0;
|
|
|
|
|
|
|
|
path = alloc_path_for_send();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
while (1) {
|
2022-10-21 00:58:27 +08:00
|
|
|
struct fscrypt_str tmp_name;
|
2022-10-21 00:58:25 +08:00
|
|
|
|
2014-01-22 07:36:38 +08:00
|
|
|
len = snprintf(tmp, sizeof(tmp), "o%llu-%llu-%llu",
|
2012-07-26 05:19:24 +08:00
|
|
|
ino, gen, idx);
|
2014-02-04 01:24:09 +08:00
|
|
|
ASSERT(len < sizeof(tmp));
|
2022-10-21 00:58:25 +08:00
|
|
|
tmp_name.name = tmp;
|
|
|
|
tmp_name.len = strlen(tmp);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
di = btrfs_lookup_dir_item(NULL, sctx->send_root,
|
|
|
|
path, BTRFS_FIRST_FREE_OBJECTID,
|
2022-10-21 00:58:25 +08:00
|
|
|
&tmp_name, 0);
|
2012-07-26 05:19:24 +08:00
|
|
|
btrfs_release_path(path);
|
|
|
|
if (IS_ERR(di)) {
|
|
|
|
ret = PTR_ERR(di);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
if (di) {
|
|
|
|
/* not unique, try again */
|
|
|
|
idx++;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!sctx->parent_root) {
|
|
|
|
/* unique */
|
|
|
|
ret = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
di = btrfs_lookup_dir_item(NULL, sctx->parent_root,
|
|
|
|
path, BTRFS_FIRST_FREE_OBJECTID,
|
2022-10-21 00:58:25 +08:00
|
|
|
&tmp_name, 0);
|
2012-07-26 05:19:24 +08:00
|
|
|
btrfs_release_path(path);
|
|
|
|
if (IS_ERR(di)) {
|
|
|
|
ret = PTR_ERR(di);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
if (di) {
|
|
|
|
/* not unique, try again */
|
|
|
|
idx++;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
/* unique */
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = fs_path_add(dest, tmp, strlen(tmp));
|
|
|
|
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
enum inode_state {
|
|
|
|
inode_state_no_change,
|
|
|
|
inode_state_will_create,
|
|
|
|
inode_state_did_create,
|
|
|
|
inode_state_will_delete,
|
|
|
|
inode_state_did_delete,
|
|
|
|
};
|
|
|
|
|
2023-01-11 19:36:05 +08:00
|
|
|
static int get_cur_inode_state(struct send_ctx *sctx, u64 ino, u64 gen,
|
|
|
|
u64 *send_gen, u64 *parent_gen)
|
2012-07-26 05:19:24 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
int left_ret;
|
|
|
|
int right_ret;
|
|
|
|
u64 left_gen;
|
2023-03-24 10:08:38 +08:00
|
|
|
u64 right_gen = 0;
|
btrfs: send: fix failures when processing inodes with no links
There is a bug causing send failures when processing an orphan directory
with no links. In commit 46b2f4590aab ("Btrfs: fix send failure when
root has deleted files still open")', the orphan inode issue was
addressed. The send operation fails with a ENOENT error because of any
attempts to generate a path for the inode with a link count of zero.
Therefore, in that patch, sctx->ignore_cur_inode was introduced to be
set if the current inode has a link count of zero for bypassing some
unnecessary steps. And a helper function btrfs_unlink_all_paths() was
introduced and called to clean up old paths found in the parent
snapshot. However, not only regular files but also directories can be
orphan inodes. So if the send operation meets an orphan directory, it
will issue a wrong unlink command for that directory now. Soon the
receive operation fails with a EISDIR error. Besides, the send operation
also fails with a ENOENT error later when it tries to generate a path of
it.
Similar example but making an orphan dir for an incremental send:
$ btrfs subvolume create vol
$ mkdir vol/dir
$ touch vol/dir/foo
$ btrfs subvolume snapshot -r vol snap1
$ btrfs subvolume snapshot -r vol snap2
# Turn the second snapshot to RW mode and delete the whole dir while
# holding an open file descriptor on it.
$ btrfs property set snap2 ro false
$ exec 73<snap2/dir
$ rm -rf snap2/dir
# Set the second snapshot back to RO mode and do an incremental send.
$ btrfs property set snap2 ro true
$ mkdir receive_dir
$ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
At subvol snap2
At snapshot snap2
ERROR: send ioctl failed with -2: No such file or directory
ERROR: unlink dir failed. Is a directory
Actually, orphan inodes are more common use cases in cascading backups.
(Please see the illustration below.) In a cascading backup, a user wants
to replicate a couple of snapshots from Machine A to Machine B and from
Machine B to Machine C. Machine B doesn't take any RO snapshots for
sending. All a receiver does is create an RW snapshot of its parent
snapshot, apply the send stream and turn it into RO mode at the end.
Even if all paths of some inodes are deleted in applying the send
stream, these inodes would not be deleted and become orphans after
changing the subvolume from RW to RO. Moreover, orphan inodes can occur
not only in send snapshots but also in parent snapshots because Machine
B may do a batch replication of a couple of snapshots.
An illustration for cascading backups:
Machine A (snapshot {1..n}) --> Machine B --> Machine C
The idea to solve the problem is to delete all the items of orphan
inodes before using these snapshots for sending. I used to think that
the reasonable timing for doing that is during the ioctl of changing the
subvolume from RW to RO because it sounds good that we will not modify
the fs tree of a RO snapshot anymore. However, attempting to do the
orphan cleanup in the ioctl would be pointless. Because if someone is
holding an open file descriptor on the inode, the reference count of the
inode will never drop to 0. Then iput() cannot trigger eviction, which
finally deletes all the items of it. So we try to extend the original
patch to handle orphans in send/parent snapshots. Here are several cases
that need to be considered:
Case 1: BTRFS_COMPARE_TREE_NEW
| send snapshot | action
--------------------------------
nlink | 0 | ignore
In case 1, when we get a BTRFS_COMPARE_TREE_NEW tree comparison result,
it means that a new inode is found in the send snapshot and it doesn't
appear in the parent snapshot. Since this inode has a link count of zero
(It's an orphan and there're no paths for it.), we can leverage
sctx->ignore_cur_inode in the original patch to prevent it from being
created.
Case 2: BTRFS_COMPARE_TREE_DELETED
| parent snapshot | action
----------------------------------
nlink | 0 | as usual
In case 2, when we get a BTRFS_COMPARE_TREE_DELETED tree comparison
result, it means that the inode only appears in the parent snapshot.
As usual, the send operation will try to delete all its paths. However,
this inode has a link count of zero, so no paths of it will be found. No
deletion operations will be issued. We don't need to change any logic.
Case 3: BTRFS_COMPARE_TREE_CHANGED
| | parent snapshot | send snapshot | action
-----------------------------------------------------------------------
subcase 1 | nlink | 0 | 0 | ignore
subcase 2 | nlink | >0 | 0 | new_gen(deletion)
subcase 3 | nlink | 0 | >0 | new_gen(creation)
In case 3, when we get a BTRFS_COMPARE_TREE_CHANGED tree comparison result,
it means that the inode appears in both snapshots. Here are 3 subcases.
First, when the inode has link counts of zero in both snapshots. Since
there are no paths for this inode in (source/destination) parent
snapshots and we don't care about whether there is also an orphan inode
in destination or not, we can set sctx->ignore_cur_inode on to prevent
it from being created.
For the second and the third subcases, if there are paths in one
snapshot and there're no paths in the other snapshot for this inode. We
can treat this inode as a new generation. We can also leverage the logic
handling a new generation of an inode with small adjustments. Then it
will delete all old paths and create a new inode with new attributes and
paths only when there's a positive link count in the send snapshot.
In subcase 2, the send operation only needs to delete all old paths as
in the parent snapshot. But it may require more operations for a
directory to remove its old paths. If a not-empty directory is going to
be deleted (because it has a link count of zero in the send snapshot)
but there are files/directories with bigger inode numbers under it, the
send operation will need to rename it to its orphan name first. After
processing and deleting the last item under this directory, the send
operation will check this directory, aka the parent directory of the
last item, again and issue a rmdir operation to remove it finally.
Therefore, we also need to treat inodes with a link count of zero as if
they didn't exist in get_cur_inode_state(), which is used in
process_recorded_refs(). By doing this, when checking a directory with
orphan names after the last item under it has been deleted, the send
operation now can properly issue a rmdir operation. Otherwise, without
doing this, the orphan directory with an orphan name would be kept here
at the end due to the existing inode with a link count of zero being
found.
In subcase 3, as in case 2, no old paths would be found, so no deletion
operations will be issued. The send operation will only create a new one
for that inode.
Note that subcase 3 is not common. That's because it's easy to reduce
the hard links of an inode, but once all valid paths are removed,
there are no valid paths for creating other hard links. The only way to
do that is trying to send an older snapshot after a newer snapshot has
been sent.
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-12 22:42:33 +08:00
|
|
|
struct btrfs_inode_info info;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
btrfs: send: fix failures when processing inodes with no links
There is a bug causing send failures when processing an orphan directory
with no links. In commit 46b2f4590aab ("Btrfs: fix send failure when
root has deleted files still open")', the orphan inode issue was
addressed. The send operation fails with a ENOENT error because of any
attempts to generate a path for the inode with a link count of zero.
Therefore, in that patch, sctx->ignore_cur_inode was introduced to be
set if the current inode has a link count of zero for bypassing some
unnecessary steps. And a helper function btrfs_unlink_all_paths() was
introduced and called to clean up old paths found in the parent
snapshot. However, not only regular files but also directories can be
orphan inodes. So if the send operation meets an orphan directory, it
will issue a wrong unlink command for that directory now. Soon the
receive operation fails with a EISDIR error. Besides, the send operation
also fails with a ENOENT error later when it tries to generate a path of
it.
Similar example but making an orphan dir for an incremental send:
$ btrfs subvolume create vol
$ mkdir vol/dir
$ touch vol/dir/foo
$ btrfs subvolume snapshot -r vol snap1
$ btrfs subvolume snapshot -r vol snap2
# Turn the second snapshot to RW mode and delete the whole dir while
# holding an open file descriptor on it.
$ btrfs property set snap2 ro false
$ exec 73<snap2/dir
$ rm -rf snap2/dir
# Set the second snapshot back to RO mode and do an incremental send.
$ btrfs property set snap2 ro true
$ mkdir receive_dir
$ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
At subvol snap2
At snapshot snap2
ERROR: send ioctl failed with -2: No such file or directory
ERROR: unlink dir failed. Is a directory
Actually, orphan inodes are more common use cases in cascading backups.
(Please see the illustration below.) In a cascading backup, a user wants
to replicate a couple of snapshots from Machine A to Machine B and from
Machine B to Machine C. Machine B doesn't take any RO snapshots for
sending. All a receiver does is create an RW snapshot of its parent
snapshot, apply the send stream and turn it into RO mode at the end.
Even if all paths of some inodes are deleted in applying the send
stream, these inodes would not be deleted and become orphans after
changing the subvolume from RW to RO. Moreover, orphan inodes can occur
not only in send snapshots but also in parent snapshots because Machine
B may do a batch replication of a couple of snapshots.
An illustration for cascading backups:
Machine A (snapshot {1..n}) --> Machine B --> Machine C
The idea to solve the problem is to delete all the items of orphan
inodes before using these snapshots for sending. I used to think that
the reasonable timing for doing that is during the ioctl of changing the
subvolume from RW to RO because it sounds good that we will not modify
the fs tree of a RO snapshot anymore. However, attempting to do the
orphan cleanup in the ioctl would be pointless. Because if someone is
holding an open file descriptor on the inode, the reference count of the
inode will never drop to 0. Then iput() cannot trigger eviction, which
finally deletes all the items of it. So we try to extend the original
patch to handle orphans in send/parent snapshots. Here are several cases
that need to be considered:
Case 1: BTRFS_COMPARE_TREE_NEW
| send snapshot | action
--------------------------------
nlink | 0 | ignore
In case 1, when we get a BTRFS_COMPARE_TREE_NEW tree comparison result,
it means that a new inode is found in the send snapshot and it doesn't
appear in the parent snapshot. Since this inode has a link count of zero
(It's an orphan and there're no paths for it.), we can leverage
sctx->ignore_cur_inode in the original patch to prevent it from being
created.
Case 2: BTRFS_COMPARE_TREE_DELETED
| parent snapshot | action
----------------------------------
nlink | 0 | as usual
In case 2, when we get a BTRFS_COMPARE_TREE_DELETED tree comparison
result, it means that the inode only appears in the parent snapshot.
As usual, the send operation will try to delete all its paths. However,
this inode has a link count of zero, so no paths of it will be found. No
deletion operations will be issued. We don't need to change any logic.
Case 3: BTRFS_COMPARE_TREE_CHANGED
| | parent snapshot | send snapshot | action
-----------------------------------------------------------------------
subcase 1 | nlink | 0 | 0 | ignore
subcase 2 | nlink | >0 | 0 | new_gen(deletion)
subcase 3 | nlink | 0 | >0 | new_gen(creation)
In case 3, when we get a BTRFS_COMPARE_TREE_CHANGED tree comparison result,
it means that the inode appears in both snapshots. Here are 3 subcases.
First, when the inode has link counts of zero in both snapshots. Since
there are no paths for this inode in (source/destination) parent
snapshots and we don't care about whether there is also an orphan inode
in destination or not, we can set sctx->ignore_cur_inode on to prevent
it from being created.
For the second and the third subcases, if there are paths in one
snapshot and there're no paths in the other snapshot for this inode. We
can treat this inode as a new generation. We can also leverage the logic
handling a new generation of an inode with small adjustments. Then it
will delete all old paths and create a new inode with new attributes and
paths only when there's a positive link count in the send snapshot.
In subcase 2, the send operation only needs to delete all old paths as
in the parent snapshot. But it may require more operations for a
directory to remove its old paths. If a not-empty directory is going to
be deleted (because it has a link count of zero in the send snapshot)
but there are files/directories with bigger inode numbers under it, the
send operation will need to rename it to its orphan name first. After
processing and deleting the last item under this directory, the send
operation will check this directory, aka the parent directory of the
last item, again and issue a rmdir operation to remove it finally.
Therefore, we also need to treat inodes with a link count of zero as if
they didn't exist in get_cur_inode_state(), which is used in
process_recorded_refs(). By doing this, when checking a directory with
orphan names after the last item under it has been deleted, the send
operation now can properly issue a rmdir operation. Otherwise, without
doing this, the orphan directory with an orphan name would be kept here
at the end due to the existing inode with a link count of zero being
found.
In subcase 3, as in case 2, no old paths would be found, so no deletion
operations will be issued. The send operation will only create a new one
for that inode.
Note that subcase 3 is not common. That's because it's easy to reduce
the hard links of an inode, but once all valid paths are removed,
there are no valid paths for creating other hard links. The only way to
do that is trying to send an older snapshot after a newer snapshot has
been sent.
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-12 22:42:33 +08:00
|
|
|
ret = get_inode_info(sctx->send_root, ino, &info);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0 && ret != -ENOENT)
|
|
|
|
goto out;
|
btrfs: send: fix failures when processing inodes with no links
There is a bug causing send failures when processing an orphan directory
with no links. In commit 46b2f4590aab ("Btrfs: fix send failure when
root has deleted files still open")', the orphan inode issue was
addressed. The send operation fails with a ENOENT error because of any
attempts to generate a path for the inode with a link count of zero.
Therefore, in that patch, sctx->ignore_cur_inode was introduced to be
set if the current inode has a link count of zero for bypassing some
unnecessary steps. And a helper function btrfs_unlink_all_paths() was
introduced and called to clean up old paths found in the parent
snapshot. However, not only regular files but also directories can be
orphan inodes. So if the send operation meets an orphan directory, it
will issue a wrong unlink command for that directory now. Soon the
receive operation fails with a EISDIR error. Besides, the send operation
also fails with a ENOENT error later when it tries to generate a path of
it.
Similar example but making an orphan dir for an incremental send:
$ btrfs subvolume create vol
$ mkdir vol/dir
$ touch vol/dir/foo
$ btrfs subvolume snapshot -r vol snap1
$ btrfs subvolume snapshot -r vol snap2
# Turn the second snapshot to RW mode and delete the whole dir while
# holding an open file descriptor on it.
$ btrfs property set snap2 ro false
$ exec 73<snap2/dir
$ rm -rf snap2/dir
# Set the second snapshot back to RO mode and do an incremental send.
$ btrfs property set snap2 ro true
$ mkdir receive_dir
$ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
At subvol snap2
At snapshot snap2
ERROR: send ioctl failed with -2: No such file or directory
ERROR: unlink dir failed. Is a directory
Actually, orphan inodes are more common use cases in cascading backups.
(Please see the illustration below.) In a cascading backup, a user wants
to replicate a couple of snapshots from Machine A to Machine B and from
Machine B to Machine C. Machine B doesn't take any RO snapshots for
sending. All a receiver does is create an RW snapshot of its parent
snapshot, apply the send stream and turn it into RO mode at the end.
Even if all paths of some inodes are deleted in applying the send
stream, these inodes would not be deleted and become orphans after
changing the subvolume from RW to RO. Moreover, orphan inodes can occur
not only in send snapshots but also in parent snapshots because Machine
B may do a batch replication of a couple of snapshots.
An illustration for cascading backups:
Machine A (snapshot {1..n}) --> Machine B --> Machine C
The idea to solve the problem is to delete all the items of orphan
inodes before using these snapshots for sending. I used to think that
the reasonable timing for doing that is during the ioctl of changing the
subvolume from RW to RO because it sounds good that we will not modify
the fs tree of a RO snapshot anymore. However, attempting to do the
orphan cleanup in the ioctl would be pointless. Because if someone is
holding an open file descriptor on the inode, the reference count of the
inode will never drop to 0. Then iput() cannot trigger eviction, which
finally deletes all the items of it. So we try to extend the original
patch to handle orphans in send/parent snapshots. Here are several cases
that need to be considered:
Case 1: BTRFS_COMPARE_TREE_NEW
| send snapshot | action
--------------------------------
nlink | 0 | ignore
In case 1, when we get a BTRFS_COMPARE_TREE_NEW tree comparison result,
it means that a new inode is found in the send snapshot and it doesn't
appear in the parent snapshot. Since this inode has a link count of zero
(It's an orphan and there're no paths for it.), we can leverage
sctx->ignore_cur_inode in the original patch to prevent it from being
created.
Case 2: BTRFS_COMPARE_TREE_DELETED
| parent snapshot | action
----------------------------------
nlink | 0 | as usual
In case 2, when we get a BTRFS_COMPARE_TREE_DELETED tree comparison
result, it means that the inode only appears in the parent snapshot.
As usual, the send operation will try to delete all its paths. However,
this inode has a link count of zero, so no paths of it will be found. No
deletion operations will be issued. We don't need to change any logic.
Case 3: BTRFS_COMPARE_TREE_CHANGED
| | parent snapshot | send snapshot | action
-----------------------------------------------------------------------
subcase 1 | nlink | 0 | 0 | ignore
subcase 2 | nlink | >0 | 0 | new_gen(deletion)
subcase 3 | nlink | 0 | >0 | new_gen(creation)
In case 3, when we get a BTRFS_COMPARE_TREE_CHANGED tree comparison result,
it means that the inode appears in both snapshots. Here are 3 subcases.
First, when the inode has link counts of zero in both snapshots. Since
there are no paths for this inode in (source/destination) parent
snapshots and we don't care about whether there is also an orphan inode
in destination or not, we can set sctx->ignore_cur_inode on to prevent
it from being created.
For the second and the third subcases, if there are paths in one
snapshot and there're no paths in the other snapshot for this inode. We
can treat this inode as a new generation. We can also leverage the logic
handling a new generation of an inode with small adjustments. Then it
will delete all old paths and create a new inode with new attributes and
paths only when there's a positive link count in the send snapshot.
In subcase 2, the send operation only needs to delete all old paths as
in the parent snapshot. But it may require more operations for a
directory to remove its old paths. If a not-empty directory is going to
be deleted (because it has a link count of zero in the send snapshot)
but there are files/directories with bigger inode numbers under it, the
send operation will need to rename it to its orphan name first. After
processing and deleting the last item under this directory, the send
operation will check this directory, aka the parent directory of the
last item, again and issue a rmdir operation to remove it finally.
Therefore, we also need to treat inodes with a link count of zero as if
they didn't exist in get_cur_inode_state(), which is used in
process_recorded_refs(). By doing this, when checking a directory with
orphan names after the last item under it has been deleted, the send
operation now can properly issue a rmdir operation. Otherwise, without
doing this, the orphan directory with an orphan name would be kept here
at the end due to the existing inode with a link count of zero being
found.
In subcase 3, as in case 2, no old paths would be found, so no deletion
operations will be issued. The send operation will only create a new one
for that inode.
Note that subcase 3 is not common. That's because it's easy to reduce
the hard links of an inode, but once all valid paths are removed,
there are no valid paths for creating other hard links. The only way to
do that is trying to send an older snapshot after a newer snapshot has
been sent.
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-12 22:42:33 +08:00
|
|
|
left_ret = (info.nlink == 0) ? -ENOENT : ret;
|
|
|
|
left_gen = info.gen;
|
2023-01-11 19:36:05 +08:00
|
|
|
if (send_gen)
|
|
|
|
*send_gen = ((left_ret == -ENOENT) ? 0 : info.gen);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
if (!sctx->parent_root) {
|
|
|
|
right_ret = -ENOENT;
|
|
|
|
} else {
|
btrfs: send: fix failures when processing inodes with no links
There is a bug causing send failures when processing an orphan directory
with no links. In commit 46b2f4590aab ("Btrfs: fix send failure when
root has deleted files still open")', the orphan inode issue was
addressed. The send operation fails with a ENOENT error because of any
attempts to generate a path for the inode with a link count of zero.
Therefore, in that patch, sctx->ignore_cur_inode was introduced to be
set if the current inode has a link count of zero for bypassing some
unnecessary steps. And a helper function btrfs_unlink_all_paths() was
introduced and called to clean up old paths found in the parent
snapshot. However, not only regular files but also directories can be
orphan inodes. So if the send operation meets an orphan directory, it
will issue a wrong unlink command for that directory now. Soon the
receive operation fails with a EISDIR error. Besides, the send operation
also fails with a ENOENT error later when it tries to generate a path of
it.
Similar example but making an orphan dir for an incremental send:
$ btrfs subvolume create vol
$ mkdir vol/dir
$ touch vol/dir/foo
$ btrfs subvolume snapshot -r vol snap1
$ btrfs subvolume snapshot -r vol snap2
# Turn the second snapshot to RW mode and delete the whole dir while
# holding an open file descriptor on it.
$ btrfs property set snap2 ro false
$ exec 73<snap2/dir
$ rm -rf snap2/dir
# Set the second snapshot back to RO mode and do an incremental send.
$ btrfs property set snap2 ro true
$ mkdir receive_dir
$ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
At subvol snap2
At snapshot snap2
ERROR: send ioctl failed with -2: No such file or directory
ERROR: unlink dir failed. Is a directory
Actually, orphan inodes are more common use cases in cascading backups.
(Please see the illustration below.) In a cascading backup, a user wants
to replicate a couple of snapshots from Machine A to Machine B and from
Machine B to Machine C. Machine B doesn't take any RO snapshots for
sending. All a receiver does is create an RW snapshot of its parent
snapshot, apply the send stream and turn it into RO mode at the end.
Even if all paths of some inodes are deleted in applying the send
stream, these inodes would not be deleted and become orphans after
changing the subvolume from RW to RO. Moreover, orphan inodes can occur
not only in send snapshots but also in parent snapshots because Machine
B may do a batch replication of a couple of snapshots.
An illustration for cascading backups:
Machine A (snapshot {1..n}) --> Machine B --> Machine C
The idea to solve the problem is to delete all the items of orphan
inodes before using these snapshots for sending. I used to think that
the reasonable timing for doing that is during the ioctl of changing the
subvolume from RW to RO because it sounds good that we will not modify
the fs tree of a RO snapshot anymore. However, attempting to do the
orphan cleanup in the ioctl would be pointless. Because if someone is
holding an open file descriptor on the inode, the reference count of the
inode will never drop to 0. Then iput() cannot trigger eviction, which
finally deletes all the items of it. So we try to extend the original
patch to handle orphans in send/parent snapshots. Here are several cases
that need to be considered:
Case 1: BTRFS_COMPARE_TREE_NEW
| send snapshot | action
--------------------------------
nlink | 0 | ignore
In case 1, when we get a BTRFS_COMPARE_TREE_NEW tree comparison result,
it means that a new inode is found in the send snapshot and it doesn't
appear in the parent snapshot. Since this inode has a link count of zero
(It's an orphan and there're no paths for it.), we can leverage
sctx->ignore_cur_inode in the original patch to prevent it from being
created.
Case 2: BTRFS_COMPARE_TREE_DELETED
| parent snapshot | action
----------------------------------
nlink | 0 | as usual
In case 2, when we get a BTRFS_COMPARE_TREE_DELETED tree comparison
result, it means that the inode only appears in the parent snapshot.
As usual, the send operation will try to delete all its paths. However,
this inode has a link count of zero, so no paths of it will be found. No
deletion operations will be issued. We don't need to change any logic.
Case 3: BTRFS_COMPARE_TREE_CHANGED
| | parent snapshot | send snapshot | action
-----------------------------------------------------------------------
subcase 1 | nlink | 0 | 0 | ignore
subcase 2 | nlink | >0 | 0 | new_gen(deletion)
subcase 3 | nlink | 0 | >0 | new_gen(creation)
In case 3, when we get a BTRFS_COMPARE_TREE_CHANGED tree comparison result,
it means that the inode appears in both snapshots. Here are 3 subcases.
First, when the inode has link counts of zero in both snapshots. Since
there are no paths for this inode in (source/destination) parent
snapshots and we don't care about whether there is also an orphan inode
in destination or not, we can set sctx->ignore_cur_inode on to prevent
it from being created.
For the second and the third subcases, if there are paths in one
snapshot and there're no paths in the other snapshot for this inode. We
can treat this inode as a new generation. We can also leverage the logic
handling a new generation of an inode with small adjustments. Then it
will delete all old paths and create a new inode with new attributes and
paths only when there's a positive link count in the send snapshot.
In subcase 2, the send operation only needs to delete all old paths as
in the parent snapshot. But it may require more operations for a
directory to remove its old paths. If a not-empty directory is going to
be deleted (because it has a link count of zero in the send snapshot)
but there are files/directories with bigger inode numbers under it, the
send operation will need to rename it to its orphan name first. After
processing and deleting the last item under this directory, the send
operation will check this directory, aka the parent directory of the
last item, again and issue a rmdir operation to remove it finally.
Therefore, we also need to treat inodes with a link count of zero as if
they didn't exist in get_cur_inode_state(), which is used in
process_recorded_refs(). By doing this, when checking a directory with
orphan names after the last item under it has been deleted, the send
operation now can properly issue a rmdir operation. Otherwise, without
doing this, the orphan directory with an orphan name would be kept here
at the end due to the existing inode with a link count of zero being
found.
In subcase 3, as in case 2, no old paths would be found, so no deletion
operations will be issued. The send operation will only create a new one
for that inode.
Note that subcase 3 is not common. That's because it's easy to reduce
the hard links of an inode, but once all valid paths are removed,
there are no valid paths for creating other hard links. The only way to
do that is trying to send an older snapshot after a newer snapshot has
been sent.
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-12 22:42:33 +08:00
|
|
|
ret = get_inode_info(sctx->parent_root, ino, &info);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0 && ret != -ENOENT)
|
|
|
|
goto out;
|
btrfs: send: fix failures when processing inodes with no links
There is a bug causing send failures when processing an orphan directory
with no links. In commit 46b2f4590aab ("Btrfs: fix send failure when
root has deleted files still open")', the orphan inode issue was
addressed. The send operation fails with a ENOENT error because of any
attempts to generate a path for the inode with a link count of zero.
Therefore, in that patch, sctx->ignore_cur_inode was introduced to be
set if the current inode has a link count of zero for bypassing some
unnecessary steps. And a helper function btrfs_unlink_all_paths() was
introduced and called to clean up old paths found in the parent
snapshot. However, not only regular files but also directories can be
orphan inodes. So if the send operation meets an orphan directory, it
will issue a wrong unlink command for that directory now. Soon the
receive operation fails with a EISDIR error. Besides, the send operation
also fails with a ENOENT error later when it tries to generate a path of
it.
Similar example but making an orphan dir for an incremental send:
$ btrfs subvolume create vol
$ mkdir vol/dir
$ touch vol/dir/foo
$ btrfs subvolume snapshot -r vol snap1
$ btrfs subvolume snapshot -r vol snap2
# Turn the second snapshot to RW mode and delete the whole dir while
# holding an open file descriptor on it.
$ btrfs property set snap2 ro false
$ exec 73<snap2/dir
$ rm -rf snap2/dir
# Set the second snapshot back to RO mode and do an incremental send.
$ btrfs property set snap2 ro true
$ mkdir receive_dir
$ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
At subvol snap2
At snapshot snap2
ERROR: send ioctl failed with -2: No such file or directory
ERROR: unlink dir failed. Is a directory
Actually, orphan inodes are more common use cases in cascading backups.
(Please see the illustration below.) In a cascading backup, a user wants
to replicate a couple of snapshots from Machine A to Machine B and from
Machine B to Machine C. Machine B doesn't take any RO snapshots for
sending. All a receiver does is create an RW snapshot of its parent
snapshot, apply the send stream and turn it into RO mode at the end.
Even if all paths of some inodes are deleted in applying the send
stream, these inodes would not be deleted and become orphans after
changing the subvolume from RW to RO. Moreover, orphan inodes can occur
not only in send snapshots but also in parent snapshots because Machine
B may do a batch replication of a couple of snapshots.
An illustration for cascading backups:
Machine A (snapshot {1..n}) --> Machine B --> Machine C
The idea to solve the problem is to delete all the items of orphan
inodes before using these snapshots for sending. I used to think that
the reasonable timing for doing that is during the ioctl of changing the
subvolume from RW to RO because it sounds good that we will not modify
the fs tree of a RO snapshot anymore. However, attempting to do the
orphan cleanup in the ioctl would be pointless. Because if someone is
holding an open file descriptor on the inode, the reference count of the
inode will never drop to 0. Then iput() cannot trigger eviction, which
finally deletes all the items of it. So we try to extend the original
patch to handle orphans in send/parent snapshots. Here are several cases
that need to be considered:
Case 1: BTRFS_COMPARE_TREE_NEW
| send snapshot | action
--------------------------------
nlink | 0 | ignore
In case 1, when we get a BTRFS_COMPARE_TREE_NEW tree comparison result,
it means that a new inode is found in the send snapshot and it doesn't
appear in the parent snapshot. Since this inode has a link count of zero
(It's an orphan and there're no paths for it.), we can leverage
sctx->ignore_cur_inode in the original patch to prevent it from being
created.
Case 2: BTRFS_COMPARE_TREE_DELETED
| parent snapshot | action
----------------------------------
nlink | 0 | as usual
In case 2, when we get a BTRFS_COMPARE_TREE_DELETED tree comparison
result, it means that the inode only appears in the parent snapshot.
As usual, the send operation will try to delete all its paths. However,
this inode has a link count of zero, so no paths of it will be found. No
deletion operations will be issued. We don't need to change any logic.
Case 3: BTRFS_COMPARE_TREE_CHANGED
| | parent snapshot | send snapshot | action
-----------------------------------------------------------------------
subcase 1 | nlink | 0 | 0 | ignore
subcase 2 | nlink | >0 | 0 | new_gen(deletion)
subcase 3 | nlink | 0 | >0 | new_gen(creation)
In case 3, when we get a BTRFS_COMPARE_TREE_CHANGED tree comparison result,
it means that the inode appears in both snapshots. Here are 3 subcases.
First, when the inode has link counts of zero in both snapshots. Since
there are no paths for this inode in (source/destination) parent
snapshots and we don't care about whether there is also an orphan inode
in destination or not, we can set sctx->ignore_cur_inode on to prevent
it from being created.
For the second and the third subcases, if there are paths in one
snapshot and there're no paths in the other snapshot for this inode. We
can treat this inode as a new generation. We can also leverage the logic
handling a new generation of an inode with small adjustments. Then it
will delete all old paths and create a new inode with new attributes and
paths only when there's a positive link count in the send snapshot.
In subcase 2, the send operation only needs to delete all old paths as
in the parent snapshot. But it may require more operations for a
directory to remove its old paths. If a not-empty directory is going to
be deleted (because it has a link count of zero in the send snapshot)
but there are files/directories with bigger inode numbers under it, the
send operation will need to rename it to its orphan name first. After
processing and deleting the last item under this directory, the send
operation will check this directory, aka the parent directory of the
last item, again and issue a rmdir operation to remove it finally.
Therefore, we also need to treat inodes with a link count of zero as if
they didn't exist in get_cur_inode_state(), which is used in
process_recorded_refs(). By doing this, when checking a directory with
orphan names after the last item under it has been deleted, the send
operation now can properly issue a rmdir operation. Otherwise, without
doing this, the orphan directory with an orphan name would be kept here
at the end due to the existing inode with a link count of zero being
found.
In subcase 3, as in case 2, no old paths would be found, so no deletion
operations will be issued. The send operation will only create a new one
for that inode.
Note that subcase 3 is not common. That's because it's easy to reduce
the hard links of an inode, but once all valid paths are removed,
there are no valid paths for creating other hard links. The only way to
do that is trying to send an older snapshot after a newer snapshot has
been sent.
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-12 22:42:33 +08:00
|
|
|
right_ret = (info.nlink == 0) ? -ENOENT : ret;
|
|
|
|
right_gen = info.gen;
|
2023-01-11 19:36:05 +08:00
|
|
|
if (parent_gen)
|
|
|
|
*parent_gen = ((right_ret == -ENOENT) ? 0 : info.gen);
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (!left_ret && !right_ret) {
|
2012-07-28 22:33:49 +08:00
|
|
|
if (left_gen == gen && right_gen == gen) {
|
2012-07-26 05:19:24 +08:00
|
|
|
ret = inode_state_no_change;
|
2012-07-28 22:33:49 +08:00
|
|
|
} else if (left_gen == gen) {
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ino < sctx->send_progress)
|
|
|
|
ret = inode_state_did_create;
|
|
|
|
else
|
|
|
|
ret = inode_state_will_create;
|
|
|
|
} else if (right_gen == gen) {
|
|
|
|
if (ino < sctx->send_progress)
|
|
|
|
ret = inode_state_did_delete;
|
|
|
|
else
|
|
|
|
ret = inode_state_will_delete;
|
|
|
|
} else {
|
|
|
|
ret = -ENOENT;
|
|
|
|
}
|
|
|
|
} else if (!left_ret) {
|
|
|
|
if (left_gen == gen) {
|
|
|
|
if (ino < sctx->send_progress)
|
|
|
|
ret = inode_state_did_create;
|
|
|
|
else
|
|
|
|
ret = inode_state_will_create;
|
|
|
|
} else {
|
|
|
|
ret = -ENOENT;
|
|
|
|
}
|
|
|
|
} else if (!right_ret) {
|
|
|
|
if (right_gen == gen) {
|
|
|
|
if (ino < sctx->send_progress)
|
|
|
|
ret = inode_state_did_delete;
|
|
|
|
else
|
|
|
|
ret = inode_state_will_delete;
|
|
|
|
} else {
|
|
|
|
ret = -ENOENT;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
ret = -ENOENT;
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2023-01-11 19:36:05 +08:00
|
|
|
static int is_inode_existent(struct send_ctx *sctx, u64 ino, u64 gen,
|
|
|
|
u64 *send_gen, u64 *parent_gen)
|
2012-07-26 05:19:24 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
Btrfs: send, fix failure to rename top level inode due to name collision
Under certain situations, an incremental send operation can fail due to a
premature attempt to create a new top level inode (a direct child of the
subvolume/snapshot root) whose name collides with another inode that was
removed from the send snapshot.
Consider the following example scenario.
Parent snapshot:
. (ino 256, gen 8)
|---- a1/ (ino 257, gen 9)
|---- a2/ (ino 258, gen 9)
Send snapshot:
. (ino 256, gen 3)
|---- a2/ (ino 257, gen 7)
In this scenario, when receiving the incremental send stream, the btrfs
receive command fails like this (ran in verbose mode, -vv argument):
rmdir a1
mkfile o257-7-0
rename o257-7-0 -> a2
ERROR: rename o257-7-0 -> a2 failed: Is a directory
What happens when computing the incremental send stream is:
1) An operation to remove the directory with inode number 257 and
generation 9 is issued.
2) An operation to create the inode with number 257 and generation 7 is
issued. This creates the inode with an orphanized name of "o257-7-0".
3) An operation rename the new inode 257 to its final name, "a2", is
issued. This is incorrect because inode 258, which has the same name
and it's a child of the same parent (root inode 256), was not yet
processed and therefore no rmdir operation for it was yet issued.
The rename operation is issued because we fail to detect that the
name of the new inode 257 collides with inode 258, because their
parent, a subvolume/snapshot root (inode 256) has a different
generation in both snapshots.
So fix this by ignoring the generation value of a parent directory that
matches a root inode (number 256) when we are checking if the name of the
inode currently being processed collides with the name of some other
inode that was not yet processed.
We can achieve this scenario of different inodes with the same number but
different generation values either by mounting a filesystem with the inode
cache option (-o inode_cache) or by creating and sending snapshots across
different filesystems, like in the following example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/a1
$ mkdir /mnt/a2
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/1.snap
$ umount /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ touch /mnt/a2
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs receive /mnt -f /tmp/1.snap
# Take note that once the filesystem is created, its current
# generation has value 7 so the inode from the second snapshot has
# a generation value of 7. And after receiving the first snapshot
# the filesystem is at a generation value of 10, because the call to
# create the second snapshot bumps the generation to 8 (the snapshot
# creation ioctl does a transaction commit), the receive command calls
# the snapshot creation ioctl to create the first snapshot, which bumps
# the filesystem's generation to 9, and finally when the receive
# operation finishes it calls an ioctl to transition the first snapshot
# (snap1) from RW mode to RO mode, which does another transaction commit
# and bumps the filesystem's generation to 10.
$ rm -f /tmp/1.snap
$ btrfs send /mnt/snap1 -f /tmp/1.snap
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
$ umount /mnt
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ btrfs receive /mnt /tmp/1.snap
# Receive of snapshot snap2 used to fail.
$ btrfs receive /mnt /tmp/2.snap
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[Rewrote changelog to be more precise and clear]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-01-05 16:24:55 +08:00
|
|
|
if (ino == BTRFS_FIRST_FREE_OBJECTID)
|
|
|
|
return 1;
|
|
|
|
|
2023-01-11 19:36:05 +08:00
|
|
|
ret = get_cur_inode_state(sctx, ino, gen, send_gen, parent_gen);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (ret == inode_state_no_change ||
|
|
|
|
ret == inode_state_did_create ||
|
|
|
|
ret == inode_state_will_delete)
|
|
|
|
ret = 1;
|
|
|
|
else
|
|
|
|
ret = 0;
|
|
|
|
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Helper function to lookup a dir item in a dir.
|
|
|
|
*/
|
|
|
|
static int lookup_dir_item_inode(struct btrfs_root *root,
|
|
|
|
u64 dir, const char *name, int name_len,
|
2021-11-05 08:00:12 +08:00
|
|
|
u64 *found_inode)
|
2012-07-26 05:19:24 +08:00
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
struct btrfs_dir_item *di;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_path *path;
|
2022-10-21 00:58:27 +08:00
|
|
|
struct fscrypt_str name_str = FSTR_INIT((char *)name, name_len);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
path = alloc_path_for_send();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2022-10-21 00:58:25 +08:00
|
|
|
di = btrfs_lookup_dir_item(NULL, root, path, dir, &name_str, 0);
|
2018-09-12 06:06:26 +08:00
|
|
|
if (IS_ERR_OR_NULL(di)) {
|
|
|
|
ret = di ? PTR_ERR(di) : -ENOENT;
|
2012-07-26 05:19:24 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
btrfs_dir_item_key_to_cpu(path->nodes[0], di, &key);
|
2014-05-25 11:49:24 +08:00
|
|
|
if (key.type == BTRFS_ROOT_ITEM_KEY) {
|
|
|
|
ret = -ENOENT;
|
|
|
|
goto out;
|
|
|
|
}
|
2012-07-26 05:19:24 +08:00
|
|
|
*found_inode = key.objectid;
|
|
|
|
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-07-28 20:11:31 +08:00
|
|
|
/*
|
|
|
|
* Looks up the first btrfs_inode_ref of a given ino. It returns the parent dir,
|
|
|
|
* generation of the parent dir and the name of the dir entry.
|
|
|
|
*/
|
2013-05-08 15:51:52 +08:00
|
|
|
static int get_first_ref(struct btrfs_root *root, u64 ino,
|
2012-07-26 05:19:24 +08:00
|
|
|
u64 *dir, u64 *dir_gen, struct fs_path *name)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_key found_key;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
int len;
|
2012-10-15 16:30:45 +08:00
|
|
|
u64 parent_dir;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
path = alloc_path_for_send();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
key.objectid = ino;
|
|
|
|
key.type = BTRFS_INODE_REF_KEY;
|
|
|
|
key.offset = 0;
|
|
|
|
|
|
|
|
ret = btrfs_search_slot_for_read(root, &key, path, 1, 0);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (!ret)
|
|
|
|
btrfs_item_key_to_cpu(path->nodes[0], &found_key,
|
|
|
|
path->slots[0]);
|
2012-10-15 16:30:45 +08:00
|
|
|
if (ret || found_key.objectid != ino ||
|
|
|
|
(found_key.type != BTRFS_INODE_REF_KEY &&
|
|
|
|
found_key.type != BTRFS_INODE_EXTREF_KEY)) {
|
2012-07-26 05:19:24 +08:00
|
|
|
ret = -ENOENT;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2014-05-14 05:01:02 +08:00
|
|
|
if (found_key.type == BTRFS_INODE_REF_KEY) {
|
2012-10-15 16:30:45 +08:00
|
|
|
struct btrfs_inode_ref *iref;
|
|
|
|
iref = btrfs_item_ptr(path->nodes[0], path->slots[0],
|
|
|
|
struct btrfs_inode_ref);
|
|
|
|
len = btrfs_inode_ref_name_len(path->nodes[0], iref);
|
|
|
|
ret = fs_path_add_from_extent_buffer(name, path->nodes[0],
|
|
|
|
(unsigned long)(iref + 1),
|
|
|
|
len);
|
|
|
|
parent_dir = found_key.offset;
|
|
|
|
} else {
|
|
|
|
struct btrfs_inode_extref *extref;
|
|
|
|
extref = btrfs_item_ptr(path->nodes[0], path->slots[0],
|
|
|
|
struct btrfs_inode_extref);
|
|
|
|
len = btrfs_inode_extref_name_len(path->nodes[0], extref);
|
|
|
|
ret = fs_path_add_from_extent_buffer(name, path->nodes[0],
|
|
|
|
(unsigned long)&extref->name, len);
|
|
|
|
parent_dir = btrfs_inode_extref_parent(path->nodes[0], extref);
|
|
|
|
}
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
btrfs_release_path(path);
|
|
|
|
|
2014-03-21 20:46:54 +08:00
|
|
|
if (dir_gen) {
|
2022-08-12 22:42:32 +08:00
|
|
|
ret = get_inode_gen(root, parent_dir, dir_gen);
|
2014-03-21 20:46:54 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2012-10-15 16:30:45 +08:00
|
|
|
*dir = parent_dir;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
static int is_first_ref(struct btrfs_root *root,
|
2012-07-26 05:19:24 +08:00
|
|
|
u64 ino, u64 dir,
|
|
|
|
const char *name, int name_len)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct fs_path *tmp_name;
|
|
|
|
u64 tmp_dir;
|
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
tmp_name = fs_path_alloc();
|
2012-07-26 05:19:24 +08:00
|
|
|
if (!tmp_name)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2014-03-21 20:46:54 +08:00
|
|
|
ret = get_first_ref(root, ino, &tmp_dir, NULL, tmp_name);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
2012-07-28 17:07:18 +08:00
|
|
|
if (dir != tmp_dir || name_len != fs_path_len(tmp_name)) {
|
2012-07-26 05:19:24 +08:00
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2012-07-28 22:33:49 +08:00
|
|
|
ret = !memcmp(tmp_name->start, name, name_len);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
out:
|
2013-05-08 15:51:52 +08:00
|
|
|
fs_path_free(tmp_name);
|
2012-07-26 05:19:24 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-07-28 20:11:31 +08:00
|
|
|
/*
|
|
|
|
* Used by process_recorded_refs to determine if a new ref would overwrite an
|
|
|
|
* already existing ref. In case it detects an overwrite, it returns the
|
|
|
|
* inode/gen in who_ino/who_gen.
|
|
|
|
* When an overwrite is detected, process_recorded_refs does proper orphanizing
|
|
|
|
* to make sure later references to the overwritten inode are possible.
|
|
|
|
* Orphanizing is however only required for the first ref of an inode.
|
|
|
|
* process_recorded_refs does an additional is_first_ref check to see if
|
|
|
|
* orphanizing is really required.
|
|
|
|
*/
|
2012-07-26 05:19:24 +08:00
|
|
|
static int will_overwrite_ref(struct send_ctx *sctx, u64 dir, u64 dir_gen,
|
|
|
|
const char *name, int name_len,
|
Btrfs: incremental send, fix invalid path for link commands
In some scenarios an incremental send stream can contain link commands
with an invalid target path. Such scenarios happen after moving some
directory inode A, renaming a regular file inode B into the old name of
inode A and finally creating a new hard link for inode B at directory
inode A.
Consider the following example scenario where this issue happens.
Parent snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
| |--- dir2/ (ino 258)
| |--- dir3/ (ino 259)
| |--- file1 (ino 261)
| |--- dir4/ (ino 262)
|
|--- dir5/ (ino 260)
Send snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
|--- dir2/ (ino 258)
| |--- dir3/ (ino 259)
| |--- dir4 (ino 261)
|
|--- dir6/ (ino 263)
|--- dir44/ (ino 262)
|--- file11 (ino 261)
|--- dir55/ (ino 260)
When attempting to apply the corresponding incremental send stream, a
link command contains an invalid target path which makes the receiver
fail. The following is the verbose output of the btrfs receive command:
receiving snapshot mysnap2 uuid=90076fe6-5ba6-e64a-9321-9279670ed16b (...)
utimes
utimes dir1
utimes dir1/dir2/dir3
utimes
rename dir1/dir2/dir3/dir4 -> o262-7-0
link dir1/dir2/dir3/dir4 -> dir1/dir2/dir3/file1
link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1
ERROR: link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1 failed: Not a directory
The following steps happen during the computation of the incremental send
stream the lead to this issue:
1) When processing inode 261, we orphanize inode 262 due to a name/location
collision with one of the new hard links for inode 261 (created in the
second step below).
2) We create one of the 2 new hard links for inode 261, the one whose
location is at "dir1/dir2/dir3/dir4".
3) We then attempt to create the other new hard link for inode 261, which
has inode 262 as its parent directory. Because the path for this new
hard link was computed before we started processing the new references
(hard links), it reflects the old name/location of inode 262, that is,
it does not account for the orphanization step that happened when
we started processing the new references for inode 261, whence it is
no longer valid, causing the receiver to fail.
So fix this issue by recomputing the full path of new references if we
ended up orphanizing other inodes which are directories.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-06-23 03:03:51 +08:00
|
|
|
u64 *who_ino, u64 *who_gen, u64 *who_mode)
|
2012-07-26 05:19:24 +08:00
|
|
|
{
|
2023-01-11 19:36:04 +08:00
|
|
|
int ret;
|
2023-01-11 19:36:05 +08:00
|
|
|
u64 parent_root_dir_gen;
|
2012-07-26 05:19:24 +08:00
|
|
|
u64 other_inode = 0;
|
2022-08-12 22:42:32 +08:00
|
|
|
struct btrfs_inode_info info;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
if (!sctx->parent_root)
|
2023-01-11 19:36:04 +08:00
|
|
|
return 0;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2023-01-11 19:36:05 +08:00
|
|
|
ret = is_inode_existent(sctx, dir, dir_gen, NULL, &parent_root_dir_gen);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret <= 0)
|
2023-01-11 19:36:04 +08:00
|
|
|
return 0;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2013-08-07 04:47:48 +08:00
|
|
|
/*
|
|
|
|
* If we have a parent root we need to verify that the parent dir was
|
2016-05-20 09:18:45 +08:00
|
|
|
* not deleted and then re-created, if it was then we have no overwrite
|
2013-08-07 04:47:48 +08:00
|
|
|
* and we can just unlink this entry.
|
2023-01-11 19:36:05 +08:00
|
|
|
*
|
|
|
|
* @parent_root_dir_gen was set to 0 if the inode does not exist in the
|
|
|
|
* parent root.
|
2013-08-07 04:47:48 +08:00
|
|
|
*/
|
2023-01-11 19:36:05 +08:00
|
|
|
if (sctx->parent_root && dir != BTRFS_FIRST_FREE_OBJECTID &&
|
|
|
|
parent_root_dir_gen != dir_gen)
|
|
|
|
return 0;
|
2013-08-07 04:47:48 +08:00
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
ret = lookup_dir_item_inode(sctx->parent_root, dir, name, name_len,
|
2021-11-05 08:00:12 +08:00
|
|
|
&other_inode);
|
2023-01-11 19:36:04 +08:00
|
|
|
if (ret == -ENOENT)
|
|
|
|
return 0;
|
|
|
|
else if (ret < 0)
|
|
|
|
return ret;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2012-07-28 20:11:31 +08:00
|
|
|
/*
|
|
|
|
* Check if the overwritten ref was already processed. If yes, the ref
|
|
|
|
* was already unlinked/moved, so we can safely assume that we will not
|
|
|
|
* overwrite anything at this point in time.
|
|
|
|
*/
|
Btrfs: send, fix failure to move directories with the same name around
When doing an incremental send we can end up not moving directories that
have the same name. This happens when the same parent directory has
different child directories with the same name in the parent and send
snapshots.
For example, consider the following scenario:
Parent snapshot:
. (ino 256)
|---- d/ (ino 257)
| |--- p1/ (ino 258)
|
|---- p1/ (ino 259)
Send snapshot:
. (ino 256)
|--- d/ (ino 257)
|--- p1/ (ino 259)
|--- p1/ (ino 258)
The directory named "d" (inode 257) has in both snapshots an entry with
the name "p1" but it refers to different inodes in both snapshots (inode
258 in the parent snapshot and inode 259 in the send snapshot). When
attempting to move inode 258, the operation is delayed because its new
parent, inode 259, was not yet moved/renamed (as the stream is currently
processing inode 258). Then when processing inode 259, we also end up
delaying its move/rename operation so that it happens after inode 258 is
moved/renamed. This decision to delay the move/rename rename operation
of inode 259 is due to the fact that the new parent inode (257) still
has inode 258 as its child, which has the same name has inode 259. So
we end up with inode 258 move/rename operation waiting for inode's 259
move/rename operation, which in turn it waiting for inode's 258
move/rename. This results in ending the send stream without issuing
move/rename operations for inodes 258 and 259 and generating the
following warnings in syslog/dmesg:
[148402.979747] ------------[ cut here ]------------
[148402.980588] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6177 btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.981928] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148402.986999] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148402.988136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148402.988136] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148402.988136] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018212139fac8
[148402.988136] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148402.988136] Call Trace:
[148402.988136] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148402.988136] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148402.988136] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148402.988136] [<ffffffffa04bc831>] btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.988136] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148402.988136] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148402.988136] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148402.988136] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148402.988136] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148402.988136] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148402.988136] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148402.988136] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148402.988136] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148402.988136] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.011373] ---[ end trace a4539270c8056f8b ]---
[148403.012296] ------------[ cut here ]------------
[148403.013071] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6194 btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.014447] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148403.019708] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148403.020104] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148403.020104] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148403.020104] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018322139fac8
[148403.020104] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148403.020104] Call Trace:
[148403.020104] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148403.020104] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148403.020104] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148403.020104] [<ffffffffa04bc847>] btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.020104] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148403.020104] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148403.020104] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148403.020104] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148403.020104] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148403.020104] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148403.020104] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148403.020104] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148403.020104] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148403.020104] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.038981] ---[ end trace a4539270c8056f8c ]---
There's another issue caused by similar (but more complex) changes in the
directory hierarchy that makes move/rename operations fail, described with
the following example:
Parent snapshot:
.
|---- a/ (ino 262)
| |---- c/ (ino 268)
|
|---- d/ (ino 263)
|---- ance/ (ino 267)
|---- e/ (ino 264)
|---- f/ (ino 265)
|---- ance/ (ino 266)
Send snapshot:
.
|---- a/ (ino 262)
|---- c/ (ino 268)
| |---- ance/ (ino 267)
|
|---- d/ (ino 263)
| |---- ance/ (ino 266)
|
|---- f/ (ino 265)
|---- e/ (ino 264)
When the inode 265 is processed, the path for inode 267 is computed, which
at that time corresponds to "d/ance", and it's stored in the names cache.
Later on when processing inode 266, we end up orphanizing (renaming to a
name matching the pattern o<ino>-<gen>-<seq>) inode 267 because it has
the same name as inode 266 and it's currently a child of the new parent
directory (inode 263) for inode 266. After the orphanization and while we
are still processing inode 266, a rename operation for inode 266 is
generated. However the source path for that rename operation is incorrect
because it ends up using the old, pre-orphanization, name of inode 267.
The no longer valid name for inode 267 was previously cached when
processing inode 265 and it remains usable and considered valid until
the inode currently being processed has a number greater than 267.
This resulted in the receiving side failing with the following error:
ERROR: rename d/ance/ance -> d/ance failed: No such file or directory
So fix these issues by detecting such circular dependencies for rename
operations and by clearing the cached name of an inode once the inode
is orphanized.
A test case for fstests will follow soon.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Rewrote change log to be more detailed and organized, and improved
comments]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-06-23 18:39:46 +08:00
|
|
|
if (other_inode > sctx->send_progress ||
|
|
|
|
is_waiting_for_move(sctx, other_inode)) {
|
2022-08-12 22:42:32 +08:00
|
|
|
ret = get_inode_info(sctx->parent_root, other_inode, &info);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
2023-01-11 19:36:04 +08:00
|
|
|
return ret;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
*who_ino = other_inode;
|
2022-08-12 22:42:32 +08:00
|
|
|
*who_gen = info.gen;
|
|
|
|
*who_mode = info.mode;
|
2023-01-11 19:36:04 +08:00
|
|
|
return 1;
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
2023-01-11 19:36:04 +08:00
|
|
|
return 0;
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
2012-07-28 20:11:31 +08:00
|
|
|
/*
|
|
|
|
* Checks if the ref was overwritten by an already processed inode. This is
|
|
|
|
* used by __get_cur_name_and_parent to find out if the ref was orphanized and
|
|
|
|
* thus the orphan name needs be used.
|
|
|
|
* process_recorded_refs also uses it to avoid unlinking of refs that were
|
|
|
|
* overwritten.
|
|
|
|
*/
|
2012-07-26 05:19:24 +08:00
|
|
|
static int did_overwrite_ref(struct send_ctx *sctx,
|
|
|
|
u64 dir, u64 dir_gen,
|
|
|
|
u64 ino, u64 ino_gen,
|
|
|
|
const char *name, int name_len)
|
|
|
|
{
|
2023-01-11 19:36:02 +08:00
|
|
|
int ret;
|
2012-07-26 05:19:24 +08:00
|
|
|
u64 ow_inode;
|
2023-01-11 19:36:03 +08:00
|
|
|
u64 ow_gen = 0;
|
2023-01-11 19:36:05 +08:00
|
|
|
u64 send_root_dir_gen;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
if (!sctx->parent_root)
|
2023-01-11 19:36:02 +08:00
|
|
|
return 0;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2023-01-11 19:36:05 +08:00
|
|
|
ret = is_inode_existent(sctx, dir, dir_gen, &send_root_dir_gen, NULL);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret <= 0)
|
2023-01-11 19:36:02 +08:00
|
|
|
return ret;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2023-01-11 19:36:05 +08:00
|
|
|
/*
|
|
|
|
* @send_root_dir_gen was set to 0 if the inode does not exist in the
|
|
|
|
* send root.
|
|
|
|
*/
|
|
|
|
if (dir != BTRFS_FIRST_FREE_OBJECTID && send_root_dir_gen != dir_gen)
|
|
|
|
return 0;
|
Btrfs: incremental send, do not issue invalid rmdir operations
When both the parent and send snapshots have a directory inode with the
same number but different generations (therefore they are different
inodes) and both have an entry with the same name, an incremental send
stream will contain an invalid rmdir operation that refers to the
orphanized name of the inode from the parent snapshot.
The following example scenario shows how this happens.
Parent snapshot:
.
|---- d259_old/ (ino 259, gen 9)
| |---- d1/ (ino 258, gen 9)
|
|---- f (ino 257, gen 9)
Send snapshot:
.
|---- d258/ (ino 258, gen 7)
|---- d259/ (ino 259, gen 7)
|---- d1/ (ino 257, gen 7)
When the kernel is processing inode 258 it notices that in both snapshots
there is an inode numbered 259 that is a parent of an inode 258. However
it ignores the fact that the inodes numbered 259 have different generations
in both snapshots, which means they are effectively different inodes.
Then it checks that both inodes 259 have a dentry named "d1" and because
of that it issues a rmdir operation with orphanized name of the inode 258
from the parent snapshot. This happens at send.c:process_record_refs(),
which calls send.c:did_overwrite_first_ref() that returns true and because
of that later on at process_recorded_refs() such rmdir operation is issued
because the inode being currently processed (258) is a directory and it
was deleted in the send snapshot (and replaced with another inode that has
the same number and is a directory too).
Fix this issue by comparing the generations of parent directory inodes
that have the same number and make send.c:did_overwrite_first_ref() when
the generations are different.
The following steps reproduce the problem.
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ touch /mnt/f
$ mkdir /mnt/d1
$ mkdir /mnt/d259_old
$ mv /mnt/d1 /mnt/d259_old/d1
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/1.snap
$ umount /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ mkdir /mnt/d1
$ mkdir /mnt/dir258
$ mkdir /mnt/dir259
$ mv /mnt/d1 /mnt/dir259/d1
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs receive /mnt/ -f /tmp/1.snap
# Take note that once the filesystem is created, its current
# generation has value 7 so the inodes from the second snapshot all have
# a generation value of 7. And after receiving the first snapshot
# the filesystem is at a generation value of 10, because the call to
# create the second snapshot bumps the generation to 8 (the snapshot
# creation ioctl does a transaction commit), the receive command calls
# the snapshot creation ioctl to create the first snapshot, which bumps
# the filesystem's generation to 9, and finally when the receive
# operation finishes it calls an ioctl to transition the first snapshot
# (snap1) from RW mode to RO mode, which does another transaction commit
# and bumps the filesystem's generation to 10. This means all the inodes
# in the first snapshot (snap1) have a generation value of 9.
$ rm -f /tmp/1.snap
$ btrfs send /mnt/snap1 -f /tmp/1.snap
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
$ umount /mnt
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ btrfs receive /mnt -f /tmp/1.snap
$ btrfs receive -vv /mnt -f /tmp/2.snap
receiving snapshot mysnap2 uuid=9c03962f-f620-0047-9f98-32e5a87116d9, ctransid=7 parent_uuid=d17a6e3f-14e5-df4f-be39-a7951a5399aa, parent_ctransid=9
utimes
unlink f
mkdir o257-7-0
mkdir o259-7-0
rename o257-7-0 -> o259-7-0/d1
chown o259-7-0/d1 - uid=0, gid=0
chmod o259-7-0/d1 - mode=0755
utimes o259-7-0/d1
rmdir o258-9-0
ERROR: rmdir o258-9-0 failed: No such file or directory
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[Rewrote changelog to be more precise and clear]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-01-05 16:24:58 +08:00
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
/* check if the ref was overwritten by another ref */
|
|
|
|
ret = lookup_dir_item_inode(sctx->send_root, dir, name, name_len,
|
2021-11-05 08:00:12 +08:00
|
|
|
&ow_inode);
|
2023-01-11 19:36:02 +08:00
|
|
|
if (ret == -ENOENT) {
|
2012-07-26 05:19:24 +08:00
|
|
|
/* was never and will never be overwritten */
|
2023-01-11 19:36:02 +08:00
|
|
|
return 0;
|
|
|
|
} else if (ret < 0) {
|
|
|
|
return ret;
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
2023-01-11 19:36:03 +08:00
|
|
|
if (ow_inode == ino) {
|
|
|
|
ret = get_inode_gen(sctx->send_root, ow_inode, &ow_gen);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2023-01-11 19:36:03 +08:00
|
|
|
/* It's the same inode, so no overwrite happened. */
|
|
|
|
if (ow_gen == ino_gen)
|
|
|
|
return 0;
|
|
|
|
}
|
2012-07-26 05:19:24 +08:00
|
|
|
|
Btrfs: incremental send, check if orphanized dir inode needs delayed rename
If a directory inode is orphanized, because some inode previously
processed has a new name that collides with the old name of the current
inode, we need to check if it needs its rename operation delayed too,
as its ancestor-descendent relationship with some other inode might
have been reversed between the parent and send snapshots and therefore
its rename operation needs to happen after that other inode is renamed.
For example, for the following reproducer where this is needed (provided
by Robbie Ko):
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt2
$ mkdir -p /mnt/data/n1/n2
$ mkdir /mnt/data/n4
$ mkdir -p /mnt/data/t6/t7
$ mkdir /mnt/data/t5
$ mkdir /mnt/data/t7
$ mkdir /mnt/data/n4/t2
$ mkdir /mnt/data/t4
$ mkdir /mnt/data/t3
$ mv /mnt/data/t7 /mnt/data/n4/t2
$ mv /mnt/data/t4 /mnt/data/n4/t2/t7
$ mv /mnt/data/t5 /mnt/data/n4/t2/t7/t4
$ mv /mnt/data/t6 /mnt/data/n4/t2/t7/t4/t5
$ mv /mnt/data/n1/n2 /mnt/data/n4/t2/t7/t4/t5/t6
$ mv /mnt/data/n1 /mnt/data/n4/t2/t7/t4/t5/t6
$ mv /mnt/data/n4/t2/t7/t4/t5/t6/t7 /mnt/data/n4/t2/t7/t4/t5/t6/n2
$ mv /mnt/data/t3 /mnt/data/n4/t2/t7/t4/t5/t6/n2/t7
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ mv /mnt/data/n4/t2/t7/t4/t5/t6/n1 /mnt/data/n4
$ mv /mnt/data/n4/t2 /mnt/data/n4/n1
$ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6/n2 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/n2/t7/t3 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/t7/t4 /mnt/data/n4/n1/t2/t6
$ mv /mnt/data/n4/n1/t2/t7 /mnt/data/n4/n1/t2/t3
$ mv /mnt/data/n4/n1/t2/n2/t7 /mnt/data/n4/n1/t2
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send /mnt/snap1 | btrfs receive /mnt2
$ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2
ERROR: send ioctl failed with -12: Cannot allocate memory
Where the parent snapshot directory hierarchy is the following:
. (ino 256)
|-- data/ (ino 257)
|-- n4/ (ino 260)
|-- t2/ (ino 265)
|-- t7/ (ino 264)
|-- t4/ (ino 266)
|-- t5/ (ino 263)
|-- t6/ (ino 261)
|-- n1/ (ino 258)
|-- n2/ (ino 259)
|-- t7/ (ino 262)
|-- t3/ (ino 267)
And the send snapshot's directory hierarchy is the following:
. (ino 256)
|-- data/ (ino 257)
|-- n4/ (ino 260)
|-- n1/ (ino 258)
|-- t2/ (ino 265)
|-- n2/ (ino 259)
|-- t3/ (ino 267)
| |-- t7 (ino 264)
|
|-- t6/ (ino 261)
| |-- t4/ (ino 266)
| |-- t5/ (ino 263)
|
|-- t7/ (ino 262)
While processing inode 262 we orphanize inode 264 and later attempt
to rename inode 264 to its new name/location, which resulted in building
an incorrect destination path string for the rename operation with the
value "data/n4/t2/t7/t4/t5/t6/n2/t7/t3/t7". This rename operation must
have been done only after inode 267 is processed and renamed, as the
ancestor-descendent relationship between inodes 264 and 267 was reversed
between both snapshots, because otherwise it results in an infinite loop
when building the path string for inode 264 when we are processing an
inode with a number larger than 264. That loop is the following:
start inode 264, send progress of 265 for example
parent of 264 -> 267
parent of 267 -> 262
parent of 262 -> 259
parent of 259 -> 261
parent of 261 -> 263
parent of 263 -> 266
parent of 266 -> 264
|--> back to first iteration while current path string length
is <= PATH_MAX, and fail with -ENOMEM otherwise
So fix this by making the check if we need to delay a directory rename
regardless of the current inode having been orphanized or not.
A test case for fstests follows soon.
Thanks to Robbie Ko for providing a reproducer for this problem.
Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-04-09 21:09:14 +08:00
|
|
|
/*
|
|
|
|
* We know that it is or will be overwritten. Check this now.
|
|
|
|
* The current inode being processed might have been the one that caused
|
2015-09-26 22:30:23 +08:00
|
|
|
* inode 'ino' to be orphanized, therefore check if ow_inode matches
|
|
|
|
* the current inode being processed.
|
Btrfs: incremental send, check if orphanized dir inode needs delayed rename
If a directory inode is orphanized, because some inode previously
processed has a new name that collides with the old name of the current
inode, we need to check if it needs its rename operation delayed too,
as its ancestor-descendent relationship with some other inode might
have been reversed between the parent and send snapshots and therefore
its rename operation needs to happen after that other inode is renamed.
For example, for the following reproducer where this is needed (provided
by Robbie Ko):
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt2
$ mkdir -p /mnt/data/n1/n2
$ mkdir /mnt/data/n4
$ mkdir -p /mnt/data/t6/t7
$ mkdir /mnt/data/t5
$ mkdir /mnt/data/t7
$ mkdir /mnt/data/n4/t2
$ mkdir /mnt/data/t4
$ mkdir /mnt/data/t3
$ mv /mnt/data/t7 /mnt/data/n4/t2
$ mv /mnt/data/t4 /mnt/data/n4/t2/t7
$ mv /mnt/data/t5 /mnt/data/n4/t2/t7/t4
$ mv /mnt/data/t6 /mnt/data/n4/t2/t7/t4/t5
$ mv /mnt/data/n1/n2 /mnt/data/n4/t2/t7/t4/t5/t6
$ mv /mnt/data/n1 /mnt/data/n4/t2/t7/t4/t5/t6
$ mv /mnt/data/n4/t2/t7/t4/t5/t6/t7 /mnt/data/n4/t2/t7/t4/t5/t6/n2
$ mv /mnt/data/t3 /mnt/data/n4/t2/t7/t4/t5/t6/n2/t7
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ mv /mnt/data/n4/t2/t7/t4/t5/t6/n1 /mnt/data/n4
$ mv /mnt/data/n4/t2 /mnt/data/n4/n1
$ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6/n2 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/n2/t7/t3 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/t7/t4 /mnt/data/n4/n1/t2/t6
$ mv /mnt/data/n4/n1/t2/t7 /mnt/data/n4/n1/t2/t3
$ mv /mnt/data/n4/n1/t2/n2/t7 /mnt/data/n4/n1/t2
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send /mnt/snap1 | btrfs receive /mnt2
$ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2
ERROR: send ioctl failed with -12: Cannot allocate memory
Where the parent snapshot directory hierarchy is the following:
. (ino 256)
|-- data/ (ino 257)
|-- n4/ (ino 260)
|-- t2/ (ino 265)
|-- t7/ (ino 264)
|-- t4/ (ino 266)
|-- t5/ (ino 263)
|-- t6/ (ino 261)
|-- n1/ (ino 258)
|-- n2/ (ino 259)
|-- t7/ (ino 262)
|-- t3/ (ino 267)
And the send snapshot's directory hierarchy is the following:
. (ino 256)
|-- data/ (ino 257)
|-- n4/ (ino 260)
|-- n1/ (ino 258)
|-- t2/ (ino 265)
|-- n2/ (ino 259)
|-- t3/ (ino 267)
| |-- t7 (ino 264)
|
|-- t6/ (ino 261)
| |-- t4/ (ino 266)
| |-- t5/ (ino 263)
|
|-- t7/ (ino 262)
While processing inode 262 we orphanize inode 264 and later attempt
to rename inode 264 to its new name/location, which resulted in building
an incorrect destination path string for the rename operation with the
value "data/n4/t2/t7/t4/t5/t6/n2/t7/t3/t7". This rename operation must
have been done only after inode 267 is processed and renamed, as the
ancestor-descendent relationship between inodes 264 and 267 was reversed
between both snapshots, because otherwise it results in an infinite loop
when building the path string for inode 264 when we are processing an
inode with a number larger than 264. That loop is the following:
start inode 264, send progress of 265 for example
parent of 264 -> 267
parent of 267 -> 262
parent of 262 -> 259
parent of 259 -> 261
parent of 261 -> 263
parent of 263 -> 266
parent of 266 -> 264
|--> back to first iteration while current path string length
is <= PATH_MAX, and fail with -ENOMEM otherwise
So fix this by making the check if we need to delay a directory rename
regardless of the current inode having been orphanized or not.
A test case for fstests follows soon.
Thanks to Robbie Ko for providing a reproducer for this problem.
Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-04-09 21:09:14 +08:00
|
|
|
*/
|
2023-01-11 19:36:03 +08:00
|
|
|
if (ow_inode < sctx->send_progress)
|
2023-01-11 19:36:02 +08:00
|
|
|
return 1;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2023-01-11 19:36:03 +08:00
|
|
|
if (ino != sctx->cur_ino && ow_inode == sctx->cur_ino) {
|
|
|
|
if (ow_gen == 0) {
|
|
|
|
ret = get_inode_gen(sctx->send_root, ow_inode, &ow_gen);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
if (ow_gen == sctx->cur_inode_gen)
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2023-01-11 19:36:02 +08:00
|
|
|
return 0;
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
2012-07-28 20:11:31 +08:00
|
|
|
/*
|
|
|
|
* Same as did_overwrite_ref, but also checks if it is the first ref of an inode
|
|
|
|
* that got overwritten. This is used by process_recorded_refs to determine
|
|
|
|
* if it has to use the path as returned by get_cur_path or the orphan name.
|
|
|
|
*/
|
2012-07-26 05:19:24 +08:00
|
|
|
static int did_overwrite_first_ref(struct send_ctx *sctx, u64 ino, u64 gen)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
struct fs_path *name = NULL;
|
|
|
|
u64 dir;
|
|
|
|
u64 dir_gen;
|
|
|
|
|
|
|
|
if (!sctx->parent_root)
|
|
|
|
goto out;
|
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
name = fs_path_alloc();
|
2012-07-26 05:19:24 +08:00
|
|
|
if (!name)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
ret = get_first_ref(sctx->parent_root, ino, &dir, &dir_gen, name);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = did_overwrite_ref(sctx, dir, dir_gen, ino, gen,
|
|
|
|
name->start, fs_path_len(name));
|
|
|
|
|
|
|
|
out:
|
2013-05-08 15:51:52 +08:00
|
|
|
fs_path_free(name);
|
2012-07-26 05:19:24 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2023-01-11 19:36:18 +08:00
|
|
|
static inline struct name_cache_entry *name_cache_search(struct send_ctx *sctx,
|
|
|
|
u64 ino, u64 gen)
|
2012-07-26 05:19:24 +08:00
|
|
|
{
|
2023-01-11 19:36:18 +08:00
|
|
|
struct btrfs_lru_cache_entry *entry;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2023-01-11 19:36:18 +08:00
|
|
|
entry = btrfs_lru_cache_lookup(&sctx->name_cache, ino, gen);
|
|
|
|
if (!entry)
|
2012-07-26 05:19:24 +08:00
|
|
|
return NULL;
|
|
|
|
|
2023-01-11 19:36:18 +08:00
|
|
|
return container_of(entry, struct name_cache_entry, entry);
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
2012-07-28 20:11:31 +08:00
|
|
|
/*
|
|
|
|
* Used by get_cur_path for each ref up to the root.
|
|
|
|
* Returns 0 if it succeeded.
|
|
|
|
* Returns 1 if the inode is not existent or got overwritten. In that case, the
|
|
|
|
* name is an orphan name. This instructs get_cur_path to stop iterating. If 1
|
|
|
|
* is returned, parent_ino/parent_gen are not guaranteed to be valid.
|
|
|
|
* Returns <0 in case of error.
|
|
|
|
*/
|
2012-07-26 05:19:24 +08:00
|
|
|
static int __get_cur_name_and_parent(struct send_ctx *sctx,
|
|
|
|
u64 ino, u64 gen,
|
|
|
|
u64 *parent_ino,
|
|
|
|
u64 *parent_gen,
|
|
|
|
struct fs_path *dest)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
int nce_ret;
|
2023-01-11 19:36:18 +08:00
|
|
|
struct name_cache_entry *nce;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2012-07-28 20:11:31 +08:00
|
|
|
/*
|
|
|
|
* First check if we already did a call to this function with the same
|
|
|
|
* ino/gen. If yes, check if the cache entry is still up-to-date. If yes
|
|
|
|
* return the cached result.
|
|
|
|
*/
|
2012-07-26 05:19:24 +08:00
|
|
|
nce = name_cache_search(sctx, ino, gen);
|
|
|
|
if (nce) {
|
|
|
|
if (ino < sctx->send_progress && nce->need_later_update) {
|
2023-01-11 19:36:18 +08:00
|
|
|
btrfs_lru_cache_remove(&sctx->name_cache, &nce->entry);
|
2012-07-26 05:19:24 +08:00
|
|
|
nce = NULL;
|
|
|
|
} else {
|
|
|
|
*parent_ino = nce->parent_ino;
|
|
|
|
*parent_gen = nce->parent_gen;
|
|
|
|
ret = fs_path_add(dest, nce->name, nce->name_len);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
ret = nce->ret;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-07-28 20:11:31 +08:00
|
|
|
/*
|
|
|
|
* If the inode is not existent yet, add the orphan name and return 1.
|
|
|
|
* This should only happen for the parent dir that we determine in
|
2022-07-12 23:31:22 +08:00
|
|
|
* record_new_ref_if_needed().
|
2012-07-28 20:11:31 +08:00
|
|
|
*/
|
2023-01-11 19:36:05 +08:00
|
|
|
ret = is_inode_existent(sctx, ino, gen, NULL, NULL);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (!ret) {
|
|
|
|
ret = gen_unique_name(sctx, ino, gen, dest);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
ret = 1;
|
|
|
|
goto out_cache;
|
|
|
|
}
|
|
|
|
|
2012-07-28 20:11:31 +08:00
|
|
|
/*
|
|
|
|
* Depending on whether the inode was already processed or not, use
|
|
|
|
* send_root or parent_root for ref lookup.
|
|
|
|
*/
|
2014-02-21 08:01:32 +08:00
|
|
|
if (ino < sctx->send_progress)
|
2013-05-08 15:51:52 +08:00
|
|
|
ret = get_first_ref(sctx->send_root, ino,
|
|
|
|
parent_ino, parent_gen, dest);
|
2012-07-26 05:19:24 +08:00
|
|
|
else
|
2013-05-08 15:51:52 +08:00
|
|
|
ret = get_first_ref(sctx->parent_root, ino,
|
|
|
|
parent_ino, parent_gen, dest);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
2012-07-28 20:11:31 +08:00
|
|
|
/*
|
|
|
|
* Check if the ref was overwritten by an inode's ref that was processed
|
|
|
|
* earlier. If yes, treat as orphan and return 1.
|
|
|
|
*/
|
2012-07-26 05:19:24 +08:00
|
|
|
ret = did_overwrite_ref(sctx, *parent_ino, *parent_gen, ino, gen,
|
|
|
|
dest->start, dest->end - dest->start);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret) {
|
|
|
|
fs_path_reset(dest);
|
|
|
|
ret = gen_unique_name(sctx, ino, gen, dest);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
ret = 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
out_cache:
|
2012-07-28 20:11:31 +08:00
|
|
|
/*
|
|
|
|
* Store the result of the lookup in the name cache.
|
|
|
|
*/
|
2016-01-19 01:42:13 +08:00
|
|
|
nce = kmalloc(sizeof(*nce) + fs_path_len(dest) + 1, GFP_KERNEL);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (!nce) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2023-01-11 19:36:18 +08:00
|
|
|
nce->entry.key = ino;
|
|
|
|
nce->entry.gen = gen;
|
2012-07-26 05:19:24 +08:00
|
|
|
nce->parent_ino = *parent_ino;
|
|
|
|
nce->parent_gen = *parent_gen;
|
|
|
|
nce->name_len = fs_path_len(dest);
|
|
|
|
nce->ret = ret;
|
|
|
|
strcpy(nce->name, dest->start);
|
|
|
|
|
|
|
|
if (ino < sctx->send_progress)
|
|
|
|
nce->need_later_update = 0;
|
|
|
|
else
|
|
|
|
nce->need_later_update = 1;
|
|
|
|
|
2023-01-11 19:36:18 +08:00
|
|
|
nce_ret = btrfs_lru_cache_store(&sctx->name_cache, &nce->entry, GFP_KERNEL);
|
|
|
|
if (nce_ret < 0) {
|
|
|
|
kfree(nce);
|
2012-07-26 05:19:24 +08:00
|
|
|
ret = nce_ret;
|
2023-01-11 19:36:18 +08:00
|
|
|
}
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Magic happens here. This function returns the first ref to an inode as it
|
|
|
|
* would look like while receiving the stream at this point in time.
|
|
|
|
* We walk the path up to the root. For every inode in between, we check if it
|
|
|
|
* was already processed/sent. If yes, we continue with the parent as found
|
|
|
|
* in send_root. If not, we continue with the parent as found in parent_root.
|
|
|
|
* If we encounter an inode that was deleted at this point in time, we use the
|
|
|
|
* inodes "orphan" name instead of the real name and stop. Same with new inodes
|
|
|
|
* that were not created yet and overwritten inodes/refs.
|
|
|
|
*
|
2018-11-28 19:05:13 +08:00
|
|
|
* When do we have orphan inodes:
|
2012-07-26 05:19:24 +08:00
|
|
|
* 1. When an inode is freshly created and thus no valid refs are available yet
|
|
|
|
* 2. When a directory lost all it's refs (deleted) but still has dir items
|
|
|
|
* inside which were not processed yet (pending for move/delete). If anyone
|
|
|
|
* tried to get the path to the dir items, it would get a path inside that
|
|
|
|
* orphan directory.
|
|
|
|
* 3. When an inode is moved around or gets new links, it may overwrite the ref
|
|
|
|
* of an unprocessed inode. If in that case the first ref would be
|
|
|
|
* overwritten, the overwritten inode gets "orphanized". Later when we
|
|
|
|
* process this overwritten inode, it is restored at a new place by moving
|
|
|
|
* the orphan inode.
|
|
|
|
*
|
|
|
|
* sctx->send_progress tells this function at which point in time receiving
|
|
|
|
* would be.
|
|
|
|
*/
|
|
|
|
static int get_cur_path(struct send_ctx *sctx, u64 ino, u64 gen,
|
|
|
|
struct fs_path *dest)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
struct fs_path *name = NULL;
|
|
|
|
u64 parent_inode = 0;
|
|
|
|
u64 parent_gen = 0;
|
|
|
|
int stop = 0;
|
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
name = fs_path_alloc();
|
2012-07-26 05:19:24 +08:00
|
|
|
if (!name) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
dest->reversed = 1;
|
|
|
|
fs_path_reset(dest);
|
|
|
|
|
|
|
|
while (!stop && ino != BTRFS_FIRST_FREE_OBJECTID) {
|
Btrfs: incremental send, check if orphanized dir inode needs delayed rename
If a directory inode is orphanized, because some inode previously
processed has a new name that collides with the old name of the current
inode, we need to check if it needs its rename operation delayed too,
as its ancestor-descendent relationship with some other inode might
have been reversed between the parent and send snapshots and therefore
its rename operation needs to happen after that other inode is renamed.
For example, for the following reproducer where this is needed (provided
by Robbie Ko):
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt2
$ mkdir -p /mnt/data/n1/n2
$ mkdir /mnt/data/n4
$ mkdir -p /mnt/data/t6/t7
$ mkdir /mnt/data/t5
$ mkdir /mnt/data/t7
$ mkdir /mnt/data/n4/t2
$ mkdir /mnt/data/t4
$ mkdir /mnt/data/t3
$ mv /mnt/data/t7 /mnt/data/n4/t2
$ mv /mnt/data/t4 /mnt/data/n4/t2/t7
$ mv /mnt/data/t5 /mnt/data/n4/t2/t7/t4
$ mv /mnt/data/t6 /mnt/data/n4/t2/t7/t4/t5
$ mv /mnt/data/n1/n2 /mnt/data/n4/t2/t7/t4/t5/t6
$ mv /mnt/data/n1 /mnt/data/n4/t2/t7/t4/t5/t6
$ mv /mnt/data/n4/t2/t7/t4/t5/t6/t7 /mnt/data/n4/t2/t7/t4/t5/t6/n2
$ mv /mnt/data/t3 /mnt/data/n4/t2/t7/t4/t5/t6/n2/t7
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ mv /mnt/data/n4/t2/t7/t4/t5/t6/n1 /mnt/data/n4
$ mv /mnt/data/n4/t2 /mnt/data/n4/n1
$ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6/n2 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/n2/t7/t3 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/t7/t4 /mnt/data/n4/n1/t2/t6
$ mv /mnt/data/n4/n1/t2/t7 /mnt/data/n4/n1/t2/t3
$ mv /mnt/data/n4/n1/t2/n2/t7 /mnt/data/n4/n1/t2
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send /mnt/snap1 | btrfs receive /mnt2
$ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2
ERROR: send ioctl failed with -12: Cannot allocate memory
Where the parent snapshot directory hierarchy is the following:
. (ino 256)
|-- data/ (ino 257)
|-- n4/ (ino 260)
|-- t2/ (ino 265)
|-- t7/ (ino 264)
|-- t4/ (ino 266)
|-- t5/ (ino 263)
|-- t6/ (ino 261)
|-- n1/ (ino 258)
|-- n2/ (ino 259)
|-- t7/ (ino 262)
|-- t3/ (ino 267)
And the send snapshot's directory hierarchy is the following:
. (ino 256)
|-- data/ (ino 257)
|-- n4/ (ino 260)
|-- n1/ (ino 258)
|-- t2/ (ino 265)
|-- n2/ (ino 259)
|-- t3/ (ino 267)
| |-- t7 (ino 264)
|
|-- t6/ (ino 261)
| |-- t4/ (ino 266)
| |-- t5/ (ino 263)
|
|-- t7/ (ino 262)
While processing inode 262 we orphanize inode 264 and later attempt
to rename inode 264 to its new name/location, which resulted in building
an incorrect destination path string for the rename operation with the
value "data/n4/t2/t7/t4/t5/t6/n2/t7/t3/t7". This rename operation must
have been done only after inode 267 is processed and renamed, as the
ancestor-descendent relationship between inodes 264 and 267 was reversed
between both snapshots, because otherwise it results in an infinite loop
when building the path string for inode 264 when we are processing an
inode with a number larger than 264. That loop is the following:
start inode 264, send progress of 265 for example
parent of 264 -> 267
parent of 267 -> 262
parent of 262 -> 259
parent of 259 -> 261
parent of 261 -> 263
parent of 263 -> 266
parent of 266 -> 264
|--> back to first iteration while current path string length
is <= PATH_MAX, and fail with -ENOMEM otherwise
So fix this by making the check if we need to delay a directory rename
regardless of the current inode having been orphanized or not.
A test case for fstests follows soon.
Thanks to Robbie Ko for providing a reproducer for this problem.
Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-04-09 21:09:14 +08:00
|
|
|
struct waiting_dir_move *wdm;
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
fs_path_reset(name);
|
|
|
|
|
2020-12-10 20:09:02 +08:00
|
|
|
if (is_waiting_for_rm(sctx, ino, gen)) {
|
2014-02-19 22:31:44 +08:00
|
|
|
ret = gen_unique_name(sctx, ino, gen, name);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
ret = fs_path_add_path(dest, name);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
Btrfs: incremental send, check if orphanized dir inode needs delayed rename
If a directory inode is orphanized, because some inode previously
processed has a new name that collides with the old name of the current
inode, we need to check if it needs its rename operation delayed too,
as its ancestor-descendent relationship with some other inode might
have been reversed between the parent and send snapshots and therefore
its rename operation needs to happen after that other inode is renamed.
For example, for the following reproducer where this is needed (provided
by Robbie Ko):
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt2
$ mkdir -p /mnt/data/n1/n2
$ mkdir /mnt/data/n4
$ mkdir -p /mnt/data/t6/t7
$ mkdir /mnt/data/t5
$ mkdir /mnt/data/t7
$ mkdir /mnt/data/n4/t2
$ mkdir /mnt/data/t4
$ mkdir /mnt/data/t3
$ mv /mnt/data/t7 /mnt/data/n4/t2
$ mv /mnt/data/t4 /mnt/data/n4/t2/t7
$ mv /mnt/data/t5 /mnt/data/n4/t2/t7/t4
$ mv /mnt/data/t6 /mnt/data/n4/t2/t7/t4/t5
$ mv /mnt/data/n1/n2 /mnt/data/n4/t2/t7/t4/t5/t6
$ mv /mnt/data/n1 /mnt/data/n4/t2/t7/t4/t5/t6
$ mv /mnt/data/n4/t2/t7/t4/t5/t6/t7 /mnt/data/n4/t2/t7/t4/t5/t6/n2
$ mv /mnt/data/t3 /mnt/data/n4/t2/t7/t4/t5/t6/n2/t7
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ mv /mnt/data/n4/t2/t7/t4/t5/t6/n1 /mnt/data/n4
$ mv /mnt/data/n4/t2 /mnt/data/n4/n1
$ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6/n2 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/n2/t7/t3 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/t7/t4 /mnt/data/n4/n1/t2/t6
$ mv /mnt/data/n4/n1/t2/t7 /mnt/data/n4/n1/t2/t3
$ mv /mnt/data/n4/n1/t2/n2/t7 /mnt/data/n4/n1/t2
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send /mnt/snap1 | btrfs receive /mnt2
$ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2
ERROR: send ioctl failed with -12: Cannot allocate memory
Where the parent snapshot directory hierarchy is the following:
. (ino 256)
|-- data/ (ino 257)
|-- n4/ (ino 260)
|-- t2/ (ino 265)
|-- t7/ (ino 264)
|-- t4/ (ino 266)
|-- t5/ (ino 263)
|-- t6/ (ino 261)
|-- n1/ (ino 258)
|-- n2/ (ino 259)
|-- t7/ (ino 262)
|-- t3/ (ino 267)
And the send snapshot's directory hierarchy is the following:
. (ino 256)
|-- data/ (ino 257)
|-- n4/ (ino 260)
|-- n1/ (ino 258)
|-- t2/ (ino 265)
|-- n2/ (ino 259)
|-- t3/ (ino 267)
| |-- t7 (ino 264)
|
|-- t6/ (ino 261)
| |-- t4/ (ino 266)
| |-- t5/ (ino 263)
|
|-- t7/ (ino 262)
While processing inode 262 we orphanize inode 264 and later attempt
to rename inode 264 to its new name/location, which resulted in building
an incorrect destination path string for the rename operation with the
value "data/n4/t2/t7/t4/t5/t6/n2/t7/t3/t7". This rename operation must
have been done only after inode 267 is processed and renamed, as the
ancestor-descendent relationship between inodes 264 and 267 was reversed
between both snapshots, because otherwise it results in an infinite loop
when building the path string for inode 264 when we are processing an
inode with a number larger than 264. That loop is the following:
start inode 264, send progress of 265 for example
parent of 264 -> 267
parent of 267 -> 262
parent of 262 -> 259
parent of 259 -> 261
parent of 261 -> 263
parent of 263 -> 266
parent of 266 -> 264
|--> back to first iteration while current path string length
is <= PATH_MAX, and fail with -ENOMEM otherwise
So fix this by making the check if we need to delay a directory rename
regardless of the current inode having been orphanized or not.
A test case for fstests follows soon.
Thanks to Robbie Ko for providing a reproducer for this problem.
Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-04-09 21:09:14 +08:00
|
|
|
wdm = get_waiting_dir_move(sctx, ino);
|
|
|
|
if (wdm && wdm->orphanized) {
|
|
|
|
ret = gen_unique_name(sctx, ino, gen, name);
|
|
|
|
stop = 1;
|
|
|
|
} else if (wdm) {
|
2014-02-21 08:01:32 +08:00
|
|
|
ret = get_first_ref(sctx->parent_root, ino,
|
|
|
|
&parent_inode, &parent_gen, name);
|
|
|
|
} else {
|
|
|
|
ret = __get_cur_name_and_parent(sctx, ino, gen,
|
|
|
|
&parent_inode,
|
|
|
|
&parent_gen, name);
|
|
|
|
if (ret)
|
|
|
|
stop = 1;
|
|
|
|
}
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
ret = fs_path_add_path(dest, name);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ino = parent_inode;
|
|
|
|
gen = parent_gen;
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
2013-05-08 15:51:52 +08:00
|
|
|
fs_path_free(name);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (!ret)
|
|
|
|
fs_path_unreverse(dest);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Sends a BTRFS_SEND_C_SUBVOL command/item to userspace
|
|
|
|
*/
|
|
|
|
static int send_subvol_begin(struct send_ctx *sctx)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct btrfs_root *send_root = sctx->send_root;
|
|
|
|
struct btrfs_root *parent_root = sctx->parent_root;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_root_ref *ref;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
char *name = NULL;
|
|
|
|
int namelen;
|
|
|
|
|
2014-01-15 00:26:43 +08:00
|
|
|
path = btrfs_alloc_path();
|
2012-07-26 05:19:24 +08:00
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2016-01-19 01:42:13 +08:00
|
|
|
name = kmalloc(BTRFS_PATH_NAME_MAX, GFP_KERNEL);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (!name) {
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
2018-08-06 13:25:24 +08:00
|
|
|
key.objectid = send_root->root_key.objectid;
|
2012-07-26 05:19:24 +08:00
|
|
|
key.type = BTRFS_ROOT_BACKREF_KEY;
|
|
|
|
key.offset = 0;
|
|
|
|
|
|
|
|
ret = btrfs_search_slot_for_read(send_root->fs_info->tree_root,
|
|
|
|
&key, path, 1, 0);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret) {
|
|
|
|
ret = -ENOENT;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
|
|
|
|
if (key.type != BTRFS_ROOT_BACKREF_KEY ||
|
2018-08-06 13:25:24 +08:00
|
|
|
key.objectid != send_root->root_key.objectid) {
|
2012-07-26 05:19:24 +08:00
|
|
|
ret = -ENOENT;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
ref = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_root_ref);
|
|
|
|
namelen = btrfs_root_ref_name_len(leaf, ref);
|
|
|
|
read_extent_buffer(leaf, name, (unsigned long)(ref + 1), namelen);
|
|
|
|
btrfs_release_path(path);
|
|
|
|
|
|
|
|
if (parent_root) {
|
|
|
|
ret = begin_cmd(sctx, BTRFS_SEND_C_SNAPSHOT);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
} else {
|
|
|
|
ret = begin_cmd(sctx, BTRFS_SEND_C_SUBVOL);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
TLV_PUT_STRING(sctx, BTRFS_SEND_A_PATH, name, namelen);
|
2015-10-01 03:23:33 +08:00
|
|
|
|
|
|
|
if (!btrfs_is_empty_uuid(sctx->send_root->root_item.received_uuid))
|
|
|
|
TLV_PUT_UUID(sctx, BTRFS_SEND_A_UUID,
|
|
|
|
sctx->send_root->root_item.received_uuid);
|
|
|
|
else
|
|
|
|
TLV_PUT_UUID(sctx, BTRFS_SEND_A_UUID,
|
|
|
|
sctx->send_root->root_item.uuid);
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
TLV_PUT_U64(sctx, BTRFS_SEND_A_CTRANSID,
|
2020-09-15 20:30:15 +08:00
|
|
|
btrfs_root_ctransid(&sctx->send_root->root_item));
|
2012-07-26 05:19:24 +08:00
|
|
|
if (parent_root) {
|
2015-06-05 05:17:25 +08:00
|
|
|
if (!btrfs_is_empty_uuid(parent_root->root_item.received_uuid))
|
|
|
|
TLV_PUT_UUID(sctx, BTRFS_SEND_A_CLONE_UUID,
|
|
|
|
parent_root->root_item.received_uuid);
|
|
|
|
else
|
|
|
|
TLV_PUT_UUID(sctx, BTRFS_SEND_A_CLONE_UUID,
|
|
|
|
parent_root->root_item.uuid);
|
2012-07-26 05:19:24 +08:00
|
|
|
TLV_PUT_U64(sctx, BTRFS_SEND_A_CLONE_CTRANSID,
|
2020-09-15 20:30:15 +08:00
|
|
|
btrfs_root_ctransid(&sctx->parent_root->root_item));
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
ret = send_cmd(sctx);
|
|
|
|
|
|
|
|
tlv_put_failure:
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
kfree(name);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int send_truncate(struct send_ctx *sctx, u64 ino, u64 gen, u64 size)
|
|
|
|
{
|
2016-09-20 22:05:03 +08:00
|
|
|
struct btrfs_fs_info *fs_info = sctx->send_root->fs_info;
|
2012-07-26 05:19:24 +08:00
|
|
|
int ret = 0;
|
|
|
|
struct fs_path *p;
|
|
|
|
|
2016-09-20 22:05:03 +08:00
|
|
|
btrfs_debug(fs_info, "send_truncate %llu size=%llu", ino, size);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
p = fs_path_alloc();
|
2012-07-26 05:19:24 +08:00
|
|
|
if (!p)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
ret = begin_cmd(sctx, BTRFS_SEND_C_TRUNCATE);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = get_cur_path(sctx, ino, gen, p);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
|
|
|
|
TLV_PUT_U64(sctx, BTRFS_SEND_A_SIZE, size);
|
|
|
|
|
|
|
|
ret = send_cmd(sctx);
|
|
|
|
|
|
|
|
tlv_put_failure:
|
|
|
|
out:
|
2013-05-08 15:51:52 +08:00
|
|
|
fs_path_free(p);
|
2012-07-26 05:19:24 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int send_chmod(struct send_ctx *sctx, u64 ino, u64 gen, u64 mode)
|
|
|
|
{
|
2016-09-20 22:05:03 +08:00
|
|
|
struct btrfs_fs_info *fs_info = sctx->send_root->fs_info;
|
2012-07-26 05:19:24 +08:00
|
|
|
int ret = 0;
|
|
|
|
struct fs_path *p;
|
|
|
|
|
2016-09-20 22:05:03 +08:00
|
|
|
btrfs_debug(fs_info, "send_chmod %llu mode=%llu", ino, mode);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
p = fs_path_alloc();
|
2012-07-26 05:19:24 +08:00
|
|
|
if (!p)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
ret = begin_cmd(sctx, BTRFS_SEND_C_CHMOD);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = get_cur_path(sctx, ino, gen, p);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
|
|
|
|
TLV_PUT_U64(sctx, BTRFS_SEND_A_MODE, mode & 07777);
|
|
|
|
|
|
|
|
ret = send_cmd(sctx);
|
|
|
|
|
|
|
|
tlv_put_failure:
|
|
|
|
out:
|
2013-05-08 15:51:52 +08:00
|
|
|
fs_path_free(p);
|
2012-07-26 05:19:24 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2022-05-19 00:02:55 +08:00
|
|
|
static int send_fileattr(struct send_ctx *sctx, u64 ino, u64 gen, u64 fileattr)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = sctx->send_root->fs_info;
|
|
|
|
int ret = 0;
|
|
|
|
struct fs_path *p;
|
|
|
|
|
|
|
|
if (sctx->proto < 2)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
btrfs_debug(fs_info, "send_fileattr %llu fileattr=%llu", ino, fileattr);
|
|
|
|
|
|
|
|
p = fs_path_alloc();
|
|
|
|
if (!p)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
ret = begin_cmd(sctx, BTRFS_SEND_C_FILEATTR);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = get_cur_path(sctx, ino, gen, p);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
|
|
|
|
TLV_PUT_U64(sctx, BTRFS_SEND_A_FILEATTR, fileattr);
|
|
|
|
|
|
|
|
ret = send_cmd(sctx);
|
|
|
|
|
|
|
|
tlv_put_failure:
|
|
|
|
out:
|
|
|
|
fs_path_free(p);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
static int send_chown(struct send_ctx *sctx, u64 ino, u64 gen, u64 uid, u64 gid)
|
|
|
|
{
|
2016-09-20 22:05:03 +08:00
|
|
|
struct btrfs_fs_info *fs_info = sctx->send_root->fs_info;
|
2012-07-26 05:19:24 +08:00
|
|
|
int ret = 0;
|
|
|
|
struct fs_path *p;
|
|
|
|
|
2016-09-20 22:05:03 +08:00
|
|
|
btrfs_debug(fs_info, "send_chown %llu uid=%llu, gid=%llu",
|
|
|
|
ino, uid, gid);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
p = fs_path_alloc();
|
2012-07-26 05:19:24 +08:00
|
|
|
if (!p)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
ret = begin_cmd(sctx, BTRFS_SEND_C_CHOWN);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = get_cur_path(sctx, ino, gen, p);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
|
|
|
|
TLV_PUT_U64(sctx, BTRFS_SEND_A_UID, uid);
|
|
|
|
TLV_PUT_U64(sctx, BTRFS_SEND_A_GID, gid);
|
|
|
|
|
|
|
|
ret = send_cmd(sctx);
|
|
|
|
|
|
|
|
tlv_put_failure:
|
|
|
|
out:
|
2013-05-08 15:51:52 +08:00
|
|
|
fs_path_free(p);
|
2012-07-26 05:19:24 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int send_utimes(struct send_ctx *sctx, u64 ino, u64 gen)
|
|
|
|
{
|
2016-09-20 22:05:03 +08:00
|
|
|
struct btrfs_fs_info *fs_info = sctx->send_root->fs_info;
|
2012-07-26 05:19:24 +08:00
|
|
|
int ret = 0;
|
|
|
|
struct fs_path *p = NULL;
|
|
|
|
struct btrfs_inode_item *ii;
|
|
|
|
struct btrfs_path *path = NULL;
|
|
|
|
struct extent_buffer *eb;
|
|
|
|
struct btrfs_key key;
|
|
|
|
int slot;
|
|
|
|
|
2016-09-20 22:05:03 +08:00
|
|
|
btrfs_debug(fs_info, "send_utimes %llu", ino);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
p = fs_path_alloc();
|
2012-07-26 05:19:24 +08:00
|
|
|
if (!p)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
path = alloc_path_for_send();
|
|
|
|
if (!path) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
key.objectid = ino;
|
|
|
|
key.type = BTRFS_INODE_ITEM_KEY;
|
|
|
|
key.offset = 0;
|
|
|
|
ret = btrfs_search_slot(NULL, sctx->send_root, &key, path, 0, 0);
|
2016-07-02 12:43:46 +08:00
|
|
|
if (ret > 0)
|
|
|
|
ret = -ENOENT;
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
eb = path->nodes[0];
|
|
|
|
slot = path->slots[0];
|
|
|
|
ii = btrfs_item_ptr(eb, slot, struct btrfs_inode_item);
|
|
|
|
|
|
|
|
ret = begin_cmd(sctx, BTRFS_SEND_C_UTIMES);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = get_cur_path(sctx, ino, gen, p);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
|
2014-12-13 00:39:12 +08:00
|
|
|
TLV_PUT_BTRFS_TIMESPEC(sctx, BTRFS_SEND_A_ATIME, eb, &ii->atime);
|
|
|
|
TLV_PUT_BTRFS_TIMESPEC(sctx, BTRFS_SEND_A_MTIME, eb, &ii->mtime);
|
|
|
|
TLV_PUT_BTRFS_TIMESPEC(sctx, BTRFS_SEND_A_CTIME, eb, &ii->ctime);
|
2022-05-17 22:50:30 +08:00
|
|
|
if (sctx->proto >= 2)
|
|
|
|
TLV_PUT_BTRFS_TIMESPEC(sctx, BTRFS_SEND_A_OTIME, eb, &ii->otime);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
ret = send_cmd(sctx);
|
|
|
|
|
|
|
|
tlv_put_failure:
|
|
|
|
out:
|
2013-05-08 15:51:52 +08:00
|
|
|
fs_path_free(p);
|
2012-07-26 05:19:24 +08:00
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
btrfs: send: cache utimes operations for directories if possible
Whenever we add or remove an entry to a directory, we issue an utimes
command for the directory. If we add 1000 entries to a directory (create
1000 files under it or move 1000 files to it), then we issue the same
utimes command 1000 times, which increases the send stream size, results
in more pipe IO, one search in the send b+tree, allocating one path for
the search, etc, as well as making the receiver do a system call for each
duplicated utimes command.
We also issue an utimes command when we create a new directory, but later
we might add entries to it corresponding to inodes with an higher inode
number, so it's pointless to issue the utimes command before we create
the last inode under the directory.
So use a lru cache to track directories for which we must send a utimes
command. When we need to remove an entry from the cache, we issue the
utimes command for the respective directory. When finishing the send
operation, we go over each cache element and issue the respective utimes
command. Finally the caching is entirely optional, just a performance
optimization, meaning that if we fail to cache (due to memory allocation
failure), we issue the utimes command right away, that is, we fallback
to the previous, unoptimized, behaviour.
This patch belongs to a patchset comprised of the following patches:
btrfs: send: directly return from did_overwrite_ref() and simplify it
btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
btrfs: send: directly return from will_overwrite_ref() and simplify it
btrfs: send: avoid extra b+tree searches when checking reference overrides
btrfs: send: remove send_progress argument from can_rmdir()
btrfs: send: avoid duplicated orphan dir allocation and initialization
btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
btrfs: send: reduce searches on parent root when checking if dir can be removed
btrfs: send: iterate waiting dir move rbtree only once when processing refs
btrfs: send: initialize all the red black trees earlier
btrfs: send: genericize the backref cache to allow it to be reused
btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
btrfs: send: cache information about created directories
btrfs: allow a generation number to be associated with lru cache entries
btrfs: add an api to delete a specific entry from the lru cache
btrfs: send: use the lru cache to implement the name cache
btrfs: send: update size of roots array for backref cache entries
btrfs: send: cache utimes operations for directories if possible
The following test was run before and after applying the whole patchset,
and on a non-debug kernel (Debian's default kernel config):
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
mkfs.btrfs -f $DEV > /dev/null
mount $DEV $MNT
mkdir $MNT/A
for ((i = 1; i <= 20000; i++)); do
echo -n > $MNT/A/file_$i
done
btrfs subvolume snapshot -r $MNT $MNT/snap1
mkdir $MNT/B
for ((i = 20000; i <= 40000; i++)); do
echo -n > $MNT/B/file_$i
done
mv $MNT/A/file_* $MNT/B/
btrfs subvolume snapshot -r $MNT $MNT/snap2
start=$(date +%s%N)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "Incremental send took $dur milliseconds"
umount $MNT
Before the whole patchset: 18408 milliseconds
After the whole patchset: 1942 milliseconds (9.5x speedup)
Using 60000 files instead of 40000:
Before the whole patchset: 39764 milliseconds
After the whole patchset: 3076 milliseconds (12.9x speedup)
Using 20000 files instead of 40000:
Before the whole patchset: 5072 milliseconds
After the whole patchset: 916 milliseconds (5.5x speedup)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-11 19:36:20 +08:00
|
|
|
/*
|
|
|
|
* If the cache is full, we can't remove entries from it and do a call to
|
|
|
|
* send_utimes() for each respective inode, because we might be finishing
|
|
|
|
* processing an inode that is a directory and it just got renamed, and existing
|
|
|
|
* entries in the cache may refer to inodes that have the directory in their
|
|
|
|
* full path - in which case we would generate outdated paths (pre-rename)
|
|
|
|
* for the inodes that the cache entries point to. Instead of prunning the
|
|
|
|
* cache when inserting, do it after we finish processing each inode at
|
|
|
|
* finish_inode_if_needed().
|
|
|
|
*/
|
|
|
|
static int cache_dir_utimes(struct send_ctx *sctx, u64 dir, u64 gen)
|
|
|
|
{
|
|
|
|
struct btrfs_lru_cache_entry *entry;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
entry = btrfs_lru_cache_lookup(&sctx->dir_utimes_cache, dir, gen);
|
|
|
|
if (entry != NULL)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
/* Caching is optional, don't fail if we can't allocate memory. */
|
|
|
|
entry = kmalloc(sizeof(*entry), GFP_KERNEL);
|
|
|
|
if (!entry)
|
|
|
|
return send_utimes(sctx, dir, gen);
|
|
|
|
|
|
|
|
entry->key = dir;
|
|
|
|
entry->gen = gen;
|
|
|
|
|
|
|
|
ret = btrfs_lru_cache_store(&sctx->dir_utimes_cache, entry, GFP_KERNEL);
|
|
|
|
ASSERT(ret != -EEXIST);
|
|
|
|
if (ret) {
|
|
|
|
kfree(entry);
|
|
|
|
return send_utimes(sctx, dir, gen);
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int trim_dir_utimes_cache(struct send_ctx *sctx)
|
|
|
|
{
|
|
|
|
while (btrfs_lru_cache_size(&sctx->dir_utimes_cache) >
|
|
|
|
SEND_MAX_DIR_UTIMES_CACHE_SIZE) {
|
|
|
|
struct btrfs_lru_cache_entry *lru;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
lru = btrfs_lru_cache_lru_entry(&sctx->dir_utimes_cache);
|
|
|
|
ASSERT(lru != NULL);
|
|
|
|
|
|
|
|
ret = send_utimes(sctx, lru->key, lru->gen);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
btrfs_lru_cache_remove(&sctx->dir_utimes_cache, lru);
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
/*
|
|
|
|
* Sends a BTRFS_SEND_C_MKXXX or SYMLINK command to user space. We don't have
|
|
|
|
* a valid path yet because we did not process the refs yet. So, the inode
|
|
|
|
* is created as orphan.
|
|
|
|
*/
|
2012-07-28 16:42:24 +08:00
|
|
|
static int send_create_inode(struct send_ctx *sctx, u64 ino)
|
2012-07-26 05:19:24 +08:00
|
|
|
{
|
2016-09-20 22:05:03 +08:00
|
|
|
struct btrfs_fs_info *fs_info = sctx->send_root->fs_info;
|
2012-07-26 05:19:24 +08:00
|
|
|
int ret = 0;
|
|
|
|
struct fs_path *p;
|
|
|
|
int cmd;
|
2022-08-12 22:42:32 +08:00
|
|
|
struct btrfs_inode_info info;
|
2012-07-28 16:42:24 +08:00
|
|
|
u64 gen;
|
2012-07-26 05:19:24 +08:00
|
|
|
u64 mode;
|
2012-07-28 16:42:24 +08:00
|
|
|
u64 rdev;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2016-09-20 22:05:03 +08:00
|
|
|
btrfs_debug(fs_info, "send_create_inode %llu", ino);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
p = fs_path_alloc();
|
2012-07-26 05:19:24 +08:00
|
|
|
if (!p)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2014-02-27 17:29:01 +08:00
|
|
|
if (ino != sctx->cur_ino) {
|
2022-08-12 22:42:32 +08:00
|
|
|
ret = get_inode_info(sctx->send_root, ino, &info);
|
2014-02-27 17:29:01 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2022-08-12 22:42:32 +08:00
|
|
|
gen = info.gen;
|
|
|
|
mode = info.mode;
|
|
|
|
rdev = info.rdev;
|
2014-02-27 17:29:01 +08:00
|
|
|
} else {
|
|
|
|
gen = sctx->cur_inode_gen;
|
|
|
|
mode = sctx->cur_inode_mode;
|
|
|
|
rdev = sctx->cur_inode_rdev;
|
|
|
|
}
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2012-07-28 22:33:49 +08:00
|
|
|
if (S_ISREG(mode)) {
|
2012-07-26 05:19:24 +08:00
|
|
|
cmd = BTRFS_SEND_C_MKFILE;
|
2012-07-28 22:33:49 +08:00
|
|
|
} else if (S_ISDIR(mode)) {
|
2012-07-26 05:19:24 +08:00
|
|
|
cmd = BTRFS_SEND_C_MKDIR;
|
2012-07-28 22:33:49 +08:00
|
|
|
} else if (S_ISLNK(mode)) {
|
2012-07-26 05:19:24 +08:00
|
|
|
cmd = BTRFS_SEND_C_SYMLINK;
|
2012-07-28 22:33:49 +08:00
|
|
|
} else if (S_ISCHR(mode) || S_ISBLK(mode)) {
|
2012-07-26 05:19:24 +08:00
|
|
|
cmd = BTRFS_SEND_C_MKNOD;
|
2012-07-28 22:33:49 +08:00
|
|
|
} else if (S_ISFIFO(mode)) {
|
2012-07-26 05:19:24 +08:00
|
|
|
cmd = BTRFS_SEND_C_MKFIFO;
|
2012-07-28 22:33:49 +08:00
|
|
|
} else if (S_ISSOCK(mode)) {
|
2012-07-26 05:19:24 +08:00
|
|
|
cmd = BTRFS_SEND_C_MKSOCK;
|
2012-07-28 22:33:49 +08:00
|
|
|
} else {
|
2015-10-08 17:37:06 +08:00
|
|
|
btrfs_warn(sctx->send_root->fs_info, "unexpected inode type %o",
|
2012-07-26 05:19:24 +08:00
|
|
|
(int)(mode & S_IFMT));
|
2016-01-22 08:13:25 +08:00
|
|
|
ret = -EOPNOTSUPP;
|
2012-07-26 05:19:24 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = begin_cmd(sctx, cmd);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
2012-07-28 16:42:24 +08:00
|
|
|
ret = gen_unique_name(sctx, ino, gen, p);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
|
2012-07-28 16:42:24 +08:00
|
|
|
TLV_PUT_U64(sctx, BTRFS_SEND_A_INO, ino);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
if (S_ISLNK(mode)) {
|
|
|
|
fs_path_reset(p);
|
2013-05-08 15:51:52 +08:00
|
|
|
ret = read_symlink(sctx->send_root, ino, p);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH_LINK, p);
|
|
|
|
} else if (S_ISCHR(mode) || S_ISBLK(mode) ||
|
|
|
|
S_ISFIFO(mode) || S_ISSOCK(mode)) {
|
2012-10-16 02:28:46 +08:00
|
|
|
TLV_PUT_U64(sctx, BTRFS_SEND_A_RDEV, new_encode_dev(rdev));
|
|
|
|
TLV_PUT_U64(sctx, BTRFS_SEND_A_MODE, mode);
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
ret = send_cmd(sctx);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
|
|
|
|
tlv_put_failure:
|
|
|
|
out:
|
2013-05-08 15:51:52 +08:00
|
|
|
fs_path_free(p);
|
2012-07-26 05:19:24 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2023-01-11 19:36:15 +08:00
|
|
|
static void cache_dir_created(struct send_ctx *sctx, u64 dir)
|
|
|
|
{
|
|
|
|
struct btrfs_lru_cache_entry *entry;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
/* Caching is optional, ignore any failures. */
|
|
|
|
entry = kmalloc(sizeof(*entry), GFP_KERNEL);
|
|
|
|
if (!entry)
|
|
|
|
return;
|
|
|
|
|
|
|
|
entry->key = dir;
|
2023-01-11 19:36:16 +08:00
|
|
|
entry->gen = 0;
|
2023-01-11 19:36:15 +08:00
|
|
|
ret = btrfs_lru_cache_store(&sctx->dir_created_cache, entry, GFP_KERNEL);
|
|
|
|
if (ret < 0)
|
|
|
|
kfree(entry);
|
|
|
|
}
|
|
|
|
|
2012-07-28 16:42:24 +08:00
|
|
|
/*
|
|
|
|
* We need some special handling for inodes that get processed before the parent
|
|
|
|
* directory got created. See process_recorded_refs for details.
|
|
|
|
* This function does the check if we already created the dir out of order.
|
|
|
|
*/
|
|
|
|
static int did_create_dir(struct send_ctx *sctx, u64 dir)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
2022-03-09 21:50:43 +08:00
|
|
|
int iter_ret = 0;
|
2012-07-28 16:42:24 +08:00
|
|
|
struct btrfs_path *path = NULL;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_key found_key;
|
|
|
|
struct btrfs_key di_key;
|
|
|
|
struct btrfs_dir_item *di;
|
|
|
|
|
2023-01-11 19:36:16 +08:00
|
|
|
if (btrfs_lru_cache_lookup(&sctx->dir_created_cache, dir, 0))
|
2023-01-11 19:36:15 +08:00
|
|
|
return 1;
|
|
|
|
|
2012-07-28 16:42:24 +08:00
|
|
|
path = alloc_path_for_send();
|
2022-03-09 21:50:43 +08:00
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
2012-07-28 16:42:24 +08:00
|
|
|
|
|
|
|
key.objectid = dir;
|
|
|
|
key.type = BTRFS_DIR_INDEX_KEY;
|
|
|
|
key.offset = 0;
|
2014-02-06 00:48:56 +08:00
|
|
|
|
2022-03-09 21:50:43 +08:00
|
|
|
btrfs_for_each_slot(sctx->send_root, &key, &found_key, path, iter_ret) {
|
|
|
|
struct extent_buffer *eb = path->nodes[0];
|
2014-02-06 00:48:56 +08:00
|
|
|
|
|
|
|
if (found_key.objectid != key.objectid ||
|
2012-07-28 16:42:24 +08:00
|
|
|
found_key.type != key.type) {
|
|
|
|
ret = 0;
|
2022-03-09 21:50:43 +08:00
|
|
|
break;
|
2012-07-28 16:42:24 +08:00
|
|
|
}
|
|
|
|
|
2022-03-09 21:50:43 +08:00
|
|
|
di = btrfs_item_ptr(eb, path->slots[0], struct btrfs_dir_item);
|
2012-07-28 16:42:24 +08:00
|
|
|
btrfs_dir_item_key_to_cpu(eb, di, &di_key);
|
|
|
|
|
2013-08-12 22:56:14 +08:00
|
|
|
if (di_key.type != BTRFS_ROOT_ITEM_KEY &&
|
|
|
|
di_key.objectid < sctx->send_progress) {
|
2012-07-28 16:42:24 +08:00
|
|
|
ret = 1;
|
2023-01-11 19:36:15 +08:00
|
|
|
cache_dir_created(sctx, dir);
|
2022-03-09 21:50:43 +08:00
|
|
|
break;
|
2012-07-28 16:42:24 +08:00
|
|
|
}
|
|
|
|
}
|
2022-03-09 21:50:43 +08:00
|
|
|
/* Catch error found during iteration */
|
|
|
|
if (iter_ret < 0)
|
|
|
|
ret = iter_ret;
|
2012-07-28 16:42:24 +08:00
|
|
|
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Only creates the inode if it is:
|
|
|
|
* 1. Not a directory
|
|
|
|
* 2. Or a directory which was not created already due to out of order
|
|
|
|
* directories. See did_create_dir and process_recorded_refs for details.
|
|
|
|
*/
|
|
|
|
static int send_create_inode_if_needed(struct send_ctx *sctx)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (S_ISDIR(sctx->cur_inode_mode)) {
|
|
|
|
ret = did_create_dir(sctx, sctx->cur_ino);
|
|
|
|
if (ret < 0)
|
2021-08-02 07:35:49 +08:00
|
|
|
return ret;
|
|
|
|
else if (ret > 0)
|
|
|
|
return 0;
|
2012-07-28 16:42:24 +08:00
|
|
|
}
|
|
|
|
|
2023-01-11 19:36:15 +08:00
|
|
|
ret = send_create_inode(sctx, sctx->cur_ino);
|
|
|
|
|
|
|
|
if (ret == 0 && S_ISDIR(sctx->cur_inode_mode))
|
|
|
|
cache_dir_created(sctx, sctx->cur_ino);
|
|
|
|
|
|
|
|
return ret;
|
2012-07-28 16:42:24 +08:00
|
|
|
}
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
struct recorded_ref {
|
|
|
|
struct list_head list;
|
|
|
|
char *name;
|
|
|
|
struct fs_path *full_path;
|
|
|
|
u64 dir;
|
|
|
|
u64 dir_gen;
|
|
|
|
int name_len;
|
btrfs: send: fix sending link commands for existing file paths
There is a bug sending link commands for existing file paths. When we're
processing an inode, we go over all references. All the new file paths are
added to the "new_refs" list. And all the deleted file paths are added to
the "deleted_refs" list. In the end, when we finish processing the inode,
we iterate over all the items in the "new_refs" list and send link commands
for those file paths. After that, we go over all the items in the
"deleted_refs" list and send unlink commands for them. If there are
duplicated file paths in both lists, we will try to create them before we
remove them. Then the receiver gets an -EEXIST error when trying the link
operations.
Example for having duplicated file paths in both list:
$ btrfs subvolume create vol
# create a file and 2000 hard links to the same inode
$ touch vol/foo
$ for i in {1..2000}; do link vol/foo vol/$i ; done
# take a snapshot for a parent snapshot
$ btrfs subvolume snapshot -r vol snap1
# remove 2000 hard links and re-create the last 1000 links
$ for i in {1..2000}; do rm vol/$i; done;
$ for i in {1001..2000}; do link vol/foo vol/$i; done
# take another one for a send snapshot
$ btrfs subvolume snapshot -r vol snap2
$ mkdir receive_dir
$ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
At subvol snap2
link 1238 -> foo
ERROR: link 1238 -> foo failed: File exists
In this case, we will have the same file paths added to both lists. In the
parent snapshot, reference paths {1..1237} are stored in inode references,
but reference paths {1238..2000} are stored in inode extended references.
In the send snapshot, all reference paths {1001..2000} are stored in inode
references. During the incremental send, we process their inode references
first. In record_changed_ref(), we iterate all its inode references in the
send/parent snapshot. For every inode reference, we also use find_iref() to
check whether the same file path also appears in the parent/send snapshot
or not. Inode references {1238..2000} which appear in the send snapshot but
not in the parent snapshot are added to the "new_refs" list. On the other
hand, Inode references {1..1000} which appear in the parent snapshot but
not in the send snapshot are added to the "deleted_refs" list. Next, when
we process their inode extended references, reference paths {1238..2000}
are added to the "deleted_refs" list because all of them only appear in the
parent snapshot. Now two lists contain items as below:
"new_refs" list: {1238..2000}
"deleted_refs" list: {1..1000}, {1238..2000}
Reference paths {1238..2000} appear in both lists. And as the processing
order mentioned about before, the receiver gets an -EEXIST error when trying
the link operations.
To fix the bug, the idea is to process the "deleted_refs" list before
the "new_refs" list. However, it's not easy to reshuffle the processing
order. For one reason, if we do so, we may unlink all the existing paths
first, there's no valid path anymore for links. And it's inefficient
because we do a bunch of unlinks followed by links for the same paths.
Moreover, it makes less sense to have duplications in both lists. A
reference path cannot not only be regarded as new but also has been seen in
the past, or we won't call it a new path. However, it's also not a good
idea to make find_iref() check a reference against all inode references
and all inode extended references because it may result in large disk
reads.
So we introduce two rbtrees to make the references easier for lookups.
And we also introduce record_new_ref_if_needed() and
record_deleted_ref_if_needed() for changed_ref() to check and remove
duplicated references early.
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-12 09:36:32 +08:00
|
|
|
struct rb_node node;
|
|
|
|
struct rb_root *root;
|
2012-07-26 05:19:24 +08:00
|
|
|
};
|
|
|
|
|
2022-07-12 09:36:31 +08:00
|
|
|
static struct recorded_ref *recorded_ref_alloc(void)
|
|
|
|
{
|
|
|
|
struct recorded_ref *ref;
|
|
|
|
|
|
|
|
ref = kzalloc(sizeof(*ref), GFP_KERNEL);
|
|
|
|
if (!ref)
|
|
|
|
return NULL;
|
btrfs: send: fix sending link commands for existing file paths
There is a bug sending link commands for existing file paths. When we're
processing an inode, we go over all references. All the new file paths are
added to the "new_refs" list. And all the deleted file paths are added to
the "deleted_refs" list. In the end, when we finish processing the inode,
we iterate over all the items in the "new_refs" list and send link commands
for those file paths. After that, we go over all the items in the
"deleted_refs" list and send unlink commands for them. If there are
duplicated file paths in both lists, we will try to create them before we
remove them. Then the receiver gets an -EEXIST error when trying the link
operations.
Example for having duplicated file paths in both list:
$ btrfs subvolume create vol
# create a file and 2000 hard links to the same inode
$ touch vol/foo
$ for i in {1..2000}; do link vol/foo vol/$i ; done
# take a snapshot for a parent snapshot
$ btrfs subvolume snapshot -r vol snap1
# remove 2000 hard links and re-create the last 1000 links
$ for i in {1..2000}; do rm vol/$i; done;
$ for i in {1001..2000}; do link vol/foo vol/$i; done
# take another one for a send snapshot
$ btrfs subvolume snapshot -r vol snap2
$ mkdir receive_dir
$ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
At subvol snap2
link 1238 -> foo
ERROR: link 1238 -> foo failed: File exists
In this case, we will have the same file paths added to both lists. In the
parent snapshot, reference paths {1..1237} are stored in inode references,
but reference paths {1238..2000} are stored in inode extended references.
In the send snapshot, all reference paths {1001..2000} are stored in inode
references. During the incremental send, we process their inode references
first. In record_changed_ref(), we iterate all its inode references in the
send/parent snapshot. For every inode reference, we also use find_iref() to
check whether the same file path also appears in the parent/send snapshot
or not. Inode references {1238..2000} which appear in the send snapshot but
not in the parent snapshot are added to the "new_refs" list. On the other
hand, Inode references {1..1000} which appear in the parent snapshot but
not in the send snapshot are added to the "deleted_refs" list. Next, when
we process their inode extended references, reference paths {1238..2000}
are added to the "deleted_refs" list because all of them only appear in the
parent snapshot. Now two lists contain items as below:
"new_refs" list: {1238..2000}
"deleted_refs" list: {1..1000}, {1238..2000}
Reference paths {1238..2000} appear in both lists. And as the processing
order mentioned about before, the receiver gets an -EEXIST error when trying
the link operations.
To fix the bug, the idea is to process the "deleted_refs" list before
the "new_refs" list. However, it's not easy to reshuffle the processing
order. For one reason, if we do so, we may unlink all the existing paths
first, there's no valid path anymore for links. And it's inefficient
because we do a bunch of unlinks followed by links for the same paths.
Moreover, it makes less sense to have duplications in both lists. A
reference path cannot not only be regarded as new but also has been seen in
the past, or we won't call it a new path. However, it's also not a good
idea to make find_iref() check a reference against all inode references
and all inode extended references because it may result in large disk
reads.
So we introduce two rbtrees to make the references easier for lookups.
And we also introduce record_new_ref_if_needed() and
record_deleted_ref_if_needed() for changed_ref() to check and remove
duplicated references early.
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-12 09:36:32 +08:00
|
|
|
RB_CLEAR_NODE(&ref->node);
|
2022-07-12 09:36:31 +08:00
|
|
|
INIT_LIST_HEAD(&ref->list);
|
|
|
|
return ref;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void recorded_ref_free(struct recorded_ref *ref)
|
|
|
|
{
|
|
|
|
if (!ref)
|
|
|
|
return;
|
btrfs: send: fix sending link commands for existing file paths
There is a bug sending link commands for existing file paths. When we're
processing an inode, we go over all references. All the new file paths are
added to the "new_refs" list. And all the deleted file paths are added to
the "deleted_refs" list. In the end, when we finish processing the inode,
we iterate over all the items in the "new_refs" list and send link commands
for those file paths. After that, we go over all the items in the
"deleted_refs" list and send unlink commands for them. If there are
duplicated file paths in both lists, we will try to create them before we
remove them. Then the receiver gets an -EEXIST error when trying the link
operations.
Example for having duplicated file paths in both list:
$ btrfs subvolume create vol
# create a file and 2000 hard links to the same inode
$ touch vol/foo
$ for i in {1..2000}; do link vol/foo vol/$i ; done
# take a snapshot for a parent snapshot
$ btrfs subvolume snapshot -r vol snap1
# remove 2000 hard links and re-create the last 1000 links
$ for i in {1..2000}; do rm vol/$i; done;
$ for i in {1001..2000}; do link vol/foo vol/$i; done
# take another one for a send snapshot
$ btrfs subvolume snapshot -r vol snap2
$ mkdir receive_dir
$ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
At subvol snap2
link 1238 -> foo
ERROR: link 1238 -> foo failed: File exists
In this case, we will have the same file paths added to both lists. In the
parent snapshot, reference paths {1..1237} are stored in inode references,
but reference paths {1238..2000} are stored in inode extended references.
In the send snapshot, all reference paths {1001..2000} are stored in inode
references. During the incremental send, we process their inode references
first. In record_changed_ref(), we iterate all its inode references in the
send/parent snapshot. For every inode reference, we also use find_iref() to
check whether the same file path also appears in the parent/send snapshot
or not. Inode references {1238..2000} which appear in the send snapshot but
not in the parent snapshot are added to the "new_refs" list. On the other
hand, Inode references {1..1000} which appear in the parent snapshot but
not in the send snapshot are added to the "deleted_refs" list. Next, when
we process their inode extended references, reference paths {1238..2000}
are added to the "deleted_refs" list because all of them only appear in the
parent snapshot. Now two lists contain items as below:
"new_refs" list: {1238..2000}
"deleted_refs" list: {1..1000}, {1238..2000}
Reference paths {1238..2000} appear in both lists. And as the processing
order mentioned about before, the receiver gets an -EEXIST error when trying
the link operations.
To fix the bug, the idea is to process the "deleted_refs" list before
the "new_refs" list. However, it's not easy to reshuffle the processing
order. For one reason, if we do so, we may unlink all the existing paths
first, there's no valid path anymore for links. And it's inefficient
because we do a bunch of unlinks followed by links for the same paths.
Moreover, it makes less sense to have duplications in both lists. A
reference path cannot not only be regarded as new but also has been seen in
the past, or we won't call it a new path. However, it's also not a good
idea to make find_iref() check a reference against all inode references
and all inode extended references because it may result in large disk
reads.
So we introduce two rbtrees to make the references easier for lookups.
And we also introduce record_new_ref_if_needed() and
record_deleted_ref_if_needed() for changed_ref() to check and remove
duplicated references early.
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-12 09:36:32 +08:00
|
|
|
if (!RB_EMPTY_NODE(&ref->node))
|
|
|
|
rb_erase(&ref->node, ref->root);
|
2022-07-12 09:36:31 +08:00
|
|
|
list_del(&ref->list);
|
|
|
|
fs_path_free(ref->full_path);
|
|
|
|
kfree(ref);
|
|
|
|
}
|
|
|
|
|
Btrfs: incremental send, fix invalid path for unlink commands
An incremental send can contain unlink operations with an invalid target
path when we rename some directory inode A, then rename some file inode B
to the old name of inode A and directory inode A is an ancestor of inode B
in the parent snapshot (but not anymore in the send snapshot).
Consider the following example scenario where this issue happens.
Parent snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
|--- dir2/ (ino 258)
| |--- file1 (ino 259)
| |--- file3 (ino 261)
|
|--- dir3/ (ino 262)
|--- file22 (ino 260)
|--- dir4/ (ino 263)
Send snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
|--- dir2/ (ino 258)
|--- dir3 (ino 260)
|--- file3/ (ino 262)
|--- dir4/ (ino 263)
|--- file11 (ino 269)
|--- file33 (ino 261)
When attempting to apply the corresponding incremental send stream, an
unlink operation contains an invalid path which makes the receiver fail.
The following is verbose output of the btrfs receive command:
receiving snapshot snap2 uuid=7d5450da-a573-e043-a451-ec85f4879f0f (...)
utimes
utimes dir1
utimes dir1/dir2
link dir1/dir3/dir4/file11 -> dir1/dir2/file1
unlink dir1/dir2/file1
utimes dir1/dir2
truncate dir1/dir3/dir4/file11 size=0
utimes dir1/dir3/dir4/file11
rename dir1/dir3 -> o262-7-0
link dir1/dir3 -> o262-7-0/file22
unlink dir1/dir3/file22
ERROR: unlink dir1/dir3/file22 failed. Not a directory
The following steps happen during the computation of the incremental send
stream the lead to this issue:
1) Before we start processing the new and deleted references for inode
260, we compute the full path of the deleted reference
("dir1/dir3/file22") and cache it in the list of deleted references
for our inode.
2) We then start processing the new references for inode 260, for which
there is only one new, located at "dir1/dir3". When processing this
new reference, we check that inode 262, which was not yet processed,
collides with the new reference and because of that we orphanize
inode 262 so its new full path becomes "o262-7-0".
3) After the orphanization of inode 262, we create the new reference for
inode 260 by issuing a link command with a target path of "dir1/dir3"
and a source path of "o262-7-0/file22".
4) We then start processing the deleted references for inode 260, for
which there is only one with the base name of "file22", and issue
an unlink operation containing the target path computed at step 1,
which is wrong because that path no longer exists and should be
replaced with "o262-7-0/file22".
So fix this issue by recomputing the full path of deleted references if
when we processed the new references for an inode we ended up orphanizing
any other inode that is an ancestor of our inode in the parent snapshot.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[ adjusted after prev patch removed fs_path::dir_path and dir_path_len ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-13 21:13:11 +08:00
|
|
|
static void set_ref_path(struct recorded_ref *ref, struct fs_path *path)
|
|
|
|
{
|
|
|
|
ref->full_path = path;
|
|
|
|
ref->name = (char *)kbasename(ref->full_path->start);
|
|
|
|
ref->name_len = ref->full_path->end - ref->name;
|
|
|
|
}
|
|
|
|
|
2013-08-17 04:52:55 +08:00
|
|
|
static int dup_ref(struct recorded_ref *ref, struct list_head *list)
|
|
|
|
{
|
|
|
|
struct recorded_ref *new;
|
|
|
|
|
2022-07-12 09:36:31 +08:00
|
|
|
new = recorded_ref_alloc();
|
2013-08-17 04:52:55 +08:00
|
|
|
if (!new)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
new->dir = ref->dir;
|
|
|
|
new->dir_gen = ref->dir_gen;
|
|
|
|
list_add_tail(&new->list, list);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
static void __free_recorded_refs(struct list_head *head)
|
2012-07-26 05:19:24 +08:00
|
|
|
{
|
|
|
|
struct recorded_ref *cur;
|
|
|
|
|
2012-07-28 22:33:49 +08:00
|
|
|
while (!list_empty(head)) {
|
|
|
|
cur = list_entry(head->next, struct recorded_ref, list);
|
2022-07-12 09:36:31 +08:00
|
|
|
recorded_ref_free(cur);
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void free_recorded_refs(struct send_ctx *sctx)
|
|
|
|
{
|
2013-05-08 15:51:52 +08:00
|
|
|
__free_recorded_refs(&sctx->new_refs);
|
|
|
|
__free_recorded_refs(&sctx->deleted_refs);
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2012-07-28 20:11:31 +08:00
|
|
|
* Renames/moves a file/dir to its orphan name. Used when the first
|
2012-07-26 05:19:24 +08:00
|
|
|
* ref of an unprocessed inode gets overwritten and for all non empty
|
|
|
|
* directories.
|
|
|
|
*/
|
|
|
|
static int orphanize_inode(struct send_ctx *sctx, u64 ino, u64 gen,
|
|
|
|
struct fs_path *path)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct fs_path *orphan;
|
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
orphan = fs_path_alloc();
|
2012-07-26 05:19:24 +08:00
|
|
|
if (!orphan)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
ret = gen_unique_name(sctx, ino, gen, orphan);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = send_rename(sctx, path, orphan);
|
|
|
|
|
|
|
|
out:
|
2013-05-08 15:51:52 +08:00
|
|
|
fs_path_free(orphan);
|
2012-07-26 05:19:24 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2020-12-10 20:09:02 +08:00
|
|
|
static struct orphan_dir_info *add_orphan_dir_info(struct send_ctx *sctx,
|
|
|
|
u64 dir_ino, u64 dir_gen)
|
2014-02-19 22:31:44 +08:00
|
|
|
{
|
|
|
|
struct rb_node **p = &sctx->orphan_dirs.rb_node;
|
|
|
|
struct rb_node *parent = NULL;
|
|
|
|
struct orphan_dir_info *entry, *odi;
|
|
|
|
|
|
|
|
while (*p) {
|
|
|
|
parent = *p;
|
|
|
|
entry = rb_entry(parent, struct orphan_dir_info, node);
|
2020-12-10 20:09:02 +08:00
|
|
|
if (dir_ino < entry->ino)
|
2014-02-19 22:31:44 +08:00
|
|
|
p = &(*p)->rb_left;
|
2020-12-10 20:09:02 +08:00
|
|
|
else if (dir_ino > entry->ino)
|
2014-02-19 22:31:44 +08:00
|
|
|
p = &(*p)->rb_right;
|
2020-12-10 20:09:02 +08:00
|
|
|
else if (dir_gen < entry->gen)
|
|
|
|
p = &(*p)->rb_left;
|
|
|
|
else if (dir_gen > entry->gen)
|
|
|
|
p = &(*p)->rb_right;
|
|
|
|
else
|
2014-02-19 22:31:44 +08:00
|
|
|
return entry;
|
|
|
|
}
|
|
|
|
|
2018-05-08 18:11:37 +08:00
|
|
|
odi = kmalloc(sizeof(*odi), GFP_KERNEL);
|
|
|
|
if (!odi)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
odi->ino = dir_ino;
|
2020-12-10 20:09:02 +08:00
|
|
|
odi->gen = dir_gen;
|
2018-05-08 18:11:38 +08:00
|
|
|
odi->last_dir_index_offset = 0;
|
2023-01-11 19:36:09 +08:00
|
|
|
odi->dir_high_seq_ino = 0;
|
2018-05-08 18:11:37 +08:00
|
|
|
|
2014-02-19 22:31:44 +08:00
|
|
|
rb_link_node(&odi->node, parent, p);
|
|
|
|
rb_insert_color(&odi->node, &sctx->orphan_dirs);
|
|
|
|
return odi;
|
|
|
|
}
|
|
|
|
|
2020-12-10 20:09:02 +08:00
|
|
|
static struct orphan_dir_info *get_orphan_dir_info(struct send_ctx *sctx,
|
|
|
|
u64 dir_ino, u64 gen)
|
2014-02-19 22:31:44 +08:00
|
|
|
{
|
|
|
|
struct rb_node *n = sctx->orphan_dirs.rb_node;
|
|
|
|
struct orphan_dir_info *entry;
|
|
|
|
|
|
|
|
while (n) {
|
|
|
|
entry = rb_entry(n, struct orphan_dir_info, node);
|
|
|
|
if (dir_ino < entry->ino)
|
|
|
|
n = n->rb_left;
|
|
|
|
else if (dir_ino > entry->ino)
|
|
|
|
n = n->rb_right;
|
2020-12-10 20:09:02 +08:00
|
|
|
else if (gen < entry->gen)
|
|
|
|
n = n->rb_left;
|
|
|
|
else if (gen > entry->gen)
|
|
|
|
n = n->rb_right;
|
2014-02-19 22:31:44 +08:00
|
|
|
else
|
|
|
|
return entry;
|
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2020-12-10 20:09:02 +08:00
|
|
|
static int is_waiting_for_rm(struct send_ctx *sctx, u64 dir_ino, u64 gen)
|
2014-02-19 22:31:44 +08:00
|
|
|
{
|
2020-12-10 20:09:02 +08:00
|
|
|
struct orphan_dir_info *odi = get_orphan_dir_info(sctx, dir_ino, gen);
|
2014-02-19 22:31:44 +08:00
|
|
|
|
|
|
|
return odi != NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void free_orphan_dir_info(struct send_ctx *sctx,
|
|
|
|
struct orphan_dir_info *odi)
|
|
|
|
{
|
|
|
|
if (!odi)
|
|
|
|
return;
|
|
|
|
rb_erase(&odi->node, &sctx->orphan_dirs);
|
|
|
|
kfree(odi);
|
|
|
|
}
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
/*
|
|
|
|
* Returns 1 if a directory can be removed at this point in time.
|
|
|
|
* We check this by iterating all dir items and checking if the inode behind
|
|
|
|
* the dir item was already processed.
|
|
|
|
*/
|
2023-01-11 19:36:06 +08:00
|
|
|
static int can_rmdir(struct send_ctx *sctx, u64 dir, u64 dir_gen)
|
2012-07-26 05:19:24 +08:00
|
|
|
{
|
|
|
|
int ret = 0;
|
2022-03-09 21:50:44 +08:00
|
|
|
int iter_ret = 0;
|
2012-07-26 05:19:24 +08:00
|
|
|
struct btrfs_root *root = sctx->parent_root;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_key found_key;
|
|
|
|
struct btrfs_key loc;
|
|
|
|
struct btrfs_dir_item *di;
|
2018-05-08 18:11:38 +08:00
|
|
|
struct orphan_dir_info *odi = NULL;
|
2023-01-11 19:36:09 +08:00
|
|
|
u64 dir_high_seq_ino = 0;
|
|
|
|
u64 last_dir_index_offset = 0;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2012-08-01 20:48:59 +08:00
|
|
|
/*
|
|
|
|
* Don't try to rmdir the top/root subvolume dir.
|
|
|
|
*/
|
|
|
|
if (dir == BTRFS_FIRST_FREE_OBJECTID)
|
|
|
|
return 0;
|
|
|
|
|
2023-01-11 19:36:09 +08:00
|
|
|
odi = get_orphan_dir_info(sctx, dir, dir_gen);
|
|
|
|
if (odi && sctx->cur_ino < odi->dir_high_seq_ino)
|
|
|
|
return 0;
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
path = alloc_path_for_send();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2023-01-11 19:36:09 +08:00
|
|
|
if (!odi) {
|
|
|
|
/*
|
|
|
|
* Find the inode number associated with the last dir index
|
|
|
|
* entry. This is very likely the inode with the highest number
|
|
|
|
* of all inodes that have an entry in the directory. We can
|
|
|
|
* then use it to avoid future calls to can_rmdir(), when
|
|
|
|
* processing inodes with a lower number, from having to search
|
|
|
|
* the parent root b+tree for dir index keys.
|
|
|
|
*/
|
|
|
|
key.objectid = dir;
|
|
|
|
key.type = BTRFS_DIR_INDEX_KEY;
|
|
|
|
key.offset = (u64)-1;
|
|
|
|
|
|
|
|
ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
|
|
|
|
if (ret < 0) {
|
|
|
|
goto out;
|
|
|
|
} else if (ret > 0) {
|
|
|
|
/* Can't happen, the root is never empty. */
|
|
|
|
ASSERT(path->slots[0] > 0);
|
|
|
|
if (WARN_ON(path->slots[0] == 0)) {
|
|
|
|
ret = -EUCLEAN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
path->slots[0]--;
|
|
|
|
}
|
|
|
|
|
|
|
|
btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
|
|
|
|
if (key.objectid != dir || key.type != BTRFS_DIR_INDEX_KEY) {
|
|
|
|
/* No index keys, dir can be removed. */
|
|
|
|
ret = 1;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
di = btrfs_item_ptr(path->nodes[0], path->slots[0],
|
|
|
|
struct btrfs_dir_item);
|
|
|
|
btrfs_dir_item_key_to_cpu(path->nodes[0], di, &loc);
|
|
|
|
dir_high_seq_ino = loc.objectid;
|
|
|
|
if (sctx->cur_ino < dir_high_seq_ino) {
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
btrfs_release_path(path);
|
|
|
|
}
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
key.objectid = dir;
|
|
|
|
key.type = BTRFS_DIR_INDEX_KEY;
|
2023-01-11 19:36:09 +08:00
|
|
|
key.offset = (odi ? odi->last_dir_index_offset : 0);
|
2018-05-08 18:11:38 +08:00
|
|
|
|
2022-03-09 21:50:44 +08:00
|
|
|
btrfs_for_each_slot(root, &key, &found_key, path, iter_ret) {
|
2014-02-19 22:31:44 +08:00
|
|
|
struct waiting_dir_move *dm;
|
|
|
|
|
2014-02-06 00:48:56 +08:00
|
|
|
if (found_key.objectid != key.objectid ||
|
|
|
|
found_key.type != key.type)
|
2012-07-26 05:19:24 +08:00
|
|
|
break;
|
|
|
|
|
|
|
|
di = btrfs_item_ptr(path->nodes[0], path->slots[0],
|
|
|
|
struct btrfs_dir_item);
|
|
|
|
btrfs_dir_item_key_to_cpu(path->nodes[0], di, &loc);
|
|
|
|
|
2023-01-11 19:36:09 +08:00
|
|
|
dir_high_seq_ino = max(dir_high_seq_ino, loc.objectid);
|
|
|
|
last_dir_index_offset = found_key.offset;
|
|
|
|
|
2014-02-19 22:31:44 +08:00
|
|
|
dm = get_waiting_dir_move(sctx, loc.objectid);
|
|
|
|
if (dm) {
|
|
|
|
dm->rmdir_ino = dir;
|
2020-12-10 20:09:02 +08:00
|
|
|
dm->rmdir_gen = dir_gen;
|
2014-02-19 22:31:44 +08:00
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2023-01-11 19:36:06 +08:00
|
|
|
if (loc.objectid > sctx->cur_ino) {
|
2012-07-26 05:19:24 +08:00
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
2022-03-09 21:50:44 +08:00
|
|
|
}
|
|
|
|
if (iter_ret < 0) {
|
|
|
|
ret = iter_ret;
|
|
|
|
goto out;
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
2018-05-08 18:11:38 +08:00
|
|
|
free_orphan_dir_info(sctx, odi);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
ret = 1;
|
|
|
|
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
2023-01-11 19:36:07 +08:00
|
|
|
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2023-01-11 19:36:08 +08:00
|
|
|
if (!odi) {
|
|
|
|
odi = add_orphan_dir_info(sctx, dir, dir_gen);
|
|
|
|
if (IS_ERR(odi))
|
|
|
|
return PTR_ERR(odi);
|
|
|
|
|
|
|
|
odi->gen = dir_gen;
|
|
|
|
}
|
2023-01-11 19:36:07 +08:00
|
|
|
|
2023-01-11 19:36:09 +08:00
|
|
|
odi->last_dir_index_offset = last_dir_index_offset;
|
|
|
|
odi->dir_high_seq_ino = max(odi->dir_high_seq_ino, dir_high_seq_ino);
|
2023-01-11 19:36:07 +08:00
|
|
|
|
|
|
|
return 0;
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
static int is_waiting_for_move(struct send_ctx *sctx, u64 ino)
|
|
|
|
{
|
2014-02-19 22:31:44 +08:00
|
|
|
struct waiting_dir_move *entry = get_waiting_dir_move(sctx, ino);
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
|
2014-02-19 22:31:44 +08:00
|
|
|
return entry != NULL;
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: incremental send, check if orphanized dir inode needs delayed rename
If a directory inode is orphanized, because some inode previously
processed has a new name that collides with the old name of the current
inode, we need to check if it needs its rename operation delayed too,
as its ancestor-descendent relationship with some other inode might
have been reversed between the parent and send snapshots and therefore
its rename operation needs to happen after that other inode is renamed.
For example, for the following reproducer where this is needed (provided
by Robbie Ko):
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt2
$ mkdir -p /mnt/data/n1/n2
$ mkdir /mnt/data/n4
$ mkdir -p /mnt/data/t6/t7
$ mkdir /mnt/data/t5
$ mkdir /mnt/data/t7
$ mkdir /mnt/data/n4/t2
$ mkdir /mnt/data/t4
$ mkdir /mnt/data/t3
$ mv /mnt/data/t7 /mnt/data/n4/t2
$ mv /mnt/data/t4 /mnt/data/n4/t2/t7
$ mv /mnt/data/t5 /mnt/data/n4/t2/t7/t4
$ mv /mnt/data/t6 /mnt/data/n4/t2/t7/t4/t5
$ mv /mnt/data/n1/n2 /mnt/data/n4/t2/t7/t4/t5/t6
$ mv /mnt/data/n1 /mnt/data/n4/t2/t7/t4/t5/t6
$ mv /mnt/data/n4/t2/t7/t4/t5/t6/t7 /mnt/data/n4/t2/t7/t4/t5/t6/n2
$ mv /mnt/data/t3 /mnt/data/n4/t2/t7/t4/t5/t6/n2/t7
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ mv /mnt/data/n4/t2/t7/t4/t5/t6/n1 /mnt/data/n4
$ mv /mnt/data/n4/t2 /mnt/data/n4/n1
$ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6/n2 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/n2/t7/t3 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/t7/t4 /mnt/data/n4/n1/t2/t6
$ mv /mnt/data/n4/n1/t2/t7 /mnt/data/n4/n1/t2/t3
$ mv /mnt/data/n4/n1/t2/n2/t7 /mnt/data/n4/n1/t2
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send /mnt/snap1 | btrfs receive /mnt2
$ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2
ERROR: send ioctl failed with -12: Cannot allocate memory
Where the parent snapshot directory hierarchy is the following:
. (ino 256)
|-- data/ (ino 257)
|-- n4/ (ino 260)
|-- t2/ (ino 265)
|-- t7/ (ino 264)
|-- t4/ (ino 266)
|-- t5/ (ino 263)
|-- t6/ (ino 261)
|-- n1/ (ino 258)
|-- n2/ (ino 259)
|-- t7/ (ino 262)
|-- t3/ (ino 267)
And the send snapshot's directory hierarchy is the following:
. (ino 256)
|-- data/ (ino 257)
|-- n4/ (ino 260)
|-- n1/ (ino 258)
|-- t2/ (ino 265)
|-- n2/ (ino 259)
|-- t3/ (ino 267)
| |-- t7 (ino 264)
|
|-- t6/ (ino 261)
| |-- t4/ (ino 266)
| |-- t5/ (ino 263)
|
|-- t7/ (ino 262)
While processing inode 262 we orphanize inode 264 and later attempt
to rename inode 264 to its new name/location, which resulted in building
an incorrect destination path string for the rename operation with the
value "data/n4/t2/t7/t4/t5/t6/n2/t7/t3/t7". This rename operation must
have been done only after inode 267 is processed and renamed, as the
ancestor-descendent relationship between inodes 264 and 267 was reversed
between both snapshots, because otherwise it results in an infinite loop
when building the path string for inode 264 when we are processing an
inode with a number larger than 264. That loop is the following:
start inode 264, send progress of 265 for example
parent of 264 -> 267
parent of 267 -> 262
parent of 262 -> 259
parent of 259 -> 261
parent of 261 -> 263
parent of 263 -> 266
parent of 266 -> 264
|--> back to first iteration while current path string length
is <= PATH_MAX, and fail with -ENOMEM otherwise
So fix this by making the check if we need to delay a directory rename
regardless of the current inode having been orphanized or not.
A test case for fstests follows soon.
Thanks to Robbie Ko for providing a reproducer for this problem.
Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-04-09 21:09:14 +08:00
|
|
|
static int add_waiting_dir_move(struct send_ctx *sctx, u64 ino, bool orphanized)
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
{
|
|
|
|
struct rb_node **p = &sctx->waiting_dir_moves.rb_node;
|
|
|
|
struct rb_node *parent = NULL;
|
|
|
|
struct waiting_dir_move *entry, *dm;
|
|
|
|
|
2016-01-19 01:42:13 +08:00
|
|
|
dm = kmalloc(sizeof(*dm), GFP_KERNEL);
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
if (!dm)
|
|
|
|
return -ENOMEM;
|
|
|
|
dm->ino = ino;
|
2014-02-19 22:31:44 +08:00
|
|
|
dm->rmdir_ino = 0;
|
2020-12-10 20:09:02 +08:00
|
|
|
dm->rmdir_gen = 0;
|
Btrfs: incremental send, check if orphanized dir inode needs delayed rename
If a directory inode is orphanized, because some inode previously
processed has a new name that collides with the old name of the current
inode, we need to check if it needs its rename operation delayed too,
as its ancestor-descendent relationship with some other inode might
have been reversed between the parent and send snapshots and therefore
its rename operation needs to happen after that other inode is renamed.
For example, for the following reproducer where this is needed (provided
by Robbie Ko):
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt2
$ mkdir -p /mnt/data/n1/n2
$ mkdir /mnt/data/n4
$ mkdir -p /mnt/data/t6/t7
$ mkdir /mnt/data/t5
$ mkdir /mnt/data/t7
$ mkdir /mnt/data/n4/t2
$ mkdir /mnt/data/t4
$ mkdir /mnt/data/t3
$ mv /mnt/data/t7 /mnt/data/n4/t2
$ mv /mnt/data/t4 /mnt/data/n4/t2/t7
$ mv /mnt/data/t5 /mnt/data/n4/t2/t7/t4
$ mv /mnt/data/t6 /mnt/data/n4/t2/t7/t4/t5
$ mv /mnt/data/n1/n2 /mnt/data/n4/t2/t7/t4/t5/t6
$ mv /mnt/data/n1 /mnt/data/n4/t2/t7/t4/t5/t6
$ mv /mnt/data/n4/t2/t7/t4/t5/t6/t7 /mnt/data/n4/t2/t7/t4/t5/t6/n2
$ mv /mnt/data/t3 /mnt/data/n4/t2/t7/t4/t5/t6/n2/t7
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ mv /mnt/data/n4/t2/t7/t4/t5/t6/n1 /mnt/data/n4
$ mv /mnt/data/n4/t2 /mnt/data/n4/n1
$ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6/n2 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/n2/t7/t3 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/t7/t4 /mnt/data/n4/n1/t2/t6
$ mv /mnt/data/n4/n1/t2/t7 /mnt/data/n4/n1/t2/t3
$ mv /mnt/data/n4/n1/t2/n2/t7 /mnt/data/n4/n1/t2
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send /mnt/snap1 | btrfs receive /mnt2
$ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2
ERROR: send ioctl failed with -12: Cannot allocate memory
Where the parent snapshot directory hierarchy is the following:
. (ino 256)
|-- data/ (ino 257)
|-- n4/ (ino 260)
|-- t2/ (ino 265)
|-- t7/ (ino 264)
|-- t4/ (ino 266)
|-- t5/ (ino 263)
|-- t6/ (ino 261)
|-- n1/ (ino 258)
|-- n2/ (ino 259)
|-- t7/ (ino 262)
|-- t3/ (ino 267)
And the send snapshot's directory hierarchy is the following:
. (ino 256)
|-- data/ (ino 257)
|-- n4/ (ino 260)
|-- n1/ (ino 258)
|-- t2/ (ino 265)
|-- n2/ (ino 259)
|-- t3/ (ino 267)
| |-- t7 (ino 264)
|
|-- t6/ (ino 261)
| |-- t4/ (ino 266)
| |-- t5/ (ino 263)
|
|-- t7/ (ino 262)
While processing inode 262 we orphanize inode 264 and later attempt
to rename inode 264 to its new name/location, which resulted in building
an incorrect destination path string for the rename operation with the
value "data/n4/t2/t7/t4/t5/t6/n2/t7/t3/t7". This rename operation must
have been done only after inode 267 is processed and renamed, as the
ancestor-descendent relationship between inodes 264 and 267 was reversed
between both snapshots, because otherwise it results in an infinite loop
when building the path string for inode 264 when we are processing an
inode with a number larger than 264. That loop is the following:
start inode 264, send progress of 265 for example
parent of 264 -> 267
parent of 267 -> 262
parent of 262 -> 259
parent of 259 -> 261
parent of 261 -> 263
parent of 263 -> 266
parent of 266 -> 264
|--> back to first iteration while current path string length
is <= PATH_MAX, and fail with -ENOMEM otherwise
So fix this by making the check if we need to delay a directory rename
regardless of the current inode having been orphanized or not.
A test case for fstests follows soon.
Thanks to Robbie Ko for providing a reproducer for this problem.
Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-04-09 21:09:14 +08:00
|
|
|
dm->orphanized = orphanized;
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
|
|
|
|
while (*p) {
|
|
|
|
parent = *p;
|
|
|
|
entry = rb_entry(parent, struct waiting_dir_move, node);
|
|
|
|
if (ino < entry->ino) {
|
|
|
|
p = &(*p)->rb_left;
|
|
|
|
} else if (ino > entry->ino) {
|
|
|
|
p = &(*p)->rb_right;
|
|
|
|
} else {
|
|
|
|
kfree(dm);
|
|
|
|
return -EEXIST;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
rb_link_node(&dm->node, parent, p);
|
|
|
|
rb_insert_color(&dm->node, &sctx->waiting_dir_moves);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-02-19 22:31:44 +08:00
|
|
|
static struct waiting_dir_move *
|
|
|
|
get_waiting_dir_move(struct send_ctx *sctx, u64 ino)
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
{
|
|
|
|
struct rb_node *n = sctx->waiting_dir_moves.rb_node;
|
|
|
|
struct waiting_dir_move *entry;
|
|
|
|
|
|
|
|
while (n) {
|
|
|
|
entry = rb_entry(n, struct waiting_dir_move, node);
|
2014-02-19 22:31:44 +08:00
|
|
|
if (ino < entry->ino)
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
n = n->rb_left;
|
2014-02-19 22:31:44 +08:00
|
|
|
else if (ino > entry->ino)
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
n = n->rb_right;
|
2014-02-19 22:31:44 +08:00
|
|
|
else
|
|
|
|
return entry;
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
}
|
2014-02-19 22:31:44 +08:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void free_waiting_dir_move(struct send_ctx *sctx,
|
|
|
|
struct waiting_dir_move *dm)
|
|
|
|
{
|
|
|
|
if (!dm)
|
|
|
|
return;
|
|
|
|
rb_erase(&dm->node, &sctx->waiting_dir_moves);
|
|
|
|
kfree(dm);
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
}
|
|
|
|
|
2014-03-19 22:20:54 +08:00
|
|
|
static int add_pending_dir_move(struct send_ctx *sctx,
|
|
|
|
u64 ino,
|
|
|
|
u64 ino_gen,
|
2014-03-28 04:14:01 +08:00
|
|
|
u64 parent_ino,
|
|
|
|
struct list_head *new_refs,
|
2015-03-01 06:29:22 +08:00
|
|
|
struct list_head *deleted_refs,
|
|
|
|
const bool is_orphan)
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
{
|
|
|
|
struct rb_node **p = &sctx->pending_dir_moves.rb_node;
|
|
|
|
struct rb_node *parent = NULL;
|
2014-03-22 06:30:44 +08:00
|
|
|
struct pending_dir_move *entry = NULL, *pm;
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
struct recorded_ref *cur;
|
|
|
|
int exists = 0;
|
|
|
|
int ret;
|
|
|
|
|
2016-01-19 01:42:13 +08:00
|
|
|
pm = kmalloc(sizeof(*pm), GFP_KERNEL);
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
if (!pm)
|
|
|
|
return -ENOMEM;
|
|
|
|
pm->parent_ino = parent_ino;
|
2014-03-19 22:20:54 +08:00
|
|
|
pm->ino = ino;
|
|
|
|
pm->gen = ino_gen;
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
INIT_LIST_HEAD(&pm->list);
|
|
|
|
INIT_LIST_HEAD(&pm->update_refs);
|
|
|
|
RB_CLEAR_NODE(&pm->node);
|
|
|
|
|
|
|
|
while (*p) {
|
|
|
|
parent = *p;
|
|
|
|
entry = rb_entry(parent, struct pending_dir_move, node);
|
|
|
|
if (parent_ino < entry->parent_ino) {
|
|
|
|
p = &(*p)->rb_left;
|
|
|
|
} else if (parent_ino > entry->parent_ino) {
|
|
|
|
p = &(*p)->rb_right;
|
|
|
|
} else {
|
|
|
|
exists = 1;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-03-28 04:14:01 +08:00
|
|
|
list_for_each_entry(cur, deleted_refs, list) {
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
ret = dup_ref(cur, &pm->update_refs);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
2014-03-28 04:14:01 +08:00
|
|
|
list_for_each_entry(cur, new_refs, list) {
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
ret = dup_ref(cur, &pm->update_refs);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
Btrfs: incremental send, check if orphanized dir inode needs delayed rename
If a directory inode is orphanized, because some inode previously
processed has a new name that collides with the old name of the current
inode, we need to check if it needs its rename operation delayed too,
as its ancestor-descendent relationship with some other inode might
have been reversed between the parent and send snapshots and therefore
its rename operation needs to happen after that other inode is renamed.
For example, for the following reproducer where this is needed (provided
by Robbie Ko):
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt2
$ mkdir -p /mnt/data/n1/n2
$ mkdir /mnt/data/n4
$ mkdir -p /mnt/data/t6/t7
$ mkdir /mnt/data/t5
$ mkdir /mnt/data/t7
$ mkdir /mnt/data/n4/t2
$ mkdir /mnt/data/t4
$ mkdir /mnt/data/t3
$ mv /mnt/data/t7 /mnt/data/n4/t2
$ mv /mnt/data/t4 /mnt/data/n4/t2/t7
$ mv /mnt/data/t5 /mnt/data/n4/t2/t7/t4
$ mv /mnt/data/t6 /mnt/data/n4/t2/t7/t4/t5
$ mv /mnt/data/n1/n2 /mnt/data/n4/t2/t7/t4/t5/t6
$ mv /mnt/data/n1 /mnt/data/n4/t2/t7/t4/t5/t6
$ mv /mnt/data/n4/t2/t7/t4/t5/t6/t7 /mnt/data/n4/t2/t7/t4/t5/t6/n2
$ mv /mnt/data/t3 /mnt/data/n4/t2/t7/t4/t5/t6/n2/t7
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ mv /mnt/data/n4/t2/t7/t4/t5/t6/n1 /mnt/data/n4
$ mv /mnt/data/n4/t2 /mnt/data/n4/n1
$ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6/n2 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/n2/t7/t3 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/t7/t4 /mnt/data/n4/n1/t2/t6
$ mv /mnt/data/n4/n1/t2/t7 /mnt/data/n4/n1/t2/t3
$ mv /mnt/data/n4/n1/t2/n2/t7 /mnt/data/n4/n1/t2
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send /mnt/snap1 | btrfs receive /mnt2
$ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2
ERROR: send ioctl failed with -12: Cannot allocate memory
Where the parent snapshot directory hierarchy is the following:
. (ino 256)
|-- data/ (ino 257)
|-- n4/ (ino 260)
|-- t2/ (ino 265)
|-- t7/ (ino 264)
|-- t4/ (ino 266)
|-- t5/ (ino 263)
|-- t6/ (ino 261)
|-- n1/ (ino 258)
|-- n2/ (ino 259)
|-- t7/ (ino 262)
|-- t3/ (ino 267)
And the send snapshot's directory hierarchy is the following:
. (ino 256)
|-- data/ (ino 257)
|-- n4/ (ino 260)
|-- n1/ (ino 258)
|-- t2/ (ino 265)
|-- n2/ (ino 259)
|-- t3/ (ino 267)
| |-- t7 (ino 264)
|
|-- t6/ (ino 261)
| |-- t4/ (ino 266)
| |-- t5/ (ino 263)
|
|-- t7/ (ino 262)
While processing inode 262 we orphanize inode 264 and later attempt
to rename inode 264 to its new name/location, which resulted in building
an incorrect destination path string for the rename operation with the
value "data/n4/t2/t7/t4/t5/t6/n2/t7/t3/t7". This rename operation must
have been done only after inode 267 is processed and renamed, as the
ancestor-descendent relationship between inodes 264 and 267 was reversed
between both snapshots, because otherwise it results in an infinite loop
when building the path string for inode 264 when we are processing an
inode with a number larger than 264. That loop is the following:
start inode 264, send progress of 265 for example
parent of 264 -> 267
parent of 267 -> 262
parent of 262 -> 259
parent of 259 -> 261
parent of 261 -> 263
parent of 263 -> 266
parent of 266 -> 264
|--> back to first iteration while current path string length
is <= PATH_MAX, and fail with -ENOMEM otherwise
So fix this by making the check if we need to delay a directory rename
regardless of the current inode having been orphanized or not.
A test case for fstests follows soon.
Thanks to Robbie Ko for providing a reproducer for this problem.
Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-04-09 21:09:14 +08:00
|
|
|
ret = add_waiting_dir_move(sctx, pm->ino, is_orphan);
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (exists) {
|
|
|
|
list_add_tail(&pm->list, &entry->list);
|
|
|
|
} else {
|
|
|
|
rb_link_node(&pm->node, parent, p);
|
|
|
|
rb_insert_color(&pm->node, &sctx->pending_dir_moves);
|
|
|
|
}
|
|
|
|
ret = 0;
|
|
|
|
out:
|
|
|
|
if (ret) {
|
|
|
|
__free_recorded_refs(&pm->update_refs);
|
|
|
|
kfree(pm);
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct pending_dir_move *get_pending_dir_moves(struct send_ctx *sctx,
|
|
|
|
u64 parent_ino)
|
|
|
|
{
|
|
|
|
struct rb_node *n = sctx->pending_dir_moves.rb_node;
|
|
|
|
struct pending_dir_move *entry;
|
|
|
|
|
|
|
|
while (n) {
|
|
|
|
entry = rb_entry(n, struct pending_dir_move, node);
|
|
|
|
if (parent_ino < entry->parent_ino)
|
|
|
|
n = n->rb_left;
|
|
|
|
else if (parent_ino > entry->parent_ino)
|
|
|
|
n = n->rb_right;
|
|
|
|
else
|
|
|
|
return entry;
|
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
Btrfs: send, fix failure to move directories with the same name around
When doing an incremental send we can end up not moving directories that
have the same name. This happens when the same parent directory has
different child directories with the same name in the parent and send
snapshots.
For example, consider the following scenario:
Parent snapshot:
. (ino 256)
|---- d/ (ino 257)
| |--- p1/ (ino 258)
|
|---- p1/ (ino 259)
Send snapshot:
. (ino 256)
|--- d/ (ino 257)
|--- p1/ (ino 259)
|--- p1/ (ino 258)
The directory named "d" (inode 257) has in both snapshots an entry with
the name "p1" but it refers to different inodes in both snapshots (inode
258 in the parent snapshot and inode 259 in the send snapshot). When
attempting to move inode 258, the operation is delayed because its new
parent, inode 259, was not yet moved/renamed (as the stream is currently
processing inode 258). Then when processing inode 259, we also end up
delaying its move/rename operation so that it happens after inode 258 is
moved/renamed. This decision to delay the move/rename rename operation
of inode 259 is due to the fact that the new parent inode (257) still
has inode 258 as its child, which has the same name has inode 259. So
we end up with inode 258 move/rename operation waiting for inode's 259
move/rename operation, which in turn it waiting for inode's 258
move/rename. This results in ending the send stream without issuing
move/rename operations for inodes 258 and 259 and generating the
following warnings in syslog/dmesg:
[148402.979747] ------------[ cut here ]------------
[148402.980588] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6177 btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.981928] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148402.986999] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148402.988136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148402.988136] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148402.988136] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018212139fac8
[148402.988136] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148402.988136] Call Trace:
[148402.988136] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148402.988136] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148402.988136] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148402.988136] [<ffffffffa04bc831>] btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.988136] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148402.988136] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148402.988136] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148402.988136] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148402.988136] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148402.988136] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148402.988136] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148402.988136] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148402.988136] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148402.988136] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.011373] ---[ end trace a4539270c8056f8b ]---
[148403.012296] ------------[ cut here ]------------
[148403.013071] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6194 btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.014447] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148403.019708] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148403.020104] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148403.020104] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148403.020104] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018322139fac8
[148403.020104] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148403.020104] Call Trace:
[148403.020104] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148403.020104] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148403.020104] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148403.020104] [<ffffffffa04bc847>] btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.020104] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148403.020104] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148403.020104] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148403.020104] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148403.020104] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148403.020104] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148403.020104] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148403.020104] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148403.020104] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148403.020104] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.038981] ---[ end trace a4539270c8056f8c ]---
There's another issue caused by similar (but more complex) changes in the
directory hierarchy that makes move/rename operations fail, described with
the following example:
Parent snapshot:
.
|---- a/ (ino 262)
| |---- c/ (ino 268)
|
|---- d/ (ino 263)
|---- ance/ (ino 267)
|---- e/ (ino 264)
|---- f/ (ino 265)
|---- ance/ (ino 266)
Send snapshot:
.
|---- a/ (ino 262)
|---- c/ (ino 268)
| |---- ance/ (ino 267)
|
|---- d/ (ino 263)
| |---- ance/ (ino 266)
|
|---- f/ (ino 265)
|---- e/ (ino 264)
When the inode 265 is processed, the path for inode 267 is computed, which
at that time corresponds to "d/ance", and it's stored in the names cache.
Later on when processing inode 266, we end up orphanizing (renaming to a
name matching the pattern o<ino>-<gen>-<seq>) inode 267 because it has
the same name as inode 266 and it's currently a child of the new parent
directory (inode 263) for inode 266. After the orphanization and while we
are still processing inode 266, a rename operation for inode 266 is
generated. However the source path for that rename operation is incorrect
because it ends up using the old, pre-orphanization, name of inode 267.
The no longer valid name for inode 267 was previously cached when
processing inode 265 and it remains usable and considered valid until
the inode currently being processed has a number greater than 267.
This resulted in the receiving side failing with the following error:
ERROR: rename d/ance/ance -> d/ance failed: No such file or directory
So fix these issues by detecting such circular dependencies for rename
operations and by clearing the cached name of an inode once the inode
is orphanized.
A test case for fstests will follow soon.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Rewrote change log to be more detailed and organized, and improved
comments]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-06-23 18:39:46 +08:00
|
|
|
static int path_loop(struct send_ctx *sctx, struct fs_path *name,
|
|
|
|
u64 ino, u64 gen, u64 *ancestor_ino)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
u64 parent_inode = 0;
|
|
|
|
u64 parent_gen = 0;
|
|
|
|
u64 start_ino = ino;
|
|
|
|
|
|
|
|
*ancestor_ino = 0;
|
|
|
|
while (ino != BTRFS_FIRST_FREE_OBJECTID) {
|
|
|
|
fs_path_reset(name);
|
|
|
|
|
2020-12-10 20:09:02 +08:00
|
|
|
if (is_waiting_for_rm(sctx, ino, gen))
|
Btrfs: send, fix failure to move directories with the same name around
When doing an incremental send we can end up not moving directories that
have the same name. This happens when the same parent directory has
different child directories with the same name in the parent and send
snapshots.
For example, consider the following scenario:
Parent snapshot:
. (ino 256)
|---- d/ (ino 257)
| |--- p1/ (ino 258)
|
|---- p1/ (ino 259)
Send snapshot:
. (ino 256)
|--- d/ (ino 257)
|--- p1/ (ino 259)
|--- p1/ (ino 258)
The directory named "d" (inode 257) has in both snapshots an entry with
the name "p1" but it refers to different inodes in both snapshots (inode
258 in the parent snapshot and inode 259 in the send snapshot). When
attempting to move inode 258, the operation is delayed because its new
parent, inode 259, was not yet moved/renamed (as the stream is currently
processing inode 258). Then when processing inode 259, we also end up
delaying its move/rename operation so that it happens after inode 258 is
moved/renamed. This decision to delay the move/rename rename operation
of inode 259 is due to the fact that the new parent inode (257) still
has inode 258 as its child, which has the same name has inode 259. So
we end up with inode 258 move/rename operation waiting for inode's 259
move/rename operation, which in turn it waiting for inode's 258
move/rename. This results in ending the send stream without issuing
move/rename operations for inodes 258 and 259 and generating the
following warnings in syslog/dmesg:
[148402.979747] ------------[ cut here ]------------
[148402.980588] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6177 btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.981928] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148402.986999] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148402.988136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148402.988136] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148402.988136] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018212139fac8
[148402.988136] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148402.988136] Call Trace:
[148402.988136] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148402.988136] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148402.988136] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148402.988136] [<ffffffffa04bc831>] btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.988136] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148402.988136] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148402.988136] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148402.988136] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148402.988136] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148402.988136] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148402.988136] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148402.988136] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148402.988136] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148402.988136] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.011373] ---[ end trace a4539270c8056f8b ]---
[148403.012296] ------------[ cut here ]------------
[148403.013071] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6194 btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.014447] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148403.019708] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148403.020104] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148403.020104] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148403.020104] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018322139fac8
[148403.020104] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148403.020104] Call Trace:
[148403.020104] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148403.020104] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148403.020104] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148403.020104] [<ffffffffa04bc847>] btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.020104] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148403.020104] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148403.020104] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148403.020104] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148403.020104] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148403.020104] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148403.020104] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148403.020104] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148403.020104] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148403.020104] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.038981] ---[ end trace a4539270c8056f8c ]---
There's another issue caused by similar (but more complex) changes in the
directory hierarchy that makes move/rename operations fail, described with
the following example:
Parent snapshot:
.
|---- a/ (ino 262)
| |---- c/ (ino 268)
|
|---- d/ (ino 263)
|---- ance/ (ino 267)
|---- e/ (ino 264)
|---- f/ (ino 265)
|---- ance/ (ino 266)
Send snapshot:
.
|---- a/ (ino 262)
|---- c/ (ino 268)
| |---- ance/ (ino 267)
|
|---- d/ (ino 263)
| |---- ance/ (ino 266)
|
|---- f/ (ino 265)
|---- e/ (ino 264)
When the inode 265 is processed, the path for inode 267 is computed, which
at that time corresponds to "d/ance", and it's stored in the names cache.
Later on when processing inode 266, we end up orphanizing (renaming to a
name matching the pattern o<ino>-<gen>-<seq>) inode 267 because it has
the same name as inode 266 and it's currently a child of the new parent
directory (inode 263) for inode 266. After the orphanization and while we
are still processing inode 266, a rename operation for inode 266 is
generated. However the source path for that rename operation is incorrect
because it ends up using the old, pre-orphanization, name of inode 267.
The no longer valid name for inode 267 was previously cached when
processing inode 265 and it remains usable and considered valid until
the inode currently being processed has a number greater than 267.
This resulted in the receiving side failing with the following error:
ERROR: rename d/ance/ance -> d/ance failed: No such file or directory
So fix these issues by detecting such circular dependencies for rename
operations and by clearing the cached name of an inode once the inode
is orphanized.
A test case for fstests will follow soon.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Rewrote change log to be more detailed and organized, and improved
comments]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-06-23 18:39:46 +08:00
|
|
|
break;
|
|
|
|
if (is_waiting_for_move(sctx, ino)) {
|
|
|
|
if (*ancestor_ino == 0)
|
|
|
|
*ancestor_ino = ino;
|
|
|
|
ret = get_first_ref(sctx->parent_root, ino,
|
|
|
|
&parent_inode, &parent_gen, name);
|
|
|
|
} else {
|
|
|
|
ret = __get_cur_name_and_parent(sctx, ino, gen,
|
|
|
|
&parent_inode,
|
|
|
|
&parent_gen, name);
|
|
|
|
if (ret > 0) {
|
|
|
|
ret = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (ret < 0)
|
|
|
|
break;
|
|
|
|
if (parent_inode == start_ino) {
|
|
|
|
ret = 1;
|
|
|
|
if (*ancestor_ino == 0)
|
|
|
|
*ancestor_ino = ino;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
ino = parent_inode;
|
|
|
|
gen = parent_gen;
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
static int apply_dir_move(struct send_ctx *sctx, struct pending_dir_move *pm)
|
|
|
|
{
|
|
|
|
struct fs_path *from_path = NULL;
|
|
|
|
struct fs_path *to_path = NULL;
|
2014-02-16 21:43:11 +08:00
|
|
|
struct fs_path *name = NULL;
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
u64 orig_progress = sctx->send_progress;
|
|
|
|
struct recorded_ref *cur;
|
2014-02-16 21:43:11 +08:00
|
|
|
u64 parent_ino, parent_gen;
|
2014-02-19 22:31:44 +08:00
|
|
|
struct waiting_dir_move *dm = NULL;
|
|
|
|
u64 rmdir_ino = 0;
|
2020-12-10 20:09:02 +08:00
|
|
|
u64 rmdir_gen;
|
Btrfs: send, fix failure to move directories with the same name around
When doing an incremental send we can end up not moving directories that
have the same name. This happens when the same parent directory has
different child directories with the same name in the parent and send
snapshots.
For example, consider the following scenario:
Parent snapshot:
. (ino 256)
|---- d/ (ino 257)
| |--- p1/ (ino 258)
|
|---- p1/ (ino 259)
Send snapshot:
. (ino 256)
|--- d/ (ino 257)
|--- p1/ (ino 259)
|--- p1/ (ino 258)
The directory named "d" (inode 257) has in both snapshots an entry with
the name "p1" but it refers to different inodes in both snapshots (inode
258 in the parent snapshot and inode 259 in the send snapshot). When
attempting to move inode 258, the operation is delayed because its new
parent, inode 259, was not yet moved/renamed (as the stream is currently
processing inode 258). Then when processing inode 259, we also end up
delaying its move/rename operation so that it happens after inode 258 is
moved/renamed. This decision to delay the move/rename rename operation
of inode 259 is due to the fact that the new parent inode (257) still
has inode 258 as its child, which has the same name has inode 259. So
we end up with inode 258 move/rename operation waiting for inode's 259
move/rename operation, which in turn it waiting for inode's 258
move/rename. This results in ending the send stream without issuing
move/rename operations for inodes 258 and 259 and generating the
following warnings in syslog/dmesg:
[148402.979747] ------------[ cut here ]------------
[148402.980588] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6177 btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.981928] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148402.986999] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148402.988136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148402.988136] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148402.988136] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018212139fac8
[148402.988136] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148402.988136] Call Trace:
[148402.988136] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148402.988136] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148402.988136] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148402.988136] [<ffffffffa04bc831>] btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.988136] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148402.988136] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148402.988136] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148402.988136] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148402.988136] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148402.988136] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148402.988136] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148402.988136] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148402.988136] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148402.988136] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.011373] ---[ end trace a4539270c8056f8b ]---
[148403.012296] ------------[ cut here ]------------
[148403.013071] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6194 btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.014447] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148403.019708] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148403.020104] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148403.020104] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148403.020104] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018322139fac8
[148403.020104] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148403.020104] Call Trace:
[148403.020104] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148403.020104] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148403.020104] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148403.020104] [<ffffffffa04bc847>] btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.020104] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148403.020104] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148403.020104] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148403.020104] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148403.020104] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148403.020104] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148403.020104] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148403.020104] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148403.020104] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148403.020104] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.038981] ---[ end trace a4539270c8056f8c ]---
There's another issue caused by similar (but more complex) changes in the
directory hierarchy that makes move/rename operations fail, described with
the following example:
Parent snapshot:
.
|---- a/ (ino 262)
| |---- c/ (ino 268)
|
|---- d/ (ino 263)
|---- ance/ (ino 267)
|---- e/ (ino 264)
|---- f/ (ino 265)
|---- ance/ (ino 266)
Send snapshot:
.
|---- a/ (ino 262)
|---- c/ (ino 268)
| |---- ance/ (ino 267)
|
|---- d/ (ino 263)
| |---- ance/ (ino 266)
|
|---- f/ (ino 265)
|---- e/ (ino 264)
When the inode 265 is processed, the path for inode 267 is computed, which
at that time corresponds to "d/ance", and it's stored in the names cache.
Later on when processing inode 266, we end up orphanizing (renaming to a
name matching the pattern o<ino>-<gen>-<seq>) inode 267 because it has
the same name as inode 266 and it's currently a child of the new parent
directory (inode 263) for inode 266. After the orphanization and while we
are still processing inode 266, a rename operation for inode 266 is
generated. However the source path for that rename operation is incorrect
because it ends up using the old, pre-orphanization, name of inode 267.
The no longer valid name for inode 267 was previously cached when
processing inode 265 and it remains usable and considered valid until
the inode currently being processed has a number greater than 267.
This resulted in the receiving side failing with the following error:
ERROR: rename d/ance/ance -> d/ance failed: No such file or directory
So fix these issues by detecting such circular dependencies for rename
operations and by clearing the cached name of an inode once the inode
is orphanized.
A test case for fstests will follow soon.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Rewrote change log to be more detailed and organized, and improved
comments]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-06-23 18:39:46 +08:00
|
|
|
u64 ancestor;
|
|
|
|
bool is_orphan;
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
int ret;
|
|
|
|
|
2014-02-16 21:43:11 +08:00
|
|
|
name = fs_path_alloc();
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
from_path = fs_path_alloc();
|
2014-02-16 21:43:11 +08:00
|
|
|
if (!name || !from_path) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
|
2014-02-19 22:31:44 +08:00
|
|
|
dm = get_waiting_dir_move(sctx, pm->ino);
|
|
|
|
ASSERT(dm);
|
|
|
|
rmdir_ino = dm->rmdir_ino;
|
2020-12-10 20:09:02 +08:00
|
|
|
rmdir_gen = dm->rmdir_gen;
|
Btrfs: send, fix failure to move directories with the same name around
When doing an incremental send we can end up not moving directories that
have the same name. This happens when the same parent directory has
different child directories with the same name in the parent and send
snapshots.
For example, consider the following scenario:
Parent snapshot:
. (ino 256)
|---- d/ (ino 257)
| |--- p1/ (ino 258)
|
|---- p1/ (ino 259)
Send snapshot:
. (ino 256)
|--- d/ (ino 257)
|--- p1/ (ino 259)
|--- p1/ (ino 258)
The directory named "d" (inode 257) has in both snapshots an entry with
the name "p1" but it refers to different inodes in both snapshots (inode
258 in the parent snapshot and inode 259 in the send snapshot). When
attempting to move inode 258, the operation is delayed because its new
parent, inode 259, was not yet moved/renamed (as the stream is currently
processing inode 258). Then when processing inode 259, we also end up
delaying its move/rename operation so that it happens after inode 258 is
moved/renamed. This decision to delay the move/rename rename operation
of inode 259 is due to the fact that the new parent inode (257) still
has inode 258 as its child, which has the same name has inode 259. So
we end up with inode 258 move/rename operation waiting for inode's 259
move/rename operation, which in turn it waiting for inode's 258
move/rename. This results in ending the send stream without issuing
move/rename operations for inodes 258 and 259 and generating the
following warnings in syslog/dmesg:
[148402.979747] ------------[ cut here ]------------
[148402.980588] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6177 btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.981928] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148402.986999] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148402.988136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148402.988136] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148402.988136] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018212139fac8
[148402.988136] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148402.988136] Call Trace:
[148402.988136] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148402.988136] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148402.988136] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148402.988136] [<ffffffffa04bc831>] btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.988136] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148402.988136] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148402.988136] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148402.988136] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148402.988136] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148402.988136] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148402.988136] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148402.988136] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148402.988136] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148402.988136] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.011373] ---[ end trace a4539270c8056f8b ]---
[148403.012296] ------------[ cut here ]------------
[148403.013071] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6194 btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.014447] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148403.019708] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148403.020104] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148403.020104] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148403.020104] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018322139fac8
[148403.020104] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148403.020104] Call Trace:
[148403.020104] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148403.020104] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148403.020104] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148403.020104] [<ffffffffa04bc847>] btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.020104] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148403.020104] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148403.020104] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148403.020104] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148403.020104] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148403.020104] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148403.020104] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148403.020104] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148403.020104] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148403.020104] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.038981] ---[ end trace a4539270c8056f8c ]---
There's another issue caused by similar (but more complex) changes in the
directory hierarchy that makes move/rename operations fail, described with
the following example:
Parent snapshot:
.
|---- a/ (ino 262)
| |---- c/ (ino 268)
|
|---- d/ (ino 263)
|---- ance/ (ino 267)
|---- e/ (ino 264)
|---- f/ (ino 265)
|---- ance/ (ino 266)
Send snapshot:
.
|---- a/ (ino 262)
|---- c/ (ino 268)
| |---- ance/ (ino 267)
|
|---- d/ (ino 263)
| |---- ance/ (ino 266)
|
|---- f/ (ino 265)
|---- e/ (ino 264)
When the inode 265 is processed, the path for inode 267 is computed, which
at that time corresponds to "d/ance", and it's stored in the names cache.
Later on when processing inode 266, we end up orphanizing (renaming to a
name matching the pattern o<ino>-<gen>-<seq>) inode 267 because it has
the same name as inode 266 and it's currently a child of the new parent
directory (inode 263) for inode 266. After the orphanization and while we
are still processing inode 266, a rename operation for inode 266 is
generated. However the source path for that rename operation is incorrect
because it ends up using the old, pre-orphanization, name of inode 267.
The no longer valid name for inode 267 was previously cached when
processing inode 265 and it remains usable and considered valid until
the inode currently being processed has a number greater than 267.
This resulted in the receiving side failing with the following error:
ERROR: rename d/ance/ance -> d/ance failed: No such file or directory
So fix these issues by detecting such circular dependencies for rename
operations and by clearing the cached name of an inode once the inode
is orphanized.
A test case for fstests will follow soon.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Rewrote change log to be more detailed and organized, and improved
comments]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-06-23 18:39:46 +08:00
|
|
|
is_orphan = dm->orphanized;
|
2014-02-19 22:31:44 +08:00
|
|
|
free_waiting_dir_move(sctx, dm);
|
2014-02-16 21:43:11 +08:00
|
|
|
|
Btrfs: send, fix failure to move directories with the same name around
When doing an incremental send we can end up not moving directories that
have the same name. This happens when the same parent directory has
different child directories with the same name in the parent and send
snapshots.
For example, consider the following scenario:
Parent snapshot:
. (ino 256)
|---- d/ (ino 257)
| |--- p1/ (ino 258)
|
|---- p1/ (ino 259)
Send snapshot:
. (ino 256)
|--- d/ (ino 257)
|--- p1/ (ino 259)
|--- p1/ (ino 258)
The directory named "d" (inode 257) has in both snapshots an entry with
the name "p1" but it refers to different inodes in both snapshots (inode
258 in the parent snapshot and inode 259 in the send snapshot). When
attempting to move inode 258, the operation is delayed because its new
parent, inode 259, was not yet moved/renamed (as the stream is currently
processing inode 258). Then when processing inode 259, we also end up
delaying its move/rename operation so that it happens after inode 258 is
moved/renamed. This decision to delay the move/rename rename operation
of inode 259 is due to the fact that the new parent inode (257) still
has inode 258 as its child, which has the same name has inode 259. So
we end up with inode 258 move/rename operation waiting for inode's 259
move/rename operation, which in turn it waiting for inode's 258
move/rename. This results in ending the send stream without issuing
move/rename operations for inodes 258 and 259 and generating the
following warnings in syslog/dmesg:
[148402.979747] ------------[ cut here ]------------
[148402.980588] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6177 btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.981928] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148402.986999] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148402.988136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148402.988136] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148402.988136] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018212139fac8
[148402.988136] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148402.988136] Call Trace:
[148402.988136] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148402.988136] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148402.988136] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148402.988136] [<ffffffffa04bc831>] btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.988136] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148402.988136] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148402.988136] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148402.988136] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148402.988136] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148402.988136] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148402.988136] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148402.988136] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148402.988136] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148402.988136] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.011373] ---[ end trace a4539270c8056f8b ]---
[148403.012296] ------------[ cut here ]------------
[148403.013071] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6194 btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.014447] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148403.019708] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148403.020104] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148403.020104] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148403.020104] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018322139fac8
[148403.020104] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148403.020104] Call Trace:
[148403.020104] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148403.020104] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148403.020104] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148403.020104] [<ffffffffa04bc847>] btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.020104] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148403.020104] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148403.020104] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148403.020104] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148403.020104] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148403.020104] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148403.020104] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148403.020104] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148403.020104] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148403.020104] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.038981] ---[ end trace a4539270c8056f8c ]---
There's another issue caused by similar (but more complex) changes in the
directory hierarchy that makes move/rename operations fail, described with
the following example:
Parent snapshot:
.
|---- a/ (ino 262)
| |---- c/ (ino 268)
|
|---- d/ (ino 263)
|---- ance/ (ino 267)
|---- e/ (ino 264)
|---- f/ (ino 265)
|---- ance/ (ino 266)
Send snapshot:
.
|---- a/ (ino 262)
|---- c/ (ino 268)
| |---- ance/ (ino 267)
|
|---- d/ (ino 263)
| |---- ance/ (ino 266)
|
|---- f/ (ino 265)
|---- e/ (ino 264)
When the inode 265 is processed, the path for inode 267 is computed, which
at that time corresponds to "d/ance", and it's stored in the names cache.
Later on when processing inode 266, we end up orphanizing (renaming to a
name matching the pattern o<ino>-<gen>-<seq>) inode 267 because it has
the same name as inode 266 and it's currently a child of the new parent
directory (inode 263) for inode 266. After the orphanization and while we
are still processing inode 266, a rename operation for inode 266 is
generated. However the source path for that rename operation is incorrect
because it ends up using the old, pre-orphanization, name of inode 267.
The no longer valid name for inode 267 was previously cached when
processing inode 265 and it remains usable and considered valid until
the inode currently being processed has a number greater than 267.
This resulted in the receiving side failing with the following error:
ERROR: rename d/ance/ance -> d/ance failed: No such file or directory
So fix these issues by detecting such circular dependencies for rename
operations and by clearing the cached name of an inode once the inode
is orphanized.
A test case for fstests will follow soon.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Rewrote change log to be more detailed and organized, and improved
comments]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-06-23 18:39:46 +08:00
|
|
|
if (is_orphan) {
|
2015-03-01 06:29:22 +08:00
|
|
|
ret = gen_unique_name(sctx, pm->ino,
|
|
|
|
pm->gen, from_path);
|
|
|
|
} else {
|
|
|
|
ret = get_first_ref(sctx->parent_root, pm->ino,
|
|
|
|
&parent_ino, &parent_gen, name);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
ret = get_cur_path(sctx, parent_ino, parent_gen,
|
|
|
|
from_path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
ret = fs_path_add_path(from_path, name);
|
|
|
|
}
|
2014-03-23 01:15:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2014-02-16 21:43:11 +08:00
|
|
|
|
2014-03-28 04:14:01 +08:00
|
|
|
sctx->send_progress = sctx->cur_ino + 1;
|
Btrfs: send, fix failure to move directories with the same name around
When doing an incremental send we can end up not moving directories that
have the same name. This happens when the same parent directory has
different child directories with the same name in the parent and send
snapshots.
For example, consider the following scenario:
Parent snapshot:
. (ino 256)
|---- d/ (ino 257)
| |--- p1/ (ino 258)
|
|---- p1/ (ino 259)
Send snapshot:
. (ino 256)
|--- d/ (ino 257)
|--- p1/ (ino 259)
|--- p1/ (ino 258)
The directory named "d" (inode 257) has in both snapshots an entry with
the name "p1" but it refers to different inodes in both snapshots (inode
258 in the parent snapshot and inode 259 in the send snapshot). When
attempting to move inode 258, the operation is delayed because its new
parent, inode 259, was not yet moved/renamed (as the stream is currently
processing inode 258). Then when processing inode 259, we also end up
delaying its move/rename operation so that it happens after inode 258 is
moved/renamed. This decision to delay the move/rename rename operation
of inode 259 is due to the fact that the new parent inode (257) still
has inode 258 as its child, which has the same name has inode 259. So
we end up with inode 258 move/rename operation waiting for inode's 259
move/rename operation, which in turn it waiting for inode's 258
move/rename. This results in ending the send stream without issuing
move/rename operations for inodes 258 and 259 and generating the
following warnings in syslog/dmesg:
[148402.979747] ------------[ cut here ]------------
[148402.980588] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6177 btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.981928] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148402.986999] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148402.988136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148402.988136] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148402.988136] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018212139fac8
[148402.988136] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148402.988136] Call Trace:
[148402.988136] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148402.988136] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148402.988136] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148402.988136] [<ffffffffa04bc831>] btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.988136] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148402.988136] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148402.988136] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148402.988136] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148402.988136] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148402.988136] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148402.988136] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148402.988136] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148402.988136] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148402.988136] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.011373] ---[ end trace a4539270c8056f8b ]---
[148403.012296] ------------[ cut here ]------------
[148403.013071] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6194 btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.014447] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148403.019708] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148403.020104] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148403.020104] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148403.020104] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018322139fac8
[148403.020104] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148403.020104] Call Trace:
[148403.020104] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148403.020104] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148403.020104] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148403.020104] [<ffffffffa04bc847>] btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.020104] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148403.020104] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148403.020104] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148403.020104] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148403.020104] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148403.020104] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148403.020104] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148403.020104] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148403.020104] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148403.020104] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.038981] ---[ end trace a4539270c8056f8c ]---
There's another issue caused by similar (but more complex) changes in the
directory hierarchy that makes move/rename operations fail, described with
the following example:
Parent snapshot:
.
|---- a/ (ino 262)
| |---- c/ (ino 268)
|
|---- d/ (ino 263)
|---- ance/ (ino 267)
|---- e/ (ino 264)
|---- f/ (ino 265)
|---- ance/ (ino 266)
Send snapshot:
.
|---- a/ (ino 262)
|---- c/ (ino 268)
| |---- ance/ (ino 267)
|
|---- d/ (ino 263)
| |---- ance/ (ino 266)
|
|---- f/ (ino 265)
|---- e/ (ino 264)
When the inode 265 is processed, the path for inode 267 is computed, which
at that time corresponds to "d/ance", and it's stored in the names cache.
Later on when processing inode 266, we end up orphanizing (renaming to a
name matching the pattern o<ino>-<gen>-<seq>) inode 267 because it has
the same name as inode 266 and it's currently a child of the new parent
directory (inode 263) for inode 266. After the orphanization and while we
are still processing inode 266, a rename operation for inode 266 is
generated. However the source path for that rename operation is incorrect
because it ends up using the old, pre-orphanization, name of inode 267.
The no longer valid name for inode 267 was previously cached when
processing inode 265 and it remains usable and considered valid until
the inode currently being processed has a number greater than 267.
This resulted in the receiving side failing with the following error:
ERROR: rename d/ance/ance -> d/ance failed: No such file or directory
So fix these issues by detecting such circular dependencies for rename
operations and by clearing the cached name of an inode once the inode
is orphanized.
A test case for fstests will follow soon.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Rewrote change log to be more detailed and organized, and improved
comments]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-06-23 18:39:46 +08:00
|
|
|
ret = path_loop(sctx, name, pm->ino, pm->gen, &ancestor);
|
2016-06-18 00:13:36 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
Btrfs: send, fix failure to move directories with the same name around
When doing an incremental send we can end up not moving directories that
have the same name. This happens when the same parent directory has
different child directories with the same name in the parent and send
snapshots.
For example, consider the following scenario:
Parent snapshot:
. (ino 256)
|---- d/ (ino 257)
| |--- p1/ (ino 258)
|
|---- p1/ (ino 259)
Send snapshot:
. (ino 256)
|--- d/ (ino 257)
|--- p1/ (ino 259)
|--- p1/ (ino 258)
The directory named "d" (inode 257) has in both snapshots an entry with
the name "p1" but it refers to different inodes in both snapshots (inode
258 in the parent snapshot and inode 259 in the send snapshot). When
attempting to move inode 258, the operation is delayed because its new
parent, inode 259, was not yet moved/renamed (as the stream is currently
processing inode 258). Then when processing inode 259, we also end up
delaying its move/rename operation so that it happens after inode 258 is
moved/renamed. This decision to delay the move/rename rename operation
of inode 259 is due to the fact that the new parent inode (257) still
has inode 258 as its child, which has the same name has inode 259. So
we end up with inode 258 move/rename operation waiting for inode's 259
move/rename operation, which in turn it waiting for inode's 258
move/rename. This results in ending the send stream without issuing
move/rename operations for inodes 258 and 259 and generating the
following warnings in syslog/dmesg:
[148402.979747] ------------[ cut here ]------------
[148402.980588] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6177 btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.981928] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148402.986999] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148402.988136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148402.988136] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148402.988136] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018212139fac8
[148402.988136] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148402.988136] Call Trace:
[148402.988136] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148402.988136] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148402.988136] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148402.988136] [<ffffffffa04bc831>] btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.988136] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148402.988136] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148402.988136] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148402.988136] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148402.988136] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148402.988136] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148402.988136] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148402.988136] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148402.988136] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148402.988136] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.011373] ---[ end trace a4539270c8056f8b ]---
[148403.012296] ------------[ cut here ]------------
[148403.013071] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6194 btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.014447] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148403.019708] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148403.020104] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148403.020104] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148403.020104] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018322139fac8
[148403.020104] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148403.020104] Call Trace:
[148403.020104] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148403.020104] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148403.020104] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148403.020104] [<ffffffffa04bc847>] btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.020104] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148403.020104] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148403.020104] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148403.020104] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148403.020104] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148403.020104] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148403.020104] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148403.020104] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148403.020104] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148403.020104] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.038981] ---[ end trace a4539270c8056f8c ]---
There's another issue caused by similar (but more complex) changes in the
directory hierarchy that makes move/rename operations fail, described with
the following example:
Parent snapshot:
.
|---- a/ (ino 262)
| |---- c/ (ino 268)
|
|---- d/ (ino 263)
|---- ance/ (ino 267)
|---- e/ (ino 264)
|---- f/ (ino 265)
|---- ance/ (ino 266)
Send snapshot:
.
|---- a/ (ino 262)
|---- c/ (ino 268)
| |---- ance/ (ino 267)
|
|---- d/ (ino 263)
| |---- ance/ (ino 266)
|
|---- f/ (ino 265)
|---- e/ (ino 264)
When the inode 265 is processed, the path for inode 267 is computed, which
at that time corresponds to "d/ance", and it's stored in the names cache.
Later on when processing inode 266, we end up orphanizing (renaming to a
name matching the pattern o<ino>-<gen>-<seq>) inode 267 because it has
the same name as inode 266 and it's currently a child of the new parent
directory (inode 263) for inode 266. After the orphanization and while we
are still processing inode 266, a rename operation for inode 266 is
generated. However the source path for that rename operation is incorrect
because it ends up using the old, pre-orphanization, name of inode 267.
The no longer valid name for inode 267 was previously cached when
processing inode 265 and it remains usable and considered valid until
the inode currently being processed has a number greater than 267.
This resulted in the receiving side failing with the following error:
ERROR: rename d/ance/ance -> d/ance failed: No such file or directory
So fix these issues by detecting such circular dependencies for rename
operations and by clearing the cached name of an inode once the inode
is orphanized.
A test case for fstests will follow soon.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Rewrote change log to be more detailed and organized, and improved
comments]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-06-23 18:39:46 +08:00
|
|
|
if (ret) {
|
|
|
|
LIST_HEAD(deleted_refs);
|
|
|
|
ASSERT(ancestor > BTRFS_FIRST_FREE_OBJECTID);
|
|
|
|
ret = add_pending_dir_move(sctx, pm->ino, pm->gen, ancestor,
|
|
|
|
&pm->update_refs, &deleted_refs,
|
|
|
|
is_orphan);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (rmdir_ino) {
|
|
|
|
dm = get_waiting_dir_move(sctx, pm->ino);
|
|
|
|
ASSERT(dm);
|
|
|
|
dm->rmdir_ino = rmdir_ino;
|
2020-12-10 20:09:02 +08:00
|
|
|
dm->rmdir_gen = rmdir_gen;
|
Btrfs: send, fix failure to move directories with the same name around
When doing an incremental send we can end up not moving directories that
have the same name. This happens when the same parent directory has
different child directories with the same name in the parent and send
snapshots.
For example, consider the following scenario:
Parent snapshot:
. (ino 256)
|---- d/ (ino 257)
| |--- p1/ (ino 258)
|
|---- p1/ (ino 259)
Send snapshot:
. (ino 256)
|--- d/ (ino 257)
|--- p1/ (ino 259)
|--- p1/ (ino 258)
The directory named "d" (inode 257) has in both snapshots an entry with
the name "p1" but it refers to different inodes in both snapshots (inode
258 in the parent snapshot and inode 259 in the send snapshot). When
attempting to move inode 258, the operation is delayed because its new
parent, inode 259, was not yet moved/renamed (as the stream is currently
processing inode 258). Then when processing inode 259, we also end up
delaying its move/rename operation so that it happens after inode 258 is
moved/renamed. This decision to delay the move/rename rename operation
of inode 259 is due to the fact that the new parent inode (257) still
has inode 258 as its child, which has the same name has inode 259. So
we end up with inode 258 move/rename operation waiting for inode's 259
move/rename operation, which in turn it waiting for inode's 258
move/rename. This results in ending the send stream without issuing
move/rename operations for inodes 258 and 259 and generating the
following warnings in syslog/dmesg:
[148402.979747] ------------[ cut here ]------------
[148402.980588] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6177 btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.981928] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148402.986999] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148402.988136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148402.988136] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148402.988136] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018212139fac8
[148402.988136] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148402.988136] Call Trace:
[148402.988136] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148402.988136] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148402.988136] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148402.988136] [<ffffffffa04bc831>] btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.988136] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148402.988136] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148402.988136] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148402.988136] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148402.988136] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148402.988136] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148402.988136] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148402.988136] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148402.988136] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148402.988136] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.011373] ---[ end trace a4539270c8056f8b ]---
[148403.012296] ------------[ cut here ]------------
[148403.013071] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6194 btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.014447] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148403.019708] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148403.020104] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148403.020104] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148403.020104] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018322139fac8
[148403.020104] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148403.020104] Call Trace:
[148403.020104] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148403.020104] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148403.020104] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148403.020104] [<ffffffffa04bc847>] btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.020104] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148403.020104] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148403.020104] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148403.020104] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148403.020104] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148403.020104] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148403.020104] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148403.020104] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148403.020104] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148403.020104] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.038981] ---[ end trace a4539270c8056f8c ]---
There's another issue caused by similar (but more complex) changes in the
directory hierarchy that makes move/rename operations fail, described with
the following example:
Parent snapshot:
.
|---- a/ (ino 262)
| |---- c/ (ino 268)
|
|---- d/ (ino 263)
|---- ance/ (ino 267)
|---- e/ (ino 264)
|---- f/ (ino 265)
|---- ance/ (ino 266)
Send snapshot:
.
|---- a/ (ino 262)
|---- c/ (ino 268)
| |---- ance/ (ino 267)
|
|---- d/ (ino 263)
| |---- ance/ (ino 266)
|
|---- f/ (ino 265)
|---- e/ (ino 264)
When the inode 265 is processed, the path for inode 267 is computed, which
at that time corresponds to "d/ance", and it's stored in the names cache.
Later on when processing inode 266, we end up orphanizing (renaming to a
name matching the pattern o<ino>-<gen>-<seq>) inode 267 because it has
the same name as inode 266 and it's currently a child of the new parent
directory (inode 263) for inode 266. After the orphanization and while we
are still processing inode 266, a rename operation for inode 266 is
generated. However the source path for that rename operation is incorrect
because it ends up using the old, pre-orphanization, name of inode 267.
The no longer valid name for inode 267 was previously cached when
processing inode 265 and it remains usable and considered valid until
the inode currently being processed has a number greater than 267.
This resulted in the receiving side failing with the following error:
ERROR: rename d/ance/ance -> d/ance failed: No such file or directory
So fix these issues by detecting such circular dependencies for rename
operations and by clearing the cached name of an inode once the inode
is orphanized.
A test case for fstests will follow soon.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Rewrote change log to be more detailed and organized, and improved
comments]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-06-23 18:39:46 +08:00
|
|
|
}
|
|
|
|
goto out;
|
|
|
|
}
|
2014-03-23 01:15:24 +08:00
|
|
|
fs_path_reset(name);
|
|
|
|
to_path = name;
|
2014-02-16 21:43:11 +08:00
|
|
|
name = NULL;
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
ret = get_cur_path(sctx, pm->ino, pm->gen, to_path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = send_rename(sctx, from_path, to_path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
2014-02-19 22:31:44 +08:00
|
|
|
if (rmdir_ino) {
|
|
|
|
struct orphan_dir_info *odi;
|
2018-05-08 18:11:38 +08:00
|
|
|
u64 gen;
|
2014-02-19 22:31:44 +08:00
|
|
|
|
2020-12-10 20:09:02 +08:00
|
|
|
odi = get_orphan_dir_info(sctx, rmdir_ino, rmdir_gen);
|
2014-02-19 22:31:44 +08:00
|
|
|
if (!odi) {
|
|
|
|
/* already deleted */
|
|
|
|
goto finish;
|
|
|
|
}
|
2018-05-08 18:11:38 +08:00
|
|
|
gen = odi->gen;
|
|
|
|
|
2023-01-11 19:36:06 +08:00
|
|
|
ret = can_rmdir(sctx, rmdir_ino, gen);
|
2014-02-19 22:31:44 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (!ret)
|
|
|
|
goto finish;
|
|
|
|
|
|
|
|
name = fs_path_alloc();
|
|
|
|
if (!name) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
2018-05-08 18:11:38 +08:00
|
|
|
ret = get_cur_path(sctx, rmdir_ino, gen, name);
|
2014-02-19 22:31:44 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
ret = send_rmdir(sctx, name);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
finish:
|
btrfs: send: cache utimes operations for directories if possible
Whenever we add or remove an entry to a directory, we issue an utimes
command for the directory. If we add 1000 entries to a directory (create
1000 files under it or move 1000 files to it), then we issue the same
utimes command 1000 times, which increases the send stream size, results
in more pipe IO, one search in the send b+tree, allocating one path for
the search, etc, as well as making the receiver do a system call for each
duplicated utimes command.
We also issue an utimes command when we create a new directory, but later
we might add entries to it corresponding to inodes with an higher inode
number, so it's pointless to issue the utimes command before we create
the last inode under the directory.
So use a lru cache to track directories for which we must send a utimes
command. When we need to remove an entry from the cache, we issue the
utimes command for the respective directory. When finishing the send
operation, we go over each cache element and issue the respective utimes
command. Finally the caching is entirely optional, just a performance
optimization, meaning that if we fail to cache (due to memory allocation
failure), we issue the utimes command right away, that is, we fallback
to the previous, unoptimized, behaviour.
This patch belongs to a patchset comprised of the following patches:
btrfs: send: directly return from did_overwrite_ref() and simplify it
btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
btrfs: send: directly return from will_overwrite_ref() and simplify it
btrfs: send: avoid extra b+tree searches when checking reference overrides
btrfs: send: remove send_progress argument from can_rmdir()
btrfs: send: avoid duplicated orphan dir allocation and initialization
btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
btrfs: send: reduce searches on parent root when checking if dir can be removed
btrfs: send: iterate waiting dir move rbtree only once when processing refs
btrfs: send: initialize all the red black trees earlier
btrfs: send: genericize the backref cache to allow it to be reused
btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
btrfs: send: cache information about created directories
btrfs: allow a generation number to be associated with lru cache entries
btrfs: add an api to delete a specific entry from the lru cache
btrfs: send: use the lru cache to implement the name cache
btrfs: send: update size of roots array for backref cache entries
btrfs: send: cache utimes operations for directories if possible
The following test was run before and after applying the whole patchset,
and on a non-debug kernel (Debian's default kernel config):
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
mkfs.btrfs -f $DEV > /dev/null
mount $DEV $MNT
mkdir $MNT/A
for ((i = 1; i <= 20000; i++)); do
echo -n > $MNT/A/file_$i
done
btrfs subvolume snapshot -r $MNT $MNT/snap1
mkdir $MNT/B
for ((i = 20000; i <= 40000; i++)); do
echo -n > $MNT/B/file_$i
done
mv $MNT/A/file_* $MNT/B/
btrfs subvolume snapshot -r $MNT $MNT/snap2
start=$(date +%s%N)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "Incremental send took $dur milliseconds"
umount $MNT
Before the whole patchset: 18408 milliseconds
After the whole patchset: 1942 milliseconds (9.5x speedup)
Using 60000 files instead of 40000:
Before the whole patchset: 39764 milliseconds
After the whole patchset: 3076 milliseconds (12.9x speedup)
Using 20000 files instead of 40000:
Before the whole patchset: 5072 milliseconds
After the whole patchset: 916 milliseconds (5.5x speedup)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-11 19:36:20 +08:00
|
|
|
ret = cache_dir_utimes(sctx, pm->ino, pm->gen);
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* After rename/move, need to update the utimes of both new parent(s)
|
|
|
|
* and old parent(s).
|
|
|
|
*/
|
|
|
|
list_for_each_entry(cur, &pm->update_refs, list) {
|
Btrfs: send, fix invalid leaf accesses due to incorrect utimes operations
During an incremental send, if we have delayed rename operations for inodes
that were children of directories which were removed in the send snapshot,
we can end up accessing incorrect items in a leaf or accessing beyond the
last item of the leaf due to issuing utimes operations for the removed
inodes. Consider the following example:
Parent snapshot:
. (ino 256)
|--- a/ (ino 257)
| |--- c/ (ino 262)
|
|--- b/ (ino 258)
| |--- d/ (ino 263)
|
|--- del/ (ino 261)
|--- x/ (ino 259)
|--- y/ (ino 260)
Send snapshot:
. (ino 256)
|--- a/ (ino 257)
|
|--- b/ (ino 258)
|
|--- c/ (ino 262)
| |--- y/ (ino 260)
|
|--- d/ (ino 263)
|--- x/ (ino 259)
1) When processing inodes 259 and 260, we end up delaying their rename
operations because their parents, inodes 263 and 262 respectively, were
not yet processed and therefore not yet renamed;
2) When processing inode 262, its rename operation is issued and right
after the rename operation for inode 260 is issued. However right after
issuing the rename operation for inode 260, at send.c:apply_dir_move(),
we issue utimes operations for all current and past parents of inode
260. This means we try to send a utimes operation for its old parent,
inode 261 (deleted in the send snapshot), which does not cause any
immediate and deterministic failure, because when the target inode is
not found in the send snapshot, the send.c:send_utimes() function
ignores it and uses the leaf region pointed to by path->slots[0],
which can be any unrelated item (belonging to other inode) or it can
be a region outside the leaf boundaries, if the leaf is full and
path->slots[0] matches the number of items in the leaf. So we end
up either successfully sending a utimes operation, which is fine
and irrelevant because the old parent (inode 261) will end up being
deleted later, or we end up doing an invalid memory access tha
crashes the kernel.
So fix this by making apply_dir_move() issue utimes operations only for
parents that still exist in the send snapshot. In a separate patch we
will make send_utimes() return an error (-ENOENT) if the given inode
does not exists in the send snapshot.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Rewrote change log to be more detailed and better organized]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-06-23 18:39:50 +08:00
|
|
|
/*
|
|
|
|
* The parent inode might have been deleted in the send snapshot
|
|
|
|
*/
|
2022-08-12 22:42:32 +08:00
|
|
|
ret = get_inode_info(sctx->send_root, cur->dir, NULL);
|
Btrfs: send, fix invalid leaf accesses due to incorrect utimes operations
During an incremental send, if we have delayed rename operations for inodes
that were children of directories which were removed in the send snapshot,
we can end up accessing incorrect items in a leaf or accessing beyond the
last item of the leaf due to issuing utimes operations for the removed
inodes. Consider the following example:
Parent snapshot:
. (ino 256)
|--- a/ (ino 257)
| |--- c/ (ino 262)
|
|--- b/ (ino 258)
| |--- d/ (ino 263)
|
|--- del/ (ino 261)
|--- x/ (ino 259)
|--- y/ (ino 260)
Send snapshot:
. (ino 256)
|--- a/ (ino 257)
|
|--- b/ (ino 258)
|
|--- c/ (ino 262)
| |--- y/ (ino 260)
|
|--- d/ (ino 263)
|--- x/ (ino 259)
1) When processing inodes 259 and 260, we end up delaying their rename
operations because their parents, inodes 263 and 262 respectively, were
not yet processed and therefore not yet renamed;
2) When processing inode 262, its rename operation is issued and right
after the rename operation for inode 260 is issued. However right after
issuing the rename operation for inode 260, at send.c:apply_dir_move(),
we issue utimes operations for all current and past parents of inode
260. This means we try to send a utimes operation for its old parent,
inode 261 (deleted in the send snapshot), which does not cause any
immediate and deterministic failure, because when the target inode is
not found in the send snapshot, the send.c:send_utimes() function
ignores it and uses the leaf region pointed to by path->slots[0],
which can be any unrelated item (belonging to other inode) or it can
be a region outside the leaf boundaries, if the leaf is full and
path->slots[0] matches the number of items in the leaf. So we end
up either successfully sending a utimes operation, which is fine
and irrelevant because the old parent (inode 261) will end up being
deleted later, or we end up doing an invalid memory access tha
crashes the kernel.
So fix this by making apply_dir_move() issue utimes operations only for
parents that still exist in the send snapshot. In a separate patch we
will make send_utimes() return an error (-ENOENT) if the given inode
does not exists in the send snapshot.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Rewrote change log to be more detailed and better organized]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-06-23 18:39:50 +08:00
|
|
|
if (ret == -ENOENT) {
|
|
|
|
ret = 0;
|
2014-02-19 22:31:44 +08:00
|
|
|
continue;
|
Btrfs: send, fix invalid leaf accesses due to incorrect utimes operations
During an incremental send, if we have delayed rename operations for inodes
that were children of directories which were removed in the send snapshot,
we can end up accessing incorrect items in a leaf or accessing beyond the
last item of the leaf due to issuing utimes operations for the removed
inodes. Consider the following example:
Parent snapshot:
. (ino 256)
|--- a/ (ino 257)
| |--- c/ (ino 262)
|
|--- b/ (ino 258)
| |--- d/ (ino 263)
|
|--- del/ (ino 261)
|--- x/ (ino 259)
|--- y/ (ino 260)
Send snapshot:
. (ino 256)
|--- a/ (ino 257)
|
|--- b/ (ino 258)
|
|--- c/ (ino 262)
| |--- y/ (ino 260)
|
|--- d/ (ino 263)
|--- x/ (ino 259)
1) When processing inodes 259 and 260, we end up delaying their rename
operations because their parents, inodes 263 and 262 respectively, were
not yet processed and therefore not yet renamed;
2) When processing inode 262, its rename operation is issued and right
after the rename operation for inode 260 is issued. However right after
issuing the rename operation for inode 260, at send.c:apply_dir_move(),
we issue utimes operations for all current and past parents of inode
260. This means we try to send a utimes operation for its old parent,
inode 261 (deleted in the send snapshot), which does not cause any
immediate and deterministic failure, because when the target inode is
not found in the send snapshot, the send.c:send_utimes() function
ignores it and uses the leaf region pointed to by path->slots[0],
which can be any unrelated item (belonging to other inode) or it can
be a region outside the leaf boundaries, if the leaf is full and
path->slots[0] matches the number of items in the leaf. So we end
up either successfully sending a utimes operation, which is fine
and irrelevant because the old parent (inode 261) will end up being
deleted later, or we end up doing an invalid memory access tha
crashes the kernel.
So fix this by making apply_dir_move() issue utimes operations only for
parents that still exist in the send snapshot. In a separate patch we
will make send_utimes() return an error (-ENOENT) if the given inode
does not exists in the send snapshot.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Rewrote change log to be more detailed and better organized]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-06-23 18:39:50 +08:00
|
|
|
}
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
btrfs: send: cache utimes operations for directories if possible
Whenever we add or remove an entry to a directory, we issue an utimes
command for the directory. If we add 1000 entries to a directory (create
1000 files under it or move 1000 files to it), then we issue the same
utimes command 1000 times, which increases the send stream size, results
in more pipe IO, one search in the send b+tree, allocating one path for
the search, etc, as well as making the receiver do a system call for each
duplicated utimes command.
We also issue an utimes command when we create a new directory, but later
we might add entries to it corresponding to inodes with an higher inode
number, so it's pointless to issue the utimes command before we create
the last inode under the directory.
So use a lru cache to track directories for which we must send a utimes
command. When we need to remove an entry from the cache, we issue the
utimes command for the respective directory. When finishing the send
operation, we go over each cache element and issue the respective utimes
command. Finally the caching is entirely optional, just a performance
optimization, meaning that if we fail to cache (due to memory allocation
failure), we issue the utimes command right away, that is, we fallback
to the previous, unoptimized, behaviour.
This patch belongs to a patchset comprised of the following patches:
btrfs: send: directly return from did_overwrite_ref() and simplify it
btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
btrfs: send: directly return from will_overwrite_ref() and simplify it
btrfs: send: avoid extra b+tree searches when checking reference overrides
btrfs: send: remove send_progress argument from can_rmdir()
btrfs: send: avoid duplicated orphan dir allocation and initialization
btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
btrfs: send: reduce searches on parent root when checking if dir can be removed
btrfs: send: iterate waiting dir move rbtree only once when processing refs
btrfs: send: initialize all the red black trees earlier
btrfs: send: genericize the backref cache to allow it to be reused
btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
btrfs: send: cache information about created directories
btrfs: allow a generation number to be associated with lru cache entries
btrfs: add an api to delete a specific entry from the lru cache
btrfs: send: use the lru cache to implement the name cache
btrfs: send: update size of roots array for backref cache entries
btrfs: send: cache utimes operations for directories if possible
The following test was run before and after applying the whole patchset,
and on a non-debug kernel (Debian's default kernel config):
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
mkfs.btrfs -f $DEV > /dev/null
mount $DEV $MNT
mkdir $MNT/A
for ((i = 1; i <= 20000; i++)); do
echo -n > $MNT/A/file_$i
done
btrfs subvolume snapshot -r $MNT $MNT/snap1
mkdir $MNT/B
for ((i = 20000; i <= 40000; i++)); do
echo -n > $MNT/B/file_$i
done
mv $MNT/A/file_* $MNT/B/
btrfs subvolume snapshot -r $MNT $MNT/snap2
start=$(date +%s%N)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "Incremental send took $dur milliseconds"
umount $MNT
Before the whole patchset: 18408 milliseconds
After the whole patchset: 1942 milliseconds (9.5x speedup)
Using 60000 files instead of 40000:
Before the whole patchset: 39764 milliseconds
After the whole patchset: 3076 milliseconds (12.9x speedup)
Using 20000 files instead of 40000:
Before the whole patchset: 5072 milliseconds
After the whole patchset: 916 milliseconds (5.5x speedup)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-11 19:36:20 +08:00
|
|
|
ret = cache_dir_utimes(sctx, cur->dir, cur->dir_gen);
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
2014-02-16 21:43:11 +08:00
|
|
|
fs_path_free(name);
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
fs_path_free(from_path);
|
|
|
|
fs_path_free(to_path);
|
|
|
|
sctx->send_progress = orig_progress;
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void free_pending_move(struct send_ctx *sctx, struct pending_dir_move *m)
|
|
|
|
{
|
|
|
|
if (!list_empty(&m->list))
|
|
|
|
list_del(&m->list);
|
|
|
|
if (!RB_EMPTY_NODE(&m->node))
|
|
|
|
rb_erase(&m->node, &sctx->pending_dir_moves);
|
|
|
|
__free_recorded_refs(&m->update_refs);
|
|
|
|
kfree(m);
|
|
|
|
}
|
|
|
|
|
Btrfs: send, fix infinite loop due to directory rename dependencies
When doing an incremental send, due to the need of delaying directory move
(rename) operations we can end up in infinite loop at
apply_children_dir_moves().
An example scenario that triggers this problem is described below, where
directory names correspond to the numbers of their respective inodes.
Parent snapshot:
.
|--- 261/
|--- 271/
|--- 266/
|--- 259/
|--- 260/
| |--- 267
|
|--- 264/
| |--- 258/
| |--- 257/
|
|--- 265/
|--- 268/
|--- 269/
| |--- 262/
|
|--- 270/
|--- 272/
| |--- 263/
| |--- 275/
|
|--- 274/
|--- 273/
Send snapshot:
.
|-- 275/
|-- 274/
|-- 273/
|-- 262/
|-- 269/
|-- 258/
|-- 271/
|-- 268/
|-- 267/
|-- 270/
|-- 259/
| |-- 265/
|
|-- 272/
|-- 257/
|-- 260/
|-- 264/
|-- 263/
|-- 261/
|-- 266/
When processing inode 257 we delay its move (rename) operation because its
new parent in the send snapshot, inode 272, was not yet processed. Then
when processing inode 272, we delay the move operation for that inode
because inode 274 is its ancestor in the send snapshot. Finally we delay
the move operation for inode 274 when processing it because inode 275 is
its new parent in the send snapshot and was not yet moved.
When finishing processing inode 275, we start to do the move operations
that were previously delayed (at apply_children_dir_moves()), resulting in
the following iterations:
1) We issue the move operation for inode 274;
2) Because inode 262 depended on the move operation of inode 274 (it was
delayed because 274 is its ancestor in the send snapshot), we issue the
move operation for inode 262;
3) We issue the move operation for inode 272, because it was delayed by
inode 274 too (ancestor of 272 in the send snapshot);
4) We issue the move operation for inode 269 (it was delayed by 262);
5) We issue the move operation for inode 257 (it was delayed by 272);
6) We issue the move operation for inode 260 (it was delayed by 272);
7) We issue the move operation for inode 258 (it was delayed by 269);
8) We issue the move operation for inode 264 (it was delayed by 257);
9) We issue the move operation for inode 271 (it was delayed by 258);
10) We issue the move operation for inode 263 (it was delayed by 264);
11) We issue the move operation for inode 268 (it was delayed by 271);
12) We verify if we can issue the move operation for inode 270 (it was
delayed by 271). We detect a path loop in the current state, because
inode 267 needs to be moved first before we can issue the move
operation for inode 270. So we delay again the move operation for
inode 270, this time we will attempt to do it after inode 267 is
moved;
13) We issue the move operation for inode 261 (it was delayed by 263);
14) We verify if we can issue the move operation for inode 266 (it was
delayed by 263). We detect a path loop in the current state, because
inode 270 needs to be moved first before we can issue the move
operation for inode 266. So we delay again the move operation for
inode 266, this time we will attempt to do it after inode 270 is
moved (its move operation was delayed in step 12);
15) We issue the move operation for inode 267 (it was delayed by 268);
16) We verify if we can issue the move operation for inode 266 (it was
delayed by 270). We detect a path loop in the current state, because
inode 270 needs to be moved first before we can issue the move
operation for inode 266. So we delay again the move operation for
inode 266, this time we will attempt to do it after inode 270 is
moved (its move operation was delayed in step 12). So here we added
again the same delayed move operation that we added in step 14;
17) We attempt again to see if we can issue the move operation for inode
266, and as in step 16, we realize we can not due to a path loop in
the current state due to a dependency on inode 270. Again we delay
inode's 266 rename to happen after inode's 270 move operation, adding
the same dependency to the empty stack that we did in steps 14 and 16.
The next iteration will pick the same move dependency on the stack
(the only entry) and realize again there is still a path loop and then
again the same dependency to the stack, over and over, resulting in
an infinite loop.
So fix this by preventing adding the same move dependency entries to the
stack by removing each pending move record from the red black tree of
pending moves. This way the next call to get_pending_dir_moves() will
not return anything for the current parent inode.
A test case for fstests, with this reproducer, follows soon.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[Wrote changelog with example and more clear explanation]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-15 02:32:37 +08:00
|
|
|
static void tail_append_pending_moves(struct send_ctx *sctx,
|
|
|
|
struct pending_dir_move *moves,
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
struct list_head *stack)
|
|
|
|
{
|
|
|
|
if (list_empty(&moves->list)) {
|
|
|
|
list_add_tail(&moves->list, stack);
|
|
|
|
} else {
|
|
|
|
LIST_HEAD(list);
|
|
|
|
list_splice_init(&moves->list, &list);
|
|
|
|
list_add_tail(&moves->list, stack);
|
|
|
|
list_splice_tail(&list, stack);
|
|
|
|
}
|
Btrfs: send, fix infinite loop due to directory rename dependencies
When doing an incremental send, due to the need of delaying directory move
(rename) operations we can end up in infinite loop at
apply_children_dir_moves().
An example scenario that triggers this problem is described below, where
directory names correspond to the numbers of their respective inodes.
Parent snapshot:
.
|--- 261/
|--- 271/
|--- 266/
|--- 259/
|--- 260/
| |--- 267
|
|--- 264/
| |--- 258/
| |--- 257/
|
|--- 265/
|--- 268/
|--- 269/
| |--- 262/
|
|--- 270/
|--- 272/
| |--- 263/
| |--- 275/
|
|--- 274/
|--- 273/
Send snapshot:
.
|-- 275/
|-- 274/
|-- 273/
|-- 262/
|-- 269/
|-- 258/
|-- 271/
|-- 268/
|-- 267/
|-- 270/
|-- 259/
| |-- 265/
|
|-- 272/
|-- 257/
|-- 260/
|-- 264/
|-- 263/
|-- 261/
|-- 266/
When processing inode 257 we delay its move (rename) operation because its
new parent in the send snapshot, inode 272, was not yet processed. Then
when processing inode 272, we delay the move operation for that inode
because inode 274 is its ancestor in the send snapshot. Finally we delay
the move operation for inode 274 when processing it because inode 275 is
its new parent in the send snapshot and was not yet moved.
When finishing processing inode 275, we start to do the move operations
that were previously delayed (at apply_children_dir_moves()), resulting in
the following iterations:
1) We issue the move operation for inode 274;
2) Because inode 262 depended on the move operation of inode 274 (it was
delayed because 274 is its ancestor in the send snapshot), we issue the
move operation for inode 262;
3) We issue the move operation for inode 272, because it was delayed by
inode 274 too (ancestor of 272 in the send snapshot);
4) We issue the move operation for inode 269 (it was delayed by 262);
5) We issue the move operation for inode 257 (it was delayed by 272);
6) We issue the move operation for inode 260 (it was delayed by 272);
7) We issue the move operation for inode 258 (it was delayed by 269);
8) We issue the move operation for inode 264 (it was delayed by 257);
9) We issue the move operation for inode 271 (it was delayed by 258);
10) We issue the move operation for inode 263 (it was delayed by 264);
11) We issue the move operation for inode 268 (it was delayed by 271);
12) We verify if we can issue the move operation for inode 270 (it was
delayed by 271). We detect a path loop in the current state, because
inode 267 needs to be moved first before we can issue the move
operation for inode 270. So we delay again the move operation for
inode 270, this time we will attempt to do it after inode 267 is
moved;
13) We issue the move operation for inode 261 (it was delayed by 263);
14) We verify if we can issue the move operation for inode 266 (it was
delayed by 263). We detect a path loop in the current state, because
inode 270 needs to be moved first before we can issue the move
operation for inode 266. So we delay again the move operation for
inode 266, this time we will attempt to do it after inode 270 is
moved (its move operation was delayed in step 12);
15) We issue the move operation for inode 267 (it was delayed by 268);
16) We verify if we can issue the move operation for inode 266 (it was
delayed by 270). We detect a path loop in the current state, because
inode 270 needs to be moved first before we can issue the move
operation for inode 266. So we delay again the move operation for
inode 266, this time we will attempt to do it after inode 270 is
moved (its move operation was delayed in step 12). So here we added
again the same delayed move operation that we added in step 14;
17) We attempt again to see if we can issue the move operation for inode
266, and as in step 16, we realize we can not due to a path loop in
the current state due to a dependency on inode 270. Again we delay
inode's 266 rename to happen after inode's 270 move operation, adding
the same dependency to the empty stack that we did in steps 14 and 16.
The next iteration will pick the same move dependency on the stack
(the only entry) and realize again there is still a path loop and then
again the same dependency to the stack, over and over, resulting in
an infinite loop.
So fix this by preventing adding the same move dependency entries to the
stack by removing each pending move record from the red black tree of
pending moves. This way the next call to get_pending_dir_moves() will
not return anything for the current parent inode.
A test case for fstests, with this reproducer, follows soon.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[Wrote changelog with example and more clear explanation]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-15 02:32:37 +08:00
|
|
|
if (!RB_EMPTY_NODE(&moves->node)) {
|
|
|
|
rb_erase(&moves->node, &sctx->pending_dir_moves);
|
|
|
|
RB_CLEAR_NODE(&moves->node);
|
|
|
|
}
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static int apply_children_dir_moves(struct send_ctx *sctx)
|
|
|
|
{
|
|
|
|
struct pending_dir_move *pm;
|
2023-08-10 11:00:22 +08:00
|
|
|
LIST_HEAD(stack);
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
u64 parent_ino = sctx->cur_ino;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
pm = get_pending_dir_moves(sctx, parent_ino);
|
|
|
|
if (!pm)
|
|
|
|
return 0;
|
|
|
|
|
Btrfs: send, fix infinite loop due to directory rename dependencies
When doing an incremental send, due to the need of delaying directory move
(rename) operations we can end up in infinite loop at
apply_children_dir_moves().
An example scenario that triggers this problem is described below, where
directory names correspond to the numbers of their respective inodes.
Parent snapshot:
.
|--- 261/
|--- 271/
|--- 266/
|--- 259/
|--- 260/
| |--- 267
|
|--- 264/
| |--- 258/
| |--- 257/
|
|--- 265/
|--- 268/
|--- 269/
| |--- 262/
|
|--- 270/
|--- 272/
| |--- 263/
| |--- 275/
|
|--- 274/
|--- 273/
Send snapshot:
.
|-- 275/
|-- 274/
|-- 273/
|-- 262/
|-- 269/
|-- 258/
|-- 271/
|-- 268/
|-- 267/
|-- 270/
|-- 259/
| |-- 265/
|
|-- 272/
|-- 257/
|-- 260/
|-- 264/
|-- 263/
|-- 261/
|-- 266/
When processing inode 257 we delay its move (rename) operation because its
new parent in the send snapshot, inode 272, was not yet processed. Then
when processing inode 272, we delay the move operation for that inode
because inode 274 is its ancestor in the send snapshot. Finally we delay
the move operation for inode 274 when processing it because inode 275 is
its new parent in the send snapshot and was not yet moved.
When finishing processing inode 275, we start to do the move operations
that were previously delayed (at apply_children_dir_moves()), resulting in
the following iterations:
1) We issue the move operation for inode 274;
2) Because inode 262 depended on the move operation of inode 274 (it was
delayed because 274 is its ancestor in the send snapshot), we issue the
move operation for inode 262;
3) We issue the move operation for inode 272, because it was delayed by
inode 274 too (ancestor of 272 in the send snapshot);
4) We issue the move operation for inode 269 (it was delayed by 262);
5) We issue the move operation for inode 257 (it was delayed by 272);
6) We issue the move operation for inode 260 (it was delayed by 272);
7) We issue the move operation for inode 258 (it was delayed by 269);
8) We issue the move operation for inode 264 (it was delayed by 257);
9) We issue the move operation for inode 271 (it was delayed by 258);
10) We issue the move operation for inode 263 (it was delayed by 264);
11) We issue the move operation for inode 268 (it was delayed by 271);
12) We verify if we can issue the move operation for inode 270 (it was
delayed by 271). We detect a path loop in the current state, because
inode 267 needs to be moved first before we can issue the move
operation for inode 270. So we delay again the move operation for
inode 270, this time we will attempt to do it after inode 267 is
moved;
13) We issue the move operation for inode 261 (it was delayed by 263);
14) We verify if we can issue the move operation for inode 266 (it was
delayed by 263). We detect a path loop in the current state, because
inode 270 needs to be moved first before we can issue the move
operation for inode 266. So we delay again the move operation for
inode 266, this time we will attempt to do it after inode 270 is
moved (its move operation was delayed in step 12);
15) We issue the move operation for inode 267 (it was delayed by 268);
16) We verify if we can issue the move operation for inode 266 (it was
delayed by 270). We detect a path loop in the current state, because
inode 270 needs to be moved first before we can issue the move
operation for inode 266. So we delay again the move operation for
inode 266, this time we will attempt to do it after inode 270 is
moved (its move operation was delayed in step 12). So here we added
again the same delayed move operation that we added in step 14;
17) We attempt again to see if we can issue the move operation for inode
266, and as in step 16, we realize we can not due to a path loop in
the current state due to a dependency on inode 270. Again we delay
inode's 266 rename to happen after inode's 270 move operation, adding
the same dependency to the empty stack that we did in steps 14 and 16.
The next iteration will pick the same move dependency on the stack
(the only entry) and realize again there is still a path loop and then
again the same dependency to the stack, over and over, resulting in
an infinite loop.
So fix this by preventing adding the same move dependency entries to the
stack by removing each pending move record from the red black tree of
pending moves. This way the next call to get_pending_dir_moves() will
not return anything for the current parent inode.
A test case for fstests, with this reproducer, follows soon.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[Wrote changelog with example and more clear explanation]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-15 02:32:37 +08:00
|
|
|
tail_append_pending_moves(sctx, pm, &stack);
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
|
|
|
|
while (!list_empty(&stack)) {
|
|
|
|
pm = list_first_entry(&stack, struct pending_dir_move, list);
|
|
|
|
parent_ino = pm->ino;
|
|
|
|
ret = apply_dir_move(sctx, pm);
|
|
|
|
free_pending_move(sctx, pm);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
pm = get_pending_dir_moves(sctx, parent_ino);
|
|
|
|
if (pm)
|
Btrfs: send, fix infinite loop due to directory rename dependencies
When doing an incremental send, due to the need of delaying directory move
(rename) operations we can end up in infinite loop at
apply_children_dir_moves().
An example scenario that triggers this problem is described below, where
directory names correspond to the numbers of their respective inodes.
Parent snapshot:
.
|--- 261/
|--- 271/
|--- 266/
|--- 259/
|--- 260/
| |--- 267
|
|--- 264/
| |--- 258/
| |--- 257/
|
|--- 265/
|--- 268/
|--- 269/
| |--- 262/
|
|--- 270/
|--- 272/
| |--- 263/
| |--- 275/
|
|--- 274/
|--- 273/
Send snapshot:
.
|-- 275/
|-- 274/
|-- 273/
|-- 262/
|-- 269/
|-- 258/
|-- 271/
|-- 268/
|-- 267/
|-- 270/
|-- 259/
| |-- 265/
|
|-- 272/
|-- 257/
|-- 260/
|-- 264/
|-- 263/
|-- 261/
|-- 266/
When processing inode 257 we delay its move (rename) operation because its
new parent in the send snapshot, inode 272, was not yet processed. Then
when processing inode 272, we delay the move operation for that inode
because inode 274 is its ancestor in the send snapshot. Finally we delay
the move operation for inode 274 when processing it because inode 275 is
its new parent in the send snapshot and was not yet moved.
When finishing processing inode 275, we start to do the move operations
that were previously delayed (at apply_children_dir_moves()), resulting in
the following iterations:
1) We issue the move operation for inode 274;
2) Because inode 262 depended on the move operation of inode 274 (it was
delayed because 274 is its ancestor in the send snapshot), we issue the
move operation for inode 262;
3) We issue the move operation for inode 272, because it was delayed by
inode 274 too (ancestor of 272 in the send snapshot);
4) We issue the move operation for inode 269 (it was delayed by 262);
5) We issue the move operation for inode 257 (it was delayed by 272);
6) We issue the move operation for inode 260 (it was delayed by 272);
7) We issue the move operation for inode 258 (it was delayed by 269);
8) We issue the move operation for inode 264 (it was delayed by 257);
9) We issue the move operation for inode 271 (it was delayed by 258);
10) We issue the move operation for inode 263 (it was delayed by 264);
11) We issue the move operation for inode 268 (it was delayed by 271);
12) We verify if we can issue the move operation for inode 270 (it was
delayed by 271). We detect a path loop in the current state, because
inode 267 needs to be moved first before we can issue the move
operation for inode 270. So we delay again the move operation for
inode 270, this time we will attempt to do it after inode 267 is
moved;
13) We issue the move operation for inode 261 (it was delayed by 263);
14) We verify if we can issue the move operation for inode 266 (it was
delayed by 263). We detect a path loop in the current state, because
inode 270 needs to be moved first before we can issue the move
operation for inode 266. So we delay again the move operation for
inode 266, this time we will attempt to do it after inode 270 is
moved (its move operation was delayed in step 12);
15) We issue the move operation for inode 267 (it was delayed by 268);
16) We verify if we can issue the move operation for inode 266 (it was
delayed by 270). We detect a path loop in the current state, because
inode 270 needs to be moved first before we can issue the move
operation for inode 266. So we delay again the move operation for
inode 266, this time we will attempt to do it after inode 270 is
moved (its move operation was delayed in step 12). So here we added
again the same delayed move operation that we added in step 14;
17) We attempt again to see if we can issue the move operation for inode
266, and as in step 16, we realize we can not due to a path loop in
the current state due to a dependency on inode 270. Again we delay
inode's 266 rename to happen after inode's 270 move operation, adding
the same dependency to the empty stack that we did in steps 14 and 16.
The next iteration will pick the same move dependency on the stack
(the only entry) and realize again there is still a path loop and then
again the same dependency to the stack, over and over, resulting in
an infinite loop.
So fix this by preventing adding the same move dependency entries to the
stack by removing each pending move record from the red black tree of
pending moves. This way the next call to get_pending_dir_moves() will
not return anything for the current parent inode.
A test case for fstests, with this reproducer, follows soon.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[Wrote changelog with example and more clear explanation]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-15 02:32:37 +08:00
|
|
|
tail_append_pending_moves(sctx, pm, &stack);
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
out:
|
|
|
|
while (!list_empty(&stack)) {
|
|
|
|
pm = list_first_entry(&stack, struct pending_dir_move, list);
|
|
|
|
free_pending_move(sctx, pm);
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2015-03-01 06:29:22 +08:00
|
|
|
/*
|
|
|
|
* We might need to delay a directory rename even when no ancestor directory
|
|
|
|
* (in the send root) with a higher inode number than ours (sctx->cur_ino) was
|
|
|
|
* renamed. This happens when we rename a directory to the old name (the name
|
|
|
|
* in the parent root) of some other unrelated directory that got its rename
|
|
|
|
* delayed due to some ancestor with higher number that got renamed.
|
|
|
|
*
|
|
|
|
* Example:
|
|
|
|
*
|
|
|
|
* Parent snapshot:
|
|
|
|
* . (ino 256)
|
|
|
|
* |---- a/ (ino 257)
|
|
|
|
* | |---- file (ino 260)
|
|
|
|
* |
|
|
|
|
* |---- b/ (ino 258)
|
|
|
|
* |---- c/ (ino 259)
|
|
|
|
*
|
|
|
|
* Send snapshot:
|
|
|
|
* . (ino 256)
|
|
|
|
* |---- a/ (ino 258)
|
|
|
|
* |---- x/ (ino 259)
|
|
|
|
* |---- y/ (ino 257)
|
|
|
|
* |----- file (ino 260)
|
|
|
|
*
|
|
|
|
* Here we can not rename 258 from 'b' to 'a' without the rename of inode 257
|
|
|
|
* from 'a' to 'x/y' happening first, which in turn depends on the rename of
|
|
|
|
* inode 259 from 'c' to 'x'. So the order of rename commands the send stream
|
|
|
|
* must issue is:
|
|
|
|
*
|
|
|
|
* 1 - rename 259 from 'c' to 'x'
|
|
|
|
* 2 - rename 257 from 'a' to 'x/y'
|
|
|
|
* 3 - rename 258 from 'b' to 'a'
|
|
|
|
*
|
|
|
|
* Returns 1 if the rename of sctx->cur_ino needs to be delayed, 0 if it can
|
|
|
|
* be done right away and < 0 on error.
|
|
|
|
*/
|
|
|
|
static int wait_for_dest_dir_move(struct send_ctx *sctx,
|
|
|
|
struct recorded_ref *parent_ref,
|
|
|
|
const bool is_orphan)
|
|
|
|
{
|
2016-06-23 06:54:24 +08:00
|
|
|
struct btrfs_fs_info *fs_info = sctx->parent_root->fs_info;
|
2015-03-01 06:29:22 +08:00
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_key di_key;
|
|
|
|
struct btrfs_dir_item *di;
|
|
|
|
u64 left_gen;
|
|
|
|
u64 right_gen;
|
|
|
|
int ret = 0;
|
Btrfs: send, fix failure to move directories with the same name around
When doing an incremental send we can end up not moving directories that
have the same name. This happens when the same parent directory has
different child directories with the same name in the parent and send
snapshots.
For example, consider the following scenario:
Parent snapshot:
. (ino 256)
|---- d/ (ino 257)
| |--- p1/ (ino 258)
|
|---- p1/ (ino 259)
Send snapshot:
. (ino 256)
|--- d/ (ino 257)
|--- p1/ (ino 259)
|--- p1/ (ino 258)
The directory named "d" (inode 257) has in both snapshots an entry with
the name "p1" but it refers to different inodes in both snapshots (inode
258 in the parent snapshot and inode 259 in the send snapshot). When
attempting to move inode 258, the operation is delayed because its new
parent, inode 259, was not yet moved/renamed (as the stream is currently
processing inode 258). Then when processing inode 259, we also end up
delaying its move/rename operation so that it happens after inode 258 is
moved/renamed. This decision to delay the move/rename rename operation
of inode 259 is due to the fact that the new parent inode (257) still
has inode 258 as its child, which has the same name has inode 259. So
we end up with inode 258 move/rename operation waiting for inode's 259
move/rename operation, which in turn it waiting for inode's 258
move/rename. This results in ending the send stream without issuing
move/rename operations for inodes 258 and 259 and generating the
following warnings in syslog/dmesg:
[148402.979747] ------------[ cut here ]------------
[148402.980588] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6177 btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.981928] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148402.986999] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148402.988136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148402.988136] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148402.988136] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018212139fac8
[148402.988136] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148402.988136] Call Trace:
[148402.988136] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148402.988136] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148402.988136] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148402.988136] [<ffffffffa04bc831>] btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.988136] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148402.988136] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148402.988136] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148402.988136] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148402.988136] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148402.988136] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148402.988136] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148402.988136] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148402.988136] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148402.988136] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.011373] ---[ end trace a4539270c8056f8b ]---
[148403.012296] ------------[ cut here ]------------
[148403.013071] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6194 btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.014447] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148403.019708] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148403.020104] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148403.020104] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148403.020104] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018322139fac8
[148403.020104] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148403.020104] Call Trace:
[148403.020104] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148403.020104] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148403.020104] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148403.020104] [<ffffffffa04bc847>] btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.020104] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148403.020104] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148403.020104] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148403.020104] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148403.020104] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148403.020104] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148403.020104] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148403.020104] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148403.020104] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148403.020104] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.038981] ---[ end trace a4539270c8056f8c ]---
There's another issue caused by similar (but more complex) changes in the
directory hierarchy that makes move/rename operations fail, described with
the following example:
Parent snapshot:
.
|---- a/ (ino 262)
| |---- c/ (ino 268)
|
|---- d/ (ino 263)
|---- ance/ (ino 267)
|---- e/ (ino 264)
|---- f/ (ino 265)
|---- ance/ (ino 266)
Send snapshot:
.
|---- a/ (ino 262)
|---- c/ (ino 268)
| |---- ance/ (ino 267)
|
|---- d/ (ino 263)
| |---- ance/ (ino 266)
|
|---- f/ (ino 265)
|---- e/ (ino 264)
When the inode 265 is processed, the path for inode 267 is computed, which
at that time corresponds to "d/ance", and it's stored in the names cache.
Later on when processing inode 266, we end up orphanizing (renaming to a
name matching the pattern o<ino>-<gen>-<seq>) inode 267 because it has
the same name as inode 266 and it's currently a child of the new parent
directory (inode 263) for inode 266. After the orphanization and while we
are still processing inode 266, a rename operation for inode 266 is
generated. However the source path for that rename operation is incorrect
because it ends up using the old, pre-orphanization, name of inode 267.
The no longer valid name for inode 267 was previously cached when
processing inode 265 and it remains usable and considered valid until
the inode currently being processed has a number greater than 267.
This resulted in the receiving side failing with the following error:
ERROR: rename d/ance/ance -> d/ance failed: No such file or directory
So fix these issues by detecting such circular dependencies for rename
operations and by clearing the cached name of an inode once the inode
is orphanized.
A test case for fstests will follow soon.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Rewrote change log to be more detailed and organized, and improved
comments]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-06-23 18:39:46 +08:00
|
|
|
struct waiting_dir_move *wdm;
|
2015-03-01 06:29:22 +08:00
|
|
|
|
|
|
|
if (RB_EMPTY_ROOT(&sctx->waiting_dir_moves))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
path = alloc_path_for_send();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
key.objectid = parent_ref->dir;
|
|
|
|
key.type = BTRFS_DIR_ITEM_KEY;
|
|
|
|
key.offset = btrfs_name_hash(parent_ref->name, parent_ref->name_len);
|
|
|
|
|
|
|
|
ret = btrfs_search_slot(NULL, sctx->parent_root, &key, path, 0, 0);
|
|
|
|
if (ret < 0) {
|
|
|
|
goto out;
|
|
|
|
} else if (ret > 0) {
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
di = btrfs_match_dir_item_name(fs_info, path, parent_ref->name,
|
|
|
|
parent_ref->name_len);
|
2015-03-01 06:29:22 +08:00
|
|
|
if (!di) {
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* di_key.objectid has the number of the inode that has a dentry in the
|
|
|
|
* parent directory with the same name that sctx->cur_ino is being
|
|
|
|
* renamed to. We need to check if that inode is in the send root as
|
|
|
|
* well and if it is currently marked as an inode with a pending rename,
|
|
|
|
* if it is, we need to delay the rename of sctx->cur_ino as well, so
|
|
|
|
* that it happens after that other inode is renamed.
|
|
|
|
*/
|
|
|
|
btrfs_dir_item_key_to_cpu(path->nodes[0], di, &di_key);
|
|
|
|
if (di_key.type != BTRFS_INODE_ITEM_KEY) {
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2022-08-12 22:42:32 +08:00
|
|
|
ret = get_inode_gen(sctx->parent_root, di_key.objectid, &left_gen);
|
2015-03-01 06:29:22 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2022-08-12 22:42:32 +08:00
|
|
|
ret = get_inode_gen(sctx->send_root, di_key.objectid, &right_gen);
|
2015-03-01 06:29:22 +08:00
|
|
|
if (ret < 0) {
|
|
|
|
if (ret == -ENOENT)
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Different inode, no need to delay the rename of sctx->cur_ino */
|
|
|
|
if (right_gen != left_gen) {
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
Btrfs: send, fix failure to move directories with the same name around
When doing an incremental send we can end up not moving directories that
have the same name. This happens when the same parent directory has
different child directories with the same name in the parent and send
snapshots.
For example, consider the following scenario:
Parent snapshot:
. (ino 256)
|---- d/ (ino 257)
| |--- p1/ (ino 258)
|
|---- p1/ (ino 259)
Send snapshot:
. (ino 256)
|--- d/ (ino 257)
|--- p1/ (ino 259)
|--- p1/ (ino 258)
The directory named "d" (inode 257) has in both snapshots an entry with
the name "p1" but it refers to different inodes in both snapshots (inode
258 in the parent snapshot and inode 259 in the send snapshot). When
attempting to move inode 258, the operation is delayed because its new
parent, inode 259, was not yet moved/renamed (as the stream is currently
processing inode 258). Then when processing inode 259, we also end up
delaying its move/rename operation so that it happens after inode 258 is
moved/renamed. This decision to delay the move/rename rename operation
of inode 259 is due to the fact that the new parent inode (257) still
has inode 258 as its child, which has the same name has inode 259. So
we end up with inode 258 move/rename operation waiting for inode's 259
move/rename operation, which in turn it waiting for inode's 258
move/rename. This results in ending the send stream without issuing
move/rename operations for inodes 258 and 259 and generating the
following warnings in syslog/dmesg:
[148402.979747] ------------[ cut here ]------------
[148402.980588] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6177 btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.981928] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148402.986999] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148402.988136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148402.988136] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148402.988136] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018212139fac8
[148402.988136] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148402.988136] Call Trace:
[148402.988136] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148402.988136] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148402.988136] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148402.988136] [<ffffffffa04bc831>] btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.988136] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148402.988136] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148402.988136] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148402.988136] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148402.988136] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148402.988136] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148402.988136] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148402.988136] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148402.988136] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148402.988136] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.011373] ---[ end trace a4539270c8056f8b ]---
[148403.012296] ------------[ cut here ]------------
[148403.013071] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6194 btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.014447] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148403.019708] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148403.020104] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148403.020104] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148403.020104] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018322139fac8
[148403.020104] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148403.020104] Call Trace:
[148403.020104] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148403.020104] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148403.020104] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148403.020104] [<ffffffffa04bc847>] btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.020104] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148403.020104] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148403.020104] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148403.020104] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148403.020104] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148403.020104] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148403.020104] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148403.020104] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148403.020104] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148403.020104] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.038981] ---[ end trace a4539270c8056f8c ]---
There's another issue caused by similar (but more complex) changes in the
directory hierarchy that makes move/rename operations fail, described with
the following example:
Parent snapshot:
.
|---- a/ (ino 262)
| |---- c/ (ino 268)
|
|---- d/ (ino 263)
|---- ance/ (ino 267)
|---- e/ (ino 264)
|---- f/ (ino 265)
|---- ance/ (ino 266)
Send snapshot:
.
|---- a/ (ino 262)
|---- c/ (ino 268)
| |---- ance/ (ino 267)
|
|---- d/ (ino 263)
| |---- ance/ (ino 266)
|
|---- f/ (ino 265)
|---- e/ (ino 264)
When the inode 265 is processed, the path for inode 267 is computed, which
at that time corresponds to "d/ance", and it's stored in the names cache.
Later on when processing inode 266, we end up orphanizing (renaming to a
name matching the pattern o<ino>-<gen>-<seq>) inode 267 because it has
the same name as inode 266 and it's currently a child of the new parent
directory (inode 263) for inode 266. After the orphanization and while we
are still processing inode 266, a rename operation for inode 266 is
generated. However the source path for that rename operation is incorrect
because it ends up using the old, pre-orphanization, name of inode 267.
The no longer valid name for inode 267 was previously cached when
processing inode 265 and it remains usable and considered valid until
the inode currently being processed has a number greater than 267.
This resulted in the receiving side failing with the following error:
ERROR: rename d/ance/ance -> d/ance failed: No such file or directory
So fix these issues by detecting such circular dependencies for rename
operations and by clearing the cached name of an inode once the inode
is orphanized.
A test case for fstests will follow soon.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Rewrote change log to be more detailed and organized, and improved
comments]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-06-23 18:39:46 +08:00
|
|
|
wdm = get_waiting_dir_move(sctx, di_key.objectid);
|
|
|
|
if (wdm && !wdm->orphanized) {
|
2015-03-01 06:29:22 +08:00
|
|
|
ret = add_pending_dir_move(sctx,
|
|
|
|
sctx->cur_ino,
|
|
|
|
sctx->cur_inode_gen,
|
|
|
|
di_key.objectid,
|
|
|
|
&sctx->new_refs,
|
|
|
|
&sctx->deleted_refs,
|
|
|
|
is_orphan);
|
|
|
|
if (!ret)
|
|
|
|
ret = 1;
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2015-03-28 01:50:45 +08:00
|
|
|
/*
|
Btrfs: incremental send, fix wrong unlink path after renaming file
Under some circumstances, an incremental send operation can issue wrong
paths for unlink commands related to files that have multiple hard links
and some (or all) of those links were renamed between the parent and send
snapshots. Consider the following example:
Parent snapshot
. (ino 256)
|---- a/ (ino 257)
| |---- b/ (ino 259)
| | |---- c/ (ino 260)
| | |---- f2 (ino 261)
| |
| |---- f2l1 (ino 261)
|
|---- d/ (ino 262)
|---- f1l1_2 (ino 258)
|---- f2l2 (ino 261)
|---- f1_2 (ino 258)
Send snapshot
. (ino 256)
|---- a/ (ino 257)
| |---- f2l1/ (ino 263)
| |---- b2/ (ino 259)
| |---- c/ (ino 260)
| | |---- d3 (ino 262)
| | |---- f1l1_2 (ino 258)
| | |---- f2l2_2 (ino 261)
| | |---- f1_2 (ino 258)
| |
| |---- f2 (ino 261)
| |---- f1l2 (ino 258)
|
|---- d (ino 261)
When computing the incremental send stream the following steps happen:
1) When processing inode 261, a rename operation is issued that renames
inode 262, which currently as a path of "d", to an orphan name of
"o262-7-0". This is done because in the send snapshot, inode 261 has
of its hard links with a path of "d" as well.
2) Two link operations are issued that create the new hard links for
inode 261, whose names are "d" and "f2l2_2", at paths "/" and
"o262-7-0/" respectively.
3) Still while processing inode 261, unlink operations are issued to
remove the old hard links of inode 261, with names "f2l1" and "f2l2",
at paths "a/" and "d/". However path "d/" does not correspond anymore
to the directory inode 262 but corresponds instead to a hard link of
inode 261 (link command issued in the previous step). This makes the
receiver fail with a ENOTDIR error when attempting the unlink
operation.
The problem happens because before sending the unlink operation, we failed
to detect that inode 262 was one of ancestors for inode 261 in the parent
snapshot, and therefore we didn't recompute the path for inode 262 before
issuing the unlink operation for the link named "f2l2" of inode 262. The
detection failed because the function "is_ancestor()" only follows the
first hard link it finds for an inode instead of all of its hard links
(as it was originally created for being used with directories only, for
which only one hard link exists). So fix this by making "is_ancestor()"
follow all hard links of the input inode.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-17 09:54:00 +08:00
|
|
|
* Check if inode ino2, or any of its ancestors, is inode ino1.
|
|
|
|
* Return 1 if true, 0 if false and < 0 on error.
|
|
|
|
*/
|
|
|
|
static int check_ino_in_path(struct btrfs_root *root,
|
|
|
|
const u64 ino1,
|
|
|
|
const u64 ino1_gen,
|
|
|
|
const u64 ino2,
|
|
|
|
const u64 ino2_gen,
|
|
|
|
struct fs_path *fs_path)
|
|
|
|
{
|
|
|
|
u64 ino = ino2;
|
|
|
|
|
|
|
|
if (ino1 == ino2)
|
|
|
|
return ino1_gen == ino2_gen;
|
|
|
|
|
|
|
|
while (ino > BTRFS_FIRST_FREE_OBJECTID) {
|
|
|
|
u64 parent;
|
|
|
|
u64 parent_gen;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
fs_path_reset(fs_path);
|
|
|
|
ret = get_first_ref(root, ino, &parent, &parent_gen, fs_path);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
if (parent == ino1)
|
|
|
|
return parent_gen == ino1_gen;
|
|
|
|
ino = parent;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2022-03-09 21:50:45 +08:00
|
|
|
* Check if inode ino1 is an ancestor of inode ino2 in the given root for any
|
Btrfs: incremental send, fix wrong unlink path after renaming file
Under some circumstances, an incremental send operation can issue wrong
paths for unlink commands related to files that have multiple hard links
and some (or all) of those links were renamed between the parent and send
snapshots. Consider the following example:
Parent snapshot
. (ino 256)
|---- a/ (ino 257)
| |---- b/ (ino 259)
| | |---- c/ (ino 260)
| | |---- f2 (ino 261)
| |
| |---- f2l1 (ino 261)
|
|---- d/ (ino 262)
|---- f1l1_2 (ino 258)
|---- f2l2 (ino 261)
|---- f1_2 (ino 258)
Send snapshot
. (ino 256)
|---- a/ (ino 257)
| |---- f2l1/ (ino 263)
| |---- b2/ (ino 259)
| |---- c/ (ino 260)
| | |---- d3 (ino 262)
| | |---- f1l1_2 (ino 258)
| | |---- f2l2_2 (ino 261)
| | |---- f1_2 (ino 258)
| |
| |---- f2 (ino 261)
| |---- f1l2 (ino 258)
|
|---- d (ino 261)
When computing the incremental send stream the following steps happen:
1) When processing inode 261, a rename operation is issued that renames
inode 262, which currently as a path of "d", to an orphan name of
"o262-7-0". This is done because in the send snapshot, inode 261 has
of its hard links with a path of "d" as well.
2) Two link operations are issued that create the new hard links for
inode 261, whose names are "d" and "f2l2_2", at paths "/" and
"o262-7-0/" respectively.
3) Still while processing inode 261, unlink operations are issued to
remove the old hard links of inode 261, with names "f2l1" and "f2l2",
at paths "a/" and "d/". However path "d/" does not correspond anymore
to the directory inode 262 but corresponds instead to a hard link of
inode 261 (link command issued in the previous step). This makes the
receiver fail with a ENOTDIR error when attempting the unlink
operation.
The problem happens because before sending the unlink operation, we failed
to detect that inode 262 was one of ancestors for inode 261 in the parent
snapshot, and therefore we didn't recompute the path for inode 262 before
issuing the unlink operation for the link named "f2l2" of inode 262. The
detection failed because the function "is_ancestor()" only follows the
first hard link it finds for an inode instead of all of its hard links
(as it was originally created for being used with directories only, for
which only one hard link exists). So fix this by making "is_ancestor()"
follow all hard links of the input inode.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-17 09:54:00 +08:00
|
|
|
* possible path (in case ino2 is not a directory and has multiple hard links).
|
2015-03-28 01:50:45 +08:00
|
|
|
* Return 1 if true, 0 if false and < 0 on error.
|
|
|
|
*/
|
|
|
|
static int is_ancestor(struct btrfs_root *root,
|
|
|
|
const u64 ino1,
|
|
|
|
const u64 ino1_gen,
|
|
|
|
const u64 ino2,
|
|
|
|
struct fs_path *fs_path)
|
|
|
|
{
|
Btrfs: incremental send, fix wrong unlink path after renaming file
Under some circumstances, an incremental send operation can issue wrong
paths for unlink commands related to files that have multiple hard links
and some (or all) of those links were renamed between the parent and send
snapshots. Consider the following example:
Parent snapshot
. (ino 256)
|---- a/ (ino 257)
| |---- b/ (ino 259)
| | |---- c/ (ino 260)
| | |---- f2 (ino 261)
| |
| |---- f2l1 (ino 261)
|
|---- d/ (ino 262)
|---- f1l1_2 (ino 258)
|---- f2l2 (ino 261)
|---- f1_2 (ino 258)
Send snapshot
. (ino 256)
|---- a/ (ino 257)
| |---- f2l1/ (ino 263)
| |---- b2/ (ino 259)
| |---- c/ (ino 260)
| | |---- d3 (ino 262)
| | |---- f1l1_2 (ino 258)
| | |---- f2l2_2 (ino 261)
| | |---- f1_2 (ino 258)
| |
| |---- f2 (ino 261)
| |---- f1l2 (ino 258)
|
|---- d (ino 261)
When computing the incremental send stream the following steps happen:
1) When processing inode 261, a rename operation is issued that renames
inode 262, which currently as a path of "d", to an orphan name of
"o262-7-0". This is done because in the send snapshot, inode 261 has
of its hard links with a path of "d" as well.
2) Two link operations are issued that create the new hard links for
inode 261, whose names are "d" and "f2l2_2", at paths "/" and
"o262-7-0/" respectively.
3) Still while processing inode 261, unlink operations are issued to
remove the old hard links of inode 261, with names "f2l1" and "f2l2",
at paths "a/" and "d/". However path "d/" does not correspond anymore
to the directory inode 262 but corresponds instead to a hard link of
inode 261 (link command issued in the previous step). This makes the
receiver fail with a ENOTDIR error when attempting the unlink
operation.
The problem happens because before sending the unlink operation, we failed
to detect that inode 262 was one of ancestors for inode 261 in the parent
snapshot, and therefore we didn't recompute the path for inode 262 before
issuing the unlink operation for the link named "f2l2" of inode 262. The
detection failed because the function "is_ancestor()" only follows the
first hard link it finds for an inode instead of all of its hard links
(as it was originally created for being used with directories only, for
which only one hard link exists). So fix this by making "is_ancestor()"
follow all hard links of the input inode.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-17 09:54:00 +08:00
|
|
|
bool free_fs_path = false;
|
Btrfs: send, fix invalid path after renaming and linking file
Currently an incremental snapshot can generate link operations which
contain an invalid target path. Such case happens when in the send
snapshot a file was renamed, a new hard link added for it and some
other inode (with a lower number) got renamed to the former name of
that file. Example:
Parent snapshot
. (ino 256)
|
|--- f1 (ino 257)
|--- f2 (ino 258)
|--- f3 (ino 259)
Send snapshot
. (ino 256)
|
|--- f2 (ino 257)
|--- f3 (ino 258)
|--- f4 (ino 259)
|--- f5 (ino 258)
The following steps happen when computing the incremental send stream:
1) When processing inode 257, inode 258 is orphanized (renamed to
"o258-7-0"), because its current reference has the same name as the
new reference for inode 257;
2) When processing inode 258, we iterate over all its new references,
which have the names "f3" and "f5". The first iteration sees name
"f5" and renames the inode from its orphan name ("o258-7-0") to
"f5", while the second iteration sees the name "f3" and, incorrectly,
issues a link operation with a target name matching the orphan name,
which no longer exists. The first iteration had reset the current
valid path of the inode to "f5", but in the second iteration we lost
it because we found another inode, with a higher number of 259, which
has a reference named "f3" as well, so we orphanized inode 259 and
recomputed the current valid path of inode 258 to its old orphan
name because inode 259 could be an ancestor of inode 258 and therefore
the current valid path could contain the pre-orphanization name of
inode 259. However in this case inode 259 is not an ancestor of inode
258 so the current valid path should not be recomputed.
This makes the receiver fail with the following error:
ERROR: link f3 -> o258-7-0 failed: No such file or directory
So fix this by not recomputing the current valid path for an inode
whenever we find a colliding reference from some not yet processed inode
(inode number higher then the one currently being processed), unless
that other inode is an ancestor of the one we are currently processing.
A test case for fstests will follow soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-07 18:41:29 +08:00
|
|
|
int ret = 0;
|
2022-03-09 21:50:45 +08:00
|
|
|
int iter_ret = 0;
|
Btrfs: incremental send, fix wrong unlink path after renaming file
Under some circumstances, an incremental send operation can issue wrong
paths for unlink commands related to files that have multiple hard links
and some (or all) of those links were renamed between the parent and send
snapshots. Consider the following example:
Parent snapshot
. (ino 256)
|---- a/ (ino 257)
| |---- b/ (ino 259)
| | |---- c/ (ino 260)
| | |---- f2 (ino 261)
| |
| |---- f2l1 (ino 261)
|
|---- d/ (ino 262)
|---- f1l1_2 (ino 258)
|---- f2l2 (ino 261)
|---- f1_2 (ino 258)
Send snapshot
. (ino 256)
|---- a/ (ino 257)
| |---- f2l1/ (ino 263)
| |---- b2/ (ino 259)
| |---- c/ (ino 260)
| | |---- d3 (ino 262)
| | |---- f1l1_2 (ino 258)
| | |---- f2l2_2 (ino 261)
| | |---- f1_2 (ino 258)
| |
| |---- f2 (ino 261)
| |---- f1l2 (ino 258)
|
|---- d (ino 261)
When computing the incremental send stream the following steps happen:
1) When processing inode 261, a rename operation is issued that renames
inode 262, which currently as a path of "d", to an orphan name of
"o262-7-0". This is done because in the send snapshot, inode 261 has
of its hard links with a path of "d" as well.
2) Two link operations are issued that create the new hard links for
inode 261, whose names are "d" and "f2l2_2", at paths "/" and
"o262-7-0/" respectively.
3) Still while processing inode 261, unlink operations are issued to
remove the old hard links of inode 261, with names "f2l1" and "f2l2",
at paths "a/" and "d/". However path "d/" does not correspond anymore
to the directory inode 262 but corresponds instead to a hard link of
inode 261 (link command issued in the previous step). This makes the
receiver fail with a ENOTDIR error when attempting the unlink
operation.
The problem happens because before sending the unlink operation, we failed
to detect that inode 262 was one of ancestors for inode 261 in the parent
snapshot, and therefore we didn't recompute the path for inode 262 before
issuing the unlink operation for the link named "f2l2" of inode 262. The
detection failed because the function "is_ancestor()" only follows the
first hard link it finds for an inode instead of all of its hard links
(as it was originally created for being used with directories only, for
which only one hard link exists). So fix this by making "is_ancestor()"
follow all hard links of the input inode.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-17 09:54:00 +08:00
|
|
|
struct btrfs_path *path = NULL;
|
|
|
|
struct btrfs_key key;
|
Btrfs: send, fix invalid path after renaming and linking file
Currently an incremental snapshot can generate link operations which
contain an invalid target path. Such case happens when in the send
snapshot a file was renamed, a new hard link added for it and some
other inode (with a lower number) got renamed to the former name of
that file. Example:
Parent snapshot
. (ino 256)
|
|--- f1 (ino 257)
|--- f2 (ino 258)
|--- f3 (ino 259)
Send snapshot
. (ino 256)
|
|--- f2 (ino 257)
|--- f3 (ino 258)
|--- f4 (ino 259)
|--- f5 (ino 258)
The following steps happen when computing the incremental send stream:
1) When processing inode 257, inode 258 is orphanized (renamed to
"o258-7-0"), because its current reference has the same name as the
new reference for inode 257;
2) When processing inode 258, we iterate over all its new references,
which have the names "f3" and "f5". The first iteration sees name
"f5" and renames the inode from its orphan name ("o258-7-0") to
"f5", while the second iteration sees the name "f3" and, incorrectly,
issues a link operation with a target name matching the orphan name,
which no longer exists. The first iteration had reset the current
valid path of the inode to "f5", but in the second iteration we lost
it because we found another inode, with a higher number of 259, which
has a reference named "f3" as well, so we orphanized inode 259 and
recomputed the current valid path of inode 258 to its old orphan
name because inode 259 could be an ancestor of inode 258 and therefore
the current valid path could contain the pre-orphanization name of
inode 259. However in this case inode 259 is not an ancestor of inode
258 so the current valid path should not be recomputed.
This makes the receiver fail with the following error:
ERROR: link f3 -> o258-7-0 failed: No such file or directory
So fix this by not recomputing the current valid path for an inode
whenever we find a colliding reference from some not yet processed inode
(inode number higher then the one currently being processed), unless
that other inode is an ancestor of the one we are currently processing.
A test case for fstests will follow soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-07 18:41:29 +08:00
|
|
|
|
|
|
|
if (!fs_path) {
|
|
|
|
fs_path = fs_path_alloc();
|
|
|
|
if (!fs_path)
|
|
|
|
return -ENOMEM;
|
Btrfs: incremental send, fix wrong unlink path after renaming file
Under some circumstances, an incremental send operation can issue wrong
paths for unlink commands related to files that have multiple hard links
and some (or all) of those links were renamed between the parent and send
snapshots. Consider the following example:
Parent snapshot
. (ino 256)
|---- a/ (ino 257)
| |---- b/ (ino 259)
| | |---- c/ (ino 260)
| | |---- f2 (ino 261)
| |
| |---- f2l1 (ino 261)
|
|---- d/ (ino 262)
|---- f1l1_2 (ino 258)
|---- f2l2 (ino 261)
|---- f1_2 (ino 258)
Send snapshot
. (ino 256)
|---- a/ (ino 257)
| |---- f2l1/ (ino 263)
| |---- b2/ (ino 259)
| |---- c/ (ino 260)
| | |---- d3 (ino 262)
| | |---- f1l1_2 (ino 258)
| | |---- f2l2_2 (ino 261)
| | |---- f1_2 (ino 258)
| |
| |---- f2 (ino 261)
| |---- f1l2 (ino 258)
|
|---- d (ino 261)
When computing the incremental send stream the following steps happen:
1) When processing inode 261, a rename operation is issued that renames
inode 262, which currently as a path of "d", to an orphan name of
"o262-7-0". This is done because in the send snapshot, inode 261 has
of its hard links with a path of "d" as well.
2) Two link operations are issued that create the new hard links for
inode 261, whose names are "d" and "f2l2_2", at paths "/" and
"o262-7-0/" respectively.
3) Still while processing inode 261, unlink operations are issued to
remove the old hard links of inode 261, with names "f2l1" and "f2l2",
at paths "a/" and "d/". However path "d/" does not correspond anymore
to the directory inode 262 but corresponds instead to a hard link of
inode 261 (link command issued in the previous step). This makes the
receiver fail with a ENOTDIR error when attempting the unlink
operation.
The problem happens because before sending the unlink operation, we failed
to detect that inode 262 was one of ancestors for inode 261 in the parent
snapshot, and therefore we didn't recompute the path for inode 262 before
issuing the unlink operation for the link named "f2l2" of inode 262. The
detection failed because the function "is_ancestor()" only follows the
first hard link it finds for an inode instead of all of its hard links
(as it was originally created for being used with directories only, for
which only one hard link exists). So fix this by making "is_ancestor()"
follow all hard links of the input inode.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-17 09:54:00 +08:00
|
|
|
free_fs_path = true;
|
Btrfs: send, fix invalid path after renaming and linking file
Currently an incremental snapshot can generate link operations which
contain an invalid target path. Such case happens when in the send
snapshot a file was renamed, a new hard link added for it and some
other inode (with a lower number) got renamed to the former name of
that file. Example:
Parent snapshot
. (ino 256)
|
|--- f1 (ino 257)
|--- f2 (ino 258)
|--- f3 (ino 259)
Send snapshot
. (ino 256)
|
|--- f2 (ino 257)
|--- f3 (ino 258)
|--- f4 (ino 259)
|--- f5 (ino 258)
The following steps happen when computing the incremental send stream:
1) When processing inode 257, inode 258 is orphanized (renamed to
"o258-7-0"), because its current reference has the same name as the
new reference for inode 257;
2) When processing inode 258, we iterate over all its new references,
which have the names "f3" and "f5". The first iteration sees name
"f5" and renames the inode from its orphan name ("o258-7-0") to
"f5", while the second iteration sees the name "f3" and, incorrectly,
issues a link operation with a target name matching the orphan name,
which no longer exists. The first iteration had reset the current
valid path of the inode to "f5", but in the second iteration we lost
it because we found another inode, with a higher number of 259, which
has a reference named "f3" as well, so we orphanized inode 259 and
recomputed the current valid path of inode 258 to its old orphan
name because inode 259 could be an ancestor of inode 258 and therefore
the current valid path could contain the pre-orphanization name of
inode 259. However in this case inode 259 is not an ancestor of inode
258 so the current valid path should not be recomputed.
This makes the receiver fail with the following error:
ERROR: link f3 -> o258-7-0 failed: No such file or directory
So fix this by not recomputing the current valid path for an inode
whenever we find a colliding reference from some not yet processed inode
(inode number higher then the one currently being processed), unless
that other inode is an ancestor of the one we are currently processing.
A test case for fstests will follow soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-07 18:41:29 +08:00
|
|
|
}
|
2015-03-28 01:50:45 +08:00
|
|
|
|
Btrfs: incremental send, fix wrong unlink path after renaming file
Under some circumstances, an incremental send operation can issue wrong
paths for unlink commands related to files that have multiple hard links
and some (or all) of those links were renamed between the parent and send
snapshots. Consider the following example:
Parent snapshot
. (ino 256)
|---- a/ (ino 257)
| |---- b/ (ino 259)
| | |---- c/ (ino 260)
| | |---- f2 (ino 261)
| |
| |---- f2l1 (ino 261)
|
|---- d/ (ino 262)
|---- f1l1_2 (ino 258)
|---- f2l2 (ino 261)
|---- f1_2 (ino 258)
Send snapshot
. (ino 256)
|---- a/ (ino 257)
| |---- f2l1/ (ino 263)
| |---- b2/ (ino 259)
| |---- c/ (ino 260)
| | |---- d3 (ino 262)
| | |---- f1l1_2 (ino 258)
| | |---- f2l2_2 (ino 261)
| | |---- f1_2 (ino 258)
| |
| |---- f2 (ino 261)
| |---- f1l2 (ino 258)
|
|---- d (ino 261)
When computing the incremental send stream the following steps happen:
1) When processing inode 261, a rename operation is issued that renames
inode 262, which currently as a path of "d", to an orphan name of
"o262-7-0". This is done because in the send snapshot, inode 261 has
of its hard links with a path of "d" as well.
2) Two link operations are issued that create the new hard links for
inode 261, whose names are "d" and "f2l2_2", at paths "/" and
"o262-7-0/" respectively.
3) Still while processing inode 261, unlink operations are issued to
remove the old hard links of inode 261, with names "f2l1" and "f2l2",
at paths "a/" and "d/". However path "d/" does not correspond anymore
to the directory inode 262 but corresponds instead to a hard link of
inode 261 (link command issued in the previous step). This makes the
receiver fail with a ENOTDIR error when attempting the unlink
operation.
The problem happens because before sending the unlink operation, we failed
to detect that inode 262 was one of ancestors for inode 261 in the parent
snapshot, and therefore we didn't recompute the path for inode 262 before
issuing the unlink operation for the link named "f2l2" of inode 262. The
detection failed because the function "is_ancestor()" only follows the
first hard link it finds for an inode instead of all of its hard links
(as it was originally created for being used with directories only, for
which only one hard link exists). So fix this by making "is_ancestor()"
follow all hard links of the input inode.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-17 09:54:00 +08:00
|
|
|
path = alloc_path_for_send();
|
|
|
|
if (!path) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
2015-03-28 01:50:45 +08:00
|
|
|
|
Btrfs: incremental send, fix wrong unlink path after renaming file
Under some circumstances, an incremental send operation can issue wrong
paths for unlink commands related to files that have multiple hard links
and some (or all) of those links were renamed between the parent and send
snapshots. Consider the following example:
Parent snapshot
. (ino 256)
|---- a/ (ino 257)
| |---- b/ (ino 259)
| | |---- c/ (ino 260)
| | |---- f2 (ino 261)
| |
| |---- f2l1 (ino 261)
|
|---- d/ (ino 262)
|---- f1l1_2 (ino 258)
|---- f2l2 (ino 261)
|---- f1_2 (ino 258)
Send snapshot
. (ino 256)
|---- a/ (ino 257)
| |---- f2l1/ (ino 263)
| |---- b2/ (ino 259)
| |---- c/ (ino 260)
| | |---- d3 (ino 262)
| | |---- f1l1_2 (ino 258)
| | |---- f2l2_2 (ino 261)
| | |---- f1_2 (ino 258)
| |
| |---- f2 (ino 261)
| |---- f1l2 (ino 258)
|
|---- d (ino 261)
When computing the incremental send stream the following steps happen:
1) When processing inode 261, a rename operation is issued that renames
inode 262, which currently as a path of "d", to an orphan name of
"o262-7-0". This is done because in the send snapshot, inode 261 has
of its hard links with a path of "d" as well.
2) Two link operations are issued that create the new hard links for
inode 261, whose names are "d" and "f2l2_2", at paths "/" and
"o262-7-0/" respectively.
3) Still while processing inode 261, unlink operations are issued to
remove the old hard links of inode 261, with names "f2l1" and "f2l2",
at paths "a/" and "d/". However path "d/" does not correspond anymore
to the directory inode 262 but corresponds instead to a hard link of
inode 261 (link command issued in the previous step). This makes the
receiver fail with a ENOTDIR error when attempting the unlink
operation.
The problem happens because before sending the unlink operation, we failed
to detect that inode 262 was one of ancestors for inode 261 in the parent
snapshot, and therefore we didn't recompute the path for inode 262 before
issuing the unlink operation for the link named "f2l2" of inode 262. The
detection failed because the function "is_ancestor()" only follows the
first hard link it finds for an inode instead of all of its hard links
(as it was originally created for being used with directories only, for
which only one hard link exists). So fix this by making "is_ancestor()"
follow all hard links of the input inode.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-17 09:54:00 +08:00
|
|
|
key.objectid = ino2;
|
|
|
|
key.type = BTRFS_INODE_REF_KEY;
|
|
|
|
key.offset = 0;
|
|
|
|
|
2022-03-09 21:50:45 +08:00
|
|
|
btrfs_for_each_slot(root, &key, &key, path, iter_ret) {
|
Btrfs: incremental send, fix wrong unlink path after renaming file
Under some circumstances, an incremental send operation can issue wrong
paths for unlink commands related to files that have multiple hard links
and some (or all) of those links were renamed between the parent and send
snapshots. Consider the following example:
Parent snapshot
. (ino 256)
|---- a/ (ino 257)
| |---- b/ (ino 259)
| | |---- c/ (ino 260)
| | |---- f2 (ino 261)
| |
| |---- f2l1 (ino 261)
|
|---- d/ (ino 262)
|---- f1l1_2 (ino 258)
|---- f2l2 (ino 261)
|---- f1_2 (ino 258)
Send snapshot
. (ino 256)
|---- a/ (ino 257)
| |---- f2l1/ (ino 263)
| |---- b2/ (ino 259)
| |---- c/ (ino 260)
| | |---- d3 (ino 262)
| | |---- f1l1_2 (ino 258)
| | |---- f2l2_2 (ino 261)
| | |---- f1_2 (ino 258)
| |
| |---- f2 (ino 261)
| |---- f1l2 (ino 258)
|
|---- d (ino 261)
When computing the incremental send stream the following steps happen:
1) When processing inode 261, a rename operation is issued that renames
inode 262, which currently as a path of "d", to an orphan name of
"o262-7-0". This is done because in the send snapshot, inode 261 has
of its hard links with a path of "d" as well.
2) Two link operations are issued that create the new hard links for
inode 261, whose names are "d" and "f2l2_2", at paths "/" and
"o262-7-0/" respectively.
3) Still while processing inode 261, unlink operations are issued to
remove the old hard links of inode 261, with names "f2l1" and "f2l2",
at paths "a/" and "d/". However path "d/" does not correspond anymore
to the directory inode 262 but corresponds instead to a hard link of
inode 261 (link command issued in the previous step). This makes the
receiver fail with a ENOTDIR error when attempting the unlink
operation.
The problem happens because before sending the unlink operation, we failed
to detect that inode 262 was one of ancestors for inode 261 in the parent
snapshot, and therefore we didn't recompute the path for inode 262 before
issuing the unlink operation for the link named "f2l2" of inode 262. The
detection failed because the function "is_ancestor()" only follows the
first hard link it finds for an inode instead of all of its hard links
(as it was originally created for being used with directories only, for
which only one hard link exists). So fix this by making "is_ancestor()"
follow all hard links of the input inode.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-17 09:54:00 +08:00
|
|
|
struct extent_buffer *leaf = path->nodes[0];
|
|
|
|
int slot = path->slots[0];
|
|
|
|
u32 cur_offset = 0;
|
|
|
|
u32 item_size;
|
|
|
|
|
|
|
|
if (key.objectid != ino2)
|
|
|
|
break;
|
|
|
|
if (key.type != BTRFS_INODE_REF_KEY &&
|
|
|
|
key.type != BTRFS_INODE_EXTREF_KEY)
|
|
|
|
break;
|
|
|
|
|
2021-10-22 02:58:35 +08:00
|
|
|
item_size = btrfs_item_size(leaf, slot);
|
Btrfs: incremental send, fix wrong unlink path after renaming file
Under some circumstances, an incremental send operation can issue wrong
paths for unlink commands related to files that have multiple hard links
and some (or all) of those links were renamed between the parent and send
snapshots. Consider the following example:
Parent snapshot
. (ino 256)
|---- a/ (ino 257)
| |---- b/ (ino 259)
| | |---- c/ (ino 260)
| | |---- f2 (ino 261)
| |
| |---- f2l1 (ino 261)
|
|---- d/ (ino 262)
|---- f1l1_2 (ino 258)
|---- f2l2 (ino 261)
|---- f1_2 (ino 258)
Send snapshot
. (ino 256)
|---- a/ (ino 257)
| |---- f2l1/ (ino 263)
| |---- b2/ (ino 259)
| |---- c/ (ino 260)
| | |---- d3 (ino 262)
| | |---- f1l1_2 (ino 258)
| | |---- f2l2_2 (ino 261)
| | |---- f1_2 (ino 258)
| |
| |---- f2 (ino 261)
| |---- f1l2 (ino 258)
|
|---- d (ino 261)
When computing the incremental send stream the following steps happen:
1) When processing inode 261, a rename operation is issued that renames
inode 262, which currently as a path of "d", to an orphan name of
"o262-7-0". This is done because in the send snapshot, inode 261 has
of its hard links with a path of "d" as well.
2) Two link operations are issued that create the new hard links for
inode 261, whose names are "d" and "f2l2_2", at paths "/" and
"o262-7-0/" respectively.
3) Still while processing inode 261, unlink operations are issued to
remove the old hard links of inode 261, with names "f2l1" and "f2l2",
at paths "a/" and "d/". However path "d/" does not correspond anymore
to the directory inode 262 but corresponds instead to a hard link of
inode 261 (link command issued in the previous step). This makes the
receiver fail with a ENOTDIR error when attempting the unlink
operation.
The problem happens because before sending the unlink operation, we failed
to detect that inode 262 was one of ancestors for inode 261 in the parent
snapshot, and therefore we didn't recompute the path for inode 262 before
issuing the unlink operation for the link named "f2l2" of inode 262. The
detection failed because the function "is_ancestor()" only follows the
first hard link it finds for an inode instead of all of its hard links
(as it was originally created for being used with directories only, for
which only one hard link exists). So fix this by making "is_ancestor()"
follow all hard links of the input inode.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-17 09:54:00 +08:00
|
|
|
while (cur_offset < item_size) {
|
|
|
|
u64 parent;
|
|
|
|
u64 parent_gen;
|
|
|
|
|
|
|
|
if (key.type == BTRFS_INODE_EXTREF_KEY) {
|
|
|
|
unsigned long ptr;
|
|
|
|
struct btrfs_inode_extref *extref;
|
|
|
|
|
|
|
|
ptr = btrfs_item_ptr_offset(leaf, slot);
|
|
|
|
extref = (struct btrfs_inode_extref *)
|
|
|
|
(ptr + cur_offset);
|
|
|
|
parent = btrfs_inode_extref_parent(leaf,
|
|
|
|
extref);
|
|
|
|
cur_offset += sizeof(*extref);
|
|
|
|
cur_offset += btrfs_inode_extref_name_len(leaf,
|
|
|
|
extref);
|
|
|
|
} else {
|
|
|
|
parent = key.offset;
|
|
|
|
cur_offset = item_size;
|
|
|
|
}
|
|
|
|
|
2022-08-12 22:42:32 +08:00
|
|
|
ret = get_inode_gen(root, parent, &parent_gen);
|
Btrfs: incremental send, fix wrong unlink path after renaming file
Under some circumstances, an incremental send operation can issue wrong
paths for unlink commands related to files that have multiple hard links
and some (or all) of those links were renamed between the parent and send
snapshots. Consider the following example:
Parent snapshot
. (ino 256)
|---- a/ (ino 257)
| |---- b/ (ino 259)
| | |---- c/ (ino 260)
| | |---- f2 (ino 261)
| |
| |---- f2l1 (ino 261)
|
|---- d/ (ino 262)
|---- f1l1_2 (ino 258)
|---- f2l2 (ino 261)
|---- f1_2 (ino 258)
Send snapshot
. (ino 256)
|---- a/ (ino 257)
| |---- f2l1/ (ino 263)
| |---- b2/ (ino 259)
| |---- c/ (ino 260)
| | |---- d3 (ino 262)
| | |---- f1l1_2 (ino 258)
| | |---- f2l2_2 (ino 261)
| | |---- f1_2 (ino 258)
| |
| |---- f2 (ino 261)
| |---- f1l2 (ino 258)
|
|---- d (ino 261)
When computing the incremental send stream the following steps happen:
1) When processing inode 261, a rename operation is issued that renames
inode 262, which currently as a path of "d", to an orphan name of
"o262-7-0". This is done because in the send snapshot, inode 261 has
of its hard links with a path of "d" as well.
2) Two link operations are issued that create the new hard links for
inode 261, whose names are "d" and "f2l2_2", at paths "/" and
"o262-7-0/" respectively.
3) Still while processing inode 261, unlink operations are issued to
remove the old hard links of inode 261, with names "f2l1" and "f2l2",
at paths "a/" and "d/". However path "d/" does not correspond anymore
to the directory inode 262 but corresponds instead to a hard link of
inode 261 (link command issued in the previous step). This makes the
receiver fail with a ENOTDIR error when attempting the unlink
operation.
The problem happens because before sending the unlink operation, we failed
to detect that inode 262 was one of ancestors for inode 261 in the parent
snapshot, and therefore we didn't recompute the path for inode 262 before
issuing the unlink operation for the link named "f2l2" of inode 262. The
detection failed because the function "is_ancestor()" only follows the
first hard link it finds for an inode instead of all of its hard links
(as it was originally created for being used with directories only, for
which only one hard link exists). So fix this by making "is_ancestor()"
follow all hard links of the input inode.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-17 09:54:00 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
ret = check_ino_in_path(root, ino1, ino1_gen,
|
|
|
|
parent, parent_gen, fs_path);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
2015-03-28 01:50:45 +08:00
|
|
|
}
|
|
|
|
}
|
Btrfs: incremental send, fix wrong unlink path after renaming file
Under some circumstances, an incremental send operation can issue wrong
paths for unlink commands related to files that have multiple hard links
and some (or all) of those links were renamed between the parent and send
snapshots. Consider the following example:
Parent snapshot
. (ino 256)
|---- a/ (ino 257)
| |---- b/ (ino 259)
| | |---- c/ (ino 260)
| | |---- f2 (ino 261)
| |
| |---- f2l1 (ino 261)
|
|---- d/ (ino 262)
|---- f1l1_2 (ino 258)
|---- f2l2 (ino 261)
|---- f1_2 (ino 258)
Send snapshot
. (ino 256)
|---- a/ (ino 257)
| |---- f2l1/ (ino 263)
| |---- b2/ (ino 259)
| |---- c/ (ino 260)
| | |---- d3 (ino 262)
| | |---- f1l1_2 (ino 258)
| | |---- f2l2_2 (ino 261)
| | |---- f1_2 (ino 258)
| |
| |---- f2 (ino 261)
| |---- f1l2 (ino 258)
|
|---- d (ino 261)
When computing the incremental send stream the following steps happen:
1) When processing inode 261, a rename operation is issued that renames
inode 262, which currently as a path of "d", to an orphan name of
"o262-7-0". This is done because in the send snapshot, inode 261 has
of its hard links with a path of "d" as well.
2) Two link operations are issued that create the new hard links for
inode 261, whose names are "d" and "f2l2_2", at paths "/" and
"o262-7-0/" respectively.
3) Still while processing inode 261, unlink operations are issued to
remove the old hard links of inode 261, with names "f2l1" and "f2l2",
at paths "a/" and "d/". However path "d/" does not correspond anymore
to the directory inode 262 but corresponds instead to a hard link of
inode 261 (link command issued in the previous step). This makes the
receiver fail with a ENOTDIR error when attempting the unlink
operation.
The problem happens because before sending the unlink operation, we failed
to detect that inode 262 was one of ancestors for inode 261 in the parent
snapshot, and therefore we didn't recompute the path for inode 262 before
issuing the unlink operation for the link named "f2l2" of inode 262. The
detection failed because the function "is_ancestor()" only follows the
first hard link it finds for an inode instead of all of its hard links
(as it was originally created for being used with directories only, for
which only one hard link exists). So fix this by making "is_ancestor()"
follow all hard links of the input inode.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-17 09:54:00 +08:00
|
|
|
ret = 0;
|
2022-03-09 21:50:45 +08:00
|
|
|
if (iter_ret < 0)
|
|
|
|
ret = iter_ret;
|
|
|
|
|
|
|
|
out:
|
Btrfs: incremental send, fix wrong unlink path after renaming file
Under some circumstances, an incremental send operation can issue wrong
paths for unlink commands related to files that have multiple hard links
and some (or all) of those links were renamed between the parent and send
snapshots. Consider the following example:
Parent snapshot
. (ino 256)
|---- a/ (ino 257)
| |---- b/ (ino 259)
| | |---- c/ (ino 260)
| | |---- f2 (ino 261)
| |
| |---- f2l1 (ino 261)
|
|---- d/ (ino 262)
|---- f1l1_2 (ino 258)
|---- f2l2 (ino 261)
|---- f1_2 (ino 258)
Send snapshot
. (ino 256)
|---- a/ (ino 257)
| |---- f2l1/ (ino 263)
| |---- b2/ (ino 259)
| |---- c/ (ino 260)
| | |---- d3 (ino 262)
| | |---- f1l1_2 (ino 258)
| | |---- f2l2_2 (ino 261)
| | |---- f1_2 (ino 258)
| |
| |---- f2 (ino 261)
| |---- f1l2 (ino 258)
|
|---- d (ino 261)
When computing the incremental send stream the following steps happen:
1) When processing inode 261, a rename operation is issued that renames
inode 262, which currently as a path of "d", to an orphan name of
"o262-7-0". This is done because in the send snapshot, inode 261 has
of its hard links with a path of "d" as well.
2) Two link operations are issued that create the new hard links for
inode 261, whose names are "d" and "f2l2_2", at paths "/" and
"o262-7-0/" respectively.
3) Still while processing inode 261, unlink operations are issued to
remove the old hard links of inode 261, with names "f2l1" and "f2l2",
at paths "a/" and "d/". However path "d/" does not correspond anymore
to the directory inode 262 but corresponds instead to a hard link of
inode 261 (link command issued in the previous step). This makes the
receiver fail with a ENOTDIR error when attempting the unlink
operation.
The problem happens because before sending the unlink operation, we failed
to detect that inode 262 was one of ancestors for inode 261 in the parent
snapshot, and therefore we didn't recompute the path for inode 262 before
issuing the unlink operation for the link named "f2l2" of inode 262. The
detection failed because the function "is_ancestor()" only follows the
first hard link it finds for an inode instead of all of its hard links
(as it was originally created for being used with directories only, for
which only one hard link exists). So fix this by making "is_ancestor()"
follow all hard links of the input inode.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-17 09:54:00 +08:00
|
|
|
btrfs_free_path(path);
|
|
|
|
if (free_fs_path)
|
Btrfs: send, fix invalid path after renaming and linking file
Currently an incremental snapshot can generate link operations which
contain an invalid target path. Such case happens when in the send
snapshot a file was renamed, a new hard link added for it and some
other inode (with a lower number) got renamed to the former name of
that file. Example:
Parent snapshot
. (ino 256)
|
|--- f1 (ino 257)
|--- f2 (ino 258)
|--- f3 (ino 259)
Send snapshot
. (ino 256)
|
|--- f2 (ino 257)
|--- f3 (ino 258)
|--- f4 (ino 259)
|--- f5 (ino 258)
The following steps happen when computing the incremental send stream:
1) When processing inode 257, inode 258 is orphanized (renamed to
"o258-7-0"), because its current reference has the same name as the
new reference for inode 257;
2) When processing inode 258, we iterate over all its new references,
which have the names "f3" and "f5". The first iteration sees name
"f5" and renames the inode from its orphan name ("o258-7-0") to
"f5", while the second iteration sees the name "f3" and, incorrectly,
issues a link operation with a target name matching the orphan name,
which no longer exists. The first iteration had reset the current
valid path of the inode to "f5", but in the second iteration we lost
it because we found another inode, with a higher number of 259, which
has a reference named "f3" as well, so we orphanized inode 259 and
recomputed the current valid path of inode 258 to its old orphan
name because inode 259 could be an ancestor of inode 258 and therefore
the current valid path could contain the pre-orphanization name of
inode 259. However in this case inode 259 is not an ancestor of inode
258 so the current valid path should not be recomputed.
This makes the receiver fail with the following error:
ERROR: link f3 -> o258-7-0 failed: No such file or directory
So fix this by not recomputing the current valid path for an inode
whenever we find a colliding reference from some not yet processed inode
(inode number higher then the one currently being processed), unless
that other inode is an ancestor of the one we are currently processing.
A test case for fstests will follow soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-07 18:41:29 +08:00
|
|
|
fs_path_free(fs_path);
|
|
|
|
return ret;
|
2015-03-28 01:50:45 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
static int wait_for_parent_move(struct send_ctx *sctx,
|
Btrfs: incremental send, check if orphanized dir inode needs delayed rename
If a directory inode is orphanized, because some inode previously
processed has a new name that collides with the old name of the current
inode, we need to check if it needs its rename operation delayed too,
as its ancestor-descendent relationship with some other inode might
have been reversed between the parent and send snapshots and therefore
its rename operation needs to happen after that other inode is renamed.
For example, for the following reproducer where this is needed (provided
by Robbie Ko):
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt2
$ mkdir -p /mnt/data/n1/n2
$ mkdir /mnt/data/n4
$ mkdir -p /mnt/data/t6/t7
$ mkdir /mnt/data/t5
$ mkdir /mnt/data/t7
$ mkdir /mnt/data/n4/t2
$ mkdir /mnt/data/t4
$ mkdir /mnt/data/t3
$ mv /mnt/data/t7 /mnt/data/n4/t2
$ mv /mnt/data/t4 /mnt/data/n4/t2/t7
$ mv /mnt/data/t5 /mnt/data/n4/t2/t7/t4
$ mv /mnt/data/t6 /mnt/data/n4/t2/t7/t4/t5
$ mv /mnt/data/n1/n2 /mnt/data/n4/t2/t7/t4/t5/t6
$ mv /mnt/data/n1 /mnt/data/n4/t2/t7/t4/t5/t6
$ mv /mnt/data/n4/t2/t7/t4/t5/t6/t7 /mnt/data/n4/t2/t7/t4/t5/t6/n2
$ mv /mnt/data/t3 /mnt/data/n4/t2/t7/t4/t5/t6/n2/t7
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ mv /mnt/data/n4/t2/t7/t4/t5/t6/n1 /mnt/data/n4
$ mv /mnt/data/n4/t2 /mnt/data/n4/n1
$ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6/n2 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/n2/t7/t3 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/t7/t4 /mnt/data/n4/n1/t2/t6
$ mv /mnt/data/n4/n1/t2/t7 /mnt/data/n4/n1/t2/t3
$ mv /mnt/data/n4/n1/t2/n2/t7 /mnt/data/n4/n1/t2
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send /mnt/snap1 | btrfs receive /mnt2
$ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2
ERROR: send ioctl failed with -12: Cannot allocate memory
Where the parent snapshot directory hierarchy is the following:
. (ino 256)
|-- data/ (ino 257)
|-- n4/ (ino 260)
|-- t2/ (ino 265)
|-- t7/ (ino 264)
|-- t4/ (ino 266)
|-- t5/ (ino 263)
|-- t6/ (ino 261)
|-- n1/ (ino 258)
|-- n2/ (ino 259)
|-- t7/ (ino 262)
|-- t3/ (ino 267)
And the send snapshot's directory hierarchy is the following:
. (ino 256)
|-- data/ (ino 257)
|-- n4/ (ino 260)
|-- n1/ (ino 258)
|-- t2/ (ino 265)
|-- n2/ (ino 259)
|-- t3/ (ino 267)
| |-- t7 (ino 264)
|
|-- t6/ (ino 261)
| |-- t4/ (ino 266)
| |-- t5/ (ino 263)
|
|-- t7/ (ino 262)
While processing inode 262 we orphanize inode 264 and later attempt
to rename inode 264 to its new name/location, which resulted in building
an incorrect destination path string for the rename operation with the
value "data/n4/t2/t7/t4/t5/t6/n2/t7/t3/t7". This rename operation must
have been done only after inode 267 is processed and renamed, as the
ancestor-descendent relationship between inodes 264 and 267 was reversed
between both snapshots, because otherwise it results in an infinite loop
when building the path string for inode 264 when we are processing an
inode with a number larger than 264. That loop is the following:
start inode 264, send progress of 265 for example
parent of 264 -> 267
parent of 267 -> 262
parent of 262 -> 259
parent of 259 -> 261
parent of 261 -> 263
parent of 263 -> 266
parent of 266 -> 264
|--> back to first iteration while current path string length
is <= PATH_MAX, and fail with -ENOMEM otherwise
So fix this by making the check if we need to delay a directory rename
regardless of the current inode having been orphanized or not.
A test case for fstests follows soon.
Thanks to Robbie Ko for providing a reproducer for this problem.
Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-04-09 21:09:14 +08:00
|
|
|
struct recorded_ref *parent_ref,
|
|
|
|
const bool is_orphan)
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
{
|
2014-03-28 04:14:01 +08:00
|
|
|
int ret = 0;
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
u64 ino = parent_ref->dir;
|
Btrfs: incremental send, do not delay rename when parent inode is new
When we are checking if we need to delay the rename operation for an
inode we not checking if a parent inode that exists in the send and
parent snapshots is really the same inode or not, that is, we are not
comparing the generation number of the parent inode in the send and
parent snapshots. Not only this results in unnecessarily delaying a
rename operation but also can later on make us generate an incorrect
name for a new inode in the send snapshot that has the same number
as another inode in the parent snapshot but a different generation.
Here follows an example where this happens.
Parent snapshot:
. (ino 256, gen 3)
|--- dir258/ (ino 258, gen 7)
| |--- dir257/ (ino 257, gen 7)
|
|--- dir259/ (ino 259, gen 7)
Send snapshot:
. (ino 256, gen 3)
|--- file258 (ino 258, gen 10)
|
|--- new_dir259/ (ino 259, gen 10)
|--- dir257/ (ino 257, gen 7)
The following steps happen when computing the incremental send stream:
1) When processing inode 257, its new parent is created using its orphan
name (o257-21-0), and the rename operation for inode 257 is delayed
because its new parent (inode 259) was not yet processed - this
decision to delay the rename operation does not make much sense
because the inode 259 in the send snapshot is a new inode, it's not
the same as inode 259 in the parent snapshot.
2) When processing inode 258 we end up delaying its rmdir operation,
because inode 257 was not yet renamed (moved away from the directory
inode 258 represents). We also create the new inode 258 using its
orphan name "o258-10-0", then rename it to its final name of "file258"
and then issue a truncate operation for it. However this truncate
operation contains an incorrect name, which corresponds to the orphan
name and not to the final name, which makes the receiver fail. This
happens because when we attempt to compute the inode's current name
we verify that there's another inode with the same number (258) that
has its rmdir operation pending and because of that we generate an
orphan name for the new inode 258 (we do this in the function
get_cur_path()).
Fix this by not delayed the rename operation of an inode if it has parents
with the same number but different generations in both snapshots.
The following steps reproduce this example scenario.
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/dir257
$ mkdir /mnt/dir258
$ mkdir /mnt/dir259
$ mv /mnt/dir257 /mnt/dir258/dir257
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ mv /mnt/dir258/dir257 /mnt/dir257
$ rmdir /mnt/dir258
$ rmdir /mnt/dir259
# Remount the filesystem so that the next created inodes will have the
# numbers 258 and 259. This is because when a filesystem is mounted,
# btrfs sets the subvolume's inode counter to a value corresponding to
# the highest inode number in the subvolume plus 1. This inode counter
# is used to assign a unique number to each new inode and it's
# incremented by 1 after very inode creation.
# Note: we unmount and then mount instead of doing a mount with
# "-o remount" because otherwise the inode counter remains at value 260.
$ umount /mnt
$ mount /dev/sdb /mnt
$ touch /mnt/file258
$ mkdir /mnt/new_dir259
$ mv /mnt/dir257 /mnt/new_dir259/dir257
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send /mnt/snap1 -f /tmp/1.snap
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
$ umount /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ btrfs receive /mnt -f /tmo/1.snap
$ btrfs receive /mnt -f /tmo/2.snap -vv
receiving snapshot mysnap2 uuid=e059b6d1-7f55-f140-8d7c-9a3039d23c97, ctransid=10 parent_uuid=77e98cb6-8762-814f-9e05-e8ba877fc0b0, parent_ctransid=7
utimes
mkdir o259-10-0
rename dir258 -> o258-7-0
utimes
mkfile o258-10-0
rename o258-10-0 -> file258
utimes
truncate o258-10-0 size=0
ERROR: truncate o258-10-0 failed: No such file or directory
Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-01-11 10:15:39 +08:00
|
|
|
u64 ino_gen = parent_ref->dir_gen;
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
u64 parent_ino_before, parent_ino_after;
|
|
|
|
struct fs_path *path_before = NULL;
|
|
|
|
struct fs_path *path_after = NULL;
|
|
|
|
int len1, len2;
|
|
|
|
|
|
|
|
path_after = fs_path_alloc();
|
2014-03-28 04:14:01 +08:00
|
|
|
path_before = fs_path_alloc();
|
|
|
|
if (!path_after || !path_before) {
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2014-03-19 22:20:54 +08:00
|
|
|
/*
|
2014-03-28 04:14:01 +08:00
|
|
|
* Our current directory inode may not yet be renamed/moved because some
|
|
|
|
* ancestor (immediate or not) has to be renamed/moved first. So find if
|
|
|
|
* such ancestor exists and make sure our own rename/move happens after
|
2015-03-28 01:50:45 +08:00
|
|
|
* that ancestor is processed to avoid path build infinite loops (done
|
|
|
|
* at get_cur_path()).
|
2014-03-19 22:20:54 +08:00
|
|
|
*/
|
2014-03-28 04:14:01 +08:00
|
|
|
while (ino > BTRFS_FIRST_FREE_OBJECTID) {
|
Btrfs: incremental send, do not delay rename when parent inode is new
When we are checking if we need to delay the rename operation for an
inode we not checking if a parent inode that exists in the send and
parent snapshots is really the same inode or not, that is, we are not
comparing the generation number of the parent inode in the send and
parent snapshots. Not only this results in unnecessarily delaying a
rename operation but also can later on make us generate an incorrect
name for a new inode in the send snapshot that has the same number
as another inode in the parent snapshot but a different generation.
Here follows an example where this happens.
Parent snapshot:
. (ino 256, gen 3)
|--- dir258/ (ino 258, gen 7)
| |--- dir257/ (ino 257, gen 7)
|
|--- dir259/ (ino 259, gen 7)
Send snapshot:
. (ino 256, gen 3)
|--- file258 (ino 258, gen 10)
|
|--- new_dir259/ (ino 259, gen 10)
|--- dir257/ (ino 257, gen 7)
The following steps happen when computing the incremental send stream:
1) When processing inode 257, its new parent is created using its orphan
name (o257-21-0), and the rename operation for inode 257 is delayed
because its new parent (inode 259) was not yet processed - this
decision to delay the rename operation does not make much sense
because the inode 259 in the send snapshot is a new inode, it's not
the same as inode 259 in the parent snapshot.
2) When processing inode 258 we end up delaying its rmdir operation,
because inode 257 was not yet renamed (moved away from the directory
inode 258 represents). We also create the new inode 258 using its
orphan name "o258-10-0", then rename it to its final name of "file258"
and then issue a truncate operation for it. However this truncate
operation contains an incorrect name, which corresponds to the orphan
name and not to the final name, which makes the receiver fail. This
happens because when we attempt to compute the inode's current name
we verify that there's another inode with the same number (258) that
has its rmdir operation pending and because of that we generate an
orphan name for the new inode 258 (we do this in the function
get_cur_path()).
Fix this by not delayed the rename operation of an inode if it has parents
with the same number but different generations in both snapshots.
The following steps reproduce this example scenario.
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/dir257
$ mkdir /mnt/dir258
$ mkdir /mnt/dir259
$ mv /mnt/dir257 /mnt/dir258/dir257
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ mv /mnt/dir258/dir257 /mnt/dir257
$ rmdir /mnt/dir258
$ rmdir /mnt/dir259
# Remount the filesystem so that the next created inodes will have the
# numbers 258 and 259. This is because when a filesystem is mounted,
# btrfs sets the subvolume's inode counter to a value corresponding to
# the highest inode number in the subvolume plus 1. This inode counter
# is used to assign a unique number to each new inode and it's
# incremented by 1 after very inode creation.
# Note: we unmount and then mount instead of doing a mount with
# "-o remount" because otherwise the inode counter remains at value 260.
$ umount /mnt
$ mount /dev/sdb /mnt
$ touch /mnt/file258
$ mkdir /mnt/new_dir259
$ mv /mnt/dir257 /mnt/new_dir259/dir257
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send /mnt/snap1 -f /tmp/1.snap
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
$ umount /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ btrfs receive /mnt -f /tmo/1.snap
$ btrfs receive /mnt -f /tmo/2.snap -vv
receiving snapshot mysnap2 uuid=e059b6d1-7f55-f140-8d7c-9a3039d23c97, ctransid=10 parent_uuid=77e98cb6-8762-814f-9e05-e8ba877fc0b0, parent_ctransid=7
utimes
mkdir o259-10-0
rename dir258 -> o258-7-0
utimes
mkfile o258-10-0
rename o258-10-0 -> file258
utimes
truncate o258-10-0 size=0
ERROR: truncate o258-10-0 failed: No such file or directory
Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-01-11 10:15:39 +08:00
|
|
|
u64 parent_ino_after_gen;
|
|
|
|
|
2014-03-28 04:14:01 +08:00
|
|
|
if (is_waiting_for_move(sctx, ino)) {
|
2015-03-28 01:50:45 +08:00
|
|
|
/*
|
|
|
|
* If the current inode is an ancestor of ino in the
|
|
|
|
* parent root, we need to delay the rename of the
|
|
|
|
* current inode, otherwise don't delayed the rename
|
|
|
|
* because we can end up with a circular dependency
|
|
|
|
* of renames, resulting in some directories never
|
|
|
|
* getting the respective rename operations issued in
|
|
|
|
* the send stream or getting into infinite path build
|
|
|
|
* loops.
|
|
|
|
*/
|
|
|
|
ret = is_ancestor(sctx->parent_root,
|
|
|
|
sctx->cur_ino, sctx->cur_inode_gen,
|
|
|
|
ino, path_before);
|
Btrfs: incremental send, fix invalid paths for rename operations
Example scenario:
Parent snapshot:
. (ino 277)
|---- tmp/ (ino 278)
|---- pre/ (ino 280)
| |---- wait_dir/ (ino 281)
|
|---- desc/ (ino 282)
|---- ance/ (ino 283)
| |---- below_ance/ (ino 279)
|
|---- other_dir/ (ino 284)
Send snapshot:
. (ino 277)
|---- tmp/ (ino 278)
|---- other_dir/ (ino 284)
|---- below_ance/ (ino 279)
| |---- pre/ (ino 280)
|
|---- wait_dir/ (ino 281)
|---- desc/ (ino 282)
|---- ance/ (ino 283)
While computing the send stream the following steps happen:
1) While processing inode 279 we end up delaying its rename operation
because its new parent in the send snapshot, inode 284, was not
yet processed and therefore not yet renamed;
2) Later when processing inode 280 we end up renaming it immediately to
"ance/below_once/pre" and not delay its rename operation because its
new parent (inode 279 in the send snapshot) has its rename operation
delayed and inode 280 is not an encestor of inode 279 (its parent in
the send snapshot) in the parent snapshot;
3) When processing inode 281 we end up delaying its rename operation
because its new parent in the send snapshot, inode 284, was not yet
processed and therefore not yet renamed;
4) When processing inode 282 we do not delay its rename operation because
its parent in the send snapshot, inode 281, already has its own rename
operation delayed and our current inode (282) is not an ancestor of
inode 281 in the parent snapshot. Therefore inode 282 is renamed to
"ance/below_ance/pre/wait_dir";
5) When processing inode 283 we realize that we can rename it because one
of its ancestors in the send snapshot, inode 281, has its rename
operation delayed and inode 283 is not an ancestor of inode 281 in the
parent snapshot. So a rename operation to rename inode 283 to
"ance/below_ance/pre/wait_dir/desc/ance" is issued. This path is
invalid due to a missing path building loop that was undetected by
the incremental send implementation, as inode 283 ends up getting
included twice in the path (once with its path in the parent snapshot).
Therefore its rename operation must wait before the ancestor inode 284
is renamed.
Fix this by not terminating the rename dependency checks when we find an
ancestor, in the send snapshot, that has its rename operation delayed. So
that we continue doing the same checks if the current inode is not an
ancestor, in the parent snapshot, of an ancestor in the send snapshot we
are processing in the loop.
The problem and reproducer were reported by Robbie Ko, as part of a patch
titled "Btrfs: incremental send, avoid ancestor rename to descendant".
However the fix was unnecessarily complicated and can be addressed with
much less code and effort.
Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-06-27 23:54:44 +08:00
|
|
|
if (ret)
|
|
|
|
break;
|
2014-03-28 04:14:01 +08:00
|
|
|
}
|
2014-03-19 22:20:54 +08:00
|
|
|
|
|
|
|
fs_path_reset(path_before);
|
|
|
|
fs_path_reset(path_after);
|
|
|
|
|
|
|
|
ret = get_first_ref(sctx->send_root, ino, &parent_ino_after,
|
Btrfs: incremental send, do not delay rename when parent inode is new
When we are checking if we need to delay the rename operation for an
inode we not checking if a parent inode that exists in the send and
parent snapshots is really the same inode or not, that is, we are not
comparing the generation number of the parent inode in the send and
parent snapshots. Not only this results in unnecessarily delaying a
rename operation but also can later on make us generate an incorrect
name for a new inode in the send snapshot that has the same number
as another inode in the parent snapshot but a different generation.
Here follows an example where this happens.
Parent snapshot:
. (ino 256, gen 3)
|--- dir258/ (ino 258, gen 7)
| |--- dir257/ (ino 257, gen 7)
|
|--- dir259/ (ino 259, gen 7)
Send snapshot:
. (ino 256, gen 3)
|--- file258 (ino 258, gen 10)
|
|--- new_dir259/ (ino 259, gen 10)
|--- dir257/ (ino 257, gen 7)
The following steps happen when computing the incremental send stream:
1) When processing inode 257, its new parent is created using its orphan
name (o257-21-0), and the rename operation for inode 257 is delayed
because its new parent (inode 259) was not yet processed - this
decision to delay the rename operation does not make much sense
because the inode 259 in the send snapshot is a new inode, it's not
the same as inode 259 in the parent snapshot.
2) When processing inode 258 we end up delaying its rmdir operation,
because inode 257 was not yet renamed (moved away from the directory
inode 258 represents). We also create the new inode 258 using its
orphan name "o258-10-0", then rename it to its final name of "file258"
and then issue a truncate operation for it. However this truncate
operation contains an incorrect name, which corresponds to the orphan
name and not to the final name, which makes the receiver fail. This
happens because when we attempt to compute the inode's current name
we verify that there's another inode with the same number (258) that
has its rmdir operation pending and because of that we generate an
orphan name for the new inode 258 (we do this in the function
get_cur_path()).
Fix this by not delayed the rename operation of an inode if it has parents
with the same number but different generations in both snapshots.
The following steps reproduce this example scenario.
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/dir257
$ mkdir /mnt/dir258
$ mkdir /mnt/dir259
$ mv /mnt/dir257 /mnt/dir258/dir257
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ mv /mnt/dir258/dir257 /mnt/dir257
$ rmdir /mnt/dir258
$ rmdir /mnt/dir259
# Remount the filesystem so that the next created inodes will have the
# numbers 258 and 259. This is because when a filesystem is mounted,
# btrfs sets the subvolume's inode counter to a value corresponding to
# the highest inode number in the subvolume plus 1. This inode counter
# is used to assign a unique number to each new inode and it's
# incremented by 1 after very inode creation.
# Note: we unmount and then mount instead of doing a mount with
# "-o remount" because otherwise the inode counter remains at value 260.
$ umount /mnt
$ mount /dev/sdb /mnt
$ touch /mnt/file258
$ mkdir /mnt/new_dir259
$ mv /mnt/dir257 /mnt/new_dir259/dir257
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send /mnt/snap1 -f /tmp/1.snap
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
$ umount /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ btrfs receive /mnt -f /tmo/1.snap
$ btrfs receive /mnt -f /tmo/2.snap -vv
receiving snapshot mysnap2 uuid=e059b6d1-7f55-f140-8d7c-9a3039d23c97, ctransid=10 parent_uuid=77e98cb6-8762-814f-9e05-e8ba877fc0b0, parent_ctransid=7
utimes
mkdir o259-10-0
rename dir258 -> o258-7-0
utimes
mkfile o258-10-0
rename o258-10-0 -> file258
utimes
truncate o258-10-0 size=0
ERROR: truncate o258-10-0 failed: No such file or directory
Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-01-11 10:15:39 +08:00
|
|
|
&parent_ino_after_gen, path_after);
|
2014-03-19 22:20:54 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
ret = get_first_ref(sctx->parent_root, ino, &parent_ino_before,
|
|
|
|
NULL, path_before);
|
2014-03-28 04:14:01 +08:00
|
|
|
if (ret < 0 && ret != -ENOENT) {
|
2014-03-19 22:20:54 +08:00
|
|
|
goto out;
|
2014-03-28 04:14:01 +08:00
|
|
|
} else if (ret == -ENOENT) {
|
2014-10-03 02:17:32 +08:00
|
|
|
ret = 0;
|
2014-03-28 04:14:01 +08:00
|
|
|
break;
|
2014-03-19 22:20:54 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
len1 = fs_path_len(path_before);
|
|
|
|
len2 = fs_path_len(path_after);
|
2014-03-28 04:14:01 +08:00
|
|
|
if (ino > sctx->cur_ino &&
|
|
|
|
(parent_ino_before != parent_ino_after || len1 != len2 ||
|
|
|
|
memcmp(path_before->start, path_after->start, len1))) {
|
Btrfs: incremental send, do not delay rename when parent inode is new
When we are checking if we need to delay the rename operation for an
inode we not checking if a parent inode that exists in the send and
parent snapshots is really the same inode or not, that is, we are not
comparing the generation number of the parent inode in the send and
parent snapshots. Not only this results in unnecessarily delaying a
rename operation but also can later on make us generate an incorrect
name for a new inode in the send snapshot that has the same number
as another inode in the parent snapshot but a different generation.
Here follows an example where this happens.
Parent snapshot:
. (ino 256, gen 3)
|--- dir258/ (ino 258, gen 7)
| |--- dir257/ (ino 257, gen 7)
|
|--- dir259/ (ino 259, gen 7)
Send snapshot:
. (ino 256, gen 3)
|--- file258 (ino 258, gen 10)
|
|--- new_dir259/ (ino 259, gen 10)
|--- dir257/ (ino 257, gen 7)
The following steps happen when computing the incremental send stream:
1) When processing inode 257, its new parent is created using its orphan
name (o257-21-0), and the rename operation for inode 257 is delayed
because its new parent (inode 259) was not yet processed - this
decision to delay the rename operation does not make much sense
because the inode 259 in the send snapshot is a new inode, it's not
the same as inode 259 in the parent snapshot.
2) When processing inode 258 we end up delaying its rmdir operation,
because inode 257 was not yet renamed (moved away from the directory
inode 258 represents). We also create the new inode 258 using its
orphan name "o258-10-0", then rename it to its final name of "file258"
and then issue a truncate operation for it. However this truncate
operation contains an incorrect name, which corresponds to the orphan
name and not to the final name, which makes the receiver fail. This
happens because when we attempt to compute the inode's current name
we verify that there's another inode with the same number (258) that
has its rmdir operation pending and because of that we generate an
orphan name for the new inode 258 (we do this in the function
get_cur_path()).
Fix this by not delayed the rename operation of an inode if it has parents
with the same number but different generations in both snapshots.
The following steps reproduce this example scenario.
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/dir257
$ mkdir /mnt/dir258
$ mkdir /mnt/dir259
$ mv /mnt/dir257 /mnt/dir258/dir257
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ mv /mnt/dir258/dir257 /mnt/dir257
$ rmdir /mnt/dir258
$ rmdir /mnt/dir259
# Remount the filesystem so that the next created inodes will have the
# numbers 258 and 259. This is because when a filesystem is mounted,
# btrfs sets the subvolume's inode counter to a value corresponding to
# the highest inode number in the subvolume plus 1. This inode counter
# is used to assign a unique number to each new inode and it's
# incremented by 1 after very inode creation.
# Note: we unmount and then mount instead of doing a mount with
# "-o remount" because otherwise the inode counter remains at value 260.
$ umount /mnt
$ mount /dev/sdb /mnt
$ touch /mnt/file258
$ mkdir /mnt/new_dir259
$ mv /mnt/dir257 /mnt/new_dir259/dir257
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send /mnt/snap1 -f /tmp/1.snap
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
$ umount /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ btrfs receive /mnt -f /tmo/1.snap
$ btrfs receive /mnt -f /tmo/2.snap -vv
receiving snapshot mysnap2 uuid=e059b6d1-7f55-f140-8d7c-9a3039d23c97, ctransid=10 parent_uuid=77e98cb6-8762-814f-9e05-e8ba877fc0b0, parent_ctransid=7
utimes
mkdir o259-10-0
rename dir258 -> o258-7-0
utimes
mkfile o258-10-0
rename o258-10-0 -> file258
utimes
truncate o258-10-0 size=0
ERROR: truncate o258-10-0 failed: No such file or directory
Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-01-11 10:15:39 +08:00
|
|
|
u64 parent_ino_gen;
|
|
|
|
|
2022-08-12 22:42:32 +08:00
|
|
|
ret = get_inode_gen(sctx->parent_root, ino, &parent_ino_gen);
|
Btrfs: incremental send, do not delay rename when parent inode is new
When we are checking if we need to delay the rename operation for an
inode we not checking if a parent inode that exists in the send and
parent snapshots is really the same inode or not, that is, we are not
comparing the generation number of the parent inode in the send and
parent snapshots. Not only this results in unnecessarily delaying a
rename operation but also can later on make us generate an incorrect
name for a new inode in the send snapshot that has the same number
as another inode in the parent snapshot but a different generation.
Here follows an example where this happens.
Parent snapshot:
. (ino 256, gen 3)
|--- dir258/ (ino 258, gen 7)
| |--- dir257/ (ino 257, gen 7)
|
|--- dir259/ (ino 259, gen 7)
Send snapshot:
. (ino 256, gen 3)
|--- file258 (ino 258, gen 10)
|
|--- new_dir259/ (ino 259, gen 10)
|--- dir257/ (ino 257, gen 7)
The following steps happen when computing the incremental send stream:
1) When processing inode 257, its new parent is created using its orphan
name (o257-21-0), and the rename operation for inode 257 is delayed
because its new parent (inode 259) was not yet processed - this
decision to delay the rename operation does not make much sense
because the inode 259 in the send snapshot is a new inode, it's not
the same as inode 259 in the parent snapshot.
2) When processing inode 258 we end up delaying its rmdir operation,
because inode 257 was not yet renamed (moved away from the directory
inode 258 represents). We also create the new inode 258 using its
orphan name "o258-10-0", then rename it to its final name of "file258"
and then issue a truncate operation for it. However this truncate
operation contains an incorrect name, which corresponds to the orphan
name and not to the final name, which makes the receiver fail. This
happens because when we attempt to compute the inode's current name
we verify that there's another inode with the same number (258) that
has its rmdir operation pending and because of that we generate an
orphan name for the new inode 258 (we do this in the function
get_cur_path()).
Fix this by not delayed the rename operation of an inode if it has parents
with the same number but different generations in both snapshots.
The following steps reproduce this example scenario.
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/dir257
$ mkdir /mnt/dir258
$ mkdir /mnt/dir259
$ mv /mnt/dir257 /mnt/dir258/dir257
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ mv /mnt/dir258/dir257 /mnt/dir257
$ rmdir /mnt/dir258
$ rmdir /mnt/dir259
# Remount the filesystem so that the next created inodes will have the
# numbers 258 and 259. This is because when a filesystem is mounted,
# btrfs sets the subvolume's inode counter to a value corresponding to
# the highest inode number in the subvolume plus 1. This inode counter
# is used to assign a unique number to each new inode and it's
# incremented by 1 after very inode creation.
# Note: we unmount and then mount instead of doing a mount with
# "-o remount" because otherwise the inode counter remains at value 260.
$ umount /mnt
$ mount /dev/sdb /mnt
$ touch /mnt/file258
$ mkdir /mnt/new_dir259
$ mv /mnt/dir257 /mnt/new_dir259/dir257
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send /mnt/snap1 -f /tmp/1.snap
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
$ umount /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ btrfs receive /mnt -f /tmo/1.snap
$ btrfs receive /mnt -f /tmo/2.snap -vv
receiving snapshot mysnap2 uuid=e059b6d1-7f55-f140-8d7c-9a3039d23c97, ctransid=10 parent_uuid=77e98cb6-8762-814f-9e05-e8ba877fc0b0, parent_ctransid=7
utimes
mkdir o259-10-0
rename dir258 -> o258-7-0
utimes
mkfile o258-10-0
rename o258-10-0 -> file258
utimes
truncate o258-10-0 size=0
ERROR: truncate o258-10-0 failed: No such file or directory
Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-01-11 10:15:39 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ino_gen == parent_ino_gen) {
|
|
|
|
ret = 1;
|
|
|
|
break;
|
|
|
|
}
|
2014-03-19 22:20:54 +08:00
|
|
|
}
|
|
|
|
ino = parent_ino_after;
|
Btrfs: incremental send, do not delay rename when parent inode is new
When we are checking if we need to delay the rename operation for an
inode we not checking if a parent inode that exists in the send and
parent snapshots is really the same inode or not, that is, we are not
comparing the generation number of the parent inode in the send and
parent snapshots. Not only this results in unnecessarily delaying a
rename operation but also can later on make us generate an incorrect
name for a new inode in the send snapshot that has the same number
as another inode in the parent snapshot but a different generation.
Here follows an example where this happens.
Parent snapshot:
. (ino 256, gen 3)
|--- dir258/ (ino 258, gen 7)
| |--- dir257/ (ino 257, gen 7)
|
|--- dir259/ (ino 259, gen 7)
Send snapshot:
. (ino 256, gen 3)
|--- file258 (ino 258, gen 10)
|
|--- new_dir259/ (ino 259, gen 10)
|--- dir257/ (ino 257, gen 7)
The following steps happen when computing the incremental send stream:
1) When processing inode 257, its new parent is created using its orphan
name (o257-21-0), and the rename operation for inode 257 is delayed
because its new parent (inode 259) was not yet processed - this
decision to delay the rename operation does not make much sense
because the inode 259 in the send snapshot is a new inode, it's not
the same as inode 259 in the parent snapshot.
2) When processing inode 258 we end up delaying its rmdir operation,
because inode 257 was not yet renamed (moved away from the directory
inode 258 represents). We also create the new inode 258 using its
orphan name "o258-10-0", then rename it to its final name of "file258"
and then issue a truncate operation for it. However this truncate
operation contains an incorrect name, which corresponds to the orphan
name and not to the final name, which makes the receiver fail. This
happens because when we attempt to compute the inode's current name
we verify that there's another inode with the same number (258) that
has its rmdir operation pending and because of that we generate an
orphan name for the new inode 258 (we do this in the function
get_cur_path()).
Fix this by not delayed the rename operation of an inode if it has parents
with the same number but different generations in both snapshots.
The following steps reproduce this example scenario.
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/dir257
$ mkdir /mnt/dir258
$ mkdir /mnt/dir259
$ mv /mnt/dir257 /mnt/dir258/dir257
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ mv /mnt/dir258/dir257 /mnt/dir257
$ rmdir /mnt/dir258
$ rmdir /mnt/dir259
# Remount the filesystem so that the next created inodes will have the
# numbers 258 and 259. This is because when a filesystem is mounted,
# btrfs sets the subvolume's inode counter to a value corresponding to
# the highest inode number in the subvolume plus 1. This inode counter
# is used to assign a unique number to each new inode and it's
# incremented by 1 after very inode creation.
# Note: we unmount and then mount instead of doing a mount with
# "-o remount" because otherwise the inode counter remains at value 260.
$ umount /mnt
$ mount /dev/sdb /mnt
$ touch /mnt/file258
$ mkdir /mnt/new_dir259
$ mv /mnt/dir257 /mnt/new_dir259/dir257
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send /mnt/snap1 -f /tmp/1.snap
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
$ umount /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ btrfs receive /mnt -f /tmo/1.snap
$ btrfs receive /mnt -f /tmo/2.snap -vv
receiving snapshot mysnap2 uuid=e059b6d1-7f55-f140-8d7c-9a3039d23c97, ctransid=10 parent_uuid=77e98cb6-8762-814f-9e05-e8ba877fc0b0, parent_ctransid=7
utimes
mkdir o259-10-0
rename dir258 -> o258-7-0
utimes
mkfile o258-10-0
rename o258-10-0 -> file258
utimes
truncate o258-10-0 size=0
ERROR: truncate o258-10-0 failed: No such file or directory
Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-01-11 10:15:39 +08:00
|
|
|
ino_gen = parent_ino_after_gen;
|
2014-03-19 22:20:54 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
out:
|
|
|
|
fs_path_free(path_before);
|
|
|
|
fs_path_free(path_after);
|
|
|
|
|
2014-03-28 04:14:01 +08:00
|
|
|
if (ret == 1) {
|
|
|
|
ret = add_pending_dir_move(sctx,
|
|
|
|
sctx->cur_ino,
|
|
|
|
sctx->cur_inode_gen,
|
|
|
|
ino,
|
|
|
|
&sctx->new_refs,
|
2015-03-01 06:29:22 +08:00
|
|
|
&sctx->deleted_refs,
|
Btrfs: incremental send, check if orphanized dir inode needs delayed rename
If a directory inode is orphanized, because some inode previously
processed has a new name that collides with the old name of the current
inode, we need to check if it needs its rename operation delayed too,
as its ancestor-descendent relationship with some other inode might
have been reversed between the parent and send snapshots and therefore
its rename operation needs to happen after that other inode is renamed.
For example, for the following reproducer where this is needed (provided
by Robbie Ko):
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt2
$ mkdir -p /mnt/data/n1/n2
$ mkdir /mnt/data/n4
$ mkdir -p /mnt/data/t6/t7
$ mkdir /mnt/data/t5
$ mkdir /mnt/data/t7
$ mkdir /mnt/data/n4/t2
$ mkdir /mnt/data/t4
$ mkdir /mnt/data/t3
$ mv /mnt/data/t7 /mnt/data/n4/t2
$ mv /mnt/data/t4 /mnt/data/n4/t2/t7
$ mv /mnt/data/t5 /mnt/data/n4/t2/t7/t4
$ mv /mnt/data/t6 /mnt/data/n4/t2/t7/t4/t5
$ mv /mnt/data/n1/n2 /mnt/data/n4/t2/t7/t4/t5/t6
$ mv /mnt/data/n1 /mnt/data/n4/t2/t7/t4/t5/t6
$ mv /mnt/data/n4/t2/t7/t4/t5/t6/t7 /mnt/data/n4/t2/t7/t4/t5/t6/n2
$ mv /mnt/data/t3 /mnt/data/n4/t2/t7/t4/t5/t6/n2/t7
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ mv /mnt/data/n4/t2/t7/t4/t5/t6/n1 /mnt/data/n4
$ mv /mnt/data/n4/t2 /mnt/data/n4/n1
$ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6/n2 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/n2/t7/t3 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/t7/t4 /mnt/data/n4/n1/t2/t6
$ mv /mnt/data/n4/n1/t2/t7 /mnt/data/n4/n1/t2/t3
$ mv /mnt/data/n4/n1/t2/n2/t7 /mnt/data/n4/n1/t2
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send /mnt/snap1 | btrfs receive /mnt2
$ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2
ERROR: send ioctl failed with -12: Cannot allocate memory
Where the parent snapshot directory hierarchy is the following:
. (ino 256)
|-- data/ (ino 257)
|-- n4/ (ino 260)
|-- t2/ (ino 265)
|-- t7/ (ino 264)
|-- t4/ (ino 266)
|-- t5/ (ino 263)
|-- t6/ (ino 261)
|-- n1/ (ino 258)
|-- n2/ (ino 259)
|-- t7/ (ino 262)
|-- t3/ (ino 267)
And the send snapshot's directory hierarchy is the following:
. (ino 256)
|-- data/ (ino 257)
|-- n4/ (ino 260)
|-- n1/ (ino 258)
|-- t2/ (ino 265)
|-- n2/ (ino 259)
|-- t3/ (ino 267)
| |-- t7 (ino 264)
|
|-- t6/ (ino 261)
| |-- t4/ (ino 266)
| |-- t5/ (ino 263)
|
|-- t7/ (ino 262)
While processing inode 262 we orphanize inode 264 and later attempt
to rename inode 264 to its new name/location, which resulted in building
an incorrect destination path string for the rename operation with the
value "data/n4/t2/t7/t4/t5/t6/n2/t7/t3/t7". This rename operation must
have been done only after inode 267 is processed and renamed, as the
ancestor-descendent relationship between inodes 264 and 267 was reversed
between both snapshots, because otherwise it results in an infinite loop
when building the path string for inode 264 when we are processing an
inode with a number larger than 264. That loop is the following:
start inode 264, send progress of 265 for example
parent of 264 -> 267
parent of 267 -> 262
parent of 262 -> 259
parent of 259 -> 261
parent of 261 -> 263
parent of 263 -> 266
parent of 266 -> 264
|--> back to first iteration while current path string length
is <= PATH_MAX, and fail with -ENOMEM otherwise
So fix this by making the check if we need to delay a directory rename
regardless of the current inode having been orphanized or not.
A test case for fstests follows soon.
Thanks to Robbie Ko for providing a reproducer for this problem.
Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-04-09 21:09:14 +08:00
|
|
|
is_orphan);
|
2014-03-28 04:14:01 +08:00
|
|
|
if (!ret)
|
|
|
|
ret = 1;
|
|
|
|
}
|
|
|
|
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
Btrfs: incremental send, fix invalid path for link commands
In some scenarios an incremental send stream can contain link commands
with an invalid target path. Such scenarios happen after moving some
directory inode A, renaming a regular file inode B into the old name of
inode A and finally creating a new hard link for inode B at directory
inode A.
Consider the following example scenario where this issue happens.
Parent snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
| |--- dir2/ (ino 258)
| |--- dir3/ (ino 259)
| |--- file1 (ino 261)
| |--- dir4/ (ino 262)
|
|--- dir5/ (ino 260)
Send snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
|--- dir2/ (ino 258)
| |--- dir3/ (ino 259)
| |--- dir4 (ino 261)
|
|--- dir6/ (ino 263)
|--- dir44/ (ino 262)
|--- file11 (ino 261)
|--- dir55/ (ino 260)
When attempting to apply the corresponding incremental send stream, a
link command contains an invalid target path which makes the receiver
fail. The following is the verbose output of the btrfs receive command:
receiving snapshot mysnap2 uuid=90076fe6-5ba6-e64a-9321-9279670ed16b (...)
utimes
utimes dir1
utimes dir1/dir2/dir3
utimes
rename dir1/dir2/dir3/dir4 -> o262-7-0
link dir1/dir2/dir3/dir4 -> dir1/dir2/dir3/file1
link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1
ERROR: link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1 failed: Not a directory
The following steps happen during the computation of the incremental send
stream the lead to this issue:
1) When processing inode 261, we orphanize inode 262 due to a name/location
collision with one of the new hard links for inode 261 (created in the
second step below).
2) We create one of the 2 new hard links for inode 261, the one whose
location is at "dir1/dir2/dir3/dir4".
3) We then attempt to create the other new hard link for inode 261, which
has inode 262 as its parent directory. Because the path for this new
hard link was computed before we started processing the new references
(hard links), it reflects the old name/location of inode 262, that is,
it does not account for the orphanization step that happened when
we started processing the new references for inode 261, whence it is
no longer valid, causing the receiver to fail.
So fix this issue by recomputing the full path of new references if we
ended up orphanizing other inodes which are directories.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-06-23 03:03:51 +08:00
|
|
|
static int update_ref_path(struct send_ctx *sctx, struct recorded_ref *ref)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct fs_path *new_path;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Our reference's name member points to its full_path member string, so
|
|
|
|
* we use here a new path.
|
|
|
|
*/
|
|
|
|
new_path = fs_path_alloc();
|
|
|
|
if (!new_path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
ret = get_cur_path(sctx, ref->dir, ref->dir_gen, new_path);
|
|
|
|
if (ret < 0) {
|
|
|
|
fs_path_free(new_path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
ret = fs_path_add(new_path, ref->name, ref->name_len);
|
|
|
|
if (ret < 0) {
|
|
|
|
fs_path_free(new_path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
fs_path_free(ref->full_path);
|
|
|
|
set_ref_path(ref, new_path);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
btrfs: send, recompute reference path after orphanization of a directory
During an incremental send, when an inode has multiple new references we
might end up emitting rename operations for orphanizations that have a
source path that is no longer valid due to a previous orphanization of
some directory inode. This causes the receiver to fail since it tries
to rename a path that does not exists.
Example reproducer:
$ cat reproducer.sh
#!/bin/bash
mkfs.btrfs -f /dev/sdi >/dev/null
mount /dev/sdi /mnt/sdi
touch /mnt/sdi/f1
touch /mnt/sdi/f2
mkdir /mnt/sdi/d1
mkdir /mnt/sdi/d1/d2
# Filesystem looks like:
#
# . (ino 256)
# |----- f1 (ino 257)
# |----- f2 (ino 258)
# |----- d1/ (ino 259)
# |----- d2/ (ino 260)
btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1
btrfs send -f /tmp/snap1.send /mnt/sdi/snap1
# Now do a series of changes such that:
#
# *) inode 258 has one new hardlink and the previous name changed
#
# *) both names conflict with the old names of two other inodes:
#
# 1) the new name "d1" conflicts with the old name of inode 259,
# under directory inode 256 (root)
#
# 2) the new name "d2" conflicts with the old name of inode 260
# under directory inode 259
#
# *) inodes 259 and 260 now have the old names of inode 258
#
# *) inode 257 is now located under inode 260 - an inode with a number
# smaller than the inode (258) for which we created a second hard
# link and swapped its names with inodes 259 and 260
#
ln /mnt/sdi/f2 /mnt/sdi/d1/f2_link
mv /mnt/sdi/f1 /mnt/sdi/d1/d2/f1
# Swap d1 and f2.
mv /mnt/sdi/d1 /mnt/sdi/tmp
mv /mnt/sdi/f2 /mnt/sdi/d1
mv /mnt/sdi/tmp /mnt/sdi/f2
# Swap d2 and f2_link
mv /mnt/sdi/f2/d2 /mnt/sdi/tmp
mv /mnt/sdi/f2/f2_link /mnt/sdi/f2/d2
mv /mnt/sdi/tmp /mnt/sdi/f2/f2_link
# Filesystem now looks like:
#
# . (ino 256)
# |----- d1 (ino 258)
# |----- f2/ (ino 259)
# |----- f2_link/ (ino 260)
# | |----- f1 (ino 257)
# |
# |----- d2 (ino 258)
btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2
btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2
mkfs.btrfs -f /dev/sdj >/dev/null
mount /dev/sdj /mnt/sdj
btrfs receive -f /tmp/snap1.send /mnt/sdj
btrfs receive -f /tmp/snap2.send /mnt/sdj
umount /mnt/sdi
umount /mnt/sdj
When executed the receive of the incremental stream fails:
$ ./reproducer.sh
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
At subvol /mnt/sdi/snap1
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
At subvol /mnt/sdi/snap2
At subvol snap1
At snapshot snap2
ERROR: rename d1/d2 -> o260-6-0 failed: No such file or directory
This happens because:
1) When processing inode 257 we end up computing the name for inode 259
because it is an ancestor in the send snapshot, and at that point it
still has its old name, "d1", from the parent snapshot because inode
259 was not yet processed. We then cache that name, which is valid
until we start processing inode 259 (or set the progress to 260 after
processing its references);
2) Later we start processing inode 258 and collecting all its new
references into the list sctx->new_refs. The first reference in the
list happens to be the reference for name "d1" while the reference for
name "d2" is next (the last element of the list).
We compute the full path "d1/d2" for this second reference and store
it in the reference (its ->full_path member). The path used for the
new parent directory was "d1" and not "f2" because inode 259, the
new parent, was not yet processed;
3) When we start processing the new references at process_recorded_refs()
we start with the first reference in the list, for the new name "d1".
Because there is a conflicting inode that was not yet processed, which
is directory inode 259, we orphanize it, renaming it from "d1" to
"o259-6-0";
4) Then we start processing the new reference for name "d2", and we
realize it conflicts with the reference of inode 260 in the parent
snapshot. So we issue an orphanization operation for inode 260 by
emitting a rename operation with a destination path of "o260-6-0"
and a source path of "d1/d2" - this source path is the value we
stored in the reference earlier at step 2), corresponding to the
->full_path member of the reference, however that path is no longer
valid due to the orphanization of the directory inode 259 in step 3).
This makes the receiver fail since the path does not exists, it should
have been "o259-6-0/d2".
Fix this by recomputing the full path of a reference before emitting an
orphanization if we previously orphanized any directory, since that
directory could be a parent in the new path. This is a rare scenario so
keeping it simple and not checking if that previously orphanized directory
is in fact an ancestor of the inode we are trying to orphanize.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-21 21:13:30 +08:00
|
|
|
/*
|
|
|
|
* When processing the new references for an inode we may orphanize an existing
|
|
|
|
* directory inode because its old name conflicts with one of the new references
|
|
|
|
* of the current inode. Later, when processing another new reference of our
|
|
|
|
* inode, we might need to orphanize another inode, but the path we have in the
|
|
|
|
* reference reflects the pre-orphanization name of the directory we previously
|
|
|
|
* orphanized. For example:
|
|
|
|
*
|
|
|
|
* parent snapshot looks like:
|
|
|
|
*
|
|
|
|
* . (ino 256)
|
|
|
|
* |----- f1 (ino 257)
|
|
|
|
* |----- f2 (ino 258)
|
|
|
|
* |----- d1/ (ino 259)
|
|
|
|
* |----- d2/ (ino 260)
|
|
|
|
*
|
|
|
|
* send snapshot looks like:
|
|
|
|
*
|
|
|
|
* . (ino 256)
|
|
|
|
* |----- d1 (ino 258)
|
|
|
|
* |----- f2/ (ino 259)
|
|
|
|
* |----- f2_link/ (ino 260)
|
|
|
|
* | |----- f1 (ino 257)
|
|
|
|
* |
|
|
|
|
* |----- d2 (ino 258)
|
|
|
|
*
|
|
|
|
* When processing inode 257 we compute the name for inode 259 as "d1", and we
|
|
|
|
* cache it in the name cache. Later when we start processing inode 258, when
|
|
|
|
* collecting all its new references we set a full path of "d1/d2" for its new
|
|
|
|
* reference with name "d2". When we start processing the new references we
|
|
|
|
* start by processing the new reference with name "d1", and this results in
|
|
|
|
* orphanizing inode 259, since its old reference causes a conflict. Then we
|
|
|
|
* move on the next new reference, with name "d2", and we find out we must
|
|
|
|
* orphanize inode 260, as its old reference conflicts with ours - but for the
|
|
|
|
* orphanization we use a source path corresponding to the path we stored in the
|
|
|
|
* new reference, which is "d1/d2" and not "o259-6-0/d2" - this makes the
|
|
|
|
* receiver fail since the path component "d1/" no longer exists, it was renamed
|
|
|
|
* to "o259-6-0/" when processing the previous new reference. So in this case we
|
|
|
|
* must recompute the path in the new reference and use it for the new
|
|
|
|
* orphanization operation.
|
|
|
|
*/
|
|
|
|
static int refresh_ref_path(struct send_ctx *sctx, struct recorded_ref *ref)
|
|
|
|
{
|
|
|
|
char *name;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
name = kmemdup(ref->name, ref->name_len, GFP_KERNEL);
|
|
|
|
if (!name)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
fs_path_reset(ref->full_path);
|
|
|
|
ret = get_cur_path(sctx, ref->dir, ref->dir_gen, ref->full_path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = fs_path_add(ref->full_path, name, ref->name_len);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
/* Update the reference's base name pointer. */
|
|
|
|
set_ref_path(ref, ref->full_path);
|
|
|
|
out:
|
|
|
|
kfree(name);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
/*
|
|
|
|
* This does all the move/link/unlink/rmdir magic.
|
|
|
|
*/
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
static int process_recorded_refs(struct send_ctx *sctx, int *pending_move)
|
2012-07-26 05:19:24 +08:00
|
|
|
{
|
2016-09-20 22:05:03 +08:00
|
|
|
struct btrfs_fs_info *fs_info = sctx->send_root->fs_info;
|
2012-07-26 05:19:24 +08:00
|
|
|
int ret = 0;
|
|
|
|
struct recorded_ref *cur;
|
2012-07-28 16:42:24 +08:00
|
|
|
struct recorded_ref *cur2;
|
2023-08-10 11:00:22 +08:00
|
|
|
LIST_HEAD(check_dirs);
|
2012-07-26 05:19:24 +08:00
|
|
|
struct fs_path *valid_path = NULL;
|
2012-07-26 07:21:10 +08:00
|
|
|
u64 ow_inode = 0;
|
2012-07-26 05:19:24 +08:00
|
|
|
u64 ow_gen;
|
Btrfs: incremental send, fix invalid path for link commands
In some scenarios an incremental send stream can contain link commands
with an invalid target path. Such scenarios happen after moving some
directory inode A, renaming a regular file inode B into the old name of
inode A and finally creating a new hard link for inode B at directory
inode A.
Consider the following example scenario where this issue happens.
Parent snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
| |--- dir2/ (ino 258)
| |--- dir3/ (ino 259)
| |--- file1 (ino 261)
| |--- dir4/ (ino 262)
|
|--- dir5/ (ino 260)
Send snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
|--- dir2/ (ino 258)
| |--- dir3/ (ino 259)
| |--- dir4 (ino 261)
|
|--- dir6/ (ino 263)
|--- dir44/ (ino 262)
|--- file11 (ino 261)
|--- dir55/ (ino 260)
When attempting to apply the corresponding incremental send stream, a
link command contains an invalid target path which makes the receiver
fail. The following is the verbose output of the btrfs receive command:
receiving snapshot mysnap2 uuid=90076fe6-5ba6-e64a-9321-9279670ed16b (...)
utimes
utimes dir1
utimes dir1/dir2/dir3
utimes
rename dir1/dir2/dir3/dir4 -> o262-7-0
link dir1/dir2/dir3/dir4 -> dir1/dir2/dir3/file1
link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1
ERROR: link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1 failed: Not a directory
The following steps happen during the computation of the incremental send
stream the lead to this issue:
1) When processing inode 261, we orphanize inode 262 due to a name/location
collision with one of the new hard links for inode 261 (created in the
second step below).
2) We create one of the 2 new hard links for inode 261, the one whose
location is at "dir1/dir2/dir3/dir4".
3) We then attempt to create the other new hard link for inode 261, which
has inode 262 as its parent directory. Because the path for this new
hard link was computed before we started processing the new references
(hard links), it reflects the old name/location of inode 262, that is,
it does not account for the orphanization step that happened when
we started processing the new references for inode 261, whence it is
no longer valid, causing the receiver to fail.
So fix this issue by recomputing the full path of new references if we
ended up orphanizing other inodes which are directories.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-06-23 03:03:51 +08:00
|
|
|
u64 ow_mode;
|
2012-07-26 05:19:24 +08:00
|
|
|
int did_overwrite = 0;
|
|
|
|
int is_orphan = 0;
|
2014-02-17 05:01:39 +08:00
|
|
|
u64 last_dir_ino_rm = 0;
|
2015-03-01 06:29:22 +08:00
|
|
|
bool can_rename = true;
|
Btrfs: incremental send, fix invalid path for link commands
In some scenarios an incremental send stream can contain link commands
with an invalid target path. Such scenarios happen after moving some
directory inode A, renaming a regular file inode B into the old name of
inode A and finally creating a new hard link for inode B at directory
inode A.
Consider the following example scenario where this issue happens.
Parent snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
| |--- dir2/ (ino 258)
| |--- dir3/ (ino 259)
| |--- file1 (ino 261)
| |--- dir4/ (ino 262)
|
|--- dir5/ (ino 260)
Send snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
|--- dir2/ (ino 258)
| |--- dir3/ (ino 259)
| |--- dir4 (ino 261)
|
|--- dir6/ (ino 263)
|--- dir44/ (ino 262)
|--- file11 (ino 261)
|--- dir55/ (ino 260)
When attempting to apply the corresponding incremental send stream, a
link command contains an invalid target path which makes the receiver
fail. The following is the verbose output of the btrfs receive command:
receiving snapshot mysnap2 uuid=90076fe6-5ba6-e64a-9321-9279670ed16b (...)
utimes
utimes dir1
utimes dir1/dir2/dir3
utimes
rename dir1/dir2/dir3/dir4 -> o262-7-0
link dir1/dir2/dir3/dir4 -> dir1/dir2/dir3/file1
link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1
ERROR: link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1 failed: Not a directory
The following steps happen during the computation of the incremental send
stream the lead to this issue:
1) When processing inode 261, we orphanize inode 262 due to a name/location
collision with one of the new hard links for inode 261 (created in the
second step below).
2) We create one of the 2 new hard links for inode 261, the one whose
location is at "dir1/dir2/dir3/dir4".
3) We then attempt to create the other new hard link for inode 261, which
has inode 262 as its parent directory. Because the path for this new
hard link was computed before we started processing the new references
(hard links), it reflects the old name/location of inode 262, that is,
it does not account for the orphanization step that happened when
we started processing the new references for inode 261, whence it is
no longer valid, causing the receiver to fail.
So fix this issue by recomputing the full path of new references if we
ended up orphanizing other inodes which are directories.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-06-23 03:03:51 +08:00
|
|
|
bool orphanized_dir = false;
|
Btrfs: incremental send, fix invalid path for unlink commands
An incremental send can contain unlink operations with an invalid target
path when we rename some directory inode A, then rename some file inode B
to the old name of inode A and directory inode A is an ancestor of inode B
in the parent snapshot (but not anymore in the send snapshot).
Consider the following example scenario where this issue happens.
Parent snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
|--- dir2/ (ino 258)
| |--- file1 (ino 259)
| |--- file3 (ino 261)
|
|--- dir3/ (ino 262)
|--- file22 (ino 260)
|--- dir4/ (ino 263)
Send snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
|--- dir2/ (ino 258)
|--- dir3 (ino 260)
|--- file3/ (ino 262)
|--- dir4/ (ino 263)
|--- file11 (ino 269)
|--- file33 (ino 261)
When attempting to apply the corresponding incremental send stream, an
unlink operation contains an invalid path which makes the receiver fail.
The following is verbose output of the btrfs receive command:
receiving snapshot snap2 uuid=7d5450da-a573-e043-a451-ec85f4879f0f (...)
utimes
utimes dir1
utimes dir1/dir2
link dir1/dir3/dir4/file11 -> dir1/dir2/file1
unlink dir1/dir2/file1
utimes dir1/dir2
truncate dir1/dir3/dir4/file11 size=0
utimes dir1/dir3/dir4/file11
rename dir1/dir3 -> o262-7-0
link dir1/dir3 -> o262-7-0/file22
unlink dir1/dir3/file22
ERROR: unlink dir1/dir3/file22 failed. Not a directory
The following steps happen during the computation of the incremental send
stream the lead to this issue:
1) Before we start processing the new and deleted references for inode
260, we compute the full path of the deleted reference
("dir1/dir3/file22") and cache it in the list of deleted references
for our inode.
2) We then start processing the new references for inode 260, for which
there is only one new, located at "dir1/dir3". When processing this
new reference, we check that inode 262, which was not yet processed,
collides with the new reference and because of that we orphanize
inode 262 so its new full path becomes "o262-7-0".
3) After the orphanization of inode 262, we create the new reference for
inode 260 by issuing a link command with a target path of "dir1/dir3"
and a source path of "o262-7-0/file22".
4) We then start processing the deleted references for inode 260, for
which there is only one with the base name of "file22", and issue
an unlink operation containing the target path computed at step 1,
which is wrong because that path no longer exists and should be
replaced with "o262-7-0/file22".
So fix this issue by recomputing the full path of deleted references if
when we processed the new references for an inode we ended up orphanizing
any other inode that is an ancestor of our inode in the parent snapshot.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[ adjusted after prev patch removed fs_path::dir_path and dir_path_len ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-13 21:13:11 +08:00
|
|
|
bool orphanized_ancestor = false;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2016-09-20 22:05:03 +08:00
|
|
|
btrfs_debug(fs_info, "process_recorded_refs %llu", sctx->cur_ino);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2012-08-01 20:48:59 +08:00
|
|
|
/*
|
|
|
|
* This should never happen as the root dir always has the same ref
|
|
|
|
* which is always '..'
|
|
|
|
*/
|
|
|
|
BUG_ON(sctx->cur_ino <= BTRFS_FIRST_FREE_OBJECTID);
|
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
valid_path = fs_path_alloc();
|
2012-07-26 05:19:24 +08:00
|
|
|
if (!valid_path) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* First, check if the first ref of the current inode was overwritten
|
|
|
|
* before. If yes, we know that the current inode was already orphanized
|
|
|
|
* and thus use the orphan name. If not, we can use get_cur_path to
|
|
|
|
* get the path of the first ref as it would like while receiving at
|
|
|
|
* this point in time.
|
|
|
|
* New inodes are always orphan at the beginning, so force to use the
|
|
|
|
* orphan name in this case.
|
|
|
|
* The first ref is stored in valid_path and will be updated if it
|
|
|
|
* gets moved around.
|
|
|
|
*/
|
|
|
|
if (!sctx->cur_inode_new) {
|
|
|
|
ret = did_overwrite_first_ref(sctx, sctx->cur_ino,
|
|
|
|
sctx->cur_inode_gen);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret)
|
|
|
|
did_overwrite = 1;
|
|
|
|
}
|
|
|
|
if (sctx->cur_inode_new || did_overwrite) {
|
|
|
|
ret = gen_unique_name(sctx, sctx->cur_ino,
|
|
|
|
sctx->cur_inode_gen, valid_path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
is_orphan = 1;
|
|
|
|
} else {
|
|
|
|
ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen,
|
|
|
|
valid_path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
btrfs: send, orphanize first all conflicting inodes when processing references
When doing an incremental send it is possible that when processing the new
references for an inode we end up issuing rename or link operations that
have an invalid path, which contains the orphanized name of a directory
before we actually orphanized it, causing the receiver to fail.
The following reproducer triggers such scenario:
$ cat reproducer.sh
#!/bin/bash
mkfs.btrfs -f /dev/sdi >/dev/null
mount /dev/sdi /mnt/sdi
touch /mnt/sdi/a
touch /mnt/sdi/b
mkdir /mnt/sdi/testdir
# We want "a" to have a lower inode number then "testdir" (257 vs 259).
mv /mnt/sdi/a /mnt/sdi/testdir/a
# Filesystem looks like:
#
# . (ino 256)
# |----- testdir/ (ino 259)
# | |----- a (ino 257)
# |
# |----- b (ino 258)
btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1
btrfs send -f /tmp/snap1.send /mnt/sdi/snap1
# Now rename 259 to "testdir_2", then change the name of 257 to
# "testdir" and make it a direct descendant of the root inode (256).
# Also create a new link for inode 257 with the old name of inode 258.
# By swapping the names and location of several inodes and create a
# nasty dependency chain of rename and link operations.
mv /mnt/sdi/testdir/a /mnt/sdi/a2
touch /mnt/sdi/testdir/a
mv /mnt/sdi/b /mnt/sdi/b2
ln /mnt/sdi/a2 /mnt/sdi/b
mv /mnt/sdi/testdir /mnt/sdi/testdir_2
mv /mnt/sdi/a2 /mnt/sdi/testdir
# Filesystem now looks like:
#
# . (ino 256)
# |----- testdir_2/ (ino 259)
# | |----- a (ino 260)
# |
# |----- testdir (ino 257)
# |----- b (ino 257)
# |----- b2 (ino 258)
btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2
btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2
mkfs.btrfs -f /dev/sdj >/dev/null
mount /dev/sdj /mnt/sdj
btrfs receive -f /tmp/snap1.send /mnt/sdj
btrfs receive -f /tmp/snap2.send /mnt/sdj
umount /mnt/sdi
umount /mnt/sdj
When running the reproducer, the receive of the incremental send stream
fails:
$ ./reproducer.sh
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
At subvol /mnt/sdi/snap1
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
At subvol /mnt/sdi/snap2
At subvol snap1
At snapshot snap2
ERROR: link b -> o259-6-0/a failed: No such file or directory
The problem happens because of the following:
1) Before we start iterating the list of new references for inode 257,
we generate its current path and store it at @valid_path, done at
the very beginning of process_recorded_refs(). The generated path
is "o259-6-0/a", containing the orphanized name for inode 259;
2) Then we iterate over the list of new references, which has the
references "b" and "testdir" in that specific order;
3) We process reference "b" first, because it is in the list before
reference "testdir". We then issue a link operation to create
the new reference "b" using a target path corresponding to the
content at @valid_path, which corresponds to "o259-6-0/a".
However we haven't yet orphanized inode 259, its name is still
"testdir", and not "o259-6-0". The orphanization of 259 did not
happen yet because we will process the reference named "testdir"
for inode 257 only in the next iteration of the loop that goes
over the list of new references.
Fix the issue by having a preliminar iteration over all the new references
at process_recorded_refs(). This iteration is responsible only for doing
the orphanization of other inodes that have and old reference that
conflicts with one of the new references of the inode we are currently
processing. The emission of rename and link operations happen now in the
next iteration of the new references.
A test case for fstests will follow soon.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-21 21:13:29 +08:00
|
|
|
/*
|
|
|
|
* Before doing any rename and link operations, do a first pass on the
|
|
|
|
* new references to orphanize any unprocessed inodes that may have a
|
|
|
|
* reference that conflicts with one of the new references of the current
|
|
|
|
* inode. This needs to happen first because a new reference may conflict
|
|
|
|
* with the old reference of a parent directory, so we must make sure
|
|
|
|
* that the path used for link and rename commands don't use an
|
|
|
|
* orphanized name when an ancestor was not yet orphanized.
|
|
|
|
*
|
|
|
|
* Example:
|
|
|
|
*
|
|
|
|
* Parent snapshot:
|
|
|
|
*
|
|
|
|
* . (ino 256)
|
|
|
|
* |----- testdir/ (ino 259)
|
|
|
|
* | |----- a (ino 257)
|
|
|
|
* |
|
|
|
|
* |----- b (ino 258)
|
|
|
|
*
|
|
|
|
* Send snapshot:
|
|
|
|
*
|
|
|
|
* . (ino 256)
|
|
|
|
* |----- testdir_2/ (ino 259)
|
|
|
|
* | |----- a (ino 260)
|
|
|
|
* |
|
|
|
|
* |----- testdir (ino 257)
|
|
|
|
* |----- b (ino 257)
|
|
|
|
* |----- b2 (ino 258)
|
|
|
|
*
|
|
|
|
* Processing the new reference for inode 257 with name "b" may happen
|
|
|
|
* before processing the new reference with name "testdir". If so, we
|
|
|
|
* must make sure that by the time we send a link command to create the
|
|
|
|
* hard link "b", inode 259 was already orphanized, since the generated
|
|
|
|
* path in "valid_path" already contains the orphanized name for 259.
|
|
|
|
* We are processing inode 257, so only later when processing 259 we do
|
|
|
|
* the rename operation to change its temporary (orphanized) name to
|
|
|
|
* "testdir_2".
|
|
|
|
*/
|
2012-07-26 05:19:24 +08:00
|
|
|
list_for_each_entry(cur, &sctx->new_refs, list) {
|
2023-01-11 19:36:05 +08:00
|
|
|
ret = get_cur_inode_state(sctx, cur->dir, cur->dir_gen, NULL, NULL);
|
2012-07-28 16:42:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
btrfs: send, orphanize first all conflicting inodes when processing references
When doing an incremental send it is possible that when processing the new
references for an inode we end up issuing rename or link operations that
have an invalid path, which contains the orphanized name of a directory
before we actually orphanized it, causing the receiver to fail.
The following reproducer triggers such scenario:
$ cat reproducer.sh
#!/bin/bash
mkfs.btrfs -f /dev/sdi >/dev/null
mount /dev/sdi /mnt/sdi
touch /mnt/sdi/a
touch /mnt/sdi/b
mkdir /mnt/sdi/testdir
# We want "a" to have a lower inode number then "testdir" (257 vs 259).
mv /mnt/sdi/a /mnt/sdi/testdir/a
# Filesystem looks like:
#
# . (ino 256)
# |----- testdir/ (ino 259)
# | |----- a (ino 257)
# |
# |----- b (ino 258)
btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1
btrfs send -f /tmp/snap1.send /mnt/sdi/snap1
# Now rename 259 to "testdir_2", then change the name of 257 to
# "testdir" and make it a direct descendant of the root inode (256).
# Also create a new link for inode 257 with the old name of inode 258.
# By swapping the names and location of several inodes and create a
# nasty dependency chain of rename and link operations.
mv /mnt/sdi/testdir/a /mnt/sdi/a2
touch /mnt/sdi/testdir/a
mv /mnt/sdi/b /mnt/sdi/b2
ln /mnt/sdi/a2 /mnt/sdi/b
mv /mnt/sdi/testdir /mnt/sdi/testdir_2
mv /mnt/sdi/a2 /mnt/sdi/testdir
# Filesystem now looks like:
#
# . (ino 256)
# |----- testdir_2/ (ino 259)
# | |----- a (ino 260)
# |
# |----- testdir (ino 257)
# |----- b (ino 257)
# |----- b2 (ino 258)
btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2
btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2
mkfs.btrfs -f /dev/sdj >/dev/null
mount /dev/sdj /mnt/sdj
btrfs receive -f /tmp/snap1.send /mnt/sdj
btrfs receive -f /tmp/snap2.send /mnt/sdj
umount /mnt/sdi
umount /mnt/sdj
When running the reproducer, the receive of the incremental send stream
fails:
$ ./reproducer.sh
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
At subvol /mnt/sdi/snap1
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
At subvol /mnt/sdi/snap2
At subvol snap1
At snapshot snap2
ERROR: link b -> o259-6-0/a failed: No such file or directory
The problem happens because of the following:
1) Before we start iterating the list of new references for inode 257,
we generate its current path and store it at @valid_path, done at
the very beginning of process_recorded_refs(). The generated path
is "o259-6-0/a", containing the orphanized name for inode 259;
2) Then we iterate over the list of new references, which has the
references "b" and "testdir" in that specific order;
3) We process reference "b" first, because it is in the list before
reference "testdir". We then issue a link operation to create
the new reference "b" using a target path corresponding to the
content at @valid_path, which corresponds to "o259-6-0/a".
However we haven't yet orphanized inode 259, its name is still
"testdir", and not "o259-6-0". The orphanization of 259 did not
happen yet because we will process the reference named "testdir"
for inode 257 only in the next iteration of the loop that goes
over the list of new references.
Fix the issue by having a preliminar iteration over all the new references
at process_recorded_refs(). This iteration is responsible only for doing
the orphanization of other inodes that have and old reference that
conflicts with one of the new references of the inode we are currently
processing. The emission of rename and link operations happen now in the
next iteration of the new references.
A test case for fstests will follow soon.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-21 21:13:29 +08:00
|
|
|
if (ret == inode_state_will_create)
|
|
|
|
continue;
|
2012-07-28 16:42:24 +08:00
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
/*
|
btrfs: send, orphanize first all conflicting inodes when processing references
When doing an incremental send it is possible that when processing the new
references for an inode we end up issuing rename or link operations that
have an invalid path, which contains the orphanized name of a directory
before we actually orphanized it, causing the receiver to fail.
The following reproducer triggers such scenario:
$ cat reproducer.sh
#!/bin/bash
mkfs.btrfs -f /dev/sdi >/dev/null
mount /dev/sdi /mnt/sdi
touch /mnt/sdi/a
touch /mnt/sdi/b
mkdir /mnt/sdi/testdir
# We want "a" to have a lower inode number then "testdir" (257 vs 259).
mv /mnt/sdi/a /mnt/sdi/testdir/a
# Filesystem looks like:
#
# . (ino 256)
# |----- testdir/ (ino 259)
# | |----- a (ino 257)
# |
# |----- b (ino 258)
btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1
btrfs send -f /tmp/snap1.send /mnt/sdi/snap1
# Now rename 259 to "testdir_2", then change the name of 257 to
# "testdir" and make it a direct descendant of the root inode (256).
# Also create a new link for inode 257 with the old name of inode 258.
# By swapping the names and location of several inodes and create a
# nasty dependency chain of rename and link operations.
mv /mnt/sdi/testdir/a /mnt/sdi/a2
touch /mnt/sdi/testdir/a
mv /mnt/sdi/b /mnt/sdi/b2
ln /mnt/sdi/a2 /mnt/sdi/b
mv /mnt/sdi/testdir /mnt/sdi/testdir_2
mv /mnt/sdi/a2 /mnt/sdi/testdir
# Filesystem now looks like:
#
# . (ino 256)
# |----- testdir_2/ (ino 259)
# | |----- a (ino 260)
# |
# |----- testdir (ino 257)
# |----- b (ino 257)
# |----- b2 (ino 258)
btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2
btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2
mkfs.btrfs -f /dev/sdj >/dev/null
mount /dev/sdj /mnt/sdj
btrfs receive -f /tmp/snap1.send /mnt/sdj
btrfs receive -f /tmp/snap2.send /mnt/sdj
umount /mnt/sdi
umount /mnt/sdj
When running the reproducer, the receive of the incremental send stream
fails:
$ ./reproducer.sh
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
At subvol /mnt/sdi/snap1
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
At subvol /mnt/sdi/snap2
At subvol snap1
At snapshot snap2
ERROR: link b -> o259-6-0/a failed: No such file or directory
The problem happens because of the following:
1) Before we start iterating the list of new references for inode 257,
we generate its current path and store it at @valid_path, done at
the very beginning of process_recorded_refs(). The generated path
is "o259-6-0/a", containing the orphanized name for inode 259;
2) Then we iterate over the list of new references, which has the
references "b" and "testdir" in that specific order;
3) We process reference "b" first, because it is in the list before
reference "testdir". We then issue a link operation to create
the new reference "b" using a target path corresponding to the
content at @valid_path, which corresponds to "o259-6-0/a".
However we haven't yet orphanized inode 259, its name is still
"testdir", and not "o259-6-0". The orphanization of 259 did not
happen yet because we will process the reference named "testdir"
for inode 257 only in the next iteration of the loop that goes
over the list of new references.
Fix the issue by having a preliminar iteration over all the new references
at process_recorded_refs(). This iteration is responsible only for doing
the orphanization of other inodes that have and old reference that
conflicts with one of the new references of the inode we are currently
processing. The emission of rename and link operations happen now in the
next iteration of the new references.
A test case for fstests will follow soon.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-21 21:13:29 +08:00
|
|
|
* Check if this new ref would overwrite the first ref of another
|
|
|
|
* unprocessed inode. If yes, orphanize the overwritten inode.
|
|
|
|
* If we find an overwritten ref that is not the first ref,
|
|
|
|
* simply unlink it.
|
2012-07-26 05:19:24 +08:00
|
|
|
*/
|
|
|
|
ret = will_overwrite_ref(sctx, cur->dir, cur->dir_gen,
|
|
|
|
cur->name, cur->name_len,
|
Btrfs: incremental send, fix invalid path for link commands
In some scenarios an incremental send stream can contain link commands
with an invalid target path. Such scenarios happen after moving some
directory inode A, renaming a regular file inode B into the old name of
inode A and finally creating a new hard link for inode B at directory
inode A.
Consider the following example scenario where this issue happens.
Parent snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
| |--- dir2/ (ino 258)
| |--- dir3/ (ino 259)
| |--- file1 (ino 261)
| |--- dir4/ (ino 262)
|
|--- dir5/ (ino 260)
Send snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
|--- dir2/ (ino 258)
| |--- dir3/ (ino 259)
| |--- dir4 (ino 261)
|
|--- dir6/ (ino 263)
|--- dir44/ (ino 262)
|--- file11 (ino 261)
|--- dir55/ (ino 260)
When attempting to apply the corresponding incremental send stream, a
link command contains an invalid target path which makes the receiver
fail. The following is the verbose output of the btrfs receive command:
receiving snapshot mysnap2 uuid=90076fe6-5ba6-e64a-9321-9279670ed16b (...)
utimes
utimes dir1
utimes dir1/dir2/dir3
utimes
rename dir1/dir2/dir3/dir4 -> o262-7-0
link dir1/dir2/dir3/dir4 -> dir1/dir2/dir3/file1
link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1
ERROR: link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1 failed: Not a directory
The following steps happen during the computation of the incremental send
stream the lead to this issue:
1) When processing inode 261, we orphanize inode 262 due to a name/location
collision with one of the new hard links for inode 261 (created in the
second step below).
2) We create one of the 2 new hard links for inode 261, the one whose
location is at "dir1/dir2/dir3/dir4".
3) We then attempt to create the other new hard link for inode 261, which
has inode 262 as its parent directory. Because the path for this new
hard link was computed before we started processing the new references
(hard links), it reflects the old name/location of inode 262, that is,
it does not account for the orphanization step that happened when
we started processing the new references for inode 261, whence it is
no longer valid, causing the receiver to fail.
So fix this issue by recomputing the full path of new references if we
ended up orphanizing other inodes which are directories.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-06-23 03:03:51 +08:00
|
|
|
&ow_inode, &ow_gen, &ow_mode);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret) {
|
2013-05-08 15:51:52 +08:00
|
|
|
ret = is_first_ref(sctx->parent_root,
|
|
|
|
ow_inode, cur->dir, cur->name,
|
|
|
|
cur->name_len);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret) {
|
2015-03-12 23:16:20 +08:00
|
|
|
struct name_cache_entry *nce;
|
Btrfs: send, fix failure to move directories with the same name around
When doing an incremental send we can end up not moving directories that
have the same name. This happens when the same parent directory has
different child directories with the same name in the parent and send
snapshots.
For example, consider the following scenario:
Parent snapshot:
. (ino 256)
|---- d/ (ino 257)
| |--- p1/ (ino 258)
|
|---- p1/ (ino 259)
Send snapshot:
. (ino 256)
|--- d/ (ino 257)
|--- p1/ (ino 259)
|--- p1/ (ino 258)
The directory named "d" (inode 257) has in both snapshots an entry with
the name "p1" but it refers to different inodes in both snapshots (inode
258 in the parent snapshot and inode 259 in the send snapshot). When
attempting to move inode 258, the operation is delayed because its new
parent, inode 259, was not yet moved/renamed (as the stream is currently
processing inode 258). Then when processing inode 259, we also end up
delaying its move/rename operation so that it happens after inode 258 is
moved/renamed. This decision to delay the move/rename rename operation
of inode 259 is due to the fact that the new parent inode (257) still
has inode 258 as its child, which has the same name has inode 259. So
we end up with inode 258 move/rename operation waiting for inode's 259
move/rename operation, which in turn it waiting for inode's 258
move/rename. This results in ending the send stream without issuing
move/rename operations for inodes 258 and 259 and generating the
following warnings in syslog/dmesg:
[148402.979747] ------------[ cut here ]------------
[148402.980588] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6177 btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.981928] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148402.986999] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148402.988136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148402.988136] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148402.988136] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018212139fac8
[148402.988136] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148402.988136] Call Trace:
[148402.988136] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148402.988136] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148402.988136] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148402.988136] [<ffffffffa04bc831>] btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.988136] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148402.988136] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148402.988136] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148402.988136] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148402.988136] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148402.988136] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148402.988136] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148402.988136] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148402.988136] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148402.988136] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.011373] ---[ end trace a4539270c8056f8b ]---
[148403.012296] ------------[ cut here ]------------
[148403.013071] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6194 btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.014447] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148403.019708] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148403.020104] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148403.020104] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148403.020104] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018322139fac8
[148403.020104] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148403.020104] Call Trace:
[148403.020104] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148403.020104] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148403.020104] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148403.020104] [<ffffffffa04bc847>] btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.020104] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148403.020104] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148403.020104] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148403.020104] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148403.020104] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148403.020104] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148403.020104] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148403.020104] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148403.020104] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148403.020104] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.038981] ---[ end trace a4539270c8056f8c ]---
There's another issue caused by similar (but more complex) changes in the
directory hierarchy that makes move/rename operations fail, described with
the following example:
Parent snapshot:
.
|---- a/ (ino 262)
| |---- c/ (ino 268)
|
|---- d/ (ino 263)
|---- ance/ (ino 267)
|---- e/ (ino 264)
|---- f/ (ino 265)
|---- ance/ (ino 266)
Send snapshot:
.
|---- a/ (ino 262)
|---- c/ (ino 268)
| |---- ance/ (ino 267)
|
|---- d/ (ino 263)
| |---- ance/ (ino 266)
|
|---- f/ (ino 265)
|---- e/ (ino 264)
When the inode 265 is processed, the path for inode 267 is computed, which
at that time corresponds to "d/ance", and it's stored in the names cache.
Later on when processing inode 266, we end up orphanizing (renaming to a
name matching the pattern o<ino>-<gen>-<seq>) inode 267 because it has
the same name as inode 266 and it's currently a child of the new parent
directory (inode 263) for inode 266. After the orphanization and while we
are still processing inode 266, a rename operation for inode 266 is
generated. However the source path for that rename operation is incorrect
because it ends up using the old, pre-orphanization, name of inode 267.
The no longer valid name for inode 267 was previously cached when
processing inode 265 and it remains usable and considered valid until
the inode currently being processed has a number greater than 267.
This resulted in the receiving side failing with the following error:
ERROR: rename d/ance/ance -> d/ance failed: No such file or directory
So fix these issues by detecting such circular dependencies for rename
operations and by clearing the cached name of an inode once the inode
is orphanized.
A test case for fstests will follow soon.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Rewrote change log to be more detailed and organized, and improved
comments]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-06-23 18:39:46 +08:00
|
|
|
struct waiting_dir_move *wdm;
|
2015-03-12 23:16:20 +08:00
|
|
|
|
btrfs: send, recompute reference path after orphanization of a directory
During an incremental send, when an inode has multiple new references we
might end up emitting rename operations for orphanizations that have a
source path that is no longer valid due to a previous orphanization of
some directory inode. This causes the receiver to fail since it tries
to rename a path that does not exists.
Example reproducer:
$ cat reproducer.sh
#!/bin/bash
mkfs.btrfs -f /dev/sdi >/dev/null
mount /dev/sdi /mnt/sdi
touch /mnt/sdi/f1
touch /mnt/sdi/f2
mkdir /mnt/sdi/d1
mkdir /mnt/sdi/d1/d2
# Filesystem looks like:
#
# . (ino 256)
# |----- f1 (ino 257)
# |----- f2 (ino 258)
# |----- d1/ (ino 259)
# |----- d2/ (ino 260)
btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1
btrfs send -f /tmp/snap1.send /mnt/sdi/snap1
# Now do a series of changes such that:
#
# *) inode 258 has one new hardlink and the previous name changed
#
# *) both names conflict with the old names of two other inodes:
#
# 1) the new name "d1" conflicts with the old name of inode 259,
# under directory inode 256 (root)
#
# 2) the new name "d2" conflicts with the old name of inode 260
# under directory inode 259
#
# *) inodes 259 and 260 now have the old names of inode 258
#
# *) inode 257 is now located under inode 260 - an inode with a number
# smaller than the inode (258) for which we created a second hard
# link and swapped its names with inodes 259 and 260
#
ln /mnt/sdi/f2 /mnt/sdi/d1/f2_link
mv /mnt/sdi/f1 /mnt/sdi/d1/d2/f1
# Swap d1 and f2.
mv /mnt/sdi/d1 /mnt/sdi/tmp
mv /mnt/sdi/f2 /mnt/sdi/d1
mv /mnt/sdi/tmp /mnt/sdi/f2
# Swap d2 and f2_link
mv /mnt/sdi/f2/d2 /mnt/sdi/tmp
mv /mnt/sdi/f2/f2_link /mnt/sdi/f2/d2
mv /mnt/sdi/tmp /mnt/sdi/f2/f2_link
# Filesystem now looks like:
#
# . (ino 256)
# |----- d1 (ino 258)
# |----- f2/ (ino 259)
# |----- f2_link/ (ino 260)
# | |----- f1 (ino 257)
# |
# |----- d2 (ino 258)
btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2
btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2
mkfs.btrfs -f /dev/sdj >/dev/null
mount /dev/sdj /mnt/sdj
btrfs receive -f /tmp/snap1.send /mnt/sdj
btrfs receive -f /tmp/snap2.send /mnt/sdj
umount /mnt/sdi
umount /mnt/sdj
When executed the receive of the incremental stream fails:
$ ./reproducer.sh
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
At subvol /mnt/sdi/snap1
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
At subvol /mnt/sdi/snap2
At subvol snap1
At snapshot snap2
ERROR: rename d1/d2 -> o260-6-0 failed: No such file or directory
This happens because:
1) When processing inode 257 we end up computing the name for inode 259
because it is an ancestor in the send snapshot, and at that point it
still has its old name, "d1", from the parent snapshot because inode
259 was not yet processed. We then cache that name, which is valid
until we start processing inode 259 (or set the progress to 260 after
processing its references);
2) Later we start processing inode 258 and collecting all its new
references into the list sctx->new_refs. The first reference in the
list happens to be the reference for name "d1" while the reference for
name "d2" is next (the last element of the list).
We compute the full path "d1/d2" for this second reference and store
it in the reference (its ->full_path member). The path used for the
new parent directory was "d1" and not "f2" because inode 259, the
new parent, was not yet processed;
3) When we start processing the new references at process_recorded_refs()
we start with the first reference in the list, for the new name "d1".
Because there is a conflicting inode that was not yet processed, which
is directory inode 259, we orphanize it, renaming it from "d1" to
"o259-6-0";
4) Then we start processing the new reference for name "d2", and we
realize it conflicts with the reference of inode 260 in the parent
snapshot. So we issue an orphanization operation for inode 260 by
emitting a rename operation with a destination path of "o260-6-0"
and a source path of "d1/d2" - this source path is the value we
stored in the reference earlier at step 2), corresponding to the
->full_path member of the reference, however that path is no longer
valid due to the orphanization of the directory inode 259 in step 3).
This makes the receiver fail since the path does not exists, it should
have been "o259-6-0/d2".
Fix this by recomputing the full path of a reference before emitting an
orphanization if we previously orphanized any directory, since that
directory could be a parent in the new path. This is a rare scenario so
keeping it simple and not checking if that previously orphanized directory
is in fact an ancestor of the inode we are trying to orphanize.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-21 21:13:30 +08:00
|
|
|
if (orphanized_dir) {
|
|
|
|
ret = refresh_ref_path(sctx, cur);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
ret = orphanize_inode(sctx, ow_inode, ow_gen,
|
|
|
|
cur->full_path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
Btrfs: incremental send, fix invalid path for link commands
In some scenarios an incremental send stream can contain link commands
with an invalid target path. Such scenarios happen after moving some
directory inode A, renaming a regular file inode B into the old name of
inode A and finally creating a new hard link for inode B at directory
inode A.
Consider the following example scenario where this issue happens.
Parent snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
| |--- dir2/ (ino 258)
| |--- dir3/ (ino 259)
| |--- file1 (ino 261)
| |--- dir4/ (ino 262)
|
|--- dir5/ (ino 260)
Send snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
|--- dir2/ (ino 258)
| |--- dir3/ (ino 259)
| |--- dir4 (ino 261)
|
|--- dir6/ (ino 263)
|--- dir44/ (ino 262)
|--- file11 (ino 261)
|--- dir55/ (ino 260)
When attempting to apply the corresponding incremental send stream, a
link command contains an invalid target path which makes the receiver
fail. The following is the verbose output of the btrfs receive command:
receiving snapshot mysnap2 uuid=90076fe6-5ba6-e64a-9321-9279670ed16b (...)
utimes
utimes dir1
utimes dir1/dir2/dir3
utimes
rename dir1/dir2/dir3/dir4 -> o262-7-0
link dir1/dir2/dir3/dir4 -> dir1/dir2/dir3/file1
link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1
ERROR: link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1 failed: Not a directory
The following steps happen during the computation of the incremental send
stream the lead to this issue:
1) When processing inode 261, we orphanize inode 262 due to a name/location
collision with one of the new hard links for inode 261 (created in the
second step below).
2) We create one of the 2 new hard links for inode 261, the one whose
location is at "dir1/dir2/dir3/dir4".
3) We then attempt to create the other new hard link for inode 261, which
has inode 262 as its parent directory. Because the path for this new
hard link was computed before we started processing the new references
(hard links), it reflects the old name/location of inode 262, that is,
it does not account for the orphanization step that happened when
we started processing the new references for inode 261, whence it is
no longer valid, causing the receiver to fail.
So fix this issue by recomputing the full path of new references if we
ended up orphanizing other inodes which are directories.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-06-23 03:03:51 +08:00
|
|
|
if (S_ISDIR(ow_mode))
|
|
|
|
orphanized_dir = true;
|
Btrfs: send, fix failure to move directories with the same name around
When doing an incremental send we can end up not moving directories that
have the same name. This happens when the same parent directory has
different child directories with the same name in the parent and send
snapshots.
For example, consider the following scenario:
Parent snapshot:
. (ino 256)
|---- d/ (ino 257)
| |--- p1/ (ino 258)
|
|---- p1/ (ino 259)
Send snapshot:
. (ino 256)
|--- d/ (ino 257)
|--- p1/ (ino 259)
|--- p1/ (ino 258)
The directory named "d" (inode 257) has in both snapshots an entry with
the name "p1" but it refers to different inodes in both snapshots (inode
258 in the parent snapshot and inode 259 in the send snapshot). When
attempting to move inode 258, the operation is delayed because its new
parent, inode 259, was not yet moved/renamed (as the stream is currently
processing inode 258). Then when processing inode 259, we also end up
delaying its move/rename operation so that it happens after inode 258 is
moved/renamed. This decision to delay the move/rename rename operation
of inode 259 is due to the fact that the new parent inode (257) still
has inode 258 as its child, which has the same name has inode 259. So
we end up with inode 258 move/rename operation waiting for inode's 259
move/rename operation, which in turn it waiting for inode's 258
move/rename. This results in ending the send stream without issuing
move/rename operations for inodes 258 and 259 and generating the
following warnings in syslog/dmesg:
[148402.979747] ------------[ cut here ]------------
[148402.980588] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6177 btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.981928] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148402.986999] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148402.988136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148402.988136] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148402.988136] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018212139fac8
[148402.988136] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148402.988136] Call Trace:
[148402.988136] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148402.988136] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148402.988136] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148402.988136] [<ffffffffa04bc831>] btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.988136] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148402.988136] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148402.988136] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148402.988136] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148402.988136] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148402.988136] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148402.988136] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148402.988136] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148402.988136] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148402.988136] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.011373] ---[ end trace a4539270c8056f8b ]---
[148403.012296] ------------[ cut here ]------------
[148403.013071] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6194 btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.014447] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148403.019708] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148403.020104] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148403.020104] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148403.020104] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018322139fac8
[148403.020104] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148403.020104] Call Trace:
[148403.020104] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148403.020104] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148403.020104] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148403.020104] [<ffffffffa04bc847>] btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.020104] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148403.020104] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148403.020104] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148403.020104] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148403.020104] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148403.020104] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148403.020104] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148403.020104] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148403.020104] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148403.020104] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.038981] ---[ end trace a4539270c8056f8c ]---
There's another issue caused by similar (but more complex) changes in the
directory hierarchy that makes move/rename operations fail, described with
the following example:
Parent snapshot:
.
|---- a/ (ino 262)
| |---- c/ (ino 268)
|
|---- d/ (ino 263)
|---- ance/ (ino 267)
|---- e/ (ino 264)
|---- f/ (ino 265)
|---- ance/ (ino 266)
Send snapshot:
.
|---- a/ (ino 262)
|---- c/ (ino 268)
| |---- ance/ (ino 267)
|
|---- d/ (ino 263)
| |---- ance/ (ino 266)
|
|---- f/ (ino 265)
|---- e/ (ino 264)
When the inode 265 is processed, the path for inode 267 is computed, which
at that time corresponds to "d/ance", and it's stored in the names cache.
Later on when processing inode 266, we end up orphanizing (renaming to a
name matching the pattern o<ino>-<gen>-<seq>) inode 267 because it has
the same name as inode 266 and it's currently a child of the new parent
directory (inode 263) for inode 266. After the orphanization and while we
are still processing inode 266, a rename operation for inode 266 is
generated. However the source path for that rename operation is incorrect
because it ends up using the old, pre-orphanization, name of inode 267.
The no longer valid name for inode 267 was previously cached when
processing inode 265 and it remains usable and considered valid until
the inode currently being processed has a number greater than 267.
This resulted in the receiving side failing with the following error:
ERROR: rename d/ance/ance -> d/ance failed: No such file or directory
So fix these issues by detecting such circular dependencies for rename
operations and by clearing the cached name of an inode once the inode
is orphanized.
A test case for fstests will follow soon.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Rewrote change log to be more detailed and organized, and improved
comments]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-06-23 18:39:46 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If ow_inode has its rename operation delayed
|
|
|
|
* make sure that its orphanized name is used in
|
|
|
|
* the source path when performing its rename
|
|
|
|
* operation.
|
|
|
|
*/
|
2023-01-11 19:36:10 +08:00
|
|
|
wdm = get_waiting_dir_move(sctx, ow_inode);
|
|
|
|
if (wdm)
|
Btrfs: send, fix failure to move directories with the same name around
When doing an incremental send we can end up not moving directories that
have the same name. This happens when the same parent directory has
different child directories with the same name in the parent and send
snapshots.
For example, consider the following scenario:
Parent snapshot:
. (ino 256)
|---- d/ (ino 257)
| |--- p1/ (ino 258)
|
|---- p1/ (ino 259)
Send snapshot:
. (ino 256)
|--- d/ (ino 257)
|--- p1/ (ino 259)
|--- p1/ (ino 258)
The directory named "d" (inode 257) has in both snapshots an entry with
the name "p1" but it refers to different inodes in both snapshots (inode
258 in the parent snapshot and inode 259 in the send snapshot). When
attempting to move inode 258, the operation is delayed because its new
parent, inode 259, was not yet moved/renamed (as the stream is currently
processing inode 258). Then when processing inode 259, we also end up
delaying its move/rename operation so that it happens after inode 258 is
moved/renamed. This decision to delay the move/rename rename operation
of inode 259 is due to the fact that the new parent inode (257) still
has inode 258 as its child, which has the same name has inode 259. So
we end up with inode 258 move/rename operation waiting for inode's 259
move/rename operation, which in turn it waiting for inode's 258
move/rename. This results in ending the send stream without issuing
move/rename operations for inodes 258 and 259 and generating the
following warnings in syslog/dmesg:
[148402.979747] ------------[ cut here ]------------
[148402.980588] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6177 btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.981928] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148402.986999] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148402.988136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148402.988136] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148402.988136] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018212139fac8
[148402.988136] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148402.988136] Call Trace:
[148402.988136] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148402.988136] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148402.988136] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148402.988136] [<ffffffffa04bc831>] btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.988136] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148402.988136] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148402.988136] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148402.988136] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148402.988136] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148402.988136] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148402.988136] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148402.988136] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148402.988136] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148402.988136] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.011373] ---[ end trace a4539270c8056f8b ]---
[148403.012296] ------------[ cut here ]------------
[148403.013071] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6194 btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.014447] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148403.019708] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148403.020104] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148403.020104] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148403.020104] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018322139fac8
[148403.020104] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148403.020104] Call Trace:
[148403.020104] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148403.020104] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148403.020104] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148403.020104] [<ffffffffa04bc847>] btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.020104] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148403.020104] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148403.020104] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148403.020104] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148403.020104] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148403.020104] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148403.020104] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148403.020104] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148403.020104] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148403.020104] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.038981] ---[ end trace a4539270c8056f8c ]---
There's another issue caused by similar (but more complex) changes in the
directory hierarchy that makes move/rename operations fail, described with
the following example:
Parent snapshot:
.
|---- a/ (ino 262)
| |---- c/ (ino 268)
|
|---- d/ (ino 263)
|---- ance/ (ino 267)
|---- e/ (ino 264)
|---- f/ (ino 265)
|---- ance/ (ino 266)
Send snapshot:
.
|---- a/ (ino 262)
|---- c/ (ino 268)
| |---- ance/ (ino 267)
|
|---- d/ (ino 263)
| |---- ance/ (ino 266)
|
|---- f/ (ino 265)
|---- e/ (ino 264)
When the inode 265 is processed, the path for inode 267 is computed, which
at that time corresponds to "d/ance", and it's stored in the names cache.
Later on when processing inode 266, we end up orphanizing (renaming to a
name matching the pattern o<ino>-<gen>-<seq>) inode 267 because it has
the same name as inode 266 and it's currently a child of the new parent
directory (inode 263) for inode 266. After the orphanization and while we
are still processing inode 266, a rename operation for inode 266 is
generated. However the source path for that rename operation is incorrect
because it ends up using the old, pre-orphanization, name of inode 267.
The no longer valid name for inode 267 was previously cached when
processing inode 265 and it remains usable and considered valid until
the inode currently being processed has a number greater than 267.
This resulted in the receiving side failing with the following error:
ERROR: rename d/ance/ance -> d/ance failed: No such file or directory
So fix these issues by detecting such circular dependencies for rename
operations and by clearing the cached name of an inode once the inode
is orphanized.
A test case for fstests will follow soon.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Rewrote change log to be more detailed and organized, and improved
comments]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-06-23 18:39:46 +08:00
|
|
|
wdm->orphanized = true;
|
|
|
|
|
2015-03-12 23:16:20 +08:00
|
|
|
/*
|
|
|
|
* Make sure we clear our orphanized inode's
|
|
|
|
* name from the name cache. This is because the
|
|
|
|
* inode ow_inode might be an ancestor of some
|
|
|
|
* other inode that will be orphanized as well
|
|
|
|
* later and has an inode number greater than
|
|
|
|
* sctx->send_progress. We need to prevent
|
|
|
|
* future name lookups from using the old name
|
|
|
|
* and get instead the orphan name.
|
|
|
|
*/
|
|
|
|
nce = name_cache_search(sctx, ow_inode, ow_gen);
|
2023-01-11 19:36:18 +08:00
|
|
|
if (nce)
|
|
|
|
btrfs_lru_cache_remove(&sctx->name_cache,
|
|
|
|
&nce->entry);
|
Btrfs: send, fix failure to move directories with the same name around
When doing an incremental send we can end up not moving directories that
have the same name. This happens when the same parent directory has
different child directories with the same name in the parent and send
snapshots.
For example, consider the following scenario:
Parent snapshot:
. (ino 256)
|---- d/ (ino 257)
| |--- p1/ (ino 258)
|
|---- p1/ (ino 259)
Send snapshot:
. (ino 256)
|--- d/ (ino 257)
|--- p1/ (ino 259)
|--- p1/ (ino 258)
The directory named "d" (inode 257) has in both snapshots an entry with
the name "p1" but it refers to different inodes in both snapshots (inode
258 in the parent snapshot and inode 259 in the send snapshot). When
attempting to move inode 258, the operation is delayed because its new
parent, inode 259, was not yet moved/renamed (as the stream is currently
processing inode 258). Then when processing inode 259, we also end up
delaying its move/rename operation so that it happens after inode 258 is
moved/renamed. This decision to delay the move/rename rename operation
of inode 259 is due to the fact that the new parent inode (257) still
has inode 258 as its child, which has the same name has inode 259. So
we end up with inode 258 move/rename operation waiting for inode's 259
move/rename operation, which in turn it waiting for inode's 258
move/rename. This results in ending the send stream without issuing
move/rename operations for inodes 258 and 259 and generating the
following warnings in syslog/dmesg:
[148402.979747] ------------[ cut here ]------------
[148402.980588] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6177 btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.981928] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148402.986999] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148402.988136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148402.988136] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148402.988136] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018212139fac8
[148402.988136] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148402.988136] Call Trace:
[148402.988136] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148402.988136] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148402.988136] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148402.988136] [<ffffffffa04bc831>] btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.988136] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148402.988136] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148402.988136] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148402.988136] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148402.988136] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148402.988136] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148402.988136] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148402.988136] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148402.988136] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148402.988136] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.011373] ---[ end trace a4539270c8056f8b ]---
[148403.012296] ------------[ cut here ]------------
[148403.013071] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6194 btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.014447] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148403.019708] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148403.020104] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148403.020104] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148403.020104] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018322139fac8
[148403.020104] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148403.020104] Call Trace:
[148403.020104] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148403.020104] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148403.020104] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148403.020104] [<ffffffffa04bc847>] btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.020104] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148403.020104] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148403.020104] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148403.020104] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148403.020104] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148403.020104] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148403.020104] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148403.020104] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148403.020104] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148403.020104] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.038981] ---[ end trace a4539270c8056f8c ]---
There's another issue caused by similar (but more complex) changes in the
directory hierarchy that makes move/rename operations fail, described with
the following example:
Parent snapshot:
.
|---- a/ (ino 262)
| |---- c/ (ino 268)
|
|---- d/ (ino 263)
|---- ance/ (ino 267)
|---- e/ (ino 264)
|---- f/ (ino 265)
|---- ance/ (ino 266)
Send snapshot:
.
|---- a/ (ino 262)
|---- c/ (ino 268)
| |---- ance/ (ino 267)
|
|---- d/ (ino 263)
| |---- ance/ (ino 266)
|
|---- f/ (ino 265)
|---- e/ (ino 264)
When the inode 265 is processed, the path for inode 267 is computed, which
at that time corresponds to "d/ance", and it's stored in the names cache.
Later on when processing inode 266, we end up orphanizing (renaming to a
name matching the pattern o<ino>-<gen>-<seq>) inode 267 because it has
the same name as inode 266 and it's currently a child of the new parent
directory (inode 263) for inode 266. After the orphanization and while we
are still processing inode 266, a rename operation for inode 266 is
generated. However the source path for that rename operation is incorrect
because it ends up using the old, pre-orphanization, name of inode 267.
The no longer valid name for inode 267 was previously cached when
processing inode 265 and it remains usable and considered valid until
the inode currently being processed has a number greater than 267.
This resulted in the receiving side failing with the following error:
ERROR: rename d/ance/ance -> d/ance failed: No such file or directory
So fix these issues by detecting such circular dependencies for rename
operations and by clearing the cached name of an inode once the inode
is orphanized.
A test case for fstests will follow soon.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Rewrote change log to be more detailed and organized, and improved
comments]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-06-23 18:39:46 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* ow_inode might currently be an ancestor of
|
|
|
|
* cur_ino, therefore compute valid_path (the
|
|
|
|
* current path of cur_ino) again because it
|
|
|
|
* might contain the pre-orphanization name of
|
|
|
|
* ow_inode, which is no longer valid.
|
|
|
|
*/
|
Btrfs: send, fix invalid path after renaming and linking file
Currently an incremental snapshot can generate link operations which
contain an invalid target path. Such case happens when in the send
snapshot a file was renamed, a new hard link added for it and some
other inode (with a lower number) got renamed to the former name of
that file. Example:
Parent snapshot
. (ino 256)
|
|--- f1 (ino 257)
|--- f2 (ino 258)
|--- f3 (ino 259)
Send snapshot
. (ino 256)
|
|--- f2 (ino 257)
|--- f3 (ino 258)
|--- f4 (ino 259)
|--- f5 (ino 258)
The following steps happen when computing the incremental send stream:
1) When processing inode 257, inode 258 is orphanized (renamed to
"o258-7-0"), because its current reference has the same name as the
new reference for inode 257;
2) When processing inode 258, we iterate over all its new references,
which have the names "f3" and "f5". The first iteration sees name
"f5" and renames the inode from its orphan name ("o258-7-0") to
"f5", while the second iteration sees the name "f3" and, incorrectly,
issues a link operation with a target name matching the orphan name,
which no longer exists. The first iteration had reset the current
valid path of the inode to "f5", but in the second iteration we lost
it because we found another inode, with a higher number of 259, which
has a reference named "f3" as well, so we orphanized inode 259 and
recomputed the current valid path of inode 258 to its old orphan
name because inode 259 could be an ancestor of inode 258 and therefore
the current valid path could contain the pre-orphanization name of
inode 259. However in this case inode 259 is not an ancestor of inode
258 so the current valid path should not be recomputed.
This makes the receiver fail with the following error:
ERROR: link f3 -> o258-7-0 failed: No such file or directory
So fix this by not recomputing the current valid path for an inode
whenever we find a colliding reference from some not yet processed inode
(inode number higher then the one currently being processed), unless
that other inode is an ancestor of the one we are currently processing.
A test case for fstests will follow soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-07 18:41:29 +08:00
|
|
|
ret = is_ancestor(sctx->parent_root,
|
|
|
|
ow_inode, ow_gen,
|
|
|
|
sctx->cur_ino, NULL);
|
|
|
|
if (ret > 0) {
|
Btrfs: incremental send, fix invalid path for unlink commands
An incremental send can contain unlink operations with an invalid target
path when we rename some directory inode A, then rename some file inode B
to the old name of inode A and directory inode A is an ancestor of inode B
in the parent snapshot (but not anymore in the send snapshot).
Consider the following example scenario where this issue happens.
Parent snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
|--- dir2/ (ino 258)
| |--- file1 (ino 259)
| |--- file3 (ino 261)
|
|--- dir3/ (ino 262)
|--- file22 (ino 260)
|--- dir4/ (ino 263)
Send snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
|--- dir2/ (ino 258)
|--- dir3 (ino 260)
|--- file3/ (ino 262)
|--- dir4/ (ino 263)
|--- file11 (ino 269)
|--- file33 (ino 261)
When attempting to apply the corresponding incremental send stream, an
unlink operation contains an invalid path which makes the receiver fail.
The following is verbose output of the btrfs receive command:
receiving snapshot snap2 uuid=7d5450da-a573-e043-a451-ec85f4879f0f (...)
utimes
utimes dir1
utimes dir1/dir2
link dir1/dir3/dir4/file11 -> dir1/dir2/file1
unlink dir1/dir2/file1
utimes dir1/dir2
truncate dir1/dir3/dir4/file11 size=0
utimes dir1/dir3/dir4/file11
rename dir1/dir3 -> o262-7-0
link dir1/dir3 -> o262-7-0/file22
unlink dir1/dir3/file22
ERROR: unlink dir1/dir3/file22 failed. Not a directory
The following steps happen during the computation of the incremental send
stream the lead to this issue:
1) Before we start processing the new and deleted references for inode
260, we compute the full path of the deleted reference
("dir1/dir3/file22") and cache it in the list of deleted references
for our inode.
2) We then start processing the new references for inode 260, for which
there is only one new, located at "dir1/dir3". When processing this
new reference, we check that inode 262, which was not yet processed,
collides with the new reference and because of that we orphanize
inode 262 so its new full path becomes "o262-7-0".
3) After the orphanization of inode 262, we create the new reference for
inode 260 by issuing a link command with a target path of "dir1/dir3"
and a source path of "o262-7-0/file22".
4) We then start processing the deleted references for inode 260, for
which there is only one with the base name of "file22", and issue
an unlink operation containing the target path computed at step 1,
which is wrong because that path no longer exists and should be
replaced with "o262-7-0/file22".
So fix this issue by recomputing the full path of deleted references if
when we processed the new references for an inode we ended up orphanizing
any other inode that is an ancestor of our inode in the parent snapshot.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[ adjusted after prev patch removed fs_path::dir_path and dir_path_len ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-13 21:13:11 +08:00
|
|
|
orphanized_ancestor = true;
|
Btrfs: send, fix invalid path after renaming and linking file
Currently an incremental snapshot can generate link operations which
contain an invalid target path. Such case happens when in the send
snapshot a file was renamed, a new hard link added for it and some
other inode (with a lower number) got renamed to the former name of
that file. Example:
Parent snapshot
. (ino 256)
|
|--- f1 (ino 257)
|--- f2 (ino 258)
|--- f3 (ino 259)
Send snapshot
. (ino 256)
|
|--- f2 (ino 257)
|--- f3 (ino 258)
|--- f4 (ino 259)
|--- f5 (ino 258)
The following steps happen when computing the incremental send stream:
1) When processing inode 257, inode 258 is orphanized (renamed to
"o258-7-0"), because its current reference has the same name as the
new reference for inode 257;
2) When processing inode 258, we iterate over all its new references,
which have the names "f3" and "f5". The first iteration sees name
"f5" and renames the inode from its orphan name ("o258-7-0") to
"f5", while the second iteration sees the name "f3" and, incorrectly,
issues a link operation with a target name matching the orphan name,
which no longer exists. The first iteration had reset the current
valid path of the inode to "f5", but in the second iteration we lost
it because we found another inode, with a higher number of 259, which
has a reference named "f3" as well, so we orphanized inode 259 and
recomputed the current valid path of inode 258 to its old orphan
name because inode 259 could be an ancestor of inode 258 and therefore
the current valid path could contain the pre-orphanization name of
inode 259. However in this case inode 259 is not an ancestor of inode
258 so the current valid path should not be recomputed.
This makes the receiver fail with the following error:
ERROR: link f3 -> o258-7-0 failed: No such file or directory
So fix this by not recomputing the current valid path for an inode
whenever we find a colliding reference from some not yet processed inode
(inode number higher then the one currently being processed), unless
that other inode is an ancestor of the one we are currently processing.
A test case for fstests will follow soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-07 18:41:29 +08:00
|
|
|
fs_path_reset(valid_path);
|
|
|
|
ret = get_cur_path(sctx, sctx->cur_ino,
|
|
|
|
sctx->cur_inode_gen,
|
|
|
|
valid_path);
|
|
|
|
}
|
Btrfs: send, fix failure to move directories with the same name around
When doing an incremental send we can end up not moving directories that
have the same name. This happens when the same parent directory has
different child directories with the same name in the parent and send
snapshots.
For example, consider the following scenario:
Parent snapshot:
. (ino 256)
|---- d/ (ino 257)
| |--- p1/ (ino 258)
|
|---- p1/ (ino 259)
Send snapshot:
. (ino 256)
|--- d/ (ino 257)
|--- p1/ (ino 259)
|--- p1/ (ino 258)
The directory named "d" (inode 257) has in both snapshots an entry with
the name "p1" but it refers to different inodes in both snapshots (inode
258 in the parent snapshot and inode 259 in the send snapshot). When
attempting to move inode 258, the operation is delayed because its new
parent, inode 259, was not yet moved/renamed (as the stream is currently
processing inode 258). Then when processing inode 259, we also end up
delaying its move/rename operation so that it happens after inode 258 is
moved/renamed. This decision to delay the move/rename rename operation
of inode 259 is due to the fact that the new parent inode (257) still
has inode 258 as its child, which has the same name has inode 259. So
we end up with inode 258 move/rename operation waiting for inode's 259
move/rename operation, which in turn it waiting for inode's 258
move/rename. This results in ending the send stream without issuing
move/rename operations for inodes 258 and 259 and generating the
following warnings in syslog/dmesg:
[148402.979747] ------------[ cut here ]------------
[148402.980588] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6177 btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.981928] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148402.986999] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148402.988136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148402.988136] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148402.988136] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018212139fac8
[148402.988136] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148402.988136] Call Trace:
[148402.988136] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148402.988136] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148402.988136] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148402.988136] [<ffffffffa04bc831>] btrfs_ioctl_send+0xe03/0xe51 [btrfs]
[148402.988136] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148402.988136] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148402.988136] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148402.988136] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148402.988136] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148402.988136] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148402.988136] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148402.988136] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148402.988136] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148402.988136] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.011373] ---[ end trace a4539270c8056f8b ]---
[148403.012296] ------------[ cut here ]------------
[148403.013071] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6194 btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.014447] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[148403.019708] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1
[148403.020104] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[148403.020104] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
[148403.020104] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018322139fac8
[148403.020104] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
[148403.020104] Call Trace:
[148403.020104] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[148403.020104] [<ffffffff81052b14>] __warn+0xc2/0xdd
[148403.020104] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
[148403.020104] [<ffffffffa04bc847>] btrfs_ioctl_send+0xe19/0xe51 [btrfs]
[148403.020104] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
[148403.020104] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[148403.020104] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
[148403.020104] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
[148403.020104] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
[148403.020104] [<ffffffff81196f0c>] ? __fget+0x6b/0x77
[148403.020104] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
[148403.020104] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
[148403.020104] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[148403.020104] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
[148403.038981] ---[ end trace a4539270c8056f8c ]---
There's another issue caused by similar (but more complex) changes in the
directory hierarchy that makes move/rename operations fail, described with
the following example:
Parent snapshot:
.
|---- a/ (ino 262)
| |---- c/ (ino 268)
|
|---- d/ (ino 263)
|---- ance/ (ino 267)
|---- e/ (ino 264)
|---- f/ (ino 265)
|---- ance/ (ino 266)
Send snapshot:
.
|---- a/ (ino 262)
|---- c/ (ino 268)
| |---- ance/ (ino 267)
|
|---- d/ (ino 263)
| |---- ance/ (ino 266)
|
|---- f/ (ino 265)
|---- e/ (ino 264)
When the inode 265 is processed, the path for inode 267 is computed, which
at that time corresponds to "d/ance", and it's stored in the names cache.
Later on when processing inode 266, we end up orphanizing (renaming to a
name matching the pattern o<ino>-<gen>-<seq>) inode 267 because it has
the same name as inode 266 and it's currently a child of the new parent
directory (inode 263) for inode 266. After the orphanization and while we
are still processing inode 266, a rename operation for inode 266 is
generated. However the source path for that rename operation is incorrect
because it ends up using the old, pre-orphanization, name of inode 267.
The no longer valid name for inode 267 was previously cached when
processing inode 265 and it remains usable and considered valid until
the inode currently being processed has a number greater than 267.
This resulted in the receiving side failing with the following error:
ERROR: rename d/ance/ance -> d/ance failed: No such file or directory
So fix these issues by detecting such circular dependencies for rename
operations and by clearing the cached name of an inode once the inode
is orphanized.
A test case for fstests will follow soon.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Rewrote change log to be more detailed and organized, and improved
comments]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-06-23 18:39:46 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2012-07-26 05:19:24 +08:00
|
|
|
} else {
|
btrfs: send: fix invalid path for unlink operations after parent orphanization
During an incremental send operation, when processing the new references
for the current inode, we might send an unlink operation for another inode
that has a conflicting path and has more than one hard link. However this
path was computed and cached before we processed previous new references
for the current inode. We may have orphanized a directory of that path
while processing a previous new reference, in which case the path will
be invalid and cause the receiver process to fail.
The following reproducer triggers the problem and explains how/why it
happens in its comments:
$ cat test-send-unlink.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV >/dev/null
mount $DEV $MNT
# Create our test files and directory. Inode 259 (file3) has two hard
# links.
touch $MNT/file1
touch $MNT/file2
touch $MNT/file3
mkdir $MNT/A
ln $MNT/file3 $MNT/A/hard_link
# Filesystem looks like:
#
# . (ino 256)
# |----- file1 (ino 257)
# |----- file2 (ino 258)
# |----- file3 (ino 259)
# |----- A/ (ino 260)
# |---- hard_link (ino 259)
#
# Now create the base snapshot, which is going to be the parent snapshot
# for a later incremental send.
btrfs subvolume snapshot -r $MNT $MNT/snap1
btrfs send -f /tmp/snap1.send $MNT/snap1
# Move inode 257 into directory inode 260. This results in computing the
# path for inode 260 as "/A" and caching it.
mv $MNT/file1 $MNT/A/file1
# Move inode 258 (file2) into directory inode 260, with a name of
# "hard_link", moving first inode 259 away since it currently has that
# location and name.
mv $MNT/A/hard_link $MNT/tmp
mv $MNT/file2 $MNT/A/hard_link
# Now rename inode 260 to something else (B for example) and then create
# a hard link for inode 258 that has the old name and location of inode
# 260 ("/A").
mv $MNT/A $MNT/B
ln $MNT/B/hard_link $MNT/A
# Filesystem now looks like:
#
# . (ino 256)
# |----- tmp (ino 259)
# |----- file3 (ino 259)
# |----- B/ (ino 260)
# | |---- file1 (ino 257)
# | |---- hard_link (ino 258)
# |
# |----- A (ino 258)
# Create another snapshot of our subvolume and use it for an incremental
# send.
btrfs subvolume snapshot -r $MNT $MNT/snap2
btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2
# Now unmount the filesystem, create a new one, mount it and try to
# apply both send streams to recreate both snapshots.
umount $DEV
mkfs.btrfs -f $DEV >/dev/null
mount $DEV $MNT
# First add the first snapshot to the new filesystem by applying the
# first send stream.
btrfs receive -f /tmp/snap1.send $MNT
# The incremental receive operation below used to fail with the
# following error:
#
# ERROR: unlink A/hard_link failed: No such file or directory
#
# This is because when send is processing inode 257, it generates the
# path for inode 260 as "/A", since that inode is its parent in the send
# snapshot, and caches that path.
#
# Later when processing inode 258, it first processes its new reference
# that has the path of "/A", which results in orphanizing inode 260
# because there is a a path collision. This results in issuing a rename
# operation from "/A" to "/o260-6-0".
#
# Finally when processing the new reference "B/hard_link" for inode 258,
# it notices that it collides with inode 259 (not yet processed, because
# it has a higher inode number), since that inode has the name
# "hard_link" under the directory inode 260. It also checks that inode
# 259 has two hardlinks, so it decides to issue a unlink operation for
# the name "hard_link" for inode 259. However the path passed to the
# unlink operation is "/A/hard_link", which is incorrect since currently
# "/A" does not exists, due to the orphanization of inode 260 mentioned
# before. The path is incorrect because it was computed and cached
# before the orphanization. This results in the receiver to fail with
# the above error.
btrfs receive -f /tmp/snap2.send $MNT
umount $MNT
When running the test, it fails like this:
$ ./test-send-unlink.sh
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
At subvol /mnt/sdi/snap1
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
At subvol /mnt/sdi/snap2
At subvol snap1
At snapshot snap2
ERROR: unlink A/hard_link failed: No such file or directory
Fix this by recomputing a path before issuing an unlink operation when
processing the new references for the current inode if we previously
have orphanized a directory.
A test case for fstests will follow soon.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-09 18:25:03 +08:00
|
|
|
/*
|
|
|
|
* If we previously orphanized a directory that
|
|
|
|
* collided with a new reference that we already
|
|
|
|
* processed, recompute the current path because
|
|
|
|
* that directory may be part of the path.
|
|
|
|
*/
|
|
|
|
if (orphanized_dir) {
|
|
|
|
ret = refresh_ref_path(sctx, cur);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
2012-07-26 05:19:24 +08:00
|
|
|
ret = send_unlink(sctx, cur->full_path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
btrfs: send, orphanize first all conflicting inodes when processing references
When doing an incremental send it is possible that when processing the new
references for an inode we end up issuing rename or link operations that
have an invalid path, which contains the orphanized name of a directory
before we actually orphanized it, causing the receiver to fail.
The following reproducer triggers such scenario:
$ cat reproducer.sh
#!/bin/bash
mkfs.btrfs -f /dev/sdi >/dev/null
mount /dev/sdi /mnt/sdi
touch /mnt/sdi/a
touch /mnt/sdi/b
mkdir /mnt/sdi/testdir
# We want "a" to have a lower inode number then "testdir" (257 vs 259).
mv /mnt/sdi/a /mnt/sdi/testdir/a
# Filesystem looks like:
#
# . (ino 256)
# |----- testdir/ (ino 259)
# | |----- a (ino 257)
# |
# |----- b (ino 258)
btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1
btrfs send -f /tmp/snap1.send /mnt/sdi/snap1
# Now rename 259 to "testdir_2", then change the name of 257 to
# "testdir" and make it a direct descendant of the root inode (256).
# Also create a new link for inode 257 with the old name of inode 258.
# By swapping the names and location of several inodes and create a
# nasty dependency chain of rename and link operations.
mv /mnt/sdi/testdir/a /mnt/sdi/a2
touch /mnt/sdi/testdir/a
mv /mnt/sdi/b /mnt/sdi/b2
ln /mnt/sdi/a2 /mnt/sdi/b
mv /mnt/sdi/testdir /mnt/sdi/testdir_2
mv /mnt/sdi/a2 /mnt/sdi/testdir
# Filesystem now looks like:
#
# . (ino 256)
# |----- testdir_2/ (ino 259)
# | |----- a (ino 260)
# |
# |----- testdir (ino 257)
# |----- b (ino 257)
# |----- b2 (ino 258)
btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2
btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2
mkfs.btrfs -f /dev/sdj >/dev/null
mount /dev/sdj /mnt/sdj
btrfs receive -f /tmp/snap1.send /mnt/sdj
btrfs receive -f /tmp/snap2.send /mnt/sdj
umount /mnt/sdi
umount /mnt/sdj
When running the reproducer, the receive of the incremental send stream
fails:
$ ./reproducer.sh
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
At subvol /mnt/sdi/snap1
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
At subvol /mnt/sdi/snap2
At subvol snap1
At snapshot snap2
ERROR: link b -> o259-6-0/a failed: No such file or directory
The problem happens because of the following:
1) Before we start iterating the list of new references for inode 257,
we generate its current path and store it at @valid_path, done at
the very beginning of process_recorded_refs(). The generated path
is "o259-6-0/a", containing the orphanized name for inode 259;
2) Then we iterate over the list of new references, which has the
references "b" and "testdir" in that specific order;
3) We process reference "b" first, because it is in the list before
reference "testdir". We then issue a link operation to create
the new reference "b" using a target path corresponding to the
content at @valid_path, which corresponds to "o259-6-0/a".
However we haven't yet orphanized inode 259, its name is still
"testdir", and not "o259-6-0". The orphanization of 259 did not
happen yet because we will process the reference named "testdir"
for inode 257 only in the next iteration of the loop that goes
over the list of new references.
Fix the issue by having a preliminar iteration over all the new references
at process_recorded_refs(). This iteration is responsible only for doing
the orphanization of other inodes that have and old reference that
conflicts with one of the new references of the inode we are currently
processing. The emission of rename and link operations happen now in the
next iteration of the new references.
A test case for fstests will follow soon.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-21 21:13:29 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
list_for_each_entry(cur, &sctx->new_refs, list) {
|
|
|
|
/*
|
|
|
|
* We may have refs where the parent directory does not exist
|
|
|
|
* yet. This happens if the parent directories inum is higher
|
|
|
|
* than the current inum. To handle this case, we create the
|
|
|
|
* parent directory out of order. But we need to check if this
|
|
|
|
* did already happen before due to other refs in the same dir.
|
|
|
|
*/
|
2023-01-11 19:36:05 +08:00
|
|
|
ret = get_cur_inode_state(sctx, cur->dir, cur->dir_gen, NULL, NULL);
|
btrfs: send, orphanize first all conflicting inodes when processing references
When doing an incremental send it is possible that when processing the new
references for an inode we end up issuing rename or link operations that
have an invalid path, which contains the orphanized name of a directory
before we actually orphanized it, causing the receiver to fail.
The following reproducer triggers such scenario:
$ cat reproducer.sh
#!/bin/bash
mkfs.btrfs -f /dev/sdi >/dev/null
mount /dev/sdi /mnt/sdi
touch /mnt/sdi/a
touch /mnt/sdi/b
mkdir /mnt/sdi/testdir
# We want "a" to have a lower inode number then "testdir" (257 vs 259).
mv /mnt/sdi/a /mnt/sdi/testdir/a
# Filesystem looks like:
#
# . (ino 256)
# |----- testdir/ (ino 259)
# | |----- a (ino 257)
# |
# |----- b (ino 258)
btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1
btrfs send -f /tmp/snap1.send /mnt/sdi/snap1
# Now rename 259 to "testdir_2", then change the name of 257 to
# "testdir" and make it a direct descendant of the root inode (256).
# Also create a new link for inode 257 with the old name of inode 258.
# By swapping the names and location of several inodes and create a
# nasty dependency chain of rename and link operations.
mv /mnt/sdi/testdir/a /mnt/sdi/a2
touch /mnt/sdi/testdir/a
mv /mnt/sdi/b /mnt/sdi/b2
ln /mnt/sdi/a2 /mnt/sdi/b
mv /mnt/sdi/testdir /mnt/sdi/testdir_2
mv /mnt/sdi/a2 /mnt/sdi/testdir
# Filesystem now looks like:
#
# . (ino 256)
# |----- testdir_2/ (ino 259)
# | |----- a (ino 260)
# |
# |----- testdir (ino 257)
# |----- b (ino 257)
# |----- b2 (ino 258)
btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2
btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2
mkfs.btrfs -f /dev/sdj >/dev/null
mount /dev/sdj /mnt/sdj
btrfs receive -f /tmp/snap1.send /mnt/sdj
btrfs receive -f /tmp/snap2.send /mnt/sdj
umount /mnt/sdi
umount /mnt/sdj
When running the reproducer, the receive of the incremental send stream
fails:
$ ./reproducer.sh
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
At subvol /mnt/sdi/snap1
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
At subvol /mnt/sdi/snap2
At subvol snap1
At snapshot snap2
ERROR: link b -> o259-6-0/a failed: No such file or directory
The problem happens because of the following:
1) Before we start iterating the list of new references for inode 257,
we generate its current path and store it at @valid_path, done at
the very beginning of process_recorded_refs(). The generated path
is "o259-6-0/a", containing the orphanized name for inode 259;
2) Then we iterate over the list of new references, which has the
references "b" and "testdir" in that specific order;
3) We process reference "b" first, because it is in the list before
reference "testdir". We then issue a link operation to create
the new reference "b" using a target path corresponding to the
content at @valid_path, which corresponds to "o259-6-0/a".
However we haven't yet orphanized inode 259, its name is still
"testdir", and not "o259-6-0". The orphanization of 259 did not
happen yet because we will process the reference named "testdir"
for inode 257 only in the next iteration of the loop that goes
over the list of new references.
Fix the issue by having a preliminar iteration over all the new references
at process_recorded_refs(). This iteration is responsible only for doing
the orphanization of other inodes that have and old reference that
conflicts with one of the new references of the inode we are currently
processing. The emission of rename and link operations happen now in the
next iteration of the new references.
A test case for fstests will follow soon.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-21 21:13:29 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret == inode_state_will_create) {
|
|
|
|
ret = 0;
|
|
|
|
/*
|
|
|
|
* First check if any of the current inodes refs did
|
|
|
|
* already create the dir.
|
|
|
|
*/
|
|
|
|
list_for_each_entry(cur2, &sctx->new_refs, list) {
|
|
|
|
if (cur == cur2)
|
|
|
|
break;
|
|
|
|
if (cur2->dir == cur->dir) {
|
|
|
|
ret = 1;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If that did not happen, check if a previous inode
|
|
|
|
* did already create the dir.
|
|
|
|
*/
|
|
|
|
if (!ret)
|
|
|
|
ret = did_create_dir(sctx, cur->dir);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (!ret) {
|
|
|
|
ret = send_create_inode(sctx, cur->dir);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2023-01-11 19:36:15 +08:00
|
|
|
cache_dir_created(sctx, cur->dir);
|
btrfs: send, orphanize first all conflicting inodes when processing references
When doing an incremental send it is possible that when processing the new
references for an inode we end up issuing rename or link operations that
have an invalid path, which contains the orphanized name of a directory
before we actually orphanized it, causing the receiver to fail.
The following reproducer triggers such scenario:
$ cat reproducer.sh
#!/bin/bash
mkfs.btrfs -f /dev/sdi >/dev/null
mount /dev/sdi /mnt/sdi
touch /mnt/sdi/a
touch /mnt/sdi/b
mkdir /mnt/sdi/testdir
# We want "a" to have a lower inode number then "testdir" (257 vs 259).
mv /mnt/sdi/a /mnt/sdi/testdir/a
# Filesystem looks like:
#
# . (ino 256)
# |----- testdir/ (ino 259)
# | |----- a (ino 257)
# |
# |----- b (ino 258)
btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1
btrfs send -f /tmp/snap1.send /mnt/sdi/snap1
# Now rename 259 to "testdir_2", then change the name of 257 to
# "testdir" and make it a direct descendant of the root inode (256).
# Also create a new link for inode 257 with the old name of inode 258.
# By swapping the names and location of several inodes and create a
# nasty dependency chain of rename and link operations.
mv /mnt/sdi/testdir/a /mnt/sdi/a2
touch /mnt/sdi/testdir/a
mv /mnt/sdi/b /mnt/sdi/b2
ln /mnt/sdi/a2 /mnt/sdi/b
mv /mnt/sdi/testdir /mnt/sdi/testdir_2
mv /mnt/sdi/a2 /mnt/sdi/testdir
# Filesystem now looks like:
#
# . (ino 256)
# |----- testdir_2/ (ino 259)
# | |----- a (ino 260)
# |
# |----- testdir (ino 257)
# |----- b (ino 257)
# |----- b2 (ino 258)
btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2
btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2
mkfs.btrfs -f /dev/sdj >/dev/null
mount /dev/sdj /mnt/sdj
btrfs receive -f /tmp/snap1.send /mnt/sdj
btrfs receive -f /tmp/snap2.send /mnt/sdj
umount /mnt/sdi
umount /mnt/sdj
When running the reproducer, the receive of the incremental send stream
fails:
$ ./reproducer.sh
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
At subvol /mnt/sdi/snap1
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
At subvol /mnt/sdi/snap2
At subvol snap1
At snapshot snap2
ERROR: link b -> o259-6-0/a failed: No such file or directory
The problem happens because of the following:
1) Before we start iterating the list of new references for inode 257,
we generate its current path and store it at @valid_path, done at
the very beginning of process_recorded_refs(). The generated path
is "o259-6-0/a", containing the orphanized name for inode 259;
2) Then we iterate over the list of new references, which has the
references "b" and "testdir" in that specific order;
3) We process reference "b" first, because it is in the list before
reference "testdir". We then issue a link operation to create
the new reference "b" using a target path corresponding to the
content at @valid_path, which corresponds to "o259-6-0/a".
However we haven't yet orphanized inode 259, its name is still
"testdir", and not "o259-6-0". The orphanization of 259 did not
happen yet because we will process the reference named "testdir"
for inode 257 only in the next iteration of the loop that goes
over the list of new references.
Fix the issue by having a preliminar iteration over all the new references
at process_recorded_refs(). This iteration is responsible only for doing
the orphanization of other inodes that have and old reference that
conflicts with one of the new references of the inode we are currently
processing. The emission of rename and link operations happen now in the
next iteration of the new references.
A test case for fstests will follow soon.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-21 21:13:29 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-03-01 06:29:22 +08:00
|
|
|
if (S_ISDIR(sctx->cur_inode_mode) && sctx->parent_root) {
|
|
|
|
ret = wait_for_dest_dir_move(sctx, cur, is_orphan);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret == 1) {
|
|
|
|
can_rename = false;
|
|
|
|
*pending_move = 1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Btrfs: incremental send, check if orphanized dir inode needs delayed rename
If a directory inode is orphanized, because some inode previously
processed has a new name that collides with the old name of the current
inode, we need to check if it needs its rename operation delayed too,
as its ancestor-descendent relationship with some other inode might
have been reversed between the parent and send snapshots and therefore
its rename operation needs to happen after that other inode is renamed.
For example, for the following reproducer where this is needed (provided
by Robbie Ko):
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt2
$ mkdir -p /mnt/data/n1/n2
$ mkdir /mnt/data/n4
$ mkdir -p /mnt/data/t6/t7
$ mkdir /mnt/data/t5
$ mkdir /mnt/data/t7
$ mkdir /mnt/data/n4/t2
$ mkdir /mnt/data/t4
$ mkdir /mnt/data/t3
$ mv /mnt/data/t7 /mnt/data/n4/t2
$ mv /mnt/data/t4 /mnt/data/n4/t2/t7
$ mv /mnt/data/t5 /mnt/data/n4/t2/t7/t4
$ mv /mnt/data/t6 /mnt/data/n4/t2/t7/t4/t5
$ mv /mnt/data/n1/n2 /mnt/data/n4/t2/t7/t4/t5/t6
$ mv /mnt/data/n1 /mnt/data/n4/t2/t7/t4/t5/t6
$ mv /mnt/data/n4/t2/t7/t4/t5/t6/t7 /mnt/data/n4/t2/t7/t4/t5/t6/n2
$ mv /mnt/data/t3 /mnt/data/n4/t2/t7/t4/t5/t6/n2/t7
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ mv /mnt/data/n4/t2/t7/t4/t5/t6/n1 /mnt/data/n4
$ mv /mnt/data/n4/t2 /mnt/data/n4/n1
$ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6/n2 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/n2/t7/t3 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/t7/t4 /mnt/data/n4/n1/t2/t6
$ mv /mnt/data/n4/n1/t2/t7 /mnt/data/n4/n1/t2/t3
$ mv /mnt/data/n4/n1/t2/n2/t7 /mnt/data/n4/n1/t2
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send /mnt/snap1 | btrfs receive /mnt2
$ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2
ERROR: send ioctl failed with -12: Cannot allocate memory
Where the parent snapshot directory hierarchy is the following:
. (ino 256)
|-- data/ (ino 257)
|-- n4/ (ino 260)
|-- t2/ (ino 265)
|-- t7/ (ino 264)
|-- t4/ (ino 266)
|-- t5/ (ino 263)
|-- t6/ (ino 261)
|-- n1/ (ino 258)
|-- n2/ (ino 259)
|-- t7/ (ino 262)
|-- t3/ (ino 267)
And the send snapshot's directory hierarchy is the following:
. (ino 256)
|-- data/ (ino 257)
|-- n4/ (ino 260)
|-- n1/ (ino 258)
|-- t2/ (ino 265)
|-- n2/ (ino 259)
|-- t3/ (ino 267)
| |-- t7 (ino 264)
|
|-- t6/ (ino 261)
| |-- t4/ (ino 266)
| |-- t5/ (ino 263)
|
|-- t7/ (ino 262)
While processing inode 262 we orphanize inode 264 and later attempt
to rename inode 264 to its new name/location, which resulted in building
an incorrect destination path string for the rename operation with the
value "data/n4/t2/t7/t4/t5/t6/n2/t7/t3/t7". This rename operation must
have been done only after inode 267 is processed and renamed, as the
ancestor-descendent relationship between inodes 264 and 267 was reversed
between both snapshots, because otherwise it results in an infinite loop
when building the path string for inode 264 when we are processing an
inode with a number larger than 264. That loop is the following:
start inode 264, send progress of 265 for example
parent of 264 -> 267
parent of 267 -> 262
parent of 262 -> 259
parent of 259 -> 261
parent of 261 -> 263
parent of 263 -> 266
parent of 266 -> 264
|--> back to first iteration while current path string length
is <= PATH_MAX, and fail with -ENOMEM otherwise
So fix this by making the check if we need to delay a directory rename
regardless of the current inode having been orphanized or not.
A test case for fstests follows soon.
Thanks to Robbie Ko for providing a reproducer for this problem.
Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-04-09 21:09:14 +08:00
|
|
|
if (S_ISDIR(sctx->cur_inode_mode) && sctx->parent_root &&
|
|
|
|
can_rename) {
|
|
|
|
ret = wait_for_parent_move(sctx, cur, is_orphan);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret == 1) {
|
|
|
|
can_rename = false;
|
|
|
|
*pending_move = 1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
/*
|
|
|
|
* link/move the ref to the new place. If we have an orphan
|
|
|
|
* inode, move it and update valid_path. If not, link or move
|
|
|
|
* it depending on the inode mode.
|
|
|
|
*/
|
2015-03-01 06:29:22 +08:00
|
|
|
if (is_orphan && can_rename) {
|
2012-07-26 05:19:24 +08:00
|
|
|
ret = send_rename(sctx, valid_path, cur->full_path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
is_orphan = 0;
|
|
|
|
ret = fs_path_copy(valid_path, cur->full_path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2015-03-01 06:29:22 +08:00
|
|
|
} else if (can_rename) {
|
2012-07-26 05:19:24 +08:00
|
|
|
if (S_ISDIR(sctx->cur_inode_mode)) {
|
|
|
|
/*
|
|
|
|
* Dirs can't be linked, so move it. For moved
|
|
|
|
* dirs, we always have one new and one deleted
|
|
|
|
* ref. The deleted ref is ignored later.
|
|
|
|
*/
|
Btrfs: incremental send, check if orphanized dir inode needs delayed rename
If a directory inode is orphanized, because some inode previously
processed has a new name that collides with the old name of the current
inode, we need to check if it needs its rename operation delayed too,
as its ancestor-descendent relationship with some other inode might
have been reversed between the parent and send snapshots and therefore
its rename operation needs to happen after that other inode is renamed.
For example, for the following reproducer where this is needed (provided
by Robbie Ko):
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt2
$ mkdir -p /mnt/data/n1/n2
$ mkdir /mnt/data/n4
$ mkdir -p /mnt/data/t6/t7
$ mkdir /mnt/data/t5
$ mkdir /mnt/data/t7
$ mkdir /mnt/data/n4/t2
$ mkdir /mnt/data/t4
$ mkdir /mnt/data/t3
$ mv /mnt/data/t7 /mnt/data/n4/t2
$ mv /mnt/data/t4 /mnt/data/n4/t2/t7
$ mv /mnt/data/t5 /mnt/data/n4/t2/t7/t4
$ mv /mnt/data/t6 /mnt/data/n4/t2/t7/t4/t5
$ mv /mnt/data/n1/n2 /mnt/data/n4/t2/t7/t4/t5/t6
$ mv /mnt/data/n1 /mnt/data/n4/t2/t7/t4/t5/t6
$ mv /mnt/data/n4/t2/t7/t4/t5/t6/t7 /mnt/data/n4/t2/t7/t4/t5/t6/n2
$ mv /mnt/data/t3 /mnt/data/n4/t2/t7/t4/t5/t6/n2/t7
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ mv /mnt/data/n4/t2/t7/t4/t5/t6/n1 /mnt/data/n4
$ mv /mnt/data/n4/t2 /mnt/data/n4/n1
$ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6/n2 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/n2/t7/t3 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/t7/t4 /mnt/data/n4/n1/t2/t6
$ mv /mnt/data/n4/n1/t2/t7 /mnt/data/n4/n1/t2/t3
$ mv /mnt/data/n4/n1/t2/n2/t7 /mnt/data/n4/n1/t2
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send /mnt/snap1 | btrfs receive /mnt2
$ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2
ERROR: send ioctl failed with -12: Cannot allocate memory
Where the parent snapshot directory hierarchy is the following:
. (ino 256)
|-- data/ (ino 257)
|-- n4/ (ino 260)
|-- t2/ (ino 265)
|-- t7/ (ino 264)
|-- t4/ (ino 266)
|-- t5/ (ino 263)
|-- t6/ (ino 261)
|-- n1/ (ino 258)
|-- n2/ (ino 259)
|-- t7/ (ino 262)
|-- t3/ (ino 267)
And the send snapshot's directory hierarchy is the following:
. (ino 256)
|-- data/ (ino 257)
|-- n4/ (ino 260)
|-- n1/ (ino 258)
|-- t2/ (ino 265)
|-- n2/ (ino 259)
|-- t3/ (ino 267)
| |-- t7 (ino 264)
|
|-- t6/ (ino 261)
| |-- t4/ (ino 266)
| |-- t5/ (ino 263)
|
|-- t7/ (ino 262)
While processing inode 262 we orphanize inode 264 and later attempt
to rename inode 264 to its new name/location, which resulted in building
an incorrect destination path string for the rename operation with the
value "data/n4/t2/t7/t4/t5/t6/n2/t7/t3/t7". This rename operation must
have been done only after inode 267 is processed and renamed, as the
ancestor-descendent relationship between inodes 264 and 267 was reversed
between both snapshots, because otherwise it results in an infinite loop
when building the path string for inode 264 when we are processing an
inode with a number larger than 264. That loop is the following:
start inode 264, send progress of 265 for example
parent of 264 -> 267
parent of 267 -> 262
parent of 262 -> 259
parent of 259 -> 261
parent of 261 -> 263
parent of 263 -> 266
parent of 266 -> 264
|--> back to first iteration while current path string length
is <= PATH_MAX, and fail with -ENOMEM otherwise
So fix this by making the check if we need to delay a directory rename
regardless of the current inode having been orphanized or not.
A test case for fstests follows soon.
Thanks to Robbie Ko for providing a reproducer for this problem.
Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-04-09 21:09:14 +08:00
|
|
|
ret = send_rename(sctx, valid_path,
|
|
|
|
cur->full_path);
|
|
|
|
if (!ret)
|
|
|
|
ret = fs_path_copy(valid_path,
|
|
|
|
cur->full_path);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
} else {
|
Btrfs: incremental send, fix invalid path for link commands
In some scenarios an incremental send stream can contain link commands
with an invalid target path. Such scenarios happen after moving some
directory inode A, renaming a regular file inode B into the old name of
inode A and finally creating a new hard link for inode B at directory
inode A.
Consider the following example scenario where this issue happens.
Parent snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
| |--- dir2/ (ino 258)
| |--- dir3/ (ino 259)
| |--- file1 (ino 261)
| |--- dir4/ (ino 262)
|
|--- dir5/ (ino 260)
Send snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
|--- dir2/ (ino 258)
| |--- dir3/ (ino 259)
| |--- dir4 (ino 261)
|
|--- dir6/ (ino 263)
|--- dir44/ (ino 262)
|--- file11 (ino 261)
|--- dir55/ (ino 260)
When attempting to apply the corresponding incremental send stream, a
link command contains an invalid target path which makes the receiver
fail. The following is the verbose output of the btrfs receive command:
receiving snapshot mysnap2 uuid=90076fe6-5ba6-e64a-9321-9279670ed16b (...)
utimes
utimes dir1
utimes dir1/dir2/dir3
utimes
rename dir1/dir2/dir3/dir4 -> o262-7-0
link dir1/dir2/dir3/dir4 -> dir1/dir2/dir3/file1
link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1
ERROR: link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1 failed: Not a directory
The following steps happen during the computation of the incremental send
stream the lead to this issue:
1) When processing inode 261, we orphanize inode 262 due to a name/location
collision with one of the new hard links for inode 261 (created in the
second step below).
2) We create one of the 2 new hard links for inode 261, the one whose
location is at "dir1/dir2/dir3/dir4".
3) We then attempt to create the other new hard link for inode 261, which
has inode 262 as its parent directory. Because the path for this new
hard link was computed before we started processing the new references
(hard links), it reflects the old name/location of inode 262, that is,
it does not account for the orphanization step that happened when
we started processing the new references for inode 261, whence it is
no longer valid, causing the receiver to fail.
So fix this issue by recomputing the full path of new references if we
ended up orphanizing other inodes which are directories.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-06-23 03:03:51 +08:00
|
|
|
/*
|
|
|
|
* We might have previously orphanized an inode
|
|
|
|
* which is an ancestor of our current inode,
|
|
|
|
* so our reference's full path, which was
|
|
|
|
* computed before any such orphanizations, must
|
|
|
|
* be updated.
|
|
|
|
*/
|
|
|
|
if (orphanized_dir) {
|
|
|
|
ret = update_ref_path(sctx, cur);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
2012-07-26 05:19:24 +08:00
|
|
|
ret = send_link(sctx, cur->full_path,
|
|
|
|
valid_path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
2013-08-17 04:52:55 +08:00
|
|
|
ret = dup_ref(cur, &check_dirs);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (S_ISDIR(sctx->cur_inode_mode) && sctx->cur_inode_deleted) {
|
|
|
|
/*
|
|
|
|
* Check if we can already rmdir the directory. If not,
|
|
|
|
* orphanize it. For every dir item inside that gets deleted
|
|
|
|
* later, we do this check again and rmdir it then if possible.
|
|
|
|
* See the use of check_dirs for more details.
|
|
|
|
*/
|
2023-01-11 19:36:06 +08:00
|
|
|
ret = can_rmdir(sctx, sctx->cur_ino, sctx->cur_inode_gen);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret) {
|
|
|
|
ret = send_rmdir(sctx, valid_path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
} else if (!is_orphan) {
|
|
|
|
ret = orphanize_inode(sctx, sctx->cur_ino,
|
|
|
|
sctx->cur_inode_gen, valid_path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
is_orphan = 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
list_for_each_entry(cur, &sctx->deleted_refs, list) {
|
2013-08-17 04:52:55 +08:00
|
|
|
ret = dup_ref(cur, &check_dirs);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
2012-07-28 17:46:29 +08:00
|
|
|
} else if (S_ISDIR(sctx->cur_inode_mode) &&
|
|
|
|
!list_empty(&sctx->deleted_refs)) {
|
|
|
|
/*
|
|
|
|
* We have a moved dir. Add the old parent to check_dirs
|
|
|
|
*/
|
|
|
|
cur = list_entry(sctx->deleted_refs.next, struct recorded_ref,
|
|
|
|
list);
|
2013-08-17 04:52:55 +08:00
|
|
|
ret = dup_ref(cur, &check_dirs);
|
2012-07-28 17:46:29 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2012-07-26 05:19:24 +08:00
|
|
|
} else if (!S_ISDIR(sctx->cur_inode_mode)) {
|
|
|
|
/*
|
|
|
|
* We have a non dir inode. Go through all deleted refs and
|
|
|
|
* unlink them if they were not already overwritten by other
|
|
|
|
* inodes.
|
|
|
|
*/
|
|
|
|
list_for_each_entry(cur, &sctx->deleted_refs, list) {
|
|
|
|
ret = did_overwrite_ref(sctx, cur->dir, cur->dir_gen,
|
|
|
|
sctx->cur_ino, sctx->cur_inode_gen,
|
|
|
|
cur->name, cur->name_len);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (!ret) {
|
Btrfs: incremental send, fix invalid path for unlink commands
An incremental send can contain unlink operations with an invalid target
path when we rename some directory inode A, then rename some file inode B
to the old name of inode A and directory inode A is an ancestor of inode B
in the parent snapshot (but not anymore in the send snapshot).
Consider the following example scenario where this issue happens.
Parent snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
|--- dir2/ (ino 258)
| |--- file1 (ino 259)
| |--- file3 (ino 261)
|
|--- dir3/ (ino 262)
|--- file22 (ino 260)
|--- dir4/ (ino 263)
Send snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
|--- dir2/ (ino 258)
|--- dir3 (ino 260)
|--- file3/ (ino 262)
|--- dir4/ (ino 263)
|--- file11 (ino 269)
|--- file33 (ino 261)
When attempting to apply the corresponding incremental send stream, an
unlink operation contains an invalid path which makes the receiver fail.
The following is verbose output of the btrfs receive command:
receiving snapshot snap2 uuid=7d5450da-a573-e043-a451-ec85f4879f0f (...)
utimes
utimes dir1
utimes dir1/dir2
link dir1/dir3/dir4/file11 -> dir1/dir2/file1
unlink dir1/dir2/file1
utimes dir1/dir2
truncate dir1/dir3/dir4/file11 size=0
utimes dir1/dir3/dir4/file11
rename dir1/dir3 -> o262-7-0
link dir1/dir3 -> o262-7-0/file22
unlink dir1/dir3/file22
ERROR: unlink dir1/dir3/file22 failed. Not a directory
The following steps happen during the computation of the incremental send
stream the lead to this issue:
1) Before we start processing the new and deleted references for inode
260, we compute the full path of the deleted reference
("dir1/dir3/file22") and cache it in the list of deleted references
for our inode.
2) We then start processing the new references for inode 260, for which
there is only one new, located at "dir1/dir3". When processing this
new reference, we check that inode 262, which was not yet processed,
collides with the new reference and because of that we orphanize
inode 262 so its new full path becomes "o262-7-0".
3) After the orphanization of inode 262, we create the new reference for
inode 260 by issuing a link command with a target path of "dir1/dir3"
and a source path of "o262-7-0/file22".
4) We then start processing the deleted references for inode 260, for
which there is only one with the base name of "file22", and issue
an unlink operation containing the target path computed at step 1,
which is wrong because that path no longer exists and should be
replaced with "o262-7-0/file22".
So fix this issue by recomputing the full path of deleted references if
when we processed the new references for an inode we ended up orphanizing
any other inode that is an ancestor of our inode in the parent snapshot.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[ adjusted after prev patch removed fs_path::dir_path and dir_path_len ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-13 21:13:11 +08:00
|
|
|
/*
|
|
|
|
* If we orphanized any ancestor before, we need
|
|
|
|
* to recompute the full path for deleted names,
|
|
|
|
* since any such path was computed before we
|
|
|
|
* processed any references and orphanized any
|
|
|
|
* ancestor inode.
|
|
|
|
*/
|
|
|
|
if (orphanized_ancestor) {
|
Btrfs: incremental send, fix invalid path for link commands
In some scenarios an incremental send stream can contain link commands
with an invalid target path. Such scenarios happen after moving some
directory inode A, renaming a regular file inode B into the old name of
inode A and finally creating a new hard link for inode B at directory
inode A.
Consider the following example scenario where this issue happens.
Parent snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
| |--- dir2/ (ino 258)
| |--- dir3/ (ino 259)
| |--- file1 (ino 261)
| |--- dir4/ (ino 262)
|
|--- dir5/ (ino 260)
Send snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
|--- dir2/ (ino 258)
| |--- dir3/ (ino 259)
| |--- dir4 (ino 261)
|
|--- dir6/ (ino 263)
|--- dir44/ (ino 262)
|--- file11 (ino 261)
|--- dir55/ (ino 260)
When attempting to apply the corresponding incremental send stream, a
link command contains an invalid target path which makes the receiver
fail. The following is the verbose output of the btrfs receive command:
receiving snapshot mysnap2 uuid=90076fe6-5ba6-e64a-9321-9279670ed16b (...)
utimes
utimes dir1
utimes dir1/dir2/dir3
utimes
rename dir1/dir2/dir3/dir4 -> o262-7-0
link dir1/dir2/dir3/dir4 -> dir1/dir2/dir3/file1
link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1
ERROR: link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1 failed: Not a directory
The following steps happen during the computation of the incremental send
stream the lead to this issue:
1) When processing inode 261, we orphanize inode 262 due to a name/location
collision with one of the new hard links for inode 261 (created in the
second step below).
2) We create one of the 2 new hard links for inode 261, the one whose
location is at "dir1/dir2/dir3/dir4".
3) We then attempt to create the other new hard link for inode 261, which
has inode 262 as its parent directory. Because the path for this new
hard link was computed before we started processing the new references
(hard links), it reflects the old name/location of inode 262, that is,
it does not account for the orphanization step that happened when
we started processing the new references for inode 261, whence it is
no longer valid, causing the receiver to fail.
So fix this issue by recomputing the full path of new references if we
ended up orphanizing other inodes which are directories.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-06-23 03:03:51 +08:00
|
|
|
ret = update_ref_path(sctx, cur);
|
|
|
|
if (ret < 0)
|
Btrfs: incremental send, fix invalid path for unlink commands
An incremental send can contain unlink operations with an invalid target
path when we rename some directory inode A, then rename some file inode B
to the old name of inode A and directory inode A is an ancestor of inode B
in the parent snapshot (but not anymore in the send snapshot).
Consider the following example scenario where this issue happens.
Parent snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
|--- dir2/ (ino 258)
| |--- file1 (ino 259)
| |--- file3 (ino 261)
|
|--- dir3/ (ino 262)
|--- file22 (ino 260)
|--- dir4/ (ino 263)
Send snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
|--- dir2/ (ino 258)
|--- dir3 (ino 260)
|--- file3/ (ino 262)
|--- dir4/ (ino 263)
|--- file11 (ino 269)
|--- file33 (ino 261)
When attempting to apply the corresponding incremental send stream, an
unlink operation contains an invalid path which makes the receiver fail.
The following is verbose output of the btrfs receive command:
receiving snapshot snap2 uuid=7d5450da-a573-e043-a451-ec85f4879f0f (...)
utimes
utimes dir1
utimes dir1/dir2
link dir1/dir3/dir4/file11 -> dir1/dir2/file1
unlink dir1/dir2/file1
utimes dir1/dir2
truncate dir1/dir3/dir4/file11 size=0
utimes dir1/dir3/dir4/file11
rename dir1/dir3 -> o262-7-0
link dir1/dir3 -> o262-7-0/file22
unlink dir1/dir3/file22
ERROR: unlink dir1/dir3/file22 failed. Not a directory
The following steps happen during the computation of the incremental send
stream the lead to this issue:
1) Before we start processing the new and deleted references for inode
260, we compute the full path of the deleted reference
("dir1/dir3/file22") and cache it in the list of deleted references
for our inode.
2) We then start processing the new references for inode 260, for which
there is only one new, located at "dir1/dir3". When processing this
new reference, we check that inode 262, which was not yet processed,
collides with the new reference and because of that we orphanize
inode 262 so its new full path becomes "o262-7-0".
3) After the orphanization of inode 262, we create the new reference for
inode 260 by issuing a link command with a target path of "dir1/dir3"
and a source path of "o262-7-0/file22".
4) We then start processing the deleted references for inode 260, for
which there is only one with the base name of "file22", and issue
an unlink operation containing the target path computed at step 1,
which is wrong because that path no longer exists and should be
replaced with "o262-7-0/file22".
So fix this issue by recomputing the full path of deleted references if
when we processed the new references for an inode we ended up orphanizing
any other inode that is an ancestor of our inode in the parent snapshot.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[ adjusted after prev patch removed fs_path::dir_path and dir_path_len ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-13 21:13:11 +08:00
|
|
|
goto out;
|
|
|
|
}
|
2012-07-28 16:42:24 +08:00
|
|
|
ret = send_unlink(sctx, cur->full_path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
2013-08-17 04:52:55 +08:00
|
|
|
ret = dup_ref(cur, &check_dirs);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* If the inode is still orphan, unlink the orphan. This may
|
|
|
|
* happen when a previous inode did overwrite the first ref
|
|
|
|
* of this inode and no new refs were added for the current
|
2012-07-28 20:11:31 +08:00
|
|
|
* inode. Unlinking does not mean that the inode is deleted in
|
|
|
|
* all cases. There may still be links to this inode in other
|
|
|
|
* places.
|
2012-07-26 05:19:24 +08:00
|
|
|
*/
|
2012-07-28 16:42:24 +08:00
|
|
|
if (is_orphan) {
|
2012-07-26 05:19:24 +08:00
|
|
|
ret = send_unlink(sctx, valid_path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We did collect all parent dirs where cur_inode was once located. We
|
|
|
|
* now go through all these dirs and check if they are pending for
|
|
|
|
* deletion and if it's finally possible to perform the rmdir now.
|
|
|
|
* We also update the inode stats of the parent dirs here.
|
|
|
|
*/
|
2013-08-17 04:52:55 +08:00
|
|
|
list_for_each_entry(cur, &check_dirs, list) {
|
2012-07-28 20:11:31 +08:00
|
|
|
/*
|
|
|
|
* In case we had refs into dirs that were not processed yet,
|
|
|
|
* we don't need to do the utime and rmdir logic for these dirs.
|
|
|
|
* The dir will be processed later.
|
|
|
|
*/
|
2013-08-17 04:52:55 +08:00
|
|
|
if (cur->dir > sctx->cur_ino)
|
2012-07-26 05:19:24 +08:00
|
|
|
continue;
|
|
|
|
|
2023-01-11 19:36:05 +08:00
|
|
|
ret = get_cur_inode_state(sctx, cur->dir, cur->dir_gen, NULL, NULL);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (ret == inode_state_did_create ||
|
|
|
|
ret == inode_state_no_change) {
|
btrfs: send: cache utimes operations for directories if possible
Whenever we add or remove an entry to a directory, we issue an utimes
command for the directory. If we add 1000 entries to a directory (create
1000 files under it or move 1000 files to it), then we issue the same
utimes command 1000 times, which increases the send stream size, results
in more pipe IO, one search in the send b+tree, allocating one path for
the search, etc, as well as making the receiver do a system call for each
duplicated utimes command.
We also issue an utimes command when we create a new directory, but later
we might add entries to it corresponding to inodes with an higher inode
number, so it's pointless to issue the utimes command before we create
the last inode under the directory.
So use a lru cache to track directories for which we must send a utimes
command. When we need to remove an entry from the cache, we issue the
utimes command for the respective directory. When finishing the send
operation, we go over each cache element and issue the respective utimes
command. Finally the caching is entirely optional, just a performance
optimization, meaning that if we fail to cache (due to memory allocation
failure), we issue the utimes command right away, that is, we fallback
to the previous, unoptimized, behaviour.
This patch belongs to a patchset comprised of the following patches:
btrfs: send: directly return from did_overwrite_ref() and simplify it
btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
btrfs: send: directly return from will_overwrite_ref() and simplify it
btrfs: send: avoid extra b+tree searches when checking reference overrides
btrfs: send: remove send_progress argument from can_rmdir()
btrfs: send: avoid duplicated orphan dir allocation and initialization
btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
btrfs: send: reduce searches on parent root when checking if dir can be removed
btrfs: send: iterate waiting dir move rbtree only once when processing refs
btrfs: send: initialize all the red black trees earlier
btrfs: send: genericize the backref cache to allow it to be reused
btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
btrfs: send: cache information about created directories
btrfs: allow a generation number to be associated with lru cache entries
btrfs: add an api to delete a specific entry from the lru cache
btrfs: send: use the lru cache to implement the name cache
btrfs: send: update size of roots array for backref cache entries
btrfs: send: cache utimes operations for directories if possible
The following test was run before and after applying the whole patchset,
and on a non-debug kernel (Debian's default kernel config):
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
mkfs.btrfs -f $DEV > /dev/null
mount $DEV $MNT
mkdir $MNT/A
for ((i = 1; i <= 20000; i++)); do
echo -n > $MNT/A/file_$i
done
btrfs subvolume snapshot -r $MNT $MNT/snap1
mkdir $MNT/B
for ((i = 20000; i <= 40000; i++)); do
echo -n > $MNT/B/file_$i
done
mv $MNT/A/file_* $MNT/B/
btrfs subvolume snapshot -r $MNT $MNT/snap2
start=$(date +%s%N)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "Incremental send took $dur milliseconds"
umount $MNT
Before the whole patchset: 18408 milliseconds
After the whole patchset: 1942 milliseconds (9.5x speedup)
Using 60000 files instead of 40000:
Before the whole patchset: 39764 milliseconds
After the whole patchset: 3076 milliseconds (12.9x speedup)
Using 20000 files instead of 40000:
Before the whole patchset: 5072 milliseconds
After the whole patchset: 916 milliseconds (5.5x speedup)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-11 19:36:20 +08:00
|
|
|
ret = cache_dir_utimes(sctx, cur->dir, cur->dir_gen);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2014-02-17 05:01:39 +08:00
|
|
|
} else if (ret == inode_state_did_delete &&
|
|
|
|
cur->dir != last_dir_ino_rm) {
|
2023-01-11 19:36:06 +08:00
|
|
|
ret = can_rmdir(sctx, cur->dir, cur->dir_gen);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret) {
|
2013-08-17 04:52:55 +08:00
|
|
|
ret = get_cur_path(sctx, cur->dir,
|
|
|
|
cur->dir_gen, valid_path);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
ret = send_rmdir(sctx, valid_path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2014-02-17 05:01:39 +08:00
|
|
|
last_dir_ino_rm = cur->dir;
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = 0;
|
|
|
|
|
|
|
|
out:
|
2013-08-17 04:52:55 +08:00
|
|
|
__free_recorded_refs(&check_dirs);
|
2012-07-26 05:19:24 +08:00
|
|
|
free_recorded_refs(sctx);
|
2013-05-08 15:51:52 +08:00
|
|
|
fs_path_free(valid_path);
|
2012-07-26 05:19:24 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
btrfs: send: fix sending link commands for existing file paths
There is a bug sending link commands for existing file paths. When we're
processing an inode, we go over all references. All the new file paths are
added to the "new_refs" list. And all the deleted file paths are added to
the "deleted_refs" list. In the end, when we finish processing the inode,
we iterate over all the items in the "new_refs" list and send link commands
for those file paths. After that, we go over all the items in the
"deleted_refs" list and send unlink commands for them. If there are
duplicated file paths in both lists, we will try to create them before we
remove them. Then the receiver gets an -EEXIST error when trying the link
operations.
Example for having duplicated file paths in both list:
$ btrfs subvolume create vol
# create a file and 2000 hard links to the same inode
$ touch vol/foo
$ for i in {1..2000}; do link vol/foo vol/$i ; done
# take a snapshot for a parent snapshot
$ btrfs subvolume snapshot -r vol snap1
# remove 2000 hard links and re-create the last 1000 links
$ for i in {1..2000}; do rm vol/$i; done;
$ for i in {1001..2000}; do link vol/foo vol/$i; done
# take another one for a send snapshot
$ btrfs subvolume snapshot -r vol snap2
$ mkdir receive_dir
$ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
At subvol snap2
link 1238 -> foo
ERROR: link 1238 -> foo failed: File exists
In this case, we will have the same file paths added to both lists. In the
parent snapshot, reference paths {1..1237} are stored in inode references,
but reference paths {1238..2000} are stored in inode extended references.
In the send snapshot, all reference paths {1001..2000} are stored in inode
references. During the incremental send, we process their inode references
first. In record_changed_ref(), we iterate all its inode references in the
send/parent snapshot. For every inode reference, we also use find_iref() to
check whether the same file path also appears in the parent/send snapshot
or not. Inode references {1238..2000} which appear in the send snapshot but
not in the parent snapshot are added to the "new_refs" list. On the other
hand, Inode references {1..1000} which appear in the parent snapshot but
not in the send snapshot are added to the "deleted_refs" list. Next, when
we process their inode extended references, reference paths {1238..2000}
are added to the "deleted_refs" list because all of them only appear in the
parent snapshot. Now two lists contain items as below:
"new_refs" list: {1238..2000}
"deleted_refs" list: {1..1000}, {1238..2000}
Reference paths {1238..2000} appear in both lists. And as the processing
order mentioned about before, the receiver gets an -EEXIST error when trying
the link operations.
To fix the bug, the idea is to process the "deleted_refs" list before
the "new_refs" list. However, it's not easy to reshuffle the processing
order. For one reason, if we do so, we may unlink all the existing paths
first, there's no valid path anymore for links. And it's inefficient
because we do a bunch of unlinks followed by links for the same paths.
Moreover, it makes less sense to have duplications in both lists. A
reference path cannot not only be regarded as new but also has been seen in
the past, or we won't call it a new path. However, it's also not a good
idea to make find_iref() check a reference against all inode references
and all inode extended references because it may result in large disk
reads.
So we introduce two rbtrees to make the references easier for lookups.
And we also introduce record_new_ref_if_needed() and
record_deleted_ref_if_needed() for changed_ref() to check and remove
duplicated references early.
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-12 09:36:32 +08:00
|
|
|
static int rbtree_ref_comp(const void *k, const struct rb_node *node)
|
|
|
|
{
|
|
|
|
const struct recorded_ref *data = k;
|
|
|
|
const struct recorded_ref *ref = rb_entry(node, struct recorded_ref, node);
|
|
|
|
int result;
|
|
|
|
|
|
|
|
if (data->dir > ref->dir)
|
|
|
|
return 1;
|
|
|
|
if (data->dir < ref->dir)
|
|
|
|
return -1;
|
|
|
|
if (data->dir_gen > ref->dir_gen)
|
|
|
|
return 1;
|
|
|
|
if (data->dir_gen < ref->dir_gen)
|
|
|
|
return -1;
|
|
|
|
if (data->name_len > ref->name_len)
|
|
|
|
return 1;
|
|
|
|
if (data->name_len < ref->name_len)
|
|
|
|
return -1;
|
|
|
|
result = strcmp(data->name, ref->name);
|
|
|
|
if (result > 0)
|
|
|
|
return 1;
|
|
|
|
if (result < 0)
|
|
|
|
return -1;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool rbtree_ref_less(struct rb_node *node, const struct rb_node *parent)
|
|
|
|
{
|
|
|
|
const struct recorded_ref *entry = rb_entry(node, struct recorded_ref, node);
|
|
|
|
|
|
|
|
return rbtree_ref_comp(entry, parent) < 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int record_ref_in_tree(struct rb_root *root, struct list_head *refs,
|
|
|
|
struct fs_path *name, u64 dir, u64 dir_gen,
|
|
|
|
struct send_ctx *sctx)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
struct fs_path *path = NULL;
|
|
|
|
struct recorded_ref *ref = NULL;
|
|
|
|
|
|
|
|
path = fs_path_alloc();
|
|
|
|
if (!path) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
ref = recorded_ref_alloc();
|
|
|
|
if (!ref) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = get_cur_path(sctx, dir, dir_gen, path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
ret = fs_path_add_path(path, name);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ref->dir = dir;
|
|
|
|
ref->dir_gen = dir_gen;
|
|
|
|
set_ref_path(ref, path);
|
|
|
|
list_add_tail(&ref->list, refs);
|
|
|
|
rb_add(&ref->node, root, rbtree_ref_less);
|
|
|
|
ref->root = root;
|
|
|
|
out:
|
|
|
|
if (ret) {
|
|
|
|
if (path && (!ref || !ref->full_path))
|
|
|
|
fs_path_free(path);
|
|
|
|
recorded_ref_free(ref);
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int record_new_ref_if_needed(int num, u64 dir, int index,
|
|
|
|
struct fs_path *name, void *ctx)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
struct send_ctx *sctx = ctx;
|
|
|
|
struct rb_node *node = NULL;
|
|
|
|
struct recorded_ref data;
|
|
|
|
struct recorded_ref *ref;
|
|
|
|
u64 dir_gen;
|
|
|
|
|
2022-08-12 22:42:32 +08:00
|
|
|
ret = get_inode_gen(sctx->send_root, dir, &dir_gen);
|
btrfs: send: fix sending link commands for existing file paths
There is a bug sending link commands for existing file paths. When we're
processing an inode, we go over all references. All the new file paths are
added to the "new_refs" list. And all the deleted file paths are added to
the "deleted_refs" list. In the end, when we finish processing the inode,
we iterate over all the items in the "new_refs" list and send link commands
for those file paths. After that, we go over all the items in the
"deleted_refs" list and send unlink commands for them. If there are
duplicated file paths in both lists, we will try to create them before we
remove them. Then the receiver gets an -EEXIST error when trying the link
operations.
Example for having duplicated file paths in both list:
$ btrfs subvolume create vol
# create a file and 2000 hard links to the same inode
$ touch vol/foo
$ for i in {1..2000}; do link vol/foo vol/$i ; done
# take a snapshot for a parent snapshot
$ btrfs subvolume snapshot -r vol snap1
# remove 2000 hard links and re-create the last 1000 links
$ for i in {1..2000}; do rm vol/$i; done;
$ for i in {1001..2000}; do link vol/foo vol/$i; done
# take another one for a send snapshot
$ btrfs subvolume snapshot -r vol snap2
$ mkdir receive_dir
$ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
At subvol snap2
link 1238 -> foo
ERROR: link 1238 -> foo failed: File exists
In this case, we will have the same file paths added to both lists. In the
parent snapshot, reference paths {1..1237} are stored in inode references,
but reference paths {1238..2000} are stored in inode extended references.
In the send snapshot, all reference paths {1001..2000} are stored in inode
references. During the incremental send, we process their inode references
first. In record_changed_ref(), we iterate all its inode references in the
send/parent snapshot. For every inode reference, we also use find_iref() to
check whether the same file path also appears in the parent/send snapshot
or not. Inode references {1238..2000} which appear in the send snapshot but
not in the parent snapshot are added to the "new_refs" list. On the other
hand, Inode references {1..1000} which appear in the parent snapshot but
not in the send snapshot are added to the "deleted_refs" list. Next, when
we process their inode extended references, reference paths {1238..2000}
are added to the "deleted_refs" list because all of them only appear in the
parent snapshot. Now two lists contain items as below:
"new_refs" list: {1238..2000}
"deleted_refs" list: {1..1000}, {1238..2000}
Reference paths {1238..2000} appear in both lists. And as the processing
order mentioned about before, the receiver gets an -EEXIST error when trying
the link operations.
To fix the bug, the idea is to process the "deleted_refs" list before
the "new_refs" list. However, it's not easy to reshuffle the processing
order. For one reason, if we do so, we may unlink all the existing paths
first, there's no valid path anymore for links. And it's inefficient
because we do a bunch of unlinks followed by links for the same paths.
Moreover, it makes less sense to have duplications in both lists. A
reference path cannot not only be regarded as new but also has been seen in
the past, or we won't call it a new path. However, it's also not a good
idea to make find_iref() check a reference against all inode references
and all inode extended references because it may result in large disk
reads.
So we introduce two rbtrees to make the references easier for lookups.
And we also introduce record_new_ref_if_needed() and
record_deleted_ref_if_needed() for changed_ref() to check and remove
duplicated references early.
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-12 09:36:32 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
data.dir = dir;
|
|
|
|
data.dir_gen = dir_gen;
|
|
|
|
set_ref_path(&data, name);
|
|
|
|
node = rb_find(&data, &sctx->rbtree_deleted_refs, rbtree_ref_comp);
|
|
|
|
if (node) {
|
|
|
|
ref = rb_entry(node, struct recorded_ref, node);
|
|
|
|
recorded_ref_free(ref);
|
|
|
|
} else {
|
|
|
|
ret = record_ref_in_tree(&sctx->rbtree_new_refs,
|
|
|
|
&sctx->new_refs, name, dir, dir_gen,
|
|
|
|
sctx);
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int record_deleted_ref_if_needed(int num, u64 dir, int index,
|
|
|
|
struct fs_path *name, void *ctx)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
struct send_ctx *sctx = ctx;
|
|
|
|
struct rb_node *node = NULL;
|
|
|
|
struct recorded_ref data;
|
|
|
|
struct recorded_ref *ref;
|
|
|
|
u64 dir_gen;
|
|
|
|
|
2022-08-12 22:42:32 +08:00
|
|
|
ret = get_inode_gen(sctx->parent_root, dir, &dir_gen);
|
btrfs: send: fix sending link commands for existing file paths
There is a bug sending link commands for existing file paths. When we're
processing an inode, we go over all references. All the new file paths are
added to the "new_refs" list. And all the deleted file paths are added to
the "deleted_refs" list. In the end, when we finish processing the inode,
we iterate over all the items in the "new_refs" list and send link commands
for those file paths. After that, we go over all the items in the
"deleted_refs" list and send unlink commands for them. If there are
duplicated file paths in both lists, we will try to create them before we
remove them. Then the receiver gets an -EEXIST error when trying the link
operations.
Example for having duplicated file paths in both list:
$ btrfs subvolume create vol
# create a file and 2000 hard links to the same inode
$ touch vol/foo
$ for i in {1..2000}; do link vol/foo vol/$i ; done
# take a snapshot for a parent snapshot
$ btrfs subvolume snapshot -r vol snap1
# remove 2000 hard links and re-create the last 1000 links
$ for i in {1..2000}; do rm vol/$i; done;
$ for i in {1001..2000}; do link vol/foo vol/$i; done
# take another one for a send snapshot
$ btrfs subvolume snapshot -r vol snap2
$ mkdir receive_dir
$ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
At subvol snap2
link 1238 -> foo
ERROR: link 1238 -> foo failed: File exists
In this case, we will have the same file paths added to both lists. In the
parent snapshot, reference paths {1..1237} are stored in inode references,
but reference paths {1238..2000} are stored in inode extended references.
In the send snapshot, all reference paths {1001..2000} are stored in inode
references. During the incremental send, we process their inode references
first. In record_changed_ref(), we iterate all its inode references in the
send/parent snapshot. For every inode reference, we also use find_iref() to
check whether the same file path also appears in the parent/send snapshot
or not. Inode references {1238..2000} which appear in the send snapshot but
not in the parent snapshot are added to the "new_refs" list. On the other
hand, Inode references {1..1000} which appear in the parent snapshot but
not in the send snapshot are added to the "deleted_refs" list. Next, when
we process their inode extended references, reference paths {1238..2000}
are added to the "deleted_refs" list because all of them only appear in the
parent snapshot. Now two lists contain items as below:
"new_refs" list: {1238..2000}
"deleted_refs" list: {1..1000}, {1238..2000}
Reference paths {1238..2000} appear in both lists. And as the processing
order mentioned about before, the receiver gets an -EEXIST error when trying
the link operations.
To fix the bug, the idea is to process the "deleted_refs" list before
the "new_refs" list. However, it's not easy to reshuffle the processing
order. For one reason, if we do so, we may unlink all the existing paths
first, there's no valid path anymore for links. And it's inefficient
because we do a bunch of unlinks followed by links for the same paths.
Moreover, it makes less sense to have duplications in both lists. A
reference path cannot not only be regarded as new but also has been seen in
the past, or we won't call it a new path. However, it's also not a good
idea to make find_iref() check a reference against all inode references
and all inode extended references because it may result in large disk
reads.
So we introduce two rbtrees to make the references easier for lookups.
And we also introduce record_new_ref_if_needed() and
record_deleted_ref_if_needed() for changed_ref() to check and remove
duplicated references early.
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-12 09:36:32 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
data.dir = dir;
|
|
|
|
data.dir_gen = dir_gen;
|
|
|
|
set_ref_path(&data, name);
|
|
|
|
node = rb_find(&data, &sctx->rbtree_new_refs, rbtree_ref_comp);
|
|
|
|
if (node) {
|
|
|
|
ref = rb_entry(node, struct recorded_ref, node);
|
|
|
|
recorded_ref_free(ref);
|
|
|
|
} else {
|
|
|
|
ret = record_ref_in_tree(&sctx->rbtree_deleted_refs,
|
|
|
|
&sctx->deleted_refs, name, dir,
|
|
|
|
dir_gen, sctx);
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
static int record_new_ref(struct send_ctx *sctx)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
ret = iterate_inode_ref(sctx->send_root, sctx->left_path,
|
btrfs: send: fix sending link commands for existing file paths
There is a bug sending link commands for existing file paths. When we're
processing an inode, we go over all references. All the new file paths are
added to the "new_refs" list. And all the deleted file paths are added to
the "deleted_refs" list. In the end, when we finish processing the inode,
we iterate over all the items in the "new_refs" list and send link commands
for those file paths. After that, we go over all the items in the
"deleted_refs" list and send unlink commands for them. If there are
duplicated file paths in both lists, we will try to create them before we
remove them. Then the receiver gets an -EEXIST error when trying the link
operations.
Example for having duplicated file paths in both list:
$ btrfs subvolume create vol
# create a file and 2000 hard links to the same inode
$ touch vol/foo
$ for i in {1..2000}; do link vol/foo vol/$i ; done
# take a snapshot for a parent snapshot
$ btrfs subvolume snapshot -r vol snap1
# remove 2000 hard links and re-create the last 1000 links
$ for i in {1..2000}; do rm vol/$i; done;
$ for i in {1001..2000}; do link vol/foo vol/$i; done
# take another one for a send snapshot
$ btrfs subvolume snapshot -r vol snap2
$ mkdir receive_dir
$ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
At subvol snap2
link 1238 -> foo
ERROR: link 1238 -> foo failed: File exists
In this case, we will have the same file paths added to both lists. In the
parent snapshot, reference paths {1..1237} are stored in inode references,
but reference paths {1238..2000} are stored in inode extended references.
In the send snapshot, all reference paths {1001..2000} are stored in inode
references. During the incremental send, we process their inode references
first. In record_changed_ref(), we iterate all its inode references in the
send/parent snapshot. For every inode reference, we also use find_iref() to
check whether the same file path also appears in the parent/send snapshot
or not. Inode references {1238..2000} which appear in the send snapshot but
not in the parent snapshot are added to the "new_refs" list. On the other
hand, Inode references {1..1000} which appear in the parent snapshot but
not in the send snapshot are added to the "deleted_refs" list. Next, when
we process their inode extended references, reference paths {1238..2000}
are added to the "deleted_refs" list because all of them only appear in the
parent snapshot. Now two lists contain items as below:
"new_refs" list: {1238..2000}
"deleted_refs" list: {1..1000}, {1238..2000}
Reference paths {1238..2000} appear in both lists. And as the processing
order mentioned about before, the receiver gets an -EEXIST error when trying
the link operations.
To fix the bug, the idea is to process the "deleted_refs" list before
the "new_refs" list. However, it's not easy to reshuffle the processing
order. For one reason, if we do so, we may unlink all the existing paths
first, there's no valid path anymore for links. And it's inefficient
because we do a bunch of unlinks followed by links for the same paths.
Moreover, it makes less sense to have duplications in both lists. A
reference path cannot not only be regarded as new but also has been seen in
the past, or we won't call it a new path. However, it's also not a good
idea to make find_iref() check a reference against all inode references
and all inode extended references because it may result in large disk
reads.
So we introduce two rbtrees to make the references easier for lookups.
And we also introduce record_new_ref_if_needed() and
record_deleted_ref_if_needed() for changed_ref() to check and remove
duplicated references early.
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-12 09:36:32 +08:00
|
|
|
sctx->cmp_key, 0, record_new_ref_if_needed, sctx);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
ret = 0;
|
|
|
|
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int record_deleted_ref(struct send_ctx *sctx)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
ret = iterate_inode_ref(sctx->parent_root, sctx->right_path,
|
btrfs: send: fix sending link commands for existing file paths
There is a bug sending link commands for existing file paths. When we're
processing an inode, we go over all references. All the new file paths are
added to the "new_refs" list. And all the deleted file paths are added to
the "deleted_refs" list. In the end, when we finish processing the inode,
we iterate over all the items in the "new_refs" list and send link commands
for those file paths. After that, we go over all the items in the
"deleted_refs" list and send unlink commands for them. If there are
duplicated file paths in both lists, we will try to create them before we
remove them. Then the receiver gets an -EEXIST error when trying the link
operations.
Example for having duplicated file paths in both list:
$ btrfs subvolume create vol
# create a file and 2000 hard links to the same inode
$ touch vol/foo
$ for i in {1..2000}; do link vol/foo vol/$i ; done
# take a snapshot for a parent snapshot
$ btrfs subvolume snapshot -r vol snap1
# remove 2000 hard links and re-create the last 1000 links
$ for i in {1..2000}; do rm vol/$i; done;
$ for i in {1001..2000}; do link vol/foo vol/$i; done
# take another one for a send snapshot
$ btrfs subvolume snapshot -r vol snap2
$ mkdir receive_dir
$ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
At subvol snap2
link 1238 -> foo
ERROR: link 1238 -> foo failed: File exists
In this case, we will have the same file paths added to both lists. In the
parent snapshot, reference paths {1..1237} are stored in inode references,
but reference paths {1238..2000} are stored in inode extended references.
In the send snapshot, all reference paths {1001..2000} are stored in inode
references. During the incremental send, we process their inode references
first. In record_changed_ref(), we iterate all its inode references in the
send/parent snapshot. For every inode reference, we also use find_iref() to
check whether the same file path also appears in the parent/send snapshot
or not. Inode references {1238..2000} which appear in the send snapshot but
not in the parent snapshot are added to the "new_refs" list. On the other
hand, Inode references {1..1000} which appear in the parent snapshot but
not in the send snapshot are added to the "deleted_refs" list. Next, when
we process their inode extended references, reference paths {1238..2000}
are added to the "deleted_refs" list because all of them only appear in the
parent snapshot. Now two lists contain items as below:
"new_refs" list: {1238..2000}
"deleted_refs" list: {1..1000}, {1238..2000}
Reference paths {1238..2000} appear in both lists. And as the processing
order mentioned about before, the receiver gets an -EEXIST error when trying
the link operations.
To fix the bug, the idea is to process the "deleted_refs" list before
the "new_refs" list. However, it's not easy to reshuffle the processing
order. For one reason, if we do so, we may unlink all the existing paths
first, there's no valid path anymore for links. And it's inefficient
because we do a bunch of unlinks followed by links for the same paths.
Moreover, it makes less sense to have duplications in both lists. A
reference path cannot not only be regarded as new but also has been seen in
the past, or we won't call it a new path. However, it's also not a good
idea to make find_iref() check a reference against all inode references
and all inode extended references because it may result in large disk
reads.
So we introduce two rbtrees to make the references easier for lookups.
And we also introduce record_new_ref_if_needed() and
record_deleted_ref_if_needed() for changed_ref() to check and remove
duplicated references early.
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-12 09:36:32 +08:00
|
|
|
sctx->cmp_key, 0, record_deleted_ref_if_needed,
|
|
|
|
sctx);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
ret = 0;
|
|
|
|
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int record_changed_ref(struct send_ctx *sctx)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
ret = iterate_inode_ref(sctx->send_root, sctx->left_path,
|
2022-07-12 23:31:22 +08:00
|
|
|
sctx->cmp_key, 0, record_new_ref_if_needed, sctx);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2013-05-08 15:51:52 +08:00
|
|
|
ret = iterate_inode_ref(sctx->parent_root, sctx->right_path,
|
2022-07-12 23:31:22 +08:00
|
|
|
sctx->cmp_key, 0, record_deleted_ref_if_needed, sctx);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
ret = 0;
|
|
|
|
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Record and process all refs at once. Needed when an inode changes the
|
|
|
|
* generation number, which means that it was deleted and recreated.
|
|
|
|
*/
|
|
|
|
static int process_all_refs(struct send_ctx *sctx,
|
|
|
|
enum btrfs_compare_tree_result cmd)
|
|
|
|
{
|
2022-03-09 21:50:46 +08:00
|
|
|
int ret = 0;
|
|
|
|
int iter_ret = 0;
|
2012-07-26 05:19:24 +08:00
|
|
|
struct btrfs_root *root;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_key found_key;
|
|
|
|
iterate_inode_ref_t cb;
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
int pending_move = 0;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
path = alloc_path_for_send();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
if (cmd == BTRFS_COMPARE_TREE_NEW) {
|
|
|
|
root = sctx->send_root;
|
2022-07-12 23:31:22 +08:00
|
|
|
cb = record_new_ref_if_needed;
|
2012-07-26 05:19:24 +08:00
|
|
|
} else if (cmd == BTRFS_COMPARE_TREE_DELETED) {
|
|
|
|
root = sctx->parent_root;
|
2022-07-12 23:31:22 +08:00
|
|
|
cb = record_deleted_ref_if_needed;
|
2012-07-26 05:19:24 +08:00
|
|
|
} else {
|
2014-02-04 02:24:19 +08:00
|
|
|
btrfs_err(sctx->send_root->fs_info,
|
|
|
|
"Wrong command %d in process_all_refs", cmd);
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto out;
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
key.objectid = sctx->cmp_key->objectid;
|
|
|
|
key.type = BTRFS_INODE_REF_KEY;
|
|
|
|
key.offset = 0;
|
2022-03-09 21:50:46 +08:00
|
|
|
btrfs_for_each_slot(root, &key, &found_key, path, iter_ret) {
|
2012-07-26 05:19:24 +08:00
|
|
|
if (found_key.objectid != key.objectid ||
|
2012-10-15 16:30:45 +08:00
|
|
|
(found_key.type != BTRFS_INODE_REF_KEY &&
|
|
|
|
found_key.type != BTRFS_INODE_EXTREF_KEY))
|
2012-07-26 05:19:24 +08:00
|
|
|
break;
|
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
ret = iterate_inode_ref(root, path, &found_key, 0, cb, sctx);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2022-03-09 21:50:46 +08:00
|
|
|
}
|
|
|
|
/* Catch error found during iteration */
|
|
|
|
if (iter_ret < 0) {
|
|
|
|
ret = iter_ret;
|
|
|
|
goto out;
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
2012-07-28 22:33:49 +08:00
|
|
|
btrfs_release_path(path);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2016-08-24 23:57:52 +08:00
|
|
|
/*
|
|
|
|
* We don't actually care about pending_move as we are simply
|
|
|
|
* re-creating this inode and will be rename'ing it into place once we
|
|
|
|
* rename the parent directory.
|
|
|
|
*/
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
ret = process_recorded_refs(sctx, &pending_move);
|
2012-07-26 05:19:24 +08:00
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int send_set_xattr(struct send_ctx *sctx,
|
|
|
|
struct fs_path *path,
|
|
|
|
const char *name, int name_len,
|
|
|
|
const char *data, int data_len)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
ret = begin_cmd(sctx, BTRFS_SEND_C_SET_XATTR);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, path);
|
|
|
|
TLV_PUT_STRING(sctx, BTRFS_SEND_A_XATTR_NAME, name, name_len);
|
|
|
|
TLV_PUT(sctx, BTRFS_SEND_A_XATTR_DATA, data, data_len);
|
|
|
|
|
|
|
|
ret = send_cmd(sctx);
|
|
|
|
|
|
|
|
tlv_put_failure:
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int send_remove_xattr(struct send_ctx *sctx,
|
|
|
|
struct fs_path *path,
|
|
|
|
const char *name, int name_len)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
ret = begin_cmd(sctx, BTRFS_SEND_C_REMOVE_XATTR);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, path);
|
|
|
|
TLV_PUT_STRING(sctx, BTRFS_SEND_A_XATTR_NAME, name, name_len);
|
|
|
|
|
|
|
|
ret = send_cmd(sctx);
|
|
|
|
|
|
|
|
tlv_put_failure:
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int __process_new_xattr(int num, struct btrfs_key *di_key,
|
2021-11-05 08:00:13 +08:00
|
|
|
const char *name, int name_len, const char *data,
|
|
|
|
int data_len, void *ctx)
|
2012-07-26 05:19:24 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct send_ctx *sctx = ctx;
|
|
|
|
struct fs_path *p;
|
2016-09-27 19:03:22 +08:00
|
|
|
struct posix_acl_xattr_header dummy_acl;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
btrfs: send: emit file capabilities after chown
Whenever a chown is executed, all capabilities of the file being touched
are lost. When doing incremental send with a file with capabilities,
there is a situation where the capability can be lost on the receiving
side. The sequence of actions bellow shows the problem:
$ mount /dev/sda fs1
$ mount /dev/sdb fs2
$ touch fs1/foo.bar
$ setcap cap_sys_nice+ep fs1/foo.bar
$ btrfs subvolume snapshot -r fs1 fs1/snap_init
$ btrfs send fs1/snap_init | btrfs receive fs2
$ chgrp adm fs1/foo.bar
$ setcap cap_sys_nice+ep fs1/foo.bar
$ btrfs subvolume snapshot -r fs1 fs1/snap_complete
$ btrfs subvolume snapshot -r fs1 fs1/snap_incremental
$ btrfs send fs1/snap_complete | btrfs receive fs2
$ btrfs send -p fs1/snap_init fs1/snap_incremental | btrfs receive fs2
At this point, only a chown was emitted by "btrfs send" since only the
group was changed. This makes the cap_sys_nice capability to be dropped
from fs2/snap_incremental/foo.bar
To fix that, only emit capabilities after chown is emitted. The current
code first checks for xattrs that are new/changed, emits them, and later
emit the chown. Now, __process_new_xattr skips capabilities, letting
only finish_inode_if_needed to emit them, if they exist, for the inode
being processed.
This behavior was being worked around in "btrfs receive" side by caching
the capability and only applying it after chown. Now, xattrs are only
emmited _after_ chown, making that workaround not needed anymore.
Link: https://github.com/kdave/btrfs-progs/issues/202
CC: stable@vger.kernel.org # 4.4+
Suggested-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-11 10:15:07 +08:00
|
|
|
/* Capabilities are emitted by finish_inode_if_needed */
|
|
|
|
if (!strncmp(name, XATTR_NAME_CAPS, name_len))
|
|
|
|
return 0;
|
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
p = fs_path_alloc();
|
2012-07-26 05:19:24 +08:00
|
|
|
if (!p)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
/*
|
2016-05-20 09:18:45 +08:00
|
|
|
* This hack is needed because empty acls are stored as zero byte
|
2012-07-26 05:19:24 +08:00
|
|
|
* data in xattrs. Problem with that is, that receiving these zero byte
|
2016-05-20 09:18:45 +08:00
|
|
|
* acls will fail later. To fix this, we send a dummy acl list that
|
2012-07-26 05:19:24 +08:00
|
|
|
* only contains the version number and no entries.
|
|
|
|
*/
|
|
|
|
if (!strncmp(name, XATTR_NAME_POSIX_ACL_ACCESS, name_len) ||
|
|
|
|
!strncmp(name, XATTR_NAME_POSIX_ACL_DEFAULT, name_len)) {
|
|
|
|
if (data_len == 0) {
|
|
|
|
dummy_acl.a_version =
|
|
|
|
cpu_to_le32(POSIX_ACL_XATTR_VERSION);
|
|
|
|
data = (char *)&dummy_acl;
|
|
|
|
data_len = sizeof(dummy_acl);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = send_set_xattr(sctx, p, name, name_len, data, data_len);
|
|
|
|
|
|
|
|
out:
|
2013-05-08 15:51:52 +08:00
|
|
|
fs_path_free(p);
|
2012-07-26 05:19:24 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int __process_deleted_xattr(int num, struct btrfs_key *di_key,
|
|
|
|
const char *name, int name_len,
|
2021-11-05 08:00:13 +08:00
|
|
|
const char *data, int data_len, void *ctx)
|
2012-07-26 05:19:24 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct send_ctx *sctx = ctx;
|
|
|
|
struct fs_path *p;
|
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
p = fs_path_alloc();
|
2012-07-26 05:19:24 +08:00
|
|
|
if (!p)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = send_remove_xattr(sctx, p, name, name_len);
|
|
|
|
|
|
|
|
out:
|
2013-05-08 15:51:52 +08:00
|
|
|
fs_path_free(p);
|
2012-07-26 05:19:24 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int process_new_xattr(struct send_ctx *sctx)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
ret = iterate_dir_item(sctx->send_root, sctx->left_path,
|
2017-08-21 17:43:44 +08:00
|
|
|
__process_new_xattr, sctx);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int process_deleted_xattr(struct send_ctx *sctx)
|
|
|
|
{
|
2016-09-13 03:35:52 +08:00
|
|
|
return iterate_dir_item(sctx->parent_root, sctx->right_path,
|
2017-08-21 17:43:44 +08:00
|
|
|
__process_deleted_xattr, sctx);
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
struct find_xattr_ctx {
|
|
|
|
const char *name;
|
|
|
|
int name_len;
|
|
|
|
int found_idx;
|
|
|
|
char *found_data;
|
|
|
|
int found_data_len;
|
|
|
|
};
|
|
|
|
|
2021-11-05 08:00:13 +08:00
|
|
|
static int __find_xattr(int num, struct btrfs_key *di_key, const char *name,
|
|
|
|
int name_len, const char *data, int data_len, void *vctx)
|
2012-07-26 05:19:24 +08:00
|
|
|
{
|
|
|
|
struct find_xattr_ctx *ctx = vctx;
|
|
|
|
|
|
|
|
if (name_len == ctx->name_len &&
|
|
|
|
strncmp(name, ctx->name, name_len) == 0) {
|
|
|
|
ctx->found_idx = num;
|
|
|
|
ctx->found_data_len = data_len;
|
2016-01-19 01:42:13 +08:00
|
|
|
ctx->found_data = kmemdup(data, data_len, GFP_KERNEL);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (!ctx->found_data)
|
|
|
|
return -ENOMEM;
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
static int find_xattr(struct btrfs_root *root,
|
2012-07-26 05:19:24 +08:00
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_key *key,
|
|
|
|
const char *name, int name_len,
|
|
|
|
char **data, int *data_len)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct find_xattr_ctx ctx;
|
|
|
|
|
|
|
|
ctx.name = name;
|
|
|
|
ctx.name_len = name_len;
|
|
|
|
ctx.found_idx = -1;
|
|
|
|
ctx.found_data = NULL;
|
|
|
|
ctx.found_data_len = 0;
|
|
|
|
|
2017-08-21 17:43:44 +08:00
|
|
|
ret = iterate_dir_item(root, path, __find_xattr, &ctx);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
if (ctx.found_idx == -1)
|
|
|
|
return -ENOENT;
|
|
|
|
if (data) {
|
|
|
|
*data = ctx.found_data;
|
|
|
|
*data_len = ctx.found_data_len;
|
|
|
|
} else {
|
|
|
|
kfree(ctx.found_data);
|
|
|
|
}
|
|
|
|
return ctx.found_idx;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
static int __process_changed_new_xattr(int num, struct btrfs_key *di_key,
|
|
|
|
const char *name, int name_len,
|
|
|
|
const char *data, int data_len,
|
2021-11-05 08:00:13 +08:00
|
|
|
void *ctx)
|
2012-07-26 05:19:24 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct send_ctx *sctx = ctx;
|
|
|
|
char *found_data = NULL;
|
|
|
|
int found_data_len = 0;
|
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
ret = find_xattr(sctx->parent_root, sctx->right_path,
|
|
|
|
sctx->cmp_key, name, name_len, &found_data,
|
|
|
|
&found_data_len);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret == -ENOENT) {
|
|
|
|
ret = __process_new_xattr(num, di_key, name, name_len, data,
|
2021-11-05 08:00:13 +08:00
|
|
|
data_len, ctx);
|
2012-07-26 05:19:24 +08:00
|
|
|
} else if (ret >= 0) {
|
|
|
|
if (data_len != found_data_len ||
|
|
|
|
memcmp(data, found_data, data_len)) {
|
|
|
|
ret = __process_new_xattr(num, di_key, name, name_len,
|
2021-11-05 08:00:13 +08:00
|
|
|
data, data_len, ctx);
|
2012-07-26 05:19:24 +08:00
|
|
|
} else {
|
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
kfree(found_data);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int __process_changed_deleted_xattr(int num, struct btrfs_key *di_key,
|
|
|
|
const char *name, int name_len,
|
|
|
|
const char *data, int data_len,
|
2021-11-05 08:00:13 +08:00
|
|
|
void *ctx)
|
2012-07-26 05:19:24 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct send_ctx *sctx = ctx;
|
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
ret = find_xattr(sctx->send_root, sctx->left_path, sctx->cmp_key,
|
|
|
|
name, name_len, NULL, NULL);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret == -ENOENT)
|
|
|
|
ret = __process_deleted_xattr(num, di_key, name, name_len, data,
|
2021-11-05 08:00:13 +08:00
|
|
|
data_len, ctx);
|
2012-07-26 05:19:24 +08:00
|
|
|
else if (ret >= 0)
|
|
|
|
ret = 0;
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int process_changed_xattr(struct send_ctx *sctx)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
ret = iterate_dir_item(sctx->send_root, sctx->left_path,
|
2017-08-21 17:43:44 +08:00
|
|
|
__process_changed_new_xattr, sctx);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2013-05-08 15:51:52 +08:00
|
|
|
ret = iterate_dir_item(sctx->parent_root, sctx->right_path,
|
2017-08-21 17:43:44 +08:00
|
|
|
__process_changed_deleted_xattr, sctx);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int process_all_new_xattrs(struct send_ctx *sctx)
|
|
|
|
{
|
2022-03-09 21:50:47 +08:00
|
|
|
int ret = 0;
|
|
|
|
int iter_ret = 0;
|
2012-07-26 05:19:24 +08:00
|
|
|
struct btrfs_root *root;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_key found_key;
|
|
|
|
|
|
|
|
path = alloc_path_for_send();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
root = sctx->send_root;
|
|
|
|
|
|
|
|
key.objectid = sctx->cmp_key->objectid;
|
|
|
|
key.type = BTRFS_XATTR_ITEM_KEY;
|
|
|
|
key.offset = 0;
|
2022-03-09 21:50:47 +08:00
|
|
|
btrfs_for_each_slot(root, &key, &found_key, path, iter_ret) {
|
2012-07-26 05:19:24 +08:00
|
|
|
if (found_key.objectid != key.objectid ||
|
|
|
|
found_key.type != key.type) {
|
|
|
|
ret = 0;
|
2022-03-09 21:50:47 +08:00
|
|
|
break;
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
2017-08-21 17:43:44 +08:00
|
|
|
ret = iterate_dir_item(root, path, __process_new_xattr, sctx);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
2022-03-09 21:50:47 +08:00
|
|
|
break;
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
2022-03-09 21:50:47 +08:00
|
|
|
/* Catch error found during iteration */
|
|
|
|
if (iter_ret < 0)
|
|
|
|
ret = iter_ret;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2022-08-16 04:54:28 +08:00
|
|
|
static int send_verity(struct send_ctx *sctx, struct fs_path *path,
|
|
|
|
struct fsverity_descriptor *desc)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = begin_cmd(sctx, BTRFS_SEND_C_ENABLE_VERITY);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, path);
|
|
|
|
TLV_PUT_U8(sctx, BTRFS_SEND_A_VERITY_ALGORITHM,
|
|
|
|
le8_to_cpu(desc->hash_algorithm));
|
|
|
|
TLV_PUT_U32(sctx, BTRFS_SEND_A_VERITY_BLOCK_SIZE,
|
|
|
|
1U << le8_to_cpu(desc->log_blocksize));
|
|
|
|
TLV_PUT(sctx, BTRFS_SEND_A_VERITY_SALT_DATA, desc->salt,
|
|
|
|
le8_to_cpu(desc->salt_size));
|
|
|
|
TLV_PUT(sctx, BTRFS_SEND_A_VERITY_SIG_DATA, desc->signature,
|
|
|
|
le32_to_cpu(desc->sig_size));
|
|
|
|
|
|
|
|
ret = send_cmd(sctx);
|
|
|
|
|
|
|
|
tlv_put_failure:
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int process_verity(struct send_ctx *sctx)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
struct btrfs_fs_info *fs_info = sctx->send_root->fs_info;
|
|
|
|
struct inode *inode;
|
|
|
|
struct fs_path *p;
|
|
|
|
|
|
|
|
inode = btrfs_iget(fs_info->sb, sctx->cur_ino, sctx->send_root);
|
|
|
|
if (IS_ERR(inode))
|
|
|
|
return PTR_ERR(inode);
|
|
|
|
|
|
|
|
ret = btrfs_get_verity_descriptor(inode, NULL, 0);
|
|
|
|
if (ret < 0)
|
|
|
|
goto iput;
|
|
|
|
|
|
|
|
if (ret > FS_VERITY_MAX_DESCRIPTOR_SIZE) {
|
|
|
|
ret = -EMSGSIZE;
|
|
|
|
goto iput;
|
|
|
|
}
|
|
|
|
if (!sctx->verity_descriptor) {
|
|
|
|
sctx->verity_descriptor = kvmalloc(FS_VERITY_MAX_DESCRIPTOR_SIZE,
|
|
|
|
GFP_KERNEL);
|
|
|
|
if (!sctx->verity_descriptor) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto iput;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = btrfs_get_verity_descriptor(inode, sctx->verity_descriptor, ret);
|
|
|
|
if (ret < 0)
|
|
|
|
goto iput;
|
|
|
|
|
|
|
|
p = fs_path_alloc();
|
|
|
|
if (!p) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto iput;
|
|
|
|
}
|
|
|
|
ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p);
|
|
|
|
if (ret < 0)
|
|
|
|
goto free_path;
|
|
|
|
|
|
|
|
ret = send_verity(sctx, p, sctx->verity_descriptor);
|
|
|
|
if (ret < 0)
|
|
|
|
goto free_path;
|
|
|
|
|
|
|
|
free_path:
|
|
|
|
fs_path_free(p);
|
|
|
|
iput:
|
|
|
|
iput(inode);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2020-08-21 15:39:52 +08:00
|
|
|
static inline u64 max_send_read_size(const struct send_ctx *sctx)
|
|
|
|
{
|
|
|
|
return sctx->send_max_size - SZ_16K;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int put_data_header(struct send_ctx *sctx, u32 len)
|
|
|
|
{
|
2022-03-18 01:25:40 +08:00
|
|
|
if (WARN_ON_ONCE(sctx->put_data))
|
|
|
|
return -EINVAL;
|
|
|
|
sctx->put_data = true;
|
|
|
|
if (sctx->proto >= 2) {
|
|
|
|
/*
|
|
|
|
* Since v2, the data attribute header doesn't include a length,
|
|
|
|
* it is implicitly to the end of the command.
|
|
|
|
*/
|
|
|
|
if (sctx->send_max_size - sctx->send_size < sizeof(__le16) + len)
|
|
|
|
return -EOVERFLOW;
|
|
|
|
put_unaligned_le16(BTRFS_SEND_A_DATA, sctx->send_buf + sctx->send_size);
|
|
|
|
sctx->send_size += sizeof(__le16);
|
|
|
|
} else {
|
|
|
|
struct btrfs_tlv_header *hdr;
|
2020-08-21 15:39:52 +08:00
|
|
|
|
2022-03-18 01:25:40 +08:00
|
|
|
if (sctx->send_max_size - sctx->send_size < sizeof(*hdr) + len)
|
|
|
|
return -EOVERFLOW;
|
|
|
|
hdr = (struct btrfs_tlv_header *)(sctx->send_buf + sctx->send_size);
|
|
|
|
put_unaligned_le16(BTRFS_SEND_A_DATA, &hdr->tlv_type);
|
|
|
|
put_unaligned_le16(len, &hdr->tlv_len);
|
|
|
|
sctx->send_size += sizeof(*hdr);
|
|
|
|
}
|
2020-08-21 15:39:52 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int put_file_data(struct send_ctx *sctx, u64 offset, u32 len)
|
2013-10-25 23:36:01 +08:00
|
|
|
{
|
|
|
|
struct btrfs_root *root = sctx->send_root;
|
|
|
|
struct btrfs_fs_info *fs_info = root->fs_info;
|
|
|
|
struct page *page;
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
pgoff_t index = offset >> PAGE_SHIFT;
|
2013-10-25 23:36:01 +08:00
|
|
|
pgoff_t last_index;
|
2018-12-05 22:23:03 +08:00
|
|
|
unsigned pg_offset = offset_in_page(offset);
|
2020-08-21 15:39:52 +08:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = put_data_header(sctx, len);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2013-10-25 23:36:01 +08:00
|
|
|
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
last_index = (offset + len - 1) >> PAGE_SHIFT;
|
2014-03-05 10:07:35 +08:00
|
|
|
|
2013-10-25 23:36:01 +08:00
|
|
|
while (index <= last_index) {
|
|
|
|
unsigned cur_len = min_t(unsigned, len,
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
PAGE_SIZE - pg_offset);
|
2017-09-15 16:47:45 +08:00
|
|
|
|
2022-05-06 01:16:14 +08:00
|
|
|
page = find_lock_page(sctx->cur_inode->i_mapping, index);
|
2013-10-25 23:36:01 +08:00
|
|
|
if (!page) {
|
2022-05-06 01:16:14 +08:00
|
|
|
page_cache_sync_readahead(sctx->cur_inode->i_mapping,
|
|
|
|
&sctx->ra, NULL, index,
|
|
|
|
last_index + 1 - index);
|
2017-09-15 16:47:45 +08:00
|
|
|
|
2022-05-06 01:16:14 +08:00
|
|
|
page = find_or_create_page(sctx->cur_inode->i_mapping,
|
|
|
|
index, GFP_KERNEL);
|
2017-09-15 16:47:45 +08:00
|
|
|
if (!page) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-05-06 01:16:14 +08:00
|
|
|
if (PageReadahead(page))
|
|
|
|
page_cache_async_readahead(sctx->cur_inode->i_mapping,
|
2022-05-25 10:55:07 +08:00
|
|
|
&sctx->ra, NULL, page_folio(page),
|
|
|
|
index, last_index + 1 - index);
|
2013-10-25 23:36:01 +08:00
|
|
|
|
|
|
|
if (!PageUptodate(page)) {
|
2022-04-29 23:12:16 +08:00
|
|
|
btrfs_read_folio(NULL, page_folio(page));
|
2013-10-25 23:36:01 +08:00
|
|
|
lock_page(page);
|
|
|
|
if (!PageUptodate(page)) {
|
|
|
|
unlock_page(page);
|
2022-02-06 02:48:23 +08:00
|
|
|
btrfs_err(fs_info,
|
|
|
|
"send: IO error at offset %llu for inode %llu root %llu",
|
|
|
|
page_offset(page), sctx->cur_ino,
|
|
|
|
sctx->send_root->root_key.objectid);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
put_page(page);
|
2013-10-25 23:36:01 +08:00
|
|
|
ret = -EIO;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
btrfs: use memcpy_[to|from]_page() and kmap_local_page()
There are many places where the pattern kmap/memcpy/kunmap occurs.
This pattern was lifted to the core common functions
memcpy_[to|from]_page().
Use these new functions to reduce the code, eliminate direct uses of
kmap, and leverage the new core functions use of kmap_local_page().
Also, there is 1 place where a kmap/memcpy is followed by an
optional memset. Here we leave the kmap open coded to avoid remapping
the page but use kmap_local_page() directly.
Development of this patch was aided by the coccinelle script:
// <smpl>
// SPDX-License-Identifier: GPL-2.0-only
// Find kmap/memcpy/kunmap pattern and replace with memcpy*page calls
//
// NOTE: Offsets and other expressions may be more complex than what the script
// will automatically generate. Therefore a catchall rule is provided to find
// the pattern which then must be evaluated by hand.
//
// Confidence: Low
// Copyright: (C) 2021 Intel Corporation
// URL: http://coccinelle.lip6.fr/
// Comments:
// Options:
//
// simple memcpy version
//
@ memcpy_rule1 @
expression page, T, F, B, Off;
identifier ptr;
type VP;
@@
(
-VP ptr = kmap(page);
|
-ptr = kmap(page);
|
-VP ptr = kmap_atomic(page);
|
-ptr = kmap_atomic(page);
)
<+...
(
-memcpy(ptr + Off, F, B);
+memcpy_to_page(page, Off, F, B);
|
-memcpy(ptr, F, B);
+memcpy_to_page(page, 0, F, B);
|
-memcpy(T, ptr + Off, B);
+memcpy_from_page(T, page, Off, B);
|
-memcpy(T, ptr, B);
+memcpy_from_page(T, page, 0, B);
)
...+>
(
-kunmap(page);
|
-kunmap_atomic(ptr);
)
// Remove any pointers left unused
@
depends on memcpy_rule1
@
identifier memcpy_rule1.ptr;
type VP, VP1;
@@
-VP ptr;
... when != ptr;
? VP1 ptr;
//
// Some callers kmap without a temp pointer
//
@ memcpy_rule2 @
expression page, T, Off, F, B;
@@
<+...
(
-memcpy(kmap(page) + Off, F, B);
+memcpy_to_page(page, Off, F, B);
|
-memcpy(kmap(page), F, B);
+memcpy_to_page(page, 0, F, B);
|
-memcpy(T, kmap(page) + Off, B);
+memcpy_from_page(T, page, Off, B);
|
-memcpy(T, kmap(page), B);
+memcpy_from_page(T, page, 0, B);
)
...+>
-kunmap(page);
// No need for the ptr variable removal
//
// Catch all
//
@ memcpy_rule3 @
expression page;
expression GenTo, GenFrom, GenSize;
identifier ptr;
type VP;
@@
(
-VP ptr = kmap(page);
|
-ptr = kmap(page);
|
-VP ptr = kmap_atomic(page);
|
-ptr = kmap_atomic(page);
)
<+...
(
//
// Some call sites have complex expressions within the memcpy
// match a catch all to be evaluated by hand.
//
-memcpy(GenTo, GenFrom, GenSize);
+memcpy_to_pageExtra(page, GenTo, GenFrom, GenSize);
+memcpy_from_pageExtra(GenTo, page, GenFrom, GenSize);
)
...+>
(
-kunmap(page);
|
-kunmap_atomic(ptr);
)
// Remove any pointers left unused
@
depends on memcpy_rule3
@
identifier memcpy_rule3.ptr;
type VP, VP1;
@@
-VP ptr;
... when != ptr;
? VP1 ptr;
// <smpl>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-10 14:22:19 +08:00
|
|
|
memcpy_from_page(sctx->send_buf + sctx->send_size, page,
|
|
|
|
pg_offset, cur_len);
|
2013-10-25 23:36:01 +08:00
|
|
|
unlock_page(page);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
put_page(page);
|
2013-10-25 23:36:01 +08:00
|
|
|
index++;
|
|
|
|
pg_offset = 0;
|
|
|
|
len -= cur_len;
|
2020-08-21 15:39:52 +08:00
|
|
|
sctx->send_size += cur_len;
|
2013-10-25 23:36:01 +08:00
|
|
|
}
|
2022-05-06 01:16:14 +08:00
|
|
|
|
2013-10-25 23:36:01 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
/*
|
|
|
|
* Read some bytes from the current inode/file and send a write command to
|
|
|
|
* user space.
|
|
|
|
*/
|
|
|
|
static int send_write(struct send_ctx *sctx, u64 offset, u32 len)
|
|
|
|
{
|
2016-09-20 22:05:03 +08:00
|
|
|
struct btrfs_fs_info *fs_info = sctx->send_root->fs_info;
|
2012-07-26 05:19:24 +08:00
|
|
|
int ret = 0;
|
|
|
|
struct fs_path *p;
|
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
p = fs_path_alloc();
|
2012-07-26 05:19:24 +08:00
|
|
|
if (!p)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2016-09-20 22:05:03 +08:00
|
|
|
btrfs_debug(fs_info, "send_write offset=%llu, len=%d", offset, len);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
ret = begin_cmd(sctx, BTRFS_SEND_C_WRITE);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
|
|
|
|
TLV_PUT_U64(sctx, BTRFS_SEND_A_FILE_OFFSET, offset);
|
2020-08-21 15:39:52 +08:00
|
|
|
ret = put_file_data(sctx, offset, len);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
ret = send_cmd(sctx);
|
|
|
|
|
|
|
|
tlv_put_failure:
|
|
|
|
out:
|
2013-05-08 15:51:52 +08:00
|
|
|
fs_path_free(p);
|
2020-08-21 15:39:51 +08:00
|
|
|
return ret;
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Send a clone command to user space.
|
|
|
|
*/
|
|
|
|
static int send_clone(struct send_ctx *sctx,
|
|
|
|
u64 offset, u32 len,
|
|
|
|
struct clone_root *clone_root)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
struct fs_path *p;
|
|
|
|
u64 gen;
|
|
|
|
|
2016-09-20 22:05:03 +08:00
|
|
|
btrfs_debug(sctx->send_root->fs_info,
|
|
|
|
"send_clone offset=%llu, len=%d, clone_root=%llu, clone_inode=%llu, clone_offset=%llu",
|
2018-08-06 13:25:24 +08:00
|
|
|
offset, len, clone_root->root->root_key.objectid,
|
|
|
|
clone_root->ino, clone_root->offset);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
p = fs_path_alloc();
|
2012-07-26 05:19:24 +08:00
|
|
|
if (!p)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
ret = begin_cmd(sctx, BTRFS_SEND_C_CLONE);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
TLV_PUT_U64(sctx, BTRFS_SEND_A_FILE_OFFSET, offset);
|
|
|
|
TLV_PUT_U64(sctx, BTRFS_SEND_A_CLONE_LEN, len);
|
|
|
|
TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
|
|
|
|
|
2012-07-28 22:33:49 +08:00
|
|
|
if (clone_root->root == sctx->send_root) {
|
2022-08-12 22:42:32 +08:00
|
|
|
ret = get_inode_gen(sctx->send_root, clone_root->ino, &gen);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
ret = get_cur_path(sctx, clone_root->ino, gen, p);
|
|
|
|
} else {
|
2013-05-08 15:51:52 +08:00
|
|
|
ret = get_inode_path(clone_root->root, clone_root->ino, p);
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
2015-06-05 05:17:25 +08:00
|
|
|
/*
|
|
|
|
* If the parent we're using has a received_uuid set then use that as
|
|
|
|
* our clone source as that is what we will look for when doing a
|
|
|
|
* receive.
|
|
|
|
*
|
|
|
|
* This covers the case that we create a snapshot off of a received
|
|
|
|
* subvolume and then use that as the parent and try to receive on a
|
|
|
|
* different host.
|
|
|
|
*/
|
|
|
|
if (!btrfs_is_empty_uuid(clone_root->root->root_item.received_uuid))
|
|
|
|
TLV_PUT_UUID(sctx, BTRFS_SEND_A_CLONE_UUID,
|
|
|
|
clone_root->root->root_item.received_uuid);
|
|
|
|
else
|
|
|
|
TLV_PUT_UUID(sctx, BTRFS_SEND_A_CLONE_UUID,
|
|
|
|
clone_root->root->root_item.uuid);
|
2012-07-26 05:19:24 +08:00
|
|
|
TLV_PUT_U64(sctx, BTRFS_SEND_A_CLONE_CTRANSID,
|
2020-09-15 20:30:15 +08:00
|
|
|
btrfs_root_ctransid(&clone_root->root->root_item));
|
2012-07-26 05:19:24 +08:00
|
|
|
TLV_PUT_PATH(sctx, BTRFS_SEND_A_CLONE_PATH, p);
|
|
|
|
TLV_PUT_U64(sctx, BTRFS_SEND_A_CLONE_OFFSET,
|
|
|
|
clone_root->offset);
|
|
|
|
|
|
|
|
ret = send_cmd(sctx);
|
|
|
|
|
|
|
|
tlv_put_failure:
|
|
|
|
out:
|
2013-05-08 15:51:52 +08:00
|
|
|
fs_path_free(p);
|
2012-07-26 05:19:24 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-02-05 04:54:57 +08:00
|
|
|
/*
|
|
|
|
* Send an update extent command to user space.
|
|
|
|
*/
|
|
|
|
static int send_update_extent(struct send_ctx *sctx,
|
|
|
|
u64 offset, u32 len)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
struct fs_path *p;
|
|
|
|
|
2013-05-08 15:51:52 +08:00
|
|
|
p = fs_path_alloc();
|
2013-02-05 04:54:57 +08:00
|
|
|
if (!p)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
ret = begin_cmd(sctx, BTRFS_SEND_C_UPDATE_EXTENT);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
|
|
|
|
TLV_PUT_U64(sctx, BTRFS_SEND_A_FILE_OFFSET, offset);
|
|
|
|
TLV_PUT_U64(sctx, BTRFS_SEND_A_SIZE, len);
|
|
|
|
|
|
|
|
ret = send_cmd(sctx);
|
|
|
|
|
|
|
|
tlv_put_failure:
|
|
|
|
out:
|
2013-05-08 15:51:52 +08:00
|
|
|
fs_path_free(p);
|
2013-02-05 04:54:57 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-10-23 00:18:51 +08:00
|
|
|
static int send_hole(struct send_ctx *sctx, u64 end)
|
|
|
|
{
|
|
|
|
struct fs_path *p = NULL;
|
2020-08-21 15:39:52 +08:00
|
|
|
u64 read_size = max_send_read_size(sctx);
|
2013-10-23 00:18:51 +08:00
|
|
|
u64 offset = sctx->cur_inode_last_extent;
|
|
|
|
int ret = 0;
|
|
|
|
|
Btrfs: send, fix incorrect file layout after hole punching beyond eof
When doing an incremental send, if we have a file in the parent snapshot
that has prealloc extents beyond EOF and in the send snapshot it got a
hole punch that partially covers the prealloc extents, the send stream,
when replayed by a receiver, can result in a file that has a size bigger
than it should and filled with zeroes past the correct EOF.
For example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "falloc -k 0 4M" /mnt/foobar
$ xfs_io -c "pwrite -S 0xea 0 1M" /mnt/foobar
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send -f /tmp/1.send /mnt/snap1
$ xfs_io -c "fpunch 1M 2M" /mnt/foobar
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send -f /tmp/2.send -p /mnt/snap1 /mnt/snap2
$ stat --format %s /mnt/snap2/foobar
1048576
$ md5sum /mnt/snap2/foobar
d31659e82e87798acd4669a1e0a19d4f /mnt/snap2/foobar
$ umount /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ btrfs receive -f /mnt/1.snap /mnt
$ btrfs receive -f /mnt/2.snap /mnt
$ stat --format %s /mnt/snap2/foobar
3145728
# --> should be 1Mb and not 3Mb (which was the end offset of hole
# punch operation)
$ md5sum /mnt/snap2/foobar
117baf295297c2a995f92da725b0b651 /mnt/snap2/foobar
# --> should be d31659e82e87798acd4669a1e0a19d4f as in the original fs
This issue actually happens only since commit ffa7c4296e93 ("Btrfs: send,
do not issue unnecessary truncate operations"), but before that commit we
were issuing a write operation full of zeroes (to "punch" a hole) which
was extending the file size beyond the correct value and then immediately
issue a truncate operation to the correct size and undoing the previous
write operation. Since the send protocol does not support fallocate, for
extent preallocation and hole punching, fix this by not even attempting
to send a "hole" (regular write full of zeroes) if it starts at an offset
greater then or equals to the file's size. This approach, besides being
much more simple then making send issue the truncate operation, adds the
benefit of avoiding the useless pair of write of zeroes and truncate
operations, saving time and IO at the receiver and reducing the size of
the send stream.
A test case for fstests follows soon.
Fixes: ffa7c4296e93 ("Btrfs: send, do not issue unnecessary truncate operations")
CC: stable@vger.kernel.org # 4.17+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-07-30 19:39:58 +08:00
|
|
|
/*
|
|
|
|
* A hole that starts at EOF or beyond it. Since we do not yet support
|
|
|
|
* fallocate (for extent preallocation and hole punching), sending a
|
|
|
|
* write of zeroes starting at EOF or beyond would later require issuing
|
|
|
|
* a truncate operation which would undo the write and achieve nothing.
|
|
|
|
*/
|
|
|
|
if (offset >= sctx->cur_inode_size)
|
|
|
|
return 0;
|
|
|
|
|
2019-05-20 16:55:42 +08:00
|
|
|
/*
|
|
|
|
* Don't go beyond the inode's i_size due to prealloc extents that start
|
|
|
|
* after the i_size.
|
|
|
|
*/
|
|
|
|
end = min_t(u64, end, sctx->cur_inode_size);
|
|
|
|
|
2018-02-07 04:39:20 +08:00
|
|
|
if (sctx->flags & BTRFS_SEND_FLAG_NO_FILE_DATA)
|
|
|
|
return send_update_extent(sctx, offset, end - offset);
|
|
|
|
|
2013-10-23 00:18:51 +08:00
|
|
|
p = fs_path_alloc();
|
|
|
|
if (!p)
|
|
|
|
return -ENOMEM;
|
2014-03-31 21:52:14 +08:00
|
|
|
ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p);
|
|
|
|
if (ret < 0)
|
|
|
|
goto tlv_put_failure;
|
2013-10-23 00:18:51 +08:00
|
|
|
while (offset < end) {
|
2020-08-21 15:39:52 +08:00
|
|
|
u64 len = min(end - offset, read_size);
|
2013-10-23 00:18:51 +08:00
|
|
|
|
|
|
|
ret = begin_cmd(sctx, BTRFS_SEND_C_WRITE);
|
|
|
|
if (ret < 0)
|
|
|
|
break;
|
|
|
|
TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
|
|
|
|
TLV_PUT_U64(sctx, BTRFS_SEND_A_FILE_OFFSET, offset);
|
2020-08-21 15:39:52 +08:00
|
|
|
ret = put_data_header(sctx, len);
|
|
|
|
if (ret < 0)
|
|
|
|
break;
|
|
|
|
memset(sctx->send_buf + sctx->send_size, 0, len);
|
|
|
|
sctx->send_size += len;
|
2013-10-23 00:18:51 +08:00
|
|
|
ret = send_cmd(sctx);
|
|
|
|
if (ret < 0)
|
|
|
|
break;
|
|
|
|
offset += len;
|
|
|
|
}
|
2018-02-07 04:40:40 +08:00
|
|
|
sctx->cur_inode_next_write_offset = offset;
|
2013-10-23 00:18:51 +08:00
|
|
|
tlv_put_failure:
|
|
|
|
fs_path_free(p);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2022-03-18 01:25:42 +08:00
|
|
|
static int send_encoded_inline_extent(struct send_ctx *sctx,
|
|
|
|
struct btrfs_path *path, u64 offset,
|
|
|
|
u64 len)
|
|
|
|
{
|
|
|
|
struct btrfs_root *root = sctx->send_root;
|
|
|
|
struct btrfs_fs_info *fs_info = root->fs_info;
|
|
|
|
struct inode *inode;
|
|
|
|
struct fs_path *fspath;
|
|
|
|
struct extent_buffer *leaf = path->nodes[0];
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_file_extent_item *ei;
|
|
|
|
u64 ram_bytes;
|
|
|
|
size_t inline_size;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
inode = btrfs_iget(fs_info->sb, sctx->cur_ino, root);
|
|
|
|
if (IS_ERR(inode))
|
|
|
|
return PTR_ERR(inode);
|
|
|
|
|
|
|
|
fspath = fs_path_alloc();
|
|
|
|
if (!fspath) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = begin_cmd(sctx, BTRFS_SEND_C_ENCODED_WRITE);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, fspath);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
|
|
|
|
ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
|
|
|
|
ram_bytes = btrfs_file_extent_ram_bytes(leaf, ei);
|
|
|
|
inline_size = btrfs_file_extent_inline_item_len(leaf, path->slots[0]);
|
|
|
|
|
|
|
|
TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, fspath);
|
|
|
|
TLV_PUT_U64(sctx, BTRFS_SEND_A_FILE_OFFSET, offset);
|
|
|
|
TLV_PUT_U64(sctx, BTRFS_SEND_A_UNENCODED_FILE_LEN,
|
|
|
|
min(key.offset + ram_bytes - offset, len));
|
|
|
|
TLV_PUT_U64(sctx, BTRFS_SEND_A_UNENCODED_LEN, ram_bytes);
|
|
|
|
TLV_PUT_U64(sctx, BTRFS_SEND_A_UNENCODED_OFFSET, offset - key.offset);
|
|
|
|
ret = btrfs_encoded_io_compression_from_extent(fs_info,
|
|
|
|
btrfs_file_extent_compression(leaf, ei));
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
TLV_PUT_U32(sctx, BTRFS_SEND_A_COMPRESSION, ret);
|
|
|
|
|
|
|
|
ret = put_data_header(sctx, inline_size);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
read_extent_buffer(leaf, sctx->send_buf + sctx->send_size,
|
|
|
|
btrfs_file_extent_inline_start(ei), inline_size);
|
|
|
|
sctx->send_size += inline_size;
|
|
|
|
|
|
|
|
ret = send_cmd(sctx);
|
|
|
|
|
|
|
|
tlv_put_failure:
|
|
|
|
out:
|
|
|
|
fs_path_free(fspath);
|
|
|
|
iput(inode);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int send_encoded_extent(struct send_ctx *sctx, struct btrfs_path *path,
|
|
|
|
u64 offset, u64 len)
|
|
|
|
{
|
|
|
|
struct btrfs_root *root = sctx->send_root;
|
|
|
|
struct btrfs_fs_info *fs_info = root->fs_info;
|
|
|
|
struct inode *inode;
|
|
|
|
struct fs_path *fspath;
|
|
|
|
struct extent_buffer *leaf = path->nodes[0];
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_file_extent_item *ei;
|
|
|
|
u64 disk_bytenr, disk_num_bytes;
|
|
|
|
u32 data_offset;
|
|
|
|
struct btrfs_cmd_header *hdr;
|
|
|
|
u32 crc;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
inode = btrfs_iget(fs_info->sb, sctx->cur_ino, root);
|
|
|
|
if (IS_ERR(inode))
|
|
|
|
return PTR_ERR(inode);
|
|
|
|
|
|
|
|
fspath = fs_path_alloc();
|
|
|
|
if (!fspath) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = begin_cmd(sctx, BTRFS_SEND_C_ENCODED_WRITE);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, fspath);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
|
|
|
|
ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
|
|
|
|
disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
|
|
|
|
disk_num_bytes = btrfs_file_extent_disk_num_bytes(leaf, ei);
|
|
|
|
|
|
|
|
TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, fspath);
|
|
|
|
TLV_PUT_U64(sctx, BTRFS_SEND_A_FILE_OFFSET, offset);
|
|
|
|
TLV_PUT_U64(sctx, BTRFS_SEND_A_UNENCODED_FILE_LEN,
|
|
|
|
min(key.offset + btrfs_file_extent_num_bytes(leaf, ei) - offset,
|
|
|
|
len));
|
|
|
|
TLV_PUT_U64(sctx, BTRFS_SEND_A_UNENCODED_LEN,
|
|
|
|
btrfs_file_extent_ram_bytes(leaf, ei));
|
|
|
|
TLV_PUT_U64(sctx, BTRFS_SEND_A_UNENCODED_OFFSET,
|
|
|
|
offset - key.offset + btrfs_file_extent_offset(leaf, ei));
|
|
|
|
ret = btrfs_encoded_io_compression_from_extent(fs_info,
|
|
|
|
btrfs_file_extent_compression(leaf, ei));
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
TLV_PUT_U32(sctx, BTRFS_SEND_A_COMPRESSION, ret);
|
|
|
|
TLV_PUT_U32(sctx, BTRFS_SEND_A_ENCRYPTION, 0);
|
|
|
|
|
|
|
|
ret = put_data_header(sctx, disk_num_bytes);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We want to do I/O directly into the send buffer, so get the next page
|
|
|
|
* boundary in the send buffer. This means that there may be a gap
|
|
|
|
* between the beginning of the command and the file data.
|
|
|
|
*/
|
2023-01-03 13:11:37 +08:00
|
|
|
data_offset = PAGE_ALIGN(sctx->send_size);
|
2022-03-18 01:25:42 +08:00
|
|
|
if (data_offset > sctx->send_max_size ||
|
|
|
|
sctx->send_max_size - data_offset < disk_num_bytes) {
|
|
|
|
ret = -EOVERFLOW;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Note that send_buf is a mapping of send_buf_pages, so this is really
|
|
|
|
* reading into send_buf.
|
|
|
|
*/
|
|
|
|
ret = btrfs_encoded_read_regular_fill_pages(BTRFS_I(inode), offset,
|
|
|
|
disk_bytenr, disk_num_bytes,
|
|
|
|
sctx->send_buf_pages +
|
|
|
|
(data_offset >> PAGE_SHIFT));
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
hdr = (struct btrfs_cmd_header *)sctx->send_buf;
|
|
|
|
hdr->len = cpu_to_le32(sctx->send_size + disk_num_bytes - sizeof(*hdr));
|
|
|
|
hdr->crc = 0;
|
2023-08-26 04:19:20 +08:00
|
|
|
crc = crc32c(0, sctx->send_buf, sctx->send_size);
|
|
|
|
crc = crc32c(crc, sctx->send_buf + data_offset, disk_num_bytes);
|
2022-03-18 01:25:42 +08:00
|
|
|
hdr->crc = cpu_to_le32(crc);
|
|
|
|
|
|
|
|
ret = write_buf(sctx->send_filp, sctx->send_buf, sctx->send_size,
|
|
|
|
&sctx->send_off);
|
|
|
|
if (!ret) {
|
|
|
|
ret = write_buf(sctx->send_filp, sctx->send_buf + data_offset,
|
|
|
|
disk_num_bytes, &sctx->send_off);
|
|
|
|
}
|
|
|
|
sctx->send_size = 0;
|
|
|
|
sctx->put_data = false;
|
|
|
|
|
|
|
|
tlv_put_failure:
|
|
|
|
out:
|
|
|
|
fs_path_free(fspath);
|
|
|
|
iput(inode);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int send_extent_data(struct send_ctx *sctx, struct btrfs_path *path,
|
|
|
|
const u64 offset, const u64 len)
|
Btrfs: send, fix file corruption due to incorrect cloning operations
If we have a file that shares an extent with other files, when processing
the extent item relative to a shared extent, we blindly issue a clone
operation that will target a length matching the length in the extent item
and uses as a source some other file the receiver already has and points
to the same extent. However that range in the other file might not
exclusively point only to the shared extent, and so using that length
will result in the receiver getting a file with different data from the
one in the send snapshot. This issue happened both for incremental and
full send operations.
So fix this by issuing clone operations with lengths that don't cover
regions of the source file that point to different extents (or have holes).
The following test case for fstests reproduces the problem.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
_require_xfs_io_command "fpunch"
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test file with a single 100K extent.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Clone our file into a new file named bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Now overwrite parts of our foo file.
$XFS_IO_PROG -c "pwrite -S 0xbb 50K 10K" \
-c "pwrite -S 0xcc 90K 10K" \
-c "fpunch 70K 10k" \
$SCRATCH_MNT/foo | _filter_xfs_io
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify
# we get the same file contents that the original filesystem had.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
# We expect the destination filesystem to have exactly the same file
# data as the original filesystem.
# The btrfs send implementation had a bug where it sent a clone
# operation from file foo into file bar covering the whole [0, 100K[
# range after creating and writing the file foo. This was incorrect
# because the file bar now included the updates done to file foo after
# we cloned foo to bar, breaking the COW nature of reflink copies
# (cloned extents).
echo "File digests in the new filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Another test case that reproduces the problem when we have compressed
extents:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
# Create our file with an extent of 100K starting at file offset 0K.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Rewrite part of the previous extent (its first 40K) and write a new
# 100K extent starting at file offset 100K.
$XFS_IO_PROG -c "pwrite -S 0xbb 0K 40K" \
-c "pwrite -S 0xcc 100K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Our file foo now has 3 file extent items in its metadata:
#
# 1) One covering the file range 0 to 40K;
# 2) One covering the file range 40K to 100K, which points to the first
# extent we wrote to the file and has a data offset field with value
# 40K (our file no longer uses the first 40K of data from that
# extent);
# 3) One covering the file range 100K to 200K.
# Now clone our file foo into file bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Create our snapshot for the send operation.
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify we
# get the same file contents that the original filesystem had.
# Btrfs send used to issue a clone operation from foo's range
# [80K, 140K[ to bar's range [40K, 100K[ when cloning the extent pointed
# to by foo's second file extent item, this was incorrect because of bad
# accounting of the file extent item's data offset field. The correct
# range to clone from should have been [40K, 100K[.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
echo "File digests in the new filesystem:"
# Must match the digests we got in the original filesystem.
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-02 17:47:34 +08:00
|
|
|
{
|
btrfs: send: avoid trashing the page cache
A send operation reads extent data using the buffered IO path for getting
extent data to send in write commands and this is both because it's simple
and to make use of the generic readahead infrastructure, which results in
a massive speedup.
However this fills the page cache with data that, most of the time, is
really only used by the send operation - once the write commands are sent,
it's not useful to have the data in the page cache anymore. For large
snapshots, bringing all data into the page cache eventually leads to the
need to evict other data from the page cache that may be more useful for
applications (and kernel subsystems).
Even if extents are shared with the subvolume on which a snapshot is based
on and the data is currently on the page cache due to being read through
the subvolume, attempting to read the data through the snapshot will
always result in bringing a new copy of the data into another location in
the page cache (there's currently no shared memory for shared extents).
So make send evict the data it has read before if when it first opened
the inode, its mapping had no pages currently loaded: when
inode->i_mapping->nr_pages has a value of 0. Do this instead of deciding
based on the return value of filemap_range_has_page() before reading an
extent because the generic readahead mechanism may read pages beyond the
range we request (and it very often does it), which means a call to
filemap_range_has_page() will return true due to the readahead that was
triggered when processing a previous extent - we don't have a simple way
to distinguish this case from the case where the data was brought into
the page cache through someone else. So checking for the mapping number
of pages being 0 when we first open the inode is simple, cheap and it
generally accomplishes the goal of not trashing the page cache - the
only exception is if part of data was previously loaded into the page
cache through the snapshot by some other process, in that case we end
up not evicting any data send brings into the page cache, just like
before this change - but that however is not the common case.
Example scenario, on a box with 32G of RAM:
$ btrfs subvolume create /mnt/sv1
$ xfs_io -f -c "pwrite 0 4G" /mnt/sv1/file1
$ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1
$ free -m
total used free shared buff/cache available
Mem: 31937 186 26866 0 4883 31297
Swap: 8188 0 8188
# After this we get less 4G of free memory.
$ btrfs send /mnt/snap1 >/dev/null
$ free -m
total used free shared buff/cache available
Mem: 31937 186 22814 0 8935 31297
Swap: 8188 0 8188
The same, obviously, applies to an incremental send.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-17 18:47:30 +08:00
|
|
|
const u64 end = offset + len;
|
2022-03-18 01:25:42 +08:00
|
|
|
struct extent_buffer *leaf = path->nodes[0];
|
|
|
|
struct btrfs_file_extent_item *ei;
|
2020-08-21 15:39:52 +08:00
|
|
|
u64 read_size = max_send_read_size(sctx);
|
Btrfs: send, fix file corruption due to incorrect cloning operations
If we have a file that shares an extent with other files, when processing
the extent item relative to a shared extent, we blindly issue a clone
operation that will target a length matching the length in the extent item
and uses as a source some other file the receiver already has and points
to the same extent. However that range in the other file might not
exclusively point only to the shared extent, and so using that length
will result in the receiver getting a file with different data from the
one in the send snapshot. This issue happened both for incremental and
full send operations.
So fix this by issuing clone operations with lengths that don't cover
regions of the source file that point to different extents (or have holes).
The following test case for fstests reproduces the problem.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
_require_xfs_io_command "fpunch"
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test file with a single 100K extent.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Clone our file into a new file named bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Now overwrite parts of our foo file.
$XFS_IO_PROG -c "pwrite -S 0xbb 50K 10K" \
-c "pwrite -S 0xcc 90K 10K" \
-c "fpunch 70K 10k" \
$SCRATCH_MNT/foo | _filter_xfs_io
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify
# we get the same file contents that the original filesystem had.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
# We expect the destination filesystem to have exactly the same file
# data as the original filesystem.
# The btrfs send implementation had a bug where it sent a clone
# operation from file foo into file bar covering the whole [0, 100K[
# range after creating and writing the file foo. This was incorrect
# because the file bar now included the updates done to file foo after
# we cloned foo to bar, breaking the COW nature of reflink copies
# (cloned extents).
echo "File digests in the new filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Another test case that reproduces the problem when we have compressed
extents:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
# Create our file with an extent of 100K starting at file offset 0K.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Rewrite part of the previous extent (its first 40K) and write a new
# 100K extent starting at file offset 100K.
$XFS_IO_PROG -c "pwrite -S 0xbb 0K 40K" \
-c "pwrite -S 0xcc 100K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Our file foo now has 3 file extent items in its metadata:
#
# 1) One covering the file range 0 to 40K;
# 2) One covering the file range 40K to 100K, which points to the first
# extent we wrote to the file and has a data offset field with value
# 40K (our file no longer uses the first 40K of data from that
# extent);
# 3) One covering the file range 100K to 200K.
# Now clone our file foo into file bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Create our snapshot for the send operation.
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify we
# get the same file contents that the original filesystem had.
# Btrfs send used to issue a clone operation from foo's range
# [80K, 140K[ to bar's range [40K, 100K[ when cloning the extent pointed
# to by foo's second file extent item, this was incorrect because of bad
# accounting of the file extent item's data offset field. The correct
# range to clone from should have been [40K, 100K[.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
echo "File digests in the new filesystem:"
# Must match the digests we got in the original filesystem.
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-02 17:47:34 +08:00
|
|
|
u64 sent = 0;
|
|
|
|
|
|
|
|
if (sctx->flags & BTRFS_SEND_FLAG_NO_FILE_DATA)
|
|
|
|
return send_update_extent(sctx, offset, len);
|
|
|
|
|
2022-03-18 01:25:42 +08:00
|
|
|
ei = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_file_extent_item);
|
|
|
|
if ((sctx->flags & BTRFS_SEND_FLAG_COMPRESSED) &&
|
|
|
|
btrfs_file_extent_compression(leaf, ei) != BTRFS_COMPRESS_NONE) {
|
|
|
|
bool is_inline = (btrfs_file_extent_type(leaf, ei) ==
|
|
|
|
BTRFS_FILE_EXTENT_INLINE);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Send the compressed extent unless the compressed data is
|
|
|
|
* larger than the decompressed data. This can happen if we're
|
|
|
|
* not sending the entire extent, either because it has been
|
|
|
|
* partially overwritten/truncated or because this is a part of
|
|
|
|
* the extent that we couldn't clone in clone_range().
|
|
|
|
*/
|
|
|
|
if (is_inline &&
|
|
|
|
btrfs_file_extent_inline_item_len(leaf,
|
|
|
|
path->slots[0]) <= len) {
|
|
|
|
return send_encoded_inline_extent(sctx, path, offset,
|
|
|
|
len);
|
|
|
|
} else if (!is_inline &&
|
|
|
|
btrfs_file_extent_disk_num_bytes(leaf, ei) <= len) {
|
|
|
|
return send_encoded_extent(sctx, path, offset, len);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-05-06 01:16:14 +08:00
|
|
|
if (sctx->cur_inode == NULL) {
|
|
|
|
struct btrfs_root *root = sctx->send_root;
|
|
|
|
|
|
|
|
sctx->cur_inode = btrfs_iget(root->fs_info->sb, sctx->cur_ino, root);
|
|
|
|
if (IS_ERR(sctx->cur_inode)) {
|
|
|
|
int err = PTR_ERR(sctx->cur_inode);
|
|
|
|
|
|
|
|
sctx->cur_inode = NULL;
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
memset(&sctx->ra, 0, sizeof(struct file_ra_state));
|
|
|
|
file_ra_state_init(&sctx->ra, sctx->cur_inode->i_mapping);
|
btrfs: send: avoid trashing the page cache
A send operation reads extent data using the buffered IO path for getting
extent data to send in write commands and this is both because it's simple
and to make use of the generic readahead infrastructure, which results in
a massive speedup.
However this fills the page cache with data that, most of the time, is
really only used by the send operation - once the write commands are sent,
it's not useful to have the data in the page cache anymore. For large
snapshots, bringing all data into the page cache eventually leads to the
need to evict other data from the page cache that may be more useful for
applications (and kernel subsystems).
Even if extents are shared with the subvolume on which a snapshot is based
on and the data is currently on the page cache due to being read through
the subvolume, attempting to read the data through the snapshot will
always result in bringing a new copy of the data into another location in
the page cache (there's currently no shared memory for shared extents).
So make send evict the data it has read before if when it first opened
the inode, its mapping had no pages currently loaded: when
inode->i_mapping->nr_pages has a value of 0. Do this instead of deciding
based on the return value of filemap_range_has_page() before reading an
extent because the generic readahead mechanism may read pages beyond the
range we request (and it very often does it), which means a call to
filemap_range_has_page() will return true due to the readahead that was
triggered when processing a previous extent - we don't have a simple way
to distinguish this case from the case where the data was brought into
the page cache through someone else. So checking for the mapping number
of pages being 0 when we first open the inode is simple, cheap and it
generally accomplishes the goal of not trashing the page cache - the
only exception is if part of data was previously loaded into the page
cache through the snapshot by some other process, in that case we end
up not evicting any data send brings into the page cache, just like
before this change - but that however is not the common case.
Example scenario, on a box with 32G of RAM:
$ btrfs subvolume create /mnt/sv1
$ xfs_io -f -c "pwrite 0 4G" /mnt/sv1/file1
$ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1
$ free -m
total used free shared buff/cache available
Mem: 31937 186 26866 0 4883 31297
Swap: 8188 0 8188
# After this we get less 4G of free memory.
$ btrfs send /mnt/snap1 >/dev/null
$ free -m
total used free shared buff/cache available
Mem: 31937 186 22814 0 8935 31297
Swap: 8188 0 8188
The same, obviously, applies to an incremental send.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-17 18:47:30 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* It's very likely there are no pages from this inode in the page
|
|
|
|
* cache, so after reading extents and sending their data, we clean
|
|
|
|
* the page cache to avoid trashing the page cache (adding pressure
|
|
|
|
* to the page cache and forcing eviction of other data more useful
|
|
|
|
* for applications).
|
|
|
|
*
|
|
|
|
* We decide if we should clean the page cache simply by checking
|
|
|
|
* if the inode's mapping nrpages is 0 when we first open it, and
|
|
|
|
* not by using something like filemap_range_has_page() before
|
|
|
|
* reading an extent because when we ask the readahead code to
|
|
|
|
* read a given file range, it may (and almost always does) read
|
|
|
|
* pages from beyond that range (see the documentation for
|
|
|
|
* page_cache_sync_readahead()), so it would not be reliable,
|
|
|
|
* because after reading the first extent future calls to
|
|
|
|
* filemap_range_has_page() would return true because the readahead
|
|
|
|
* on the previous extent resulted in reading pages of the current
|
|
|
|
* extent as well.
|
|
|
|
*/
|
|
|
|
sctx->clean_page_cache = (sctx->cur_inode->i_mapping->nrpages == 0);
|
|
|
|
sctx->page_cache_clear_start = round_down(offset, PAGE_SIZE);
|
2022-05-06 01:16:14 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: send, fix file corruption due to incorrect cloning operations
If we have a file that shares an extent with other files, when processing
the extent item relative to a shared extent, we blindly issue a clone
operation that will target a length matching the length in the extent item
and uses as a source some other file the receiver already has and points
to the same extent. However that range in the other file might not
exclusively point only to the shared extent, and so using that length
will result in the receiver getting a file with different data from the
one in the send snapshot. This issue happened both for incremental and
full send operations.
So fix this by issuing clone operations with lengths that don't cover
regions of the source file that point to different extents (or have holes).
The following test case for fstests reproduces the problem.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
_require_xfs_io_command "fpunch"
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test file with a single 100K extent.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Clone our file into a new file named bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Now overwrite parts of our foo file.
$XFS_IO_PROG -c "pwrite -S 0xbb 50K 10K" \
-c "pwrite -S 0xcc 90K 10K" \
-c "fpunch 70K 10k" \
$SCRATCH_MNT/foo | _filter_xfs_io
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify
# we get the same file contents that the original filesystem had.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
# We expect the destination filesystem to have exactly the same file
# data as the original filesystem.
# The btrfs send implementation had a bug where it sent a clone
# operation from file foo into file bar covering the whole [0, 100K[
# range after creating and writing the file foo. This was incorrect
# because the file bar now included the updates done to file foo after
# we cloned foo to bar, breaking the COW nature of reflink copies
# (cloned extents).
echo "File digests in the new filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Another test case that reproduces the problem when we have compressed
extents:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
# Create our file with an extent of 100K starting at file offset 0K.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Rewrite part of the previous extent (its first 40K) and write a new
# 100K extent starting at file offset 100K.
$XFS_IO_PROG -c "pwrite -S 0xbb 0K 40K" \
-c "pwrite -S 0xcc 100K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Our file foo now has 3 file extent items in its metadata:
#
# 1) One covering the file range 0 to 40K;
# 2) One covering the file range 40K to 100K, which points to the first
# extent we wrote to the file and has a data offset field with value
# 40K (our file no longer uses the first 40K of data from that
# extent);
# 3) One covering the file range 100K to 200K.
# Now clone our file foo into file bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Create our snapshot for the send operation.
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify we
# get the same file contents that the original filesystem had.
# Btrfs send used to issue a clone operation from foo's range
# [80K, 140K[ to bar's range [40K, 100K[ when cloning the extent pointed
# to by foo's second file extent item, this was incorrect because of bad
# accounting of the file extent item's data offset field. The correct
# range to clone from should have been [40K, 100K[.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
echo "File digests in the new filesystem:"
# Must match the digests we got in the original filesystem.
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-02 17:47:34 +08:00
|
|
|
while (sent < len) {
|
2020-08-21 15:39:52 +08:00
|
|
|
u64 size = min(len - sent, read_size);
|
Btrfs: send, fix file corruption due to incorrect cloning operations
If we have a file that shares an extent with other files, when processing
the extent item relative to a shared extent, we blindly issue a clone
operation that will target a length matching the length in the extent item
and uses as a source some other file the receiver already has and points
to the same extent. However that range in the other file might not
exclusively point only to the shared extent, and so using that length
will result in the receiver getting a file with different data from the
one in the send snapshot. This issue happened both for incremental and
full send operations.
So fix this by issuing clone operations with lengths that don't cover
regions of the source file that point to different extents (or have holes).
The following test case for fstests reproduces the problem.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
_require_xfs_io_command "fpunch"
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test file with a single 100K extent.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Clone our file into a new file named bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Now overwrite parts of our foo file.
$XFS_IO_PROG -c "pwrite -S 0xbb 50K 10K" \
-c "pwrite -S 0xcc 90K 10K" \
-c "fpunch 70K 10k" \
$SCRATCH_MNT/foo | _filter_xfs_io
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify
# we get the same file contents that the original filesystem had.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
# We expect the destination filesystem to have exactly the same file
# data as the original filesystem.
# The btrfs send implementation had a bug where it sent a clone
# operation from file foo into file bar covering the whole [0, 100K[
# range after creating and writing the file foo. This was incorrect
# because the file bar now included the updates done to file foo after
# we cloned foo to bar, breaking the COW nature of reflink copies
# (cloned extents).
echo "File digests in the new filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Another test case that reproduces the problem when we have compressed
extents:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
# Create our file with an extent of 100K starting at file offset 0K.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Rewrite part of the previous extent (its first 40K) and write a new
# 100K extent starting at file offset 100K.
$XFS_IO_PROG -c "pwrite -S 0xbb 0K 40K" \
-c "pwrite -S 0xcc 100K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Our file foo now has 3 file extent items in its metadata:
#
# 1) One covering the file range 0 to 40K;
# 2) One covering the file range 40K to 100K, which points to the first
# extent we wrote to the file and has a data offset field with value
# 40K (our file no longer uses the first 40K of data from that
# extent);
# 3) One covering the file range 100K to 200K.
# Now clone our file foo into file bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Create our snapshot for the send operation.
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify we
# get the same file contents that the original filesystem had.
# Btrfs send used to issue a clone operation from foo's range
# [80K, 140K[ to bar's range [40K, 100K[ when cloning the extent pointed
# to by foo's second file extent item, this was incorrect because of bad
# accounting of the file extent item's data offset field. The correct
# range to clone from should have been [40K, 100K[.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
echo "File digests in the new filesystem:"
# Must match the digests we got in the original filesystem.
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-02 17:47:34 +08:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = send_write(sctx, offset + sent, size);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
2020-08-21 15:39:51 +08:00
|
|
|
sent += size;
|
Btrfs: send, fix file corruption due to incorrect cloning operations
If we have a file that shares an extent with other files, when processing
the extent item relative to a shared extent, we blindly issue a clone
operation that will target a length matching the length in the extent item
and uses as a source some other file the receiver already has and points
to the same extent. However that range in the other file might not
exclusively point only to the shared extent, and so using that length
will result in the receiver getting a file with different data from the
one in the send snapshot. This issue happened both for incremental and
full send operations.
So fix this by issuing clone operations with lengths that don't cover
regions of the source file that point to different extents (or have holes).
The following test case for fstests reproduces the problem.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
_require_xfs_io_command "fpunch"
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test file with a single 100K extent.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Clone our file into a new file named bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Now overwrite parts of our foo file.
$XFS_IO_PROG -c "pwrite -S 0xbb 50K 10K" \
-c "pwrite -S 0xcc 90K 10K" \
-c "fpunch 70K 10k" \
$SCRATCH_MNT/foo | _filter_xfs_io
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify
# we get the same file contents that the original filesystem had.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
# We expect the destination filesystem to have exactly the same file
# data as the original filesystem.
# The btrfs send implementation had a bug where it sent a clone
# operation from file foo into file bar covering the whole [0, 100K[
# range after creating and writing the file foo. This was incorrect
# because the file bar now included the updates done to file foo after
# we cloned foo to bar, breaking the COW nature of reflink copies
# (cloned extents).
echo "File digests in the new filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Another test case that reproduces the problem when we have compressed
extents:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
# Create our file with an extent of 100K starting at file offset 0K.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Rewrite part of the previous extent (its first 40K) and write a new
# 100K extent starting at file offset 100K.
$XFS_IO_PROG -c "pwrite -S 0xbb 0K 40K" \
-c "pwrite -S 0xcc 100K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Our file foo now has 3 file extent items in its metadata:
#
# 1) One covering the file range 0 to 40K;
# 2) One covering the file range 40K to 100K, which points to the first
# extent we wrote to the file and has a data offset field with value
# 40K (our file no longer uses the first 40K of data from that
# extent);
# 3) One covering the file range 100K to 200K.
# Now clone our file foo into file bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Create our snapshot for the send operation.
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify we
# get the same file contents that the original filesystem had.
# Btrfs send used to issue a clone operation from foo's range
# [80K, 140K[ to bar's range [40K, 100K[ when cloning the extent pointed
# to by foo's second file extent item, this was incorrect because of bad
# accounting of the file extent item's data offset field. The correct
# range to clone from should have been [40K, 100K[.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
echo "File digests in the new filesystem:"
# Must match the digests we got in the original filesystem.
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-02 17:47:34 +08:00
|
|
|
}
|
btrfs: send: avoid trashing the page cache
A send operation reads extent data using the buffered IO path for getting
extent data to send in write commands and this is both because it's simple
and to make use of the generic readahead infrastructure, which results in
a massive speedup.
However this fills the page cache with data that, most of the time, is
really only used by the send operation - once the write commands are sent,
it's not useful to have the data in the page cache anymore. For large
snapshots, bringing all data into the page cache eventually leads to the
need to evict other data from the page cache that may be more useful for
applications (and kernel subsystems).
Even if extents are shared with the subvolume on which a snapshot is based
on and the data is currently on the page cache due to being read through
the subvolume, attempting to read the data through the snapshot will
always result in bringing a new copy of the data into another location in
the page cache (there's currently no shared memory for shared extents).
So make send evict the data it has read before if when it first opened
the inode, its mapping had no pages currently loaded: when
inode->i_mapping->nr_pages has a value of 0. Do this instead of deciding
based on the return value of filemap_range_has_page() before reading an
extent because the generic readahead mechanism may read pages beyond the
range we request (and it very often does it), which means a call to
filemap_range_has_page() will return true due to the readahead that was
triggered when processing a previous extent - we don't have a simple way
to distinguish this case from the case where the data was brought into
the page cache through someone else. So checking for the mapping number
of pages being 0 when we first open the inode is simple, cheap and it
generally accomplishes the goal of not trashing the page cache - the
only exception is if part of data was previously loaded into the page
cache through the snapshot by some other process, in that case we end
up not evicting any data send brings into the page cache, just like
before this change - but that however is not the common case.
Example scenario, on a box with 32G of RAM:
$ btrfs subvolume create /mnt/sv1
$ xfs_io -f -c "pwrite 0 4G" /mnt/sv1/file1
$ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1
$ free -m
total used free shared buff/cache available
Mem: 31937 186 26866 0 4883 31297
Swap: 8188 0 8188
# After this we get less 4G of free memory.
$ btrfs send /mnt/snap1 >/dev/null
$ free -m
total used free shared buff/cache available
Mem: 31937 186 22814 0 8935 31297
Swap: 8188 0 8188
The same, obviously, applies to an incremental send.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-17 18:47:30 +08:00
|
|
|
|
2023-01-03 13:11:37 +08:00
|
|
|
if (sctx->clean_page_cache && PAGE_ALIGNED(end)) {
|
btrfs: send: avoid trashing the page cache
A send operation reads extent data using the buffered IO path for getting
extent data to send in write commands and this is both because it's simple
and to make use of the generic readahead infrastructure, which results in
a massive speedup.
However this fills the page cache with data that, most of the time, is
really only used by the send operation - once the write commands are sent,
it's not useful to have the data in the page cache anymore. For large
snapshots, bringing all data into the page cache eventually leads to the
need to evict other data from the page cache that may be more useful for
applications (and kernel subsystems).
Even if extents are shared with the subvolume on which a snapshot is based
on and the data is currently on the page cache due to being read through
the subvolume, attempting to read the data through the snapshot will
always result in bringing a new copy of the data into another location in
the page cache (there's currently no shared memory for shared extents).
So make send evict the data it has read before if when it first opened
the inode, its mapping had no pages currently loaded: when
inode->i_mapping->nr_pages has a value of 0. Do this instead of deciding
based on the return value of filemap_range_has_page() before reading an
extent because the generic readahead mechanism may read pages beyond the
range we request (and it very often does it), which means a call to
filemap_range_has_page() will return true due to the readahead that was
triggered when processing a previous extent - we don't have a simple way
to distinguish this case from the case where the data was brought into
the page cache through someone else. So checking for the mapping number
of pages being 0 when we first open the inode is simple, cheap and it
generally accomplishes the goal of not trashing the page cache - the
only exception is if part of data was previously loaded into the page
cache through the snapshot by some other process, in that case we end
up not evicting any data send brings into the page cache, just like
before this change - but that however is not the common case.
Example scenario, on a box with 32G of RAM:
$ btrfs subvolume create /mnt/sv1
$ xfs_io -f -c "pwrite 0 4G" /mnt/sv1/file1
$ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1
$ free -m
total used free shared buff/cache available
Mem: 31937 186 26866 0 4883 31297
Swap: 8188 0 8188
# After this we get less 4G of free memory.
$ btrfs send /mnt/snap1 >/dev/null
$ free -m
total used free shared buff/cache available
Mem: 31937 186 22814 0 8935 31297
Swap: 8188 0 8188
The same, obviously, applies to an incremental send.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-17 18:47:30 +08:00
|
|
|
/*
|
|
|
|
* Always operate only on ranges that are a multiple of the page
|
|
|
|
* size. This is not only to prevent zeroing parts of a page in
|
|
|
|
* the case of subpage sector size, but also to guarantee we evict
|
|
|
|
* pages, as passing a range that is smaller than page size does
|
|
|
|
* not evict the respective page (only zeroes part of its content).
|
|
|
|
*
|
|
|
|
* Always start from the end offset of the last range cleared.
|
|
|
|
* This is because the readahead code may (and very often does)
|
|
|
|
* reads pages beyond the range we request for readahead. So if
|
|
|
|
* we have an extent layout like this:
|
|
|
|
*
|
|
|
|
* [ extent A ] [ extent B ] [ extent C ]
|
|
|
|
*
|
|
|
|
* When we ask page_cache_sync_readahead() to read extent A, it
|
|
|
|
* may also trigger reads for pages of extent B. If we are doing
|
|
|
|
* an incremental send and extent B has not changed between the
|
|
|
|
* parent and send snapshots, some or all of its pages may end
|
|
|
|
* up being read and placed in the page cache. So when truncating
|
|
|
|
* the page cache we always start from the end offset of the
|
|
|
|
* previously processed extent up to the end of the current
|
|
|
|
* extent.
|
|
|
|
*/
|
|
|
|
truncate_inode_pages_range(&sctx->cur_inode->i_data,
|
|
|
|
sctx->page_cache_clear_start,
|
|
|
|
end - 1);
|
|
|
|
sctx->page_cache_clear_start = end;
|
|
|
|
}
|
|
|
|
|
Btrfs: send, fix file corruption due to incorrect cloning operations
If we have a file that shares an extent with other files, when processing
the extent item relative to a shared extent, we blindly issue a clone
operation that will target a length matching the length in the extent item
and uses as a source some other file the receiver already has and points
to the same extent. However that range in the other file might not
exclusively point only to the shared extent, and so using that length
will result in the receiver getting a file with different data from the
one in the send snapshot. This issue happened both for incremental and
full send operations.
So fix this by issuing clone operations with lengths that don't cover
regions of the source file that point to different extents (or have holes).
The following test case for fstests reproduces the problem.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
_require_xfs_io_command "fpunch"
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test file with a single 100K extent.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Clone our file into a new file named bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Now overwrite parts of our foo file.
$XFS_IO_PROG -c "pwrite -S 0xbb 50K 10K" \
-c "pwrite -S 0xcc 90K 10K" \
-c "fpunch 70K 10k" \
$SCRATCH_MNT/foo | _filter_xfs_io
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify
# we get the same file contents that the original filesystem had.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
# We expect the destination filesystem to have exactly the same file
# data as the original filesystem.
# The btrfs send implementation had a bug where it sent a clone
# operation from file foo into file bar covering the whole [0, 100K[
# range after creating and writing the file foo. This was incorrect
# because the file bar now included the updates done to file foo after
# we cloned foo to bar, breaking the COW nature of reflink copies
# (cloned extents).
echo "File digests in the new filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Another test case that reproduces the problem when we have compressed
extents:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
# Create our file with an extent of 100K starting at file offset 0K.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Rewrite part of the previous extent (its first 40K) and write a new
# 100K extent starting at file offset 100K.
$XFS_IO_PROG -c "pwrite -S 0xbb 0K 40K" \
-c "pwrite -S 0xcc 100K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Our file foo now has 3 file extent items in its metadata:
#
# 1) One covering the file range 0 to 40K;
# 2) One covering the file range 40K to 100K, which points to the first
# extent we wrote to the file and has a data offset field with value
# 40K (our file no longer uses the first 40K of data from that
# extent);
# 3) One covering the file range 100K to 200K.
# Now clone our file foo into file bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Create our snapshot for the send operation.
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify we
# get the same file contents that the original filesystem had.
# Btrfs send used to issue a clone operation from foo's range
# [80K, 140K[ to bar's range [40K, 100K[ when cloning the extent pointed
# to by foo's second file extent item, this was incorrect because of bad
# accounting of the file extent item's data offset field. The correct
# range to clone from should have been [40K, 100K[.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
echo "File digests in the new filesystem:"
# Must match the digests we got in the original filesystem.
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-02 17:47:34 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
btrfs: send: emit file capabilities after chown
Whenever a chown is executed, all capabilities of the file being touched
are lost. When doing incremental send with a file with capabilities,
there is a situation where the capability can be lost on the receiving
side. The sequence of actions bellow shows the problem:
$ mount /dev/sda fs1
$ mount /dev/sdb fs2
$ touch fs1/foo.bar
$ setcap cap_sys_nice+ep fs1/foo.bar
$ btrfs subvolume snapshot -r fs1 fs1/snap_init
$ btrfs send fs1/snap_init | btrfs receive fs2
$ chgrp adm fs1/foo.bar
$ setcap cap_sys_nice+ep fs1/foo.bar
$ btrfs subvolume snapshot -r fs1 fs1/snap_complete
$ btrfs subvolume snapshot -r fs1 fs1/snap_incremental
$ btrfs send fs1/snap_complete | btrfs receive fs2
$ btrfs send -p fs1/snap_init fs1/snap_incremental | btrfs receive fs2
At this point, only a chown was emitted by "btrfs send" since only the
group was changed. This makes the cap_sys_nice capability to be dropped
from fs2/snap_incremental/foo.bar
To fix that, only emit capabilities after chown is emitted. The current
code first checks for xattrs that are new/changed, emits them, and later
emit the chown. Now, __process_new_xattr skips capabilities, letting
only finish_inode_if_needed to emit them, if they exist, for the inode
being processed.
This behavior was being worked around in "btrfs receive" side by caching
the capability and only applying it after chown. Now, xattrs are only
emmited _after_ chown, making that workaround not needed anymore.
Link: https://github.com/kdave/btrfs-progs/issues/202
CC: stable@vger.kernel.org # 4.4+
Suggested-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-11 10:15:07 +08:00
|
|
|
/*
|
|
|
|
* Search for a capability xattr related to sctx->cur_ino. If the capability is
|
|
|
|
* found, call send_set_xattr function to emit it.
|
|
|
|
*
|
|
|
|
* Return 0 if there isn't a capability, or when the capability was emitted
|
|
|
|
* successfully, or < 0 if an error occurred.
|
|
|
|
*/
|
|
|
|
static int send_capabilities(struct send_ctx *sctx)
|
|
|
|
{
|
|
|
|
struct fs_path *fspath = NULL;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_dir_item *di;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
unsigned long data_ptr;
|
|
|
|
char *buf = NULL;
|
|
|
|
int buf_len;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
path = alloc_path_for_send();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
di = btrfs_lookup_xattr(NULL, sctx->send_root, path, sctx->cur_ino,
|
|
|
|
XATTR_NAME_CAPS, strlen(XATTR_NAME_CAPS), 0);
|
|
|
|
if (!di) {
|
|
|
|
/* There is no xattr for this inode */
|
|
|
|
goto out;
|
|
|
|
} else if (IS_ERR(di)) {
|
|
|
|
ret = PTR_ERR(di);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
buf_len = btrfs_dir_data_len(leaf, di);
|
|
|
|
|
|
|
|
fspath = fs_path_alloc();
|
|
|
|
buf = kmalloc(buf_len, GFP_KERNEL);
|
|
|
|
if (!fspath || !buf) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, fspath);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
data_ptr = (unsigned long)(di + 1) + btrfs_dir_name_len(leaf, di);
|
|
|
|
read_extent_buffer(leaf, buf, data_ptr, buf_len);
|
|
|
|
|
|
|
|
ret = send_set_xattr(sctx, fspath, XATTR_NAME_CAPS,
|
|
|
|
strlen(XATTR_NAME_CAPS), buf, buf_len);
|
|
|
|
out:
|
|
|
|
kfree(buf);
|
|
|
|
fs_path_free(fspath);
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2022-03-18 01:25:42 +08:00
|
|
|
static int clone_range(struct send_ctx *sctx, struct btrfs_path *dst_path,
|
|
|
|
struct clone_root *clone_root, const u64 disk_byte,
|
|
|
|
u64 data_offset, u64 offset, u64 len)
|
Btrfs: send, fix file corruption due to incorrect cloning operations
If we have a file that shares an extent with other files, when processing
the extent item relative to a shared extent, we blindly issue a clone
operation that will target a length matching the length in the extent item
and uses as a source some other file the receiver already has and points
to the same extent. However that range in the other file might not
exclusively point only to the shared extent, and so using that length
will result in the receiver getting a file with different data from the
one in the send snapshot. This issue happened both for incremental and
full send operations.
So fix this by issuing clone operations with lengths that don't cover
regions of the source file that point to different extents (or have holes).
The following test case for fstests reproduces the problem.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
_require_xfs_io_command "fpunch"
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test file with a single 100K extent.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Clone our file into a new file named bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Now overwrite parts of our foo file.
$XFS_IO_PROG -c "pwrite -S 0xbb 50K 10K" \
-c "pwrite -S 0xcc 90K 10K" \
-c "fpunch 70K 10k" \
$SCRATCH_MNT/foo | _filter_xfs_io
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify
# we get the same file contents that the original filesystem had.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
# We expect the destination filesystem to have exactly the same file
# data as the original filesystem.
# The btrfs send implementation had a bug where it sent a clone
# operation from file foo into file bar covering the whole [0, 100K[
# range after creating and writing the file foo. This was incorrect
# because the file bar now included the updates done to file foo after
# we cloned foo to bar, breaking the COW nature of reflink copies
# (cloned extents).
echo "File digests in the new filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Another test case that reproduces the problem when we have compressed
extents:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
# Create our file with an extent of 100K starting at file offset 0K.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Rewrite part of the previous extent (its first 40K) and write a new
# 100K extent starting at file offset 100K.
$XFS_IO_PROG -c "pwrite -S 0xbb 0K 40K" \
-c "pwrite -S 0xcc 100K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Our file foo now has 3 file extent items in its metadata:
#
# 1) One covering the file range 0 to 40K;
# 2) One covering the file range 40K to 100K, which points to the first
# extent we wrote to the file and has a data offset field with value
# 40K (our file no longer uses the first 40K of data from that
# extent);
# 3) One covering the file range 100K to 200K.
# Now clone our file foo into file bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Create our snapshot for the send operation.
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify we
# get the same file contents that the original filesystem had.
# Btrfs send used to issue a clone operation from foo's range
# [80K, 140K[ to bar's range [40K, 100K[ when cloning the extent pointed
# to by foo's second file extent item, this was incorrect because of bad
# accounting of the file extent item's data offset field. The correct
# range to clone from should have been [40K, 100K[.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
echo "File digests in the new filesystem:"
# Must match the digests we got in the original filesystem.
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-02 17:47:34 +08:00
|
|
|
{
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_key key;
|
|
|
|
int ret;
|
2022-08-12 22:42:32 +08:00
|
|
|
struct btrfs_inode_info info;
|
2019-09-03 11:30:19 +08:00
|
|
|
u64 clone_src_i_size = 0;
|
Btrfs: send, fix file corruption due to incorrect cloning operations
If we have a file that shares an extent with other files, when processing
the extent item relative to a shared extent, we blindly issue a clone
operation that will target a length matching the length in the extent item
and uses as a source some other file the receiver already has and points
to the same extent. However that range in the other file might not
exclusively point only to the shared extent, and so using that length
will result in the receiver getting a file with different data from the
one in the send snapshot. This issue happened both for incremental and
full send operations.
So fix this by issuing clone operations with lengths that don't cover
regions of the source file that point to different extents (or have holes).
The following test case for fstests reproduces the problem.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
_require_xfs_io_command "fpunch"
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test file with a single 100K extent.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Clone our file into a new file named bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Now overwrite parts of our foo file.
$XFS_IO_PROG -c "pwrite -S 0xbb 50K 10K" \
-c "pwrite -S 0xcc 90K 10K" \
-c "fpunch 70K 10k" \
$SCRATCH_MNT/foo | _filter_xfs_io
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify
# we get the same file contents that the original filesystem had.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
# We expect the destination filesystem to have exactly the same file
# data as the original filesystem.
# The btrfs send implementation had a bug where it sent a clone
# operation from file foo into file bar covering the whole [0, 100K[
# range after creating and writing the file foo. This was incorrect
# because the file bar now included the updates done to file foo after
# we cloned foo to bar, breaking the COW nature of reflink copies
# (cloned extents).
echo "File digests in the new filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Another test case that reproduces the problem when we have compressed
extents:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
# Create our file with an extent of 100K starting at file offset 0K.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Rewrite part of the previous extent (its first 40K) and write a new
# 100K extent starting at file offset 100K.
$XFS_IO_PROG -c "pwrite -S 0xbb 0K 40K" \
-c "pwrite -S 0xcc 100K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Our file foo now has 3 file extent items in its metadata:
#
# 1) One covering the file range 0 to 40K;
# 2) One covering the file range 40K to 100K, which points to the first
# extent we wrote to the file and has a data offset field with value
# 40K (our file no longer uses the first 40K of data from that
# extent);
# 3) One covering the file range 100K to 200K.
# Now clone our file foo into file bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Create our snapshot for the send operation.
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify we
# get the same file contents that the original filesystem had.
# Btrfs send used to issue a clone operation from foo's range
# [80K, 140K[ to bar's range [40K, 100K[ when cloning the extent pointed
# to by foo's second file extent item, this was incorrect because of bad
# accounting of the file extent item's data offset field. The correct
# range to clone from should have been [40K, 100K[.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
echo "File digests in the new filesystem:"
# Must match the digests we got in the original filesystem.
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-02 17:47:34 +08:00
|
|
|
|
2017-08-11 05:54:51 +08:00
|
|
|
/*
|
|
|
|
* Prevent cloning from a zero offset with a length matching the sector
|
|
|
|
* size because in some scenarios this will make the receiver fail.
|
|
|
|
*
|
|
|
|
* For example, if in the source filesystem the extent at offset 0
|
|
|
|
* has a length of sectorsize and it was written using direct IO, then
|
|
|
|
* it can never be an inline extent (even if compression is enabled).
|
|
|
|
* Then this extent can be cloned in the original filesystem to a non
|
|
|
|
* zero file offset, but it may not be possible to clone in the
|
|
|
|
* destination filesystem because it can be inlined due to compression
|
|
|
|
* on the destination filesystem (as the receiver's write operations are
|
|
|
|
* always done using buffered IO). The same happens when the original
|
|
|
|
* filesystem does not have compression enabled but the destination
|
|
|
|
* filesystem has.
|
|
|
|
*/
|
|
|
|
if (clone_root->offset == 0 &&
|
|
|
|
len == sctx->send_root->fs_info->sectorsize)
|
2022-03-18 01:25:42 +08:00
|
|
|
return send_extent_data(sctx, dst_path, offset, len);
|
2017-08-11 05:54:51 +08:00
|
|
|
|
Btrfs: send, fix file corruption due to incorrect cloning operations
If we have a file that shares an extent with other files, when processing
the extent item relative to a shared extent, we blindly issue a clone
operation that will target a length matching the length in the extent item
and uses as a source some other file the receiver already has and points
to the same extent. However that range in the other file might not
exclusively point only to the shared extent, and so using that length
will result in the receiver getting a file with different data from the
one in the send snapshot. This issue happened both for incremental and
full send operations.
So fix this by issuing clone operations with lengths that don't cover
regions of the source file that point to different extents (or have holes).
The following test case for fstests reproduces the problem.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
_require_xfs_io_command "fpunch"
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test file with a single 100K extent.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Clone our file into a new file named bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Now overwrite parts of our foo file.
$XFS_IO_PROG -c "pwrite -S 0xbb 50K 10K" \
-c "pwrite -S 0xcc 90K 10K" \
-c "fpunch 70K 10k" \
$SCRATCH_MNT/foo | _filter_xfs_io
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify
# we get the same file contents that the original filesystem had.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
# We expect the destination filesystem to have exactly the same file
# data as the original filesystem.
# The btrfs send implementation had a bug where it sent a clone
# operation from file foo into file bar covering the whole [0, 100K[
# range after creating and writing the file foo. This was incorrect
# because the file bar now included the updates done to file foo after
# we cloned foo to bar, breaking the COW nature of reflink copies
# (cloned extents).
echo "File digests in the new filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Another test case that reproduces the problem when we have compressed
extents:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
# Create our file with an extent of 100K starting at file offset 0K.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Rewrite part of the previous extent (its first 40K) and write a new
# 100K extent starting at file offset 100K.
$XFS_IO_PROG -c "pwrite -S 0xbb 0K 40K" \
-c "pwrite -S 0xcc 100K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Our file foo now has 3 file extent items in its metadata:
#
# 1) One covering the file range 0 to 40K;
# 2) One covering the file range 40K to 100K, which points to the first
# extent we wrote to the file and has a data offset field with value
# 40K (our file no longer uses the first 40K of data from that
# extent);
# 3) One covering the file range 100K to 200K.
# Now clone our file foo into file bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Create our snapshot for the send operation.
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify we
# get the same file contents that the original filesystem had.
# Btrfs send used to issue a clone operation from foo's range
# [80K, 140K[ to bar's range [40K, 100K[ when cloning the extent pointed
# to by foo's second file extent item, this was incorrect because of bad
# accounting of the file extent item's data offset field. The correct
# range to clone from should have been [40K, 100K[.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
echo "File digests in the new filesystem:"
# Must match the digests we got in the original filesystem.
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-02 17:47:34 +08:00
|
|
|
path = alloc_path_for_send();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
Btrfs: send, improve clone range
Improve clone_range in two scenarios.
1. Remove the limit of inode size when find clone inodes We can do
partial clone, so there is no need to limit the size of the candidate
inode. When clone a range, we clone the legal range only by bytenr,
offset, len, inode size.
2. In the scenarios of rewrite or clone_range, data_offset rarely
matches exactly, so the chance of a clone is missed.
e.g.
1. Write a 1M file
dd if=/dev/zero of=1M bs=1M count=1
2. Clone 1M file
cp --reflink 1M clone
3. Rewrite 4k on the clone file
dd if=/dev/zero of=clone bs=4k count=1 conv=notrunc
The disk layout is as follows:
item 16 key (257 EXTENT_DATA 0) itemoff 15353 itemsize 53
extent data disk byte 1103101952 nr 1048576
extent data offset 0 nr 1048576 ram 1048576
extent compression(none)
...
item 22 key (258 EXTENT_DATA 0) itemoff 14959 itemsize 53
extent data disk byte 1104150528 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression(none)
item 23 key (258 EXTENT_DATA 4096) itemoff 14906 itemsize 53
extent data disk byte 1103101952 nr 1048576
extent data offset 4096 nr 1044480 ram 1048576
extent compression(none)
When send, inode 258 file offset 4096~1048576 (item 23) has a chance to
clone_range, but because data_offset does not match inode 257 (item 16),
it causes missed clone and can only transfer actual data.
Improve the problem by judging whether the current data_offset has
overlap with the file extent item, and if so, adjusting offset and
extent_len so that we can clone correctly.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-03-29 18:03:27 +08:00
|
|
|
/*
|
|
|
|
* There are inodes that have extents that lie behind its i_size. Don't
|
|
|
|
* accept clones from these extents.
|
|
|
|
*/
|
2022-08-12 22:42:32 +08:00
|
|
|
ret = get_inode_info(clone_root->root, clone_root->ino, &info);
|
Btrfs: send, improve clone range
Improve clone_range in two scenarios.
1. Remove the limit of inode size when find clone inodes We can do
partial clone, so there is no need to limit the size of the candidate
inode. When clone a range, we clone the legal range only by bytenr,
offset, len, inode size.
2. In the scenarios of rewrite or clone_range, data_offset rarely
matches exactly, so the chance of a clone is missed.
e.g.
1. Write a 1M file
dd if=/dev/zero of=1M bs=1M count=1
2. Clone 1M file
cp --reflink 1M clone
3. Rewrite 4k on the clone file
dd if=/dev/zero of=clone bs=4k count=1 conv=notrunc
The disk layout is as follows:
item 16 key (257 EXTENT_DATA 0) itemoff 15353 itemsize 53
extent data disk byte 1103101952 nr 1048576
extent data offset 0 nr 1048576 ram 1048576
extent compression(none)
...
item 22 key (258 EXTENT_DATA 0) itemoff 14959 itemsize 53
extent data disk byte 1104150528 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression(none)
item 23 key (258 EXTENT_DATA 4096) itemoff 14906 itemsize 53
extent data disk byte 1103101952 nr 1048576
extent data offset 4096 nr 1044480 ram 1048576
extent compression(none)
When send, inode 258 file offset 4096~1048576 (item 23) has a chance to
clone_range, but because data_offset does not match inode 257 (item 16),
it causes missed clone and can only transfer actual data.
Improve the problem by judging whether the current data_offset has
overlap with the file extent item, and if so, adjusting offset and
extent_len so that we can clone correctly.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-03-29 18:03:27 +08:00
|
|
|
btrfs_release_path(path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2022-08-12 22:42:32 +08:00
|
|
|
clone_src_i_size = info.size;
|
Btrfs: send, improve clone range
Improve clone_range in two scenarios.
1. Remove the limit of inode size when find clone inodes We can do
partial clone, so there is no need to limit the size of the candidate
inode. When clone a range, we clone the legal range only by bytenr,
offset, len, inode size.
2. In the scenarios of rewrite or clone_range, data_offset rarely
matches exactly, so the chance of a clone is missed.
e.g.
1. Write a 1M file
dd if=/dev/zero of=1M bs=1M count=1
2. Clone 1M file
cp --reflink 1M clone
3. Rewrite 4k on the clone file
dd if=/dev/zero of=clone bs=4k count=1 conv=notrunc
The disk layout is as follows:
item 16 key (257 EXTENT_DATA 0) itemoff 15353 itemsize 53
extent data disk byte 1103101952 nr 1048576
extent data offset 0 nr 1048576 ram 1048576
extent compression(none)
...
item 22 key (258 EXTENT_DATA 0) itemoff 14959 itemsize 53
extent data disk byte 1104150528 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression(none)
item 23 key (258 EXTENT_DATA 4096) itemoff 14906 itemsize 53
extent data disk byte 1103101952 nr 1048576
extent data offset 4096 nr 1044480 ram 1048576
extent compression(none)
When send, inode 258 file offset 4096~1048576 (item 23) has a chance to
clone_range, but because data_offset does not match inode 257 (item 16),
it causes missed clone and can only transfer actual data.
Improve the problem by judging whether the current data_offset has
overlap with the file extent item, and if so, adjusting offset and
extent_len so that we can clone correctly.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-03-29 18:03:27 +08:00
|
|
|
|
Btrfs: send, fix file corruption due to incorrect cloning operations
If we have a file that shares an extent with other files, when processing
the extent item relative to a shared extent, we blindly issue a clone
operation that will target a length matching the length in the extent item
and uses as a source some other file the receiver already has and points
to the same extent. However that range in the other file might not
exclusively point only to the shared extent, and so using that length
will result in the receiver getting a file with different data from the
one in the send snapshot. This issue happened both for incremental and
full send operations.
So fix this by issuing clone operations with lengths that don't cover
regions of the source file that point to different extents (or have holes).
The following test case for fstests reproduces the problem.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
_require_xfs_io_command "fpunch"
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test file with a single 100K extent.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Clone our file into a new file named bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Now overwrite parts of our foo file.
$XFS_IO_PROG -c "pwrite -S 0xbb 50K 10K" \
-c "pwrite -S 0xcc 90K 10K" \
-c "fpunch 70K 10k" \
$SCRATCH_MNT/foo | _filter_xfs_io
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify
# we get the same file contents that the original filesystem had.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
# We expect the destination filesystem to have exactly the same file
# data as the original filesystem.
# The btrfs send implementation had a bug where it sent a clone
# operation from file foo into file bar covering the whole [0, 100K[
# range after creating and writing the file foo. This was incorrect
# because the file bar now included the updates done to file foo after
# we cloned foo to bar, breaking the COW nature of reflink copies
# (cloned extents).
echo "File digests in the new filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Another test case that reproduces the problem when we have compressed
extents:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
# Create our file with an extent of 100K starting at file offset 0K.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Rewrite part of the previous extent (its first 40K) and write a new
# 100K extent starting at file offset 100K.
$XFS_IO_PROG -c "pwrite -S 0xbb 0K 40K" \
-c "pwrite -S 0xcc 100K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Our file foo now has 3 file extent items in its metadata:
#
# 1) One covering the file range 0 to 40K;
# 2) One covering the file range 40K to 100K, which points to the first
# extent we wrote to the file and has a data offset field with value
# 40K (our file no longer uses the first 40K of data from that
# extent);
# 3) One covering the file range 100K to 200K.
# Now clone our file foo into file bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Create our snapshot for the send operation.
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify we
# get the same file contents that the original filesystem had.
# Btrfs send used to issue a clone operation from foo's range
# [80K, 140K[ to bar's range [40K, 100K[ when cloning the extent pointed
# to by foo's second file extent item, this was incorrect because of bad
# accounting of the file extent item's data offset field. The correct
# range to clone from should have been [40K, 100K[.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
echo "File digests in the new filesystem:"
# Must match the digests we got in the original filesystem.
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-02 17:47:34 +08:00
|
|
|
/*
|
|
|
|
* We can't send a clone operation for the entire range if we find
|
|
|
|
* extent items in the respective range in the source file that
|
|
|
|
* refer to different extents or if we find holes.
|
|
|
|
* So check for that and do a mix of clone and regular write/copy
|
|
|
|
* operations if needed.
|
|
|
|
*
|
|
|
|
* Example:
|
|
|
|
*
|
|
|
|
* mkfs.btrfs -f /dev/sda
|
|
|
|
* mount /dev/sda /mnt
|
|
|
|
* xfs_io -f -c "pwrite -S 0xaa 0K 100K" /mnt/foo
|
|
|
|
* cp --reflink=always /mnt/foo /mnt/bar
|
|
|
|
* xfs_io -c "pwrite -S 0xbb 50K 50K" /mnt/foo
|
|
|
|
* btrfs subvolume snapshot -r /mnt /mnt/snap
|
|
|
|
*
|
|
|
|
* If when we send the snapshot and we are processing file bar (which
|
|
|
|
* has a higher inode number than foo) we blindly send a clone operation
|
|
|
|
* for the [0, 100K[ range from foo to bar, the receiver ends up getting
|
|
|
|
* a file bar that matches the content of file foo - iow, doesn't match
|
|
|
|
* the content from bar in the original filesystem.
|
|
|
|
*/
|
|
|
|
key.objectid = clone_root->ino;
|
|
|
|
key.type = BTRFS_EXTENT_DATA_KEY;
|
|
|
|
key.offset = clone_root->offset;
|
|
|
|
ret = btrfs_search_slot(NULL, clone_root->root, &key, path, 0, 0);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret > 0 && path->slots[0] > 0) {
|
|
|
|
btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0] - 1);
|
|
|
|
if (key.objectid == clone_root->ino &&
|
|
|
|
key.type == BTRFS_EXTENT_DATA_KEY)
|
|
|
|
path->slots[0]--;
|
|
|
|
}
|
|
|
|
|
|
|
|
while (true) {
|
|
|
|
struct extent_buffer *leaf = path->nodes[0];
|
|
|
|
int slot = path->slots[0];
|
|
|
|
struct btrfs_file_extent_item *ei;
|
|
|
|
u8 type;
|
|
|
|
u64 ext_len;
|
|
|
|
u64 clone_len;
|
Btrfs: send, improve clone range
Improve clone_range in two scenarios.
1. Remove the limit of inode size when find clone inodes We can do
partial clone, so there is no need to limit the size of the candidate
inode. When clone a range, we clone the legal range only by bytenr,
offset, len, inode size.
2. In the scenarios of rewrite or clone_range, data_offset rarely
matches exactly, so the chance of a clone is missed.
e.g.
1. Write a 1M file
dd if=/dev/zero of=1M bs=1M count=1
2. Clone 1M file
cp --reflink 1M clone
3. Rewrite 4k on the clone file
dd if=/dev/zero of=clone bs=4k count=1 conv=notrunc
The disk layout is as follows:
item 16 key (257 EXTENT_DATA 0) itemoff 15353 itemsize 53
extent data disk byte 1103101952 nr 1048576
extent data offset 0 nr 1048576 ram 1048576
extent compression(none)
...
item 22 key (258 EXTENT_DATA 0) itemoff 14959 itemsize 53
extent data disk byte 1104150528 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression(none)
item 23 key (258 EXTENT_DATA 4096) itemoff 14906 itemsize 53
extent data disk byte 1103101952 nr 1048576
extent data offset 4096 nr 1044480 ram 1048576
extent compression(none)
When send, inode 258 file offset 4096~1048576 (item 23) has a chance to
clone_range, but because data_offset does not match inode 257 (item 16),
it causes missed clone and can only transfer actual data.
Improve the problem by judging whether the current data_offset has
overlap with the file extent item, and if so, adjusting offset and
extent_len so that we can clone correctly.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-03-29 18:03:27 +08:00
|
|
|
u64 clone_data_offset;
|
btrfs: send: avoid unaligned encoded writes when attempting to clone range
When trying to see if we can clone a file range, there are cases where we
end up sending two write operations in case the inode from the source root
has an i_size that is not sector size aligned and the length from the
current offset to its i_size is less than the remaining length we are
trying to clone.
Issuing two write operations when we could instead issue a single write
operation is not incorrect. However it is not optimal, specially if the
extents are compressed and the flag BTRFS_SEND_FLAG_COMPRESSED was passed
to the send ioctl. In that case we can end up sending an encoded write
with an offset that is not sector size aligned, which makes the receiver
fallback to decompressing the data and writing it using regular buffered
IO (so re-compressing the data in case the fs is mounted with compression
enabled), because encoded writes fail with -EINVAL when an offset is not
sector size aligned.
The following example, which triggered a bug in the receiver code for the
fallback logic of decompressing + regular buffer IO and is fixed by the
patchset referred in a Link at the bottom of this changelog, is an example
where we have the non-optimal behaviour due to an unaligned encoded write:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f $DEV > /dev/null
mount -o compress $DEV $MNT
# File foo has a size of 33K, not aligned to the sector size.
xfs_io -f -c "pwrite -S 0xab 0 33K" $MNT/foo
xfs_io -f -c "pwrite -S 0xcd 0 64K" $MNT/bar
# Now clone the first 32K of file bar into foo at offset 0.
xfs_io -c "reflink $MNT/bar 0 0 32K" $MNT/foo
# Snapshot the default subvolume and create a full send stream (v2).
btrfs subvolume snapshot -r $MNT $MNT/snap
btrfs send --compressed-data -f /tmp/test.send $MNT/snap
echo -e "\nFile bar in the original filesystem:"
od -A d -t x1 $MNT/snap/bar
umount $MNT
mkfs.btrfs -f $DEV > /dev/null
mount $DEV $MNT
echo -e "\nReceiving stream in a new filesystem..."
btrfs receive -f /tmp/test.send $MNT
echo -e "\nFile bar in the new filesystem:"
od -A d -t x1 $MNT/snap/bar
umount $MNT
Before this patch, the send stream included one regular write and one
encoded write for file 'bar', with the later being not sector size aligned
and causing the receiver to fallback to decompression + buffered writes.
The output of the btrfs receive command in verbose mode (-vvv):
(...)
mkfile o258-7-0
rename o258-7-0 -> bar
utimes
clone bar - source=foo source offset=0 offset=0 length=32768
write bar - offset=32768 length=1024
encoded_write bar - offset=33792, len=4096, unencoded_offset=33792, unencoded_file_len=31744, unencoded_len=65536, compression=1, encryption=0
encoded_write bar - falling back to decompress and write due to errno 22 ("Invalid argument")
(...)
This patch avoids the regular write followed by an unaligned encoded write
so that we end up sending a single encoded write that is aligned. So after
this patch the stream content is (output of btrfs receive -vvv):
(...)
mkfile o258-7-0
rename o258-7-0 -> bar
utimes
clone bar - source=foo source offset=0 offset=0 length=32768
encoded_write bar - offset=32768, len=4096, unencoded_offset=32768, unencoded_file_len=32768, unencoded_len=65536, compression=1, encryption=0
(...)
So we get more optimal behaviour and avoid the silent data loss bug in
versions of btrfs-progs affected by the bug referred by the Link tag
below (btrfs-progs v5.19, v5.19.1, v6.0 and v6.0.1).
Link: https://lore.kernel.org/linux-btrfs/cover.1668529099.git.fdmanana@suse.com/
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-16 00:29:44 +08:00
|
|
|
bool crossed_src_i_size = false;
|
Btrfs: send, fix file corruption due to incorrect cloning operations
If we have a file that shares an extent with other files, when processing
the extent item relative to a shared extent, we blindly issue a clone
operation that will target a length matching the length in the extent item
and uses as a source some other file the receiver already has and points
to the same extent. However that range in the other file might not
exclusively point only to the shared extent, and so using that length
will result in the receiver getting a file with different data from the
one in the send snapshot. This issue happened both for incremental and
full send operations.
So fix this by issuing clone operations with lengths that don't cover
regions of the source file that point to different extents (or have holes).
The following test case for fstests reproduces the problem.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
_require_xfs_io_command "fpunch"
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test file with a single 100K extent.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Clone our file into a new file named bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Now overwrite parts of our foo file.
$XFS_IO_PROG -c "pwrite -S 0xbb 50K 10K" \
-c "pwrite -S 0xcc 90K 10K" \
-c "fpunch 70K 10k" \
$SCRATCH_MNT/foo | _filter_xfs_io
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify
# we get the same file contents that the original filesystem had.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
# We expect the destination filesystem to have exactly the same file
# data as the original filesystem.
# The btrfs send implementation had a bug where it sent a clone
# operation from file foo into file bar covering the whole [0, 100K[
# range after creating and writing the file foo. This was incorrect
# because the file bar now included the updates done to file foo after
# we cloned foo to bar, breaking the COW nature of reflink copies
# (cloned extents).
echo "File digests in the new filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Another test case that reproduces the problem when we have compressed
extents:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
# Create our file with an extent of 100K starting at file offset 0K.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Rewrite part of the previous extent (its first 40K) and write a new
# 100K extent starting at file offset 100K.
$XFS_IO_PROG -c "pwrite -S 0xbb 0K 40K" \
-c "pwrite -S 0xcc 100K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Our file foo now has 3 file extent items in its metadata:
#
# 1) One covering the file range 0 to 40K;
# 2) One covering the file range 40K to 100K, which points to the first
# extent we wrote to the file and has a data offset field with value
# 40K (our file no longer uses the first 40K of data from that
# extent);
# 3) One covering the file range 100K to 200K.
# Now clone our file foo into file bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Create our snapshot for the send operation.
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify we
# get the same file contents that the original filesystem had.
# Btrfs send used to issue a clone operation from foo's range
# [80K, 140K[ to bar's range [40K, 100K[ when cloning the extent pointed
# to by foo's second file extent item, this was incorrect because of bad
# accounting of the file extent item's data offset field. The correct
# range to clone from should have been [40K, 100K[.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
echo "File digests in the new filesystem:"
# Must match the digests we got in the original filesystem.
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-02 17:47:34 +08:00
|
|
|
|
|
|
|
if (slot >= btrfs_header_nritems(leaf)) {
|
|
|
|
ret = btrfs_next_leaf(clone_root->root, path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
else if (ret > 0)
|
|
|
|
break;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, slot);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We might have an implicit trailing hole (NO_HOLES feature
|
|
|
|
* enabled). We deal with it after leaving this loop.
|
|
|
|
*/
|
|
|
|
if (key.objectid != clone_root->ino ||
|
|
|
|
key.type != BTRFS_EXTENT_DATA_KEY)
|
|
|
|
break;
|
|
|
|
|
|
|
|
ei = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item);
|
|
|
|
type = btrfs_file_extent_type(leaf, ei);
|
|
|
|
if (type == BTRFS_FILE_EXTENT_INLINE) {
|
2018-06-06 15:41:49 +08:00
|
|
|
ext_len = btrfs_file_extent_ram_bytes(leaf, ei);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
ext_len = PAGE_ALIGN(ext_len);
|
Btrfs: send, fix file corruption due to incorrect cloning operations
If we have a file that shares an extent with other files, when processing
the extent item relative to a shared extent, we blindly issue a clone
operation that will target a length matching the length in the extent item
and uses as a source some other file the receiver already has and points
to the same extent. However that range in the other file might not
exclusively point only to the shared extent, and so using that length
will result in the receiver getting a file with different data from the
one in the send snapshot. This issue happened both for incremental and
full send operations.
So fix this by issuing clone operations with lengths that don't cover
regions of the source file that point to different extents (or have holes).
The following test case for fstests reproduces the problem.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
_require_xfs_io_command "fpunch"
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test file with a single 100K extent.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Clone our file into a new file named bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Now overwrite parts of our foo file.
$XFS_IO_PROG -c "pwrite -S 0xbb 50K 10K" \
-c "pwrite -S 0xcc 90K 10K" \
-c "fpunch 70K 10k" \
$SCRATCH_MNT/foo | _filter_xfs_io
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify
# we get the same file contents that the original filesystem had.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
# We expect the destination filesystem to have exactly the same file
# data as the original filesystem.
# The btrfs send implementation had a bug where it sent a clone
# operation from file foo into file bar covering the whole [0, 100K[
# range after creating and writing the file foo. This was incorrect
# because the file bar now included the updates done to file foo after
# we cloned foo to bar, breaking the COW nature of reflink copies
# (cloned extents).
echo "File digests in the new filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Another test case that reproduces the problem when we have compressed
extents:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
# Create our file with an extent of 100K starting at file offset 0K.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Rewrite part of the previous extent (its first 40K) and write a new
# 100K extent starting at file offset 100K.
$XFS_IO_PROG -c "pwrite -S 0xbb 0K 40K" \
-c "pwrite -S 0xcc 100K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Our file foo now has 3 file extent items in its metadata:
#
# 1) One covering the file range 0 to 40K;
# 2) One covering the file range 40K to 100K, which points to the first
# extent we wrote to the file and has a data offset field with value
# 40K (our file no longer uses the first 40K of data from that
# extent);
# 3) One covering the file range 100K to 200K.
# Now clone our file foo into file bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Create our snapshot for the send operation.
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify we
# get the same file contents that the original filesystem had.
# Btrfs send used to issue a clone operation from foo's range
# [80K, 140K[ to bar's range [40K, 100K[ when cloning the extent pointed
# to by foo's second file extent item, this was incorrect because of bad
# accounting of the file extent item's data offset field. The correct
# range to clone from should have been [40K, 100K[.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
echo "File digests in the new filesystem:"
# Must match the digests we got in the original filesystem.
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-02 17:47:34 +08:00
|
|
|
} else {
|
|
|
|
ext_len = btrfs_file_extent_num_bytes(leaf, ei);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (key.offset + ext_len <= clone_root->offset)
|
|
|
|
goto next;
|
|
|
|
|
|
|
|
if (key.offset > clone_root->offset) {
|
|
|
|
/* Implicit hole, NO_HOLES feature enabled. */
|
|
|
|
u64 hole_len = key.offset - clone_root->offset;
|
|
|
|
|
|
|
|
if (hole_len > len)
|
|
|
|
hole_len = len;
|
2022-03-18 01:25:42 +08:00
|
|
|
ret = send_extent_data(sctx, dst_path, offset,
|
|
|
|
hole_len);
|
Btrfs: send, fix file corruption due to incorrect cloning operations
If we have a file that shares an extent with other files, when processing
the extent item relative to a shared extent, we blindly issue a clone
operation that will target a length matching the length in the extent item
and uses as a source some other file the receiver already has and points
to the same extent. However that range in the other file might not
exclusively point only to the shared extent, and so using that length
will result in the receiver getting a file with different data from the
one in the send snapshot. This issue happened both for incremental and
full send operations.
So fix this by issuing clone operations with lengths that don't cover
regions of the source file that point to different extents (or have holes).
The following test case for fstests reproduces the problem.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
_require_xfs_io_command "fpunch"
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test file with a single 100K extent.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Clone our file into a new file named bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Now overwrite parts of our foo file.
$XFS_IO_PROG -c "pwrite -S 0xbb 50K 10K" \
-c "pwrite -S 0xcc 90K 10K" \
-c "fpunch 70K 10k" \
$SCRATCH_MNT/foo | _filter_xfs_io
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify
# we get the same file contents that the original filesystem had.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
# We expect the destination filesystem to have exactly the same file
# data as the original filesystem.
# The btrfs send implementation had a bug where it sent a clone
# operation from file foo into file bar covering the whole [0, 100K[
# range after creating and writing the file foo. This was incorrect
# because the file bar now included the updates done to file foo after
# we cloned foo to bar, breaking the COW nature of reflink copies
# (cloned extents).
echo "File digests in the new filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Another test case that reproduces the problem when we have compressed
extents:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
# Create our file with an extent of 100K starting at file offset 0K.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Rewrite part of the previous extent (its first 40K) and write a new
# 100K extent starting at file offset 100K.
$XFS_IO_PROG -c "pwrite -S 0xbb 0K 40K" \
-c "pwrite -S 0xcc 100K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Our file foo now has 3 file extent items in its metadata:
#
# 1) One covering the file range 0 to 40K;
# 2) One covering the file range 40K to 100K, which points to the first
# extent we wrote to the file and has a data offset field with value
# 40K (our file no longer uses the first 40K of data from that
# extent);
# 3) One covering the file range 100K to 200K.
# Now clone our file foo into file bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Create our snapshot for the send operation.
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify we
# get the same file contents that the original filesystem had.
# Btrfs send used to issue a clone operation from foo's range
# [80K, 140K[ to bar's range [40K, 100K[ when cloning the extent pointed
# to by foo's second file extent item, this was incorrect because of bad
# accounting of the file extent item's data offset field. The correct
# range to clone from should have been [40K, 100K[.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
echo "File digests in the new filesystem:"
# Must match the digests we got in the original filesystem.
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-02 17:47:34 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
len -= hole_len;
|
|
|
|
if (len == 0)
|
|
|
|
break;
|
|
|
|
offset += hole_len;
|
|
|
|
clone_root->offset += hole_len;
|
|
|
|
data_offset += hole_len;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (key.offset >= clone_root->offset + len)
|
|
|
|
break;
|
|
|
|
|
Btrfs: send, improve clone range
Improve clone_range in two scenarios.
1. Remove the limit of inode size when find clone inodes We can do
partial clone, so there is no need to limit the size of the candidate
inode. When clone a range, we clone the legal range only by bytenr,
offset, len, inode size.
2. In the scenarios of rewrite or clone_range, data_offset rarely
matches exactly, so the chance of a clone is missed.
e.g.
1. Write a 1M file
dd if=/dev/zero of=1M bs=1M count=1
2. Clone 1M file
cp --reflink 1M clone
3. Rewrite 4k on the clone file
dd if=/dev/zero of=clone bs=4k count=1 conv=notrunc
The disk layout is as follows:
item 16 key (257 EXTENT_DATA 0) itemoff 15353 itemsize 53
extent data disk byte 1103101952 nr 1048576
extent data offset 0 nr 1048576 ram 1048576
extent compression(none)
...
item 22 key (258 EXTENT_DATA 0) itemoff 14959 itemsize 53
extent data disk byte 1104150528 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression(none)
item 23 key (258 EXTENT_DATA 4096) itemoff 14906 itemsize 53
extent data disk byte 1103101952 nr 1048576
extent data offset 4096 nr 1044480 ram 1048576
extent compression(none)
When send, inode 258 file offset 4096~1048576 (item 23) has a chance to
clone_range, but because data_offset does not match inode 257 (item 16),
it causes missed clone and can only transfer actual data.
Improve the problem by judging whether the current data_offset has
overlap with the file extent item, and if so, adjusting offset and
extent_len so that we can clone correctly.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-03-29 18:03:27 +08:00
|
|
|
if (key.offset >= clone_src_i_size)
|
|
|
|
break;
|
|
|
|
|
btrfs: send: avoid unaligned encoded writes when attempting to clone range
When trying to see if we can clone a file range, there are cases where we
end up sending two write operations in case the inode from the source root
has an i_size that is not sector size aligned and the length from the
current offset to its i_size is less than the remaining length we are
trying to clone.
Issuing two write operations when we could instead issue a single write
operation is not incorrect. However it is not optimal, specially if the
extents are compressed and the flag BTRFS_SEND_FLAG_COMPRESSED was passed
to the send ioctl. In that case we can end up sending an encoded write
with an offset that is not sector size aligned, which makes the receiver
fallback to decompressing the data and writing it using regular buffered
IO (so re-compressing the data in case the fs is mounted with compression
enabled), because encoded writes fail with -EINVAL when an offset is not
sector size aligned.
The following example, which triggered a bug in the receiver code for the
fallback logic of decompressing + regular buffer IO and is fixed by the
patchset referred in a Link at the bottom of this changelog, is an example
where we have the non-optimal behaviour due to an unaligned encoded write:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f $DEV > /dev/null
mount -o compress $DEV $MNT
# File foo has a size of 33K, not aligned to the sector size.
xfs_io -f -c "pwrite -S 0xab 0 33K" $MNT/foo
xfs_io -f -c "pwrite -S 0xcd 0 64K" $MNT/bar
# Now clone the first 32K of file bar into foo at offset 0.
xfs_io -c "reflink $MNT/bar 0 0 32K" $MNT/foo
# Snapshot the default subvolume and create a full send stream (v2).
btrfs subvolume snapshot -r $MNT $MNT/snap
btrfs send --compressed-data -f /tmp/test.send $MNT/snap
echo -e "\nFile bar in the original filesystem:"
od -A d -t x1 $MNT/snap/bar
umount $MNT
mkfs.btrfs -f $DEV > /dev/null
mount $DEV $MNT
echo -e "\nReceiving stream in a new filesystem..."
btrfs receive -f /tmp/test.send $MNT
echo -e "\nFile bar in the new filesystem:"
od -A d -t x1 $MNT/snap/bar
umount $MNT
Before this patch, the send stream included one regular write and one
encoded write for file 'bar', with the later being not sector size aligned
and causing the receiver to fallback to decompression + buffered writes.
The output of the btrfs receive command in verbose mode (-vvv):
(...)
mkfile o258-7-0
rename o258-7-0 -> bar
utimes
clone bar - source=foo source offset=0 offset=0 length=32768
write bar - offset=32768 length=1024
encoded_write bar - offset=33792, len=4096, unencoded_offset=33792, unencoded_file_len=31744, unencoded_len=65536, compression=1, encryption=0
encoded_write bar - falling back to decompress and write due to errno 22 ("Invalid argument")
(...)
This patch avoids the regular write followed by an unaligned encoded write
so that we end up sending a single encoded write that is aligned. So after
this patch the stream content is (output of btrfs receive -vvv):
(...)
mkfile o258-7-0
rename o258-7-0 -> bar
utimes
clone bar - source=foo source offset=0 offset=0 length=32768
encoded_write bar - offset=32768, len=4096, unencoded_offset=32768, unencoded_file_len=32768, unencoded_len=65536, compression=1, encryption=0
(...)
So we get more optimal behaviour and avoid the silent data loss bug in
versions of btrfs-progs affected by the bug referred by the Link tag
below (btrfs-progs v5.19, v5.19.1, v6.0 and v6.0.1).
Link: https://lore.kernel.org/linux-btrfs/cover.1668529099.git.fdmanana@suse.com/
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-16 00:29:44 +08:00
|
|
|
if (key.offset + ext_len > clone_src_i_size) {
|
Btrfs: send, improve clone range
Improve clone_range in two scenarios.
1. Remove the limit of inode size when find clone inodes We can do
partial clone, so there is no need to limit the size of the candidate
inode. When clone a range, we clone the legal range only by bytenr,
offset, len, inode size.
2. In the scenarios of rewrite or clone_range, data_offset rarely
matches exactly, so the chance of a clone is missed.
e.g.
1. Write a 1M file
dd if=/dev/zero of=1M bs=1M count=1
2. Clone 1M file
cp --reflink 1M clone
3. Rewrite 4k on the clone file
dd if=/dev/zero of=clone bs=4k count=1 conv=notrunc
The disk layout is as follows:
item 16 key (257 EXTENT_DATA 0) itemoff 15353 itemsize 53
extent data disk byte 1103101952 nr 1048576
extent data offset 0 nr 1048576 ram 1048576
extent compression(none)
...
item 22 key (258 EXTENT_DATA 0) itemoff 14959 itemsize 53
extent data disk byte 1104150528 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression(none)
item 23 key (258 EXTENT_DATA 4096) itemoff 14906 itemsize 53
extent data disk byte 1103101952 nr 1048576
extent data offset 4096 nr 1044480 ram 1048576
extent compression(none)
When send, inode 258 file offset 4096~1048576 (item 23) has a chance to
clone_range, but because data_offset does not match inode 257 (item 16),
it causes missed clone and can only transfer actual data.
Improve the problem by judging whether the current data_offset has
overlap with the file extent item, and if so, adjusting offset and
extent_len so that we can clone correctly.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-03-29 18:03:27 +08:00
|
|
|
ext_len = clone_src_i_size - key.offset;
|
btrfs: send: avoid unaligned encoded writes when attempting to clone range
When trying to see if we can clone a file range, there are cases where we
end up sending two write operations in case the inode from the source root
has an i_size that is not sector size aligned and the length from the
current offset to its i_size is less than the remaining length we are
trying to clone.
Issuing two write operations when we could instead issue a single write
operation is not incorrect. However it is not optimal, specially if the
extents are compressed and the flag BTRFS_SEND_FLAG_COMPRESSED was passed
to the send ioctl. In that case we can end up sending an encoded write
with an offset that is not sector size aligned, which makes the receiver
fallback to decompressing the data and writing it using regular buffered
IO (so re-compressing the data in case the fs is mounted with compression
enabled), because encoded writes fail with -EINVAL when an offset is not
sector size aligned.
The following example, which triggered a bug in the receiver code for the
fallback logic of decompressing + regular buffer IO and is fixed by the
patchset referred in a Link at the bottom of this changelog, is an example
where we have the non-optimal behaviour due to an unaligned encoded write:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f $DEV > /dev/null
mount -o compress $DEV $MNT
# File foo has a size of 33K, not aligned to the sector size.
xfs_io -f -c "pwrite -S 0xab 0 33K" $MNT/foo
xfs_io -f -c "pwrite -S 0xcd 0 64K" $MNT/bar
# Now clone the first 32K of file bar into foo at offset 0.
xfs_io -c "reflink $MNT/bar 0 0 32K" $MNT/foo
# Snapshot the default subvolume and create a full send stream (v2).
btrfs subvolume snapshot -r $MNT $MNT/snap
btrfs send --compressed-data -f /tmp/test.send $MNT/snap
echo -e "\nFile bar in the original filesystem:"
od -A d -t x1 $MNT/snap/bar
umount $MNT
mkfs.btrfs -f $DEV > /dev/null
mount $DEV $MNT
echo -e "\nReceiving stream in a new filesystem..."
btrfs receive -f /tmp/test.send $MNT
echo -e "\nFile bar in the new filesystem:"
od -A d -t x1 $MNT/snap/bar
umount $MNT
Before this patch, the send stream included one regular write and one
encoded write for file 'bar', with the later being not sector size aligned
and causing the receiver to fallback to decompression + buffered writes.
The output of the btrfs receive command in verbose mode (-vvv):
(...)
mkfile o258-7-0
rename o258-7-0 -> bar
utimes
clone bar - source=foo source offset=0 offset=0 length=32768
write bar - offset=32768 length=1024
encoded_write bar - offset=33792, len=4096, unencoded_offset=33792, unencoded_file_len=31744, unencoded_len=65536, compression=1, encryption=0
encoded_write bar - falling back to decompress and write due to errno 22 ("Invalid argument")
(...)
This patch avoids the regular write followed by an unaligned encoded write
so that we end up sending a single encoded write that is aligned. So after
this patch the stream content is (output of btrfs receive -vvv):
(...)
mkfile o258-7-0
rename o258-7-0 -> bar
utimes
clone bar - source=foo source offset=0 offset=0 length=32768
encoded_write bar - offset=32768, len=4096, unencoded_offset=32768, unencoded_file_len=32768, unencoded_len=65536, compression=1, encryption=0
(...)
So we get more optimal behaviour and avoid the silent data loss bug in
versions of btrfs-progs affected by the bug referred by the Link tag
below (btrfs-progs v5.19, v5.19.1, v6.0 and v6.0.1).
Link: https://lore.kernel.org/linux-btrfs/cover.1668529099.git.fdmanana@suse.com/
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-16 00:29:44 +08:00
|
|
|
crossed_src_i_size = true;
|
|
|
|
}
|
Btrfs: send, improve clone range
Improve clone_range in two scenarios.
1. Remove the limit of inode size when find clone inodes We can do
partial clone, so there is no need to limit the size of the candidate
inode. When clone a range, we clone the legal range only by bytenr,
offset, len, inode size.
2. In the scenarios of rewrite or clone_range, data_offset rarely
matches exactly, so the chance of a clone is missed.
e.g.
1. Write a 1M file
dd if=/dev/zero of=1M bs=1M count=1
2. Clone 1M file
cp --reflink 1M clone
3. Rewrite 4k on the clone file
dd if=/dev/zero of=clone bs=4k count=1 conv=notrunc
The disk layout is as follows:
item 16 key (257 EXTENT_DATA 0) itemoff 15353 itemsize 53
extent data disk byte 1103101952 nr 1048576
extent data offset 0 nr 1048576 ram 1048576
extent compression(none)
...
item 22 key (258 EXTENT_DATA 0) itemoff 14959 itemsize 53
extent data disk byte 1104150528 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression(none)
item 23 key (258 EXTENT_DATA 4096) itemoff 14906 itemsize 53
extent data disk byte 1103101952 nr 1048576
extent data offset 4096 nr 1044480 ram 1048576
extent compression(none)
When send, inode 258 file offset 4096~1048576 (item 23) has a chance to
clone_range, but because data_offset does not match inode 257 (item 16),
it causes missed clone and can only transfer actual data.
Improve the problem by judging whether the current data_offset has
overlap with the file extent item, and if so, adjusting offset and
extent_len so that we can clone correctly.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-03-29 18:03:27 +08:00
|
|
|
|
|
|
|
clone_data_offset = btrfs_file_extent_offset(leaf, ei);
|
|
|
|
if (btrfs_file_extent_disk_bytenr(leaf, ei) == disk_byte) {
|
|
|
|
clone_root->offset = key.offset;
|
|
|
|
if (clone_data_offset < data_offset &&
|
|
|
|
clone_data_offset + ext_len > data_offset) {
|
|
|
|
u64 extent_offset;
|
|
|
|
|
|
|
|
extent_offset = data_offset - clone_data_offset;
|
|
|
|
ext_len -= extent_offset;
|
|
|
|
clone_data_offset += extent_offset;
|
|
|
|
clone_root->offset += extent_offset;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Btrfs: send, fix file corruption due to incorrect cloning operations
If we have a file that shares an extent with other files, when processing
the extent item relative to a shared extent, we blindly issue a clone
operation that will target a length matching the length in the extent item
and uses as a source some other file the receiver already has and points
to the same extent. However that range in the other file might not
exclusively point only to the shared extent, and so using that length
will result in the receiver getting a file with different data from the
one in the send snapshot. This issue happened both for incremental and
full send operations.
So fix this by issuing clone operations with lengths that don't cover
regions of the source file that point to different extents (or have holes).
The following test case for fstests reproduces the problem.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
_require_xfs_io_command "fpunch"
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test file with a single 100K extent.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Clone our file into a new file named bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Now overwrite parts of our foo file.
$XFS_IO_PROG -c "pwrite -S 0xbb 50K 10K" \
-c "pwrite -S 0xcc 90K 10K" \
-c "fpunch 70K 10k" \
$SCRATCH_MNT/foo | _filter_xfs_io
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify
# we get the same file contents that the original filesystem had.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
# We expect the destination filesystem to have exactly the same file
# data as the original filesystem.
# The btrfs send implementation had a bug where it sent a clone
# operation from file foo into file bar covering the whole [0, 100K[
# range after creating and writing the file foo. This was incorrect
# because the file bar now included the updates done to file foo after
# we cloned foo to bar, breaking the COW nature of reflink copies
# (cloned extents).
echo "File digests in the new filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Another test case that reproduces the problem when we have compressed
extents:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
# Create our file with an extent of 100K starting at file offset 0K.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Rewrite part of the previous extent (its first 40K) and write a new
# 100K extent starting at file offset 100K.
$XFS_IO_PROG -c "pwrite -S 0xbb 0K 40K" \
-c "pwrite -S 0xcc 100K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Our file foo now has 3 file extent items in its metadata:
#
# 1) One covering the file range 0 to 40K;
# 2) One covering the file range 40K to 100K, which points to the first
# extent we wrote to the file and has a data offset field with value
# 40K (our file no longer uses the first 40K of data from that
# extent);
# 3) One covering the file range 100K to 200K.
# Now clone our file foo into file bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Create our snapshot for the send operation.
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify we
# get the same file contents that the original filesystem had.
# Btrfs send used to issue a clone operation from foo's range
# [80K, 140K[ to bar's range [40K, 100K[ when cloning the extent pointed
# to by foo's second file extent item, this was incorrect because of bad
# accounting of the file extent item's data offset field. The correct
# range to clone from should have been [40K, 100K[.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
echo "File digests in the new filesystem:"
# Must match the digests we got in the original filesystem.
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-02 17:47:34 +08:00
|
|
|
clone_len = min_t(u64, ext_len, len);
|
|
|
|
|
|
|
|
if (btrfs_file_extent_disk_bytenr(leaf, ei) == disk_byte &&
|
Btrfs: incremental send, fix emission of invalid clone operations
When doing an incremental send we can now issue clone operations with a
source range that ends at the source's file eof and with a destination
range that ends at an offset smaller then the destination's file eof.
If the eof of the source file is not aligned to the sector size of the
filesystem, the receiver will get a -EINVAL error when trying to do the
operation or, on older kernels, silently corrupt the destination file.
The corruption happens on kernels without commit ac765f83f1397646
("Btrfs: fix data corruption due to cloning of eof block"), while the
failure to clone happens on kernels with that commit.
Example reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt/sdb
$ xfs_io -f -c "pwrite -S 0xb1 0 2M" /mnt/sdb/foo
$ xfs_io -f -c "pwrite -S 0xc7 0 2M" /mnt/sdb/bar
$ xfs_io -f -c "pwrite -S 0x4d 0 2M" /mnt/sdb/baz
$ xfs_io -f -c "pwrite -S 0xe2 0 2M" /mnt/sdb/zoo
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base
$ btrfs send -f /tmp/base.send /mnt/sdb/base
$ xfs_io -c "reflink /mnt/sdb/bar 1560K 500K 100K" /mnt/sdb/bar
$ xfs_io -c "reflink /mnt/sdb/bar 1560K 0 100K" /mnt/sdb/zoo
$ xfs_io -c "truncate 550K" /mnt/sdb/bar
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr
$ btrfs send -f /tmp/incr.send -p /mnt/sdb/base /mnt/sdb/incr
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ btrfs receive -f /tmp/base.send /mnt/sdc
$ btrfs receive -vv -f /tmp/incr.send /mnt/sdc
(...)
truncate bar size=563200
utimes bar
clone zoo - source=bar source offset=512000 offset=0 length=51200
ERROR: failed to clone extents to zoo
Invalid argument
The failure happens because the clone source range ends at the eof of file
bar, 563200, which is not aligned to the filesystems sector size (4Kb in
this case), and the destination range ends at offset 0 + 51200, which is
less then the size of the file zoo (2Mb).
So fix this by detecting such case and instead of issuing a clone
operation for the whole range, do a clone operation for smaller range
that is sector size aligned followed by a write operation for the block
containing the eof. Here we will always be pessimistic and assume the
destination filesystem of the send stream has the largest possible sector
size (64Kb), since we have no way of determining it.
This fixes a recent regression introduced in kernel 5.2-rc1.
Fixes: 040ee6120cb6706 ("Btrfs: send, improve clone range")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-20 16:57:00 +08:00
|
|
|
clone_data_offset == data_offset) {
|
|
|
|
const u64 src_end = clone_root->offset + clone_len;
|
|
|
|
const u64 sectorsize = SZ_64K;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We can't clone the last block, when its size is not
|
|
|
|
* sector size aligned, into the middle of a file. If we
|
|
|
|
* do so, the receiver will get a failure (-EINVAL) when
|
|
|
|
* trying to clone or will silently corrupt the data in
|
|
|
|
* the destination file if it's on a kernel without the
|
|
|
|
* fix introduced by commit ac765f83f1397646
|
|
|
|
* ("Btrfs: fix data corruption due to cloning of eof
|
|
|
|
* block).
|
|
|
|
*
|
|
|
|
* So issue a clone of the aligned down range plus a
|
|
|
|
* regular write for the eof block, if we hit that case.
|
|
|
|
*
|
|
|
|
* Also, we use the maximum possible sector size, 64K,
|
|
|
|
* because we don't know what's the sector size of the
|
|
|
|
* filesystem that receives the stream, so we have to
|
|
|
|
* assume the largest possible sector size.
|
|
|
|
*/
|
|
|
|
if (src_end == clone_src_i_size &&
|
|
|
|
!IS_ALIGNED(src_end, sectorsize) &&
|
|
|
|
offset + clone_len < sctx->cur_inode_size) {
|
|
|
|
u64 slen;
|
|
|
|
|
|
|
|
slen = ALIGN_DOWN(src_end - clone_root->offset,
|
|
|
|
sectorsize);
|
|
|
|
if (slen > 0) {
|
|
|
|
ret = send_clone(sctx, offset, slen,
|
|
|
|
clone_root);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
2022-03-18 01:25:42 +08:00
|
|
|
ret = send_extent_data(sctx, dst_path,
|
|
|
|
offset + slen,
|
Btrfs: incremental send, fix emission of invalid clone operations
When doing an incremental send we can now issue clone operations with a
source range that ends at the source's file eof and with a destination
range that ends at an offset smaller then the destination's file eof.
If the eof of the source file is not aligned to the sector size of the
filesystem, the receiver will get a -EINVAL error when trying to do the
operation or, on older kernels, silently corrupt the destination file.
The corruption happens on kernels without commit ac765f83f1397646
("Btrfs: fix data corruption due to cloning of eof block"), while the
failure to clone happens on kernels with that commit.
Example reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt/sdb
$ xfs_io -f -c "pwrite -S 0xb1 0 2M" /mnt/sdb/foo
$ xfs_io -f -c "pwrite -S 0xc7 0 2M" /mnt/sdb/bar
$ xfs_io -f -c "pwrite -S 0x4d 0 2M" /mnt/sdb/baz
$ xfs_io -f -c "pwrite -S 0xe2 0 2M" /mnt/sdb/zoo
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base
$ btrfs send -f /tmp/base.send /mnt/sdb/base
$ xfs_io -c "reflink /mnt/sdb/bar 1560K 500K 100K" /mnt/sdb/bar
$ xfs_io -c "reflink /mnt/sdb/bar 1560K 0 100K" /mnt/sdb/zoo
$ xfs_io -c "truncate 550K" /mnt/sdb/bar
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr
$ btrfs send -f /tmp/incr.send -p /mnt/sdb/base /mnt/sdb/incr
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ btrfs receive -f /tmp/base.send /mnt/sdc
$ btrfs receive -vv -f /tmp/incr.send /mnt/sdc
(...)
truncate bar size=563200
utimes bar
clone zoo - source=bar source offset=512000 offset=0 length=51200
ERROR: failed to clone extents to zoo
Invalid argument
The failure happens because the clone source range ends at the eof of file
bar, 563200, which is not aligned to the filesystems sector size (4Kb in
this case), and the destination range ends at offset 0 + 51200, which is
less then the size of the file zoo (2Mb).
So fix this by detecting such case and instead of issuing a clone
operation for the whole range, do a clone operation for smaller range
that is sector size aligned followed by a write operation for the block
containing the eof. Here we will always be pessimistic and assume the
destination filesystem of the send stream has the largest possible sector
size (64Kb), since we have no way of determining it.
This fixes a recent regression introduced in kernel 5.2-rc1.
Fixes: 040ee6120cb6706 ("Btrfs: send, improve clone range")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-20 16:57:00 +08:00
|
|
|
clone_len - slen);
|
|
|
|
} else {
|
|
|
|
ret = send_clone(sctx, offset, clone_len,
|
|
|
|
clone_root);
|
|
|
|
}
|
btrfs: send: avoid unaligned encoded writes when attempting to clone range
When trying to see if we can clone a file range, there are cases where we
end up sending two write operations in case the inode from the source root
has an i_size that is not sector size aligned and the length from the
current offset to its i_size is less than the remaining length we are
trying to clone.
Issuing two write operations when we could instead issue a single write
operation is not incorrect. However it is not optimal, specially if the
extents are compressed and the flag BTRFS_SEND_FLAG_COMPRESSED was passed
to the send ioctl. In that case we can end up sending an encoded write
with an offset that is not sector size aligned, which makes the receiver
fallback to decompressing the data and writing it using regular buffered
IO (so re-compressing the data in case the fs is mounted with compression
enabled), because encoded writes fail with -EINVAL when an offset is not
sector size aligned.
The following example, which triggered a bug in the receiver code for the
fallback logic of decompressing + regular buffer IO and is fixed by the
patchset referred in a Link at the bottom of this changelog, is an example
where we have the non-optimal behaviour due to an unaligned encoded write:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f $DEV > /dev/null
mount -o compress $DEV $MNT
# File foo has a size of 33K, not aligned to the sector size.
xfs_io -f -c "pwrite -S 0xab 0 33K" $MNT/foo
xfs_io -f -c "pwrite -S 0xcd 0 64K" $MNT/bar
# Now clone the first 32K of file bar into foo at offset 0.
xfs_io -c "reflink $MNT/bar 0 0 32K" $MNT/foo
# Snapshot the default subvolume and create a full send stream (v2).
btrfs subvolume snapshot -r $MNT $MNT/snap
btrfs send --compressed-data -f /tmp/test.send $MNT/snap
echo -e "\nFile bar in the original filesystem:"
od -A d -t x1 $MNT/snap/bar
umount $MNT
mkfs.btrfs -f $DEV > /dev/null
mount $DEV $MNT
echo -e "\nReceiving stream in a new filesystem..."
btrfs receive -f /tmp/test.send $MNT
echo -e "\nFile bar in the new filesystem:"
od -A d -t x1 $MNT/snap/bar
umount $MNT
Before this patch, the send stream included one regular write and one
encoded write for file 'bar', with the later being not sector size aligned
and causing the receiver to fallback to decompression + buffered writes.
The output of the btrfs receive command in verbose mode (-vvv):
(...)
mkfile o258-7-0
rename o258-7-0 -> bar
utimes
clone bar - source=foo source offset=0 offset=0 length=32768
write bar - offset=32768 length=1024
encoded_write bar - offset=33792, len=4096, unencoded_offset=33792, unencoded_file_len=31744, unencoded_len=65536, compression=1, encryption=0
encoded_write bar - falling back to decompress and write due to errno 22 ("Invalid argument")
(...)
This patch avoids the regular write followed by an unaligned encoded write
so that we end up sending a single encoded write that is aligned. So after
this patch the stream content is (output of btrfs receive -vvv):
(...)
mkfile o258-7-0
rename o258-7-0 -> bar
utimes
clone bar - source=foo source offset=0 offset=0 length=32768
encoded_write bar - offset=32768, len=4096, unencoded_offset=32768, unencoded_file_len=32768, unencoded_len=65536, compression=1, encryption=0
(...)
So we get more optimal behaviour and avoid the silent data loss bug in
versions of btrfs-progs affected by the bug referred by the Link tag
below (btrfs-progs v5.19, v5.19.1, v6.0 and v6.0.1).
Link: https://lore.kernel.org/linux-btrfs/cover.1668529099.git.fdmanana@suse.com/
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-16 00:29:44 +08:00
|
|
|
} else if (crossed_src_i_size && clone_len < len) {
|
|
|
|
/*
|
|
|
|
* If we are at i_size of the clone source inode and we
|
|
|
|
* can not clone from it, terminate the loop. This is
|
|
|
|
* to avoid sending two write operations, one with a
|
|
|
|
* length matching clone_len and the final one after
|
|
|
|
* this loop with a length of len - clone_len.
|
|
|
|
*
|
|
|
|
* When using encoded writes (BTRFS_SEND_FLAG_COMPRESSED
|
|
|
|
* was passed to the send ioctl), this helps avoid
|
|
|
|
* sending an encoded write for an offset that is not
|
|
|
|
* sector size aligned, in case the i_size of the source
|
|
|
|
* inode is not sector size aligned. That will make the
|
|
|
|
* receiver fallback to decompression of the data and
|
|
|
|
* writing it using regular buffered IO, therefore while
|
|
|
|
* not incorrect, it's not optimal due decompression and
|
|
|
|
* possible re-compression at the receiver.
|
|
|
|
*/
|
|
|
|
break;
|
Btrfs: incremental send, fix emission of invalid clone operations
When doing an incremental send we can now issue clone operations with a
source range that ends at the source's file eof and with a destination
range that ends at an offset smaller then the destination's file eof.
If the eof of the source file is not aligned to the sector size of the
filesystem, the receiver will get a -EINVAL error when trying to do the
operation or, on older kernels, silently corrupt the destination file.
The corruption happens on kernels without commit ac765f83f1397646
("Btrfs: fix data corruption due to cloning of eof block"), while the
failure to clone happens on kernels with that commit.
Example reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt/sdb
$ xfs_io -f -c "pwrite -S 0xb1 0 2M" /mnt/sdb/foo
$ xfs_io -f -c "pwrite -S 0xc7 0 2M" /mnt/sdb/bar
$ xfs_io -f -c "pwrite -S 0x4d 0 2M" /mnt/sdb/baz
$ xfs_io -f -c "pwrite -S 0xe2 0 2M" /mnt/sdb/zoo
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base
$ btrfs send -f /tmp/base.send /mnt/sdb/base
$ xfs_io -c "reflink /mnt/sdb/bar 1560K 500K 100K" /mnt/sdb/bar
$ xfs_io -c "reflink /mnt/sdb/bar 1560K 0 100K" /mnt/sdb/zoo
$ xfs_io -c "truncate 550K" /mnt/sdb/bar
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr
$ btrfs send -f /tmp/incr.send -p /mnt/sdb/base /mnt/sdb/incr
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ btrfs receive -f /tmp/base.send /mnt/sdc
$ btrfs receive -vv -f /tmp/incr.send /mnt/sdc
(...)
truncate bar size=563200
utimes bar
clone zoo - source=bar source offset=512000 offset=0 length=51200
ERROR: failed to clone extents to zoo
Invalid argument
The failure happens because the clone source range ends at the eof of file
bar, 563200, which is not aligned to the filesystems sector size (4Kb in
this case), and the destination range ends at offset 0 + 51200, which is
less then the size of the file zoo (2Mb).
So fix this by detecting such case and instead of issuing a clone
operation for the whole range, do a clone operation for smaller range
that is sector size aligned followed by a write operation for the block
containing the eof. Here we will always be pessimistic and assume the
destination filesystem of the send stream has the largest possible sector
size (64Kb), since we have no way of determining it.
This fixes a recent regression introduced in kernel 5.2-rc1.
Fixes: 040ee6120cb6706 ("Btrfs: send, improve clone range")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-20 16:57:00 +08:00
|
|
|
} else {
|
2022-03-18 01:25:42 +08:00
|
|
|
ret = send_extent_data(sctx, dst_path, offset,
|
|
|
|
clone_len);
|
Btrfs: incremental send, fix emission of invalid clone operations
When doing an incremental send we can now issue clone operations with a
source range that ends at the source's file eof and with a destination
range that ends at an offset smaller then the destination's file eof.
If the eof of the source file is not aligned to the sector size of the
filesystem, the receiver will get a -EINVAL error when trying to do the
operation or, on older kernels, silently corrupt the destination file.
The corruption happens on kernels without commit ac765f83f1397646
("Btrfs: fix data corruption due to cloning of eof block"), while the
failure to clone happens on kernels with that commit.
Example reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt/sdb
$ xfs_io -f -c "pwrite -S 0xb1 0 2M" /mnt/sdb/foo
$ xfs_io -f -c "pwrite -S 0xc7 0 2M" /mnt/sdb/bar
$ xfs_io -f -c "pwrite -S 0x4d 0 2M" /mnt/sdb/baz
$ xfs_io -f -c "pwrite -S 0xe2 0 2M" /mnt/sdb/zoo
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base
$ btrfs send -f /tmp/base.send /mnt/sdb/base
$ xfs_io -c "reflink /mnt/sdb/bar 1560K 500K 100K" /mnt/sdb/bar
$ xfs_io -c "reflink /mnt/sdb/bar 1560K 0 100K" /mnt/sdb/zoo
$ xfs_io -c "truncate 550K" /mnt/sdb/bar
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr
$ btrfs send -f /tmp/incr.send -p /mnt/sdb/base /mnt/sdb/incr
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ btrfs receive -f /tmp/base.send /mnt/sdc
$ btrfs receive -vv -f /tmp/incr.send /mnt/sdc
(...)
truncate bar size=563200
utimes bar
clone zoo - source=bar source offset=512000 offset=0 length=51200
ERROR: failed to clone extents to zoo
Invalid argument
The failure happens because the clone source range ends at the eof of file
bar, 563200, which is not aligned to the filesystems sector size (4Kb in
this case), and the destination range ends at offset 0 + 51200, which is
less then the size of the file zoo (2Mb).
So fix this by detecting such case and instead of issuing a clone
operation for the whole range, do a clone operation for smaller range
that is sector size aligned followed by a write operation for the block
containing the eof. Here we will always be pessimistic and assume the
destination filesystem of the send stream has the largest possible sector
size (64Kb), since we have no way of determining it.
This fixes a recent regression introduced in kernel 5.2-rc1.
Fixes: 040ee6120cb6706 ("Btrfs: send, improve clone range")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-20 16:57:00 +08:00
|
|
|
}
|
Btrfs: send, fix file corruption due to incorrect cloning operations
If we have a file that shares an extent with other files, when processing
the extent item relative to a shared extent, we blindly issue a clone
operation that will target a length matching the length in the extent item
and uses as a source some other file the receiver already has and points
to the same extent. However that range in the other file might not
exclusively point only to the shared extent, and so using that length
will result in the receiver getting a file with different data from the
one in the send snapshot. This issue happened both for incremental and
full send operations.
So fix this by issuing clone operations with lengths that don't cover
regions of the source file that point to different extents (or have holes).
The following test case for fstests reproduces the problem.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
_require_xfs_io_command "fpunch"
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test file with a single 100K extent.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Clone our file into a new file named bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Now overwrite parts of our foo file.
$XFS_IO_PROG -c "pwrite -S 0xbb 50K 10K" \
-c "pwrite -S 0xcc 90K 10K" \
-c "fpunch 70K 10k" \
$SCRATCH_MNT/foo | _filter_xfs_io
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify
# we get the same file contents that the original filesystem had.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
# We expect the destination filesystem to have exactly the same file
# data as the original filesystem.
# The btrfs send implementation had a bug where it sent a clone
# operation from file foo into file bar covering the whole [0, 100K[
# range after creating and writing the file foo. This was incorrect
# because the file bar now included the updates done to file foo after
# we cloned foo to bar, breaking the COW nature of reflink copies
# (cloned extents).
echo "File digests in the new filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Another test case that reproduces the problem when we have compressed
extents:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
# Create our file with an extent of 100K starting at file offset 0K.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Rewrite part of the previous extent (its first 40K) and write a new
# 100K extent starting at file offset 100K.
$XFS_IO_PROG -c "pwrite -S 0xbb 0K 40K" \
-c "pwrite -S 0xcc 100K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Our file foo now has 3 file extent items in its metadata:
#
# 1) One covering the file range 0 to 40K;
# 2) One covering the file range 40K to 100K, which points to the first
# extent we wrote to the file and has a data offset field with value
# 40K (our file no longer uses the first 40K of data from that
# extent);
# 3) One covering the file range 100K to 200K.
# Now clone our file foo into file bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Create our snapshot for the send operation.
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify we
# get the same file contents that the original filesystem had.
# Btrfs send used to issue a clone operation from foo's range
# [80K, 140K[ to bar's range [40K, 100K[ when cloning the extent pointed
# to by foo's second file extent item, this was incorrect because of bad
# accounting of the file extent item's data offset field. The correct
# range to clone from should have been [40K, 100K[.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
echo "File digests in the new filesystem:"
# Must match the digests we got in the original filesystem.
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-02 17:47:34 +08:00
|
|
|
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
len -= clone_len;
|
|
|
|
if (len == 0)
|
|
|
|
break;
|
|
|
|
offset += clone_len;
|
|
|
|
clone_root->offset += clone_len;
|
btrfs: send: fix invalid clone operations when cloning from the same file and root
When an incremental send finds an extent that is shared, it checks which
file extent items in the range refer to that extent, and for those it
emits clone operations, while for others it emits regular write operations
to avoid corruption at the destination (as described and fixed by commit
d906d49fc5f4 ("Btrfs: send, fix file corruption due to incorrect cloning
operations")).
However when the root we are cloning from is the send root, we are cloning
from the inode currently being processed and the source file range has
several extent items that partially point to the desired extent, with an
offset smaller than the offset in the file extent item for the range we
want to clone into, it can cause the algorithm to issue a clone operation
that starts at the current eof of the file being processed in the receiver
side, in which case the receiver will fail, with EINVAL, when attempting
to execute the clone operation.
Example reproducer:
$ cat test-send-clone.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV >/dev/null
mount $DEV $MNT
# Create our test file with a single and large extent (1M) and with
# different content for different file ranges that will be reflinked
# later.
xfs_io -f \
-c "pwrite -S 0xab 0 128K" \
-c "pwrite -S 0xcd 128K 128K" \
-c "pwrite -S 0xef 256K 256K" \
-c "pwrite -S 0x1a 512K 512K" \
$MNT/foobar
btrfs subvolume snapshot -r $MNT $MNT/snap1
btrfs send -f /tmp/snap1.send $MNT/snap1
# Now do a series of changes to our file such that we end up with
# different parts of the extent reflinked into different file offsets
# and we overwrite a large part of the extent too, so no file extent
# items refer to that part that was overwritten. This used to confuse
# the algorithm used by the kernel to figure out which file ranges to
# clone, making it attempt to clone from a source range starting at
# the current eof of the file, resulting in the receiver to fail since
# it is an invalid clone operation.
#
xfs_io -c "reflink $MNT/foobar 64K 1M 960K" \
-c "reflink $MNT/foobar 0K 512K 256K" \
-c "reflink $MNT/foobar 512K 128K 256K" \
-c "pwrite -S 0x73 384K 640K" \
$MNT/foobar
btrfs subvolume snapshot -r $MNT $MNT/snap2
btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2
echo -e "\nFile digest in the original filesystem:"
md5sum $MNT/snap2/foobar
# Now unmount the filesystem, create a new one, mount it and try to
# apply both send streams to recreate both snapshots.
umount $DEV
mkfs.btrfs -f $DEV >/dev/null
mount $DEV $MNT
btrfs receive -f /tmp/snap1.send $MNT
btrfs receive -f /tmp/snap2.send $MNT
# Must match what we got in the original filesystem of course.
echo -e "\nFile digest in the new filesystem:"
md5sum $MNT/snap2/foobar
umount $MNT
When running the reproducer, the incremental send operation fails due to
an invalid clone operation:
$ ./test-send-clone.sh
wrote 131072/131072 bytes at offset 0
128 KiB, 32 ops; 0.0015 sec (80.906 MiB/sec and 20711.9741 ops/sec)
wrote 131072/131072 bytes at offset 131072
128 KiB, 32 ops; 0.0013 sec (90.514 MiB/sec and 23171.6148 ops/sec)
wrote 262144/262144 bytes at offset 262144
256 KiB, 64 ops; 0.0025 sec (98.270 MiB/sec and 25157.2327 ops/sec)
wrote 524288/524288 bytes at offset 524288
512 KiB, 128 ops; 0.0052 sec (95.730 MiB/sec and 24506.9883 ops/sec)
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
At subvol /mnt/sdi/snap1
linked 983040/983040 bytes at offset 1048576
960 KiB, 1 ops; 0.0006 sec (1.419 GiB/sec and 1550.3876 ops/sec)
linked 262144/262144 bytes at offset 524288
256 KiB, 1 ops; 0.0020 sec (120.192 MiB/sec and 480.7692 ops/sec)
linked 262144/262144 bytes at offset 131072
256 KiB, 1 ops; 0.0018 sec (133.833 MiB/sec and 535.3319 ops/sec)
wrote 655360/655360 bytes at offset 393216
640 KiB, 160 ops; 0.0093 sec (66.781 MiB/sec and 17095.8436 ops/sec)
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
At subvol /mnt/sdi/snap2
File digest in the original filesystem:
9c13c61cb0b9f5abf45344375cb04dfa /mnt/sdi/snap2/foobar
At subvol snap1
At snapshot snap2
ERROR: failed to clone extents to foobar: Invalid argument
File digest in the new filesystem:
132f0396da8f48d2e667196bff882cfc /mnt/sdi/snap2/foobar
The clone operation is invalid because its source range starts at the
current eof of the file in the receiver, causing the receiver to get
an EINVAL error from the clone operation when attempting it.
For the example above, what happens is the following:
1) When processing the extent at file offset 1M, the algorithm checks that
the extent is shared and can be (fully or partially) found at file
offset 0.
At this point the file has a size (and eof) of 1M at the receiver;
2) It finds that our extent item at file offset 1M has a data offset of
64K and, since the file extent item at file offset 0 has a data offset
of 0, it issues a clone operation, from the same file and root, that
has a source range offset of 64K, destination offset of 1M and a length
of 64K, since the extent item at file offset 0 refers only to the first
128K of the shared extent.
After this clone operation, the file size (and eof) at the receiver is
increased from 1M to 1088K (1M + 64K);
3) Now there's still 896K (960K - 64K) of data left to clone or write, so
it checks for the next file extent item, which starts at file offset
128K. This file extent item has a data offset of 0 and a length of
256K, so a clone operation with a source range offset of 256K, a
destination offset of 1088K (1M + 64K) and length of 128K is issued.
After this operation the file size (and eof) at the receiver increases
from 1088K to 1216K (1088K + 128K);
4) Now there's still 768K (896K - 128K) of data left to clone or write, so
it checks for the next file extent item, located at file offset 384K.
This file extent item points to a different extent, not the one we want
to clone, with a length of 640K. So we issue a write operation into the
file range 1216K (1088K + 128K, end of the last clone operation), with
a length of 640K and with a data matching the one we can find for that
range in send root.
After this operation, the file size (and eof) at the receiver increases
from 1216K to 1856K (1216K + 640K);
5) Now there's still 128K (768K - 640K) of data left to clone or write, so
we look into the file extent item, which is for file offset 1M and it
points to the extent we want to clone, with a data offset of 64K and a
length of 960K.
However this matches the file offset we started with, the start of the
range to clone into. So we can't for sure find any file extent item
from here onwards with the rest of the data we want to clone, yet we
proceed and since the file extent item points to the shared extent,
with a data offset of 64K, we issue a clone operation with a source
range starting at file offset 1856K, which matches the file extent
item's offset, 1M, plus the amount of data cloned and written so far,
which is 64K (step 2) + 128K (step 3) + 640K (step 4). This clone
operation is invalid since the source range offset matches the current
eof of the file in the receiver. We should have stopped looking for
extents to clone at this point and instead fallback to write, which
would simply the contain the data in the file range from 1856K to
1856K + 128K.
So fix this by stopping the loop that looks for file ranges to clone at
clone_range() when we reach the current eof of the file being processed,
if we are cloning from the same file and using the send root as the clone
root. This ensures any data not yet cloned will be sent to the receiver
through a write operation.
A test case for fstests will follow soon.
Reported-by: Massimo B. <massimo.b@gmx.net>
Link: https://lore.kernel.org/linux-btrfs/6ae34776e85912960a253a8327068a892998e685.camel@gmx.net/
Fixes: 11f2069c113e ("Btrfs: send, allow clone operations within the same file")
CC: stable@vger.kernel.org # 5.5+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-11 19:41:42 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we are cloning from the file we are currently processing,
|
|
|
|
* and using the send root as the clone root, we must stop once
|
|
|
|
* the current clone offset reaches the current eof of the file
|
|
|
|
* at the receiver, otherwise we would issue an invalid clone
|
|
|
|
* operation (source range going beyond eof) and cause the
|
|
|
|
* receiver to fail. So if we reach the current eof, bail out
|
|
|
|
* and fallback to a regular write.
|
|
|
|
*/
|
|
|
|
if (clone_root->root == sctx->send_root &&
|
|
|
|
clone_root->ino == sctx->cur_ino &&
|
|
|
|
clone_root->offset >= sctx->cur_inode_next_write_offset)
|
|
|
|
break;
|
|
|
|
|
Btrfs: send, fix file corruption due to incorrect cloning operations
If we have a file that shares an extent with other files, when processing
the extent item relative to a shared extent, we blindly issue a clone
operation that will target a length matching the length in the extent item
and uses as a source some other file the receiver already has and points
to the same extent. However that range in the other file might not
exclusively point only to the shared extent, and so using that length
will result in the receiver getting a file with different data from the
one in the send snapshot. This issue happened both for incremental and
full send operations.
So fix this by issuing clone operations with lengths that don't cover
regions of the source file that point to different extents (or have holes).
The following test case for fstests reproduces the problem.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
_require_xfs_io_command "fpunch"
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test file with a single 100K extent.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Clone our file into a new file named bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Now overwrite parts of our foo file.
$XFS_IO_PROG -c "pwrite -S 0xbb 50K 10K" \
-c "pwrite -S 0xcc 90K 10K" \
-c "fpunch 70K 10k" \
$SCRATCH_MNT/foo | _filter_xfs_io
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify
# we get the same file contents that the original filesystem had.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
# We expect the destination filesystem to have exactly the same file
# data as the original filesystem.
# The btrfs send implementation had a bug where it sent a clone
# operation from file foo into file bar covering the whole [0, 100K[
# range after creating and writing the file foo. This was incorrect
# because the file bar now included the updates done to file foo after
# we cloned foo to bar, breaking the COW nature of reflink copies
# (cloned extents).
echo "File digests in the new filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Another test case that reproduces the problem when we have compressed
extents:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
# Create our file with an extent of 100K starting at file offset 0K.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Rewrite part of the previous extent (its first 40K) and write a new
# 100K extent starting at file offset 100K.
$XFS_IO_PROG -c "pwrite -S 0xbb 0K 40K" \
-c "pwrite -S 0xcc 100K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Our file foo now has 3 file extent items in its metadata:
#
# 1) One covering the file range 0 to 40K;
# 2) One covering the file range 40K to 100K, which points to the first
# extent we wrote to the file and has a data offset field with value
# 40K (our file no longer uses the first 40K of data from that
# extent);
# 3) One covering the file range 100K to 200K.
# Now clone our file foo into file bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Create our snapshot for the send operation.
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify we
# get the same file contents that the original filesystem had.
# Btrfs send used to issue a clone operation from foo's range
# [80K, 140K[ to bar's range [40K, 100K[ when cloning the extent pointed
# to by foo's second file extent item, this was incorrect because of bad
# accounting of the file extent item's data offset field. The correct
# range to clone from should have been [40K, 100K[.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
echo "File digests in the new filesystem:"
# Must match the digests we got in the original filesystem.
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-02 17:47:34 +08:00
|
|
|
data_offset += clone_len;
|
|
|
|
next:
|
|
|
|
path->slots[0]++;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (len > 0)
|
2022-03-18 01:25:42 +08:00
|
|
|
ret = send_extent_data(sctx, dst_path, offset, len);
|
Btrfs: send, fix file corruption due to incorrect cloning operations
If we have a file that shares an extent with other files, when processing
the extent item relative to a shared extent, we blindly issue a clone
operation that will target a length matching the length in the extent item
and uses as a source some other file the receiver already has and points
to the same extent. However that range in the other file might not
exclusively point only to the shared extent, and so using that length
will result in the receiver getting a file with different data from the
one in the send snapshot. This issue happened both for incremental and
full send operations.
So fix this by issuing clone operations with lengths that don't cover
regions of the source file that point to different extents (or have holes).
The following test case for fstests reproduces the problem.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
_require_xfs_io_command "fpunch"
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test file with a single 100K extent.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Clone our file into a new file named bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Now overwrite parts of our foo file.
$XFS_IO_PROG -c "pwrite -S 0xbb 50K 10K" \
-c "pwrite -S 0xcc 90K 10K" \
-c "fpunch 70K 10k" \
$SCRATCH_MNT/foo | _filter_xfs_io
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify
# we get the same file contents that the original filesystem had.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
# We expect the destination filesystem to have exactly the same file
# data as the original filesystem.
# The btrfs send implementation had a bug where it sent a clone
# operation from file foo into file bar covering the whole [0, 100K[
# range after creating and writing the file foo. This was incorrect
# because the file bar now included the updates done to file foo after
# we cloned foo to bar, breaking the COW nature of reflink copies
# (cloned extents).
echo "File digests in the new filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Another test case that reproduces the problem when we have compressed
extents:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
# Create our file with an extent of 100K starting at file offset 0K.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Rewrite part of the previous extent (its first 40K) and write a new
# 100K extent starting at file offset 100K.
$XFS_IO_PROG -c "pwrite -S 0xbb 0K 40K" \
-c "pwrite -S 0xcc 100K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Our file foo now has 3 file extent items in its metadata:
#
# 1) One covering the file range 0 to 40K;
# 2) One covering the file range 40K to 100K, which points to the first
# extent we wrote to the file and has a data offset field with value
# 40K (our file no longer uses the first 40K of data from that
# extent);
# 3) One covering the file range 100K to 200K.
# Now clone our file foo into file bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Create our snapshot for the send operation.
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify we
# get the same file contents that the original filesystem had.
# Btrfs send used to issue a clone operation from foo's range
# [80K, 140K[ to bar's range [40K, 100K[ when cloning the extent pointed
# to by foo's second file extent item, this was incorrect because of bad
# accounting of the file extent item's data offset field. The correct
# range to clone from should have been [40K, 100K[.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
echo "File digests in the new filesystem:"
# Must match the digests we got in the original filesystem.
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-02 17:47:34 +08:00
|
|
|
else
|
|
|
|
ret = 0;
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
static int send_write_or_clone(struct send_ctx *sctx,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_key *key,
|
|
|
|
struct clone_root *clone_root)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
u64 offset = key->offset;
|
2020-08-21 15:39:53 +08:00
|
|
|
u64 end;
|
2014-01-12 10:26:28 +08:00
|
|
|
u64 bs = sctx->send_root->fs_info->sb->s_blocksize;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2020-08-21 15:39:53 +08:00
|
|
|
end = min_t(u64, btrfs_file_extent_end(path), sctx->cur_inode_size);
|
|
|
|
if (offset >= end)
|
|
|
|
return 0;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2020-08-21 15:39:53 +08:00
|
|
|
if (clone_root && IS_ALIGNED(end, bs)) {
|
|
|
|
struct btrfs_file_extent_item *ei;
|
Btrfs: send, fix file corruption due to incorrect cloning operations
If we have a file that shares an extent with other files, when processing
the extent item relative to a shared extent, we blindly issue a clone
operation that will target a length matching the length in the extent item
and uses as a source some other file the receiver already has and points
to the same extent. However that range in the other file might not
exclusively point only to the shared extent, and so using that length
will result in the receiver getting a file with different data from the
one in the send snapshot. This issue happened both for incremental and
full send operations.
So fix this by issuing clone operations with lengths that don't cover
regions of the source file that point to different extents (or have holes).
The following test case for fstests reproduces the problem.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
_require_xfs_io_command "fpunch"
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test file with a single 100K extent.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Clone our file into a new file named bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Now overwrite parts of our foo file.
$XFS_IO_PROG -c "pwrite -S 0xbb 50K 10K" \
-c "pwrite -S 0xcc 90K 10K" \
-c "fpunch 70K 10k" \
$SCRATCH_MNT/foo | _filter_xfs_io
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify
# we get the same file contents that the original filesystem had.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
# We expect the destination filesystem to have exactly the same file
# data as the original filesystem.
# The btrfs send implementation had a bug where it sent a clone
# operation from file foo into file bar covering the whole [0, 100K[
# range after creating and writing the file foo. This was incorrect
# because the file bar now included the updates done to file foo after
# we cloned foo to bar, breaking the COW nature of reflink copies
# (cloned extents).
echo "File digests in the new filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Another test case that reproduces the problem when we have compressed
extents:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
# Create our file with an extent of 100K starting at file offset 0K.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Rewrite part of the previous extent (its first 40K) and write a new
# 100K extent starting at file offset 100K.
$XFS_IO_PROG -c "pwrite -S 0xbb 0K 40K" \
-c "pwrite -S 0xcc 100K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Our file foo now has 3 file extent items in its metadata:
#
# 1) One covering the file range 0 to 40K;
# 2) One covering the file range 40K to 100K, which points to the first
# extent we wrote to the file and has a data offset field with value
# 40K (our file no longer uses the first 40K of data from that
# extent);
# 3) One covering the file range 100K to 200K.
# Now clone our file foo into file bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Create our snapshot for the send operation.
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify we
# get the same file contents that the original filesystem had.
# Btrfs send used to issue a clone operation from foo's range
# [80K, 140K[ to bar's range [40K, 100K[ when cloning the extent pointed
# to by foo's second file extent item, this was incorrect because of bad
# accounting of the file extent item's data offset field. The correct
# range to clone from should have been [40K, 100K[.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
echo "File digests in the new filesystem:"
# Must match the digests we got in the original filesystem.
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-02 17:47:34 +08:00
|
|
|
u64 disk_byte;
|
|
|
|
u64 data_offset;
|
|
|
|
|
2020-08-21 15:39:53 +08:00
|
|
|
ei = btrfs_item_ptr(path->nodes[0], path->slots[0],
|
|
|
|
struct btrfs_file_extent_item);
|
Btrfs: send, fix file corruption due to incorrect cloning operations
If we have a file that shares an extent with other files, when processing
the extent item relative to a shared extent, we blindly issue a clone
operation that will target a length matching the length in the extent item
and uses as a source some other file the receiver already has and points
to the same extent. However that range in the other file might not
exclusively point only to the shared extent, and so using that length
will result in the receiver getting a file with different data from the
one in the send snapshot. This issue happened both for incremental and
full send operations.
So fix this by issuing clone operations with lengths that don't cover
regions of the source file that point to different extents (or have holes).
The following test case for fstests reproduces the problem.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
_require_xfs_io_command "fpunch"
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test file with a single 100K extent.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Clone our file into a new file named bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Now overwrite parts of our foo file.
$XFS_IO_PROG -c "pwrite -S 0xbb 50K 10K" \
-c "pwrite -S 0xcc 90K 10K" \
-c "fpunch 70K 10k" \
$SCRATCH_MNT/foo | _filter_xfs_io
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify
# we get the same file contents that the original filesystem had.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
# We expect the destination filesystem to have exactly the same file
# data as the original filesystem.
# The btrfs send implementation had a bug where it sent a clone
# operation from file foo into file bar covering the whole [0, 100K[
# range after creating and writing the file foo. This was incorrect
# because the file bar now included the updates done to file foo after
# we cloned foo to bar, breaking the COW nature of reflink copies
# (cloned extents).
echo "File digests in the new filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Another test case that reproduces the problem when we have compressed
extents:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
_require_cp_reflink
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
# Create our file with an extent of 100K starting at file offset 0K.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Rewrite part of the previous extent (its first 40K) and write a new
# 100K extent starting at file offset 100K.
$XFS_IO_PROG -c "pwrite -S 0xbb 0K 40K" \
-c "pwrite -S 0xcc 100K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Our file foo now has 3 file extent items in its metadata:
#
# 1) One covering the file range 0 to 40K;
# 2) One covering the file range 40K to 100K, which points to the first
# extent we wrote to the file and has a data offset field with value
# 40K (our file no longer uses the first 40K of data from that
# extent);
# 3) One covering the file range 100K to 200K.
# Now clone our file foo into file bar.
cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Create our snapshot for the send operation.
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/snap
echo "File digests in the original filesystem:"
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
_run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap
# Now recreate the filesystem by receiving the send stream and verify we
# get the same file contents that the original filesystem had.
# Btrfs send used to issue a clone operation from foo's range
# [80K, 140K[ to bar's range [40K, 100K[ when cloning the extent pointed
# to by foo's second file extent item, this was incorrect because of bad
# accounting of the file extent item's data offset field. The correct
# range to clone from should have been [40K, 100K[.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
echo "File digests in the new filesystem:"
# Must match the digests we got in the original filesystem.
md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
md5sum $SCRATCH_MNT/snap/bar | _filter_scratch
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-02 17:47:34 +08:00
|
|
|
disk_byte = btrfs_file_extent_disk_bytenr(path->nodes[0], ei);
|
|
|
|
data_offset = btrfs_file_extent_offset(path->nodes[0], ei);
|
2022-03-18 01:25:42 +08:00
|
|
|
ret = clone_range(sctx, path, clone_root, disk_byte,
|
|
|
|
data_offset, offset, end - offset);
|
2013-02-05 04:54:57 +08:00
|
|
|
} else {
|
2022-03-18 01:25:42 +08:00
|
|
|
ret = send_extent_data(sctx, path, offset, end - offset);
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
2020-08-21 15:39:53 +08:00
|
|
|
sctx->cur_inode_next_write_offset = end;
|
2012-07-26 05:19:24 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int is_extent_unchanged(struct send_ctx *sctx,
|
|
|
|
struct btrfs_path *left_path,
|
|
|
|
struct btrfs_key *ekey)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_path *path = NULL;
|
|
|
|
struct extent_buffer *eb;
|
|
|
|
int slot;
|
|
|
|
struct btrfs_key found_key;
|
|
|
|
struct btrfs_file_extent_item *ei;
|
|
|
|
u64 left_disknr;
|
|
|
|
u64 right_disknr;
|
|
|
|
u64 left_offset;
|
|
|
|
u64 right_offset;
|
|
|
|
u64 left_offset_fixed;
|
|
|
|
u64 left_len;
|
|
|
|
u64 right_len;
|
2012-08-08 04:25:13 +08:00
|
|
|
u64 left_gen;
|
|
|
|
u64 right_gen;
|
2012-07-26 05:19:24 +08:00
|
|
|
u8 left_type;
|
|
|
|
u8 right_type;
|
|
|
|
|
|
|
|
path = alloc_path_for_send();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
eb = left_path->nodes[0];
|
|
|
|
slot = left_path->slots[0];
|
|
|
|
ei = btrfs_item_ptr(eb, slot, struct btrfs_file_extent_item);
|
|
|
|
left_type = btrfs_file_extent_type(eb, ei);
|
|
|
|
|
|
|
|
if (left_type != BTRFS_FILE_EXTENT_REG) {
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
2012-08-08 04:25:13 +08:00
|
|
|
left_disknr = btrfs_file_extent_disk_bytenr(eb, ei);
|
|
|
|
left_len = btrfs_file_extent_num_bytes(eb, ei);
|
|
|
|
left_offset = btrfs_file_extent_offset(eb, ei);
|
|
|
|
left_gen = btrfs_file_extent_generation(eb, ei);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Following comments will refer to these graphics. L is the left
|
|
|
|
* extents which we are checking at the moment. 1-8 are the right
|
|
|
|
* extents that we iterate.
|
|
|
|
*
|
|
|
|
* |-----L-----|
|
|
|
|
* |-1-|-2a-|-3-|-4-|-5-|-6-|
|
|
|
|
*
|
|
|
|
* |-----L-----|
|
|
|
|
* |--1--|-2b-|...(same as above)
|
|
|
|
*
|
|
|
|
* Alternative situation. Happens on files where extents got split.
|
|
|
|
* |-----L-----|
|
|
|
|
* |-----------7-----------|-6-|
|
|
|
|
*
|
|
|
|
* Alternative situation. Happens on files which got larger.
|
|
|
|
* |-----L-----|
|
|
|
|
* |-8-|
|
|
|
|
* Nothing follows after 8.
|
|
|
|
*/
|
|
|
|
|
|
|
|
key.objectid = ekey->objectid;
|
|
|
|
key.type = BTRFS_EXTENT_DATA_KEY;
|
|
|
|
key.offset = ekey->offset;
|
|
|
|
ret = btrfs_search_slot_for_read(sctx->parent_root, &key, path, 0, 0);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret) {
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Handle special case where the right side has no extents at all.
|
|
|
|
*/
|
|
|
|
eb = path->nodes[0];
|
|
|
|
slot = path->slots[0];
|
|
|
|
btrfs_item_key_to_cpu(eb, &found_key, slot);
|
|
|
|
if (found_key.objectid != key.objectid ||
|
|
|
|
found_key.type != key.type) {
|
2013-08-21 03:55:39 +08:00
|
|
|
/* If we're a hole then just pretend nothing changed */
|
|
|
|
ret = (left_disknr) ? 0 : 1;
|
2012-07-26 05:19:24 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We're now on 2a, 2b or 7.
|
|
|
|
*/
|
|
|
|
key = found_key;
|
|
|
|
while (key.offset < ekey->offset + left_len) {
|
|
|
|
ei = btrfs_item_ptr(eb, slot, struct btrfs_file_extent_item);
|
|
|
|
right_type = btrfs_file_extent_type(eb, ei);
|
2017-04-05 03:31:00 +08:00
|
|
|
if (right_type != BTRFS_FILE_EXTENT_REG &&
|
|
|
|
right_type != BTRFS_FILE_EXTENT_INLINE) {
|
2012-07-26 05:19:24 +08:00
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2017-04-05 03:31:00 +08:00
|
|
|
if (right_type == BTRFS_FILE_EXTENT_INLINE) {
|
2018-06-06 15:41:49 +08:00
|
|
|
right_len = btrfs_file_extent_ram_bytes(eb, ei);
|
2017-04-05 03:31:00 +08:00
|
|
|
right_len = PAGE_ALIGN(right_len);
|
|
|
|
} else {
|
|
|
|
right_len = btrfs_file_extent_num_bytes(eb, ei);
|
|
|
|
}
|
2013-11-01 04:49:02 +08:00
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
/*
|
|
|
|
* Are we at extent 8? If yes, we know the extent is changed.
|
|
|
|
* This may only happen on the first iteration.
|
|
|
|
*/
|
2012-08-01 18:49:15 +08:00
|
|
|
if (found_key.offset + right_len <= ekey->offset) {
|
2013-08-21 03:55:39 +08:00
|
|
|
/* If we're a hole just pretend nothing changed */
|
|
|
|
ret = (left_disknr) ? 0 : 1;
|
2012-07-26 05:19:24 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2017-04-05 03:31:00 +08:00
|
|
|
/*
|
|
|
|
* We just wanted to see if when we have an inline extent, what
|
|
|
|
* follows it is a regular extent (wanted to check the above
|
|
|
|
* condition for inline extents too). This should normally not
|
|
|
|
* happen but it's possible for example when we have an inline
|
|
|
|
* compressed extent representing data with a size matching
|
|
|
|
* the page size (currently the same as sector size).
|
|
|
|
*/
|
|
|
|
if (right_type == BTRFS_FILE_EXTENT_INLINE) {
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
Btrfs: incremental send, fix invalid memory access
When doing an incremental send, while processing an extent that changed
between the parent and send snapshots and that extent was an inline extent
in the parent snapshot, it's possible to access a memory region beyond
the end of leaf if the inline extent is very small and it is the first
item in a leaf.
An example scenario is described below.
The send snapshot has the following leaf:
leaf 33865728 items 33 free space 773 generation 46 owner 5
fs uuid ab7090d8-dafd-4fb9-9246-723b6d2e2fb7
chunk uuid 2d16478c-c704-4ab9-b574-68bff2281b1f
(...)
item 14 key (335 EXTENT_DATA 0) itemoff 3052 itemsize 53
generation 36 type 1 (regular)
extent data disk byte 12791808 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression 0 (none)
item 15 key (335 EXTENT_DATA 8192) itemoff 2999 itemsize 53
generation 36 type 1 (regular)
extent data disk byte 138170368 nr 225280
extent data offset 0 nr 225280 ram 225280
extent compression 0 (none)
(...)
And the parent snapshot has the following leaf:
leaf 31272960 items 17 free space 17 generation 31 owner 5
fs uuid ab7090d8-dafd-4fb9-9246-723b6d2e2fb7
chunk uuid 2d16478c-c704-4ab9-b574-68bff2281b1f
item 0 key (335 EXTENT_DATA 0) itemoff 3951 itemsize 44
generation 31 type 0 (inline)
inline extent data size 23 ram_bytes 613 compression 1 (zlib)
(...)
When computing the send stream, it is detected that the extent of inode
335, at file offset 0, and at fs/btrfs/send.c:is_extent_unchanged() we
grab the leaf from the parent snapshot and access the inline extent item.
However, before jumping to the 'out' label, we access the 'offset' and
'disk_bytenr' fields of the extent item, which should not be done for
inline extents since the inlined data starts at the offset of the
'disk_bytenr' field and can be very small. For example accessing the
'offset' field of the file extent item results in the following trace:
[ 599.705368] general protection fault: 0000 [#1] PREEMPT SMP
[ 599.706296] Modules linked in: btrfs psmouse i2c_piix4 ppdev acpi_cpufreq serio_raw parport_pc i2c_core evdev tpm_tis tpm_tis_core sg pcspkr parport tpm button su$
[ 599.709340] CPU: 7 PID: 5283 Comm: btrfs Not tainted 4.10.0-rc8-btrfs-next-46+ #1
[ 599.709340] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[ 599.709340] task: ffff88023eedd040 task.stack: ffffc90006658000
[ 599.709340] RIP: 0010:read_extent_buffer+0xdb/0xf4 [btrfs]
[ 599.709340] RSP: 0018:ffffc9000665ba00 EFLAGS: 00010286
[ 599.709340] RAX: db73880000000000 RBX: 0000000000000000 RCX: 0000000000000001
[ 599.709340] RDX: ffffc9000665ba60 RSI: db73880000000000 RDI: ffffc9000665ba5f
[ 599.709340] RBP: ffffc9000665ba30 R08: 0000000000000001 R09: ffff88020dc5e098
[ 599.709340] R10: 0000000000001000 R11: 0000160000000000 R12: 6db6db6db6db6db7
[ 599.709340] R13: ffff880000000000 R14: 0000000000000000 R15: ffff88020dc5e088
[ 599.709340] FS: 00007f519555a8c0(0000) GS:ffff88023f3c0000(0000) knlGS:0000000000000000
[ 599.709340] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 599.709340] CR2: 00007f1411afd000 CR3: 0000000235f8e000 CR4: 00000000000006e0
[ 599.709340] Call Trace:
[ 599.709340] btrfs_get_token_64+0x93/0xce [btrfs]
[ 599.709340] ? printk+0x48/0x50
[ 599.709340] btrfs_get_64+0xb/0xd [btrfs]
[ 599.709340] process_extent+0x3a1/0x1106 [btrfs]
[ 599.709340] ? btree_read_extent_buffer_pages+0x5/0xef [btrfs]
[ 599.709340] changed_cb+0xb03/0xb3d [btrfs]
[ 599.709340] ? btrfs_get_token_32+0x7a/0xcc [btrfs]
[ 599.709340] btrfs_compare_trees+0x432/0x53d [btrfs]
[ 599.709340] ? process_extent+0x1106/0x1106 [btrfs]
[ 599.709340] btrfs_ioctl_send+0x960/0xe26 [btrfs]
[ 599.709340] btrfs_ioctl+0x181b/0x1fed [btrfs]
[ 599.709340] ? trace_hardirqs_on_caller+0x150/0x1ac
[ 599.709340] vfs_ioctl+0x21/0x38
[ 599.709340] ? vfs_ioctl+0x21/0x38
[ 599.709340] do_vfs_ioctl+0x611/0x645
[ 599.709340] ? rcu_read_unlock+0x5b/0x5d
[ 599.709340] ? __fget+0x6d/0x79
[ 599.709340] SyS_ioctl+0x57/0x7b
[ 599.709340] entry_SYSCALL_64_fastpath+0x18/0xad
[ 599.709340] RIP: 0033:0x7f51945eec47
[ 599.709340] RSP: 002b:00007ffc21c13e98 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[ 599.709340] RAX: ffffffffffffffda RBX: ffffffff81096459 RCX: 00007f51945eec47
[ 599.709340] RDX: 00007ffc21c13f20 RSI: 0000000040489426 RDI: 0000000000000004
[ 599.709340] RBP: ffffc9000665bf98 R08: 00007f519450d700 R09: 00007f519450d700
[ 599.709340] R10: 00007f519450d9d0 R11: 0000000000000202 R12: 0000000000000046
[ 599.709340] R13: ffffc9000665bf78 R14: 0000000000000000 R15: 00007f5195574040
[ 599.709340] ? trace_hardirqs_off_caller+0x43/0xb1
[ 599.709340] Code: 29 f0 49 39 d8 4c 0f 47 c3 49 03 81 58 01 00 00 44 89 c1 4c 01 c2 4c 29 c3 48 c1 f8 03 49 0f af c4 48 c1 e0 0c 4c 01 e8 48 01 c6 <f3> a4 31 f6 4$
[ 599.709340] RIP: read_extent_buffer+0xdb/0xf4 [btrfs] RSP: ffffc9000665ba00
[ 599.762057] ---[ end trace fe00d7af61b9f49e ]---
This is because the 'offset' field starts at an offset of 37 bytes
(offsetof(struct btrfs_file_extent_item, offset)), has a length of 8
bytes and therefore attemping to read it causes a 1 byte access beyond
the end of the leaf, as the first item's content in a leaf is located
at the tail of the leaf, the item size is 44 bytes and the offset of
that field plus its length (37 + 8 = 45) goes beyond the item's size
by 1 byte.
So fix this by accessing the 'offset' and 'disk_bytenr' fields after
jumping to the 'out' label if we are processing an inline extent. We
move the reading operation of the 'disk_bytenr' field too because we
have the same problem as for the 'offset' field explained above when
the inline data is less then 8 bytes. The access to the 'generation'
field is also moved but just for the sake of grouping access to all
the fields.
Fixes: e1cbfd7bf6da ("Btrfs: send, fix file hole not being preserved due to inline extent")
Cc: <stable@vger.kernel.org> # v4.12+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-07-06 22:31:46 +08:00
|
|
|
right_disknr = btrfs_file_extent_disk_bytenr(eb, ei);
|
|
|
|
right_offset = btrfs_file_extent_offset(eb, ei);
|
|
|
|
right_gen = btrfs_file_extent_generation(eb, ei);
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
left_offset_fixed = left_offset;
|
|
|
|
if (key.offset < ekey->offset) {
|
|
|
|
/* Fix the right offset for 2a and 7. */
|
|
|
|
right_offset += ekey->offset - key.offset;
|
|
|
|
} else {
|
|
|
|
/* Fix the left offset for all behind 2a and 2b */
|
|
|
|
left_offset_fixed += key.offset - ekey->offset;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check if we have the same extent.
|
|
|
|
*/
|
2012-08-01 18:46:05 +08:00
|
|
|
if (left_disknr != right_disknr ||
|
2012-08-08 04:25:13 +08:00
|
|
|
left_offset_fixed != right_offset ||
|
|
|
|
left_gen != right_gen) {
|
2012-07-26 05:19:24 +08:00
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Go to the next extent.
|
|
|
|
*/
|
|
|
|
ret = btrfs_next_item(sctx->parent_root, path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (!ret) {
|
|
|
|
eb = path->nodes[0];
|
|
|
|
slot = path->slots[0];
|
|
|
|
btrfs_item_key_to_cpu(eb, &found_key, slot);
|
|
|
|
}
|
|
|
|
if (ret || found_key.objectid != key.objectid ||
|
|
|
|
found_key.type != key.type) {
|
|
|
|
key.offset += right_len;
|
|
|
|
break;
|
2013-03-21 22:30:23 +08:00
|
|
|
}
|
|
|
|
if (found_key.offset != key.offset + right_len) {
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
key = found_key;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We're now behind the left extent (treat as unchanged) or at the end
|
|
|
|
* of the right side (treat as changed).
|
|
|
|
*/
|
|
|
|
if (key.offset >= ekey->offset + left_len)
|
|
|
|
ret = 1;
|
|
|
|
else
|
|
|
|
ret = 0;
|
|
|
|
|
|
|
|
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-10-23 00:18:51 +08:00
|
|
|
static int get_last_extent(struct send_ctx *sctx, u64 offset)
|
|
|
|
{
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_root *root = sctx->send_root;
|
|
|
|
struct btrfs_key key;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
path = alloc_path_for_send();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
sctx->cur_inode_last_extent = 0;
|
|
|
|
|
|
|
|
key.objectid = sctx->cur_ino;
|
|
|
|
key.type = BTRFS_EXTENT_DATA_KEY;
|
|
|
|
key.offset = offset;
|
|
|
|
ret = btrfs_search_slot_for_read(root, &key, path, 0, 1);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
ret = 0;
|
|
|
|
btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
|
|
|
|
if (key.objectid != sctx->cur_ino || key.type != BTRFS_EXTENT_DATA_KEY)
|
|
|
|
goto out;
|
|
|
|
|
2020-03-09 20:41:06 +08:00
|
|
|
sctx->cur_inode_last_extent = btrfs_file_extent_end(path);
|
2013-10-23 00:18:51 +08:00
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2017-02-15 01:56:32 +08:00
|
|
|
static int range_is_hole_in_parent(struct send_ctx *sctx,
|
|
|
|
const u64 start,
|
|
|
|
const u64 end)
|
|
|
|
{
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_root *root = sctx->parent_root;
|
|
|
|
u64 search_start = start;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
path = alloc_path_for_send();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
key.objectid = sctx->cur_ino;
|
|
|
|
key.type = BTRFS_EXTENT_DATA_KEY;
|
|
|
|
key.offset = search_start;
|
|
|
|
ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret > 0 && path->slots[0] > 0)
|
|
|
|
path->slots[0]--;
|
|
|
|
|
|
|
|
while (search_start < end) {
|
|
|
|
struct extent_buffer *leaf = path->nodes[0];
|
|
|
|
int slot = path->slots[0];
|
|
|
|
struct btrfs_file_extent_item *fi;
|
|
|
|
u64 extent_end;
|
|
|
|
|
|
|
|
if (slot >= btrfs_header_nritems(leaf)) {
|
|
|
|
ret = btrfs_next_leaf(root, path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
else if (ret > 0)
|
|
|
|
break;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, slot);
|
|
|
|
if (key.objectid < sctx->cur_ino ||
|
|
|
|
key.type < BTRFS_EXTENT_DATA_KEY)
|
|
|
|
goto next;
|
|
|
|
if (key.objectid > sctx->cur_ino ||
|
|
|
|
key.type > BTRFS_EXTENT_DATA_KEY ||
|
|
|
|
key.offset >= end)
|
|
|
|
break;
|
|
|
|
|
|
|
|
fi = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item);
|
2020-03-09 20:41:06 +08:00
|
|
|
extent_end = btrfs_file_extent_end(path);
|
2017-02-15 01:56:32 +08:00
|
|
|
if (extent_end <= start)
|
|
|
|
goto next;
|
|
|
|
if (btrfs_file_extent_disk_bytenr(leaf, fi) == 0) {
|
|
|
|
search_start = extent_end;
|
|
|
|
goto next;
|
|
|
|
}
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
next:
|
|
|
|
path->slots[0]++;
|
|
|
|
}
|
|
|
|
ret = 1;
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-10-23 00:18:51 +08:00
|
|
|
static int maybe_send_hole(struct send_ctx *sctx, struct btrfs_path *path,
|
|
|
|
struct btrfs_key *key)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
if (sctx->cur_ino != key->objectid || !need_send_hole(sctx))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (sctx->cur_inode_last_extent == (u64)-1) {
|
|
|
|
ret = get_last_extent(sctx, key->offset - 1);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2014-01-28 09:38:06 +08:00
|
|
|
if (path->slots[0] == 0 &&
|
|
|
|
sctx->cur_inode_last_extent < key->offset) {
|
|
|
|
/*
|
|
|
|
* We might have skipped entire leafs that contained only
|
|
|
|
* file extent items for our current inode. These leafs have
|
|
|
|
* a generation number smaller (older) than the one in the
|
|
|
|
* current leaf and the leaf our last extent came from, and
|
|
|
|
* are located between these 2 leafs.
|
|
|
|
*/
|
|
|
|
ret = get_last_extent(sctx, key->offset - 1);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2017-02-15 01:56:32 +08:00
|
|
|
if (sctx->cur_inode_last_extent < key->offset) {
|
|
|
|
ret = range_is_hole_in_parent(sctx,
|
|
|
|
sctx->cur_inode_last_extent,
|
|
|
|
key->offset);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
else if (ret == 0)
|
|
|
|
ret = send_hole(sctx, key->offset);
|
|
|
|
else
|
|
|
|
ret = 0;
|
|
|
|
}
|
2020-03-09 20:41:06 +08:00
|
|
|
sctx->cur_inode_last_extent = btrfs_file_extent_end(path);
|
2013-10-23 00:18:51 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
static int process_extent(struct send_ctx *sctx,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_key *key)
|
|
|
|
{
|
|
|
|
struct clone_root *found_clone = NULL;
|
2013-08-21 03:55:39 +08:00
|
|
|
int ret = 0;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
if (S_ISLNK(sctx->cur_inode_mode))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (sctx->parent_root && !sctx->cur_inode_new) {
|
|
|
|
ret = is_extent_unchanged(sctx, path, key);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret) {
|
|
|
|
ret = 0;
|
2013-10-23 00:18:51 +08:00
|
|
|
goto out_hole;
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
2013-08-21 03:55:39 +08:00
|
|
|
} else {
|
|
|
|
struct btrfs_file_extent_item *ei;
|
|
|
|
u8 type;
|
|
|
|
|
|
|
|
ei = btrfs_item_ptr(path->nodes[0], path->slots[0],
|
|
|
|
struct btrfs_file_extent_item);
|
|
|
|
type = btrfs_file_extent_type(path->nodes[0], ei);
|
|
|
|
if (type == BTRFS_FILE_EXTENT_PREALLOC ||
|
|
|
|
type == BTRFS_FILE_EXTENT_REG) {
|
|
|
|
/*
|
|
|
|
* The send spec does not have a prealloc command yet,
|
|
|
|
* so just leave a hole for prealloc'ed extents until
|
|
|
|
* we have enough commands queued up to justify rev'ing
|
|
|
|
* the send spec.
|
|
|
|
*/
|
|
|
|
if (type == BTRFS_FILE_EXTENT_PREALLOC) {
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Have a hole, just skip it. */
|
|
|
|
if (btrfs_file_extent_disk_bytenr(path->nodes[0], ei) == 0) {
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
ret = find_extent_clone(sctx, path, key->objectid, key->offset,
|
|
|
|
sctx->cur_inode_size, &found_clone);
|
|
|
|
if (ret != -ENOENT && ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = send_write_or_clone(sctx, path, key, found_clone);
|
2013-10-23 00:18:51 +08:00
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
out_hole:
|
|
|
|
ret = maybe_send_hole(sctx, path, key);
|
2012-07-26 05:19:24 +08:00
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int process_all_extents(struct send_ctx *sctx)
|
|
|
|
{
|
2022-03-09 21:50:48 +08:00
|
|
|
int ret = 0;
|
|
|
|
int iter_ret = 0;
|
2012-07-26 05:19:24 +08:00
|
|
|
struct btrfs_root *root;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_key found_key;
|
|
|
|
|
|
|
|
root = sctx->send_root;
|
|
|
|
path = alloc_path_for_send();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
key.objectid = sctx->cmp_key->objectid;
|
|
|
|
key.type = BTRFS_EXTENT_DATA_KEY;
|
|
|
|
key.offset = 0;
|
2022-03-09 21:50:48 +08:00
|
|
|
btrfs_for_each_slot(root, &key, &found_key, path, iter_ret) {
|
2012-07-26 05:19:24 +08:00
|
|
|
if (found_key.objectid != key.objectid ||
|
|
|
|
found_key.type != key.type) {
|
|
|
|
ret = 0;
|
2022-03-09 21:50:48 +08:00
|
|
|
break;
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
ret = process_extent(sctx, path, &found_key);
|
|
|
|
if (ret < 0)
|
2022-03-09 21:50:48 +08:00
|
|
|
break;
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
2022-03-09 21:50:48 +08:00
|
|
|
/* Catch error found during iteration */
|
|
|
|
if (iter_ret < 0)
|
|
|
|
ret = iter_ret;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
static int process_recorded_refs_if_needed(struct send_ctx *sctx, int at_end,
|
|
|
|
int *pending_move,
|
|
|
|
int *refs_processed)
|
2012-07-26 05:19:24 +08:00
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
if (sctx->cur_ino == 0)
|
|
|
|
goto out;
|
|
|
|
if (!at_end && sctx->cur_ino == sctx->cmp_key->objectid &&
|
2012-10-15 16:30:45 +08:00
|
|
|
sctx->cmp_key->type <= BTRFS_INODE_EXTREF_KEY)
|
2012-07-26 05:19:24 +08:00
|
|
|
goto out;
|
|
|
|
if (list_empty(&sctx->new_refs) && list_empty(&sctx->deleted_refs))
|
|
|
|
goto out;
|
|
|
|
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
ret = process_recorded_refs(sctx, pending_move);
|
2012-07-28 22:09:35 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
*refs_processed = 1;
|
2012-07-26 05:19:24 +08:00
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int finish_inode_if_needed(struct send_ctx *sctx, int at_end)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
2022-08-12 22:42:32 +08:00
|
|
|
struct btrfs_inode_info info;
|
2012-07-26 05:19:24 +08:00
|
|
|
u64 left_mode;
|
|
|
|
u64 left_uid;
|
|
|
|
u64 left_gid;
|
2022-05-19 00:02:55 +08:00
|
|
|
u64 left_fileattr;
|
2012-07-26 05:19:24 +08:00
|
|
|
u64 right_mode;
|
|
|
|
u64 right_uid;
|
|
|
|
u64 right_gid;
|
2022-05-19 00:02:55 +08:00
|
|
|
u64 right_fileattr;
|
2012-07-26 05:19:24 +08:00
|
|
|
int need_chmod = 0;
|
|
|
|
int need_chown = 0;
|
2022-05-19 00:02:55 +08:00
|
|
|
bool need_fileattr = false;
|
2018-02-07 04:40:40 +08:00
|
|
|
int need_truncate = 1;
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
int pending_move = 0;
|
|
|
|
int refs_processed = 0;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2018-07-24 18:54:04 +08:00
|
|
|
if (sctx->ignore_cur_inode)
|
|
|
|
return 0;
|
|
|
|
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
ret = process_recorded_refs_if_needed(sctx, at_end, &pending_move,
|
|
|
|
&refs_processed);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
/*
|
|
|
|
* We have processed the refs and thus need to advance send_progress.
|
|
|
|
* Now, calls to get_cur_xxx will take the updated refs of the current
|
|
|
|
* inode into account.
|
|
|
|
*
|
|
|
|
* On the other hand, if our current inode is a directory and couldn't
|
|
|
|
* be moved/renamed because its parent was renamed/moved too and it has
|
|
|
|
* a higher inode number, we can only move/rename our current inode
|
|
|
|
* after we moved/renamed its parent. Therefore in this case operate on
|
|
|
|
* the old path (pre move/rename) of our current inode, and the
|
|
|
|
* move/rename will be performed later.
|
|
|
|
*/
|
|
|
|
if (refs_processed && !pending_move)
|
|
|
|
sctx->send_progress = sctx->cur_ino + 1;
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
if (sctx->cur_ino == 0 || sctx->cur_inode_deleted)
|
|
|
|
goto out;
|
|
|
|
if (!at_end && sctx->cmp_key->objectid == sctx->cur_ino)
|
|
|
|
goto out;
|
2022-08-12 22:42:32 +08:00
|
|
|
ret = get_inode_info(sctx->send_root, sctx->cur_ino, &info);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2022-08-12 22:42:32 +08:00
|
|
|
left_mode = info.mode;
|
|
|
|
left_uid = info.uid;
|
|
|
|
left_gid = info.gid;
|
|
|
|
left_fileattr = info.fileattr;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2012-10-17 21:52:47 +08:00
|
|
|
if (!sctx->parent_root || sctx->cur_inode_new) {
|
|
|
|
need_chown = 1;
|
|
|
|
if (!S_ISLNK(sctx->cur_inode_mode))
|
2012-07-26 05:19:24 +08:00
|
|
|
need_chmod = 1;
|
2018-02-07 04:40:40 +08:00
|
|
|
if (sctx->cur_inode_next_write_offset == sctx->cur_inode_size)
|
|
|
|
need_truncate = 0;
|
2012-10-17 21:52:47 +08:00
|
|
|
} else {
|
2018-02-07 04:40:40 +08:00
|
|
|
u64 old_size;
|
|
|
|
|
2022-08-12 22:42:32 +08:00
|
|
|
ret = get_inode_info(sctx->parent_root, sctx->cur_ino, &info);
|
2012-10-17 21:52:47 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2022-08-12 22:42:32 +08:00
|
|
|
old_size = info.size;
|
|
|
|
right_mode = info.mode;
|
|
|
|
right_uid = info.uid;
|
|
|
|
right_gid = info.gid;
|
|
|
|
right_fileattr = info.fileattr;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2012-10-17 21:52:47 +08:00
|
|
|
if (left_uid != right_uid || left_gid != right_gid)
|
|
|
|
need_chown = 1;
|
|
|
|
if (!S_ISLNK(sctx->cur_inode_mode) && left_mode != right_mode)
|
|
|
|
need_chmod = 1;
|
2022-05-19 00:02:55 +08:00
|
|
|
if (!S_ISLNK(sctx->cur_inode_mode) && left_fileattr != right_fileattr)
|
|
|
|
need_fileattr = true;
|
2018-02-07 04:40:40 +08:00
|
|
|
if ((old_size == sctx->cur_inode_size) ||
|
|
|
|
(sctx->cur_inode_size > old_size &&
|
|
|
|
sctx->cur_inode_next_write_offset == sctx->cur_inode_size))
|
|
|
|
need_truncate = 0;
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (S_ISREG(sctx->cur_inode_mode)) {
|
2013-10-23 00:18:51 +08:00
|
|
|
if (need_send_hole(sctx)) {
|
2014-03-31 06:02:53 +08:00
|
|
|
if (sctx->cur_inode_last_extent == (u64)-1 ||
|
|
|
|
sctx->cur_inode_last_extent <
|
|
|
|
sctx->cur_inode_size) {
|
2013-10-23 00:18:51 +08:00
|
|
|
ret = get_last_extent(sctx, (u64)-1);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
}
|
btrfs: send: don't issue unnecessary zero writes for trailing hole
If we have a sparse file with a trailing hole (from the last extent's end
to i_size) and then create an extent in the file that ends before the
file's i_size, then when doing an incremental send we will issue a write
full of zeroes for the range that starts immediately after the new extent
ends up to i_size. While this isn't incorrect because the file ends up
with exactly the same data, it unnecessarily results in using extra space
at the destination with one or more extents full of zeroes instead of
having a hole. In same cases this results in using megabytes or even
gigabytes of unnecessary space.
Example, reproducer:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdh
MNT=/mnt/sdh
mkfs.btrfs -f $DEV
mount $DEV $MNT
# Create 1G sparse file.
xfs_io -f -c "truncate 1G" $MNT/foobar
# Create base snapshot.
btrfs subvolume snapshot -r $MNT $MNT/mysnap1
# Create send stream (full send) for the base snapshot.
btrfs send -f /tmp/1.snap $MNT/mysnap1
# Now write one extent at the beginning of the file and one somewhere
# in the middle, leaving a gap between the end of this second extent
# and the file's size.
xfs_io -c "pwrite -S 0xab 0 128K" \
-c "pwrite -S 0xcd 512M 128K" \
$MNT/foobar
# Now create a second snapshot which is going to be used for an
# incremental send operation.
btrfs subvolume snapshot -r $MNT $MNT/mysnap2
# Create send stream (incremental send) for the second snapshot.
btrfs send -p $MNT/mysnap1 -f /tmp/2.snap $MNT/mysnap2
# Now recreate the filesystem by receiving both send streams and
# verify we get the same content that the original filesystem had
# and file foobar has only two extents with a size of 128K each.
umount $MNT
mkfs.btrfs -f $DEV
mount $DEV $MNT
btrfs receive -f /tmp/1.snap $MNT
btrfs receive -f /tmp/2.snap $MNT
echo -e "\nFile fiemap in the second snapshot:"
# Should have:
#
# 128K extent at file range [0, 128K[
# hole at file range [128K, 512M[
# 128K extent file range [512M, 512M + 128K[
# hole at file range [512M + 128K, 1G[
xfs_io -r -c "fiemap -v" $MNT/mysnap2/foobar
# File should be using 256K of data (two 128K extents).
echo -e "\nSpace used by the file: $(du -h $MNT/mysnap2/foobar | cut -f 1)"
umount $MNT
Running the test, we can see with fiemap that we get an extent for the
range [512M, 1G[, while in the source filesystem we have an extent for
the range [512M, 512M + 128K[ and a hole for the rest of the file (the
range [512M + 128K, 1G[):
$ ./test.sh
(...)
File fiemap in the second snapshot:
/mnt/sdh/mysnap2/foobar:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..255]: 26624..26879 256 0x0
1: [256..1048575]: hole 1048320
2: [1048576..2097151]: 2156544..3205119 1048576 0x1
Space used by the file: 513M
This happens because once we finish processing an inode, at
finish_inode_if_needed(), we always issue a hole (write operations full
of zeros) if there's a gap between the end of the last processed extent
and the file's size, even if that range is already a hole in the parent
snapshot. Fix this by issuing the hole only if the range is not already
a hole.
After this change, running the test above, we get the expected layout:
$ ./test.sh
(...)
File fiemap in the second snapshot:
/mnt/sdh/mysnap2/foobar:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..255]: 26624..26879 256 0x0
1: [256..1048575]: hole 1048320
2: [1048576..1048831]: 26880..27135 256 0x1
3: [1048832..2097151]: hole 1048320
Space used by the file: 256K
A test case for fstests will follow soon.
CC: stable@vger.kernel.org # 6.1+
Reported-by: Dorai Ashok S A <dash.btrfs@inix.me>
Link: https://lore.kernel.org/linux-btrfs/c0bf7818-9c45-46a8-b3d3-513230d0c86e@inix.me/
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-17 06:17:10 +08:00
|
|
|
if (sctx->cur_inode_last_extent < sctx->cur_inode_size) {
|
|
|
|
ret = range_is_hole_in_parent(sctx,
|
|
|
|
sctx->cur_inode_last_extent,
|
|
|
|
sctx->cur_inode_size);
|
|
|
|
if (ret < 0) {
|
2013-10-23 00:18:51 +08:00
|
|
|
goto out;
|
btrfs: send: don't issue unnecessary zero writes for trailing hole
If we have a sparse file with a trailing hole (from the last extent's end
to i_size) and then create an extent in the file that ends before the
file's i_size, then when doing an incremental send we will issue a write
full of zeroes for the range that starts immediately after the new extent
ends up to i_size. While this isn't incorrect because the file ends up
with exactly the same data, it unnecessarily results in using extra space
at the destination with one or more extents full of zeroes instead of
having a hole. In same cases this results in using megabytes or even
gigabytes of unnecessary space.
Example, reproducer:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdh
MNT=/mnt/sdh
mkfs.btrfs -f $DEV
mount $DEV $MNT
# Create 1G sparse file.
xfs_io -f -c "truncate 1G" $MNT/foobar
# Create base snapshot.
btrfs subvolume snapshot -r $MNT $MNT/mysnap1
# Create send stream (full send) for the base snapshot.
btrfs send -f /tmp/1.snap $MNT/mysnap1
# Now write one extent at the beginning of the file and one somewhere
# in the middle, leaving a gap between the end of this second extent
# and the file's size.
xfs_io -c "pwrite -S 0xab 0 128K" \
-c "pwrite -S 0xcd 512M 128K" \
$MNT/foobar
# Now create a second snapshot which is going to be used for an
# incremental send operation.
btrfs subvolume snapshot -r $MNT $MNT/mysnap2
# Create send stream (incremental send) for the second snapshot.
btrfs send -p $MNT/mysnap1 -f /tmp/2.snap $MNT/mysnap2
# Now recreate the filesystem by receiving both send streams and
# verify we get the same content that the original filesystem had
# and file foobar has only two extents with a size of 128K each.
umount $MNT
mkfs.btrfs -f $DEV
mount $DEV $MNT
btrfs receive -f /tmp/1.snap $MNT
btrfs receive -f /tmp/2.snap $MNT
echo -e "\nFile fiemap in the second snapshot:"
# Should have:
#
# 128K extent at file range [0, 128K[
# hole at file range [128K, 512M[
# 128K extent file range [512M, 512M + 128K[
# hole at file range [512M + 128K, 1G[
xfs_io -r -c "fiemap -v" $MNT/mysnap2/foobar
# File should be using 256K of data (two 128K extents).
echo -e "\nSpace used by the file: $(du -h $MNT/mysnap2/foobar | cut -f 1)"
umount $MNT
Running the test, we can see with fiemap that we get an extent for the
range [512M, 1G[, while in the source filesystem we have an extent for
the range [512M, 512M + 128K[ and a hole for the rest of the file (the
range [512M + 128K, 1G[):
$ ./test.sh
(...)
File fiemap in the second snapshot:
/mnt/sdh/mysnap2/foobar:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..255]: 26624..26879 256 0x0
1: [256..1048575]: hole 1048320
2: [1048576..2097151]: 2156544..3205119 1048576 0x1
Space used by the file: 513M
This happens because once we finish processing an inode, at
finish_inode_if_needed(), we always issue a hole (write operations full
of zeros) if there's a gap between the end of the last processed extent
and the file's size, even if that range is already a hole in the parent
snapshot. Fix this by issuing the hole only if the range is not already
a hole.
After this change, running the test above, we get the expected layout:
$ ./test.sh
(...)
File fiemap in the second snapshot:
/mnt/sdh/mysnap2/foobar:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..255]: 26624..26879 256 0x0
1: [256..1048575]: hole 1048320
2: [1048576..1048831]: 26880..27135 256 0x1
3: [1048832..2097151]: hole 1048320
Space used by the file: 256K
A test case for fstests will follow soon.
CC: stable@vger.kernel.org # 6.1+
Reported-by: Dorai Ashok S A <dash.btrfs@inix.me>
Link: https://lore.kernel.org/linux-btrfs/c0bf7818-9c45-46a8-b3d3-513230d0c86e@inix.me/
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-17 06:17:10 +08:00
|
|
|
} else if (ret == 0) {
|
|
|
|
ret = send_hole(sctx, sctx->cur_inode_size);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
} else {
|
|
|
|
/* Range is already a hole, skip. */
|
|
|
|
ret = 0;
|
|
|
|
}
|
2013-10-23 00:18:51 +08:00
|
|
|
}
|
|
|
|
}
|
2018-02-07 04:40:40 +08:00
|
|
|
if (need_truncate) {
|
|
|
|
ret = send_truncate(sctx, sctx->cur_ino,
|
|
|
|
sctx->cur_inode_gen,
|
|
|
|
sctx->cur_inode_size);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (need_chown) {
|
|
|
|
ret = send_chown(sctx, sctx->cur_ino, sctx->cur_inode_gen,
|
|
|
|
left_uid, left_gid);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
if (need_chmod) {
|
|
|
|
ret = send_chmod(sctx, sctx->cur_ino, sctx->cur_inode_gen,
|
|
|
|
left_mode);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
2022-05-19 00:02:55 +08:00
|
|
|
if (need_fileattr) {
|
|
|
|
ret = send_fileattr(sctx, sctx->cur_ino, sctx->cur_inode_gen,
|
|
|
|
left_fileattr);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
2022-10-07 23:10:02 +08:00
|
|
|
|
|
|
|
if (proto_cmd_ok(sctx, BTRFS_SEND_C_ENABLE_VERITY)
|
|
|
|
&& sctx->cur_inode_needs_verity) {
|
2022-08-16 04:54:28 +08:00
|
|
|
ret = process_verity(sctx);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
2012-07-26 05:19:24 +08:00
|
|
|
|
btrfs: send: emit file capabilities after chown
Whenever a chown is executed, all capabilities of the file being touched
are lost. When doing incremental send with a file with capabilities,
there is a situation where the capability can be lost on the receiving
side. The sequence of actions bellow shows the problem:
$ mount /dev/sda fs1
$ mount /dev/sdb fs2
$ touch fs1/foo.bar
$ setcap cap_sys_nice+ep fs1/foo.bar
$ btrfs subvolume snapshot -r fs1 fs1/snap_init
$ btrfs send fs1/snap_init | btrfs receive fs2
$ chgrp adm fs1/foo.bar
$ setcap cap_sys_nice+ep fs1/foo.bar
$ btrfs subvolume snapshot -r fs1 fs1/snap_complete
$ btrfs subvolume snapshot -r fs1 fs1/snap_incremental
$ btrfs send fs1/snap_complete | btrfs receive fs2
$ btrfs send -p fs1/snap_init fs1/snap_incremental | btrfs receive fs2
At this point, only a chown was emitted by "btrfs send" since only the
group was changed. This makes the cap_sys_nice capability to be dropped
from fs2/snap_incremental/foo.bar
To fix that, only emit capabilities after chown is emitted. The current
code first checks for xattrs that are new/changed, emits them, and later
emit the chown. Now, __process_new_xattr skips capabilities, letting
only finish_inode_if_needed to emit them, if they exist, for the inode
being processed.
This behavior was being worked around in "btrfs receive" side by caching
the capability and only applying it after chown. Now, xattrs are only
emmited _after_ chown, making that workaround not needed anymore.
Link: https://github.com/kdave/btrfs-progs/issues/202
CC: stable@vger.kernel.org # 4.4+
Suggested-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-11 10:15:07 +08:00
|
|
|
ret = send_capabilities(sctx);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
/*
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
* If other directory inodes depended on our current directory
|
|
|
|
* inode's move/rename, now do their move/rename operations.
|
2012-07-26 05:19:24 +08:00
|
|
|
*/
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
if (!is_waiting_for_move(sctx, sctx->cur_ino)) {
|
|
|
|
ret = apply_children_dir_moves(sctx);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
2014-03-03 20:28:40 +08:00
|
|
|
/*
|
|
|
|
* Need to send that every time, no matter if it actually
|
|
|
|
* changed between the two trees as we have done changes to
|
|
|
|
* the inode before. If our inode is a directory and it's
|
|
|
|
* waiting to be moved/renamed, we will send its utimes when
|
|
|
|
* it's moved/renamed, therefore we don't need to do it here.
|
|
|
|
*/
|
|
|
|
sctx->send_progress = sctx->cur_ino + 1;
|
btrfs: send: cache utimes operations for directories if possible
Whenever we add or remove an entry to a directory, we issue an utimes
command for the directory. If we add 1000 entries to a directory (create
1000 files under it or move 1000 files to it), then we issue the same
utimes command 1000 times, which increases the send stream size, results
in more pipe IO, one search in the send b+tree, allocating one path for
the search, etc, as well as making the receiver do a system call for each
duplicated utimes command.
We also issue an utimes command when we create a new directory, but later
we might add entries to it corresponding to inodes with an higher inode
number, so it's pointless to issue the utimes command before we create
the last inode under the directory.
So use a lru cache to track directories for which we must send a utimes
command. When we need to remove an entry from the cache, we issue the
utimes command for the respective directory. When finishing the send
operation, we go over each cache element and issue the respective utimes
command. Finally the caching is entirely optional, just a performance
optimization, meaning that if we fail to cache (due to memory allocation
failure), we issue the utimes command right away, that is, we fallback
to the previous, unoptimized, behaviour.
This patch belongs to a patchset comprised of the following patches:
btrfs: send: directly return from did_overwrite_ref() and simplify it
btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
btrfs: send: directly return from will_overwrite_ref() and simplify it
btrfs: send: avoid extra b+tree searches when checking reference overrides
btrfs: send: remove send_progress argument from can_rmdir()
btrfs: send: avoid duplicated orphan dir allocation and initialization
btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
btrfs: send: reduce searches on parent root when checking if dir can be removed
btrfs: send: iterate waiting dir move rbtree only once when processing refs
btrfs: send: initialize all the red black trees earlier
btrfs: send: genericize the backref cache to allow it to be reused
btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
btrfs: send: cache information about created directories
btrfs: allow a generation number to be associated with lru cache entries
btrfs: add an api to delete a specific entry from the lru cache
btrfs: send: use the lru cache to implement the name cache
btrfs: send: update size of roots array for backref cache entries
btrfs: send: cache utimes operations for directories if possible
The following test was run before and after applying the whole patchset,
and on a non-debug kernel (Debian's default kernel config):
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
mkfs.btrfs -f $DEV > /dev/null
mount $DEV $MNT
mkdir $MNT/A
for ((i = 1; i <= 20000; i++)); do
echo -n > $MNT/A/file_$i
done
btrfs subvolume snapshot -r $MNT $MNT/snap1
mkdir $MNT/B
for ((i = 20000; i <= 40000; i++)); do
echo -n > $MNT/B/file_$i
done
mv $MNT/A/file_* $MNT/B/
btrfs subvolume snapshot -r $MNT $MNT/snap2
start=$(date +%s%N)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "Incremental send took $dur milliseconds"
umount $MNT
Before the whole patchset: 18408 milliseconds
After the whole patchset: 1942 milliseconds (9.5x speedup)
Using 60000 files instead of 40000:
Before the whole patchset: 39764 milliseconds
After the whole patchset: 3076 milliseconds (12.9x speedup)
Using 20000 files instead of 40000:
Before the whole patchset: 5072 milliseconds
After the whole patchset: 916 milliseconds (5.5x speedup)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-11 19:36:20 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the current inode is a non-empty directory, delay issuing
|
|
|
|
* the utimes command for it, as it's very likely we have inodes
|
|
|
|
* with an higher number inside it. We want to issue the utimes
|
|
|
|
* command only after adding all dentries to it.
|
|
|
|
*/
|
|
|
|
if (S_ISDIR(sctx->cur_inode_mode) && sctx->cur_inode_size > 0)
|
|
|
|
ret = cache_dir_utimes(sctx, sctx->cur_ino, sctx->cur_inode_gen);
|
|
|
|
else
|
|
|
|
ret = send_utimes(sctx, sctx->cur_ino, sctx->cur_inode_gen);
|
|
|
|
|
2014-03-03 20:28:40 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
}
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
out:
|
btrfs: send: cache utimes operations for directories if possible
Whenever we add or remove an entry to a directory, we issue an utimes
command for the directory. If we add 1000 entries to a directory (create
1000 files under it or move 1000 files to it), then we issue the same
utimes command 1000 times, which increases the send stream size, results
in more pipe IO, one search in the send b+tree, allocating one path for
the search, etc, as well as making the receiver do a system call for each
duplicated utimes command.
We also issue an utimes command when we create a new directory, but later
we might add entries to it corresponding to inodes with an higher inode
number, so it's pointless to issue the utimes command before we create
the last inode under the directory.
So use a lru cache to track directories for which we must send a utimes
command. When we need to remove an entry from the cache, we issue the
utimes command for the respective directory. When finishing the send
operation, we go over each cache element and issue the respective utimes
command. Finally the caching is entirely optional, just a performance
optimization, meaning that if we fail to cache (due to memory allocation
failure), we issue the utimes command right away, that is, we fallback
to the previous, unoptimized, behaviour.
This patch belongs to a patchset comprised of the following patches:
btrfs: send: directly return from did_overwrite_ref() and simplify it
btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
btrfs: send: directly return from will_overwrite_ref() and simplify it
btrfs: send: avoid extra b+tree searches when checking reference overrides
btrfs: send: remove send_progress argument from can_rmdir()
btrfs: send: avoid duplicated orphan dir allocation and initialization
btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
btrfs: send: reduce searches on parent root when checking if dir can be removed
btrfs: send: iterate waiting dir move rbtree only once when processing refs
btrfs: send: initialize all the red black trees earlier
btrfs: send: genericize the backref cache to allow it to be reused
btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
btrfs: send: cache information about created directories
btrfs: allow a generation number to be associated with lru cache entries
btrfs: add an api to delete a specific entry from the lru cache
btrfs: send: use the lru cache to implement the name cache
btrfs: send: update size of roots array for backref cache entries
btrfs: send: cache utimes operations for directories if possible
The following test was run before and after applying the whole patchset,
and on a non-debug kernel (Debian's default kernel config):
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
mkfs.btrfs -f $DEV > /dev/null
mount $DEV $MNT
mkdir $MNT/A
for ((i = 1; i <= 20000; i++)); do
echo -n > $MNT/A/file_$i
done
btrfs subvolume snapshot -r $MNT $MNT/snap1
mkdir $MNT/B
for ((i = 20000; i <= 40000; i++)); do
echo -n > $MNT/B/file_$i
done
mv $MNT/A/file_* $MNT/B/
btrfs subvolume snapshot -r $MNT $MNT/snap2
start=$(date +%s%N)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "Incremental send took $dur milliseconds"
umount $MNT
Before the whole patchset: 18408 milliseconds
After the whole patchset: 1942 milliseconds (9.5x speedup)
Using 60000 files instead of 40000:
Before the whole patchset: 39764 milliseconds
After the whole patchset: 3076 milliseconds (12.9x speedup)
Using 20000 files instead of 40000:
Before the whole patchset: 5072 milliseconds
After the whole patchset: 916 milliseconds (5.5x speedup)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-11 19:36:20 +08:00
|
|
|
if (!ret)
|
|
|
|
ret = trim_dir_utimes_cache(sctx);
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
btrfs: send: avoid trashing the page cache
A send operation reads extent data using the buffered IO path for getting
extent data to send in write commands and this is both because it's simple
and to make use of the generic readahead infrastructure, which results in
a massive speedup.
However this fills the page cache with data that, most of the time, is
really only used by the send operation - once the write commands are sent,
it's not useful to have the data in the page cache anymore. For large
snapshots, bringing all data into the page cache eventually leads to the
need to evict other data from the page cache that may be more useful for
applications (and kernel subsystems).
Even if extents are shared with the subvolume on which a snapshot is based
on and the data is currently on the page cache due to being read through
the subvolume, attempting to read the data through the snapshot will
always result in bringing a new copy of the data into another location in
the page cache (there's currently no shared memory for shared extents).
So make send evict the data it has read before if when it first opened
the inode, its mapping had no pages currently loaded: when
inode->i_mapping->nr_pages has a value of 0. Do this instead of deciding
based on the return value of filemap_range_has_page() before reading an
extent because the generic readahead mechanism may read pages beyond the
range we request (and it very often does it), which means a call to
filemap_range_has_page() will return true due to the readahead that was
triggered when processing a previous extent - we don't have a simple way
to distinguish this case from the case where the data was brought into
the page cache through someone else. So checking for the mapping number
of pages being 0 when we first open the inode is simple, cheap and it
generally accomplishes the goal of not trashing the page cache - the
only exception is if part of data was previously loaded into the page
cache through the snapshot by some other process, in that case we end
up not evicting any data send brings into the page cache, just like
before this change - but that however is not the common case.
Example scenario, on a box with 32G of RAM:
$ btrfs subvolume create /mnt/sv1
$ xfs_io -f -c "pwrite 0 4G" /mnt/sv1/file1
$ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1
$ free -m
total used free shared buff/cache available
Mem: 31937 186 26866 0 4883 31297
Swap: 8188 0 8188
# After this we get less 4G of free memory.
$ btrfs send /mnt/snap1 >/dev/null
$ free -m
total used free shared buff/cache available
Mem: 31937 186 22814 0 8935 31297
Swap: 8188 0 8188
The same, obviously, applies to an incremental send.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-17 18:47:30 +08:00
|
|
|
static void close_current_inode(struct send_ctx *sctx)
|
|
|
|
{
|
|
|
|
u64 i_size;
|
|
|
|
|
|
|
|
if (sctx->cur_inode == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
i_size = i_size_read(sctx->cur_inode);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we are doing an incremental send, we may have extents between the
|
|
|
|
* last processed extent and the i_size that have not been processed
|
|
|
|
* because they haven't changed but we may have read some of their pages
|
|
|
|
* through readahead, see the comments at send_extent_data().
|
|
|
|
*/
|
|
|
|
if (sctx->clean_page_cache && sctx->page_cache_clear_start < i_size)
|
|
|
|
truncate_inode_pages_range(&sctx->cur_inode->i_data,
|
|
|
|
sctx->page_cache_clear_start,
|
|
|
|
round_up(i_size, PAGE_SIZE) - 1);
|
|
|
|
|
|
|
|
iput(sctx->cur_inode);
|
|
|
|
sctx->cur_inode = NULL;
|
|
|
|
}
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
static int changed_inode(struct send_ctx *sctx,
|
|
|
|
enum btrfs_compare_tree_result result)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
struct btrfs_key *key = sctx->cmp_key;
|
|
|
|
struct btrfs_inode_item *left_ii = NULL;
|
|
|
|
struct btrfs_inode_item *right_ii = NULL;
|
|
|
|
u64 left_gen = 0;
|
|
|
|
u64 right_gen = 0;
|
|
|
|
|
btrfs: send: avoid trashing the page cache
A send operation reads extent data using the buffered IO path for getting
extent data to send in write commands and this is both because it's simple
and to make use of the generic readahead infrastructure, which results in
a massive speedup.
However this fills the page cache with data that, most of the time, is
really only used by the send operation - once the write commands are sent,
it's not useful to have the data in the page cache anymore. For large
snapshots, bringing all data into the page cache eventually leads to the
need to evict other data from the page cache that may be more useful for
applications (and kernel subsystems).
Even if extents are shared with the subvolume on which a snapshot is based
on and the data is currently on the page cache due to being read through
the subvolume, attempting to read the data through the snapshot will
always result in bringing a new copy of the data into another location in
the page cache (there's currently no shared memory for shared extents).
So make send evict the data it has read before if when it first opened
the inode, its mapping had no pages currently loaded: when
inode->i_mapping->nr_pages has a value of 0. Do this instead of deciding
based on the return value of filemap_range_has_page() before reading an
extent because the generic readahead mechanism may read pages beyond the
range we request (and it very often does it), which means a call to
filemap_range_has_page() will return true due to the readahead that was
triggered when processing a previous extent - we don't have a simple way
to distinguish this case from the case where the data was brought into
the page cache through someone else. So checking for the mapping number
of pages being 0 when we first open the inode is simple, cheap and it
generally accomplishes the goal of not trashing the page cache - the
only exception is if part of data was previously loaded into the page
cache through the snapshot by some other process, in that case we end
up not evicting any data send brings into the page cache, just like
before this change - but that however is not the common case.
Example scenario, on a box with 32G of RAM:
$ btrfs subvolume create /mnt/sv1
$ xfs_io -f -c "pwrite 0 4G" /mnt/sv1/file1
$ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1
$ free -m
total used free shared buff/cache available
Mem: 31937 186 26866 0 4883 31297
Swap: 8188 0 8188
# After this we get less 4G of free memory.
$ btrfs send /mnt/snap1 >/dev/null
$ free -m
total used free shared buff/cache available
Mem: 31937 186 22814 0 8935 31297
Swap: 8188 0 8188
The same, obviously, applies to an incremental send.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-17 18:47:30 +08:00
|
|
|
close_current_inode(sctx);
|
2022-05-06 01:16:14 +08:00
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
sctx->cur_ino = key->objectid;
|
2022-06-03 00:03:08 +08:00
|
|
|
sctx->cur_inode_new_gen = false;
|
2013-10-23 00:18:51 +08:00
|
|
|
sctx->cur_inode_last_extent = (u64)-1;
|
2018-02-07 04:40:40 +08:00
|
|
|
sctx->cur_inode_next_write_offset = 0;
|
2018-07-24 18:54:04 +08:00
|
|
|
sctx->ignore_cur_inode = false;
|
2012-07-28 22:09:35 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Set send_progress to current inode. This will tell all get_cur_xxx
|
|
|
|
* functions that the current inode's refs are not updated yet. Later,
|
|
|
|
* when process_recorded_refs is finished, it is set to cur_ino + 1.
|
|
|
|
*/
|
2012-07-26 05:19:24 +08:00
|
|
|
sctx->send_progress = sctx->cur_ino;
|
|
|
|
|
|
|
|
if (result == BTRFS_COMPARE_TREE_NEW ||
|
|
|
|
result == BTRFS_COMPARE_TREE_CHANGED) {
|
|
|
|
left_ii = btrfs_item_ptr(sctx->left_path->nodes[0],
|
|
|
|
sctx->left_path->slots[0],
|
|
|
|
struct btrfs_inode_item);
|
|
|
|
left_gen = btrfs_inode_generation(sctx->left_path->nodes[0],
|
|
|
|
left_ii);
|
|
|
|
} else {
|
|
|
|
right_ii = btrfs_item_ptr(sctx->right_path->nodes[0],
|
|
|
|
sctx->right_path->slots[0],
|
|
|
|
struct btrfs_inode_item);
|
|
|
|
right_gen = btrfs_inode_generation(sctx->right_path->nodes[0],
|
|
|
|
right_ii);
|
|
|
|
}
|
|
|
|
if (result == BTRFS_COMPARE_TREE_CHANGED) {
|
|
|
|
right_ii = btrfs_item_ptr(sctx->right_path->nodes[0],
|
|
|
|
sctx->right_path->slots[0],
|
|
|
|
struct btrfs_inode_item);
|
|
|
|
|
|
|
|
right_gen = btrfs_inode_generation(sctx->right_path->nodes[0],
|
|
|
|
right_ii);
|
2012-08-01 20:48:59 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The cur_ino = root dir case is special here. We can't treat
|
|
|
|
* the inode as deleted+reused because it would generate a
|
|
|
|
* stream that tries to delete/mkdir the root dir.
|
|
|
|
*/
|
|
|
|
if (left_gen != right_gen &&
|
|
|
|
sctx->cur_ino != BTRFS_FIRST_FREE_OBJECTID)
|
2022-06-03 00:03:08 +08:00
|
|
|
sctx->cur_inode_new_gen = true;
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
2018-07-24 18:54:04 +08:00
|
|
|
/*
|
|
|
|
* Normally we do not find inodes with a link count of zero (orphans)
|
|
|
|
* because the most common case is to create a snapshot and use it
|
|
|
|
* for a send operation. However other less common use cases involve
|
|
|
|
* using a subvolume and send it after turning it to RO mode just
|
|
|
|
* after deleting all hard links of a file while holding an open
|
|
|
|
* file descriptor against it or turning a RO snapshot into RW mode,
|
|
|
|
* keep an open file descriptor against a file, delete it and then
|
|
|
|
* turn the snapshot back to RO mode before using it for a send
|
btrfs: send: fix failures when processing inodes with no links
There is a bug causing send failures when processing an orphan directory
with no links. In commit 46b2f4590aab ("Btrfs: fix send failure when
root has deleted files still open")', the orphan inode issue was
addressed. The send operation fails with a ENOENT error because of any
attempts to generate a path for the inode with a link count of zero.
Therefore, in that patch, sctx->ignore_cur_inode was introduced to be
set if the current inode has a link count of zero for bypassing some
unnecessary steps. And a helper function btrfs_unlink_all_paths() was
introduced and called to clean up old paths found in the parent
snapshot. However, not only regular files but also directories can be
orphan inodes. So if the send operation meets an orphan directory, it
will issue a wrong unlink command for that directory now. Soon the
receive operation fails with a EISDIR error. Besides, the send operation
also fails with a ENOENT error later when it tries to generate a path of
it.
Similar example but making an orphan dir for an incremental send:
$ btrfs subvolume create vol
$ mkdir vol/dir
$ touch vol/dir/foo
$ btrfs subvolume snapshot -r vol snap1
$ btrfs subvolume snapshot -r vol snap2
# Turn the second snapshot to RW mode and delete the whole dir while
# holding an open file descriptor on it.
$ btrfs property set snap2 ro false
$ exec 73<snap2/dir
$ rm -rf snap2/dir
# Set the second snapshot back to RO mode and do an incremental send.
$ btrfs property set snap2 ro true
$ mkdir receive_dir
$ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
At subvol snap2
At snapshot snap2
ERROR: send ioctl failed with -2: No such file or directory
ERROR: unlink dir failed. Is a directory
Actually, orphan inodes are more common use cases in cascading backups.
(Please see the illustration below.) In a cascading backup, a user wants
to replicate a couple of snapshots from Machine A to Machine B and from
Machine B to Machine C. Machine B doesn't take any RO snapshots for
sending. All a receiver does is create an RW snapshot of its parent
snapshot, apply the send stream and turn it into RO mode at the end.
Even if all paths of some inodes are deleted in applying the send
stream, these inodes would not be deleted and become orphans after
changing the subvolume from RW to RO. Moreover, orphan inodes can occur
not only in send snapshots but also in parent snapshots because Machine
B may do a batch replication of a couple of snapshots.
An illustration for cascading backups:
Machine A (snapshot {1..n}) --> Machine B --> Machine C
The idea to solve the problem is to delete all the items of orphan
inodes before using these snapshots for sending. I used to think that
the reasonable timing for doing that is during the ioctl of changing the
subvolume from RW to RO because it sounds good that we will not modify
the fs tree of a RO snapshot anymore. However, attempting to do the
orphan cleanup in the ioctl would be pointless. Because if someone is
holding an open file descriptor on the inode, the reference count of the
inode will never drop to 0. Then iput() cannot trigger eviction, which
finally deletes all the items of it. So we try to extend the original
patch to handle orphans in send/parent snapshots. Here are several cases
that need to be considered:
Case 1: BTRFS_COMPARE_TREE_NEW
| send snapshot | action
--------------------------------
nlink | 0 | ignore
In case 1, when we get a BTRFS_COMPARE_TREE_NEW tree comparison result,
it means that a new inode is found in the send snapshot and it doesn't
appear in the parent snapshot. Since this inode has a link count of zero
(It's an orphan and there're no paths for it.), we can leverage
sctx->ignore_cur_inode in the original patch to prevent it from being
created.
Case 2: BTRFS_COMPARE_TREE_DELETED
| parent snapshot | action
----------------------------------
nlink | 0 | as usual
In case 2, when we get a BTRFS_COMPARE_TREE_DELETED tree comparison
result, it means that the inode only appears in the parent snapshot.
As usual, the send operation will try to delete all its paths. However,
this inode has a link count of zero, so no paths of it will be found. No
deletion operations will be issued. We don't need to change any logic.
Case 3: BTRFS_COMPARE_TREE_CHANGED
| | parent snapshot | send snapshot | action
-----------------------------------------------------------------------
subcase 1 | nlink | 0 | 0 | ignore
subcase 2 | nlink | >0 | 0 | new_gen(deletion)
subcase 3 | nlink | 0 | >0 | new_gen(creation)
In case 3, when we get a BTRFS_COMPARE_TREE_CHANGED tree comparison result,
it means that the inode appears in both snapshots. Here are 3 subcases.
First, when the inode has link counts of zero in both snapshots. Since
there are no paths for this inode in (source/destination) parent
snapshots and we don't care about whether there is also an orphan inode
in destination or not, we can set sctx->ignore_cur_inode on to prevent
it from being created.
For the second and the third subcases, if there are paths in one
snapshot and there're no paths in the other snapshot for this inode. We
can treat this inode as a new generation. We can also leverage the logic
handling a new generation of an inode with small adjustments. Then it
will delete all old paths and create a new inode with new attributes and
paths only when there's a positive link count in the send snapshot.
In subcase 2, the send operation only needs to delete all old paths as
in the parent snapshot. But it may require more operations for a
directory to remove its old paths. If a not-empty directory is going to
be deleted (because it has a link count of zero in the send snapshot)
but there are files/directories with bigger inode numbers under it, the
send operation will need to rename it to its orphan name first. After
processing and deleting the last item under this directory, the send
operation will check this directory, aka the parent directory of the
last item, again and issue a rmdir operation to remove it finally.
Therefore, we also need to treat inodes with a link count of zero as if
they didn't exist in get_cur_inode_state(), which is used in
process_recorded_refs(). By doing this, when checking a directory with
orphan names after the last item under it has been deleted, the send
operation now can properly issue a rmdir operation. Otherwise, without
doing this, the orphan directory with an orphan name would be kept here
at the end due to the existing inode with a link count of zero being
found.
In subcase 3, as in case 2, no old paths would be found, so no deletion
operations will be issued. The send operation will only create a new one
for that inode.
Note that subcase 3 is not common. That's because it's easy to reduce
the hard links of an inode, but once all valid paths are removed,
there are no valid paths for creating other hard links. The only way to
do that is trying to send an older snapshot after a newer snapshot has
been sent.
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-12 22:42:33 +08:00
|
|
|
* operation. The former is what the receiver operation does.
|
|
|
|
* Therefore, if we want to send these snapshots soon after they're
|
|
|
|
* received, we need to handle orphan inodes as well. Moreover, orphans
|
|
|
|
* can appear not only in the send snapshot but also in the parent
|
|
|
|
* snapshot. Here are several cases:
|
|
|
|
*
|
|
|
|
* Case 1: BTRFS_COMPARE_TREE_NEW
|
|
|
|
* | send snapshot | action
|
|
|
|
* --------------------------------
|
|
|
|
* nlink | 0 | ignore
|
|
|
|
*
|
|
|
|
* Case 2: BTRFS_COMPARE_TREE_DELETED
|
|
|
|
* | parent snapshot | action
|
|
|
|
* ----------------------------------
|
|
|
|
* nlink | 0 | as usual
|
|
|
|
* Note: No unlinks will be sent because there're no paths for it.
|
|
|
|
*
|
|
|
|
* Case 3: BTRFS_COMPARE_TREE_CHANGED
|
|
|
|
* | | parent snapshot | send snapshot | action
|
|
|
|
* -----------------------------------------------------------------------
|
|
|
|
* subcase 1 | nlink | 0 | 0 | ignore
|
|
|
|
* subcase 2 | nlink | >0 | 0 | new_gen(deletion)
|
|
|
|
* subcase 3 | nlink | 0 | >0 | new_gen(creation)
|
|
|
|
*
|
2018-07-24 18:54:04 +08:00
|
|
|
*/
|
btrfs: send: fix failures when processing inodes with no links
There is a bug causing send failures when processing an orphan directory
with no links. In commit 46b2f4590aab ("Btrfs: fix send failure when
root has deleted files still open")', the orphan inode issue was
addressed. The send operation fails with a ENOENT error because of any
attempts to generate a path for the inode with a link count of zero.
Therefore, in that patch, sctx->ignore_cur_inode was introduced to be
set if the current inode has a link count of zero for bypassing some
unnecessary steps. And a helper function btrfs_unlink_all_paths() was
introduced and called to clean up old paths found in the parent
snapshot. However, not only regular files but also directories can be
orphan inodes. So if the send operation meets an orphan directory, it
will issue a wrong unlink command for that directory now. Soon the
receive operation fails with a EISDIR error. Besides, the send operation
also fails with a ENOENT error later when it tries to generate a path of
it.
Similar example but making an orphan dir for an incremental send:
$ btrfs subvolume create vol
$ mkdir vol/dir
$ touch vol/dir/foo
$ btrfs subvolume snapshot -r vol snap1
$ btrfs subvolume snapshot -r vol snap2
# Turn the second snapshot to RW mode and delete the whole dir while
# holding an open file descriptor on it.
$ btrfs property set snap2 ro false
$ exec 73<snap2/dir
$ rm -rf snap2/dir
# Set the second snapshot back to RO mode and do an incremental send.
$ btrfs property set snap2 ro true
$ mkdir receive_dir
$ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
At subvol snap2
At snapshot snap2
ERROR: send ioctl failed with -2: No such file or directory
ERROR: unlink dir failed. Is a directory
Actually, orphan inodes are more common use cases in cascading backups.
(Please see the illustration below.) In a cascading backup, a user wants
to replicate a couple of snapshots from Machine A to Machine B and from
Machine B to Machine C. Machine B doesn't take any RO snapshots for
sending. All a receiver does is create an RW snapshot of its parent
snapshot, apply the send stream and turn it into RO mode at the end.
Even if all paths of some inodes are deleted in applying the send
stream, these inodes would not be deleted and become orphans after
changing the subvolume from RW to RO. Moreover, orphan inodes can occur
not only in send snapshots but also in parent snapshots because Machine
B may do a batch replication of a couple of snapshots.
An illustration for cascading backups:
Machine A (snapshot {1..n}) --> Machine B --> Machine C
The idea to solve the problem is to delete all the items of orphan
inodes before using these snapshots for sending. I used to think that
the reasonable timing for doing that is during the ioctl of changing the
subvolume from RW to RO because it sounds good that we will not modify
the fs tree of a RO snapshot anymore. However, attempting to do the
orphan cleanup in the ioctl would be pointless. Because if someone is
holding an open file descriptor on the inode, the reference count of the
inode will never drop to 0. Then iput() cannot trigger eviction, which
finally deletes all the items of it. So we try to extend the original
patch to handle orphans in send/parent snapshots. Here are several cases
that need to be considered:
Case 1: BTRFS_COMPARE_TREE_NEW
| send snapshot | action
--------------------------------
nlink | 0 | ignore
In case 1, when we get a BTRFS_COMPARE_TREE_NEW tree comparison result,
it means that a new inode is found in the send snapshot and it doesn't
appear in the parent snapshot. Since this inode has a link count of zero
(It's an orphan and there're no paths for it.), we can leverage
sctx->ignore_cur_inode in the original patch to prevent it from being
created.
Case 2: BTRFS_COMPARE_TREE_DELETED
| parent snapshot | action
----------------------------------
nlink | 0 | as usual
In case 2, when we get a BTRFS_COMPARE_TREE_DELETED tree comparison
result, it means that the inode only appears in the parent snapshot.
As usual, the send operation will try to delete all its paths. However,
this inode has a link count of zero, so no paths of it will be found. No
deletion operations will be issued. We don't need to change any logic.
Case 3: BTRFS_COMPARE_TREE_CHANGED
| | parent snapshot | send snapshot | action
-----------------------------------------------------------------------
subcase 1 | nlink | 0 | 0 | ignore
subcase 2 | nlink | >0 | 0 | new_gen(deletion)
subcase 3 | nlink | 0 | >0 | new_gen(creation)
In case 3, when we get a BTRFS_COMPARE_TREE_CHANGED tree comparison result,
it means that the inode appears in both snapshots. Here are 3 subcases.
First, when the inode has link counts of zero in both snapshots. Since
there are no paths for this inode in (source/destination) parent
snapshots and we don't care about whether there is also an orphan inode
in destination or not, we can set sctx->ignore_cur_inode on to prevent
it from being created.
For the second and the third subcases, if there are paths in one
snapshot and there're no paths in the other snapshot for this inode. We
can treat this inode as a new generation. We can also leverage the logic
handling a new generation of an inode with small adjustments. Then it
will delete all old paths and create a new inode with new attributes and
paths only when there's a positive link count in the send snapshot.
In subcase 2, the send operation only needs to delete all old paths as
in the parent snapshot. But it may require more operations for a
directory to remove its old paths. If a not-empty directory is going to
be deleted (because it has a link count of zero in the send snapshot)
but there are files/directories with bigger inode numbers under it, the
send operation will need to rename it to its orphan name first. After
processing and deleting the last item under this directory, the send
operation will check this directory, aka the parent directory of the
last item, again and issue a rmdir operation to remove it finally.
Therefore, we also need to treat inodes with a link count of zero as if
they didn't exist in get_cur_inode_state(), which is used in
process_recorded_refs(). By doing this, when checking a directory with
orphan names after the last item under it has been deleted, the send
operation now can properly issue a rmdir operation. Otherwise, without
doing this, the orphan directory with an orphan name would be kept here
at the end due to the existing inode with a link count of zero being
found.
In subcase 3, as in case 2, no old paths would be found, so no deletion
operations will be issued. The send operation will only create a new one
for that inode.
Note that subcase 3 is not common. That's because it's easy to reduce
the hard links of an inode, but once all valid paths are removed,
there are no valid paths for creating other hard links. The only way to
do that is trying to send an older snapshot after a newer snapshot has
been sent.
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-12 22:42:33 +08:00
|
|
|
if (result == BTRFS_COMPARE_TREE_NEW) {
|
|
|
|
if (btrfs_inode_nlink(sctx->left_path->nodes[0], left_ii) == 0) {
|
2018-07-24 18:54:04 +08:00
|
|
|
sctx->ignore_cur_inode = true;
|
|
|
|
goto out;
|
|
|
|
}
|
2012-07-26 05:19:24 +08:00
|
|
|
sctx->cur_inode_gen = left_gen;
|
2022-06-03 00:03:08 +08:00
|
|
|
sctx->cur_inode_new = true;
|
|
|
|
sctx->cur_inode_deleted = false;
|
2012-07-26 05:19:24 +08:00
|
|
|
sctx->cur_inode_size = btrfs_inode_size(
|
|
|
|
sctx->left_path->nodes[0], left_ii);
|
|
|
|
sctx->cur_inode_mode = btrfs_inode_mode(
|
|
|
|
sctx->left_path->nodes[0], left_ii);
|
2014-02-27 17:29:01 +08:00
|
|
|
sctx->cur_inode_rdev = btrfs_inode_rdev(
|
|
|
|
sctx->left_path->nodes[0], left_ii);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (sctx->cur_ino != BTRFS_FIRST_FREE_OBJECTID)
|
2012-07-28 16:42:24 +08:00
|
|
|
ret = send_create_inode_if_needed(sctx);
|
2012-07-26 05:19:24 +08:00
|
|
|
} else if (result == BTRFS_COMPARE_TREE_DELETED) {
|
|
|
|
sctx->cur_inode_gen = right_gen;
|
2022-06-03 00:03:08 +08:00
|
|
|
sctx->cur_inode_new = false;
|
|
|
|
sctx->cur_inode_deleted = true;
|
2012-07-26 05:19:24 +08:00
|
|
|
sctx->cur_inode_size = btrfs_inode_size(
|
|
|
|
sctx->right_path->nodes[0], right_ii);
|
|
|
|
sctx->cur_inode_mode = btrfs_inode_mode(
|
|
|
|
sctx->right_path->nodes[0], right_ii);
|
|
|
|
} else if (result == BTRFS_COMPARE_TREE_CHANGED) {
|
btrfs: send: fix failures when processing inodes with no links
There is a bug causing send failures when processing an orphan directory
with no links. In commit 46b2f4590aab ("Btrfs: fix send failure when
root has deleted files still open")', the orphan inode issue was
addressed. The send operation fails with a ENOENT error because of any
attempts to generate a path for the inode with a link count of zero.
Therefore, in that patch, sctx->ignore_cur_inode was introduced to be
set if the current inode has a link count of zero for bypassing some
unnecessary steps. And a helper function btrfs_unlink_all_paths() was
introduced and called to clean up old paths found in the parent
snapshot. However, not only regular files but also directories can be
orphan inodes. So if the send operation meets an orphan directory, it
will issue a wrong unlink command for that directory now. Soon the
receive operation fails with a EISDIR error. Besides, the send operation
also fails with a ENOENT error later when it tries to generate a path of
it.
Similar example but making an orphan dir for an incremental send:
$ btrfs subvolume create vol
$ mkdir vol/dir
$ touch vol/dir/foo
$ btrfs subvolume snapshot -r vol snap1
$ btrfs subvolume snapshot -r vol snap2
# Turn the second snapshot to RW mode and delete the whole dir while
# holding an open file descriptor on it.
$ btrfs property set snap2 ro false
$ exec 73<snap2/dir
$ rm -rf snap2/dir
# Set the second snapshot back to RO mode and do an incremental send.
$ btrfs property set snap2 ro true
$ mkdir receive_dir
$ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
At subvol snap2
At snapshot snap2
ERROR: send ioctl failed with -2: No such file or directory
ERROR: unlink dir failed. Is a directory
Actually, orphan inodes are more common use cases in cascading backups.
(Please see the illustration below.) In a cascading backup, a user wants
to replicate a couple of snapshots from Machine A to Machine B and from
Machine B to Machine C. Machine B doesn't take any RO snapshots for
sending. All a receiver does is create an RW snapshot of its parent
snapshot, apply the send stream and turn it into RO mode at the end.
Even if all paths of some inodes are deleted in applying the send
stream, these inodes would not be deleted and become orphans after
changing the subvolume from RW to RO. Moreover, orphan inodes can occur
not only in send snapshots but also in parent snapshots because Machine
B may do a batch replication of a couple of snapshots.
An illustration for cascading backups:
Machine A (snapshot {1..n}) --> Machine B --> Machine C
The idea to solve the problem is to delete all the items of orphan
inodes before using these snapshots for sending. I used to think that
the reasonable timing for doing that is during the ioctl of changing the
subvolume from RW to RO because it sounds good that we will not modify
the fs tree of a RO snapshot anymore. However, attempting to do the
orphan cleanup in the ioctl would be pointless. Because if someone is
holding an open file descriptor on the inode, the reference count of the
inode will never drop to 0. Then iput() cannot trigger eviction, which
finally deletes all the items of it. So we try to extend the original
patch to handle orphans in send/parent snapshots. Here are several cases
that need to be considered:
Case 1: BTRFS_COMPARE_TREE_NEW
| send snapshot | action
--------------------------------
nlink | 0 | ignore
In case 1, when we get a BTRFS_COMPARE_TREE_NEW tree comparison result,
it means that a new inode is found in the send snapshot and it doesn't
appear in the parent snapshot. Since this inode has a link count of zero
(It's an orphan and there're no paths for it.), we can leverage
sctx->ignore_cur_inode in the original patch to prevent it from being
created.
Case 2: BTRFS_COMPARE_TREE_DELETED
| parent snapshot | action
----------------------------------
nlink | 0 | as usual
In case 2, when we get a BTRFS_COMPARE_TREE_DELETED tree comparison
result, it means that the inode only appears in the parent snapshot.
As usual, the send operation will try to delete all its paths. However,
this inode has a link count of zero, so no paths of it will be found. No
deletion operations will be issued. We don't need to change any logic.
Case 3: BTRFS_COMPARE_TREE_CHANGED
| | parent snapshot | send snapshot | action
-----------------------------------------------------------------------
subcase 1 | nlink | 0 | 0 | ignore
subcase 2 | nlink | >0 | 0 | new_gen(deletion)
subcase 3 | nlink | 0 | >0 | new_gen(creation)
In case 3, when we get a BTRFS_COMPARE_TREE_CHANGED tree comparison result,
it means that the inode appears in both snapshots. Here are 3 subcases.
First, when the inode has link counts of zero in both snapshots. Since
there are no paths for this inode in (source/destination) parent
snapshots and we don't care about whether there is also an orphan inode
in destination or not, we can set sctx->ignore_cur_inode on to prevent
it from being created.
For the second and the third subcases, if there are paths in one
snapshot and there're no paths in the other snapshot for this inode. We
can treat this inode as a new generation. We can also leverage the logic
handling a new generation of an inode with small adjustments. Then it
will delete all old paths and create a new inode with new attributes and
paths only when there's a positive link count in the send snapshot.
In subcase 2, the send operation only needs to delete all old paths as
in the parent snapshot. But it may require more operations for a
directory to remove its old paths. If a not-empty directory is going to
be deleted (because it has a link count of zero in the send snapshot)
but there are files/directories with bigger inode numbers under it, the
send operation will need to rename it to its orphan name first. After
processing and deleting the last item under this directory, the send
operation will check this directory, aka the parent directory of the
last item, again and issue a rmdir operation to remove it finally.
Therefore, we also need to treat inodes with a link count of zero as if
they didn't exist in get_cur_inode_state(), which is used in
process_recorded_refs(). By doing this, when checking a directory with
orphan names after the last item under it has been deleted, the send
operation now can properly issue a rmdir operation. Otherwise, without
doing this, the orphan directory with an orphan name would be kept here
at the end due to the existing inode with a link count of zero being
found.
In subcase 3, as in case 2, no old paths would be found, so no deletion
operations will be issued. The send operation will only create a new one
for that inode.
Note that subcase 3 is not common. That's because it's easy to reduce
the hard links of an inode, but once all valid paths are removed,
there are no valid paths for creating other hard links. The only way to
do that is trying to send an older snapshot after a newer snapshot has
been sent.
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-12 22:42:33 +08:00
|
|
|
u32 new_nlinks, old_nlinks;
|
|
|
|
|
|
|
|
new_nlinks = btrfs_inode_nlink(sctx->left_path->nodes[0], left_ii);
|
|
|
|
old_nlinks = btrfs_inode_nlink(sctx->right_path->nodes[0], right_ii);
|
|
|
|
if (new_nlinks == 0 && old_nlinks == 0) {
|
|
|
|
sctx->ignore_cur_inode = true;
|
|
|
|
goto out;
|
|
|
|
} else if (new_nlinks == 0 || old_nlinks == 0) {
|
|
|
|
sctx->cur_inode_new_gen = 1;
|
|
|
|
}
|
2012-07-28 20:11:31 +08:00
|
|
|
/*
|
|
|
|
* We need to do some special handling in case the inode was
|
|
|
|
* reported as changed with a changed generation number. This
|
|
|
|
* means that the original inode was deleted and new inode
|
|
|
|
* reused the same inum. So we have to treat the old inode as
|
|
|
|
* deleted and the new one as new.
|
|
|
|
*/
|
2012-07-26 05:19:24 +08:00
|
|
|
if (sctx->cur_inode_new_gen) {
|
2012-07-28 20:11:31 +08:00
|
|
|
/*
|
|
|
|
* First, process the inode as if it was deleted.
|
|
|
|
*/
|
btrfs: send: fix send failure of a subcase of orphan inodes
Commit 9ed0a72e5b35 ("btrfs: send: fix failures when processing inodes with
no links") tries to fix all incremental send cases of orphan inodes the
send operation will meet. However, there's still a bug causing the corner
subcase fails with a ENOENT error.
Here's shortened steps of that subcase:
$ btrfs subvolume create vol
$ touch vol/foo
$ btrfs subvolume snapshot -r vol snap1
$ btrfs subvolume snapshot -r vol snap2
# Turn the second snapshot to RW mode and delete the file while
# holding an open file descriptor on it
$ btrfs property set snap2 ro false
$ exec 73<snap2/foo
$ rm snap2/foo
# Set the second snapshot back to RO mode and do an incremental send
# with an unusal reverse order
$ btrfs property set snap2 ro true
$ btrfs send -p snap2 snap1 > /dev/null
At subvol snap1
ERROR: send ioctl failed with -2: No such file or directory
It's subcase 3 of BTRFS_COMPARE_TREE_CHANGED in the commit 9ed0a72e5b35
("btrfs: send: fix failures when processing inodes with no links"). And
it's not a common case. We still have not met it in the real world.
Theoretically, this case can happen in a batch cascading snapshot backup.
In cascading backups, the receive operation in the middle may cause orphan
inodes to appear because of the open file descriptors on the snapshot files
during receiving. And if we don't do the batch snapshot backups in their
creation order, then we can have an inode, which is an orphan in the parent
snapshot but refers to a file in the send snapshot. Since an orphan inode
has no paths, the send operation will fail with a ENOENT error if it
tries to generate a path for it.
In that patch, this subcase will be treated as an inode with a new
generation. However, when the routine tries to delete the old paths in
the parent snapshot, the function process_all_refs() doesn't check whether
there are paths recorded or not before it calls the function
process_recorded_refs(). And the function process_recorded_refs() try
to get the first path in the parent snapshot in the beginning. Since it has
no paths in the parent snapshot, the send operation fails.
To fix this, we can easily put a link count check to avoid entering the
deletion routine like what we do a link count check to avoid creating a
new one. Moreover, we can assume that the function process_all_refs()
can always collect references to process because we know it has a
positive link count.
Fixes: 9ed0a72e5b35 ("btrfs: send: fix failures when processing inodes with no links")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-10-16 23:33:46 +08:00
|
|
|
if (old_nlinks > 0) {
|
|
|
|
sctx->cur_inode_gen = right_gen;
|
|
|
|
sctx->cur_inode_new = false;
|
|
|
|
sctx->cur_inode_deleted = true;
|
|
|
|
sctx->cur_inode_size = btrfs_inode_size(
|
|
|
|
sctx->right_path->nodes[0], right_ii);
|
|
|
|
sctx->cur_inode_mode = btrfs_inode_mode(
|
|
|
|
sctx->right_path->nodes[0], right_ii);
|
|
|
|
ret = process_all_refs(sctx,
|
|
|
|
BTRFS_COMPARE_TREE_DELETED);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2012-07-28 20:11:31 +08:00
|
|
|
/*
|
|
|
|
* Now process the inode as if it was new.
|
|
|
|
*/
|
btrfs: send: fix failures when processing inodes with no links
There is a bug causing send failures when processing an orphan directory
with no links. In commit 46b2f4590aab ("Btrfs: fix send failure when
root has deleted files still open")', the orphan inode issue was
addressed. The send operation fails with a ENOENT error because of any
attempts to generate a path for the inode with a link count of zero.
Therefore, in that patch, sctx->ignore_cur_inode was introduced to be
set if the current inode has a link count of zero for bypassing some
unnecessary steps. And a helper function btrfs_unlink_all_paths() was
introduced and called to clean up old paths found in the parent
snapshot. However, not only regular files but also directories can be
orphan inodes. So if the send operation meets an orphan directory, it
will issue a wrong unlink command for that directory now. Soon the
receive operation fails with a EISDIR error. Besides, the send operation
also fails with a ENOENT error later when it tries to generate a path of
it.
Similar example but making an orphan dir for an incremental send:
$ btrfs subvolume create vol
$ mkdir vol/dir
$ touch vol/dir/foo
$ btrfs subvolume snapshot -r vol snap1
$ btrfs subvolume snapshot -r vol snap2
# Turn the second snapshot to RW mode and delete the whole dir while
# holding an open file descriptor on it.
$ btrfs property set snap2 ro false
$ exec 73<snap2/dir
$ rm -rf snap2/dir
# Set the second snapshot back to RO mode and do an incremental send.
$ btrfs property set snap2 ro true
$ mkdir receive_dir
$ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
At subvol snap2
At snapshot snap2
ERROR: send ioctl failed with -2: No such file or directory
ERROR: unlink dir failed. Is a directory
Actually, orphan inodes are more common use cases in cascading backups.
(Please see the illustration below.) In a cascading backup, a user wants
to replicate a couple of snapshots from Machine A to Machine B and from
Machine B to Machine C. Machine B doesn't take any RO snapshots for
sending. All a receiver does is create an RW snapshot of its parent
snapshot, apply the send stream and turn it into RO mode at the end.
Even if all paths of some inodes are deleted in applying the send
stream, these inodes would not be deleted and become orphans after
changing the subvolume from RW to RO. Moreover, orphan inodes can occur
not only in send snapshots but also in parent snapshots because Machine
B may do a batch replication of a couple of snapshots.
An illustration for cascading backups:
Machine A (snapshot {1..n}) --> Machine B --> Machine C
The idea to solve the problem is to delete all the items of orphan
inodes before using these snapshots for sending. I used to think that
the reasonable timing for doing that is during the ioctl of changing the
subvolume from RW to RO because it sounds good that we will not modify
the fs tree of a RO snapshot anymore. However, attempting to do the
orphan cleanup in the ioctl would be pointless. Because if someone is
holding an open file descriptor on the inode, the reference count of the
inode will never drop to 0. Then iput() cannot trigger eviction, which
finally deletes all the items of it. So we try to extend the original
patch to handle orphans in send/parent snapshots. Here are several cases
that need to be considered:
Case 1: BTRFS_COMPARE_TREE_NEW
| send snapshot | action
--------------------------------
nlink | 0 | ignore
In case 1, when we get a BTRFS_COMPARE_TREE_NEW tree comparison result,
it means that a new inode is found in the send snapshot and it doesn't
appear in the parent snapshot. Since this inode has a link count of zero
(It's an orphan and there're no paths for it.), we can leverage
sctx->ignore_cur_inode in the original patch to prevent it from being
created.
Case 2: BTRFS_COMPARE_TREE_DELETED
| parent snapshot | action
----------------------------------
nlink | 0 | as usual
In case 2, when we get a BTRFS_COMPARE_TREE_DELETED tree comparison
result, it means that the inode only appears in the parent snapshot.
As usual, the send operation will try to delete all its paths. However,
this inode has a link count of zero, so no paths of it will be found. No
deletion operations will be issued. We don't need to change any logic.
Case 3: BTRFS_COMPARE_TREE_CHANGED
| | parent snapshot | send snapshot | action
-----------------------------------------------------------------------
subcase 1 | nlink | 0 | 0 | ignore
subcase 2 | nlink | >0 | 0 | new_gen(deletion)
subcase 3 | nlink | 0 | >0 | new_gen(creation)
In case 3, when we get a BTRFS_COMPARE_TREE_CHANGED tree comparison result,
it means that the inode appears in both snapshots. Here are 3 subcases.
First, when the inode has link counts of zero in both snapshots. Since
there are no paths for this inode in (source/destination) parent
snapshots and we don't care about whether there is also an orphan inode
in destination or not, we can set sctx->ignore_cur_inode on to prevent
it from being created.
For the second and the third subcases, if there are paths in one
snapshot and there're no paths in the other snapshot for this inode. We
can treat this inode as a new generation. We can also leverage the logic
handling a new generation of an inode with small adjustments. Then it
will delete all old paths and create a new inode with new attributes and
paths only when there's a positive link count in the send snapshot.
In subcase 2, the send operation only needs to delete all old paths as
in the parent snapshot. But it may require more operations for a
directory to remove its old paths. If a not-empty directory is going to
be deleted (because it has a link count of zero in the send snapshot)
but there are files/directories with bigger inode numbers under it, the
send operation will need to rename it to its orphan name first. After
processing and deleting the last item under this directory, the send
operation will check this directory, aka the parent directory of the
last item, again and issue a rmdir operation to remove it finally.
Therefore, we also need to treat inodes with a link count of zero as if
they didn't exist in get_cur_inode_state(), which is used in
process_recorded_refs(). By doing this, when checking a directory with
orphan names after the last item under it has been deleted, the send
operation now can properly issue a rmdir operation. Otherwise, without
doing this, the orphan directory with an orphan name would be kept here
at the end due to the existing inode with a link count of zero being
found.
In subcase 3, as in case 2, no old paths would be found, so no deletion
operations will be issued. The send operation will only create a new one
for that inode.
Note that subcase 3 is not common. That's because it's easy to reduce
the hard links of an inode, but once all valid paths are removed,
there are no valid paths for creating other hard links. The only way to
do that is trying to send an older snapshot after a newer snapshot has
been sent.
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-12 22:42:33 +08:00
|
|
|
if (new_nlinks > 0) {
|
|
|
|
sctx->cur_inode_gen = left_gen;
|
|
|
|
sctx->cur_inode_new = true;
|
|
|
|
sctx->cur_inode_deleted = false;
|
|
|
|
sctx->cur_inode_size = btrfs_inode_size(
|
|
|
|
sctx->left_path->nodes[0],
|
|
|
|
left_ii);
|
|
|
|
sctx->cur_inode_mode = btrfs_inode_mode(
|
|
|
|
sctx->left_path->nodes[0],
|
|
|
|
left_ii);
|
|
|
|
sctx->cur_inode_rdev = btrfs_inode_rdev(
|
|
|
|
sctx->left_path->nodes[0],
|
|
|
|
left_ii);
|
|
|
|
ret = send_create_inode_if_needed(sctx);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
btrfs: send: fix failures when processing inodes with no links
There is a bug causing send failures when processing an orphan directory
with no links. In commit 46b2f4590aab ("Btrfs: fix send failure when
root has deleted files still open")', the orphan inode issue was
addressed. The send operation fails with a ENOENT error because of any
attempts to generate a path for the inode with a link count of zero.
Therefore, in that patch, sctx->ignore_cur_inode was introduced to be
set if the current inode has a link count of zero for bypassing some
unnecessary steps. And a helper function btrfs_unlink_all_paths() was
introduced and called to clean up old paths found in the parent
snapshot. However, not only regular files but also directories can be
orphan inodes. So if the send operation meets an orphan directory, it
will issue a wrong unlink command for that directory now. Soon the
receive operation fails with a EISDIR error. Besides, the send operation
also fails with a ENOENT error later when it tries to generate a path of
it.
Similar example but making an orphan dir for an incremental send:
$ btrfs subvolume create vol
$ mkdir vol/dir
$ touch vol/dir/foo
$ btrfs subvolume snapshot -r vol snap1
$ btrfs subvolume snapshot -r vol snap2
# Turn the second snapshot to RW mode and delete the whole dir while
# holding an open file descriptor on it.
$ btrfs property set snap2 ro false
$ exec 73<snap2/dir
$ rm -rf snap2/dir
# Set the second snapshot back to RO mode and do an incremental send.
$ btrfs property set snap2 ro true
$ mkdir receive_dir
$ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
At subvol snap2
At snapshot snap2
ERROR: send ioctl failed with -2: No such file or directory
ERROR: unlink dir failed. Is a directory
Actually, orphan inodes are more common use cases in cascading backups.
(Please see the illustration below.) In a cascading backup, a user wants
to replicate a couple of snapshots from Machine A to Machine B and from
Machine B to Machine C. Machine B doesn't take any RO snapshots for
sending. All a receiver does is create an RW snapshot of its parent
snapshot, apply the send stream and turn it into RO mode at the end.
Even if all paths of some inodes are deleted in applying the send
stream, these inodes would not be deleted and become orphans after
changing the subvolume from RW to RO. Moreover, orphan inodes can occur
not only in send snapshots but also in parent snapshots because Machine
B may do a batch replication of a couple of snapshots.
An illustration for cascading backups:
Machine A (snapshot {1..n}) --> Machine B --> Machine C
The idea to solve the problem is to delete all the items of orphan
inodes before using these snapshots for sending. I used to think that
the reasonable timing for doing that is during the ioctl of changing the
subvolume from RW to RO because it sounds good that we will not modify
the fs tree of a RO snapshot anymore. However, attempting to do the
orphan cleanup in the ioctl would be pointless. Because if someone is
holding an open file descriptor on the inode, the reference count of the
inode will never drop to 0. Then iput() cannot trigger eviction, which
finally deletes all the items of it. So we try to extend the original
patch to handle orphans in send/parent snapshots. Here are several cases
that need to be considered:
Case 1: BTRFS_COMPARE_TREE_NEW
| send snapshot | action
--------------------------------
nlink | 0 | ignore
In case 1, when we get a BTRFS_COMPARE_TREE_NEW tree comparison result,
it means that a new inode is found in the send snapshot and it doesn't
appear in the parent snapshot. Since this inode has a link count of zero
(It's an orphan and there're no paths for it.), we can leverage
sctx->ignore_cur_inode in the original patch to prevent it from being
created.
Case 2: BTRFS_COMPARE_TREE_DELETED
| parent snapshot | action
----------------------------------
nlink | 0 | as usual
In case 2, when we get a BTRFS_COMPARE_TREE_DELETED tree comparison
result, it means that the inode only appears in the parent snapshot.
As usual, the send operation will try to delete all its paths. However,
this inode has a link count of zero, so no paths of it will be found. No
deletion operations will be issued. We don't need to change any logic.
Case 3: BTRFS_COMPARE_TREE_CHANGED
| | parent snapshot | send snapshot | action
-----------------------------------------------------------------------
subcase 1 | nlink | 0 | 0 | ignore
subcase 2 | nlink | >0 | 0 | new_gen(deletion)
subcase 3 | nlink | 0 | >0 | new_gen(creation)
In case 3, when we get a BTRFS_COMPARE_TREE_CHANGED tree comparison result,
it means that the inode appears in both snapshots. Here are 3 subcases.
First, when the inode has link counts of zero in both snapshots. Since
there are no paths for this inode in (source/destination) parent
snapshots and we don't care about whether there is also an orphan inode
in destination or not, we can set sctx->ignore_cur_inode on to prevent
it from being created.
For the second and the third subcases, if there are paths in one
snapshot and there're no paths in the other snapshot for this inode. We
can treat this inode as a new generation. We can also leverage the logic
handling a new generation of an inode with small adjustments. Then it
will delete all old paths and create a new inode with new attributes and
paths only when there's a positive link count in the send snapshot.
In subcase 2, the send operation only needs to delete all old paths as
in the parent snapshot. But it may require more operations for a
directory to remove its old paths. If a not-empty directory is going to
be deleted (because it has a link count of zero in the send snapshot)
but there are files/directories with bigger inode numbers under it, the
send operation will need to rename it to its orphan name first. After
processing and deleting the last item under this directory, the send
operation will check this directory, aka the parent directory of the
last item, again and issue a rmdir operation to remove it finally.
Therefore, we also need to treat inodes with a link count of zero as if
they didn't exist in get_cur_inode_state(), which is used in
process_recorded_refs(). By doing this, when checking a directory with
orphan names after the last item under it has been deleted, the send
operation now can properly issue a rmdir operation. Otherwise, without
doing this, the orphan directory with an orphan name would be kept here
at the end due to the existing inode with a link count of zero being
found.
In subcase 3, as in case 2, no old paths would be found, so no deletion
operations will be issued. The send operation will only create a new one
for that inode.
Note that subcase 3 is not common. That's because it's easy to reduce
the hard links of an inode, but once all valid paths are removed,
there are no valid paths for creating other hard links. The only way to
do that is trying to send an older snapshot after a newer snapshot has
been sent.
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-12 22:42:33 +08:00
|
|
|
ret = process_all_refs(sctx, BTRFS_COMPARE_TREE_NEW);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
/*
|
|
|
|
* Advance send_progress now as we did not get
|
|
|
|
* into process_recorded_refs_if_needed in the
|
|
|
|
* new_gen case.
|
|
|
|
*/
|
|
|
|
sctx->send_progress = sctx->cur_ino + 1;
|
2012-07-28 20:11:31 +08:00
|
|
|
|
btrfs: send: fix failures when processing inodes with no links
There is a bug causing send failures when processing an orphan directory
with no links. In commit 46b2f4590aab ("Btrfs: fix send failure when
root has deleted files still open")', the orphan inode issue was
addressed. The send operation fails with a ENOENT error because of any
attempts to generate a path for the inode with a link count of zero.
Therefore, in that patch, sctx->ignore_cur_inode was introduced to be
set if the current inode has a link count of zero for bypassing some
unnecessary steps. And a helper function btrfs_unlink_all_paths() was
introduced and called to clean up old paths found in the parent
snapshot. However, not only regular files but also directories can be
orphan inodes. So if the send operation meets an orphan directory, it
will issue a wrong unlink command for that directory now. Soon the
receive operation fails with a EISDIR error. Besides, the send operation
also fails with a ENOENT error later when it tries to generate a path of
it.
Similar example but making an orphan dir for an incremental send:
$ btrfs subvolume create vol
$ mkdir vol/dir
$ touch vol/dir/foo
$ btrfs subvolume snapshot -r vol snap1
$ btrfs subvolume snapshot -r vol snap2
# Turn the second snapshot to RW mode and delete the whole dir while
# holding an open file descriptor on it.
$ btrfs property set snap2 ro false
$ exec 73<snap2/dir
$ rm -rf snap2/dir
# Set the second snapshot back to RO mode and do an incremental send.
$ btrfs property set snap2 ro true
$ mkdir receive_dir
$ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
At subvol snap2
At snapshot snap2
ERROR: send ioctl failed with -2: No such file or directory
ERROR: unlink dir failed. Is a directory
Actually, orphan inodes are more common use cases in cascading backups.
(Please see the illustration below.) In a cascading backup, a user wants
to replicate a couple of snapshots from Machine A to Machine B and from
Machine B to Machine C. Machine B doesn't take any RO snapshots for
sending. All a receiver does is create an RW snapshot of its parent
snapshot, apply the send stream and turn it into RO mode at the end.
Even if all paths of some inodes are deleted in applying the send
stream, these inodes would not be deleted and become orphans after
changing the subvolume from RW to RO. Moreover, orphan inodes can occur
not only in send snapshots but also in parent snapshots because Machine
B may do a batch replication of a couple of snapshots.
An illustration for cascading backups:
Machine A (snapshot {1..n}) --> Machine B --> Machine C
The idea to solve the problem is to delete all the items of orphan
inodes before using these snapshots for sending. I used to think that
the reasonable timing for doing that is during the ioctl of changing the
subvolume from RW to RO because it sounds good that we will not modify
the fs tree of a RO snapshot anymore. However, attempting to do the
orphan cleanup in the ioctl would be pointless. Because if someone is
holding an open file descriptor on the inode, the reference count of the
inode will never drop to 0. Then iput() cannot trigger eviction, which
finally deletes all the items of it. So we try to extend the original
patch to handle orphans in send/parent snapshots. Here are several cases
that need to be considered:
Case 1: BTRFS_COMPARE_TREE_NEW
| send snapshot | action
--------------------------------
nlink | 0 | ignore
In case 1, when we get a BTRFS_COMPARE_TREE_NEW tree comparison result,
it means that a new inode is found in the send snapshot and it doesn't
appear in the parent snapshot. Since this inode has a link count of zero
(It's an orphan and there're no paths for it.), we can leverage
sctx->ignore_cur_inode in the original patch to prevent it from being
created.
Case 2: BTRFS_COMPARE_TREE_DELETED
| parent snapshot | action
----------------------------------
nlink | 0 | as usual
In case 2, when we get a BTRFS_COMPARE_TREE_DELETED tree comparison
result, it means that the inode only appears in the parent snapshot.
As usual, the send operation will try to delete all its paths. However,
this inode has a link count of zero, so no paths of it will be found. No
deletion operations will be issued. We don't need to change any logic.
Case 3: BTRFS_COMPARE_TREE_CHANGED
| | parent snapshot | send snapshot | action
-----------------------------------------------------------------------
subcase 1 | nlink | 0 | 0 | ignore
subcase 2 | nlink | >0 | 0 | new_gen(deletion)
subcase 3 | nlink | 0 | >0 | new_gen(creation)
In case 3, when we get a BTRFS_COMPARE_TREE_CHANGED tree comparison result,
it means that the inode appears in both snapshots. Here are 3 subcases.
First, when the inode has link counts of zero in both snapshots. Since
there are no paths for this inode in (source/destination) parent
snapshots and we don't care about whether there is also an orphan inode
in destination or not, we can set sctx->ignore_cur_inode on to prevent
it from being created.
For the second and the third subcases, if there are paths in one
snapshot and there're no paths in the other snapshot for this inode. We
can treat this inode as a new generation. We can also leverage the logic
handling a new generation of an inode with small adjustments. Then it
will delete all old paths and create a new inode with new attributes and
paths only when there's a positive link count in the send snapshot.
In subcase 2, the send operation only needs to delete all old paths as
in the parent snapshot. But it may require more operations for a
directory to remove its old paths. If a not-empty directory is going to
be deleted (because it has a link count of zero in the send snapshot)
but there are files/directories with bigger inode numbers under it, the
send operation will need to rename it to its orphan name first. After
processing and deleting the last item under this directory, the send
operation will check this directory, aka the parent directory of the
last item, again and issue a rmdir operation to remove it finally.
Therefore, we also need to treat inodes with a link count of zero as if
they didn't exist in get_cur_inode_state(), which is used in
process_recorded_refs(). By doing this, when checking a directory with
orphan names after the last item under it has been deleted, the send
operation now can properly issue a rmdir operation. Otherwise, without
doing this, the orphan directory with an orphan name would be kept here
at the end due to the existing inode with a link count of zero being
found.
In subcase 3, as in case 2, no old paths would be found, so no deletion
operations will be issued. The send operation will only create a new one
for that inode.
Note that subcase 3 is not common. That's because it's easy to reduce
the hard links of an inode, but once all valid paths are removed,
there are no valid paths for creating other hard links. The only way to
do that is trying to send an older snapshot after a newer snapshot has
been sent.
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-12 22:42:33 +08:00
|
|
|
/*
|
|
|
|
* Now process all extents and xattrs of the
|
|
|
|
* inode as if they were all new.
|
|
|
|
*/
|
|
|
|
ret = process_all_extents(sctx);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
ret = process_all_new_xattrs(sctx);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
2012-07-26 05:19:24 +08:00
|
|
|
} else {
|
|
|
|
sctx->cur_inode_gen = left_gen;
|
2022-06-03 00:03:08 +08:00
|
|
|
sctx->cur_inode_new = false;
|
|
|
|
sctx->cur_inode_new_gen = false;
|
|
|
|
sctx->cur_inode_deleted = false;
|
2012-07-26 05:19:24 +08:00
|
|
|
sctx->cur_inode_size = btrfs_inode_size(
|
|
|
|
sctx->left_path->nodes[0], left_ii);
|
|
|
|
sctx->cur_inode_mode = btrfs_inode_mode(
|
|
|
|
sctx->left_path->nodes[0], left_ii);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-07-28 20:11:31 +08:00
|
|
|
/*
|
|
|
|
* We have to process new refs before deleted refs, but compare_trees gives us
|
|
|
|
* the new and deleted refs mixed. To fix this, we record the new/deleted refs
|
|
|
|
* first and later process them in process_recorded_refs.
|
|
|
|
* For the cur_inode_new_gen case, we skip recording completely because
|
|
|
|
* changed_inode did already initiate processing of refs. The reason for this is
|
|
|
|
* that in this case, compare_tree actually compares the refs of 2 different
|
|
|
|
* inodes. To fix this, process_all_refs is used in changed_inode to handle all
|
|
|
|
* refs of the right tree as deleted and all refs of the left tree as new.
|
|
|
|
*/
|
2012-07-26 05:19:24 +08:00
|
|
|
static int changed_ref(struct send_ctx *sctx,
|
|
|
|
enum btrfs_compare_tree_result result)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
|
Btrfs: send, don't bug on inconsistent snapshots
When doing an incremental send, if we find a new/modified/deleted extent,
reference or xattr without having previously processed the corresponding
inode item we end up exexuting a BUG_ON(). This is because whenever an
extent, xattr or reference is added, modified or deleted, we always expect
to have the corresponding inode item updated. However there are situations
where this will not happen due to transient -ENOMEM or -ENOSPC errors when
doing delayed inode updates.
For example, when punching holes we can succeed in deleting and modifying
(shrinking) extents but later fail to do the delayed inode update. So after
such failure we close our transaction handle and right after a snapshot of
the fs/subvol tree can be made and used later for a send operation. The
same thing can happen during truncate, link, unlink, and xattr related
operations.
So instead of executing a BUG_ON, make send return an -EIO error and print
an informative error message do dmesg/syslog.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-08-01 08:50:37 +08:00
|
|
|
if (sctx->cur_ino != sctx->cmp_key->objectid) {
|
|
|
|
inconsistent_snapshot_error(sctx, result, "reference");
|
|
|
|
return -EIO;
|
|
|
|
}
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
if (!sctx->cur_inode_new_gen &&
|
|
|
|
sctx->cur_ino != BTRFS_FIRST_FREE_OBJECTID) {
|
|
|
|
if (result == BTRFS_COMPARE_TREE_NEW)
|
|
|
|
ret = record_new_ref(sctx);
|
|
|
|
else if (result == BTRFS_COMPARE_TREE_DELETED)
|
|
|
|
ret = record_deleted_ref(sctx);
|
|
|
|
else if (result == BTRFS_COMPARE_TREE_CHANGED)
|
|
|
|
ret = record_changed_ref(sctx);
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-07-28 20:11:31 +08:00
|
|
|
/*
|
|
|
|
* Process new/deleted/changed xattrs. We skip processing in the
|
|
|
|
* cur_inode_new_gen case because changed_inode did already initiate processing
|
|
|
|
* of xattrs. The reason is the same as in changed_ref
|
|
|
|
*/
|
2012-07-26 05:19:24 +08:00
|
|
|
static int changed_xattr(struct send_ctx *sctx,
|
|
|
|
enum btrfs_compare_tree_result result)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
|
Btrfs: send, don't bug on inconsistent snapshots
When doing an incremental send, if we find a new/modified/deleted extent,
reference or xattr without having previously processed the corresponding
inode item we end up exexuting a BUG_ON(). This is because whenever an
extent, xattr or reference is added, modified or deleted, we always expect
to have the corresponding inode item updated. However there are situations
where this will not happen due to transient -ENOMEM or -ENOSPC errors when
doing delayed inode updates.
For example, when punching holes we can succeed in deleting and modifying
(shrinking) extents but later fail to do the delayed inode update. So after
such failure we close our transaction handle and right after a snapshot of
the fs/subvol tree can be made and used later for a send operation. The
same thing can happen during truncate, link, unlink, and xattr related
operations.
So instead of executing a BUG_ON, make send return an -EIO error and print
an informative error message do dmesg/syslog.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-08-01 08:50:37 +08:00
|
|
|
if (sctx->cur_ino != sctx->cmp_key->objectid) {
|
|
|
|
inconsistent_snapshot_error(sctx, result, "xattr");
|
|
|
|
return -EIO;
|
|
|
|
}
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
if (!sctx->cur_inode_new_gen && !sctx->cur_inode_deleted) {
|
|
|
|
if (result == BTRFS_COMPARE_TREE_NEW)
|
|
|
|
ret = process_new_xattr(sctx);
|
|
|
|
else if (result == BTRFS_COMPARE_TREE_DELETED)
|
|
|
|
ret = process_deleted_xattr(sctx);
|
|
|
|
else if (result == BTRFS_COMPARE_TREE_CHANGED)
|
|
|
|
ret = process_changed_xattr(sctx);
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-07-28 20:11:31 +08:00
|
|
|
/*
|
|
|
|
* Process new/deleted/changed extents. We skip processing in the
|
|
|
|
* cur_inode_new_gen case because changed_inode did already initiate processing
|
|
|
|
* of extents. The reason is the same as in changed_ref
|
|
|
|
*/
|
2012-07-26 05:19:24 +08:00
|
|
|
static int changed_extent(struct send_ctx *sctx,
|
|
|
|
enum btrfs_compare_tree_result result)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
|
Btrfs: fix incremental send failure after deduplication
When doing an incremental send operation we can fail if we previously did
deduplication operations against a file that exists in both snapshots. In
that case we will fail the send operation with -EIO and print a message
to dmesg/syslog like the following:
BTRFS error (device sdc): Send: inconsistent snapshot, found updated \
extent for inode 257 without updated inode item, send root is 258, \
parent root is 257
This requires that we deduplicate to the same file in both snapshots for
the same amount of times on each snapshot. The issue happens because a
deduplication only updates the iversion of an inode and does not update
any other field of the inode, therefore if we deduplicate the file on
each snapshot for the same amount of time, the inode will have the same
iversion value (stored as the "sequence" field on the inode item) on both
snapshots, therefore it will be seen as unchanged between in the send
snapshot while there are new/updated/deleted extent items when comparing
to the parent snapshot. This makes the send operation return -EIO and
print an error message.
Example reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
# Create our first file. The first half of the file has several 64Kb
# extents while the second half as a single 512Kb extent.
$ xfs_io -f -s -c "pwrite -S 0xb8 -b 64K 0 512K" /mnt/foo
$ xfs_io -c "pwrite -S 0xb8 512K 512K" /mnt/foo
# Create the base snapshot and the parent send stream from it.
$ btrfs subvolume snapshot -r /mnt /mnt/mysnap1
$ btrfs send -f /tmp/1.snap /mnt/mysnap1
# Create our second file, that has exactly the same data as the first
# file.
$ xfs_io -f -c "pwrite -S 0xb8 0 1M" /mnt/bar
# Create the second snapshot, used for the incremental send, before
# doing the file deduplication.
$ btrfs subvolume snapshot -r /mnt /mnt/mysnap2
# Now before creating the incremental send stream:
#
# 1) Deduplicate into a subrange of file foo in snapshot mysnap1. This
# will drop several extent items and add a new one, also updating
# the inode's iversion (sequence field in inode item) by 1, but not
# any other field of the inode;
#
# 2) Deduplicate into a different subrange of file foo in snapshot
# mysnap2. This will replace an extent item with a new one, also
# updating the inode's iversion by 1 but not any other field of the
# inode.
#
# After these two deduplication operations, the inode items, for file
# foo, are identical in both snapshots, but we have different extent
# items for this inode in both snapshots. We want to check this doesn't
# cause send to fail with an error or produce an incorrect stream.
$ xfs_io -r -c "dedupe /mnt/bar 0 0 512K" /mnt/mysnap1/foo
$ xfs_io -r -c "dedupe /mnt/bar 512K 512K 512K" /mnt/mysnap2/foo
# Create the incremental send stream.
$ btrfs send -p /mnt/mysnap1 -f /tmp/2.snap /mnt/mysnap2
ERROR: send ioctl failed with -5: Input/output error
This issue started happening back in 2015 when deduplication was updated
to not update the inode's ctime and mtime and update only the iversion.
Back then we would hit a BUG_ON() in send, but later in 2016 send was
updated to return -EIO and print the error message instead of doing the
BUG_ON().
A test case for fstests follows soon.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203933
Fixes: 1c919a5e13702c ("btrfs: don't update mtime/ctime on deduped inodes")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-17 20:23:39 +08:00
|
|
|
/*
|
|
|
|
* We have found an extent item that changed without the inode item
|
|
|
|
* having changed. This can happen either after relocation (where the
|
|
|
|
* disk_bytenr of an extent item is replaced at
|
|
|
|
* relocation.c:replace_file_extents()) or after deduplication into a
|
|
|
|
* file in both the parent and send snapshots (where an extent item can
|
|
|
|
* get modified or replaced with a new one). Note that deduplication
|
|
|
|
* updates the inode item, but it only changes the iversion (sequence
|
|
|
|
* field in the inode item) of the inode, so if a file is deduplicated
|
|
|
|
* the same amount of times in both the parent and send snapshots, its
|
2021-05-21 23:42:23 +08:00
|
|
|
* iversion becomes the same in both snapshots, whence the inode item is
|
Btrfs: fix incremental send failure after deduplication
When doing an incremental send operation we can fail if we previously did
deduplication operations against a file that exists in both snapshots. In
that case we will fail the send operation with -EIO and print a message
to dmesg/syslog like the following:
BTRFS error (device sdc): Send: inconsistent snapshot, found updated \
extent for inode 257 without updated inode item, send root is 258, \
parent root is 257
This requires that we deduplicate to the same file in both snapshots for
the same amount of times on each snapshot. The issue happens because a
deduplication only updates the iversion of an inode and does not update
any other field of the inode, therefore if we deduplicate the file on
each snapshot for the same amount of time, the inode will have the same
iversion value (stored as the "sequence" field on the inode item) on both
snapshots, therefore it will be seen as unchanged between in the send
snapshot while there are new/updated/deleted extent items when comparing
to the parent snapshot. This makes the send operation return -EIO and
print an error message.
Example reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
# Create our first file. The first half of the file has several 64Kb
# extents while the second half as a single 512Kb extent.
$ xfs_io -f -s -c "pwrite -S 0xb8 -b 64K 0 512K" /mnt/foo
$ xfs_io -c "pwrite -S 0xb8 512K 512K" /mnt/foo
# Create the base snapshot and the parent send stream from it.
$ btrfs subvolume snapshot -r /mnt /mnt/mysnap1
$ btrfs send -f /tmp/1.snap /mnt/mysnap1
# Create our second file, that has exactly the same data as the first
# file.
$ xfs_io -f -c "pwrite -S 0xb8 0 1M" /mnt/bar
# Create the second snapshot, used for the incremental send, before
# doing the file deduplication.
$ btrfs subvolume snapshot -r /mnt /mnt/mysnap2
# Now before creating the incremental send stream:
#
# 1) Deduplicate into a subrange of file foo in snapshot mysnap1. This
# will drop several extent items and add a new one, also updating
# the inode's iversion (sequence field in inode item) by 1, but not
# any other field of the inode;
#
# 2) Deduplicate into a different subrange of file foo in snapshot
# mysnap2. This will replace an extent item with a new one, also
# updating the inode's iversion by 1 but not any other field of the
# inode.
#
# After these two deduplication operations, the inode items, for file
# foo, are identical in both snapshots, but we have different extent
# items for this inode in both snapshots. We want to check this doesn't
# cause send to fail with an error or produce an incorrect stream.
$ xfs_io -r -c "dedupe /mnt/bar 0 0 512K" /mnt/mysnap1/foo
$ xfs_io -r -c "dedupe /mnt/bar 512K 512K 512K" /mnt/mysnap2/foo
# Create the incremental send stream.
$ btrfs send -p /mnt/mysnap1 -f /tmp/2.snap /mnt/mysnap2
ERROR: send ioctl failed with -5: Input/output error
This issue started happening back in 2015 when deduplication was updated
to not update the inode's ctime and mtime and update only the iversion.
Back then we would hit a BUG_ON() in send, but later in 2016 send was
updated to return -EIO and print the error message instead of doing the
BUG_ON().
A test case for fstests follows soon.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203933
Fixes: 1c919a5e13702c ("btrfs: don't update mtime/ctime on deduped inodes")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-17 20:23:39 +08:00
|
|
|
* the same on both snapshots.
|
|
|
|
*/
|
|
|
|
if (sctx->cur_ino != sctx->cmp_key->objectid)
|
|
|
|
return 0;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
if (!sctx->cur_inode_new_gen && !sctx->cur_inode_deleted) {
|
|
|
|
if (result != BTRFS_COMPARE_TREE_DELETED)
|
|
|
|
ret = process_extent(sctx, sctx->left_path,
|
|
|
|
sctx->cmp_key);
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2022-08-16 04:54:28 +08:00
|
|
|
static int changed_verity(struct send_ctx *sctx, enum btrfs_compare_tree_result result)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
if (!sctx->cur_inode_new_gen && !sctx->cur_inode_deleted) {
|
|
|
|
if (result == BTRFS_COMPARE_TREE_NEW)
|
|
|
|
sctx->cur_inode_needs_verity = true;
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-08-17 04:52:55 +08:00
|
|
|
static int dir_changed(struct send_ctx *sctx, u64 dir)
|
|
|
|
{
|
|
|
|
u64 orig_gen, new_gen;
|
|
|
|
int ret;
|
|
|
|
|
2022-08-12 22:42:32 +08:00
|
|
|
ret = get_inode_gen(sctx->send_root, dir, &new_gen);
|
2013-08-17 04:52:55 +08:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2022-08-12 22:42:32 +08:00
|
|
|
ret = get_inode_gen(sctx->parent_root, dir, &orig_gen);
|
2013-08-17 04:52:55 +08:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
return (orig_gen != new_gen) ? 1 : 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int compare_refs(struct send_ctx *sctx, struct btrfs_path *path,
|
|
|
|
struct btrfs_key *key)
|
|
|
|
{
|
|
|
|
struct btrfs_inode_extref *extref;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
u64 dirid = 0, last_dirid = 0;
|
|
|
|
unsigned long ptr;
|
|
|
|
u32 item_size;
|
|
|
|
u32 cur_offset = 0;
|
|
|
|
int ref_name_len;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
/* Easy case, just check this one dirid */
|
|
|
|
if (key->type == BTRFS_INODE_REF_KEY) {
|
|
|
|
dirid = key->offset;
|
|
|
|
|
|
|
|
ret = dir_changed(sctx, dirid);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
2021-10-22 02:58:35 +08:00
|
|
|
item_size = btrfs_item_size(leaf, path->slots[0]);
|
2013-08-17 04:52:55 +08:00
|
|
|
ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
|
|
|
|
while (cur_offset < item_size) {
|
|
|
|
extref = (struct btrfs_inode_extref *)(ptr +
|
|
|
|
cur_offset);
|
|
|
|
dirid = btrfs_inode_extref_parent(leaf, extref);
|
|
|
|
ref_name_len = btrfs_inode_extref_name_len(leaf, extref);
|
|
|
|
cur_offset += ref_name_len + sizeof(*extref);
|
|
|
|
if (dirid == last_dirid)
|
|
|
|
continue;
|
|
|
|
ret = dir_changed(sctx, dirid);
|
|
|
|
if (ret)
|
|
|
|
break;
|
|
|
|
last_dirid = dirid;
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-07-28 20:11:31 +08:00
|
|
|
/*
|
|
|
|
* Updates compare related fields in sctx and simply forwards to the actual
|
|
|
|
* changed_xxx functions.
|
|
|
|
*/
|
2017-08-21 17:43:45 +08:00
|
|
|
static int changed_cb(struct btrfs_path *left_path,
|
2012-07-26 05:19:24 +08:00
|
|
|
struct btrfs_path *right_path,
|
|
|
|
struct btrfs_key *key,
|
|
|
|
enum btrfs_compare_tree_result result,
|
2021-01-26 03:43:25 +08:00
|
|
|
struct send_ctx *sctx)
|
2012-07-26 05:19:24 +08:00
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
/*
|
|
|
|
* We can not hold the commit root semaphore here. This is because in
|
|
|
|
* the case of sending and receiving to the same filesystem, using a
|
|
|
|
* pipe, could result in a deadlock:
|
|
|
|
*
|
|
|
|
* 1) The task running send blocks on the pipe because it's full;
|
|
|
|
*
|
|
|
|
* 2) The task running receive, which is the only consumer of the pipe,
|
|
|
|
* is waiting for a transaction commit (for example due to a space
|
|
|
|
* reservation when doing a write or triggering a transaction commit
|
|
|
|
* when creating a subvolume);
|
|
|
|
*
|
|
|
|
* 3) The transaction is waiting to write lock the commit root semaphore,
|
|
|
|
* but can not acquire it since it's being held at 1).
|
|
|
|
*
|
|
|
|
* Down this call chain we write to the pipe through kernel_write().
|
|
|
|
* The same type of problem can also happen when sending to a file that
|
|
|
|
* is stored in the same filesystem - when reserving space for a write
|
|
|
|
* into the file, we can trigger a transaction commit.
|
|
|
|
*
|
|
|
|
* Our caller has supplied us with clones of leaves from the send and
|
|
|
|
* parent roots, so we're safe here from a concurrent relocation and
|
|
|
|
* further reallocation of metadata extents while we are here. Below we
|
|
|
|
* also assert that the leaves are clones.
|
|
|
|
*/
|
|
|
|
lockdep_assert_not_held(&sctx->send_root->fs_info->commit_root_sem);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We always have a send root, so left_path is never NULL. We will not
|
|
|
|
* have a leaf when we have reached the end of the send root but have
|
|
|
|
* not yet reached the end of the parent root.
|
|
|
|
*/
|
|
|
|
if (left_path->nodes[0])
|
|
|
|
ASSERT(test_bit(EXTENT_BUFFER_UNMAPPED,
|
|
|
|
&left_path->nodes[0]->bflags));
|
|
|
|
/*
|
|
|
|
* When doing a full send we don't have a parent root, so right_path is
|
|
|
|
* NULL. When doing an incremental send, we may have reached the end of
|
|
|
|
* the parent root already, so we don't have a leaf at right_path.
|
|
|
|
*/
|
|
|
|
if (right_path && right_path->nodes[0])
|
|
|
|
ASSERT(test_bit(EXTENT_BUFFER_UNMAPPED,
|
|
|
|
&right_path->nodes[0]->bflags));
|
|
|
|
|
2013-08-17 04:52:55 +08:00
|
|
|
if (result == BTRFS_COMPARE_TREE_SAME) {
|
2013-10-23 00:18:51 +08:00
|
|
|
if (key->type == BTRFS_INODE_REF_KEY ||
|
|
|
|
key->type == BTRFS_INODE_EXTREF_KEY) {
|
|
|
|
ret = compare_refs(sctx, left_path, key);
|
|
|
|
if (!ret)
|
|
|
|
return 0;
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
} else if (key->type == BTRFS_EXTENT_DATA_KEY) {
|
|
|
|
return maybe_send_hole(sctx, left_path, key);
|
|
|
|
} else {
|
2013-08-17 04:52:55 +08:00
|
|
|
return 0;
|
2013-10-23 00:18:51 +08:00
|
|
|
}
|
2013-08-17 04:52:55 +08:00
|
|
|
result = BTRFS_COMPARE_TREE_CHANGED;
|
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
sctx->left_path = left_path;
|
|
|
|
sctx->right_path = right_path;
|
|
|
|
sctx->cmp_key = key;
|
|
|
|
|
|
|
|
ret = finish_inode_if_needed(sctx, 0);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
2012-08-01 20:47:03 +08:00
|
|
|
/* Ignore non-FS objects */
|
|
|
|
if (key->objectid == BTRFS_FREE_INO_OBJECTID ||
|
|
|
|
key->objectid == BTRFS_FREE_SPACE_OBJECTID)
|
|
|
|
goto out;
|
|
|
|
|
2018-07-24 18:54:04 +08:00
|
|
|
if (key->type == BTRFS_INODE_ITEM_KEY) {
|
2012-07-26 05:19:24 +08:00
|
|
|
ret = changed_inode(sctx, result);
|
2018-07-24 18:54:04 +08:00
|
|
|
} else if (!sctx->ignore_cur_inode) {
|
|
|
|
if (key->type == BTRFS_INODE_REF_KEY ||
|
|
|
|
key->type == BTRFS_INODE_EXTREF_KEY)
|
|
|
|
ret = changed_ref(sctx, result);
|
|
|
|
else if (key->type == BTRFS_XATTR_ITEM_KEY)
|
|
|
|
ret = changed_xattr(sctx, result);
|
|
|
|
else if (key->type == BTRFS_EXTENT_DATA_KEY)
|
|
|
|
ret = changed_extent(sctx, result);
|
2022-08-16 04:54:28 +08:00
|
|
|
else if (key->type == BTRFS_VERITY_DESC_ITEM_KEY &&
|
|
|
|
key->offset == 0)
|
|
|
|
ret = changed_verity(sctx, result);
|
2018-07-24 18:54:04 +08:00
|
|
|
}
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
static int search_key_again(const struct send_ctx *sctx,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
const struct btrfs_key *key)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!path->need_commit_sem)
|
|
|
|
lockdep_assert_held_read(&root->fs_info->commit_root_sem);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Roots used for send operations are readonly and no one can add,
|
|
|
|
* update or remove keys from them, so we should be able to find our
|
|
|
|
* key again. The only exception is deduplication, which can operate on
|
|
|
|
* readonly roots and add, update or remove keys to/from them - but at
|
|
|
|
* the moment we don't allow it to run in parallel with send.
|
|
|
|
*/
|
|
|
|
ret = btrfs_search_slot(NULL, root, key, path, 0, 0);
|
|
|
|
ASSERT(ret <= 0);
|
|
|
|
if (ret > 0) {
|
|
|
|
btrfs_print_tree(path->nodes[path->lowest_level], false);
|
|
|
|
btrfs_err(root->fs_info,
|
|
|
|
"send: key (%llu %u %llu) not found in %s root %llu, lowest_level %d, slot %d",
|
|
|
|
key->objectid, key->type, key->offset,
|
|
|
|
(root == sctx->parent_root ? "parent" : "send"),
|
|
|
|
root->root_key.objectid, path->lowest_level,
|
|
|
|
path->slots[path->lowest_level]);
|
|
|
|
return -EUCLEAN;
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
static int full_send_tree(struct send_ctx *sctx)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct btrfs_root *send_root = sctx->send_root;
|
|
|
|
struct btrfs_key key;
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
struct btrfs_fs_info *fs_info = send_root->fs_info;
|
2012-07-26 05:19:24 +08:00
|
|
|
struct btrfs_path *path;
|
|
|
|
|
|
|
|
path = alloc_path_for_send();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
2021-03-31 18:56:21 +08:00
|
|
|
path->reada = READA_FORWARD_ALWAYS;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
key.objectid = BTRFS_FIRST_FREE_OBJECTID;
|
|
|
|
key.type = BTRFS_INODE_ITEM_KEY;
|
|
|
|
key.offset = 0;
|
|
|
|
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
down_read(&fs_info->commit_root_sem);
|
|
|
|
sctx->last_reloc_trans = fs_info->last_reloc_trans;
|
|
|
|
up_read(&fs_info->commit_root_sem);
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
ret = btrfs_search_slot_for_read(send_root, &key, path, 1, 0);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret)
|
|
|
|
goto out_finish;
|
|
|
|
|
|
|
|
while (1) {
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
2018-07-23 16:10:09 +08:00
|
|
|
ret = changed_cb(path, NULL, &key,
|
2017-08-21 17:43:45 +08:00
|
|
|
BTRFS_COMPARE_TREE_NEW, sctx);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
down_read(&fs_info->commit_root_sem);
|
|
|
|
if (fs_info->last_reloc_trans > sctx->last_reloc_trans) {
|
|
|
|
sctx->last_reloc_trans = fs_info->last_reloc_trans;
|
|
|
|
up_read(&fs_info->commit_root_sem);
|
|
|
|
/*
|
|
|
|
* A transaction used for relocating a block group was
|
|
|
|
* committed or is about to finish its commit. Release
|
|
|
|
* our path (leaf) and restart the search, so that we
|
|
|
|
* avoid operating on any file extent items that are
|
|
|
|
* stale, with a disk_bytenr that reflects a pre
|
|
|
|
* relocation value. This way we avoid as much as
|
|
|
|
* possible to fallback to regular writes when checking
|
|
|
|
* if we can clone file ranges.
|
|
|
|
*/
|
|
|
|
btrfs_release_path(path);
|
|
|
|
ret = search_key_again(sctx, send_root, path, &key);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
} else {
|
|
|
|
up_read(&fs_info->commit_root_sem);
|
|
|
|
}
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
ret = btrfs_next_item(send_root, path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret) {
|
|
|
|
ret = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
out_finish:
|
|
|
|
ret = finish_inode_if_needed(sctx, 1);
|
|
|
|
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
static int replace_node_with_clone(struct btrfs_path *path, int level)
|
|
|
|
{
|
|
|
|
struct extent_buffer *clone;
|
|
|
|
|
|
|
|
clone = btrfs_clone_extent_buffer(path->nodes[level]);
|
|
|
|
if (!clone)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
free_extent_buffer(path->nodes[level]);
|
|
|
|
path->nodes[level] = clone;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
btrfs: add btree read ahead for incremental send operations
Currently we do not do btree read ahead when doing an incremental send,
however we know that we will read and process any node or leaf in the
send root that has a generation greater than the generation of the parent
root. So triggering read ahead for such nodes and leafs is beneficial
for an incremental send.
This change does that, triggers read ahead of any node or leaf in the
send root that has a generation greater then the generation of the
parent root. As for the parent root, no readahead is triggered because
knowing in advance which nodes/leaves are going to be read is not so
linear and there's often a large time window between visiting nodes or
leaves of the parent root. So I opted to leave out the parent root,
and triggering read ahead for its nodes/leaves seemed to have not made
significant difference.
The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk and with 16GiB of ram:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit
MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
# Create files with inline data to make it easier and faster to create
# large btrees.
add_files()
{
local total=$1
local start_offset=$2
local number_jobs=$3
local total_per_job=$(($total / $number_jobs))
echo "Creating $total new files using $number_jobs jobs"
for ((n = 0; n < $number_jobs; n++)); do
(
local start_num=$(($start_offset + $n * $total_per_job))
for ((i = 1; i <= $total_per_job; i++)); do
local file_num=$((start_num + $i))
local file_path="$MNT/file_${file_num}"
xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
if [ $? -ne 0 ]; then
echo "Failed creating file $file_path"
break
fi
done
) &
worker_pids[$n]=$!
done
wait ${worker_pids[@]}
sync
echo
echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
}
initial_file_count=500000
add_files $initial_file_count 0 4
echo
echo "Creating first snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap1
echo
echo "Adding more files..."
add_files $((initial_file_count / 4)) $initial_file_count 4
echo
echo "Updating 1/50th of the initial files..."
for ((i = 1; i < $initial_file_count; i += 50)); do
xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
done
echo
echo "Creating second snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap2
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing full send..."
start=$(date +%s)
btrfs send $MNT/snap1 > /dev/null
end=$(date +%s)
echo
echo "Full send took $((end - start)) seconds"
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing incremental send..."
start=$(date +%s)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s)
echo
echo "Incremental send took $((end - start)) seconds"
umount $MNT
Before this change, incremental send duration:
with $initial_file_count == 200000: 51 seconds
with $initial_file_count == 500000: 168 seconds
After this change, incremental send duration:
with $initial_file_count == 200000: 39 seconds (-26.7%)
with $initial_file_count == 500000: 125 seconds (-29.4%)
For $initial_file_count == 200000 there are 62600 nodes and leaves in the
btree of the first snapshot, and 77759 nodes and leaves in the btree of
the second snapshot. The root nodes were at level 2.
While for $initial_file_count == 500000 there are 152476 nodes and leaves
in the btree of the first snapshot, and 190511 nodes and leaves in the
btree of the second snapshot. The root nodes were at level 2 as well.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-01 17:26:43 +08:00
|
|
|
static int tree_move_down(struct btrfs_path *path, int *level, u64 reada_min_gen)
|
2019-08-22 01:12:59 +08:00
|
|
|
{
|
|
|
|
struct extent_buffer *eb;
|
btrfs: add btree read ahead for incremental send operations
Currently we do not do btree read ahead when doing an incremental send,
however we know that we will read and process any node or leaf in the
send root that has a generation greater than the generation of the parent
root. So triggering read ahead for such nodes and leafs is beneficial
for an incremental send.
This change does that, triggers read ahead of any node or leaf in the
send root that has a generation greater then the generation of the
parent root. As for the parent root, no readahead is triggered because
knowing in advance which nodes/leaves are going to be read is not so
linear and there's often a large time window between visiting nodes or
leaves of the parent root. So I opted to leave out the parent root,
and triggering read ahead for its nodes/leaves seemed to have not made
significant difference.
The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk and with 16GiB of ram:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit
MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
# Create files with inline data to make it easier and faster to create
# large btrees.
add_files()
{
local total=$1
local start_offset=$2
local number_jobs=$3
local total_per_job=$(($total / $number_jobs))
echo "Creating $total new files using $number_jobs jobs"
for ((n = 0; n < $number_jobs; n++)); do
(
local start_num=$(($start_offset + $n * $total_per_job))
for ((i = 1; i <= $total_per_job; i++)); do
local file_num=$((start_num + $i))
local file_path="$MNT/file_${file_num}"
xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
if [ $? -ne 0 ]; then
echo "Failed creating file $file_path"
break
fi
done
) &
worker_pids[$n]=$!
done
wait ${worker_pids[@]}
sync
echo
echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
}
initial_file_count=500000
add_files $initial_file_count 0 4
echo
echo "Creating first snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap1
echo
echo "Adding more files..."
add_files $((initial_file_count / 4)) $initial_file_count 4
echo
echo "Updating 1/50th of the initial files..."
for ((i = 1; i < $initial_file_count; i += 50)); do
xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
done
echo
echo "Creating second snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap2
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing full send..."
start=$(date +%s)
btrfs send $MNT/snap1 > /dev/null
end=$(date +%s)
echo
echo "Full send took $((end - start)) seconds"
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing incremental send..."
start=$(date +%s)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s)
echo
echo "Incremental send took $((end - start)) seconds"
umount $MNT
Before this change, incremental send duration:
with $initial_file_count == 200000: 51 seconds
with $initial_file_count == 500000: 168 seconds
After this change, incremental send duration:
with $initial_file_count == 200000: 39 seconds (-26.7%)
with $initial_file_count == 500000: 125 seconds (-29.4%)
For $initial_file_count == 200000 there are 62600 nodes and leaves in the
btree of the first snapshot, and 77759 nodes and leaves in the btree of
the second snapshot. The root nodes were at level 2.
While for $initial_file_count == 500000 there are 152476 nodes and leaves
in the btree of the first snapshot, and 190511 nodes and leaves in the
btree of the second snapshot. The root nodes were at level 2 as well.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-01 17:26:43 +08:00
|
|
|
struct extent_buffer *parent = path->nodes[*level];
|
|
|
|
int slot = path->slots[*level];
|
|
|
|
const int nritems = btrfs_header_nritems(parent);
|
|
|
|
u64 reada_max;
|
|
|
|
u64 reada_done = 0;
|
2019-08-22 01:12:59 +08:00
|
|
|
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
lockdep_assert_held_read(&parent->fs_info->commit_root_sem);
|
|
|
|
|
2019-08-22 01:12:59 +08:00
|
|
|
BUG_ON(*level == 0);
|
btrfs: add btree read ahead for incremental send operations
Currently we do not do btree read ahead when doing an incremental send,
however we know that we will read and process any node or leaf in the
send root that has a generation greater than the generation of the parent
root. So triggering read ahead for such nodes and leafs is beneficial
for an incremental send.
This change does that, triggers read ahead of any node or leaf in the
send root that has a generation greater then the generation of the
parent root. As for the parent root, no readahead is triggered because
knowing in advance which nodes/leaves are going to be read is not so
linear and there's often a large time window between visiting nodes or
leaves of the parent root. So I opted to leave out the parent root,
and triggering read ahead for its nodes/leaves seemed to have not made
significant difference.
The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk and with 16GiB of ram:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit
MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
# Create files with inline data to make it easier and faster to create
# large btrees.
add_files()
{
local total=$1
local start_offset=$2
local number_jobs=$3
local total_per_job=$(($total / $number_jobs))
echo "Creating $total new files using $number_jobs jobs"
for ((n = 0; n < $number_jobs; n++)); do
(
local start_num=$(($start_offset + $n * $total_per_job))
for ((i = 1; i <= $total_per_job; i++)); do
local file_num=$((start_num + $i))
local file_path="$MNT/file_${file_num}"
xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
if [ $? -ne 0 ]; then
echo "Failed creating file $file_path"
break
fi
done
) &
worker_pids[$n]=$!
done
wait ${worker_pids[@]}
sync
echo
echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
}
initial_file_count=500000
add_files $initial_file_count 0 4
echo
echo "Creating first snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap1
echo
echo "Adding more files..."
add_files $((initial_file_count / 4)) $initial_file_count 4
echo
echo "Updating 1/50th of the initial files..."
for ((i = 1; i < $initial_file_count; i += 50)); do
xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
done
echo
echo "Creating second snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap2
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing full send..."
start=$(date +%s)
btrfs send $MNT/snap1 > /dev/null
end=$(date +%s)
echo
echo "Full send took $((end - start)) seconds"
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing incremental send..."
start=$(date +%s)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s)
echo
echo "Incremental send took $((end - start)) seconds"
umount $MNT
Before this change, incremental send duration:
with $initial_file_count == 200000: 51 seconds
with $initial_file_count == 500000: 168 seconds
After this change, incremental send duration:
with $initial_file_count == 200000: 39 seconds (-26.7%)
with $initial_file_count == 500000: 125 seconds (-29.4%)
For $initial_file_count == 200000 there are 62600 nodes and leaves in the
btree of the first snapshot, and 77759 nodes and leaves in the btree of
the second snapshot. The root nodes were at level 2.
While for $initial_file_count == 500000 there are 152476 nodes and leaves
in the btree of the first snapshot, and 190511 nodes and leaves in the
btree of the second snapshot. The root nodes were at level 2 as well.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-01 17:26:43 +08:00
|
|
|
eb = btrfs_read_node_slot(parent, slot);
|
2019-08-22 01:12:59 +08:00
|
|
|
if (IS_ERR(eb))
|
|
|
|
return PTR_ERR(eb);
|
|
|
|
|
btrfs: add btree read ahead for incremental send operations
Currently we do not do btree read ahead when doing an incremental send,
however we know that we will read and process any node or leaf in the
send root that has a generation greater than the generation of the parent
root. So triggering read ahead for such nodes and leafs is beneficial
for an incremental send.
This change does that, triggers read ahead of any node or leaf in the
send root that has a generation greater then the generation of the
parent root. As for the parent root, no readahead is triggered because
knowing in advance which nodes/leaves are going to be read is not so
linear and there's often a large time window between visiting nodes or
leaves of the parent root. So I opted to leave out the parent root,
and triggering read ahead for its nodes/leaves seemed to have not made
significant difference.
The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk and with 16GiB of ram:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit
MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
# Create files with inline data to make it easier and faster to create
# large btrees.
add_files()
{
local total=$1
local start_offset=$2
local number_jobs=$3
local total_per_job=$(($total / $number_jobs))
echo "Creating $total new files using $number_jobs jobs"
for ((n = 0; n < $number_jobs; n++)); do
(
local start_num=$(($start_offset + $n * $total_per_job))
for ((i = 1; i <= $total_per_job; i++)); do
local file_num=$((start_num + $i))
local file_path="$MNT/file_${file_num}"
xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
if [ $? -ne 0 ]; then
echo "Failed creating file $file_path"
break
fi
done
) &
worker_pids[$n]=$!
done
wait ${worker_pids[@]}
sync
echo
echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
}
initial_file_count=500000
add_files $initial_file_count 0 4
echo
echo "Creating first snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap1
echo
echo "Adding more files..."
add_files $((initial_file_count / 4)) $initial_file_count 4
echo
echo "Updating 1/50th of the initial files..."
for ((i = 1; i < $initial_file_count; i += 50)); do
xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
done
echo
echo "Creating second snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap2
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing full send..."
start=$(date +%s)
btrfs send $MNT/snap1 > /dev/null
end=$(date +%s)
echo
echo "Full send took $((end - start)) seconds"
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing incremental send..."
start=$(date +%s)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s)
echo
echo "Incremental send took $((end - start)) seconds"
umount $MNT
Before this change, incremental send duration:
with $initial_file_count == 200000: 51 seconds
with $initial_file_count == 500000: 168 seconds
After this change, incremental send duration:
with $initial_file_count == 200000: 39 seconds (-26.7%)
with $initial_file_count == 500000: 125 seconds (-29.4%)
For $initial_file_count == 200000 there are 62600 nodes and leaves in the
btree of the first snapshot, and 77759 nodes and leaves in the btree of
the second snapshot. The root nodes were at level 2.
While for $initial_file_count == 500000 there are 152476 nodes and leaves
in the btree of the first snapshot, and 190511 nodes and leaves in the
btree of the second snapshot. The root nodes were at level 2 as well.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-01 17:26:43 +08:00
|
|
|
/*
|
|
|
|
* Trigger readahead for the next leaves we will process, so that it is
|
|
|
|
* very likely that when we need them they are already in memory and we
|
|
|
|
* will not block on disk IO. For nodes we only do readahead for one,
|
|
|
|
* since the time window between processing nodes is typically larger.
|
|
|
|
*/
|
|
|
|
reada_max = (*level == 1 ? SZ_128K : eb->fs_info->nodesize);
|
|
|
|
|
|
|
|
for (slot++; slot < nritems && reada_done < reada_max; slot++) {
|
|
|
|
if (btrfs_node_ptr_generation(parent, slot) > reada_min_gen) {
|
|
|
|
btrfs_readahead_node_child(parent, slot);
|
|
|
|
reada_done += eb->fs_info->nodesize;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-08-22 01:12:59 +08:00
|
|
|
path->nodes[*level - 1] = eb;
|
|
|
|
path->slots[*level - 1] = 0;
|
|
|
|
(*level)--;
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
|
|
|
|
if (*level == 0)
|
|
|
|
return replace_node_with_clone(path, 0);
|
|
|
|
|
2019-08-22 01:12:59 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int tree_move_next_or_upnext(struct btrfs_path *path,
|
|
|
|
int *level, int root_level)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
int nritems;
|
|
|
|
nritems = btrfs_header_nritems(path->nodes[*level]);
|
|
|
|
|
|
|
|
path->slots[*level]++;
|
|
|
|
|
|
|
|
while (path->slots[*level] >= nritems) {
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
if (*level == root_level) {
|
|
|
|
path->slots[*level] = nritems - 1;
|
2019-08-22 01:12:59 +08:00
|
|
|
return -1;
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
}
|
2019-08-22 01:12:59 +08:00
|
|
|
|
|
|
|
/* move upnext */
|
|
|
|
path->slots[*level] = 0;
|
|
|
|
free_extent_buffer(path->nodes[*level]);
|
|
|
|
path->nodes[*level] = NULL;
|
|
|
|
(*level)++;
|
|
|
|
path->slots[*level]++;
|
|
|
|
|
|
|
|
nritems = btrfs_header_nritems(path->nodes[*level]);
|
|
|
|
ret = 1;
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Returns 1 if it had to move up and next. 0 is returned if it moved only next
|
|
|
|
* or down.
|
|
|
|
*/
|
|
|
|
static int tree_advance(struct btrfs_path *path,
|
|
|
|
int *level, int root_level,
|
|
|
|
int allow_down,
|
btrfs: add btree read ahead for incremental send operations
Currently we do not do btree read ahead when doing an incremental send,
however we know that we will read and process any node or leaf in the
send root that has a generation greater than the generation of the parent
root. So triggering read ahead for such nodes and leafs is beneficial
for an incremental send.
This change does that, triggers read ahead of any node or leaf in the
send root that has a generation greater then the generation of the
parent root. As for the parent root, no readahead is triggered because
knowing in advance which nodes/leaves are going to be read is not so
linear and there's often a large time window between visiting nodes or
leaves of the parent root. So I opted to leave out the parent root,
and triggering read ahead for its nodes/leaves seemed to have not made
significant difference.
The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk and with 16GiB of ram:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit
MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
# Create files with inline data to make it easier and faster to create
# large btrees.
add_files()
{
local total=$1
local start_offset=$2
local number_jobs=$3
local total_per_job=$(($total / $number_jobs))
echo "Creating $total new files using $number_jobs jobs"
for ((n = 0; n < $number_jobs; n++)); do
(
local start_num=$(($start_offset + $n * $total_per_job))
for ((i = 1; i <= $total_per_job; i++)); do
local file_num=$((start_num + $i))
local file_path="$MNT/file_${file_num}"
xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
if [ $? -ne 0 ]; then
echo "Failed creating file $file_path"
break
fi
done
) &
worker_pids[$n]=$!
done
wait ${worker_pids[@]}
sync
echo
echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
}
initial_file_count=500000
add_files $initial_file_count 0 4
echo
echo "Creating first snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap1
echo
echo "Adding more files..."
add_files $((initial_file_count / 4)) $initial_file_count 4
echo
echo "Updating 1/50th of the initial files..."
for ((i = 1; i < $initial_file_count; i += 50)); do
xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
done
echo
echo "Creating second snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap2
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing full send..."
start=$(date +%s)
btrfs send $MNT/snap1 > /dev/null
end=$(date +%s)
echo
echo "Full send took $((end - start)) seconds"
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing incremental send..."
start=$(date +%s)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s)
echo
echo "Incremental send took $((end - start)) seconds"
umount $MNT
Before this change, incremental send duration:
with $initial_file_count == 200000: 51 seconds
with $initial_file_count == 500000: 168 seconds
After this change, incremental send duration:
with $initial_file_count == 200000: 39 seconds (-26.7%)
with $initial_file_count == 500000: 125 seconds (-29.4%)
For $initial_file_count == 200000 there are 62600 nodes and leaves in the
btree of the first snapshot, and 77759 nodes and leaves in the btree of
the second snapshot. The root nodes were at level 2.
While for $initial_file_count == 500000 there are 152476 nodes and leaves
in the btree of the first snapshot, and 190511 nodes and leaves in the
btree of the second snapshot. The root nodes were at level 2 as well.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-01 17:26:43 +08:00
|
|
|
struct btrfs_key *key,
|
|
|
|
u64 reada_min_gen)
|
2019-08-22 01:12:59 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (*level == 0 || !allow_down) {
|
|
|
|
ret = tree_move_next_or_upnext(path, level, root_level);
|
|
|
|
} else {
|
btrfs: add btree read ahead for incremental send operations
Currently we do not do btree read ahead when doing an incremental send,
however we know that we will read and process any node or leaf in the
send root that has a generation greater than the generation of the parent
root. So triggering read ahead for such nodes and leafs is beneficial
for an incremental send.
This change does that, triggers read ahead of any node or leaf in the
send root that has a generation greater then the generation of the
parent root. As for the parent root, no readahead is triggered because
knowing in advance which nodes/leaves are going to be read is not so
linear and there's often a large time window between visiting nodes or
leaves of the parent root. So I opted to leave out the parent root,
and triggering read ahead for its nodes/leaves seemed to have not made
significant difference.
The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk and with 16GiB of ram:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit
MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
# Create files with inline data to make it easier and faster to create
# large btrees.
add_files()
{
local total=$1
local start_offset=$2
local number_jobs=$3
local total_per_job=$(($total / $number_jobs))
echo "Creating $total new files using $number_jobs jobs"
for ((n = 0; n < $number_jobs; n++)); do
(
local start_num=$(($start_offset + $n * $total_per_job))
for ((i = 1; i <= $total_per_job; i++)); do
local file_num=$((start_num + $i))
local file_path="$MNT/file_${file_num}"
xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
if [ $? -ne 0 ]; then
echo "Failed creating file $file_path"
break
fi
done
) &
worker_pids[$n]=$!
done
wait ${worker_pids[@]}
sync
echo
echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
}
initial_file_count=500000
add_files $initial_file_count 0 4
echo
echo "Creating first snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap1
echo
echo "Adding more files..."
add_files $((initial_file_count / 4)) $initial_file_count 4
echo
echo "Updating 1/50th of the initial files..."
for ((i = 1; i < $initial_file_count; i += 50)); do
xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
done
echo
echo "Creating second snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap2
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing full send..."
start=$(date +%s)
btrfs send $MNT/snap1 > /dev/null
end=$(date +%s)
echo
echo "Full send took $((end - start)) seconds"
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing incremental send..."
start=$(date +%s)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s)
echo
echo "Incremental send took $((end - start)) seconds"
umount $MNT
Before this change, incremental send duration:
with $initial_file_count == 200000: 51 seconds
with $initial_file_count == 500000: 168 seconds
After this change, incremental send duration:
with $initial_file_count == 200000: 39 seconds (-26.7%)
with $initial_file_count == 500000: 125 seconds (-29.4%)
For $initial_file_count == 200000 there are 62600 nodes and leaves in the
btree of the first snapshot, and 77759 nodes and leaves in the btree of
the second snapshot. The root nodes were at level 2.
While for $initial_file_count == 500000 there are 152476 nodes and leaves
in the btree of the first snapshot, and 190511 nodes and leaves in the
btree of the second snapshot. The root nodes were at level 2 as well.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-01 17:26:43 +08:00
|
|
|
ret = tree_move_down(path, level, reada_min_gen);
|
2019-08-22 01:12:59 +08:00
|
|
|
}
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Even if we have reached the end of a tree, ret is -1, update the key
|
|
|
|
* anyway, so that in case we need to restart due to a block group
|
|
|
|
* relocation, we can assert that the last key of the root node still
|
|
|
|
* exists in the tree.
|
|
|
|
*/
|
|
|
|
if (*level == 0)
|
|
|
|
btrfs_item_key_to_cpu(path->nodes[*level], key,
|
|
|
|
path->slots[*level]);
|
|
|
|
else
|
|
|
|
btrfs_node_key_to_cpu(path->nodes[*level], key,
|
|
|
|
path->slots[*level]);
|
|
|
|
|
2019-08-22 01:12:59 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int tree_compare_item(struct btrfs_path *left_path,
|
|
|
|
struct btrfs_path *right_path,
|
|
|
|
char *tmp_buf)
|
|
|
|
{
|
|
|
|
int cmp;
|
|
|
|
int len1, len2;
|
|
|
|
unsigned long off1, off2;
|
|
|
|
|
2021-10-22 02:58:35 +08:00
|
|
|
len1 = btrfs_item_size(left_path->nodes[0], left_path->slots[0]);
|
|
|
|
len2 = btrfs_item_size(right_path->nodes[0], right_path->slots[0]);
|
2019-08-22 01:12:59 +08:00
|
|
|
if (len1 != len2)
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
off1 = btrfs_item_ptr_offset(left_path->nodes[0], left_path->slots[0]);
|
|
|
|
off2 = btrfs_item_ptr_offset(right_path->nodes[0],
|
|
|
|
right_path->slots[0]);
|
|
|
|
|
|
|
|
read_extent_buffer(left_path->nodes[0], tmp_buf, off1, len1);
|
|
|
|
|
|
|
|
cmp = memcmp_extent_buffer(right_path->nodes[0], tmp_buf, off2, len1);
|
|
|
|
if (cmp)
|
|
|
|
return 1;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
/*
|
|
|
|
* A transaction used for relocating a block group was committed or is about to
|
|
|
|
* finish its commit. Release our paths and restart the search, so that we are
|
|
|
|
* not using stale extent buffers:
|
|
|
|
*
|
|
|
|
* 1) For levels > 0, we are only holding references of extent buffers, without
|
|
|
|
* any locks on them, which does not prevent them from having been relocated
|
|
|
|
* and reallocated after the last time we released the commit root semaphore.
|
|
|
|
* The exception are the root nodes, for which we always have a clone, see
|
|
|
|
* the comment at btrfs_compare_trees();
|
|
|
|
*
|
|
|
|
* 2) For leaves, level 0, we are holding copies (clones) of extent buffers, so
|
|
|
|
* we are safe from the concurrent relocation and reallocation. However they
|
|
|
|
* can have file extent items with a pre relocation disk_bytenr value, so we
|
|
|
|
* restart the start from the current commit roots and clone the new leaves so
|
|
|
|
* that we get the post relocation disk_bytenr values. Not doing so, could
|
|
|
|
* make us clone the wrong data in case there are new extents using the old
|
|
|
|
* disk_bytenr that happen to be shared.
|
|
|
|
*/
|
|
|
|
static int restart_after_relocation(struct btrfs_path *left_path,
|
|
|
|
struct btrfs_path *right_path,
|
|
|
|
const struct btrfs_key *left_key,
|
|
|
|
const struct btrfs_key *right_key,
|
|
|
|
int left_level,
|
|
|
|
int right_level,
|
|
|
|
const struct send_ctx *sctx)
|
|
|
|
{
|
|
|
|
int root_level;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
lockdep_assert_held_read(&sctx->send_root->fs_info->commit_root_sem);
|
|
|
|
|
|
|
|
btrfs_release_path(left_path);
|
|
|
|
btrfs_release_path(right_path);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Since keys can not be added or removed to/from our roots because they
|
|
|
|
* are readonly and we do not allow deduplication to run in parallel
|
|
|
|
* (which can add, remove or change keys), the layout of the trees should
|
|
|
|
* not change.
|
|
|
|
*/
|
|
|
|
left_path->lowest_level = left_level;
|
|
|
|
ret = search_key_again(sctx, sctx->send_root, left_path, left_key);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
right_path->lowest_level = right_level;
|
|
|
|
ret = search_key_again(sctx, sctx->parent_root, right_path, right_key);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the lowest level nodes are leaves, clone them so that they can be
|
|
|
|
* safely used by changed_cb() while not under the protection of the
|
|
|
|
* commit root semaphore, even if relocation and reallocation happens in
|
|
|
|
* parallel.
|
|
|
|
*/
|
|
|
|
if (left_level == 0) {
|
|
|
|
ret = replace_node_with_clone(left_path, 0);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (right_level == 0) {
|
|
|
|
ret = replace_node_with_clone(right_path, 0);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Now clone the root nodes (unless they happen to be the leaves we have
|
|
|
|
* already cloned). This is to protect against concurrent snapshotting of
|
|
|
|
* the send and parent roots (see the comment at btrfs_compare_trees()).
|
|
|
|
*/
|
|
|
|
root_level = btrfs_header_level(sctx->send_root->commit_root);
|
|
|
|
if (root_level > 0) {
|
|
|
|
ret = replace_node_with_clone(left_path, root_level);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
root_level = btrfs_header_level(sctx->parent_root->commit_root);
|
|
|
|
if (root_level > 0) {
|
|
|
|
ret = replace_node_with_clone(right_path, root_level);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-08-22 01:12:59 +08:00
|
|
|
/*
|
|
|
|
* This function compares two trees and calls the provided callback for
|
|
|
|
* every changed/new/deleted item it finds.
|
|
|
|
* If shared tree blocks are encountered, whole subtrees are skipped, making
|
|
|
|
* the compare pretty fast on snapshotted subvolumes.
|
|
|
|
*
|
|
|
|
* This currently works on commit roots only. As commit roots are read only,
|
|
|
|
* we don't do any locking. The commit roots are protected with transactions.
|
|
|
|
* Transactions are ended and rejoined when a commit is tried in between.
|
|
|
|
*
|
|
|
|
* This function checks for modifications done to the trees while comparing.
|
|
|
|
* If it detects a change, it aborts immediately.
|
|
|
|
*/
|
|
|
|
static int btrfs_compare_trees(struct btrfs_root *left_root,
|
2021-01-26 03:43:25 +08:00
|
|
|
struct btrfs_root *right_root, struct send_ctx *sctx)
|
2019-08-22 01:12:59 +08:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = left_root->fs_info;
|
|
|
|
int ret;
|
|
|
|
int cmp;
|
|
|
|
struct btrfs_path *left_path = NULL;
|
|
|
|
struct btrfs_path *right_path = NULL;
|
|
|
|
struct btrfs_key left_key;
|
|
|
|
struct btrfs_key right_key;
|
|
|
|
char *tmp_buf = NULL;
|
|
|
|
int left_root_level;
|
|
|
|
int right_root_level;
|
|
|
|
int left_level;
|
|
|
|
int right_level;
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
int left_end_reached = 0;
|
|
|
|
int right_end_reached = 0;
|
|
|
|
int advance_left = 0;
|
|
|
|
int advance_right = 0;
|
2019-08-22 01:12:59 +08:00
|
|
|
u64 left_blockptr;
|
|
|
|
u64 right_blockptr;
|
|
|
|
u64 left_gen;
|
|
|
|
u64 right_gen;
|
btrfs: add btree read ahead for incremental send operations
Currently we do not do btree read ahead when doing an incremental send,
however we know that we will read and process any node or leaf in the
send root that has a generation greater than the generation of the parent
root. So triggering read ahead for such nodes and leafs is beneficial
for an incremental send.
This change does that, triggers read ahead of any node or leaf in the
send root that has a generation greater then the generation of the
parent root. As for the parent root, no readahead is triggered because
knowing in advance which nodes/leaves are going to be read is not so
linear and there's often a large time window between visiting nodes or
leaves of the parent root. So I opted to leave out the parent root,
and triggering read ahead for its nodes/leaves seemed to have not made
significant difference.
The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk and with 16GiB of ram:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit
MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
# Create files with inline data to make it easier and faster to create
# large btrees.
add_files()
{
local total=$1
local start_offset=$2
local number_jobs=$3
local total_per_job=$(($total / $number_jobs))
echo "Creating $total new files using $number_jobs jobs"
for ((n = 0; n < $number_jobs; n++)); do
(
local start_num=$(($start_offset + $n * $total_per_job))
for ((i = 1; i <= $total_per_job; i++)); do
local file_num=$((start_num + $i))
local file_path="$MNT/file_${file_num}"
xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
if [ $? -ne 0 ]; then
echo "Failed creating file $file_path"
break
fi
done
) &
worker_pids[$n]=$!
done
wait ${worker_pids[@]}
sync
echo
echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
}
initial_file_count=500000
add_files $initial_file_count 0 4
echo
echo "Creating first snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap1
echo
echo "Adding more files..."
add_files $((initial_file_count / 4)) $initial_file_count 4
echo
echo "Updating 1/50th of the initial files..."
for ((i = 1; i < $initial_file_count; i += 50)); do
xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
done
echo
echo "Creating second snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap2
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing full send..."
start=$(date +%s)
btrfs send $MNT/snap1 > /dev/null
end=$(date +%s)
echo
echo "Full send took $((end - start)) seconds"
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing incremental send..."
start=$(date +%s)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s)
echo
echo "Incremental send took $((end - start)) seconds"
umount $MNT
Before this change, incremental send duration:
with $initial_file_count == 200000: 51 seconds
with $initial_file_count == 500000: 168 seconds
After this change, incremental send duration:
with $initial_file_count == 200000: 39 seconds (-26.7%)
with $initial_file_count == 500000: 125 seconds (-29.4%)
For $initial_file_count == 200000 there are 62600 nodes and leaves in the
btree of the first snapshot, and 77759 nodes and leaves in the btree of
the second snapshot. The root nodes were at level 2.
While for $initial_file_count == 500000 there are 152476 nodes and leaves
in the btree of the first snapshot, and 190511 nodes and leaves in the
btree of the second snapshot. The root nodes were at level 2 as well.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-01 17:26:43 +08:00
|
|
|
u64 reada_min_gen;
|
2019-08-22 01:12:59 +08:00
|
|
|
|
|
|
|
left_path = btrfs_alloc_path();
|
|
|
|
if (!left_path) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
right_path = btrfs_alloc_path();
|
|
|
|
if (!right_path) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
tmp_buf = kvmalloc(fs_info->nodesize, GFP_KERNEL);
|
|
|
|
if (!tmp_buf) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
left_path->search_commit_root = 1;
|
|
|
|
left_path->skip_locking = 1;
|
|
|
|
right_path->search_commit_root = 1;
|
|
|
|
right_path->skip_locking = 1;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Strategy: Go to the first items of both trees. Then do
|
|
|
|
*
|
|
|
|
* If both trees are at level 0
|
|
|
|
* Compare keys of current items
|
|
|
|
* If left < right treat left item as new, advance left tree
|
|
|
|
* and repeat
|
|
|
|
* If left > right treat right item as deleted, advance right tree
|
|
|
|
* and repeat
|
|
|
|
* If left == right do deep compare of items, treat as changed if
|
|
|
|
* needed, advance both trees and repeat
|
|
|
|
* If both trees are at the same level but not at level 0
|
|
|
|
* Compare keys of current nodes/leafs
|
|
|
|
* If left < right advance left tree and repeat
|
|
|
|
* If left > right advance right tree and repeat
|
|
|
|
* If left == right compare blockptrs of the next nodes/leafs
|
|
|
|
* If they match advance both trees but stay at the same level
|
|
|
|
* and repeat
|
|
|
|
* If they don't match advance both trees while allowing to go
|
|
|
|
* deeper and repeat
|
|
|
|
* If tree levels are different
|
|
|
|
* Advance the tree that needs it and repeat
|
|
|
|
*
|
|
|
|
* Advancing a tree means:
|
|
|
|
* If we are at level 0, try to go to the next slot. If that's not
|
|
|
|
* possible, go one level up and repeat. Stop when we found a level
|
|
|
|
* where we could go to the next slot. We may at this point be on a
|
|
|
|
* node or a leaf.
|
|
|
|
*
|
|
|
|
* If we are not at level 0 and not on shared tree blocks, go one
|
|
|
|
* level deeper.
|
|
|
|
*
|
|
|
|
* If we are not at level 0 and on shared tree blocks, go one slot to
|
|
|
|
* the right if possible or go up and right.
|
|
|
|
*/
|
|
|
|
|
|
|
|
down_read(&fs_info->commit_root_sem);
|
|
|
|
left_level = btrfs_header_level(left_root->commit_root);
|
|
|
|
left_root_level = left_level;
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
/*
|
|
|
|
* We clone the root node of the send and parent roots to prevent races
|
|
|
|
* with snapshot creation of these roots. Snapshot creation COWs the
|
|
|
|
* root node of a tree, so after the transaction is committed the old
|
|
|
|
* extent can be reallocated while this send operation is still ongoing.
|
|
|
|
* So we clone them, under the commit root semaphore, to be race free.
|
|
|
|
*/
|
2019-08-22 01:12:59 +08:00
|
|
|
left_path->nodes[left_level] =
|
|
|
|
btrfs_clone_extent_buffer(left_root->commit_root);
|
|
|
|
if (!left_path->nodes[left_level]) {
|
|
|
|
ret = -ENOMEM;
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
goto out_unlock;
|
2019-08-22 01:12:59 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
right_level = btrfs_header_level(right_root->commit_root);
|
|
|
|
right_root_level = right_level;
|
|
|
|
right_path->nodes[right_level] =
|
|
|
|
btrfs_clone_extent_buffer(right_root->commit_root);
|
|
|
|
if (!right_path->nodes[right_level]) {
|
|
|
|
ret = -ENOMEM;
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
goto out_unlock;
|
2019-08-22 01:12:59 +08:00
|
|
|
}
|
btrfs: add btree read ahead for incremental send operations
Currently we do not do btree read ahead when doing an incremental send,
however we know that we will read and process any node or leaf in the
send root that has a generation greater than the generation of the parent
root. So triggering read ahead for such nodes and leafs is beneficial
for an incremental send.
This change does that, triggers read ahead of any node or leaf in the
send root that has a generation greater then the generation of the
parent root. As for the parent root, no readahead is triggered because
knowing in advance which nodes/leaves are going to be read is not so
linear and there's often a large time window between visiting nodes or
leaves of the parent root. So I opted to leave out the parent root,
and triggering read ahead for its nodes/leaves seemed to have not made
significant difference.
The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk and with 16GiB of ram:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit
MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
# Create files with inline data to make it easier and faster to create
# large btrees.
add_files()
{
local total=$1
local start_offset=$2
local number_jobs=$3
local total_per_job=$(($total / $number_jobs))
echo "Creating $total new files using $number_jobs jobs"
for ((n = 0; n < $number_jobs; n++)); do
(
local start_num=$(($start_offset + $n * $total_per_job))
for ((i = 1; i <= $total_per_job; i++)); do
local file_num=$((start_num + $i))
local file_path="$MNT/file_${file_num}"
xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
if [ $? -ne 0 ]; then
echo "Failed creating file $file_path"
break
fi
done
) &
worker_pids[$n]=$!
done
wait ${worker_pids[@]}
sync
echo
echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
}
initial_file_count=500000
add_files $initial_file_count 0 4
echo
echo "Creating first snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap1
echo
echo "Adding more files..."
add_files $((initial_file_count / 4)) $initial_file_count 4
echo
echo "Updating 1/50th of the initial files..."
for ((i = 1; i < $initial_file_count; i += 50)); do
xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
done
echo
echo "Creating second snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap2
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing full send..."
start=$(date +%s)
btrfs send $MNT/snap1 > /dev/null
end=$(date +%s)
echo
echo "Full send took $((end - start)) seconds"
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing incremental send..."
start=$(date +%s)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s)
echo
echo "Incremental send took $((end - start)) seconds"
umount $MNT
Before this change, incremental send duration:
with $initial_file_count == 200000: 51 seconds
with $initial_file_count == 500000: 168 seconds
After this change, incremental send duration:
with $initial_file_count == 200000: 39 seconds (-26.7%)
with $initial_file_count == 500000: 125 seconds (-29.4%)
For $initial_file_count == 200000 there are 62600 nodes and leaves in the
btree of the first snapshot, and 77759 nodes and leaves in the btree of
the second snapshot. The root nodes were at level 2.
While for $initial_file_count == 500000 there are 152476 nodes and leaves
in the btree of the first snapshot, and 190511 nodes and leaves in the
btree of the second snapshot. The root nodes were at level 2 as well.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-01 17:26:43 +08:00
|
|
|
/*
|
|
|
|
* Our right root is the parent root, while the left root is the "send"
|
|
|
|
* root. We know that all new nodes/leaves in the left root must have
|
|
|
|
* a generation greater than the right root's generation, so we trigger
|
|
|
|
* readahead for those nodes and leaves of the left root, as we know we
|
|
|
|
* will need to read them at some point.
|
|
|
|
*/
|
|
|
|
reada_min_gen = btrfs_header_generation(right_root->commit_root);
|
2019-08-22 01:12:59 +08:00
|
|
|
|
|
|
|
if (left_level == 0)
|
|
|
|
btrfs_item_key_to_cpu(left_path->nodes[left_level],
|
|
|
|
&left_key, left_path->slots[left_level]);
|
|
|
|
else
|
|
|
|
btrfs_node_key_to_cpu(left_path->nodes[left_level],
|
|
|
|
&left_key, left_path->slots[left_level]);
|
|
|
|
if (right_level == 0)
|
|
|
|
btrfs_item_key_to_cpu(right_path->nodes[right_level],
|
|
|
|
&right_key, right_path->slots[right_level]);
|
|
|
|
else
|
|
|
|
btrfs_node_key_to_cpu(right_path->nodes[right_level],
|
|
|
|
&right_key, right_path->slots[right_level]);
|
|
|
|
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
sctx->last_reloc_trans = fs_info->last_reloc_trans;
|
2019-08-22 01:12:59 +08:00
|
|
|
|
|
|
|
while (1) {
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
if (need_resched() ||
|
|
|
|
rwsem_is_contended(&fs_info->commit_root_sem)) {
|
|
|
|
up_read(&fs_info->commit_root_sem);
|
|
|
|
cond_resched();
|
|
|
|
down_read(&fs_info->commit_root_sem);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (fs_info->last_reloc_trans > sctx->last_reloc_trans) {
|
|
|
|
ret = restart_after_relocation(left_path, right_path,
|
|
|
|
&left_key, &right_key,
|
|
|
|
left_level, right_level,
|
|
|
|
sctx);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out_unlock;
|
|
|
|
sctx->last_reloc_trans = fs_info->last_reloc_trans;
|
|
|
|
}
|
|
|
|
|
2019-08-22 01:12:59 +08:00
|
|
|
if (advance_left && !left_end_reached) {
|
|
|
|
ret = tree_advance(left_path, &left_level,
|
|
|
|
left_root_level,
|
|
|
|
advance_left != ADVANCE_ONLY_NEXT,
|
btrfs: add btree read ahead for incremental send operations
Currently we do not do btree read ahead when doing an incremental send,
however we know that we will read and process any node or leaf in the
send root that has a generation greater than the generation of the parent
root. So triggering read ahead for such nodes and leafs is beneficial
for an incremental send.
This change does that, triggers read ahead of any node or leaf in the
send root that has a generation greater then the generation of the
parent root. As for the parent root, no readahead is triggered because
knowing in advance which nodes/leaves are going to be read is not so
linear and there's often a large time window between visiting nodes or
leaves of the parent root. So I opted to leave out the parent root,
and triggering read ahead for its nodes/leaves seemed to have not made
significant difference.
The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk and with 16GiB of ram:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit
MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
# Create files with inline data to make it easier and faster to create
# large btrees.
add_files()
{
local total=$1
local start_offset=$2
local number_jobs=$3
local total_per_job=$(($total / $number_jobs))
echo "Creating $total new files using $number_jobs jobs"
for ((n = 0; n < $number_jobs; n++)); do
(
local start_num=$(($start_offset + $n * $total_per_job))
for ((i = 1; i <= $total_per_job; i++)); do
local file_num=$((start_num + $i))
local file_path="$MNT/file_${file_num}"
xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
if [ $? -ne 0 ]; then
echo "Failed creating file $file_path"
break
fi
done
) &
worker_pids[$n]=$!
done
wait ${worker_pids[@]}
sync
echo
echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
}
initial_file_count=500000
add_files $initial_file_count 0 4
echo
echo "Creating first snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap1
echo
echo "Adding more files..."
add_files $((initial_file_count / 4)) $initial_file_count 4
echo
echo "Updating 1/50th of the initial files..."
for ((i = 1; i < $initial_file_count; i += 50)); do
xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
done
echo
echo "Creating second snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap2
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing full send..."
start=$(date +%s)
btrfs send $MNT/snap1 > /dev/null
end=$(date +%s)
echo
echo "Full send took $((end - start)) seconds"
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing incremental send..."
start=$(date +%s)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s)
echo
echo "Incremental send took $((end - start)) seconds"
umount $MNT
Before this change, incremental send duration:
with $initial_file_count == 200000: 51 seconds
with $initial_file_count == 500000: 168 seconds
After this change, incremental send duration:
with $initial_file_count == 200000: 39 seconds (-26.7%)
with $initial_file_count == 500000: 125 seconds (-29.4%)
For $initial_file_count == 200000 there are 62600 nodes and leaves in the
btree of the first snapshot, and 77759 nodes and leaves in the btree of
the second snapshot. The root nodes were at level 2.
While for $initial_file_count == 500000 there are 152476 nodes and leaves
in the btree of the first snapshot, and 190511 nodes and leaves in the
btree of the second snapshot. The root nodes were at level 2 as well.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-01 17:26:43 +08:00
|
|
|
&left_key, reada_min_gen);
|
2019-08-22 01:12:59 +08:00
|
|
|
if (ret == -1)
|
|
|
|
left_end_reached = ADVANCE;
|
|
|
|
else if (ret < 0)
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
goto out_unlock;
|
2019-08-22 01:12:59 +08:00
|
|
|
advance_left = 0;
|
|
|
|
}
|
|
|
|
if (advance_right && !right_end_reached) {
|
|
|
|
ret = tree_advance(right_path, &right_level,
|
|
|
|
right_root_level,
|
|
|
|
advance_right != ADVANCE_ONLY_NEXT,
|
btrfs: add btree read ahead for incremental send operations
Currently we do not do btree read ahead when doing an incremental send,
however we know that we will read and process any node or leaf in the
send root that has a generation greater than the generation of the parent
root. So triggering read ahead for such nodes and leafs is beneficial
for an incremental send.
This change does that, triggers read ahead of any node or leaf in the
send root that has a generation greater then the generation of the
parent root. As for the parent root, no readahead is triggered because
knowing in advance which nodes/leaves are going to be read is not so
linear and there's often a large time window between visiting nodes or
leaves of the parent root. So I opted to leave out the parent root,
and triggering read ahead for its nodes/leaves seemed to have not made
significant difference.
The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk and with 16GiB of ram:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit
MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
# Create files with inline data to make it easier and faster to create
# large btrees.
add_files()
{
local total=$1
local start_offset=$2
local number_jobs=$3
local total_per_job=$(($total / $number_jobs))
echo "Creating $total new files using $number_jobs jobs"
for ((n = 0; n < $number_jobs; n++)); do
(
local start_num=$(($start_offset + $n * $total_per_job))
for ((i = 1; i <= $total_per_job; i++)); do
local file_num=$((start_num + $i))
local file_path="$MNT/file_${file_num}"
xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
if [ $? -ne 0 ]; then
echo "Failed creating file $file_path"
break
fi
done
) &
worker_pids[$n]=$!
done
wait ${worker_pids[@]}
sync
echo
echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
}
initial_file_count=500000
add_files $initial_file_count 0 4
echo
echo "Creating first snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap1
echo
echo "Adding more files..."
add_files $((initial_file_count / 4)) $initial_file_count 4
echo
echo "Updating 1/50th of the initial files..."
for ((i = 1; i < $initial_file_count; i += 50)); do
xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
done
echo
echo "Creating second snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap2
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing full send..."
start=$(date +%s)
btrfs send $MNT/snap1 > /dev/null
end=$(date +%s)
echo
echo "Full send took $((end - start)) seconds"
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing incremental send..."
start=$(date +%s)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s)
echo
echo "Incremental send took $((end - start)) seconds"
umount $MNT
Before this change, incremental send duration:
with $initial_file_count == 200000: 51 seconds
with $initial_file_count == 500000: 168 seconds
After this change, incremental send duration:
with $initial_file_count == 200000: 39 seconds (-26.7%)
with $initial_file_count == 500000: 125 seconds (-29.4%)
For $initial_file_count == 200000 there are 62600 nodes and leaves in the
btree of the first snapshot, and 77759 nodes and leaves in the btree of
the second snapshot. The root nodes were at level 2.
While for $initial_file_count == 500000 there are 152476 nodes and leaves
in the btree of the first snapshot, and 190511 nodes and leaves in the
btree of the second snapshot. The root nodes were at level 2 as well.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-01 17:26:43 +08:00
|
|
|
&right_key, reada_min_gen);
|
2019-08-22 01:12:59 +08:00
|
|
|
if (ret == -1)
|
|
|
|
right_end_reached = ADVANCE;
|
|
|
|
else if (ret < 0)
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
goto out_unlock;
|
2019-08-22 01:12:59 +08:00
|
|
|
advance_right = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (left_end_reached && right_end_reached) {
|
|
|
|
ret = 0;
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
goto out_unlock;
|
2019-08-22 01:12:59 +08:00
|
|
|
} else if (left_end_reached) {
|
|
|
|
if (right_level == 0) {
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
up_read(&fs_info->commit_root_sem);
|
2019-08-22 01:12:59 +08:00
|
|
|
ret = changed_cb(left_path, right_path,
|
|
|
|
&right_key,
|
|
|
|
BTRFS_COMPARE_TREE_DELETED,
|
2021-01-26 03:43:25 +08:00
|
|
|
sctx);
|
2019-08-22 01:12:59 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
down_read(&fs_info->commit_root_sem);
|
2019-08-22 01:12:59 +08:00
|
|
|
}
|
|
|
|
advance_right = ADVANCE;
|
|
|
|
continue;
|
|
|
|
} else if (right_end_reached) {
|
|
|
|
if (left_level == 0) {
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
up_read(&fs_info->commit_root_sem);
|
2019-08-22 01:12:59 +08:00
|
|
|
ret = changed_cb(left_path, right_path,
|
|
|
|
&left_key,
|
|
|
|
BTRFS_COMPARE_TREE_NEW,
|
2021-01-26 03:43:25 +08:00
|
|
|
sctx);
|
2019-08-22 01:12:59 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
down_read(&fs_info->commit_root_sem);
|
2019-08-22 01:12:59 +08:00
|
|
|
}
|
|
|
|
advance_left = ADVANCE;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (left_level == 0 && right_level == 0) {
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
up_read(&fs_info->commit_root_sem);
|
2019-08-22 01:12:59 +08:00
|
|
|
cmp = btrfs_comp_cpu_keys(&left_key, &right_key);
|
|
|
|
if (cmp < 0) {
|
|
|
|
ret = changed_cb(left_path, right_path,
|
|
|
|
&left_key,
|
|
|
|
BTRFS_COMPARE_TREE_NEW,
|
2021-01-26 03:43:25 +08:00
|
|
|
sctx);
|
2019-08-22 01:12:59 +08:00
|
|
|
advance_left = ADVANCE;
|
|
|
|
} else if (cmp > 0) {
|
|
|
|
ret = changed_cb(left_path, right_path,
|
|
|
|
&right_key,
|
|
|
|
BTRFS_COMPARE_TREE_DELETED,
|
2021-01-26 03:43:25 +08:00
|
|
|
sctx);
|
2019-08-22 01:12:59 +08:00
|
|
|
advance_right = ADVANCE;
|
|
|
|
} else {
|
|
|
|
enum btrfs_compare_tree_result result;
|
|
|
|
|
|
|
|
WARN_ON(!extent_buffer_uptodate(left_path->nodes[0]));
|
|
|
|
ret = tree_compare_item(left_path, right_path,
|
|
|
|
tmp_buf);
|
|
|
|
if (ret)
|
|
|
|
result = BTRFS_COMPARE_TREE_CHANGED;
|
|
|
|
else
|
|
|
|
result = BTRFS_COMPARE_TREE_SAME;
|
|
|
|
ret = changed_cb(left_path, right_path,
|
2021-01-26 03:43:25 +08:00
|
|
|
&left_key, result, sctx);
|
2019-08-22 01:12:59 +08:00
|
|
|
advance_left = ADVANCE;
|
|
|
|
advance_right = ADVANCE;
|
|
|
|
}
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
down_read(&fs_info->commit_root_sem);
|
2019-08-22 01:12:59 +08:00
|
|
|
} else if (left_level == right_level) {
|
|
|
|
cmp = btrfs_comp_cpu_keys(&left_key, &right_key);
|
|
|
|
if (cmp < 0) {
|
|
|
|
advance_left = ADVANCE;
|
|
|
|
} else if (cmp > 0) {
|
|
|
|
advance_right = ADVANCE;
|
|
|
|
} else {
|
|
|
|
left_blockptr = btrfs_node_blockptr(
|
|
|
|
left_path->nodes[left_level],
|
|
|
|
left_path->slots[left_level]);
|
|
|
|
right_blockptr = btrfs_node_blockptr(
|
|
|
|
right_path->nodes[right_level],
|
|
|
|
right_path->slots[right_level]);
|
|
|
|
left_gen = btrfs_node_ptr_generation(
|
|
|
|
left_path->nodes[left_level],
|
|
|
|
left_path->slots[left_level]);
|
|
|
|
right_gen = btrfs_node_ptr_generation(
|
|
|
|
right_path->nodes[right_level],
|
|
|
|
right_path->slots[right_level]);
|
|
|
|
if (left_blockptr == right_blockptr &&
|
|
|
|
left_gen == right_gen) {
|
|
|
|
/*
|
|
|
|
* As we're on a shared block, don't
|
|
|
|
* allow to go deeper.
|
|
|
|
*/
|
|
|
|
advance_left = ADVANCE_ONLY_NEXT;
|
|
|
|
advance_right = ADVANCE_ONLY_NEXT;
|
|
|
|
} else {
|
|
|
|
advance_left = ADVANCE;
|
|
|
|
advance_right = ADVANCE;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
} else if (left_level < right_level) {
|
|
|
|
advance_right = ADVANCE;
|
|
|
|
} else {
|
|
|
|
advance_left = ADVANCE;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
out_unlock:
|
|
|
|
up_read(&fs_info->commit_root_sem);
|
2019-08-22 01:12:59 +08:00
|
|
|
out:
|
|
|
|
btrfs_free_path(left_path);
|
|
|
|
btrfs_free_path(right_path);
|
|
|
|
kvfree(tmp_buf);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
static int send_subvol(struct send_ctx *sctx)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
2013-04-11 01:10:52 +08:00
|
|
|
if (!(sctx->flags & BTRFS_SEND_FLAG_OMIT_STREAM_HEADER)) {
|
|
|
|
ret = send_header(sctx);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
ret = send_subvol_begin(sctx);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (sctx->parent_root) {
|
2020-08-17 18:16:57 +08:00
|
|
|
ret = btrfs_compare_trees(sctx->send_root, sctx->parent_root, sctx);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
ret = finish_inode_if_needed(sctx, 1);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
} else {
|
|
|
|
ret = full_send_tree(sctx);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
|
|
|
free_recorded_refs(sctx);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2014-10-21 18:11:41 +08:00
|
|
|
/*
|
|
|
|
* If orphan cleanup did remove any orphans from a root, it means the tree
|
|
|
|
* was modified and therefore the commit root is not the same as the current
|
|
|
|
* root anymore. This is a problem, because send uses the commit root and
|
|
|
|
* therefore can see inode items that don't exist in the current root anymore,
|
|
|
|
* and for example make calls to btrfs_iget, which will do tree lookups based
|
|
|
|
* on the current root and not on the commit root. Those lookups will fail,
|
|
|
|
* returning a -ESTALE error, and making send fail with that error. So make
|
|
|
|
* sure a send does not see any orphans we have just removed, and that it will
|
|
|
|
* see the same inodes regardless of whether a transaction commit happened
|
|
|
|
* before it started (meaning that the commit root will be the same as the
|
|
|
|
* current root) or not.
|
|
|
|
*/
|
|
|
|
static int ensure_commit_roots_uptodate(struct send_ctx *sctx)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
struct btrfs_trans_handle *trans = NULL;
|
|
|
|
|
|
|
|
again:
|
|
|
|
if (sctx->parent_root &&
|
|
|
|
sctx->parent_root->node != sctx->parent_root->commit_root)
|
|
|
|
goto commit_trans;
|
|
|
|
|
|
|
|
for (i = 0; i < sctx->clone_roots_cnt; i++)
|
|
|
|
if (sctx->clone_roots[i].root->node !=
|
|
|
|
sctx->clone_roots[i].root->commit_root)
|
|
|
|
goto commit_trans;
|
|
|
|
|
|
|
|
if (trans)
|
2016-09-10 09:39:03 +08:00
|
|
|
return btrfs_end_transaction(trans);
|
2014-10-21 18:11:41 +08:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
commit_trans:
|
|
|
|
/* Use any root, all fs roots will get their commit roots updated. */
|
|
|
|
if (!trans) {
|
|
|
|
trans = btrfs_join_transaction(sctx->send_root);
|
|
|
|
if (IS_ERR(trans))
|
|
|
|
return PTR_ERR(trans);
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
|
2016-09-10 09:39:03 +08:00
|
|
|
return btrfs_commit_transaction(trans);
|
2014-10-21 18:11:41 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: send, flush dellaloc in order to avoid data loss
When we set a subvolume to read-only mode we do not flush dellaloc for any
of its inodes (except if the filesystem is mounted with -o flushoncommit),
since it does not affect correctness for any subsequent operations - except
for a future send operation. The send operation will not be able to see the
delalloc data since the respective file extent items, inode item updates,
backreferences, etc, have not hit yet the subvolume and extent trees.
Effectively this means data loss, since the send stream will not contain
any data from existing delalloc. Another problem from this is that if the
writeback starts and finishes while the send operation is in progress, we
have the subvolume tree being being modified concurrently which can result
in send failing unexpectedly with EIO or hitting runtime errors, assertion
failures or hitting BUG_ONs, etc.
Simple reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ btrfs subvolume create /mnt/sv
$ xfs_io -f -c "pwrite -S 0xea 0 108K" /mnt/sv/foo
$ btrfs property set /mnt/sv ro true
$ btrfs send -f /tmp/send.stream /mnt/sv
$ od -t x1 -A d /mnt/sv/foo
0000000 ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea
*
0110592
$ umount /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ btrfs receive -f /tmp/send.stream /mnt
$ echo $?
0
$ od -t x1 -A d /mnt/sv/foo
0000000
# ---> empty file
Since this a problem that affects send only, fix it in send by flushing
dellaloc for all the roots used by the send operation before send starts
to process the commit roots.
This is a problem that affects send since it was introduced (commit
31db9f7c23fbf7 ("Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive"))
but backporting it to older kernels has some dependencies:
- For kernels between 3.19 and 4.20, it depends on commit 3cd24c698004d2
("btrfs: use tagged writepage to mitigate livelock of snapshot") because
the function btrfs_start_delalloc_snapshot() does not exist before that
commit. So one has to either pick that commit or replace the calls to
btrfs_start_delalloc_snapshot() in this patch with calls to
btrfs_start_delalloc_inodes().
- For kernels older than 3.19 it also requires commit e5fa8f865b3324
("Btrfs: ensure send always works on roots without orphans") because
it depends on the function ensure_commit_roots_uptodate() which that
commits introduced.
- No dependencies for 5.0+ kernels.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 3.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-15 16:29:36 +08:00
|
|
|
/*
|
|
|
|
* Make sure any existing dellaloc is flushed for any root used by a send
|
|
|
|
* operation so that we do not miss any data and we do not race with writeback
|
|
|
|
* finishing and changing a tree while send is using the tree. This could
|
|
|
|
* happen if a subvolume is in RW mode, has delalloc, is turned to RO mode and
|
|
|
|
* a send operation then uses the subvolume.
|
|
|
|
* After flushing delalloc ensure_commit_roots_uptodate() must be called.
|
|
|
|
*/
|
|
|
|
static int flush_delalloc_roots(struct send_ctx *sctx)
|
|
|
|
{
|
|
|
|
struct btrfs_root *root = sctx->parent_root;
|
|
|
|
int ret;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (root) {
|
btrfs: fix deadlock when cloning inline extents and using qgroups
There are a few exceptional cases where cloning an inline extent needs to
copy the inline extent data into a page of the destination inode.
When this happens, we end up starting a transaction while having a dirty
page for the destination inode and while having the range locked in the
destination's inode iotree too. Because when reserving metadata space
for a transaction we may need to flush existing delalloc in case there is
not enough free space, we have a mechanism in place to prevent a deadlock,
which was introduced in commit 3d45f221ce627d ("btrfs: fix deadlock when
cloning inline extent and low on free metadata space").
However when using qgroups, a transaction also reserves metadata qgroup
space, which can also result in flushing delalloc in case there is not
enough available space at the moment. When this happens we deadlock, since
flushing delalloc requires locking the file range in the inode's iotree
and the range was already locked at the very beginning of the clone
operation, before attempting to start the transaction.
When this issue happens, stack traces like the following are reported:
[72747.556262] task:kworker/u81:9 state:D stack: 0 pid: 225 ppid: 2 flags:0x00004000
[72747.556268] Workqueue: writeback wb_workfn (flush-btrfs-1142)
[72747.556271] Call Trace:
[72747.556273] __schedule+0x296/0x760
[72747.556277] schedule+0x3c/0xa0
[72747.556279] io_schedule+0x12/0x40
[72747.556284] __lock_page+0x13c/0x280
[72747.556287] ? generic_file_readonly_mmap+0x70/0x70
[72747.556325] extent_write_cache_pages+0x22a/0x440 [btrfs]
[72747.556331] ? __set_page_dirty_nobuffers+0xe7/0x160
[72747.556358] ? set_extent_buffer_dirty+0x5e/0x80 [btrfs]
[72747.556362] ? update_group_capacity+0x25/0x210
[72747.556366] ? cpumask_next_and+0x1a/0x20
[72747.556391] extent_writepages+0x44/0xa0 [btrfs]
[72747.556394] do_writepages+0x41/0xd0
[72747.556398] __writeback_single_inode+0x39/0x2a0
[72747.556403] writeback_sb_inodes+0x1ea/0x440
[72747.556407] __writeback_inodes_wb+0x5f/0xc0
[72747.556410] wb_writeback+0x235/0x2b0
[72747.556414] ? get_nr_inodes+0x35/0x50
[72747.556417] wb_workfn+0x354/0x490
[72747.556420] ? newidle_balance+0x2c5/0x3e0
[72747.556424] process_one_work+0x1aa/0x340
[72747.556426] worker_thread+0x30/0x390
[72747.556429] ? create_worker+0x1a0/0x1a0
[72747.556432] kthread+0x116/0x130
[72747.556435] ? kthread_park+0x80/0x80
[72747.556438] ret_from_fork+0x1f/0x30
[72747.566958] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
[72747.566961] Call Trace:
[72747.566964] __schedule+0x296/0x760
[72747.566968] ? finish_wait+0x80/0x80
[72747.566970] schedule+0x3c/0xa0
[72747.566995] wait_extent_bit.constprop.68+0x13b/0x1c0 [btrfs]
[72747.566999] ? finish_wait+0x80/0x80
[72747.567024] lock_extent_bits+0x37/0x90 [btrfs]
[72747.567047] btrfs_invalidatepage+0x299/0x2c0 [btrfs]
[72747.567051] ? find_get_pages_range_tag+0x2cd/0x380
[72747.567076] __extent_writepage+0x203/0x320 [btrfs]
[72747.567102] extent_write_cache_pages+0x2bb/0x440 [btrfs]
[72747.567106] ? update_load_avg+0x7e/0x5f0
[72747.567109] ? enqueue_entity+0xf4/0x6f0
[72747.567134] extent_writepages+0x44/0xa0 [btrfs]
[72747.567137] ? enqueue_task_fair+0x93/0x6f0
[72747.567140] do_writepages+0x41/0xd0
[72747.567144] __filemap_fdatawrite_range+0xc7/0x100
[72747.567167] btrfs_run_delalloc_work+0x17/0x40 [btrfs]
[72747.567195] btrfs_work_helper+0xc2/0x300 [btrfs]
[72747.567200] process_one_work+0x1aa/0x340
[72747.567202] worker_thread+0x30/0x390
[72747.567205] ? create_worker+0x1a0/0x1a0
[72747.567208] kthread+0x116/0x130
[72747.567211] ? kthread_park+0x80/0x80
[72747.567214] ret_from_fork+0x1f/0x30
[72747.569686] task:fsstress state:D stack: 0 pid:841421 ppid:841417 flags:0x00000000
[72747.569689] Call Trace:
[72747.569691] __schedule+0x296/0x760
[72747.569694] schedule+0x3c/0xa0
[72747.569721] try_flush_qgroup+0x95/0x140 [btrfs]
[72747.569725] ? finish_wait+0x80/0x80
[72747.569753] btrfs_qgroup_reserve_data+0x34/0x50 [btrfs]
[72747.569781] btrfs_check_data_free_space+0x5f/0xa0 [btrfs]
[72747.569804] btrfs_buffered_write+0x1f7/0x7f0 [btrfs]
[72747.569810] ? path_lookupat.isra.48+0x97/0x140
[72747.569833] btrfs_file_write_iter+0x81/0x410 [btrfs]
[72747.569836] ? __kmalloc+0x16a/0x2c0
[72747.569839] do_iter_readv_writev+0x160/0x1c0
[72747.569843] do_iter_write+0x80/0x1b0
[72747.569847] vfs_writev+0x84/0x140
[72747.569869] ? btrfs_file_llseek+0x38/0x270 [btrfs]
[72747.569873] do_writev+0x65/0x100
[72747.569876] do_syscall_64+0x33/0x40
[72747.569879] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[72747.569899] task:fsstress state:D stack: 0 pid:841424 ppid:841417 flags:0x00004000
[72747.569903] Call Trace:
[72747.569906] __schedule+0x296/0x760
[72747.569909] schedule+0x3c/0xa0
[72747.569936] try_flush_qgroup+0x95/0x140 [btrfs]
[72747.569940] ? finish_wait+0x80/0x80
[72747.569967] __btrfs_qgroup_reserve_meta+0x36/0x50 [btrfs]
[72747.569989] start_transaction+0x279/0x580 [btrfs]
[72747.570014] clone_copy_inline_extent+0x332/0x490 [btrfs]
[72747.570041] btrfs_clone+0x5b7/0x7a0 [btrfs]
[72747.570068] ? lock_extent_bits+0x64/0x90 [btrfs]
[72747.570095] btrfs_clone_files+0xfc/0x150 [btrfs]
[72747.570122] btrfs_remap_file_range+0x3d8/0x4a0 [btrfs]
[72747.570126] do_clone_file_range+0xed/0x200
[72747.570131] vfs_clone_file_range+0x37/0x110
[72747.570134] ioctl_file_clone+0x7d/0xb0
[72747.570137] do_vfs_ioctl+0x138/0x630
[72747.570140] __x64_sys_ioctl+0x62/0xc0
[72747.570143] do_syscall_64+0x33/0x40
[72747.570146] entry_SYSCALL_64_after_hwframe+0x44/0xa9
So fix this by skipping the flush of delalloc for an inode that is
flagged with BTRFS_INODE_NO_DELALLOC_FLUSH, meaning it is currently under
such a special case of cloning an inline extent, when flushing delalloc
during qgroup metadata reservation.
The special cases for cloning inline extents were added in kernel 5.7 by
by commit 05a5a7621ce66c ("Btrfs: implement full reflink support for
inline extents"), while having qgroup metadata space reservation flushing
delalloc when low on space was added in kernel 5.9 by commit
c53e9653605dbf ("btrfs: qgroup: try to flush qgroup space when we get
-EDQUOT"). So use a "Fixes:" tag for the later commit to ease stable
kernel backports.
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20210421083137.31E3.409509F4@e16-tech.com/
Fixes: c53e9653605dbf ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
CC: stable@vger.kernel.org # 5.9+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-22 19:08:05 +08:00
|
|
|
ret = btrfs_start_delalloc_snapshot(root, false);
|
Btrfs: send, flush dellaloc in order to avoid data loss
When we set a subvolume to read-only mode we do not flush dellaloc for any
of its inodes (except if the filesystem is mounted with -o flushoncommit),
since it does not affect correctness for any subsequent operations - except
for a future send operation. The send operation will not be able to see the
delalloc data since the respective file extent items, inode item updates,
backreferences, etc, have not hit yet the subvolume and extent trees.
Effectively this means data loss, since the send stream will not contain
any data from existing delalloc. Another problem from this is that if the
writeback starts and finishes while the send operation is in progress, we
have the subvolume tree being being modified concurrently which can result
in send failing unexpectedly with EIO or hitting runtime errors, assertion
failures or hitting BUG_ONs, etc.
Simple reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ btrfs subvolume create /mnt/sv
$ xfs_io -f -c "pwrite -S 0xea 0 108K" /mnt/sv/foo
$ btrfs property set /mnt/sv ro true
$ btrfs send -f /tmp/send.stream /mnt/sv
$ od -t x1 -A d /mnt/sv/foo
0000000 ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea
*
0110592
$ umount /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ btrfs receive -f /tmp/send.stream /mnt
$ echo $?
0
$ od -t x1 -A d /mnt/sv/foo
0000000
# ---> empty file
Since this a problem that affects send only, fix it in send by flushing
dellaloc for all the roots used by the send operation before send starts
to process the commit roots.
This is a problem that affects send since it was introduced (commit
31db9f7c23fbf7 ("Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive"))
but backporting it to older kernels has some dependencies:
- For kernels between 3.19 and 4.20, it depends on commit 3cd24c698004d2
("btrfs: use tagged writepage to mitigate livelock of snapshot") because
the function btrfs_start_delalloc_snapshot() does not exist before that
commit. So one has to either pick that commit or replace the calls to
btrfs_start_delalloc_snapshot() in this patch with calls to
btrfs_start_delalloc_inodes().
- For kernels older than 3.19 it also requires commit e5fa8f865b3324
("Btrfs: ensure send always works on roots without orphans") because
it depends on the function ensure_commit_roots_uptodate() which that
commits introduced.
- No dependencies for 5.0+ kernels.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 3.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-15 16:29:36 +08:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
btrfs_wait_ordered_extents(root, U64_MAX, 0, U64_MAX);
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; i < sctx->clone_roots_cnt; i++) {
|
|
|
|
root = sctx->clone_roots[i].root;
|
btrfs: fix deadlock when cloning inline extents and using qgroups
There are a few exceptional cases where cloning an inline extent needs to
copy the inline extent data into a page of the destination inode.
When this happens, we end up starting a transaction while having a dirty
page for the destination inode and while having the range locked in the
destination's inode iotree too. Because when reserving metadata space
for a transaction we may need to flush existing delalloc in case there is
not enough free space, we have a mechanism in place to prevent a deadlock,
which was introduced in commit 3d45f221ce627d ("btrfs: fix deadlock when
cloning inline extent and low on free metadata space").
However when using qgroups, a transaction also reserves metadata qgroup
space, which can also result in flushing delalloc in case there is not
enough available space at the moment. When this happens we deadlock, since
flushing delalloc requires locking the file range in the inode's iotree
and the range was already locked at the very beginning of the clone
operation, before attempting to start the transaction.
When this issue happens, stack traces like the following are reported:
[72747.556262] task:kworker/u81:9 state:D stack: 0 pid: 225 ppid: 2 flags:0x00004000
[72747.556268] Workqueue: writeback wb_workfn (flush-btrfs-1142)
[72747.556271] Call Trace:
[72747.556273] __schedule+0x296/0x760
[72747.556277] schedule+0x3c/0xa0
[72747.556279] io_schedule+0x12/0x40
[72747.556284] __lock_page+0x13c/0x280
[72747.556287] ? generic_file_readonly_mmap+0x70/0x70
[72747.556325] extent_write_cache_pages+0x22a/0x440 [btrfs]
[72747.556331] ? __set_page_dirty_nobuffers+0xe7/0x160
[72747.556358] ? set_extent_buffer_dirty+0x5e/0x80 [btrfs]
[72747.556362] ? update_group_capacity+0x25/0x210
[72747.556366] ? cpumask_next_and+0x1a/0x20
[72747.556391] extent_writepages+0x44/0xa0 [btrfs]
[72747.556394] do_writepages+0x41/0xd0
[72747.556398] __writeback_single_inode+0x39/0x2a0
[72747.556403] writeback_sb_inodes+0x1ea/0x440
[72747.556407] __writeback_inodes_wb+0x5f/0xc0
[72747.556410] wb_writeback+0x235/0x2b0
[72747.556414] ? get_nr_inodes+0x35/0x50
[72747.556417] wb_workfn+0x354/0x490
[72747.556420] ? newidle_balance+0x2c5/0x3e0
[72747.556424] process_one_work+0x1aa/0x340
[72747.556426] worker_thread+0x30/0x390
[72747.556429] ? create_worker+0x1a0/0x1a0
[72747.556432] kthread+0x116/0x130
[72747.556435] ? kthread_park+0x80/0x80
[72747.556438] ret_from_fork+0x1f/0x30
[72747.566958] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
[72747.566961] Call Trace:
[72747.566964] __schedule+0x296/0x760
[72747.566968] ? finish_wait+0x80/0x80
[72747.566970] schedule+0x3c/0xa0
[72747.566995] wait_extent_bit.constprop.68+0x13b/0x1c0 [btrfs]
[72747.566999] ? finish_wait+0x80/0x80
[72747.567024] lock_extent_bits+0x37/0x90 [btrfs]
[72747.567047] btrfs_invalidatepage+0x299/0x2c0 [btrfs]
[72747.567051] ? find_get_pages_range_tag+0x2cd/0x380
[72747.567076] __extent_writepage+0x203/0x320 [btrfs]
[72747.567102] extent_write_cache_pages+0x2bb/0x440 [btrfs]
[72747.567106] ? update_load_avg+0x7e/0x5f0
[72747.567109] ? enqueue_entity+0xf4/0x6f0
[72747.567134] extent_writepages+0x44/0xa0 [btrfs]
[72747.567137] ? enqueue_task_fair+0x93/0x6f0
[72747.567140] do_writepages+0x41/0xd0
[72747.567144] __filemap_fdatawrite_range+0xc7/0x100
[72747.567167] btrfs_run_delalloc_work+0x17/0x40 [btrfs]
[72747.567195] btrfs_work_helper+0xc2/0x300 [btrfs]
[72747.567200] process_one_work+0x1aa/0x340
[72747.567202] worker_thread+0x30/0x390
[72747.567205] ? create_worker+0x1a0/0x1a0
[72747.567208] kthread+0x116/0x130
[72747.567211] ? kthread_park+0x80/0x80
[72747.567214] ret_from_fork+0x1f/0x30
[72747.569686] task:fsstress state:D stack: 0 pid:841421 ppid:841417 flags:0x00000000
[72747.569689] Call Trace:
[72747.569691] __schedule+0x296/0x760
[72747.569694] schedule+0x3c/0xa0
[72747.569721] try_flush_qgroup+0x95/0x140 [btrfs]
[72747.569725] ? finish_wait+0x80/0x80
[72747.569753] btrfs_qgroup_reserve_data+0x34/0x50 [btrfs]
[72747.569781] btrfs_check_data_free_space+0x5f/0xa0 [btrfs]
[72747.569804] btrfs_buffered_write+0x1f7/0x7f0 [btrfs]
[72747.569810] ? path_lookupat.isra.48+0x97/0x140
[72747.569833] btrfs_file_write_iter+0x81/0x410 [btrfs]
[72747.569836] ? __kmalloc+0x16a/0x2c0
[72747.569839] do_iter_readv_writev+0x160/0x1c0
[72747.569843] do_iter_write+0x80/0x1b0
[72747.569847] vfs_writev+0x84/0x140
[72747.569869] ? btrfs_file_llseek+0x38/0x270 [btrfs]
[72747.569873] do_writev+0x65/0x100
[72747.569876] do_syscall_64+0x33/0x40
[72747.569879] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[72747.569899] task:fsstress state:D stack: 0 pid:841424 ppid:841417 flags:0x00004000
[72747.569903] Call Trace:
[72747.569906] __schedule+0x296/0x760
[72747.569909] schedule+0x3c/0xa0
[72747.569936] try_flush_qgroup+0x95/0x140 [btrfs]
[72747.569940] ? finish_wait+0x80/0x80
[72747.569967] __btrfs_qgroup_reserve_meta+0x36/0x50 [btrfs]
[72747.569989] start_transaction+0x279/0x580 [btrfs]
[72747.570014] clone_copy_inline_extent+0x332/0x490 [btrfs]
[72747.570041] btrfs_clone+0x5b7/0x7a0 [btrfs]
[72747.570068] ? lock_extent_bits+0x64/0x90 [btrfs]
[72747.570095] btrfs_clone_files+0xfc/0x150 [btrfs]
[72747.570122] btrfs_remap_file_range+0x3d8/0x4a0 [btrfs]
[72747.570126] do_clone_file_range+0xed/0x200
[72747.570131] vfs_clone_file_range+0x37/0x110
[72747.570134] ioctl_file_clone+0x7d/0xb0
[72747.570137] do_vfs_ioctl+0x138/0x630
[72747.570140] __x64_sys_ioctl+0x62/0xc0
[72747.570143] do_syscall_64+0x33/0x40
[72747.570146] entry_SYSCALL_64_after_hwframe+0x44/0xa9
So fix this by skipping the flush of delalloc for an inode that is
flagged with BTRFS_INODE_NO_DELALLOC_FLUSH, meaning it is currently under
such a special case of cloning an inline extent, when flushing delalloc
during qgroup metadata reservation.
The special cases for cloning inline extents were added in kernel 5.7 by
by commit 05a5a7621ce66c ("Btrfs: implement full reflink support for
inline extents"), while having qgroup metadata space reservation flushing
delalloc when low on space was added in kernel 5.9 by commit
c53e9653605dbf ("btrfs: qgroup: try to flush qgroup space when we get
-EDQUOT"). So use a "Fixes:" tag for the later commit to ease stable
kernel backports.
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20210421083137.31E3.409509F4@e16-tech.com/
Fixes: c53e9653605dbf ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
CC: stable@vger.kernel.org # 5.9+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-22 19:08:05 +08:00
|
|
|
ret = btrfs_start_delalloc_snapshot(root, false);
|
Btrfs: send, flush dellaloc in order to avoid data loss
When we set a subvolume to read-only mode we do not flush dellaloc for any
of its inodes (except if the filesystem is mounted with -o flushoncommit),
since it does not affect correctness for any subsequent operations - except
for a future send operation. The send operation will not be able to see the
delalloc data since the respective file extent items, inode item updates,
backreferences, etc, have not hit yet the subvolume and extent trees.
Effectively this means data loss, since the send stream will not contain
any data from existing delalloc. Another problem from this is that if the
writeback starts and finishes while the send operation is in progress, we
have the subvolume tree being being modified concurrently which can result
in send failing unexpectedly with EIO or hitting runtime errors, assertion
failures or hitting BUG_ONs, etc.
Simple reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ btrfs subvolume create /mnt/sv
$ xfs_io -f -c "pwrite -S 0xea 0 108K" /mnt/sv/foo
$ btrfs property set /mnt/sv ro true
$ btrfs send -f /tmp/send.stream /mnt/sv
$ od -t x1 -A d /mnt/sv/foo
0000000 ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea
*
0110592
$ umount /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ btrfs receive -f /tmp/send.stream /mnt
$ echo $?
0
$ od -t x1 -A d /mnt/sv/foo
0000000
# ---> empty file
Since this a problem that affects send only, fix it in send by flushing
dellaloc for all the roots used by the send operation before send starts
to process the commit roots.
This is a problem that affects send since it was introduced (commit
31db9f7c23fbf7 ("Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive"))
but backporting it to older kernels has some dependencies:
- For kernels between 3.19 and 4.20, it depends on commit 3cd24c698004d2
("btrfs: use tagged writepage to mitigate livelock of snapshot") because
the function btrfs_start_delalloc_snapshot() does not exist before that
commit. So one has to either pick that commit or replace the calls to
btrfs_start_delalloc_snapshot() in this patch with calls to
btrfs_start_delalloc_inodes().
- For kernels older than 3.19 it also requires commit e5fa8f865b3324
("Btrfs: ensure send always works on roots without orphans") because
it depends on the function ensure_commit_roots_uptodate() which that
commits introduced.
- No dependencies for 5.0+ kernels.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 3.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-15 16:29:36 +08:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
btrfs_wait_ordered_extents(root, U64_MAX, 0, U64_MAX);
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-12-17 22:07:20 +08:00
|
|
|
static void btrfs_root_dec_send_in_progress(struct btrfs_root* root)
|
|
|
|
{
|
|
|
|
spin_lock(&root->root_item_lock);
|
|
|
|
root->send_in_progress--;
|
|
|
|
/*
|
|
|
|
* Not much left to do, we don't know why it's unbalanced and
|
|
|
|
* can't blindly reset it to 0.
|
|
|
|
*/
|
|
|
|
if (root->send_in_progress < 0)
|
|
|
|
btrfs_err(root->fs_info,
|
2018-05-04 19:11:12 +08:00
|
|
|
"send_in_progress unbalanced %d root %llu",
|
2016-06-23 06:54:23 +08:00
|
|
|
root->send_in_progress, root->root_key.objectid);
|
2013-12-17 22:07:20 +08:00
|
|
|
spin_unlock(&root->root_item_lock);
|
|
|
|
}
|
|
|
|
|
Btrfs: fix race between send and deduplication that lead to failures and crashes
Send operates on read only trees and expects them to never change while it
is using them. This is part of its initial design, and this expection is
due to two different reasons:
1) When it was introduced, no operations were allowed to modifiy read-only
subvolumes/snapshots (including defrag for example).
2) It keeps send from having an impact on other filesystem operations.
Namely send does not need to keep locks on the trees nor needs to hold on
to transaction handles and delay transaction commits. This ends up being
a consequence of the former reason.
However the deduplication feature was introduced later (on September 2013,
while send was introduced in July 2012) and it allowed for deduplication
with destination files that belong to read-only trees (subvolumes and
snapshots).
That means that having a send operation (either full or incremental) running
in parallel with a deduplication that has the destination inode in one of
the trees used by the send operation, can result in tree nodes and leaves
getting freed and reused while send is using them. This problem is similar
to the problem solved for the root nodes getting freed and reused when a
snapshot is made against one tree that is currenly being used by a send
operation, fixed in commits [1] and [2]. These commits explain in detail
how the problem happens and the explanation is valid for any node or leaf
that is not the root of a tree as well. This problem was also discussed
and explained recently in a thread [3].
The problem is very easy to reproduce when using send with large trees
(snapshots) and just a few concurrent deduplication operations that target
files in the trees used by send. A stress test case is being sent for
fstests that triggers the issue easily. The most common error to hit is
the send ioctl return -EIO with the following messages in dmesg/syslog:
[1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400
[1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27
The first one is very easy to hit while the second one happens much less
frequently, except for very large trees (in that test case, snapshots
with 100000 files having large xattrs to get deep and wide trees).
Less frequently, at least one BUG_ON can be hit:
[1631742.130080] ------------[ cut here ]------------
[1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806!
[1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI
[1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G B D W 5.0.0-rc8-btrfs-next-45 #1
[1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
(...)
[1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246
[1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002
[1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000
[1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d
[1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708
[1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[1631742.138455] FS: 00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000
[1631742.139010] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0
[1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1631742.141272] Call Trace:
[1631742.141826] ? do_raw_spin_unlock+0x49/0xc0
[1631742.142390] tree_advance+0x173/0x1d0 [btrfs]
[1631742.142948] btrfs_compare_trees+0x268/0x690 [btrfs]
[1631742.143533] ? process_extent+0x1070/0x1070 [btrfs]
[1631742.144088] btrfs_ioctl_send+0x1037/0x1270 [btrfs]
[1631742.144645] _btrfs_ioctl_send+0x80/0x110 [btrfs]
[1631742.145161] ? trace_sched_stick_numa+0xe0/0xe0
[1631742.145685] btrfs_ioctl+0x13fe/0x3120 [btrfs]
[1631742.146179] ? account_entity_enqueue+0xd3/0x100
[1631742.146662] ? reweight_entity+0x154/0x1a0
[1631742.147135] ? update_curr+0x20/0x2a0
[1631742.147593] ? check_preempt_wakeup+0x103/0x250
[1631742.148053] ? do_vfs_ioctl+0xa2/0x6f0
[1631742.148510] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[1631742.148942] do_vfs_ioctl+0xa2/0x6f0
[1631742.149361] ? __fget+0x113/0x200
[1631742.149767] ksys_ioctl+0x70/0x80
[1631742.150159] __x64_sys_ioctl+0x16/0x20
[1631742.150543] do_syscall_64+0x60/0x1b0
[1631742.150931] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[1631742.151326] RIP: 0033:0x7f4cd9f5add7
(...)
[1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7
[1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007
[1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700
[1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003
[1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001
(...)
[1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]---
That BUG_ON happens because while send is using a node, that node is COWed
by a concurrent deduplication, gets freed and gets reused as a leaf (because
a transaction commit happened in between), so when it attempts to read a
slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer
contents were wiped out and it now matches a leaf (which can even belong to
some other tree now), hitting the BUG_ON(level == 0).
Fix this concurrency issue by not allowing send and deduplication to run
in parallel if both operate on the same readonly trees, returning EAGAIN
to user space and logging an exlicit warning in dmesg/syslog.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2
[3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-22 23:43:42 +08:00
|
|
|
static void dedupe_in_progress_warn(const struct btrfs_root *root)
|
|
|
|
{
|
|
|
|
btrfs_warn_rl(root->fs_info,
|
|
|
|
"cannot use root %llu for send while deduplications on it are in progress (%d in progress)",
|
|
|
|
root->root_key.objectid, root->dedupe_in_progress);
|
|
|
|
}
|
|
|
|
|
2022-01-16 10:48:47 +08:00
|
|
|
long btrfs_ioctl_send(struct inode *inode, struct btrfs_ioctl_send_args *arg)
|
2012-07-26 05:19:24 +08:00
|
|
|
{
|
|
|
|
int ret = 0;
|
2022-01-16 10:48:47 +08:00
|
|
|
struct btrfs_root *send_root = BTRFS_I(inode)->root;
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *fs_info = send_root->fs_info;
|
2012-07-26 05:19:24 +08:00
|
|
|
struct btrfs_root *clone_root;
|
|
|
|
struct send_ctx *sctx = NULL;
|
|
|
|
u32 i;
|
|
|
|
u64 *clone_sources_tmp = NULL;
|
2013-12-17 00:34:17 +08:00
|
|
|
int clone_sources_to_rollback = 0;
|
2020-09-22 01:03:36 +08:00
|
|
|
size_t alloc_size;
|
2014-01-07 17:25:18 +08:00
|
|
|
int sort_clone_roots = 0;
|
btrfs: send: cache utimes operations for directories if possible
Whenever we add or remove an entry to a directory, we issue an utimes
command for the directory. If we add 1000 entries to a directory (create
1000 files under it or move 1000 files to it), then we issue the same
utimes command 1000 times, which increases the send stream size, results
in more pipe IO, one search in the send b+tree, allocating one path for
the search, etc, as well as making the receiver do a system call for each
duplicated utimes command.
We also issue an utimes command when we create a new directory, but later
we might add entries to it corresponding to inodes with an higher inode
number, so it's pointless to issue the utimes command before we create
the last inode under the directory.
So use a lru cache to track directories for which we must send a utimes
command. When we need to remove an entry from the cache, we issue the
utimes command for the respective directory. When finishing the send
operation, we go over each cache element and issue the respective utimes
command. Finally the caching is entirely optional, just a performance
optimization, meaning that if we fail to cache (due to memory allocation
failure), we issue the utimes command right away, that is, we fallback
to the previous, unoptimized, behaviour.
This patch belongs to a patchset comprised of the following patches:
btrfs: send: directly return from did_overwrite_ref() and simplify it
btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
btrfs: send: directly return from will_overwrite_ref() and simplify it
btrfs: send: avoid extra b+tree searches when checking reference overrides
btrfs: send: remove send_progress argument from can_rmdir()
btrfs: send: avoid duplicated orphan dir allocation and initialization
btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
btrfs: send: reduce searches on parent root when checking if dir can be removed
btrfs: send: iterate waiting dir move rbtree only once when processing refs
btrfs: send: initialize all the red black trees earlier
btrfs: send: genericize the backref cache to allow it to be reused
btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
btrfs: send: cache information about created directories
btrfs: allow a generation number to be associated with lru cache entries
btrfs: add an api to delete a specific entry from the lru cache
btrfs: send: use the lru cache to implement the name cache
btrfs: send: update size of roots array for backref cache entries
btrfs: send: cache utimes operations for directories if possible
The following test was run before and after applying the whole patchset,
and on a non-debug kernel (Debian's default kernel config):
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
mkfs.btrfs -f $DEV > /dev/null
mount $DEV $MNT
mkdir $MNT/A
for ((i = 1; i <= 20000; i++)); do
echo -n > $MNT/A/file_$i
done
btrfs subvolume snapshot -r $MNT $MNT/snap1
mkdir $MNT/B
for ((i = 20000; i <= 40000; i++)); do
echo -n > $MNT/B/file_$i
done
mv $MNT/A/file_* $MNT/B/
btrfs subvolume snapshot -r $MNT $MNT/snap2
start=$(date +%s%N)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "Incremental send took $dur milliseconds"
umount $MNT
Before the whole patchset: 18408 milliseconds
After the whole patchset: 1942 milliseconds (9.5x speedup)
Using 60000 files instead of 40000:
Before the whole patchset: 39764 milliseconds
After the whole patchset: 3076 milliseconds (12.9x speedup)
Using 20000 files instead of 40000:
Before the whole patchset: 5072 milliseconds
After the whole patchset: 916 milliseconds (5.5x speedup)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-11 19:36:20 +08:00
|
|
|
struct btrfs_lru_cache_entry *entry;
|
|
|
|
struct btrfs_lru_cache_entry *tmp;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
2013-12-17 00:34:17 +08:00
|
|
|
/*
|
|
|
|
* The subvolume must remain read-only during send, protect against
|
2014-04-15 22:41:44 +08:00
|
|
|
* making it RW. This also protects against deletion.
|
2013-12-17 00:34:17 +08:00
|
|
|
*/
|
|
|
|
spin_lock(&send_root->root_item_lock);
|
Btrfs: fix race between send and deduplication that lead to failures and crashes
Send operates on read only trees and expects them to never change while it
is using them. This is part of its initial design, and this expection is
due to two different reasons:
1) When it was introduced, no operations were allowed to modifiy read-only
subvolumes/snapshots (including defrag for example).
2) It keeps send from having an impact on other filesystem operations.
Namely send does not need to keep locks on the trees nor needs to hold on
to transaction handles and delay transaction commits. This ends up being
a consequence of the former reason.
However the deduplication feature was introduced later (on September 2013,
while send was introduced in July 2012) and it allowed for deduplication
with destination files that belong to read-only trees (subvolumes and
snapshots).
That means that having a send operation (either full or incremental) running
in parallel with a deduplication that has the destination inode in one of
the trees used by the send operation, can result in tree nodes and leaves
getting freed and reused while send is using them. This problem is similar
to the problem solved for the root nodes getting freed and reused when a
snapshot is made against one tree that is currenly being used by a send
operation, fixed in commits [1] and [2]. These commits explain in detail
how the problem happens and the explanation is valid for any node or leaf
that is not the root of a tree as well. This problem was also discussed
and explained recently in a thread [3].
The problem is very easy to reproduce when using send with large trees
(snapshots) and just a few concurrent deduplication operations that target
files in the trees used by send. A stress test case is being sent for
fstests that triggers the issue easily. The most common error to hit is
the send ioctl return -EIO with the following messages in dmesg/syslog:
[1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400
[1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27
The first one is very easy to hit while the second one happens much less
frequently, except for very large trees (in that test case, snapshots
with 100000 files having large xattrs to get deep and wide trees).
Less frequently, at least one BUG_ON can be hit:
[1631742.130080] ------------[ cut here ]------------
[1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806!
[1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI
[1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G B D W 5.0.0-rc8-btrfs-next-45 #1
[1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
(...)
[1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246
[1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002
[1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000
[1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d
[1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708
[1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[1631742.138455] FS: 00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000
[1631742.139010] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0
[1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1631742.141272] Call Trace:
[1631742.141826] ? do_raw_spin_unlock+0x49/0xc0
[1631742.142390] tree_advance+0x173/0x1d0 [btrfs]
[1631742.142948] btrfs_compare_trees+0x268/0x690 [btrfs]
[1631742.143533] ? process_extent+0x1070/0x1070 [btrfs]
[1631742.144088] btrfs_ioctl_send+0x1037/0x1270 [btrfs]
[1631742.144645] _btrfs_ioctl_send+0x80/0x110 [btrfs]
[1631742.145161] ? trace_sched_stick_numa+0xe0/0xe0
[1631742.145685] btrfs_ioctl+0x13fe/0x3120 [btrfs]
[1631742.146179] ? account_entity_enqueue+0xd3/0x100
[1631742.146662] ? reweight_entity+0x154/0x1a0
[1631742.147135] ? update_curr+0x20/0x2a0
[1631742.147593] ? check_preempt_wakeup+0x103/0x250
[1631742.148053] ? do_vfs_ioctl+0xa2/0x6f0
[1631742.148510] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[1631742.148942] do_vfs_ioctl+0xa2/0x6f0
[1631742.149361] ? __fget+0x113/0x200
[1631742.149767] ksys_ioctl+0x70/0x80
[1631742.150159] __x64_sys_ioctl+0x16/0x20
[1631742.150543] do_syscall_64+0x60/0x1b0
[1631742.150931] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[1631742.151326] RIP: 0033:0x7f4cd9f5add7
(...)
[1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7
[1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007
[1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700
[1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003
[1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001
(...)
[1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]---
That BUG_ON happens because while send is using a node, that node is COWed
by a concurrent deduplication, gets freed and gets reused as a leaf (because
a transaction commit happened in between), so when it attempts to read a
slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer
contents were wiped out and it now matches a leaf (which can even belong to
some other tree now), hitting the BUG_ON(level == 0).
Fix this concurrency issue by not allowing send and deduplication to run
in parallel if both operate on the same readonly trees, returning EAGAIN
to user space and logging an exlicit warning in dmesg/syslog.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2
[3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-22 23:43:42 +08:00
|
|
|
if (btrfs_root_readonly(send_root) && send_root->dedupe_in_progress) {
|
|
|
|
dedupe_in_progress_warn(send_root);
|
|
|
|
spin_unlock(&send_root->root_item_lock);
|
|
|
|
return -EAGAIN;
|
|
|
|
}
|
2013-12-17 00:34:17 +08:00
|
|
|
send_root->send_in_progress++;
|
|
|
|
spin_unlock(&send_root->root_item_lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Userspace tools do the checks and warn the user if it's
|
|
|
|
* not RO.
|
|
|
|
*/
|
|
|
|
if (!btrfs_root_readonly(send_root)) {
|
|
|
|
ret = -EPERM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2017-03-18 04:51:20 +08:00
|
|
|
/*
|
|
|
|
* Check that we don't overflow at later allocations, we request
|
|
|
|
* clone_sources_count + 1 items, and compare to unsigned long inside
|
2023-01-25 03:32:10 +08:00
|
|
|
* access_ok. Also set an upper limit for allocation size so this can't
|
|
|
|
* easily exhaust memory. Max number of clone sources is about 200K.
|
2017-03-18 04:51:20 +08:00
|
|
|
*/
|
2023-01-25 03:32:10 +08:00
|
|
|
if (arg->clone_sources_count > SZ_8M / sizeof(struct clone_root)) {
|
2016-04-13 14:40:59 +08:00
|
|
|
ret = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2013-04-11 01:10:52 +08:00
|
|
|
if (arg->flags & ~BTRFS_SEND_FLAG_MASK) {
|
2024-01-11 00:48:44 +08:00
|
|
|
ret = -EOPNOTSUPP;
|
2013-02-05 04:54:57 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2016-01-19 01:42:13 +08:00
|
|
|
sctx = kzalloc(sizeof(struct send_ctx), GFP_KERNEL);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (!sctx) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
INIT_LIST_HEAD(&sctx->new_refs);
|
|
|
|
INIT_LIST_HEAD(&sctx->deleted_refs);
|
|
|
|
|
2023-01-11 19:36:18 +08:00
|
|
|
btrfs_lru_cache_init(&sctx->name_cache, SEND_MAX_NAME_CACHE_SIZE);
|
2023-01-11 19:36:13 +08:00
|
|
|
btrfs_lru_cache_init(&sctx->backref_cache, SEND_MAX_BACKREF_CACHE_SIZE);
|
2023-01-11 19:36:15 +08:00
|
|
|
btrfs_lru_cache_init(&sctx->dir_created_cache,
|
|
|
|
SEND_MAX_DIR_CREATED_CACHE_SIZE);
|
btrfs: send: cache utimes operations for directories if possible
Whenever we add or remove an entry to a directory, we issue an utimes
command for the directory. If we add 1000 entries to a directory (create
1000 files under it or move 1000 files to it), then we issue the same
utimes command 1000 times, which increases the send stream size, results
in more pipe IO, one search in the send b+tree, allocating one path for
the search, etc, as well as making the receiver do a system call for each
duplicated utimes command.
We also issue an utimes command when we create a new directory, but later
we might add entries to it corresponding to inodes with an higher inode
number, so it's pointless to issue the utimes command before we create
the last inode under the directory.
So use a lru cache to track directories for which we must send a utimes
command. When we need to remove an entry from the cache, we issue the
utimes command for the respective directory. When finishing the send
operation, we go over each cache element and issue the respective utimes
command. Finally the caching is entirely optional, just a performance
optimization, meaning that if we fail to cache (due to memory allocation
failure), we issue the utimes command right away, that is, we fallback
to the previous, unoptimized, behaviour.
This patch belongs to a patchset comprised of the following patches:
btrfs: send: directly return from did_overwrite_ref() and simplify it
btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
btrfs: send: directly return from will_overwrite_ref() and simplify it
btrfs: send: avoid extra b+tree searches when checking reference overrides
btrfs: send: remove send_progress argument from can_rmdir()
btrfs: send: avoid duplicated orphan dir allocation and initialization
btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
btrfs: send: reduce searches on parent root when checking if dir can be removed
btrfs: send: iterate waiting dir move rbtree only once when processing refs
btrfs: send: initialize all the red black trees earlier
btrfs: send: genericize the backref cache to allow it to be reused
btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
btrfs: send: cache information about created directories
btrfs: allow a generation number to be associated with lru cache entries
btrfs: add an api to delete a specific entry from the lru cache
btrfs: send: use the lru cache to implement the name cache
btrfs: send: update size of roots array for backref cache entries
btrfs: send: cache utimes operations for directories if possible
The following test was run before and after applying the whole patchset,
and on a non-debug kernel (Debian's default kernel config):
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
mkfs.btrfs -f $DEV > /dev/null
mount $DEV $MNT
mkdir $MNT/A
for ((i = 1; i <= 20000; i++)); do
echo -n > $MNT/A/file_$i
done
btrfs subvolume snapshot -r $MNT $MNT/snap1
mkdir $MNT/B
for ((i = 20000; i <= 40000; i++)); do
echo -n > $MNT/B/file_$i
done
mv $MNT/A/file_* $MNT/B/
btrfs subvolume snapshot -r $MNT $MNT/snap2
start=$(date +%s%N)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "Incremental send took $dur milliseconds"
umount $MNT
Before the whole patchset: 18408 milliseconds
After the whole patchset: 1942 milliseconds (9.5x speedup)
Using 60000 files instead of 40000:
Before the whole patchset: 39764 milliseconds
After the whole patchset: 3076 milliseconds (12.9x speedup)
Using 20000 files instead of 40000:
Before the whole patchset: 5072 milliseconds
After the whole patchset: 916 milliseconds (5.5x speedup)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-11 19:36:20 +08:00
|
|
|
/*
|
|
|
|
* This cache is periodically trimmed to a fixed size elsewhere, see
|
|
|
|
* cache_dir_utimes() and trim_dir_utimes_cache().
|
|
|
|
*/
|
|
|
|
btrfs_lru_cache_init(&sctx->dir_utimes_cache, 0);
|
btrfs: send: cache leaf to roots mapping during backref walking
During a send operation, when doing backref walking to determine which
inodes/offsets/roots we can clone from, the most repetitive and expensive
step is to map each leaf that has file extent items pointing to the target
data extent to the IDs of the roots from which the leaves are accessible,
which happens at iterate_extent_inodes(). That step requires finding every
parent node of a leaf, then the parent of each parent, and so on until we
reach a root node. So it's a naturally expensive operation, and repetitive
because each leaf can have hundreds of file extent items (for a nodesize
of 16K, that can be slightly over 200 file extent items). There's also
temporal locality, as we process all file extent items from a leave before
moving the next leaf.
This change caches the mapping of leaves to root IDs, to avoid repeating
those computations over and over again. The cache is limited to a maximum
of 128 entries, with each entry being a struct with a size of 128 bytes,
so the maximum cache size is 16K plus any nodes internally allocated by
the maple tree that is used to index pointers to those structs. The cache
is invalidated whenever we detect relocation happened since we started
filling the cache, because if relocation happened then extent buffers for
leaves and nodes of the trees used by a send operation may have been
reallocated.
This cache also allows for another important optimization that is
introduced in the next patch in the series.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:50 +08:00
|
|
|
|
2023-01-11 19:36:12 +08:00
|
|
|
sctx->pending_dir_moves = RB_ROOT;
|
|
|
|
sctx->waiting_dir_moves = RB_ROOT;
|
|
|
|
sctx->orphan_dirs = RB_ROOT;
|
|
|
|
sctx->rbtree_new_refs = RB_ROOT;
|
|
|
|
sctx->rbtree_deleted_refs = RB_ROOT;
|
|
|
|
|
2013-02-05 04:54:57 +08:00
|
|
|
sctx->flags = arg->flags;
|
|
|
|
|
2021-10-22 22:53:36 +08:00
|
|
|
if (arg->flags & BTRFS_SEND_FLAG_VERSION) {
|
|
|
|
if (arg->version > BTRFS_SEND_STREAM_VERSION) {
|
|
|
|
ret = -EPROTO;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
/* Zero means "use the highest version" */
|
|
|
|
sctx->proto = arg->version ?: BTRFS_SEND_STREAM_VERSION;
|
|
|
|
} else {
|
|
|
|
sctx->proto = 1;
|
|
|
|
}
|
2022-03-18 01:25:43 +08:00
|
|
|
if ((arg->flags & BTRFS_SEND_FLAG_COMPRESSED) && sctx->proto < 2) {
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
2021-10-22 22:53:36 +08:00
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
sctx->send_filp = fget(arg->send_fd);
|
2023-11-25 00:48:31 +08:00
|
|
|
if (!sctx->send_filp || !(sctx->send_filp->f_mode & FMODE_WRITE)) {
|
2013-04-19 09:04:46 +08:00
|
|
|
ret = -EBADF;
|
2012-07-26 05:19:24 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
sctx->send_root = send_root;
|
2014-04-15 22:41:44 +08:00
|
|
|
/*
|
|
|
|
* Unlikely but possible, if the subvolume is marked for deletion but
|
|
|
|
* is slow to remove the directory entry, send can still be started
|
|
|
|
*/
|
|
|
|
if (btrfs_root_dead(sctx->send_root)) {
|
|
|
|
ret = -EPERM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
sctx->clone_roots_cnt = arg->clone_sources_count;
|
|
|
|
|
btrfs: send: get send buffer pages for protocol v2
For encoded writes in send v2, we will get the encoded data with
btrfs_encoded_read_regular_fill_pages(), which expects a list of raw
pages. To avoid extra buffers and copies, we should read directly into
the send buffer. Therefore, we need the raw pages for the send buffer.
We currently allocate the send buffer with kvmalloc(), which may return
a kmalloc'd buffer or a vmalloc'd buffer. For vmalloc, we can get the
pages with vmalloc_to_page(). For kmalloc, we could use virt_to_page().
However, the buffer size we use (144K) is not a power of two, which in
theory is not guaranteed to return a page-aligned buffer, and in
practice would waste a lot of memory due to rounding up to the next
power of two. 144K is large enough that it usually gets allocated with
vmalloc(), anyways. So, for send v2, replace kvmalloc() with vmalloc()
and save the pages in an array.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-05 01:29:07 +08:00
|
|
|
if (sctx->proto >= 2) {
|
|
|
|
u32 send_buf_num_pages;
|
|
|
|
|
2022-10-19 16:10:01 +08:00
|
|
|
sctx->send_max_size = BTRFS_SEND_BUF_SIZE_V2;
|
btrfs: send: get send buffer pages for protocol v2
For encoded writes in send v2, we will get the encoded data with
btrfs_encoded_read_regular_fill_pages(), which expects a list of raw
pages. To avoid extra buffers and copies, we should read directly into
the send buffer. Therefore, we need the raw pages for the send buffer.
We currently allocate the send buffer with kvmalloc(), which may return
a kmalloc'd buffer or a vmalloc'd buffer. For vmalloc, we can get the
pages with vmalloc_to_page(). For kmalloc, we could use virt_to_page().
However, the buffer size we use (144K) is not a power of two, which in
theory is not guaranteed to return a page-aligned buffer, and in
practice would waste a lot of memory due to rounding up to the next
power of two. 144K is large enough that it usually gets allocated with
vmalloc(), anyways. So, for send v2, replace kvmalloc() with vmalloc()
and save the pages in an array.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-05 01:29:07 +08:00
|
|
|
sctx->send_buf = vmalloc(sctx->send_max_size);
|
|
|
|
if (!sctx->send_buf) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
send_buf_num_pages = sctx->send_max_size >> PAGE_SHIFT;
|
|
|
|
sctx->send_buf_pages = kcalloc(send_buf_num_pages,
|
|
|
|
sizeof(*sctx->send_buf_pages),
|
|
|
|
GFP_KERNEL);
|
|
|
|
if (!sctx->send_buf_pages) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
for (i = 0; i < send_buf_num_pages; i++) {
|
|
|
|
sctx->send_buf_pages[i] =
|
|
|
|
vmalloc_to_page(sctx->send_buf + (i << PAGE_SHIFT));
|
|
|
|
}
|
|
|
|
} else {
|
2022-03-18 01:25:40 +08:00
|
|
|
sctx->send_max_size = BTRFS_SEND_BUF_SIZE_V1;
|
btrfs: send: get send buffer pages for protocol v2
For encoded writes in send v2, we will get the encoded data with
btrfs_encoded_read_regular_fill_pages(), which expects a list of raw
pages. To avoid extra buffers and copies, we should read directly into
the send buffer. Therefore, we need the raw pages for the send buffer.
We currently allocate the send buffer with kvmalloc(), which may return
a kmalloc'd buffer or a vmalloc'd buffer. For vmalloc, we can get the
pages with vmalloc_to_page(). For kmalloc, we could use virt_to_page().
However, the buffer size we use (144K) is not a power of two, which in
theory is not guaranteed to return a page-aligned buffer, and in
practice would waste a lot of memory due to rounding up to the next
power of two. 144K is large enough that it usually gets allocated with
vmalloc(), anyways. So, for send v2, replace kvmalloc() with vmalloc()
and save the pages in an array.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-05 01:29:07 +08:00
|
|
|
sctx->send_buf = kvmalloc(sctx->send_max_size, GFP_KERNEL);
|
|
|
|
}
|
2012-07-26 05:19:24 +08:00
|
|
|
if (!sctx->send_buf) {
|
2017-05-09 06:57:27 +08:00
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
2023-12-21 16:47:45 +08:00
|
|
|
sctx->clone_roots = kvcalloc(arg->clone_sources_count + 1,
|
|
|
|
sizeof(*sctx->clone_roots),
|
2020-09-22 01:03:36 +08:00
|
|
|
GFP_KERNEL);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (!sctx->clone_roots) {
|
2017-06-01 00:40:02 +08:00
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
2020-09-22 01:03:36 +08:00
|
|
|
alloc_size = array_size(sizeof(*arg->clone_sources),
|
|
|
|
arg->clone_sources_count);
|
2016-04-12 00:52:02 +08:00
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
if (arg->clone_sources_count) {
|
2017-05-09 06:57:27 +08:00
|
|
|
clone_sources_tmp = kvmalloc(alloc_size, GFP_KERNEL);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (!clone_sources_tmp) {
|
2017-05-09 06:57:27 +08:00
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
ret = copy_from_user(clone_sources_tmp, arg->clone_sources,
|
2016-04-12 00:52:02 +08:00
|
|
|
alloc_size);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (ret) {
|
|
|
|
ret = -EFAULT;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; i < arg->clone_sources_count; i++) {
|
2020-05-16 01:35:55 +08:00
|
|
|
clone_root = btrfs_get_fs_root(fs_info,
|
|
|
|
clone_sources_tmp[i], true);
|
2012-07-26 05:19:24 +08:00
|
|
|
if (IS_ERR(clone_root)) {
|
|
|
|
ret = PTR_ERR(clone_root);
|
|
|
|
goto out;
|
|
|
|
}
|
2013-12-17 00:34:17 +08:00
|
|
|
spin_lock(&clone_root->root_item_lock);
|
2015-03-03 04:53:52 +08:00
|
|
|
if (!btrfs_root_readonly(clone_root) ||
|
|
|
|
btrfs_root_dead(clone_root)) {
|
2013-12-17 00:34:17 +08:00
|
|
|
spin_unlock(&clone_root->root_item_lock);
|
2020-01-24 22:33:01 +08:00
|
|
|
btrfs_put_root(clone_root);
|
2013-12-17 00:34:17 +08:00
|
|
|
ret = -EPERM;
|
|
|
|
goto out;
|
|
|
|
}
|
Btrfs: fix race between send and deduplication that lead to failures and crashes
Send operates on read only trees and expects them to never change while it
is using them. This is part of its initial design, and this expection is
due to two different reasons:
1) When it was introduced, no operations were allowed to modifiy read-only
subvolumes/snapshots (including defrag for example).
2) It keeps send from having an impact on other filesystem operations.
Namely send does not need to keep locks on the trees nor needs to hold on
to transaction handles and delay transaction commits. This ends up being
a consequence of the former reason.
However the deduplication feature was introduced later (on September 2013,
while send was introduced in July 2012) and it allowed for deduplication
with destination files that belong to read-only trees (subvolumes and
snapshots).
That means that having a send operation (either full or incremental) running
in parallel with a deduplication that has the destination inode in one of
the trees used by the send operation, can result in tree nodes and leaves
getting freed and reused while send is using them. This problem is similar
to the problem solved for the root nodes getting freed and reused when a
snapshot is made against one tree that is currenly being used by a send
operation, fixed in commits [1] and [2]. These commits explain in detail
how the problem happens and the explanation is valid for any node or leaf
that is not the root of a tree as well. This problem was also discussed
and explained recently in a thread [3].
The problem is very easy to reproduce when using send with large trees
(snapshots) and just a few concurrent deduplication operations that target
files in the trees used by send. A stress test case is being sent for
fstests that triggers the issue easily. The most common error to hit is
the send ioctl return -EIO with the following messages in dmesg/syslog:
[1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400
[1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27
The first one is very easy to hit while the second one happens much less
frequently, except for very large trees (in that test case, snapshots
with 100000 files having large xattrs to get deep and wide trees).
Less frequently, at least one BUG_ON can be hit:
[1631742.130080] ------------[ cut here ]------------
[1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806!
[1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI
[1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G B D W 5.0.0-rc8-btrfs-next-45 #1
[1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
(...)
[1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246
[1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002
[1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000
[1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d
[1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708
[1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[1631742.138455] FS: 00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000
[1631742.139010] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0
[1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1631742.141272] Call Trace:
[1631742.141826] ? do_raw_spin_unlock+0x49/0xc0
[1631742.142390] tree_advance+0x173/0x1d0 [btrfs]
[1631742.142948] btrfs_compare_trees+0x268/0x690 [btrfs]
[1631742.143533] ? process_extent+0x1070/0x1070 [btrfs]
[1631742.144088] btrfs_ioctl_send+0x1037/0x1270 [btrfs]
[1631742.144645] _btrfs_ioctl_send+0x80/0x110 [btrfs]
[1631742.145161] ? trace_sched_stick_numa+0xe0/0xe0
[1631742.145685] btrfs_ioctl+0x13fe/0x3120 [btrfs]
[1631742.146179] ? account_entity_enqueue+0xd3/0x100
[1631742.146662] ? reweight_entity+0x154/0x1a0
[1631742.147135] ? update_curr+0x20/0x2a0
[1631742.147593] ? check_preempt_wakeup+0x103/0x250
[1631742.148053] ? do_vfs_ioctl+0xa2/0x6f0
[1631742.148510] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[1631742.148942] do_vfs_ioctl+0xa2/0x6f0
[1631742.149361] ? __fget+0x113/0x200
[1631742.149767] ksys_ioctl+0x70/0x80
[1631742.150159] __x64_sys_ioctl+0x16/0x20
[1631742.150543] do_syscall_64+0x60/0x1b0
[1631742.150931] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[1631742.151326] RIP: 0033:0x7f4cd9f5add7
(...)
[1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7
[1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007
[1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700
[1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003
[1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001
(...)
[1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]---
That BUG_ON happens because while send is using a node, that node is COWed
by a concurrent deduplication, gets freed and gets reused as a leaf (because
a transaction commit happened in between), so when it attempts to read a
slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer
contents were wiped out and it now matches a leaf (which can even belong to
some other tree now), hitting the BUG_ON(level == 0).
Fix this concurrency issue by not allowing send and deduplication to run
in parallel if both operate on the same readonly trees, returning EAGAIN
to user space and logging an exlicit warning in dmesg/syslog.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2
[3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-22 23:43:42 +08:00
|
|
|
if (clone_root->dedupe_in_progress) {
|
|
|
|
dedupe_in_progress_warn(clone_root);
|
|
|
|
spin_unlock(&clone_root->root_item_lock);
|
2020-01-24 22:33:01 +08:00
|
|
|
btrfs_put_root(clone_root);
|
Btrfs: fix race between send and deduplication that lead to failures and crashes
Send operates on read only trees and expects them to never change while it
is using them. This is part of its initial design, and this expection is
due to two different reasons:
1) When it was introduced, no operations were allowed to modifiy read-only
subvolumes/snapshots (including defrag for example).
2) It keeps send from having an impact on other filesystem operations.
Namely send does not need to keep locks on the trees nor needs to hold on
to transaction handles and delay transaction commits. This ends up being
a consequence of the former reason.
However the deduplication feature was introduced later (on September 2013,
while send was introduced in July 2012) and it allowed for deduplication
with destination files that belong to read-only trees (subvolumes and
snapshots).
That means that having a send operation (either full or incremental) running
in parallel with a deduplication that has the destination inode in one of
the trees used by the send operation, can result in tree nodes and leaves
getting freed and reused while send is using them. This problem is similar
to the problem solved for the root nodes getting freed and reused when a
snapshot is made against one tree that is currenly being used by a send
operation, fixed in commits [1] and [2]. These commits explain in detail
how the problem happens and the explanation is valid for any node or leaf
that is not the root of a tree as well. This problem was also discussed
and explained recently in a thread [3].
The problem is very easy to reproduce when using send with large trees
(snapshots) and just a few concurrent deduplication operations that target
files in the trees used by send. A stress test case is being sent for
fstests that triggers the issue easily. The most common error to hit is
the send ioctl return -EIO with the following messages in dmesg/syslog:
[1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400
[1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27
The first one is very easy to hit while the second one happens much less
frequently, except for very large trees (in that test case, snapshots
with 100000 files having large xattrs to get deep and wide trees).
Less frequently, at least one BUG_ON can be hit:
[1631742.130080] ------------[ cut here ]------------
[1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806!
[1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI
[1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G B D W 5.0.0-rc8-btrfs-next-45 #1
[1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
(...)
[1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246
[1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002
[1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000
[1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d
[1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708
[1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[1631742.138455] FS: 00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000
[1631742.139010] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0
[1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1631742.141272] Call Trace:
[1631742.141826] ? do_raw_spin_unlock+0x49/0xc0
[1631742.142390] tree_advance+0x173/0x1d0 [btrfs]
[1631742.142948] btrfs_compare_trees+0x268/0x690 [btrfs]
[1631742.143533] ? process_extent+0x1070/0x1070 [btrfs]
[1631742.144088] btrfs_ioctl_send+0x1037/0x1270 [btrfs]
[1631742.144645] _btrfs_ioctl_send+0x80/0x110 [btrfs]
[1631742.145161] ? trace_sched_stick_numa+0xe0/0xe0
[1631742.145685] btrfs_ioctl+0x13fe/0x3120 [btrfs]
[1631742.146179] ? account_entity_enqueue+0xd3/0x100
[1631742.146662] ? reweight_entity+0x154/0x1a0
[1631742.147135] ? update_curr+0x20/0x2a0
[1631742.147593] ? check_preempt_wakeup+0x103/0x250
[1631742.148053] ? do_vfs_ioctl+0xa2/0x6f0
[1631742.148510] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[1631742.148942] do_vfs_ioctl+0xa2/0x6f0
[1631742.149361] ? __fget+0x113/0x200
[1631742.149767] ksys_ioctl+0x70/0x80
[1631742.150159] __x64_sys_ioctl+0x16/0x20
[1631742.150543] do_syscall_64+0x60/0x1b0
[1631742.150931] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[1631742.151326] RIP: 0033:0x7f4cd9f5add7
(...)
[1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7
[1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007
[1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700
[1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003
[1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001
(...)
[1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]---
That BUG_ON happens because while send is using a node, that node is COWed
by a concurrent deduplication, gets freed and gets reused as a leaf (because
a transaction commit happened in between), so when it attempts to read a
slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer
contents were wiped out and it now matches a leaf (which can even belong to
some other tree now), hitting the BUG_ON(level == 0).
Fix this concurrency issue by not allowing send and deduplication to run
in parallel if both operate on the same readonly trees, returning EAGAIN
to user space and logging an exlicit warning in dmesg/syslog.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2
[3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-22 23:43:42 +08:00
|
|
|
ret = -EAGAIN;
|
|
|
|
goto out;
|
|
|
|
}
|
2015-03-03 04:53:53 +08:00
|
|
|
clone_root->send_in_progress++;
|
2013-12-17 00:34:17 +08:00
|
|
|
spin_unlock(&clone_root->root_item_lock);
|
2014-01-07 17:25:19 +08:00
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
sctx->clone_roots[i].root = clone_root;
|
2015-03-03 04:53:53 +08:00
|
|
|
clone_sources_to_rollback = i + 1;
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
2016-04-12 00:40:08 +08:00
|
|
|
kvfree(clone_sources_tmp);
|
2012-07-26 05:19:24 +08:00
|
|
|
clone_sources_tmp = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (arg->parent_root) {
|
2020-05-16 01:35:55 +08:00
|
|
|
sctx->parent_root = btrfs_get_fs_root(fs_info, arg->parent_root,
|
|
|
|
true);
|
2013-05-13 22:42:57 +08:00
|
|
|
if (IS_ERR(sctx->parent_root)) {
|
|
|
|
ret = PTR_ERR(sctx->parent_root);
|
2012-07-26 05:19:24 +08:00
|
|
|
goto out;
|
|
|
|
}
|
2014-01-07 17:25:19 +08:00
|
|
|
|
2013-12-17 00:34:17 +08:00
|
|
|
spin_lock(&sctx->parent_root->root_item_lock);
|
|
|
|
sctx->parent_root->send_in_progress++;
|
2014-04-15 22:41:44 +08:00
|
|
|
if (!btrfs_root_readonly(sctx->parent_root) ||
|
|
|
|
btrfs_root_dead(sctx->parent_root)) {
|
2013-12-17 00:34:17 +08:00
|
|
|
spin_unlock(&sctx->parent_root->root_item_lock);
|
|
|
|
ret = -EPERM;
|
|
|
|
goto out;
|
|
|
|
}
|
Btrfs: fix race between send and deduplication that lead to failures and crashes
Send operates on read only trees and expects them to never change while it
is using them. This is part of its initial design, and this expection is
due to two different reasons:
1) When it was introduced, no operations were allowed to modifiy read-only
subvolumes/snapshots (including defrag for example).
2) It keeps send from having an impact on other filesystem operations.
Namely send does not need to keep locks on the trees nor needs to hold on
to transaction handles and delay transaction commits. This ends up being
a consequence of the former reason.
However the deduplication feature was introduced later (on September 2013,
while send was introduced in July 2012) and it allowed for deduplication
with destination files that belong to read-only trees (subvolumes and
snapshots).
That means that having a send operation (either full or incremental) running
in parallel with a deduplication that has the destination inode in one of
the trees used by the send operation, can result in tree nodes and leaves
getting freed and reused while send is using them. This problem is similar
to the problem solved for the root nodes getting freed and reused when a
snapshot is made against one tree that is currenly being used by a send
operation, fixed in commits [1] and [2]. These commits explain in detail
how the problem happens and the explanation is valid for any node or leaf
that is not the root of a tree as well. This problem was also discussed
and explained recently in a thread [3].
The problem is very easy to reproduce when using send with large trees
(snapshots) and just a few concurrent deduplication operations that target
files in the trees used by send. A stress test case is being sent for
fstests that triggers the issue easily. The most common error to hit is
the send ioctl return -EIO with the following messages in dmesg/syslog:
[1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400
[1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27
The first one is very easy to hit while the second one happens much less
frequently, except for very large trees (in that test case, snapshots
with 100000 files having large xattrs to get deep and wide trees).
Less frequently, at least one BUG_ON can be hit:
[1631742.130080] ------------[ cut here ]------------
[1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806!
[1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI
[1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G B D W 5.0.0-rc8-btrfs-next-45 #1
[1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
(...)
[1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246
[1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002
[1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000
[1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d
[1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708
[1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[1631742.138455] FS: 00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000
[1631742.139010] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0
[1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1631742.141272] Call Trace:
[1631742.141826] ? do_raw_spin_unlock+0x49/0xc0
[1631742.142390] tree_advance+0x173/0x1d0 [btrfs]
[1631742.142948] btrfs_compare_trees+0x268/0x690 [btrfs]
[1631742.143533] ? process_extent+0x1070/0x1070 [btrfs]
[1631742.144088] btrfs_ioctl_send+0x1037/0x1270 [btrfs]
[1631742.144645] _btrfs_ioctl_send+0x80/0x110 [btrfs]
[1631742.145161] ? trace_sched_stick_numa+0xe0/0xe0
[1631742.145685] btrfs_ioctl+0x13fe/0x3120 [btrfs]
[1631742.146179] ? account_entity_enqueue+0xd3/0x100
[1631742.146662] ? reweight_entity+0x154/0x1a0
[1631742.147135] ? update_curr+0x20/0x2a0
[1631742.147593] ? check_preempt_wakeup+0x103/0x250
[1631742.148053] ? do_vfs_ioctl+0xa2/0x6f0
[1631742.148510] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[1631742.148942] do_vfs_ioctl+0xa2/0x6f0
[1631742.149361] ? __fget+0x113/0x200
[1631742.149767] ksys_ioctl+0x70/0x80
[1631742.150159] __x64_sys_ioctl+0x16/0x20
[1631742.150543] do_syscall_64+0x60/0x1b0
[1631742.150931] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[1631742.151326] RIP: 0033:0x7f4cd9f5add7
(...)
[1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7
[1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007
[1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700
[1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003
[1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001
(...)
[1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]---
That BUG_ON happens because while send is using a node, that node is COWed
by a concurrent deduplication, gets freed and gets reused as a leaf (because
a transaction commit happened in between), so when it attempts to read a
slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer
contents were wiped out and it now matches a leaf (which can even belong to
some other tree now), hitting the BUG_ON(level == 0).
Fix this concurrency issue by not allowing send and deduplication to run
in parallel if both operate on the same readonly trees, returning EAGAIN
to user space and logging an exlicit warning in dmesg/syslog.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2
[3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-22 23:43:42 +08:00
|
|
|
if (sctx->parent_root->dedupe_in_progress) {
|
|
|
|
dedupe_in_progress_warn(sctx->parent_root);
|
|
|
|
spin_unlock(&sctx->parent_root->root_item_lock);
|
|
|
|
ret = -EAGAIN;
|
|
|
|
goto out;
|
|
|
|
}
|
2013-12-17 00:34:17 +08:00
|
|
|
spin_unlock(&sctx->parent_root->root_item_lock);
|
2012-07-26 05:19:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Clones from send_root are allowed, but only if the clone source
|
|
|
|
* is behind the current send position. This is checked while searching
|
|
|
|
* for possible clone sources.
|
|
|
|
*/
|
2020-01-24 22:32:48 +08:00
|
|
|
sctx->clone_roots[sctx->clone_roots_cnt++].root =
|
2020-01-24 22:33:01 +08:00
|
|
|
btrfs_grab_root(sctx->send_root);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
/* We do a bsearch later */
|
|
|
|
sort(sctx->clone_roots, sctx->clone_roots_cnt,
|
|
|
|
sizeof(*sctx->clone_roots), __clone_root_cmp_sort,
|
|
|
|
NULL);
|
2014-01-07 17:25:18 +08:00
|
|
|
sort_clone_roots = 1;
|
2012-07-26 05:19:24 +08:00
|
|
|
|
Btrfs: send, flush dellaloc in order to avoid data loss
When we set a subvolume to read-only mode we do not flush dellaloc for any
of its inodes (except if the filesystem is mounted with -o flushoncommit),
since it does not affect correctness for any subsequent operations - except
for a future send operation. The send operation will not be able to see the
delalloc data since the respective file extent items, inode item updates,
backreferences, etc, have not hit yet the subvolume and extent trees.
Effectively this means data loss, since the send stream will not contain
any data from existing delalloc. Another problem from this is that if the
writeback starts and finishes while the send operation is in progress, we
have the subvolume tree being being modified concurrently which can result
in send failing unexpectedly with EIO or hitting runtime errors, assertion
failures or hitting BUG_ONs, etc.
Simple reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ btrfs subvolume create /mnt/sv
$ xfs_io -f -c "pwrite -S 0xea 0 108K" /mnt/sv/foo
$ btrfs property set /mnt/sv ro true
$ btrfs send -f /tmp/send.stream /mnt/sv
$ od -t x1 -A d /mnt/sv/foo
0000000 ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea
*
0110592
$ umount /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ btrfs receive -f /tmp/send.stream /mnt
$ echo $?
0
$ od -t x1 -A d /mnt/sv/foo
0000000
# ---> empty file
Since this a problem that affects send only, fix it in send by flushing
dellaloc for all the roots used by the send operation before send starts
to process the commit roots.
This is a problem that affects send since it was introduced (commit
31db9f7c23fbf7 ("Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive"))
but backporting it to older kernels has some dependencies:
- For kernels between 3.19 and 4.20, it depends on commit 3cd24c698004d2
("btrfs: use tagged writepage to mitigate livelock of snapshot") because
the function btrfs_start_delalloc_snapshot() does not exist before that
commit. So one has to either pick that commit or replace the calls to
btrfs_start_delalloc_snapshot() in this patch with calls to
btrfs_start_delalloc_inodes().
- For kernels older than 3.19 it also requires commit e5fa8f865b3324
("Btrfs: ensure send always works on roots without orphans") because
it depends on the function ensure_commit_roots_uptodate() which that
commits introduced.
- No dependencies for 5.0+ kernels.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 3.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-15 16:29:36 +08:00
|
|
|
ret = flush_delalloc_roots(sctx);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
2014-10-21 18:11:41 +08:00
|
|
|
ret = ensure_commit_roots_uptodate(sctx);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
ret = send_subvol(sctx);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
btrfs: send: cache utimes operations for directories if possible
Whenever we add or remove an entry to a directory, we issue an utimes
command for the directory. If we add 1000 entries to a directory (create
1000 files under it or move 1000 files to it), then we issue the same
utimes command 1000 times, which increases the send stream size, results
in more pipe IO, one search in the send b+tree, allocating one path for
the search, etc, as well as making the receiver do a system call for each
duplicated utimes command.
We also issue an utimes command when we create a new directory, but later
we might add entries to it corresponding to inodes with an higher inode
number, so it's pointless to issue the utimes command before we create
the last inode under the directory.
So use a lru cache to track directories for which we must send a utimes
command. When we need to remove an entry from the cache, we issue the
utimes command for the respective directory. When finishing the send
operation, we go over each cache element and issue the respective utimes
command. Finally the caching is entirely optional, just a performance
optimization, meaning that if we fail to cache (due to memory allocation
failure), we issue the utimes command right away, that is, we fallback
to the previous, unoptimized, behaviour.
This patch belongs to a patchset comprised of the following patches:
btrfs: send: directly return from did_overwrite_ref() and simplify it
btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
btrfs: send: directly return from will_overwrite_ref() and simplify it
btrfs: send: avoid extra b+tree searches when checking reference overrides
btrfs: send: remove send_progress argument from can_rmdir()
btrfs: send: avoid duplicated orphan dir allocation and initialization
btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
btrfs: send: reduce searches on parent root when checking if dir can be removed
btrfs: send: iterate waiting dir move rbtree only once when processing refs
btrfs: send: initialize all the red black trees earlier
btrfs: send: genericize the backref cache to allow it to be reused
btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
btrfs: send: cache information about created directories
btrfs: allow a generation number to be associated with lru cache entries
btrfs: add an api to delete a specific entry from the lru cache
btrfs: send: use the lru cache to implement the name cache
btrfs: send: update size of roots array for backref cache entries
btrfs: send: cache utimes operations for directories if possible
The following test was run before and after applying the whole patchset,
and on a non-debug kernel (Debian's default kernel config):
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
mkfs.btrfs -f $DEV > /dev/null
mount $DEV $MNT
mkdir $MNT/A
for ((i = 1; i <= 20000; i++)); do
echo -n > $MNT/A/file_$i
done
btrfs subvolume snapshot -r $MNT $MNT/snap1
mkdir $MNT/B
for ((i = 20000; i <= 40000; i++)); do
echo -n > $MNT/B/file_$i
done
mv $MNT/A/file_* $MNT/B/
btrfs subvolume snapshot -r $MNT $MNT/snap2
start=$(date +%s%N)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "Incremental send took $dur milliseconds"
umount $MNT
Before the whole patchset: 18408 milliseconds
After the whole patchset: 1942 milliseconds (9.5x speedup)
Using 60000 files instead of 40000:
Before the whole patchset: 39764 milliseconds
After the whole patchset: 3076 milliseconds (12.9x speedup)
Using 20000 files instead of 40000:
Before the whole patchset: 5072 milliseconds
After the whole patchset: 916 milliseconds (5.5x speedup)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-11 19:36:20 +08:00
|
|
|
btrfs_lru_cache_for_each_entry_safe(&sctx->dir_utimes_cache, entry, tmp) {
|
|
|
|
ret = send_utimes(sctx, entry->key, entry->gen);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
btrfs_lru_cache_remove(&sctx->dir_utimes_cache, entry);
|
|
|
|
}
|
|
|
|
|
2013-04-11 01:10:52 +08:00
|
|
|
if (!(sctx->flags & BTRFS_SEND_FLAG_OMIT_END_CMD)) {
|
|
|
|
ret = begin_cmd(sctx, BTRFS_SEND_C_END);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
ret = send_cmd(sctx);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
out:
|
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-22 18:00:53 +08:00
|
|
|
WARN_ON(sctx && !ret && !RB_EMPTY_ROOT(&sctx->pending_dir_moves));
|
|
|
|
while (sctx && !RB_EMPTY_ROOT(&sctx->pending_dir_moves)) {
|
|
|
|
struct rb_node *n;
|
|
|
|
struct pending_dir_move *pm;
|
|
|
|
|
|
|
|
n = rb_first(&sctx->pending_dir_moves);
|
|
|
|
pm = rb_entry(n, struct pending_dir_move, node);
|
|
|
|
while (!list_empty(&pm->list)) {
|
|
|
|
struct pending_dir_move *pm2;
|
|
|
|
|
|
|
|
pm2 = list_first_entry(&pm->list,
|
|
|
|
struct pending_dir_move, list);
|
|
|
|
free_pending_move(sctx, pm2);
|
|
|
|
}
|
|
|
|
free_pending_move(sctx, pm);
|
|
|
|
}
|
|
|
|
|
|
|
|
WARN_ON(sctx && !ret && !RB_EMPTY_ROOT(&sctx->waiting_dir_moves));
|
|
|
|
while (sctx && !RB_EMPTY_ROOT(&sctx->waiting_dir_moves)) {
|
|
|
|
struct rb_node *n;
|
|
|
|
struct waiting_dir_move *dm;
|
|
|
|
|
|
|
|
n = rb_first(&sctx->waiting_dir_moves);
|
|
|
|
dm = rb_entry(n, struct waiting_dir_move, node);
|
|
|
|
rb_erase(&dm->node, &sctx->waiting_dir_moves);
|
|
|
|
kfree(dm);
|
|
|
|
}
|
|
|
|
|
2014-02-19 22:31:44 +08:00
|
|
|
WARN_ON(sctx && !ret && !RB_EMPTY_ROOT(&sctx->orphan_dirs));
|
|
|
|
while (sctx && !RB_EMPTY_ROOT(&sctx->orphan_dirs)) {
|
|
|
|
struct rb_node *n;
|
|
|
|
struct orphan_dir_info *odi;
|
|
|
|
|
|
|
|
n = rb_first(&sctx->orphan_dirs);
|
|
|
|
odi = rb_entry(n, struct orphan_dir_info, node);
|
|
|
|
free_orphan_dir_info(sctx, odi);
|
|
|
|
}
|
|
|
|
|
2014-01-07 17:25:18 +08:00
|
|
|
if (sort_clone_roots) {
|
2020-01-24 22:32:48 +08:00
|
|
|
for (i = 0; i < sctx->clone_roots_cnt; i++) {
|
2014-01-07 17:25:18 +08:00
|
|
|
btrfs_root_dec_send_in_progress(
|
|
|
|
sctx->clone_roots[i].root);
|
2020-01-24 22:33:01 +08:00
|
|
|
btrfs_put_root(sctx->clone_roots[i].root);
|
2020-01-24 22:32:48 +08:00
|
|
|
}
|
2014-01-07 17:25:18 +08:00
|
|
|
} else {
|
2020-01-24 22:32:48 +08:00
|
|
|
for (i = 0; sctx && i < clone_sources_to_rollback; i++) {
|
2014-01-07 17:25:18 +08:00
|
|
|
btrfs_root_dec_send_in_progress(
|
|
|
|
sctx->clone_roots[i].root);
|
2020-01-24 22:33:01 +08:00
|
|
|
btrfs_put_root(sctx->clone_roots[i].root);
|
2020-01-24 22:32:48 +08:00
|
|
|
}
|
2014-01-07 17:25:18 +08:00
|
|
|
|
|
|
|
btrfs_root_dec_send_in_progress(send_root);
|
|
|
|
}
|
2020-01-24 22:32:48 +08:00
|
|
|
if (sctx && !IS_ERR_OR_NULL(sctx->parent_root)) {
|
2013-12-17 22:07:20 +08:00
|
|
|
btrfs_root_dec_send_in_progress(sctx->parent_root);
|
2020-01-24 22:33:01 +08:00
|
|
|
btrfs_put_root(sctx->parent_root);
|
2020-01-24 22:32:48 +08:00
|
|
|
}
|
2013-12-17 00:34:17 +08:00
|
|
|
|
2016-04-12 00:40:08 +08:00
|
|
|
kvfree(clone_sources_tmp);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
|
|
|
if (sctx) {
|
|
|
|
if (sctx->send_filp)
|
|
|
|
fput(sctx->send_filp);
|
|
|
|
|
2016-04-12 00:40:08 +08:00
|
|
|
kvfree(sctx->clone_roots);
|
btrfs: send: get send buffer pages for protocol v2
For encoded writes in send v2, we will get the encoded data with
btrfs_encoded_read_regular_fill_pages(), which expects a list of raw
pages. To avoid extra buffers and copies, we should read directly into
the send buffer. Therefore, we need the raw pages for the send buffer.
We currently allocate the send buffer with kvmalloc(), which may return
a kmalloc'd buffer or a vmalloc'd buffer. For vmalloc, we can get the
pages with vmalloc_to_page(). For kmalloc, we could use virt_to_page().
However, the buffer size we use (144K) is not a power of two, which in
theory is not guaranteed to return a page-aligned buffer, and in
practice would waste a lot of memory due to rounding up to the next
power of two. 144K is large enough that it usually gets allocated with
vmalloc(), anyways. So, for send v2, replace kvmalloc() with vmalloc()
and save the pages in an array.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-05 01:29:07 +08:00
|
|
|
kfree(sctx->send_buf_pages);
|
2016-04-12 00:40:08 +08:00
|
|
|
kvfree(sctx->send_buf);
|
2022-08-16 04:54:28 +08:00
|
|
|
kvfree(sctx->verity_descriptor);
|
2012-07-26 05:19:24 +08:00
|
|
|
|
btrfs: send: avoid trashing the page cache
A send operation reads extent data using the buffered IO path for getting
extent data to send in write commands and this is both because it's simple
and to make use of the generic readahead infrastructure, which results in
a massive speedup.
However this fills the page cache with data that, most of the time, is
really only used by the send operation - once the write commands are sent,
it's not useful to have the data in the page cache anymore. For large
snapshots, bringing all data into the page cache eventually leads to the
need to evict other data from the page cache that may be more useful for
applications (and kernel subsystems).
Even if extents are shared with the subvolume on which a snapshot is based
on and the data is currently on the page cache due to being read through
the subvolume, attempting to read the data through the snapshot will
always result in bringing a new copy of the data into another location in
the page cache (there's currently no shared memory for shared extents).
So make send evict the data it has read before if when it first opened
the inode, its mapping had no pages currently loaded: when
inode->i_mapping->nr_pages has a value of 0. Do this instead of deciding
based on the return value of filemap_range_has_page() before reading an
extent because the generic readahead mechanism may read pages beyond the
range we request (and it very often does it), which means a call to
filemap_range_has_page() will return true due to the readahead that was
triggered when processing a previous extent - we don't have a simple way
to distinguish this case from the case where the data was brought into
the page cache through someone else. So checking for the mapping number
of pages being 0 when we first open the inode is simple, cheap and it
generally accomplishes the goal of not trashing the page cache - the
only exception is if part of data was previously loaded into the page
cache through the snapshot by some other process, in that case we end
up not evicting any data send brings into the page cache, just like
before this change - but that however is not the common case.
Example scenario, on a box with 32G of RAM:
$ btrfs subvolume create /mnt/sv1
$ xfs_io -f -c "pwrite 0 4G" /mnt/sv1/file1
$ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1
$ free -m
total used free shared buff/cache available
Mem: 31937 186 26866 0 4883 31297
Swap: 8188 0 8188
# After this we get less 4G of free memory.
$ btrfs send /mnt/snap1 >/dev/null
$ free -m
total used free shared buff/cache available
Mem: 31937 186 22814 0 8935 31297
Swap: 8188 0 8188
The same, obviously, applies to an incremental send.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-17 18:47:30 +08:00
|
|
|
close_current_inode(sctx);
|
2022-05-06 01:16:14 +08:00
|
|
|
|
2023-01-11 19:36:18 +08:00
|
|
|
btrfs_lru_cache_clear(&sctx->name_cache);
|
2023-01-11 19:36:13 +08:00
|
|
|
btrfs_lru_cache_clear(&sctx->backref_cache);
|
2023-01-11 19:36:15 +08:00
|
|
|
btrfs_lru_cache_clear(&sctx->dir_created_cache);
|
btrfs: send: cache utimes operations for directories if possible
Whenever we add or remove an entry to a directory, we issue an utimes
command for the directory. If we add 1000 entries to a directory (create
1000 files under it or move 1000 files to it), then we issue the same
utimes command 1000 times, which increases the send stream size, results
in more pipe IO, one search in the send b+tree, allocating one path for
the search, etc, as well as making the receiver do a system call for each
duplicated utimes command.
We also issue an utimes command when we create a new directory, but later
we might add entries to it corresponding to inodes with an higher inode
number, so it's pointless to issue the utimes command before we create
the last inode under the directory.
So use a lru cache to track directories for which we must send a utimes
command. When we need to remove an entry from the cache, we issue the
utimes command for the respective directory. When finishing the send
operation, we go over each cache element and issue the respective utimes
command. Finally the caching is entirely optional, just a performance
optimization, meaning that if we fail to cache (due to memory allocation
failure), we issue the utimes command right away, that is, we fallback
to the previous, unoptimized, behaviour.
This patch belongs to a patchset comprised of the following patches:
btrfs: send: directly return from did_overwrite_ref() and simplify it
btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
btrfs: send: directly return from will_overwrite_ref() and simplify it
btrfs: send: avoid extra b+tree searches when checking reference overrides
btrfs: send: remove send_progress argument from can_rmdir()
btrfs: send: avoid duplicated orphan dir allocation and initialization
btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
btrfs: send: reduce searches on parent root when checking if dir can be removed
btrfs: send: iterate waiting dir move rbtree only once when processing refs
btrfs: send: initialize all the red black trees earlier
btrfs: send: genericize the backref cache to allow it to be reused
btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
btrfs: send: cache information about created directories
btrfs: allow a generation number to be associated with lru cache entries
btrfs: add an api to delete a specific entry from the lru cache
btrfs: send: use the lru cache to implement the name cache
btrfs: send: update size of roots array for backref cache entries
btrfs: send: cache utimes operations for directories if possible
The following test was run before and after applying the whole patchset,
and on a non-debug kernel (Debian's default kernel config):
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
mkfs.btrfs -f $DEV > /dev/null
mount $DEV $MNT
mkdir $MNT/A
for ((i = 1; i <= 20000; i++)); do
echo -n > $MNT/A/file_$i
done
btrfs subvolume snapshot -r $MNT $MNT/snap1
mkdir $MNT/B
for ((i = 20000; i <= 40000; i++)); do
echo -n > $MNT/B/file_$i
done
mv $MNT/A/file_* $MNT/B/
btrfs subvolume snapshot -r $MNT $MNT/snap2
start=$(date +%s%N)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "Incremental send took $dur milliseconds"
umount $MNT
Before the whole patchset: 18408 milliseconds
After the whole patchset: 1942 milliseconds (9.5x speedup)
Using 60000 files instead of 40000:
Before the whole patchset: 39764 milliseconds
After the whole patchset: 3076 milliseconds (12.9x speedup)
Using 20000 files instead of 40000:
Before the whole patchset: 5072 milliseconds
After the whole patchset: 916 milliseconds (5.5x speedup)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-11 19:36:20 +08:00
|
|
|
btrfs_lru_cache_clear(&sctx->dir_utimes_cache);
|
btrfs: send: cache leaf to roots mapping during backref walking
During a send operation, when doing backref walking to determine which
inodes/offsets/roots we can clone from, the most repetitive and expensive
step is to map each leaf that has file extent items pointing to the target
data extent to the IDs of the roots from which the leaves are accessible,
which happens at iterate_extent_inodes(). That step requires finding every
parent node of a leaf, then the parent of each parent, and so on until we
reach a root node. So it's a naturally expensive operation, and repetitive
because each leaf can have hundreds of file extent items (for a nodesize
of 16K, that can be slightly over 200 file extent items). There's also
temporal locality, as we process all file extent items from a leave before
moving the next leaf.
This change caches the mapping of leaves to root IDs, to avoid repeating
those computations over and over again. The cache is limited to a maximum
of 128 entries, with each entry being a struct with a size of 128 bytes,
so the maximum cache size is 16K plus any nodes internally allocated by
the maple tree that is used to index pointers to those structs. The cache
is invalidated whenever we detect relocation happened since we started
filling the cache, because if relocation happened then extent buffers for
leaves and nodes of the trees used by a send operation may have been
reallocated.
This cache also allows for another important optimization that is
introduced in the next patch in the series.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02 00:15:50 +08:00
|
|
|
|
2012-07-26 05:19:24 +08:00
|
|
|
kfree(sctx);
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|