2018-04-04 01:23:33 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2011-03-08 21:14:00 +08:00
|
|
|
/*
|
2012-11-02 23:44:58 +08:00
|
|
|
* Copyright (C) 2011, 2012 STRATO. All rights reserved.
|
2011-03-08 21:14:00 +08:00
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/blkdev.h>
|
2011-06-14 01:59:12 +08:00
|
|
|
#include <linux/ratelimit.h>
|
2017-06-01 01:21:38 +08:00
|
|
|
#include <linux/sched/mm.h>
|
2019-06-03 22:58:57 +08:00
|
|
|
#include <crypto/hash.h>
|
2011-03-08 21:14:00 +08:00
|
|
|
#include "ctree.h"
|
btrfs: handle empty block_group removal for async discard
block_group removal is a little tricky. It can race with the extent
allocator, the cleaner thread, and balancing. The current path is for a
block_group to be added to the unused_bgs list. Then, when the cleaner
thread comes around, it starts a transaction and then proceeds with
removing the block_group. Extents that are pinned are subsequently
removed from the pinned trees and then eventually a discard is issued
for the entire block_group.
Async discard introduces another player into the game, the discard
workqueue. While it has none of the racing issues, the new problem is
ensuring we don't leave free space untrimmed prior to forgetting the
block_group. This is handled by placing fully free block_groups on a
separate discard queue. This is necessary to maintain discarding order
as in the future we will slowly trim even fully free block_groups. The
ordering helps us make progress on the same block_group rather than say
the last fully freed block_group or needing to search through the fully
freed block groups at the beginning of a list and insert after.
The new order of events is a fully freed block group gets placed on the
unused discard queue first. Once it's processed, it will be placed on
the unusued_bgs list and then the original sequence of events will
happen, just without the final whole block_group discard.
The mount flags can change when processing unused_bgs, so when flipping
from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
free block groups on the discard_list to the unused_bg queue which will
do the final discard for us.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-14 08:22:15 +08:00
|
|
|
#include "discard.h"
|
2011-03-08 21:14:00 +08:00
|
|
|
#include "volumes.h"
|
|
|
|
#include "disk-io.h"
|
|
|
|
#include "ordered-data.h"
|
2011-06-14 02:04:15 +08:00
|
|
|
#include "transaction.h"
|
2011-06-14 01:59:12 +08:00
|
|
|
#include "backref.h"
|
2011-08-05 00:11:04 +08:00
|
|
|
#include "extent_io.h"
|
2012-11-06 18:43:11 +08:00
|
|
|
#include "dev-replace.h"
|
2013-01-30 07:40:14 +08:00
|
|
|
#include "raid56.h"
|
2019-06-21 03:37:44 +08:00
|
|
|
#include "block-group.h"
|
btrfs: implement log-structured superblock for ZONED mode
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones. However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today. So, the number of superblock and
copies is limited to be two. Second downside is that we cannot support
devices which have no conventional zones at all.
To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.
We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.
The following zones are reserved as the circular buffer on ZONED btrfs.
- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
next to it
If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-10 19:26:14 +08:00
|
|
|
#include "zoned.h"
|
2022-10-19 22:50:47 +08:00
|
|
|
#include "fs.h"
|
2022-10-19 22:51:00 +08:00
|
|
|
#include "accessors.h"
|
2022-10-27 03:08:27 +08:00
|
|
|
#include "file-item.h"
|
2022-10-27 03:08:35 +08:00
|
|
|
#include "scrub.h"
|
2023-09-15 00:07:01 +08:00
|
|
|
#include "raid-stripe-tree.h"
|
2011-03-08 21:14:00 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This is only the first step towards a full-features scrub. It reads all
|
|
|
|
* extent and super block and verifies the checksums. In case a bad checksum
|
|
|
|
* is found or the extent cannot be read, good data will be written back if
|
|
|
|
* any can be found.
|
|
|
|
*
|
|
|
|
* Future enhancements:
|
|
|
|
* - In case an unrepairable extent is encountered, track which files are
|
|
|
|
* affected and report them
|
|
|
|
* - track and record media errors, throw out bad devices
|
|
|
|
* - add a mode to also read unallocated space
|
|
|
|
*/
|
|
|
|
|
Btrfs: rename the scrub context structure
The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.
Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is used.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-11-02 16:58:09 +08:00
|
|
|
struct scrub_ctx;
|
2011-03-08 21:14:00 +08:00
|
|
|
|
2012-11-06 18:43:11 +08:00
|
|
|
/*
|
2023-03-29 14:54:57 +08:00
|
|
|
* The following value only influences the performance.
|
2021-12-06 13:52:58 +08:00
|
|
|
*
|
2023-12-06 02:26:39 +08:00
|
|
|
* This determines how many stripes would be submitted in one go,
|
btrfs: scrub: fix grouping of read IO
[REGRESSION]
There are several regression reports about the scrub performance with
v6.4 kernel.
On a PCIe 3.0 device, the old v6.3 kernel can go 3GB/s scrub speed, but
v6.4 can only go 1GB/s, an obvious 66% performance drop.
[CAUSE]
Iostat shows a very different behavior between v6.3 and v6.4 kernel:
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 9731.00 3425544.00 17237.00 63.92 2.18 352.02 21.18 100.00
nvme0n1p3 15578.00 993616.00 5.00 0.03 0.09 63.78 1.32 100.00
The upper one is v6.3 while the lower one is v6.4.
There are several obvious differences:
- Very few read merges
This turns out to be a behavior change that we no longer do bio
plug/unplug.
- Very low aqu-sz
This is due to the submit-and-wait behavior of flush_scrub_stripes(),
and extra extent/csum tree search.
Both behaviors are not that obvious on SATA SSDs, as SATA SSDs have NCQ
to merge the reads, while SATA SSDs can not handle high queue depth well
either.
[FIX]
For now this patch focuses on the read speed fix. Dev-replace replace
speed needs more work.
For the read part, we go two directions to fix the problems:
- Re-introduce blk plug/unplug to merge read requests
This is pretty simple, and the behavior is pretty easy to observe.
This would enlarge the average read request size to 512K.
- Introduce multi-group reads and no longer wait for each group
Instead of the old behavior, which submits 8 stripes and waits for
them, here we would enlarge the total number of stripes to 16 * 8.
Which is 8M per device, the same limit as the old scrub in-flight
bios size limit.
Now every time we fill a group (8 stripes), we submit them and
continue to next stripes.
Only when the full 16 * 8 stripes are all filled, we submit the
remaining ones (the last group), and wait for all groups to finish.
Then submit the repair writes and dev-replace writes.
This should enlarge the queue depth.
This would greatly improve the merge rate (thus read block size) and
queue depth:
Before (with regression, and cached extent/csum path):
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 20666.00 1318240.00 10.00 0.05 0.08 63.79 1.63 100.00
After (with all patches applied):
nvme0n1p3 5165.00 2278304.00 30557.00 85.54 0.55 441.10 2.81 100.00
i.e. 1287 to 2224 MB/s.
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-03 14:33:31 +08:00
|
|
|
* which is 512KiB (BTRFS_STRIPE_LEN * SCRUB_STRIPES_PER_GROUP).
|
2012-11-06 18:43:11 +08:00
|
|
|
*/
|
btrfs: scrub: fix grouping of read IO
[REGRESSION]
There are several regression reports about the scrub performance with
v6.4 kernel.
On a PCIe 3.0 device, the old v6.3 kernel can go 3GB/s scrub speed, but
v6.4 can only go 1GB/s, an obvious 66% performance drop.
[CAUSE]
Iostat shows a very different behavior between v6.3 and v6.4 kernel:
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 9731.00 3425544.00 17237.00 63.92 2.18 352.02 21.18 100.00
nvme0n1p3 15578.00 993616.00 5.00 0.03 0.09 63.78 1.32 100.00
The upper one is v6.3 while the lower one is v6.4.
There are several obvious differences:
- Very few read merges
This turns out to be a behavior change that we no longer do bio
plug/unplug.
- Very low aqu-sz
This is due to the submit-and-wait behavior of flush_scrub_stripes(),
and extra extent/csum tree search.
Both behaviors are not that obvious on SATA SSDs, as SATA SSDs have NCQ
to merge the reads, while SATA SSDs can not handle high queue depth well
either.
[FIX]
For now this patch focuses on the read speed fix. Dev-replace replace
speed needs more work.
For the read part, we go two directions to fix the problems:
- Re-introduce blk plug/unplug to merge read requests
This is pretty simple, and the behavior is pretty easy to observe.
This would enlarge the average read request size to 512K.
- Introduce multi-group reads and no longer wait for each group
Instead of the old behavior, which submits 8 stripes and waits for
them, here we would enlarge the total number of stripes to 16 * 8.
Which is 8M per device, the same limit as the old scrub in-flight
bios size limit.
Now every time we fill a group (8 stripes), we submit them and
continue to next stripes.
Only when the full 16 * 8 stripes are all filled, we submit the
remaining ones (the last group), and wait for all groups to finish.
Then submit the repair writes and dev-replace writes.
This should enlarge the queue depth.
This would greatly improve the merge rate (thus read block size) and
queue depth:
Before (with regression, and cached extent/csum path):
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 20666.00 1318240.00 10.00 0.05 0.08 63.79 1.63 100.00
After (with all patches applied):
nvme0n1p3 5165.00 2278304.00 30557.00 85.54 0.55 441.10 2.81 100.00
i.e. 1287 to 2224 MB/s.
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-03 14:33:31 +08:00
|
|
|
#define SCRUB_STRIPES_PER_GROUP 8
|
|
|
|
|
|
|
|
/*
|
|
|
|
* How many groups we have for each sctx.
|
|
|
|
*
|
|
|
|
* This would be 8M per device, the same value as the old scrub in-flight bios
|
|
|
|
* size limit.
|
|
|
|
*/
|
|
|
|
#define SCRUB_GROUPS_PER_SCTX 16
|
|
|
|
|
|
|
|
#define SCRUB_TOTAL_STRIPES (SCRUB_GROUPS_PER_SCTX * SCRUB_STRIPES_PER_GROUP)
|
2012-11-02 21:58:04 +08:00
|
|
|
|
|
|
|
/*
|
2021-12-06 13:52:57 +08:00
|
|
|
* The following value times PAGE_SIZE needs to be large enough to match the
|
2012-11-02 21:58:04 +08:00
|
|
|
* largest node/leaf/sector size that shall be supported.
|
|
|
|
*/
|
2022-03-13 18:40:00 +08:00
|
|
|
#define SCRUB_MAX_SECTORS_PER_BLOCK (BTRFS_MAX_METADATA_BLOCKSIZE / SZ_4K)
|
2011-03-08 21:14:00 +08:00
|
|
|
|
2023-03-20 10:12:50 +08:00
|
|
|
/* Represent one sector and its needed info to verify the content. */
|
|
|
|
struct scrub_sector_verification {
|
|
|
|
bool is_metadata;
|
|
|
|
|
|
|
|
union {
|
|
|
|
/*
|
|
|
|
* Csum pointer for data csum verification. Should point to a
|
|
|
|
* sector csum inside scrub_stripe::csums.
|
|
|
|
*
|
|
|
|
* NULL if this data sector has no csum.
|
|
|
|
*/
|
|
|
|
u8 *csum;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Extra info for metadata verification. All sectors inside a
|
|
|
|
* tree block share the same generation.
|
|
|
|
*/
|
|
|
|
u64 generation;
|
|
|
|
};
|
|
|
|
};
|
|
|
|
|
|
|
|
enum scrub_stripe_flags {
|
|
|
|
/* Set when @mirror_num, @dev, @physical and @logical are set. */
|
|
|
|
SCRUB_STRIPE_FLAG_INITIALIZED,
|
|
|
|
|
|
|
|
/* Set when the read-repair is finished. */
|
|
|
|
SCRUB_STRIPE_FLAG_REPAIR_DONE,
|
2023-03-28 16:57:35 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Set for data stripes if it's triggered from P/Q stripe.
|
|
|
|
* During such scrub, we should not report errors in data stripes, nor
|
|
|
|
* update the accounting.
|
|
|
|
*/
|
|
|
|
SCRUB_STRIPE_FLAG_NO_REPORT,
|
2023-03-20 10:12:50 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
#define SCRUB_STRIPE_PAGES (BTRFS_STRIPE_LEN / PAGE_SIZE)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Represent one contiguous range with a length of BTRFS_STRIPE_LEN.
|
|
|
|
*/
|
|
|
|
struct scrub_stripe {
|
2023-03-20 10:12:56 +08:00
|
|
|
struct scrub_ctx *sctx;
|
2023-03-20 10:12:50 +08:00
|
|
|
struct btrfs_block_group *bg;
|
|
|
|
|
|
|
|
struct page *pages[SCRUB_STRIPE_PAGES];
|
|
|
|
struct scrub_sector_verification *sectors;
|
|
|
|
|
|
|
|
struct btrfs_device *dev;
|
|
|
|
u64 logical;
|
|
|
|
u64 physical;
|
|
|
|
|
|
|
|
u16 mirror_num;
|
|
|
|
|
|
|
|
/* Should be BTRFS_STRIPE_LEN / sectorsize. */
|
|
|
|
u16 nr_sectors;
|
|
|
|
|
2023-03-20 10:12:56 +08:00
|
|
|
/*
|
|
|
|
* How many data/meta extents are in this stripe. Only for scrub status
|
|
|
|
* reporting purposes.
|
|
|
|
*/
|
|
|
|
u16 nr_data_extents;
|
|
|
|
u16 nr_meta_extents;
|
|
|
|
|
2023-03-20 10:12:50 +08:00
|
|
|
atomic_t pending_io;
|
|
|
|
wait_queue_head_t io_wait;
|
btrfs: scrub: introduce the main read repair worker for scrub_stripe
The new helper, scrub_stripe_read_repair_worker(), would handle the
read-repair part:
- Wait for the previous submitted read IO to finish
- Verify the contents of the stripe
- Go through the remaining mirrors, using as large blocksize as possible
At this stage, we just read out all the failed sectors from each
mirror and re-verify.
If no more failed sector, we can exit.
- Go through all mirrors again, sector-by-sector
This time, we read sector by sector, this is to address cases where
one bad sector mismatches the drive's internal checksum, and cause the
whole read range to fail.
We put this recovery method as the last resort, as sector-by-sector
reading is slow, and reading from other mirrors may have already fixed
the errors.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-20 10:12:54 +08:00
|
|
|
wait_queue_head_t repair_wait;
|
2023-03-20 10:12:50 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Indicate the states of the stripe. Bits are defined in
|
|
|
|
* scrub_stripe_flags enum.
|
|
|
|
*/
|
|
|
|
unsigned long state;
|
|
|
|
|
|
|
|
/* Indicate which sectors are covered by extent items. */
|
|
|
|
unsigned long extent_sector_bitmap;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The errors hit during the initial read of the stripe.
|
|
|
|
*
|
|
|
|
* Would be utilized for error reporting and repair.
|
btrfs: scrub: also report errors hit during the initial read
[BUG]
After the recent scrub rework introduced in commit e02ee89baa66 ("btrfs:
scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure"),
btrfs scrub no longer reports repaired errors any more:
# mkfs.btrfs -f $dev -d DUP
# mount $dev $mnt
# xfs_io -f -d -c "pwrite -b 64K -S 0xaa 0 64" $mnt/file
# umount $dev
# xfs_io -f -c "pwrite -S 0xff $phy1 64K" $dev # Corrupt the first mirror
# mount $dev $mnt
# btrfs scrub start -BR $mnt
scrub done for 725e7cb7-8a4a-4c77-9f2a-86943619e218
Scrub started: Tue Jun 6 14:56:50 2023
Status: finished
Duration: 0:00:00
data_extents_scrubbed: 2
tree_extents_scrubbed: 18
data_bytes_scrubbed: 131072
tree_bytes_scrubbed: 294912
read_errors: 0
csum_errors: 0 <<< No errors here
verify_errors: 0
[...]
uncorrectable_errors: 0
unverified_errors: 0
corrected_errors: 16 <<< Only corrected errors
last_physical: 2723151872
This can confuse btrfs-progs, as it relies on the csum_errors to
determine if there is anything wrong.
While on v6.3.x kernels, the report is different:
csum_errors: 16 <<<
verify_errors: 0
[...]
uncorrectable_errors: 0
unverified_errors: 0
corrected_errors: 16 <<<
[CAUSE]
In the reworked scrub, we update the scrub progress inside
scrub_stripe_report_errors(), using various bitmaps to update the
result.
For example for csum_errors, we use bitmap_weight() of
stripe->csum_error_bitmap.
Unfortunately at that stage, all error bitmaps (except
init_error_bitmap) are the result of the latest repair attempt, thus if
the stripe is fully repaired, those error bitmaps will all be empty,
resulting the above output mismatch.
To fix this, record the number of errors into stripe->init_nr_*_errors.
Since we don't really care about where those errors are, we only need to
record the number of errors.
Then in scrub_stripe_report_errors(), use those initial numbers to
update the progress other than using the latest error bitmaps.
Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-06 15:08:28 +08:00
|
|
|
*
|
|
|
|
* The remaining init_nr_* records the number of errors hit, only used
|
|
|
|
* by error reporting.
|
2023-03-20 10:12:50 +08:00
|
|
|
*/
|
|
|
|
unsigned long init_error_bitmap;
|
btrfs: scrub: also report errors hit during the initial read
[BUG]
After the recent scrub rework introduced in commit e02ee89baa66 ("btrfs:
scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure"),
btrfs scrub no longer reports repaired errors any more:
# mkfs.btrfs -f $dev -d DUP
# mount $dev $mnt
# xfs_io -f -d -c "pwrite -b 64K -S 0xaa 0 64" $mnt/file
# umount $dev
# xfs_io -f -c "pwrite -S 0xff $phy1 64K" $dev # Corrupt the first mirror
# mount $dev $mnt
# btrfs scrub start -BR $mnt
scrub done for 725e7cb7-8a4a-4c77-9f2a-86943619e218
Scrub started: Tue Jun 6 14:56:50 2023
Status: finished
Duration: 0:00:00
data_extents_scrubbed: 2
tree_extents_scrubbed: 18
data_bytes_scrubbed: 131072
tree_bytes_scrubbed: 294912
read_errors: 0
csum_errors: 0 <<< No errors here
verify_errors: 0
[...]
uncorrectable_errors: 0
unverified_errors: 0
corrected_errors: 16 <<< Only corrected errors
last_physical: 2723151872
This can confuse btrfs-progs, as it relies on the csum_errors to
determine if there is anything wrong.
While on v6.3.x kernels, the report is different:
csum_errors: 16 <<<
verify_errors: 0
[...]
uncorrectable_errors: 0
unverified_errors: 0
corrected_errors: 16 <<<
[CAUSE]
In the reworked scrub, we update the scrub progress inside
scrub_stripe_report_errors(), using various bitmaps to update the
result.
For example for csum_errors, we use bitmap_weight() of
stripe->csum_error_bitmap.
Unfortunately at that stage, all error bitmaps (except
init_error_bitmap) are the result of the latest repair attempt, thus if
the stripe is fully repaired, those error bitmaps will all be empty,
resulting the above output mismatch.
To fix this, record the number of errors into stripe->init_nr_*_errors.
Since we don't really care about where those errors are, we only need to
record the number of errors.
Then in scrub_stripe_report_errors(), use those initial numbers to
update the progress other than using the latest error bitmaps.
Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-06 15:08:28 +08:00
|
|
|
unsigned int init_nr_io_errors;
|
|
|
|
unsigned int init_nr_csum_errors;
|
|
|
|
unsigned int init_nr_meta_errors;
|
2023-03-20 10:12:50 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The following error bitmaps are all for the current status.
|
|
|
|
* Every time we submit a new read, these bitmaps may be updated.
|
|
|
|
*
|
|
|
|
* error_bitmap = io_error_bitmap | csum_error_bitmap | meta_error_bitmap;
|
|
|
|
*
|
|
|
|
* IO and csum errors can happen for both metadata and data.
|
|
|
|
*/
|
|
|
|
unsigned long error_bitmap;
|
|
|
|
unsigned long io_error_bitmap;
|
|
|
|
unsigned long csum_error_bitmap;
|
|
|
|
unsigned long meta_error_bitmap;
|
|
|
|
|
2023-03-20 10:12:55 +08:00
|
|
|
/* For writeback (repair or replace) error reporting. */
|
|
|
|
unsigned long write_error_bitmap;
|
|
|
|
|
|
|
|
/* Writeback can be concurrent, thus we need to protect the bitmap. */
|
|
|
|
spinlock_t write_error_lock;
|
|
|
|
|
2023-03-20 10:12:50 +08:00
|
|
|
/*
|
|
|
|
* Checksum for the whole stripe if this stripe is inside a data block
|
|
|
|
* group.
|
|
|
|
*/
|
|
|
|
u8 *csums;
|
btrfs: scrub: introduce the main read repair worker for scrub_stripe
The new helper, scrub_stripe_read_repair_worker(), would handle the
read-repair part:
- Wait for the previous submitted read IO to finish
- Verify the contents of the stripe
- Go through the remaining mirrors, using as large blocksize as possible
At this stage, we just read out all the failed sectors from each
mirror and re-verify.
If no more failed sector, we can exit.
- Go through all mirrors again, sector-by-sector
This time, we read sector by sector, this is to address cases where
one bad sector mismatches the drive's internal checksum, and cause the
whole read range to fail.
We put this recovery method as the last resort, as sector-by-sector
reading is slow, and reading from other mirrors may have already fixed
the errors.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-20 10:12:54 +08:00
|
|
|
|
|
|
|
struct work_struct work;
|
2023-03-20 10:12:50 +08:00
|
|
|
};
|
|
|
|
|
Btrfs: rename the scrub context structure
The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.
Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is used.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-11-02 16:58:09 +08:00
|
|
|
struct scrub_ctx {
|
btrfs: scrub: fix grouping of read IO
[REGRESSION]
There are several regression reports about the scrub performance with
v6.4 kernel.
On a PCIe 3.0 device, the old v6.3 kernel can go 3GB/s scrub speed, but
v6.4 can only go 1GB/s, an obvious 66% performance drop.
[CAUSE]
Iostat shows a very different behavior between v6.3 and v6.4 kernel:
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 9731.00 3425544.00 17237.00 63.92 2.18 352.02 21.18 100.00
nvme0n1p3 15578.00 993616.00 5.00 0.03 0.09 63.78 1.32 100.00
The upper one is v6.3 while the lower one is v6.4.
There are several obvious differences:
- Very few read merges
This turns out to be a behavior change that we no longer do bio
plug/unplug.
- Very low aqu-sz
This is due to the submit-and-wait behavior of flush_scrub_stripes(),
and extra extent/csum tree search.
Both behaviors are not that obvious on SATA SSDs, as SATA SSDs have NCQ
to merge the reads, while SATA SSDs can not handle high queue depth well
either.
[FIX]
For now this patch focuses on the read speed fix. Dev-replace replace
speed needs more work.
For the read part, we go two directions to fix the problems:
- Re-introduce blk plug/unplug to merge read requests
This is pretty simple, and the behavior is pretty easy to observe.
This would enlarge the average read request size to 512K.
- Introduce multi-group reads and no longer wait for each group
Instead of the old behavior, which submits 8 stripes and waits for
them, here we would enlarge the total number of stripes to 16 * 8.
Which is 8M per device, the same limit as the old scrub in-flight
bios size limit.
Now every time we fill a group (8 stripes), we submit them and
continue to next stripes.
Only when the full 16 * 8 stripes are all filled, we submit the
remaining ones (the last group), and wait for all groups to finish.
Then submit the repair writes and dev-replace writes.
This should enlarge the queue depth.
This would greatly improve the merge rate (thus read block size) and
queue depth:
Before (with regression, and cached extent/csum path):
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 20666.00 1318240.00 10.00 0.05 0.08 63.79 1.63 100.00
After (with all patches applied):
nvme0n1p3 5165.00 2278304.00 30557.00 85.54 0.55 441.10 2.81 100.00
i.e. 1287 to 2224 MB/s.
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-03 14:33:31 +08:00
|
|
|
struct scrub_stripe stripes[SCRUB_TOTAL_STRIPES];
|
2023-03-28 16:57:35 +08:00
|
|
|
struct scrub_stripe *raid56_data_stripes;
|
2016-06-23 06:54:56 +08:00
|
|
|
struct btrfs_fs_info *fs_info;
|
2023-08-03 14:33:29 +08:00
|
|
|
struct btrfs_path extent_path;
|
2023-08-03 14:33:30 +08:00
|
|
|
struct btrfs_path csum_path;
|
2011-03-08 21:14:00 +08:00
|
|
|
int first_free;
|
2023-03-20 10:12:57 +08:00
|
|
|
int cur_stripe;
|
2011-03-08 21:14:00 +08:00
|
|
|
atomic_t cancel_req;
|
2011-03-23 23:34:19 +08:00
|
|
|
int readonly;
|
2012-11-06 01:29:28 +08:00
|
|
|
|
2019-10-09 19:58:13 +08:00
|
|
|
/* State of IO submission throttling affecting the associated device */
|
|
|
|
ktime_t throttle_deadline;
|
|
|
|
u64 throttle_sent;
|
|
|
|
|
2012-11-06 01:29:28 +08:00
|
|
|
int is_dev_replace;
|
2021-02-04 18:22:13 +08:00
|
|
|
u64 write_pointer;
|
2017-05-17 01:10:32 +08:00
|
|
|
|
|
|
|
struct mutex wr_lock;
|
|
|
|
struct btrfs_device *wr_tgtdev;
|
2012-11-06 01:29:28 +08:00
|
|
|
|
2011-03-08 21:14:00 +08:00
|
|
|
/*
|
|
|
|
* statistics
|
|
|
|
*/
|
|
|
|
struct btrfs_scrub_progress stat;
|
|
|
|
spinlock_t stat_lock;
|
2015-02-10 05:14:24 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Use a ref counter to avoid use-after-free issues. Scrub workers
|
|
|
|
* decrement bios_in_flight and workers_pending and then do a wakeup
|
|
|
|
* on the list_wait wait queue. We must ensure the main scrub task
|
|
|
|
* doesn't free the scrub context before or while the workers are
|
|
|
|
* doing the wakeup() call.
|
|
|
|
*/
|
2017-03-03 16:55:25 +08:00
|
|
|
refcount_t refs;
|
2011-03-08 21:14:00 +08:00
|
|
|
};
|
|
|
|
|
2011-06-14 01:59:12 +08:00
|
|
|
struct scrub_warning {
|
|
|
|
struct btrfs_path *path;
|
|
|
|
u64 extent_item_size;
|
|
|
|
const char *errstr;
|
2017-10-04 23:07:07 +08:00
|
|
|
u64 physical;
|
2011-06-14 01:59:12 +08:00
|
|
|
u64 logical;
|
|
|
|
struct btrfs_device *dev;
|
|
|
|
};
|
|
|
|
|
2023-03-20 10:12:50 +08:00
|
|
|
static void release_scrub_stripe(struct scrub_stripe *stripe)
|
|
|
|
{
|
|
|
|
if (!stripe)
|
|
|
|
return;
|
|
|
|
|
|
|
|
for (int i = 0; i < SCRUB_STRIPE_PAGES; i++) {
|
|
|
|
if (stripe->pages[i])
|
|
|
|
__free_page(stripe->pages[i]);
|
|
|
|
stripe->pages[i] = NULL;
|
|
|
|
}
|
|
|
|
kfree(stripe->sectors);
|
|
|
|
kfree(stripe->csums);
|
|
|
|
stripe->sectors = NULL;
|
|
|
|
stripe->csums = NULL;
|
2023-03-20 10:12:56 +08:00
|
|
|
stripe->sctx = NULL;
|
2023-03-20 10:12:50 +08:00
|
|
|
stripe->state = 0;
|
|
|
|
}
|
|
|
|
|
2023-03-20 10:12:57 +08:00
|
|
|
static int init_scrub_stripe(struct btrfs_fs_info *fs_info,
|
|
|
|
struct scrub_stripe *stripe)
|
2023-03-20 10:12:50 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
memset(stripe, 0, sizeof(*stripe));
|
|
|
|
|
|
|
|
stripe->nr_sectors = BTRFS_STRIPE_LEN >> fs_info->sectorsize_bits;
|
|
|
|
stripe->state = 0;
|
|
|
|
|
|
|
|
init_waitqueue_head(&stripe->io_wait);
|
btrfs: scrub: introduce the main read repair worker for scrub_stripe
The new helper, scrub_stripe_read_repair_worker(), would handle the
read-repair part:
- Wait for the previous submitted read IO to finish
- Verify the contents of the stripe
- Go through the remaining mirrors, using as large blocksize as possible
At this stage, we just read out all the failed sectors from each
mirror and re-verify.
If no more failed sector, we can exit.
- Go through all mirrors again, sector-by-sector
This time, we read sector by sector, this is to address cases where
one bad sector mismatches the drive's internal checksum, and cause the
whole read range to fail.
We put this recovery method as the last resort, as sector-by-sector
reading is slow, and reading from other mirrors may have already fixed
the errors.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-20 10:12:54 +08:00
|
|
|
init_waitqueue_head(&stripe->repair_wait);
|
2023-03-20 10:12:50 +08:00
|
|
|
atomic_set(&stripe->pending_io, 0);
|
2023-03-20 10:12:55 +08:00
|
|
|
spin_lock_init(&stripe->write_error_lock);
|
2023-03-20 10:12:50 +08:00
|
|
|
|
2023-11-30 06:32:08 +08:00
|
|
|
ret = btrfs_alloc_page_array(SCRUB_STRIPE_PAGES, stripe->pages, 0);
|
2023-03-20 10:12:50 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto error;
|
|
|
|
|
|
|
|
stripe->sectors = kcalloc(stripe->nr_sectors,
|
|
|
|
sizeof(struct scrub_sector_verification),
|
|
|
|
GFP_KERNEL);
|
|
|
|
if (!stripe->sectors)
|
|
|
|
goto error;
|
|
|
|
|
|
|
|
stripe->csums = kcalloc(BTRFS_STRIPE_LEN >> fs_info->sectorsize_bits,
|
|
|
|
fs_info->csum_size, GFP_KERNEL);
|
|
|
|
if (!stripe->csums)
|
|
|
|
goto error;
|
|
|
|
return 0;
|
|
|
|
error:
|
|
|
|
release_scrub_stripe(stripe);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
btrfs: scrub: introduce the main read repair worker for scrub_stripe
The new helper, scrub_stripe_read_repair_worker(), would handle the
read-repair part:
- Wait for the previous submitted read IO to finish
- Verify the contents of the stripe
- Go through the remaining mirrors, using as large blocksize as possible
At this stage, we just read out all the failed sectors from each
mirror and re-verify.
If no more failed sector, we can exit.
- Go through all mirrors again, sector-by-sector
This time, we read sector by sector, this is to address cases where
one bad sector mismatches the drive's internal checksum, and cause the
whole read range to fail.
We put this recovery method as the last resort, as sector-by-sector
reading is slow, and reading from other mirrors may have already fixed
the errors.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-20 10:12:54 +08:00
|
|
|
static void wait_scrub_stripe_io(struct scrub_stripe *stripe)
|
2023-03-20 10:12:50 +08:00
|
|
|
{
|
|
|
|
wait_event(stripe->io_wait, atomic_read(&stripe->pending_io) == 0);
|
|
|
|
}
|
|
|
|
|
2015-02-10 05:14:24 +08:00
|
|
|
static void scrub_put_ctx(struct scrub_ctx *sctx);
|
2012-03-28 02:21:26 +08:00
|
|
|
|
2013-12-04 21:16:53 +08:00
|
|
|
static void __scrub_blocked_if_needed(struct btrfs_fs_info *fs_info)
|
2013-12-04 21:15:19 +08:00
|
|
|
{
|
|
|
|
while (atomic_read(&fs_info->scrub_pause_req)) {
|
|
|
|
mutex_unlock(&fs_info->scrub_lock);
|
|
|
|
wait_event(fs_info->scrub_pause_wait,
|
|
|
|
atomic_read(&fs_info->scrub_pause_req) == 0);
|
|
|
|
mutex_lock(&fs_info->scrub_lock);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-08-05 16:43:28 +08:00
|
|
|
static void scrub_pause_on(struct btrfs_fs_info *fs_info)
|
2013-12-04 21:16:53 +08:00
|
|
|
{
|
|
|
|
atomic_inc(&fs_info->scrubs_paused);
|
|
|
|
wake_up(&fs_info->scrub_pause_wait);
|
2015-08-05 16:43:28 +08:00
|
|
|
}
|
2013-12-04 21:16:53 +08:00
|
|
|
|
2015-08-05 16:43:28 +08:00
|
|
|
static void scrub_pause_off(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
2013-12-04 21:16:53 +08:00
|
|
|
mutex_lock(&fs_info->scrub_lock);
|
|
|
|
__scrub_blocked_if_needed(fs_info);
|
|
|
|
atomic_dec(&fs_info->scrubs_paused);
|
|
|
|
mutex_unlock(&fs_info->scrub_lock);
|
|
|
|
|
|
|
|
wake_up(&fs_info->scrub_pause_wait);
|
|
|
|
}
|
|
|
|
|
2015-08-05 16:43:28 +08:00
|
|
|
static void scrub_blocked_if_needed(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
scrub_pause_on(fs_info);
|
|
|
|
scrub_pause_off(fs_info);
|
|
|
|
}
|
|
|
|
|
Btrfs: rename the scrub context structure
The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.
Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is used.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-11-02 16:58:09 +08:00
|
|
|
static noinline_for_stack void scrub_free_ctx(struct scrub_ctx *sctx)
|
2011-03-08 21:14:00 +08:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
Btrfs: rename the scrub context structure
The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.
Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is used.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-11-02 16:58:09 +08:00
|
|
|
if (!sctx)
|
2011-03-08 21:14:00 +08:00
|
|
|
return;
|
|
|
|
|
btrfs: scrub: fix grouping of read IO
[REGRESSION]
There are several regression reports about the scrub performance with
v6.4 kernel.
On a PCIe 3.0 device, the old v6.3 kernel can go 3GB/s scrub speed, but
v6.4 can only go 1GB/s, an obvious 66% performance drop.
[CAUSE]
Iostat shows a very different behavior between v6.3 and v6.4 kernel:
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 9731.00 3425544.00 17237.00 63.92 2.18 352.02 21.18 100.00
nvme0n1p3 15578.00 993616.00 5.00 0.03 0.09 63.78 1.32 100.00
The upper one is v6.3 while the lower one is v6.4.
There are several obvious differences:
- Very few read merges
This turns out to be a behavior change that we no longer do bio
plug/unplug.
- Very low aqu-sz
This is due to the submit-and-wait behavior of flush_scrub_stripes(),
and extra extent/csum tree search.
Both behaviors are not that obvious on SATA SSDs, as SATA SSDs have NCQ
to merge the reads, while SATA SSDs can not handle high queue depth well
either.
[FIX]
For now this patch focuses on the read speed fix. Dev-replace replace
speed needs more work.
For the read part, we go two directions to fix the problems:
- Re-introduce blk plug/unplug to merge read requests
This is pretty simple, and the behavior is pretty easy to observe.
This would enlarge the average read request size to 512K.
- Introduce multi-group reads and no longer wait for each group
Instead of the old behavior, which submits 8 stripes and waits for
them, here we would enlarge the total number of stripes to 16 * 8.
Which is 8M per device, the same limit as the old scrub in-flight
bios size limit.
Now every time we fill a group (8 stripes), we submit them and
continue to next stripes.
Only when the full 16 * 8 stripes are all filled, we submit the
remaining ones (the last group), and wait for all groups to finish.
Then submit the repair writes and dev-replace writes.
This should enlarge the queue depth.
This would greatly improve the merge rate (thus read block size) and
queue depth:
Before (with regression, and cached extent/csum path):
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 20666.00 1318240.00 10.00 0.05 0.08 63.79 1.63 100.00
After (with all patches applied):
nvme0n1p3 5165.00 2278304.00 30557.00 85.54 0.55 441.10 2.81 100.00
i.e. 1287 to 2224 MB/s.
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-03 14:33:31 +08:00
|
|
|
for (i = 0; i < SCRUB_TOTAL_STRIPES; i++)
|
2023-03-20 10:12:57 +08:00
|
|
|
release_scrub_stripe(&sctx->stripes[i]);
|
|
|
|
|
btrfs: scrub: fix grouping of read IO
[REGRESSION]
There are several regression reports about the scrub performance with
v6.4 kernel.
On a PCIe 3.0 device, the old v6.3 kernel can go 3GB/s scrub speed, but
v6.4 can only go 1GB/s, an obvious 66% performance drop.
[CAUSE]
Iostat shows a very different behavior between v6.3 and v6.4 kernel:
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 9731.00 3425544.00 17237.00 63.92 2.18 352.02 21.18 100.00
nvme0n1p3 15578.00 993616.00 5.00 0.03 0.09 63.78 1.32 100.00
The upper one is v6.3 while the lower one is v6.4.
There are several obvious differences:
- Very few read merges
This turns out to be a behavior change that we no longer do bio
plug/unplug.
- Very low aqu-sz
This is due to the submit-and-wait behavior of flush_scrub_stripes(),
and extra extent/csum tree search.
Both behaviors are not that obvious on SATA SSDs, as SATA SSDs have NCQ
to merge the reads, while SATA SSDs can not handle high queue depth well
either.
[FIX]
For now this patch focuses on the read speed fix. Dev-replace replace
speed needs more work.
For the read part, we go two directions to fix the problems:
- Re-introduce blk plug/unplug to merge read requests
This is pretty simple, and the behavior is pretty easy to observe.
This would enlarge the average read request size to 512K.
- Introduce multi-group reads and no longer wait for each group
Instead of the old behavior, which submits 8 stripes and waits for
them, here we would enlarge the total number of stripes to 16 * 8.
Which is 8M per device, the same limit as the old scrub in-flight
bios size limit.
Now every time we fill a group (8 stripes), we submit them and
continue to next stripes.
Only when the full 16 * 8 stripes are all filled, we submit the
remaining ones (the last group), and wait for all groups to finish.
Then submit the repair writes and dev-replace writes.
This should enlarge the queue depth.
This would greatly improve the merge rate (thus read block size) and
queue depth:
Before (with regression, and cached extent/csum path):
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 20666.00 1318240.00 10.00 0.05 0.08 63.79 1.63 100.00
After (with all patches applied):
nvme0n1p3 5165.00 2278304.00 30557.00 85.54 0.55 441.10 2.81 100.00
i.e. 1287 to 2224 MB/s.
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-03 14:33:31 +08:00
|
|
|
kvfree(sctx);
|
2011-03-08 21:14:00 +08:00
|
|
|
}
|
|
|
|
|
2015-02-10 05:14:24 +08:00
|
|
|
static void scrub_put_ctx(struct scrub_ctx *sctx)
|
|
|
|
{
|
2017-03-03 16:55:25 +08:00
|
|
|
if (refcount_dec_and_test(&sctx->refs))
|
2015-02-10 05:14:24 +08:00
|
|
|
scrub_free_ctx(sctx);
|
|
|
|
}
|
|
|
|
|
2018-12-04 23:11:55 +08:00
|
|
|
static noinline_for_stack struct scrub_ctx *scrub_setup_ctx(
|
|
|
|
struct btrfs_fs_info *fs_info, int is_dev_replace)
|
2011-03-08 21:14:00 +08:00
|
|
|
{
|
Btrfs: rename the scrub context structure
The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.
Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is used.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-11-02 16:58:09 +08:00
|
|
|
struct scrub_ctx *sctx;
|
2011-03-08 21:14:00 +08:00
|
|
|
int i;
|
|
|
|
|
btrfs: scrub: fix grouping of read IO
[REGRESSION]
There are several regression reports about the scrub performance with
v6.4 kernel.
On a PCIe 3.0 device, the old v6.3 kernel can go 3GB/s scrub speed, but
v6.4 can only go 1GB/s, an obvious 66% performance drop.
[CAUSE]
Iostat shows a very different behavior between v6.3 and v6.4 kernel:
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 9731.00 3425544.00 17237.00 63.92 2.18 352.02 21.18 100.00
nvme0n1p3 15578.00 993616.00 5.00 0.03 0.09 63.78 1.32 100.00
The upper one is v6.3 while the lower one is v6.4.
There are several obvious differences:
- Very few read merges
This turns out to be a behavior change that we no longer do bio
plug/unplug.
- Very low aqu-sz
This is due to the submit-and-wait behavior of flush_scrub_stripes(),
and extra extent/csum tree search.
Both behaviors are not that obvious on SATA SSDs, as SATA SSDs have NCQ
to merge the reads, while SATA SSDs can not handle high queue depth well
either.
[FIX]
For now this patch focuses on the read speed fix. Dev-replace replace
speed needs more work.
For the read part, we go two directions to fix the problems:
- Re-introduce blk plug/unplug to merge read requests
This is pretty simple, and the behavior is pretty easy to observe.
This would enlarge the average read request size to 512K.
- Introduce multi-group reads and no longer wait for each group
Instead of the old behavior, which submits 8 stripes and waits for
them, here we would enlarge the total number of stripes to 16 * 8.
Which is 8M per device, the same limit as the old scrub in-flight
bios size limit.
Now every time we fill a group (8 stripes), we submit them and
continue to next stripes.
Only when the full 16 * 8 stripes are all filled, we submit the
remaining ones (the last group), and wait for all groups to finish.
Then submit the repair writes and dev-replace writes.
This should enlarge the queue depth.
This would greatly improve the merge rate (thus read block size) and
queue depth:
Before (with regression, and cached extent/csum path):
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 20666.00 1318240.00 10.00 0.05 0.08 63.79 1.63 100.00
After (with all patches applied):
nvme0n1p3 5165.00 2278304.00 30557.00 85.54 0.55 441.10 2.81 100.00
i.e. 1287 to 2224 MB/s.
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-03 14:33:31 +08:00
|
|
|
/* Since sctx has inline 128 stripes, it can go beyond 64K easily. Use
|
|
|
|
* kvzalloc().
|
|
|
|
*/
|
|
|
|
sctx = kvzalloc(sizeof(*sctx), GFP_KERNEL);
|
Btrfs: rename the scrub context structure
The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.
Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is used.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-11-02 16:58:09 +08:00
|
|
|
if (!sctx)
|
2011-03-08 21:14:00 +08:00
|
|
|
goto nomem;
|
2017-03-03 16:55:25 +08:00
|
|
|
refcount_set(&sctx->refs, 1);
|
2012-11-06 01:29:28 +08:00
|
|
|
sctx->is_dev_replace = is_dev_replace;
|
2018-12-04 23:11:55 +08:00
|
|
|
sctx->fs_info = fs_info;
|
2023-08-03 14:33:29 +08:00
|
|
|
sctx->extent_path.search_commit_root = 1;
|
|
|
|
sctx->extent_path.skip_locking = 1;
|
2023-08-03 14:33:30 +08:00
|
|
|
sctx->csum_path.search_commit_root = 1;
|
|
|
|
sctx->csum_path.skip_locking = 1;
|
btrfs: scrub: fix grouping of read IO
[REGRESSION]
There are several regression reports about the scrub performance with
v6.4 kernel.
On a PCIe 3.0 device, the old v6.3 kernel can go 3GB/s scrub speed, but
v6.4 can only go 1GB/s, an obvious 66% performance drop.
[CAUSE]
Iostat shows a very different behavior between v6.3 and v6.4 kernel:
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 9731.00 3425544.00 17237.00 63.92 2.18 352.02 21.18 100.00
nvme0n1p3 15578.00 993616.00 5.00 0.03 0.09 63.78 1.32 100.00
The upper one is v6.3 while the lower one is v6.4.
There are several obvious differences:
- Very few read merges
This turns out to be a behavior change that we no longer do bio
plug/unplug.
- Very low aqu-sz
This is due to the submit-and-wait behavior of flush_scrub_stripes(),
and extra extent/csum tree search.
Both behaviors are not that obvious on SATA SSDs, as SATA SSDs have NCQ
to merge the reads, while SATA SSDs can not handle high queue depth well
either.
[FIX]
For now this patch focuses on the read speed fix. Dev-replace replace
speed needs more work.
For the read part, we go two directions to fix the problems:
- Re-introduce blk plug/unplug to merge read requests
This is pretty simple, and the behavior is pretty easy to observe.
This would enlarge the average read request size to 512K.
- Introduce multi-group reads and no longer wait for each group
Instead of the old behavior, which submits 8 stripes and waits for
them, here we would enlarge the total number of stripes to 16 * 8.
Which is 8M per device, the same limit as the old scrub in-flight
bios size limit.
Now every time we fill a group (8 stripes), we submit them and
continue to next stripes.
Only when the full 16 * 8 stripes are all filled, we submit the
remaining ones (the last group), and wait for all groups to finish.
Then submit the repair writes and dev-replace writes.
This should enlarge the queue depth.
This would greatly improve the merge rate (thus read block size) and
queue depth:
Before (with regression, and cached extent/csum path):
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 20666.00 1318240.00 10.00 0.05 0.08 63.79 1.63 100.00
After (with all patches applied):
nvme0n1p3 5165.00 2278304.00 30557.00 85.54 0.55 441.10 2.81 100.00
i.e. 1287 to 2224 MB/s.
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-03 14:33:31 +08:00
|
|
|
for (i = 0; i < SCRUB_TOTAL_STRIPES; i++) {
|
2023-03-20 10:12:57 +08:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = init_scrub_stripe(fs_info, &sctx->stripes[i]);
|
|
|
|
if (ret < 0)
|
|
|
|
goto nomem;
|
|
|
|
sctx->stripes[i].sctx = sctx;
|
|
|
|
}
|
Btrfs: rename the scrub context structure
The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.
Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is used.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-11-02 16:58:09 +08:00
|
|
|
sctx->first_free = 0;
|
|
|
|
atomic_set(&sctx->cancel_req, 0);
|
|
|
|
|
|
|
|
spin_lock_init(&sctx->stat_lock);
|
2019-10-09 19:58:13 +08:00
|
|
|
sctx->throttle_deadline = 0;
|
2012-11-06 18:43:11 +08:00
|
|
|
|
2017-05-17 01:10:32 +08:00
|
|
|
mutex_init(&sctx->wr_lock);
|
2017-05-17 01:10:23 +08:00
|
|
|
if (is_dev_replace) {
|
2017-06-26 21:19:00 +08:00
|
|
|
WARN_ON(!fs_info->dev_replace.tgtdev);
|
|
|
|
sctx->wr_tgtdev = fs_info->dev_replace.tgtdev;
|
2012-11-06 18:43:11 +08:00
|
|
|
}
|
2017-05-17 01:10:23 +08:00
|
|
|
|
Btrfs: rename the scrub context structure
The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.
Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is used.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-11-02 16:58:09 +08:00
|
|
|
return sctx;
|
2011-03-08 21:14:00 +08:00
|
|
|
|
|
|
|
nomem:
|
Btrfs: rename the scrub context structure
The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.
Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is used.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-11-02 16:58:09 +08:00
|
|
|
scrub_free_ctx(sctx);
|
2011-03-08 21:14:00 +08:00
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
}
|
|
|
|
|
2022-11-02 00:15:45 +08:00
|
|
|
static int scrub_print_warning_inode(u64 inum, u64 offset, u64 num_bytes,
|
|
|
|
u64 root, void *warn_ctx)
|
2011-06-14 01:59:12 +08:00
|
|
|
{
|
|
|
|
u32 nlink;
|
|
|
|
int ret;
|
|
|
|
int i;
|
2017-06-01 01:21:38 +08:00
|
|
|
unsigned nofs_flag;
|
2011-06-14 01:59:12 +08:00
|
|
|
struct extent_buffer *eb;
|
|
|
|
struct btrfs_inode_item *inode_item;
|
2012-11-06 18:43:11 +08:00
|
|
|
struct scrub_warning *swarn = warn_ctx;
|
2016-06-23 06:54:56 +08:00
|
|
|
struct btrfs_fs_info *fs_info = swarn->dev->fs_info;
|
2011-06-14 01:59:12 +08:00
|
|
|
struct inode_fs_paths *ipath = NULL;
|
|
|
|
struct btrfs_root *local_root;
|
2015-01-03 02:36:14 +08:00
|
|
|
struct btrfs_key key;
|
2011-06-14 01:59:12 +08:00
|
|
|
|
2020-05-16 01:35:55 +08:00
|
|
|
local_root = btrfs_get_fs_root(fs_info, root, true);
|
2011-06-14 01:59:12 +08:00
|
|
|
if (IS_ERR(local_root)) {
|
|
|
|
ret = PTR_ERR(local_root);
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
2015-01-03 01:55:46 +08:00
|
|
|
/*
|
|
|
|
* this makes the path point to (inum INODE_ITEM ioff)
|
|
|
|
*/
|
2015-01-03 02:36:14 +08:00
|
|
|
key.objectid = inum;
|
|
|
|
key.type = BTRFS_INODE_ITEM_KEY;
|
|
|
|
key.offset = 0;
|
|
|
|
|
|
|
|
ret = btrfs_search_slot(NULL, local_root, &key, swarn->path, 0, 0);
|
2011-06-14 01:59:12 +08:00
|
|
|
if (ret) {
|
2020-01-24 22:33:01 +08:00
|
|
|
btrfs_put_root(local_root);
|
2011-06-14 01:59:12 +08:00
|
|
|
btrfs_release_path(swarn->path);
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
|
|
|
eb = swarn->path->nodes[0];
|
|
|
|
inode_item = btrfs_item_ptr(eb, swarn->path->slots[0],
|
|
|
|
struct btrfs_inode_item);
|
|
|
|
nlink = btrfs_inode_nlink(eb, inode_item);
|
|
|
|
btrfs_release_path(swarn->path);
|
|
|
|
|
2017-06-01 01:21:38 +08:00
|
|
|
/*
|
|
|
|
* init_path might indirectly call vmalloc, or use GFP_KERNEL. Scrub
|
|
|
|
* uses GFP_NOFS in this context, so we keep it consistent but it does
|
|
|
|
* not seem to be strictly necessary.
|
|
|
|
*/
|
|
|
|
nofs_flag = memalloc_nofs_save();
|
2011-06-14 01:59:12 +08:00
|
|
|
ipath = init_ipath(4096, local_root, swarn->path);
|
2017-06-01 01:21:38 +08:00
|
|
|
memalloc_nofs_restore(nofs_flag);
|
2011-11-16 16:28:01 +08:00
|
|
|
if (IS_ERR(ipath)) {
|
2020-01-24 22:33:01 +08:00
|
|
|
btrfs_put_root(local_root);
|
2011-11-16 16:28:01 +08:00
|
|
|
ret = PTR_ERR(ipath);
|
|
|
|
ipath = NULL;
|
|
|
|
goto err;
|
|
|
|
}
|
2011-06-14 01:59:12 +08:00
|
|
|
ret = paths_from_inode(inum, ipath);
|
|
|
|
|
|
|
|
if (ret < 0)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* we deliberately ignore the bit ipath might have been too small to
|
|
|
|
* hold all of the paths here
|
|
|
|
*/
|
|
|
|
for (i = 0; i < ipath->fspath->elem_cnt; ++i)
|
2016-09-20 22:05:00 +08:00
|
|
|
btrfs_warn_in_rcu(fs_info,
|
btrfs: scrub: fix subpage repair error caused by hard coded PAGE_SIZE
[BUG]
For the following file layout, scrub will not be able to repair all
these two repairable error, but in fact make one corruption even
unrepairable:
inode offset 0 4k 8K
Mirror 1 |XXXXXX| |
Mirror 2 | |XXXXXX|
[CAUSE]
The root cause is the hard coded PAGE_SIZE, which makes scrub repair to
go crazy for subpage.
For above case, when reading the first sector, we use PAGE_SIZE other
than sectorsize to read, which makes us to read the full range [0, 64K).
In fact, after 8K there may be no data at all, we can just get some
garbage.
Then when doing the repair, we also writeback a full page from mirror 2,
this means, we will also writeback the corrupted data in mirror 2 back
to mirror 1, leaving the range [4K, 8K) unrepairable.
[FIX]
This patch will modify the following PAGE_SIZE use with sectorsize:
- scrub_print_warning_inode()
Remove the min() and replace PAGE_SIZE with sectorsize.
The min() makes no sense, as csum is done for the full sector with
padding.
This fixes a bug that subpage report extra length like:
checksum error at logical 298844160 on dev /dev/mapper/arm_nvme-test,
physical 575668224, root 5, inode 257, offset 0, length 12288, links 1 (path: file)
Where the error is only 1 sector.
- scrub_handle_errored_block()
Comments with PAGE|page involved, all changed to sector.
- scrub_setup_recheck_block()
- scrub_repair_page_from_good_copy()
- scrub_add_page_to_wr_bio()
- scrub_wr_submit()
- scrub_add_page_to_rd_bio()
- scrub_block_complete()
Replace PAGE_SIZE with sectorsize.
This solves several problems where we read/write extra range for
subpage case.
RAID56 code is excluded intentionally, as RAID56 has extra PAGE_SIZE
usage, and is not really safe enough.
Thus we will reject RAID56 for subpage in later commit.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-22 19:02:46 +08:00
|
|
|
"%s at logical %llu on dev %s, physical %llu, root %llu, inode %llu, offset %llu, length %u, links %u (path: %s)",
|
2016-09-20 22:05:00 +08:00
|
|
|
swarn->errstr, swarn->logical,
|
2022-11-13 09:32:07 +08:00
|
|
|
btrfs_dev_name(swarn->dev),
|
2017-10-04 23:07:07 +08:00
|
|
|
swarn->physical,
|
2016-09-20 22:05:00 +08:00
|
|
|
root, inum, offset,
|
btrfs: scrub: fix subpage repair error caused by hard coded PAGE_SIZE
[BUG]
For the following file layout, scrub will not be able to repair all
these two repairable error, but in fact make one corruption even
unrepairable:
inode offset 0 4k 8K
Mirror 1 |XXXXXX| |
Mirror 2 | |XXXXXX|
[CAUSE]
The root cause is the hard coded PAGE_SIZE, which makes scrub repair to
go crazy for subpage.
For above case, when reading the first sector, we use PAGE_SIZE other
than sectorsize to read, which makes us to read the full range [0, 64K).
In fact, after 8K there may be no data at all, we can just get some
garbage.
Then when doing the repair, we also writeback a full page from mirror 2,
this means, we will also writeback the corrupted data in mirror 2 back
to mirror 1, leaving the range [4K, 8K) unrepairable.
[FIX]
This patch will modify the following PAGE_SIZE use with sectorsize:
- scrub_print_warning_inode()
Remove the min() and replace PAGE_SIZE with sectorsize.
The min() makes no sense, as csum is done for the full sector with
padding.
This fixes a bug that subpage report extra length like:
checksum error at logical 298844160 on dev /dev/mapper/arm_nvme-test,
physical 575668224, root 5, inode 257, offset 0, length 12288, links 1 (path: file)
Where the error is only 1 sector.
- scrub_handle_errored_block()
Comments with PAGE|page involved, all changed to sector.
- scrub_setup_recheck_block()
- scrub_repair_page_from_good_copy()
- scrub_add_page_to_wr_bio()
- scrub_wr_submit()
- scrub_add_page_to_rd_bio()
- scrub_block_complete()
Replace PAGE_SIZE with sectorsize.
This solves several problems where we read/write extra range for
subpage case.
RAID56 code is excluded intentionally, as RAID56 has extra PAGE_SIZE
usage, and is not really safe enough.
Thus we will reject RAID56 for subpage in later commit.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-22 19:02:46 +08:00
|
|
|
fs_info->sectorsize, nlink,
|
2016-09-20 22:05:00 +08:00
|
|
|
(char *)(unsigned long)ipath->fspath->val[i]);
|
2011-06-14 01:59:12 +08:00
|
|
|
|
2020-01-24 22:33:01 +08:00
|
|
|
btrfs_put_root(local_root);
|
2011-06-14 01:59:12 +08:00
|
|
|
free_ipath(ipath);
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
err:
|
2016-09-20 22:05:00 +08:00
|
|
|
btrfs_warn_in_rcu(fs_info,
|
2017-10-04 23:07:07 +08:00
|
|
|
"%s at logical %llu on dev %s, physical %llu, root %llu, inode %llu, offset %llu: path resolving failed with ret=%d",
|
2016-09-20 22:05:00 +08:00
|
|
|
swarn->errstr, swarn->logical,
|
2022-11-13 09:32:07 +08:00
|
|
|
btrfs_dev_name(swarn->dev),
|
2017-10-04 23:07:07 +08:00
|
|
|
swarn->physical,
|
2016-09-20 22:05:00 +08:00
|
|
|
root, inum, offset, ret);
|
2011-06-14 01:59:12 +08:00
|
|
|
|
|
|
|
free_ipath(ipath);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2023-03-20 10:12:56 +08:00
|
|
|
static void scrub_print_common_warning(const char *errstr, struct btrfs_device *dev,
|
|
|
|
bool is_super, u64 logical, u64 physical)
|
2011-06-14 01:59:12 +08:00
|
|
|
{
|
2023-03-20 10:12:56 +08:00
|
|
|
struct btrfs_fs_info *fs_info = dev->fs_info;
|
2011-06-14 01:59:12 +08:00
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_key found_key;
|
|
|
|
struct extent_buffer *eb;
|
|
|
|
struct btrfs_extent_item *ei;
|
|
|
|
struct scrub_warning swarn;
|
2012-09-08 10:01:28 +08:00
|
|
|
u64 flags = 0;
|
|
|
|
u32 item_size;
|
|
|
|
int ret;
|
2011-06-14 01:59:12 +08:00
|
|
|
|
btrfs: scrub: properly report super block errors in system log
[PROBLEM]
Unlike data/metadata corruption, if scrub detected some error in the
super block, the only error message is from the updated device status:
BTRFS info (device dm-1): scrub: started on devid 2
BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0
This is not helpful at all.
[CAUSE]
Unlike data/metadata error reporting, there is no visible report in
kernel dmesg to report supper block errors.
In fact, return value of scrub_checksum_super() is intentionally
skipped, thus scrub_handle_errored_block() will never be called for
super blocks.
[FIX]
Make super block errors to output an error message, now the full
dmesg would looks like this:
BTRFS info (device dm-1): scrub: started on devid 2
BTRFS warning (device dm-1): super block error on device /dev/mapper/test-scratch2, physical 67108864
BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0
BTRFS info (device dm-1): scrub: started on devid 2
This fix involves:
- Move the super_errors reporting to scrub_handle_errored_block()
This allows the device status message to show after the super block
error message.
But now we no longer distinguish super block corruption and generation
mismatch, now all counted as corruption.
- Properly check the return value from scrub_checksum_super()
- Add extra super block error reporting for scrub_print_warning().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-02 14:53:02 +08:00
|
|
|
/* Super block error, no need to search extent tree. */
|
2023-03-20 10:12:56 +08:00
|
|
|
if (is_super) {
|
btrfs: scrub: properly report super block errors in system log
[PROBLEM]
Unlike data/metadata corruption, if scrub detected some error in the
super block, the only error message is from the updated device status:
BTRFS info (device dm-1): scrub: started on devid 2
BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0
This is not helpful at all.
[CAUSE]
Unlike data/metadata error reporting, there is no visible report in
kernel dmesg to report supper block errors.
In fact, return value of scrub_checksum_super() is intentionally
skipped, thus scrub_handle_errored_block() will never be called for
super blocks.
[FIX]
Make super block errors to output an error message, now the full
dmesg would looks like this:
BTRFS info (device dm-1): scrub: started on devid 2
BTRFS warning (device dm-1): super block error on device /dev/mapper/test-scratch2, physical 67108864
BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0
BTRFS info (device dm-1): scrub: started on devid 2
This fix involves:
- Move the super_errors reporting to scrub_handle_errored_block()
This allows the device status message to show after the super block
error message.
But now we no longer distinguish super block corruption and generation
mismatch, now all counted as corruption.
- Properly check the return value from scrub_checksum_super()
- Add extra super block error reporting for scrub_print_warning().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-02 14:53:02 +08:00
|
|
|
btrfs_warn_in_rcu(fs_info, "%s on device %s, physical %llu",
|
2023-03-20 10:12:56 +08:00
|
|
|
errstr, btrfs_dev_name(dev), physical);
|
btrfs: scrub: properly report super block errors in system log
[PROBLEM]
Unlike data/metadata corruption, if scrub detected some error in the
super block, the only error message is from the updated device status:
BTRFS info (device dm-1): scrub: started on devid 2
BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0
This is not helpful at all.
[CAUSE]
Unlike data/metadata error reporting, there is no visible report in
kernel dmesg to report supper block errors.
In fact, return value of scrub_checksum_super() is intentionally
skipped, thus scrub_handle_errored_block() will never be called for
super blocks.
[FIX]
Make super block errors to output an error message, now the full
dmesg would looks like this:
BTRFS info (device dm-1): scrub: started on devid 2
BTRFS warning (device dm-1): super block error on device /dev/mapper/test-scratch2, physical 67108864
BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0
BTRFS info (device dm-1): scrub: started on devid 2
This fix involves:
- Move the super_errors reporting to scrub_handle_errored_block()
This allows the device status message to show after the super block
error message.
But now we no longer distinguish super block corruption and generation
mismatch, now all counted as corruption.
- Properly check the return value from scrub_checksum_super()
- Add extra super block error reporting for scrub_print_warning().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-02 14:53:02 +08:00
|
|
|
return;
|
|
|
|
}
|
2011-06-14 01:59:12 +08:00
|
|
|
path = btrfs_alloc_path();
|
2014-07-30 07:25:30 +08:00
|
|
|
if (!path)
|
|
|
|
return;
|
2011-06-14 01:59:12 +08:00
|
|
|
|
2023-03-20 10:12:56 +08:00
|
|
|
swarn.physical = physical;
|
|
|
|
swarn.logical = logical;
|
2011-06-14 01:59:12 +08:00
|
|
|
swarn.errstr = errstr;
|
2012-11-02 20:26:57 +08:00
|
|
|
swarn.dev = NULL;
|
2011-06-14 01:59:12 +08:00
|
|
|
|
2012-09-08 10:01:28 +08:00
|
|
|
ret = extent_from_logical(fs_info, swarn.logical, path, &found_key,
|
|
|
|
&flags);
|
2011-06-14 01:59:12 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
swarn.extent_item_size = found_key.offset;
|
|
|
|
|
|
|
|
eb = path->nodes[0];
|
|
|
|
ei = btrfs_item_ptr(eb, path->slots[0], struct btrfs_extent_item);
|
2021-10-22 02:58:35 +08:00
|
|
|
item_size = btrfs_item_size(eb, path->slots[0]);
|
2011-06-14 01:59:12 +08:00
|
|
|
|
2012-09-08 10:01:28 +08:00
|
|
|
if (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
|
2023-05-11 16:01:44 +08:00
|
|
|
unsigned long ptr = 0;
|
|
|
|
u8 ref_level;
|
|
|
|
u64 ref_root;
|
|
|
|
|
|
|
|
while (true) {
|
2014-06-09 10:54:07 +08:00
|
|
|
ret = tree_backref_for_extent(&ptr, eb, &found_key, ei,
|
|
|
|
item_size, &ref_root,
|
|
|
|
&ref_level);
|
2023-05-11 16:01:44 +08:00
|
|
|
if (ret < 0) {
|
|
|
|
btrfs_warn(fs_info,
|
|
|
|
"failed to resolve tree backref for logical %llu: %d",
|
|
|
|
swarn.logical, ret);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (ret > 0)
|
|
|
|
break;
|
2015-10-08 15:01:03 +08:00
|
|
|
btrfs_warn_in_rcu(fs_info,
|
2017-10-04 23:07:07 +08:00
|
|
|
"%s at logical %llu on dev %s, physical %llu: metadata %s (level %d) in tree %llu",
|
2023-05-11 16:01:44 +08:00
|
|
|
errstr, swarn.logical, btrfs_dev_name(dev),
|
|
|
|
swarn.physical, (ref_level ? "node" : "leaf"),
|
|
|
|
ref_level, ref_root);
|
|
|
|
}
|
2013-03-29 22:09:34 +08:00
|
|
|
btrfs_release_path(path);
|
2011-06-14 01:59:12 +08:00
|
|
|
} else {
|
2022-11-02 00:15:47 +08:00
|
|
|
struct btrfs_backref_walk_ctx ctx = { 0 };
|
|
|
|
|
2013-03-29 22:09:34 +08:00
|
|
|
btrfs_release_path(path);
|
2022-11-02 00:15:47 +08:00
|
|
|
|
|
|
|
ctx.bytenr = found_key.objectid;
|
|
|
|
ctx.extent_item_pos = swarn.logical - found_key.objectid;
|
|
|
|
ctx.fs_info = fs_info;
|
|
|
|
|
2011-06-14 01:59:12 +08:00
|
|
|
swarn.path = path;
|
2012-11-02 20:26:57 +08:00
|
|
|
swarn.dev = dev;
|
2022-11-02 00:15:47 +08:00
|
|
|
|
|
|
|
iterate_extent_inodes(&ctx, true, scrub_print_warning_inode, &swarn);
|
2011-06-14 01:59:12 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
}
|
|
|
|
|
2021-02-04 18:22:13 +08:00
|
|
|
static int fill_writer_pointer_gap(struct scrub_ctx *sctx, u64 physical)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
u64 length;
|
|
|
|
|
|
|
|
if (!btrfs_is_zoned(sctx->fs_info))
|
|
|
|
return 0;
|
|
|
|
|
2021-02-04 18:22:14 +08:00
|
|
|
if (!btrfs_dev_is_sequential(sctx->wr_tgtdev, physical))
|
|
|
|
return 0;
|
|
|
|
|
2021-02-04 18:22:13 +08:00
|
|
|
if (sctx->write_pointer < physical) {
|
|
|
|
length = physical - sctx->write_pointer;
|
|
|
|
|
|
|
|
ret = btrfs_zoned_issue_zeroout(sctx->wr_tgtdev,
|
|
|
|
sctx->write_pointer, length);
|
|
|
|
if (!ret)
|
|
|
|
sctx->write_pointer = physical;
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2023-03-20 10:12:52 +08:00
|
|
|
static struct page *scrub_stripe_get_page(struct scrub_stripe *stripe, int sector_nr)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
|
|
|
|
int page_index = (sector_nr << fs_info->sectorsize_bits) >> PAGE_SHIFT;
|
|
|
|
|
|
|
|
return stripe->pages[page_index];
|
|
|
|
}
|
|
|
|
|
|
|
|
static unsigned int scrub_stripe_get_page_offset(struct scrub_stripe *stripe,
|
|
|
|
int sector_nr)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
|
|
|
|
|
|
|
|
return offset_in_page(sector_nr << fs_info->sectorsize_bits);
|
|
|
|
}
|
|
|
|
|
2023-03-20 10:12:53 +08:00
|
|
|
static void scrub_verify_one_metadata(struct scrub_stripe *stripe, int sector_nr)
|
2023-03-20 10:12:52 +08:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
|
|
|
|
const u32 sectors_per_tree = fs_info->nodesize >> fs_info->sectorsize_bits;
|
|
|
|
const u64 logical = stripe->logical + (sector_nr << fs_info->sectorsize_bits);
|
|
|
|
const struct page *first_page = scrub_stripe_get_page(stripe, sector_nr);
|
|
|
|
const unsigned int first_off = scrub_stripe_get_page_offset(stripe, sector_nr);
|
|
|
|
SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
|
|
|
|
u8 on_disk_csum[BTRFS_CSUM_SIZE];
|
|
|
|
u8 calculated_csum[BTRFS_CSUM_SIZE];
|
|
|
|
struct btrfs_header *header;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Here we don't have a good way to attach the pages (and subpages)
|
|
|
|
* to a dummy extent buffer, thus we have to directly grab the members
|
|
|
|
* from pages.
|
|
|
|
*/
|
|
|
|
header = (struct btrfs_header *)(page_address(first_page) + first_off);
|
|
|
|
memcpy(on_disk_csum, header->csum, fs_info->csum_size);
|
|
|
|
|
|
|
|
if (logical != btrfs_stack_header_bytenr(header)) {
|
|
|
|
bitmap_set(&stripe->csum_error_bitmap, sector_nr, sectors_per_tree);
|
|
|
|
bitmap_set(&stripe->error_bitmap, sector_nr, sectors_per_tree);
|
|
|
|
btrfs_warn_rl(fs_info,
|
|
|
|
"tree block %llu mirror %u has bad bytenr, has %llu want %llu",
|
|
|
|
logical, stripe->mirror_num,
|
|
|
|
btrfs_stack_header_bytenr(header), logical);
|
|
|
|
return;
|
|
|
|
}
|
2023-07-28 14:48:13 +08:00
|
|
|
if (memcmp(header->fsid, fs_info->fs_devices->metadata_uuid,
|
|
|
|
BTRFS_FSID_SIZE) != 0) {
|
2023-03-20 10:12:52 +08:00
|
|
|
bitmap_set(&stripe->meta_error_bitmap, sector_nr, sectors_per_tree);
|
|
|
|
bitmap_set(&stripe->error_bitmap, sector_nr, sectors_per_tree);
|
|
|
|
btrfs_warn_rl(fs_info,
|
|
|
|
"tree block %llu mirror %u has bad fsid, has %pU want %pU",
|
|
|
|
logical, stripe->mirror_num,
|
|
|
|
header->fsid, fs_info->fs_devices->fsid);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
if (memcmp(header->chunk_tree_uuid, fs_info->chunk_tree_uuid,
|
|
|
|
BTRFS_UUID_SIZE) != 0) {
|
|
|
|
bitmap_set(&stripe->meta_error_bitmap, sector_nr, sectors_per_tree);
|
|
|
|
bitmap_set(&stripe->error_bitmap, sector_nr, sectors_per_tree);
|
|
|
|
btrfs_warn_rl(fs_info,
|
|
|
|
"tree block %llu mirror %u has bad chunk tree uuid, has %pU want %pU",
|
|
|
|
logical, stripe->mirror_num,
|
|
|
|
header->chunk_tree_uuid, fs_info->chunk_tree_uuid);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Now check tree block csum. */
|
|
|
|
shash->tfm = fs_info->csum_shash;
|
|
|
|
crypto_shash_init(shash);
|
|
|
|
crypto_shash_update(shash, page_address(first_page) + first_off +
|
|
|
|
BTRFS_CSUM_SIZE, fs_info->sectorsize - BTRFS_CSUM_SIZE);
|
|
|
|
|
|
|
|
for (int i = sector_nr + 1; i < sector_nr + sectors_per_tree; i++) {
|
|
|
|
struct page *page = scrub_stripe_get_page(stripe, i);
|
|
|
|
unsigned int page_off = scrub_stripe_get_page_offset(stripe, i);
|
|
|
|
|
|
|
|
crypto_shash_update(shash, page_address(page) + page_off,
|
|
|
|
fs_info->sectorsize);
|
|
|
|
}
|
|
|
|
|
|
|
|
crypto_shash_final(shash, calculated_csum);
|
|
|
|
if (memcmp(calculated_csum, on_disk_csum, fs_info->csum_size) != 0) {
|
|
|
|
bitmap_set(&stripe->meta_error_bitmap, sector_nr, sectors_per_tree);
|
|
|
|
bitmap_set(&stripe->error_bitmap, sector_nr, sectors_per_tree);
|
|
|
|
btrfs_warn_rl(fs_info,
|
|
|
|
"tree block %llu mirror %u has bad csum, has " CSUM_FMT " want " CSUM_FMT,
|
|
|
|
logical, stripe->mirror_num,
|
|
|
|
CSUM_FMT_VALUE(fs_info->csum_size, on_disk_csum),
|
|
|
|
CSUM_FMT_VALUE(fs_info->csum_size, calculated_csum));
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
if (stripe->sectors[sector_nr].generation !=
|
|
|
|
btrfs_stack_header_generation(header)) {
|
|
|
|
bitmap_set(&stripe->meta_error_bitmap, sector_nr, sectors_per_tree);
|
|
|
|
bitmap_set(&stripe->error_bitmap, sector_nr, sectors_per_tree);
|
|
|
|
btrfs_warn_rl(fs_info,
|
|
|
|
"tree block %llu mirror %u has bad generation, has %llu want %llu",
|
|
|
|
logical, stripe->mirror_num,
|
|
|
|
btrfs_stack_header_generation(header),
|
|
|
|
stripe->sectors[sector_nr].generation);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
bitmap_clear(&stripe->error_bitmap, sector_nr, sectors_per_tree);
|
|
|
|
bitmap_clear(&stripe->csum_error_bitmap, sector_nr, sectors_per_tree);
|
|
|
|
bitmap_clear(&stripe->meta_error_bitmap, sector_nr, sectors_per_tree);
|
|
|
|
}
|
|
|
|
|
2023-03-20 10:12:53 +08:00
|
|
|
static void scrub_verify_one_sector(struct scrub_stripe *stripe, int sector_nr)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
|
|
|
|
struct scrub_sector_verification *sector = &stripe->sectors[sector_nr];
|
|
|
|
const u32 sectors_per_tree = fs_info->nodesize >> fs_info->sectorsize_bits;
|
|
|
|
struct page *page = scrub_stripe_get_page(stripe, sector_nr);
|
|
|
|
unsigned int pgoff = scrub_stripe_get_page_offset(stripe, sector_nr);
|
|
|
|
u8 csum_buf[BTRFS_CSUM_SIZE];
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ASSERT(sector_nr >= 0 && sector_nr < stripe->nr_sectors);
|
|
|
|
|
|
|
|
/* Sector not utilized, skip it. */
|
|
|
|
if (!test_bit(sector_nr, &stripe->extent_sector_bitmap))
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* IO error, no need to check. */
|
|
|
|
if (test_bit(sector_nr, &stripe->io_error_bitmap))
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* Metadata, verify the full tree block. */
|
|
|
|
if (sector->is_metadata) {
|
|
|
|
/*
|
2023-12-06 02:26:39 +08:00
|
|
|
* Check if the tree block crosses the stripe boundary. If
|
2023-03-20 10:12:53 +08:00
|
|
|
* crossed the boundary, we cannot verify it but only give a
|
|
|
|
* warning.
|
|
|
|
*
|
|
|
|
* This can only happen on a very old filesystem where chunks
|
|
|
|
* are not ensured to be stripe aligned.
|
|
|
|
*/
|
|
|
|
if (unlikely(sector_nr + sectors_per_tree > stripe->nr_sectors)) {
|
|
|
|
btrfs_warn_rl(fs_info,
|
|
|
|
"tree block at %llu crosses stripe boundary %llu",
|
|
|
|
stripe->logical +
|
|
|
|
(sector_nr << fs_info->sectorsize_bits),
|
|
|
|
stripe->logical);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
scrub_verify_one_metadata(stripe, sector_nr);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Data is easier, we just verify the data csum (if we have it). For
|
|
|
|
* cases without csum, we have no other choice but to trust it.
|
|
|
|
*/
|
|
|
|
if (!sector->csum) {
|
|
|
|
clear_bit(sector_nr, &stripe->error_bitmap);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = btrfs_check_sector_csum(fs_info, page, pgoff, csum_buf, sector->csum);
|
|
|
|
if (ret < 0) {
|
|
|
|
set_bit(sector_nr, &stripe->csum_error_bitmap);
|
|
|
|
set_bit(sector_nr, &stripe->error_bitmap);
|
|
|
|
} else {
|
|
|
|
clear_bit(sector_nr, &stripe->csum_error_bitmap);
|
|
|
|
clear_bit(sector_nr, &stripe->error_bitmap);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Verify specified sectors of a stripe. */
|
btrfs: scrub: introduce the main read repair worker for scrub_stripe
The new helper, scrub_stripe_read_repair_worker(), would handle the
read-repair part:
- Wait for the previous submitted read IO to finish
- Verify the contents of the stripe
- Go through the remaining mirrors, using as large blocksize as possible
At this stage, we just read out all the failed sectors from each
mirror and re-verify.
If no more failed sector, we can exit.
- Go through all mirrors again, sector-by-sector
This time, we read sector by sector, this is to address cases where
one bad sector mismatches the drive's internal checksum, and cause the
whole read range to fail.
We put this recovery method as the last resort, as sector-by-sector
reading is slow, and reading from other mirrors may have already fixed
the errors.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-20 10:12:54 +08:00
|
|
|
static void scrub_verify_one_stripe(struct scrub_stripe *stripe, unsigned long bitmap)
|
2023-03-20 10:12:53 +08:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
|
|
|
|
const u32 sectors_per_tree = fs_info->nodesize >> fs_info->sectorsize_bits;
|
|
|
|
int sector_nr;
|
|
|
|
|
|
|
|
for_each_set_bit(sector_nr, &bitmap, stripe->nr_sectors) {
|
|
|
|
scrub_verify_one_sector(stripe, sector_nr);
|
|
|
|
if (stripe->sectors[sector_nr].is_metadata)
|
|
|
|
sector_nr += sectors_per_tree - 1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
btrfs: scrub: introduce the main read repair worker for scrub_stripe
The new helper, scrub_stripe_read_repair_worker(), would handle the
read-repair part:
- Wait for the previous submitted read IO to finish
- Verify the contents of the stripe
- Go through the remaining mirrors, using as large blocksize as possible
At this stage, we just read out all the failed sectors from each
mirror and re-verify.
If no more failed sector, we can exit.
- Go through all mirrors again, sector-by-sector
This time, we read sector by sector, this is to address cases where
one bad sector mismatches the drive's internal checksum, and cause the
whole read range to fail.
We put this recovery method as the last resort, as sector-by-sector
reading is slow, and reading from other mirrors may have already fixed
the errors.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-20 10:12:54 +08:00
|
|
|
static int calc_sector_number(struct scrub_stripe *stripe, struct bio_vec *first_bvec)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < stripe->nr_sectors; i++) {
|
|
|
|
if (scrub_stripe_get_page(stripe, i) == first_bvec->bv_page &&
|
|
|
|
scrub_stripe_get_page_offset(stripe, i) == first_bvec->bv_offset)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
ASSERT(i < stripe->nr_sectors);
|
|
|
|
return i;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Repair read is different to the regular read:
|
|
|
|
*
|
|
|
|
* - Only reads the failed sectors
|
|
|
|
* - May have extra blocksize limits
|
|
|
|
*/
|
|
|
|
static void scrub_repair_read_endio(struct btrfs_bio *bbio)
|
|
|
|
{
|
|
|
|
struct scrub_stripe *stripe = bbio->private;
|
|
|
|
struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
|
|
|
|
struct bio_vec *bvec;
|
|
|
|
int sector_nr = calc_sector_number(stripe, bio_first_bvec_all(&bbio->bio));
|
|
|
|
u32 bio_size = 0;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
ASSERT(sector_nr < stripe->nr_sectors);
|
|
|
|
|
|
|
|
bio_for_each_bvec_all(bvec, &bbio->bio, i)
|
|
|
|
bio_size += bvec->bv_len;
|
|
|
|
|
|
|
|
if (bbio->bio.bi_status) {
|
|
|
|
bitmap_set(&stripe->io_error_bitmap, sector_nr,
|
|
|
|
bio_size >> fs_info->sectorsize_bits);
|
|
|
|
bitmap_set(&stripe->error_bitmap, sector_nr,
|
|
|
|
bio_size >> fs_info->sectorsize_bits);
|
|
|
|
} else {
|
|
|
|
bitmap_clear(&stripe->io_error_bitmap, sector_nr,
|
|
|
|
bio_size >> fs_info->sectorsize_bits);
|
|
|
|
}
|
|
|
|
bio_put(&bbio->bio);
|
|
|
|
if (atomic_dec_and_test(&stripe->pending_io))
|
|
|
|
wake_up(&stripe->io_wait);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int calc_next_mirror(int mirror, int num_copies)
|
|
|
|
{
|
|
|
|
ASSERT(mirror <= num_copies);
|
|
|
|
return (mirror + 1 > num_copies) ? 1 : mirror + 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void scrub_stripe_submit_repair_read(struct scrub_stripe *stripe,
|
|
|
|
int mirror, int blocksize, bool wait)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
|
|
|
|
struct btrfs_bio *bbio = NULL;
|
|
|
|
const unsigned long old_error_bitmap = stripe->error_bitmap;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
ASSERT(stripe->mirror_num >= 1);
|
|
|
|
ASSERT(atomic_read(&stripe->pending_io) == 0);
|
|
|
|
|
|
|
|
for_each_set_bit(i, &old_error_bitmap, stripe->nr_sectors) {
|
|
|
|
struct page *page;
|
|
|
|
int pgoff;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
page = scrub_stripe_get_page(stripe, i);
|
|
|
|
pgoff = scrub_stripe_get_page_offset(stripe, i);
|
|
|
|
|
|
|
|
/* The current sector cannot be merged, submit the bio. */
|
|
|
|
if (bbio && ((i > 0 && !test_bit(i - 1, &stripe->error_bitmap)) ||
|
|
|
|
bbio->bio.bi_iter.bi_size >= blocksize)) {
|
|
|
|
ASSERT(bbio->bio.bi_iter.bi_size);
|
|
|
|
atomic_inc(&stripe->pending_io);
|
|
|
|
btrfs_submit_bio(bbio, mirror);
|
|
|
|
if (wait)
|
|
|
|
wait_scrub_stripe_io(stripe);
|
|
|
|
bbio = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!bbio) {
|
|
|
|
bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_READ,
|
|
|
|
fs_info, scrub_repair_read_endio, stripe);
|
|
|
|
bbio->bio.bi_iter.bi_sector = (stripe->logical +
|
|
|
|
(i << fs_info->sectorsize_bits)) >> SECTOR_SHIFT;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = bio_add_page(&bbio->bio, page, fs_info->sectorsize, pgoff);
|
|
|
|
ASSERT(ret == fs_info->sectorsize);
|
|
|
|
}
|
|
|
|
if (bbio) {
|
|
|
|
ASSERT(bbio->bio.bi_iter.bi_size);
|
|
|
|
atomic_inc(&stripe->pending_io);
|
|
|
|
btrfs_submit_bio(bbio, mirror);
|
|
|
|
if (wait)
|
|
|
|
wait_scrub_stripe_io(stripe);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-03-20 10:12:56 +08:00
|
|
|
static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
|
|
|
|
struct scrub_stripe *stripe)
|
|
|
|
{
|
|
|
|
static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL,
|
|
|
|
DEFAULT_RATELIMIT_BURST);
|
|
|
|
struct btrfs_fs_info *fs_info = sctx->fs_info;
|
|
|
|
struct btrfs_device *dev = NULL;
|
|
|
|
u64 physical = 0;
|
|
|
|
int nr_data_sectors = 0;
|
|
|
|
int nr_meta_sectors = 0;
|
|
|
|
int nr_nodatacsum_sectors = 0;
|
|
|
|
int nr_repaired_sectors = 0;
|
|
|
|
int sector_nr;
|
|
|
|
|
2023-03-28 16:57:35 +08:00
|
|
|
if (test_bit(SCRUB_STRIPE_FLAG_NO_REPORT, &stripe->state))
|
|
|
|
return;
|
|
|
|
|
2023-03-20 10:12:56 +08:00
|
|
|
/*
|
|
|
|
* Init needed infos for error reporting.
|
|
|
|
*
|
2023-12-06 02:26:39 +08:00
|
|
|
* Although our scrub_stripe infrastructure is mostly based on btrfs_submit_bio()
|
2023-03-20 10:12:56 +08:00
|
|
|
* thus no need for dev/physical, error reporting still needs dev and physical.
|
|
|
|
*/
|
|
|
|
if (!bitmap_empty(&stripe->init_error_bitmap, stripe->nr_sectors)) {
|
|
|
|
u64 mapped_len = fs_info->sectorsize;
|
|
|
|
struct btrfs_io_context *bioc = NULL;
|
|
|
|
int stripe_index = stripe->mirror_num - 1;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
/* For scrub, our mirror_num should always start at 1. */
|
|
|
|
ASSERT(stripe->mirror_num >= 1);
|
2023-05-31 12:17:38 +08:00
|
|
|
ret = btrfs_map_block(fs_info, BTRFS_MAP_GET_READ_MIRRORS,
|
|
|
|
stripe->logical, &mapped_len, &bioc,
|
2023-09-17 18:06:21 +08:00
|
|
|
NULL, NULL);
|
2023-03-20 10:12:56 +08:00
|
|
|
/*
|
|
|
|
* If we failed, dev will be NULL, and later detailed reports
|
|
|
|
* will just be skipped.
|
|
|
|
*/
|
|
|
|
if (ret < 0)
|
|
|
|
goto skip;
|
|
|
|
physical = bioc->stripes[stripe_index].physical;
|
|
|
|
dev = bioc->stripes[stripe_index].dev;
|
|
|
|
btrfs_put_bioc(bioc);
|
|
|
|
}
|
|
|
|
|
|
|
|
skip:
|
|
|
|
for_each_set_bit(sector_nr, &stripe->extent_sector_bitmap, stripe->nr_sectors) {
|
|
|
|
bool repaired = false;
|
|
|
|
|
|
|
|
if (stripe->sectors[sector_nr].is_metadata) {
|
|
|
|
nr_meta_sectors++;
|
|
|
|
} else {
|
|
|
|
nr_data_sectors++;
|
|
|
|
if (!stripe->sectors[sector_nr].csum)
|
|
|
|
nr_nodatacsum_sectors++;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (test_bit(sector_nr, &stripe->init_error_bitmap) &&
|
|
|
|
!test_bit(sector_nr, &stripe->error_bitmap)) {
|
|
|
|
nr_repaired_sectors++;
|
|
|
|
repaired = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Good sector from the beginning, nothing need to be done. */
|
|
|
|
if (!test_bit(sector_nr, &stripe->init_error_bitmap))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Report error for the corrupted sectors. If repaired, just
|
|
|
|
* output the message of repaired message.
|
|
|
|
*/
|
|
|
|
if (repaired) {
|
|
|
|
if (dev) {
|
|
|
|
btrfs_err_rl_in_rcu(fs_info,
|
|
|
|
"fixed up error at logical %llu on dev %s physical %llu",
|
|
|
|
stripe->logical, btrfs_dev_name(dev),
|
|
|
|
physical);
|
|
|
|
} else {
|
|
|
|
btrfs_err_rl_in_rcu(fs_info,
|
|
|
|
"fixed up error at logical %llu on mirror %u",
|
|
|
|
stripe->logical, stripe->mirror_num);
|
|
|
|
}
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* The remaining are all for unrepaired. */
|
|
|
|
if (dev) {
|
|
|
|
btrfs_err_rl_in_rcu(fs_info,
|
|
|
|
"unable to fixup (regular) error at logical %llu on dev %s physical %llu",
|
|
|
|
stripe->logical, btrfs_dev_name(dev),
|
|
|
|
physical);
|
|
|
|
} else {
|
|
|
|
btrfs_err_rl_in_rcu(fs_info,
|
|
|
|
"unable to fixup (regular) error at logical %llu on mirror %u",
|
|
|
|
stripe->logical, stripe->mirror_num);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (test_bit(sector_nr, &stripe->io_error_bitmap))
|
|
|
|
if (__ratelimit(&rs) && dev)
|
|
|
|
scrub_print_common_warning("i/o error", dev, false,
|
|
|
|
stripe->logical, physical);
|
|
|
|
if (test_bit(sector_nr, &stripe->csum_error_bitmap))
|
|
|
|
if (__ratelimit(&rs) && dev)
|
|
|
|
scrub_print_common_warning("checksum error", dev, false,
|
|
|
|
stripe->logical, physical);
|
|
|
|
if (test_bit(sector_nr, &stripe->meta_error_bitmap))
|
|
|
|
if (__ratelimit(&rs) && dev)
|
|
|
|
scrub_print_common_warning("header error", dev, false,
|
|
|
|
stripe->logical, physical);
|
|
|
|
}
|
|
|
|
|
|
|
|
spin_lock(&sctx->stat_lock);
|
|
|
|
sctx->stat.data_extents_scrubbed += stripe->nr_data_extents;
|
|
|
|
sctx->stat.tree_extents_scrubbed += stripe->nr_meta_extents;
|
|
|
|
sctx->stat.data_bytes_scrubbed += nr_data_sectors << fs_info->sectorsize_bits;
|
|
|
|
sctx->stat.tree_bytes_scrubbed += nr_meta_sectors << fs_info->sectorsize_bits;
|
|
|
|
sctx->stat.no_csum += nr_nodatacsum_sectors;
|
btrfs: scrub: also report errors hit during the initial read
[BUG]
After the recent scrub rework introduced in commit e02ee89baa66 ("btrfs:
scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure"),
btrfs scrub no longer reports repaired errors any more:
# mkfs.btrfs -f $dev -d DUP
# mount $dev $mnt
# xfs_io -f -d -c "pwrite -b 64K -S 0xaa 0 64" $mnt/file
# umount $dev
# xfs_io -f -c "pwrite -S 0xff $phy1 64K" $dev # Corrupt the first mirror
# mount $dev $mnt
# btrfs scrub start -BR $mnt
scrub done for 725e7cb7-8a4a-4c77-9f2a-86943619e218
Scrub started: Tue Jun 6 14:56:50 2023
Status: finished
Duration: 0:00:00
data_extents_scrubbed: 2
tree_extents_scrubbed: 18
data_bytes_scrubbed: 131072
tree_bytes_scrubbed: 294912
read_errors: 0
csum_errors: 0 <<< No errors here
verify_errors: 0
[...]
uncorrectable_errors: 0
unverified_errors: 0
corrected_errors: 16 <<< Only corrected errors
last_physical: 2723151872
This can confuse btrfs-progs, as it relies on the csum_errors to
determine if there is anything wrong.
While on v6.3.x kernels, the report is different:
csum_errors: 16 <<<
verify_errors: 0
[...]
uncorrectable_errors: 0
unverified_errors: 0
corrected_errors: 16 <<<
[CAUSE]
In the reworked scrub, we update the scrub progress inside
scrub_stripe_report_errors(), using various bitmaps to update the
result.
For example for csum_errors, we use bitmap_weight() of
stripe->csum_error_bitmap.
Unfortunately at that stage, all error bitmaps (except
init_error_bitmap) are the result of the latest repair attempt, thus if
the stripe is fully repaired, those error bitmaps will all be empty,
resulting the above output mismatch.
To fix this, record the number of errors into stripe->init_nr_*_errors.
Since we don't really care about where those errors are, we only need to
record the number of errors.
Then in scrub_stripe_report_errors(), use those initial numbers to
update the progress other than using the latest error bitmaps.
Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-06 15:08:28 +08:00
|
|
|
sctx->stat.read_errors += stripe->init_nr_io_errors;
|
|
|
|
sctx->stat.csum_errors += stripe->init_nr_csum_errors;
|
|
|
|
sctx->stat.verify_errors += stripe->init_nr_meta_errors;
|
2023-03-20 10:12:56 +08:00
|
|
|
sctx->stat.uncorrectable_errors +=
|
|
|
|
bitmap_weight(&stripe->error_bitmap, stripe->nr_sectors);
|
|
|
|
sctx->stat.corrected_errors += nr_repaired_sectors;
|
|
|
|
spin_unlock(&sctx->stat_lock);
|
|
|
|
}
|
|
|
|
|
2023-08-03 14:33:33 +08:00
|
|
|
static void scrub_write_sectors(struct scrub_ctx *sctx, struct scrub_stripe *stripe,
|
|
|
|
unsigned long write_bitmap, bool dev_replace);
|
|
|
|
|
btrfs: scrub: introduce the main read repair worker for scrub_stripe
The new helper, scrub_stripe_read_repair_worker(), would handle the
read-repair part:
- Wait for the previous submitted read IO to finish
- Verify the contents of the stripe
- Go through the remaining mirrors, using as large blocksize as possible
At this stage, we just read out all the failed sectors from each
mirror and re-verify.
If no more failed sector, we can exit.
- Go through all mirrors again, sector-by-sector
This time, we read sector by sector, this is to address cases where
one bad sector mismatches the drive's internal checksum, and cause the
whole read range to fail.
We put this recovery method as the last resort, as sector-by-sector
reading is slow, and reading from other mirrors may have already fixed
the errors.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-20 10:12:54 +08:00
|
|
|
/*
|
|
|
|
* The main entrance for all read related scrub work, including:
|
|
|
|
*
|
|
|
|
* - Wait for the initial read to finish
|
|
|
|
* - Verify and locate any bad sectors
|
|
|
|
* - Go through the remaining mirrors and try to read as large blocksize as
|
|
|
|
* possible
|
|
|
|
* - Go through all mirrors (including the failed mirror) sector-by-sector
|
2023-08-03 14:33:33 +08:00
|
|
|
* - Submit writeback for repaired sectors
|
btrfs: scrub: introduce the main read repair worker for scrub_stripe
The new helper, scrub_stripe_read_repair_worker(), would handle the
read-repair part:
- Wait for the previous submitted read IO to finish
- Verify the contents of the stripe
- Go through the remaining mirrors, using as large blocksize as possible
At this stage, we just read out all the failed sectors from each
mirror and re-verify.
If no more failed sector, we can exit.
- Go through all mirrors again, sector-by-sector
This time, we read sector by sector, this is to address cases where
one bad sector mismatches the drive's internal checksum, and cause the
whole read range to fail.
We put this recovery method as the last resort, as sector-by-sector
reading is slow, and reading from other mirrors may have already fixed
the errors.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-20 10:12:54 +08:00
|
|
|
*
|
2023-08-03 14:33:33 +08:00
|
|
|
* Writeback for dev-replace does not happen here, it needs extra
|
|
|
|
* synchronization for zoned devices.
|
btrfs: scrub: introduce the main read repair worker for scrub_stripe
The new helper, scrub_stripe_read_repair_worker(), would handle the
read-repair part:
- Wait for the previous submitted read IO to finish
- Verify the contents of the stripe
- Go through the remaining mirrors, using as large blocksize as possible
At this stage, we just read out all the failed sectors from each
mirror and re-verify.
If no more failed sector, we can exit.
- Go through all mirrors again, sector-by-sector
This time, we read sector by sector, this is to address cases where
one bad sector mismatches the drive's internal checksum, and cause the
whole read range to fail.
We put this recovery method as the last resort, as sector-by-sector
reading is slow, and reading from other mirrors may have already fixed
the errors.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-20 10:12:54 +08:00
|
|
|
*/
|
|
|
|
static void scrub_stripe_read_repair_worker(struct work_struct *work)
|
|
|
|
{
|
|
|
|
struct scrub_stripe *stripe = container_of(work, struct scrub_stripe, work);
|
2023-08-03 14:33:33 +08:00
|
|
|
struct scrub_ctx *sctx = stripe->sctx;
|
|
|
|
struct btrfs_fs_info *fs_info = sctx->fs_info;
|
btrfs: scrub: introduce the main read repair worker for scrub_stripe
The new helper, scrub_stripe_read_repair_worker(), would handle the
read-repair part:
- Wait for the previous submitted read IO to finish
- Verify the contents of the stripe
- Go through the remaining mirrors, using as large blocksize as possible
At this stage, we just read out all the failed sectors from each
mirror and re-verify.
If no more failed sector, we can exit.
- Go through all mirrors again, sector-by-sector
This time, we read sector by sector, this is to address cases where
one bad sector mismatches the drive's internal checksum, and cause the
whole read range to fail.
We put this recovery method as the last resort, as sector-by-sector
reading is slow, and reading from other mirrors may have already fixed
the errors.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-20 10:12:54 +08:00
|
|
|
int num_copies = btrfs_num_copies(fs_info, stripe->bg->start,
|
|
|
|
stripe->bg->length);
|
2024-04-09 22:18:52 +08:00
|
|
|
unsigned long repaired;
|
btrfs: scrub: introduce the main read repair worker for scrub_stripe
The new helper, scrub_stripe_read_repair_worker(), would handle the
read-repair part:
- Wait for the previous submitted read IO to finish
- Verify the contents of the stripe
- Go through the remaining mirrors, using as large blocksize as possible
At this stage, we just read out all the failed sectors from each
mirror and re-verify.
If no more failed sector, we can exit.
- Go through all mirrors again, sector-by-sector
This time, we read sector by sector, this is to address cases where
one bad sector mismatches the drive's internal checksum, and cause the
whole read range to fail.
We put this recovery method as the last resort, as sector-by-sector
reading is slow, and reading from other mirrors may have already fixed
the errors.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-20 10:12:54 +08:00
|
|
|
int mirror;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
ASSERT(stripe->mirror_num > 0);
|
|
|
|
|
|
|
|
wait_scrub_stripe_io(stripe);
|
|
|
|
scrub_verify_one_stripe(stripe, stripe->extent_sector_bitmap);
|
|
|
|
/* Save the initial failed bitmap for later repair and report usage. */
|
|
|
|
stripe->init_error_bitmap = stripe->error_bitmap;
|
btrfs: scrub: also report errors hit during the initial read
[BUG]
After the recent scrub rework introduced in commit e02ee89baa66 ("btrfs:
scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure"),
btrfs scrub no longer reports repaired errors any more:
# mkfs.btrfs -f $dev -d DUP
# mount $dev $mnt
# xfs_io -f -d -c "pwrite -b 64K -S 0xaa 0 64" $mnt/file
# umount $dev
# xfs_io -f -c "pwrite -S 0xff $phy1 64K" $dev # Corrupt the first mirror
# mount $dev $mnt
# btrfs scrub start -BR $mnt
scrub done for 725e7cb7-8a4a-4c77-9f2a-86943619e218
Scrub started: Tue Jun 6 14:56:50 2023
Status: finished
Duration: 0:00:00
data_extents_scrubbed: 2
tree_extents_scrubbed: 18
data_bytes_scrubbed: 131072
tree_bytes_scrubbed: 294912
read_errors: 0
csum_errors: 0 <<< No errors here
verify_errors: 0
[...]
uncorrectable_errors: 0
unverified_errors: 0
corrected_errors: 16 <<< Only corrected errors
last_physical: 2723151872
This can confuse btrfs-progs, as it relies on the csum_errors to
determine if there is anything wrong.
While on v6.3.x kernels, the report is different:
csum_errors: 16 <<<
verify_errors: 0
[...]
uncorrectable_errors: 0
unverified_errors: 0
corrected_errors: 16 <<<
[CAUSE]
In the reworked scrub, we update the scrub progress inside
scrub_stripe_report_errors(), using various bitmaps to update the
result.
For example for csum_errors, we use bitmap_weight() of
stripe->csum_error_bitmap.
Unfortunately at that stage, all error bitmaps (except
init_error_bitmap) are the result of the latest repair attempt, thus if
the stripe is fully repaired, those error bitmaps will all be empty,
resulting the above output mismatch.
To fix this, record the number of errors into stripe->init_nr_*_errors.
Since we don't really care about where those errors are, we only need to
record the number of errors.
Then in scrub_stripe_report_errors(), use those initial numbers to
update the progress other than using the latest error bitmaps.
Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-06 15:08:28 +08:00
|
|
|
stripe->init_nr_io_errors = bitmap_weight(&stripe->io_error_bitmap,
|
|
|
|
stripe->nr_sectors);
|
|
|
|
stripe->init_nr_csum_errors = bitmap_weight(&stripe->csum_error_bitmap,
|
|
|
|
stripe->nr_sectors);
|
|
|
|
stripe->init_nr_meta_errors = bitmap_weight(&stripe->meta_error_bitmap,
|
|
|
|
stripe->nr_sectors);
|
btrfs: scrub: introduce the main read repair worker for scrub_stripe
The new helper, scrub_stripe_read_repair_worker(), would handle the
read-repair part:
- Wait for the previous submitted read IO to finish
- Verify the contents of the stripe
- Go through the remaining mirrors, using as large blocksize as possible
At this stage, we just read out all the failed sectors from each
mirror and re-verify.
If no more failed sector, we can exit.
- Go through all mirrors again, sector-by-sector
This time, we read sector by sector, this is to address cases where
one bad sector mismatches the drive's internal checksum, and cause the
whole read range to fail.
We put this recovery method as the last resort, as sector-by-sector
reading is slow, and reading from other mirrors may have already fixed
the errors.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-20 10:12:54 +08:00
|
|
|
|
|
|
|
if (bitmap_empty(&stripe->init_error_bitmap, stripe->nr_sectors))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Try all remaining mirrors.
|
|
|
|
*
|
|
|
|
* Here we still try to read as large block as possible, as this is
|
|
|
|
* faster and we have extra safety nets to rely on.
|
|
|
|
*/
|
|
|
|
for (mirror = calc_next_mirror(stripe->mirror_num, num_copies);
|
|
|
|
mirror != stripe->mirror_num;
|
|
|
|
mirror = calc_next_mirror(mirror, num_copies)) {
|
|
|
|
const unsigned long old_error_bitmap = stripe->error_bitmap;
|
|
|
|
|
|
|
|
scrub_stripe_submit_repair_read(stripe, mirror,
|
|
|
|
BTRFS_STRIPE_LEN, false);
|
|
|
|
wait_scrub_stripe_io(stripe);
|
|
|
|
scrub_verify_one_stripe(stripe, old_error_bitmap);
|
|
|
|
if (bitmap_empty(&stripe->error_bitmap, stripe->nr_sectors))
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Last safety net, try re-checking all mirrors, including the failed
|
|
|
|
* one, sector-by-sector.
|
|
|
|
*
|
|
|
|
* As if one sector failed the drive's internal csum, the whole read
|
|
|
|
* containing the offending sector would be marked as error.
|
|
|
|
* Thus here we do sector-by-sector read.
|
|
|
|
*
|
|
|
|
* This can be slow, thus we only try it as the last resort.
|
|
|
|
*/
|
|
|
|
|
|
|
|
for (i = 0, mirror = stripe->mirror_num;
|
|
|
|
i < num_copies;
|
|
|
|
i++, mirror = calc_next_mirror(mirror, num_copies)) {
|
|
|
|
const unsigned long old_error_bitmap = stripe->error_bitmap;
|
|
|
|
|
|
|
|
scrub_stripe_submit_repair_read(stripe, mirror,
|
|
|
|
fs_info->sectorsize, true);
|
|
|
|
wait_scrub_stripe_io(stripe);
|
|
|
|
scrub_verify_one_stripe(stripe, old_error_bitmap);
|
|
|
|
if (bitmap_empty(&stripe->error_bitmap, stripe->nr_sectors))
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
out:
|
2023-08-03 14:33:33 +08:00
|
|
|
/*
|
|
|
|
* Submit the repaired sectors. For zoned case, we cannot do repair
|
|
|
|
* in-place, but queue the bg to be relocated.
|
|
|
|
*/
|
2024-04-09 22:18:52 +08:00
|
|
|
bitmap_andnot(&repaired, &stripe->init_error_bitmap, &stripe->error_bitmap,
|
|
|
|
stripe->nr_sectors);
|
|
|
|
if (!sctx->readonly && !bitmap_empty(&repaired, stripe->nr_sectors)) {
|
|
|
|
if (btrfs_is_zoned(fs_info)) {
|
2023-08-03 14:33:33 +08:00
|
|
|
btrfs_repair_one_zone(fs_info, sctx->stripes[0].bg->start);
|
2024-04-09 22:18:52 +08:00
|
|
|
} else {
|
|
|
|
scrub_write_sectors(sctx, stripe, repaired, false);
|
|
|
|
wait_scrub_stripe_io(stripe);
|
|
|
|
}
|
2023-08-03 14:33:33 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
scrub_stripe_report_errors(sctx, stripe);
|
btrfs: scrub: introduce the main read repair worker for scrub_stripe
The new helper, scrub_stripe_read_repair_worker(), would handle the
read-repair part:
- Wait for the previous submitted read IO to finish
- Verify the contents of the stripe
- Go through the remaining mirrors, using as large blocksize as possible
At this stage, we just read out all the failed sectors from each
mirror and re-verify.
If no more failed sector, we can exit.
- Go through all mirrors again, sector-by-sector
This time, we read sector by sector, this is to address cases where
one bad sector mismatches the drive's internal checksum, and cause the
whole read range to fail.
We put this recovery method as the last resort, as sector-by-sector
reading is slow, and reading from other mirrors may have already fixed
the errors.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-20 10:12:54 +08:00
|
|
|
set_bit(SCRUB_STRIPE_FLAG_REPAIR_DONE, &stripe->state);
|
|
|
|
wake_up(&stripe->repair_wait);
|
|
|
|
}
|
|
|
|
|
2023-03-20 10:12:57 +08:00
|
|
|
static void scrub_read_endio(struct btrfs_bio *bbio)
|
btrfs: scrub: introduce the main read repair worker for scrub_stripe
The new helper, scrub_stripe_read_repair_worker(), would handle the
read-repair part:
- Wait for the previous submitted read IO to finish
- Verify the contents of the stripe
- Go through the remaining mirrors, using as large blocksize as possible
At this stage, we just read out all the failed sectors from each
mirror and re-verify.
If no more failed sector, we can exit.
- Go through all mirrors again, sector-by-sector
This time, we read sector by sector, this is to address cases where
one bad sector mismatches the drive's internal checksum, and cause the
whole read range to fail.
We put this recovery method as the last resort, as sector-by-sector
reading is slow, and reading from other mirrors may have already fixed
the errors.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-20 10:12:54 +08:00
|
|
|
{
|
|
|
|
struct scrub_stripe *stripe = bbio->private;
|
btrfs: scrub: avoid use-after-free when chunk length is not 64K aligned
[BUG]
There is a bug report that, on a ext4-converted btrfs, scrub leads to
various problems, including:
- "unable to find chunk map" errors
BTRFS info (device vdb): scrub: started on devid 1
BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 4096
BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 45056
This would lead to unrepariable errors.
- Use-after-free KASAN reports:
==================================================================
BUG: KASAN: slab-use-after-free in __blk_rq_map_sg+0x18f/0x7c0
Read of size 8 at addr ffff8881013c9040 by task btrfs/909
CPU: 0 PID: 909 Comm: btrfs Not tainted 6.7.0-x64v3-dbg #11 c50636e9419a8354555555245df535e380563b2b
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2023.11-2 12/24/2023
Call Trace:
<TASK>
dump_stack_lvl+0x43/0x60
print_report+0xcf/0x640
kasan_report+0xa6/0xd0
__blk_rq_map_sg+0x18f/0x7c0
virtblk_prep_rq.isra.0+0x215/0x6a0 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff]
virtio_queue_rqs+0xc4/0x310 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff]
blk_mq_flush_plug_list.part.0+0x780/0x860
__blk_flush_plug+0x1ba/0x220
blk_finish_plug+0x3b/0x60
submit_initial_group_read+0x10a/0x290 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
flush_scrub_stripes+0x38e/0x430 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
scrub_stripe+0x82a/0xae0 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
scrub_chunk+0x178/0x200 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
scrub_enumerate_chunks+0x4bc/0xa30 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
btrfs_scrub_dev+0x398/0x810 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
btrfs_ioctl+0x4b9/0x3020 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
__x64_sys_ioctl+0xbd/0x100
do_syscall_64+0x5d/0xe0
entry_SYSCALL_64_after_hwframe+0x63/0x6b
RIP: 0033:0x7f47e5e0952b
- Crash, mostly due to above use-after-free
[CAUSE]
The converted fs has the following data chunk layout:
item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 2214658048) itemoff 16025 itemsize 80
length 86016 owner 2 stripe_len 65536 type DATA|single
For above logical bytenr 2214744064, it's at the chunk end
(2214658048 + 86016 = 2214744064).
This means btrfs_submit_bio() would split the bio, and trigger endio
function for both of the two halves.
However scrub_submit_initial_read() would only expect the endio function
to be called once, not any more.
This means the first endio function would already free the bbio::bio,
leaving the bvec freed, thus the 2nd endio call would lead to
use-after-free.
[FIX]
- Make sure scrub_read_endio() only updates bits in its range
Since we may read less than 64K at the end of the chunk, we should not
touch the bits beyond chunk boundary.
- Make sure scrub_submit_initial_read() only to read the chunk range
This is done by calculating the real number of sectors we need to
read, and add sector-by-sector to the bio.
Thankfully the scrub read repair path won't need extra fixes:
- scrub_stripe_submit_repair_read()
With above fixes, we won't update error bit for range beyond chunk,
thus scrub_stripe_submit_repair_read() should never submit any read
beyond the chunk.
Reported-by: Rongrong <i@rong.moe>
Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Tested-by: Rongrong <i@rong.moe>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-17 08:32:25 +08:00
|
|
|
struct bio_vec *bvec;
|
|
|
|
int sector_nr = calc_sector_number(stripe, bio_first_bvec_all(&bbio->bio));
|
|
|
|
int num_sectors;
|
|
|
|
u32 bio_size = 0;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
ASSERT(sector_nr < stripe->nr_sectors);
|
|
|
|
bio_for_each_bvec_all(bvec, &bbio->bio, i)
|
|
|
|
bio_size += bvec->bv_len;
|
|
|
|
num_sectors = bio_size >> stripe->bg->fs_info->sectorsize_bits;
|
btrfs: scrub: introduce the main read repair worker for scrub_stripe
The new helper, scrub_stripe_read_repair_worker(), would handle the
read-repair part:
- Wait for the previous submitted read IO to finish
- Verify the contents of the stripe
- Go through the remaining mirrors, using as large blocksize as possible
At this stage, we just read out all the failed sectors from each
mirror and re-verify.
If no more failed sector, we can exit.
- Go through all mirrors again, sector-by-sector
This time, we read sector by sector, this is to address cases where
one bad sector mismatches the drive's internal checksum, and cause the
whole read range to fail.
We put this recovery method as the last resort, as sector-by-sector
reading is slow, and reading from other mirrors may have already fixed
the errors.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-20 10:12:54 +08:00
|
|
|
|
|
|
|
if (bbio->bio.bi_status) {
|
btrfs: scrub: avoid use-after-free when chunk length is not 64K aligned
[BUG]
There is a bug report that, on a ext4-converted btrfs, scrub leads to
various problems, including:
- "unable to find chunk map" errors
BTRFS info (device vdb): scrub: started on devid 1
BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 4096
BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 45056
This would lead to unrepariable errors.
- Use-after-free KASAN reports:
==================================================================
BUG: KASAN: slab-use-after-free in __blk_rq_map_sg+0x18f/0x7c0
Read of size 8 at addr ffff8881013c9040 by task btrfs/909
CPU: 0 PID: 909 Comm: btrfs Not tainted 6.7.0-x64v3-dbg #11 c50636e9419a8354555555245df535e380563b2b
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2023.11-2 12/24/2023
Call Trace:
<TASK>
dump_stack_lvl+0x43/0x60
print_report+0xcf/0x640
kasan_report+0xa6/0xd0
__blk_rq_map_sg+0x18f/0x7c0
virtblk_prep_rq.isra.0+0x215/0x6a0 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff]
virtio_queue_rqs+0xc4/0x310 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff]
blk_mq_flush_plug_list.part.0+0x780/0x860
__blk_flush_plug+0x1ba/0x220
blk_finish_plug+0x3b/0x60
submit_initial_group_read+0x10a/0x290 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
flush_scrub_stripes+0x38e/0x430 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
scrub_stripe+0x82a/0xae0 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
scrub_chunk+0x178/0x200 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
scrub_enumerate_chunks+0x4bc/0xa30 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
btrfs_scrub_dev+0x398/0x810 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
btrfs_ioctl+0x4b9/0x3020 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
__x64_sys_ioctl+0xbd/0x100
do_syscall_64+0x5d/0xe0
entry_SYSCALL_64_after_hwframe+0x63/0x6b
RIP: 0033:0x7f47e5e0952b
- Crash, mostly due to above use-after-free
[CAUSE]
The converted fs has the following data chunk layout:
item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 2214658048) itemoff 16025 itemsize 80
length 86016 owner 2 stripe_len 65536 type DATA|single
For above logical bytenr 2214744064, it's at the chunk end
(2214658048 + 86016 = 2214744064).
This means btrfs_submit_bio() would split the bio, and trigger endio
function for both of the two halves.
However scrub_submit_initial_read() would only expect the endio function
to be called once, not any more.
This means the first endio function would already free the bbio::bio,
leaving the bvec freed, thus the 2nd endio call would lead to
use-after-free.
[FIX]
- Make sure scrub_read_endio() only updates bits in its range
Since we may read less than 64K at the end of the chunk, we should not
touch the bits beyond chunk boundary.
- Make sure scrub_submit_initial_read() only to read the chunk range
This is done by calculating the real number of sectors we need to
read, and add sector-by-sector to the bio.
Thankfully the scrub read repair path won't need extra fixes:
- scrub_stripe_submit_repair_read()
With above fixes, we won't update error bit for range beyond chunk,
thus scrub_stripe_submit_repair_read() should never submit any read
beyond the chunk.
Reported-by: Rongrong <i@rong.moe>
Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Tested-by: Rongrong <i@rong.moe>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-17 08:32:25 +08:00
|
|
|
bitmap_set(&stripe->io_error_bitmap, sector_nr, num_sectors);
|
|
|
|
bitmap_set(&stripe->error_bitmap, sector_nr, num_sectors);
|
btrfs: scrub: introduce the main read repair worker for scrub_stripe
The new helper, scrub_stripe_read_repair_worker(), would handle the
read-repair part:
- Wait for the previous submitted read IO to finish
- Verify the contents of the stripe
- Go through the remaining mirrors, using as large blocksize as possible
At this stage, we just read out all the failed sectors from each
mirror and re-verify.
If no more failed sector, we can exit.
- Go through all mirrors again, sector-by-sector
This time, we read sector by sector, this is to address cases where
one bad sector mismatches the drive's internal checksum, and cause the
whole read range to fail.
We put this recovery method as the last resort, as sector-by-sector
reading is slow, and reading from other mirrors may have already fixed
the errors.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-20 10:12:54 +08:00
|
|
|
} else {
|
btrfs: scrub: avoid use-after-free when chunk length is not 64K aligned
[BUG]
There is a bug report that, on a ext4-converted btrfs, scrub leads to
various problems, including:
- "unable to find chunk map" errors
BTRFS info (device vdb): scrub: started on devid 1
BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 4096
BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 45056
This would lead to unrepariable errors.
- Use-after-free KASAN reports:
==================================================================
BUG: KASAN: slab-use-after-free in __blk_rq_map_sg+0x18f/0x7c0
Read of size 8 at addr ffff8881013c9040 by task btrfs/909
CPU: 0 PID: 909 Comm: btrfs Not tainted 6.7.0-x64v3-dbg #11 c50636e9419a8354555555245df535e380563b2b
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2023.11-2 12/24/2023
Call Trace:
<TASK>
dump_stack_lvl+0x43/0x60
print_report+0xcf/0x640
kasan_report+0xa6/0xd0
__blk_rq_map_sg+0x18f/0x7c0
virtblk_prep_rq.isra.0+0x215/0x6a0 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff]
virtio_queue_rqs+0xc4/0x310 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff]
blk_mq_flush_plug_list.part.0+0x780/0x860
__blk_flush_plug+0x1ba/0x220
blk_finish_plug+0x3b/0x60
submit_initial_group_read+0x10a/0x290 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
flush_scrub_stripes+0x38e/0x430 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
scrub_stripe+0x82a/0xae0 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
scrub_chunk+0x178/0x200 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
scrub_enumerate_chunks+0x4bc/0xa30 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
btrfs_scrub_dev+0x398/0x810 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
btrfs_ioctl+0x4b9/0x3020 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
__x64_sys_ioctl+0xbd/0x100
do_syscall_64+0x5d/0xe0
entry_SYSCALL_64_after_hwframe+0x63/0x6b
RIP: 0033:0x7f47e5e0952b
- Crash, mostly due to above use-after-free
[CAUSE]
The converted fs has the following data chunk layout:
item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 2214658048) itemoff 16025 itemsize 80
length 86016 owner 2 stripe_len 65536 type DATA|single
For above logical bytenr 2214744064, it's at the chunk end
(2214658048 + 86016 = 2214744064).
This means btrfs_submit_bio() would split the bio, and trigger endio
function for both of the two halves.
However scrub_submit_initial_read() would only expect the endio function
to be called once, not any more.
This means the first endio function would already free the bbio::bio,
leaving the bvec freed, thus the 2nd endio call would lead to
use-after-free.
[FIX]
- Make sure scrub_read_endio() only updates bits in its range
Since we may read less than 64K at the end of the chunk, we should not
touch the bits beyond chunk boundary.
- Make sure scrub_submit_initial_read() only to read the chunk range
This is done by calculating the real number of sectors we need to
read, and add sector-by-sector to the bio.
Thankfully the scrub read repair path won't need extra fixes:
- scrub_stripe_submit_repair_read()
With above fixes, we won't update error bit for range beyond chunk,
thus scrub_stripe_submit_repair_read() should never submit any read
beyond the chunk.
Reported-by: Rongrong <i@rong.moe>
Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Tested-by: Rongrong <i@rong.moe>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-17 08:32:25 +08:00
|
|
|
bitmap_clear(&stripe->io_error_bitmap, sector_nr, num_sectors);
|
btrfs: scrub: introduce the main read repair worker for scrub_stripe
The new helper, scrub_stripe_read_repair_worker(), would handle the
read-repair part:
- Wait for the previous submitted read IO to finish
- Verify the contents of the stripe
- Go through the remaining mirrors, using as large blocksize as possible
At this stage, we just read out all the failed sectors from each
mirror and re-verify.
If no more failed sector, we can exit.
- Go through all mirrors again, sector-by-sector
This time, we read sector by sector, this is to address cases where
one bad sector mismatches the drive's internal checksum, and cause the
whole read range to fail.
We put this recovery method as the last resort, as sector-by-sector
reading is slow, and reading from other mirrors may have already fixed
the errors.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-20 10:12:54 +08:00
|
|
|
}
|
|
|
|
bio_put(&bbio->bio);
|
|
|
|
if (atomic_dec_and_test(&stripe->pending_io)) {
|
|
|
|
wake_up(&stripe->io_wait);
|
|
|
|
INIT_WORK(&stripe->work, scrub_stripe_read_repair_worker);
|
|
|
|
queue_work(stripe->bg->fs_info->scrub_workers, &stripe->work);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-03-20 10:12:55 +08:00
|
|
|
static void scrub_write_endio(struct btrfs_bio *bbio)
|
|
|
|
{
|
|
|
|
struct scrub_stripe *stripe = bbio->private;
|
|
|
|
struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
|
|
|
|
struct bio_vec *bvec;
|
|
|
|
int sector_nr = calc_sector_number(stripe, bio_first_bvec_all(&bbio->bio));
|
|
|
|
u32 bio_size = 0;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
bio_for_each_bvec_all(bvec, &bbio->bio, i)
|
|
|
|
bio_size += bvec->bv_len;
|
|
|
|
|
|
|
|
if (bbio->bio.bi_status) {
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
spin_lock_irqsave(&stripe->write_error_lock, flags);
|
|
|
|
bitmap_set(&stripe->write_error_bitmap, sector_nr,
|
|
|
|
bio_size >> fs_info->sectorsize_bits);
|
|
|
|
spin_unlock_irqrestore(&stripe->write_error_lock, flags);
|
|
|
|
}
|
|
|
|
bio_put(&bbio->bio);
|
|
|
|
|
|
|
|
if (atomic_dec_and_test(&stripe->pending_io))
|
|
|
|
wake_up(&stripe->io_wait);
|
|
|
|
}
|
|
|
|
|
btrfs: zoned: fix dev-replace after the scrub rework
[BUG]
After commit e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror()
to scrub_stripe infrastructure"), scrub no longer works for zoned device
at all.
Even an empty zoned btrfs cannot be replaced:
# mkfs.btrfs -f /dev/nvme0n1
# mount /dev/nvme0n1 /mnt/btrfs
# btrfs replace start -Bf 1 /dev/nvme0n2 /mnt/btrfs
Resetting device zones /dev/nvme1n1 (160 zones) ...
ERROR: ioctl(DEV_REPLACE_START) failed on "/mnt/btrfs/": Input/output error
And we can hit kernel crash related to that:
BTRFS info (device nvme1n1): host-managed zoned block device /dev/nvme3n1, 160 zones of 134217728 bytes
BTRFS info (device nvme1n1): dev_replace from /dev/nvme2n1 (devid 2) to /dev/nvme3n1 started
nvme3n1: Zone Management Append(0x7d) @ LBA 65536, 4 blocks, Zone Is Full (sct 0x1 / sc 0xb9) DNR
I/O error, dev nvme3n1, sector 786432 op 0xd:(ZONE_APPEND) flags 0x4000 phys_seg 3 prio class 2
BTRFS error (device nvme1n1): bdev /dev/nvme3n1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
BUG: kernel NULL pointer dereference, address: 00000000000000a8
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:_raw_spin_lock_irqsave+0x1e/0x40
Call Trace:
<IRQ>
btrfs_lookup_ordered_extent+0x31/0x190
btrfs_record_physical_zoned+0x18/0x40
btrfs_simple_end_io+0xaf/0xc0
blk_update_request+0x153/0x4c0
blk_mq_end_request+0x15/0xd0
nvme_poll_cq+0x1d3/0x360
nvme_irq+0x39/0x80
__handle_irq_event_percpu+0x3b/0x190
handle_irq_event+0x2f/0x70
handle_edge_irq+0x7c/0x210
__common_interrupt+0x34/0xa0
common_interrupt+0x7d/0xa0
</IRQ>
<TASK>
asm_common_interrupt+0x22/0x40
[CAUSE]
Dev-replace reuses scrub code to iterate all extents and write the
existing content back to the new device.
And for zoned devices, we call fill_writer_pointer_gap() to make sure
all the writes into the zoned device is sequential, even if there may be
some gaps between the writes.
However we have several different bugs all related to zoned dev-replace:
- We are using ZONE_APPEND operation for metadata style write back
For zoned devices, btrfs has two ways to write data:
* ZONE_APPEND for data
This allows higher queue depth, but will not be able to know where
the write would land.
Thus needs to grab the real on-disk physical location in it's endio.
* WRITE for metadata
This requires single queue depth (new writes can only be submitted
after previous one finished), and all writes must be sequential.
For scrub, we go single queue depth, but still goes with ZONE_APPEND,
which requires btrfs_bio::inode being populated.
This is the cause of that crash.
- No correct tracing of write_pointer
After a write finished, we should forward sctx->write_pointer, or
fill_writer_pointer_gap() would not work properly and cause more
than necessary zero out, and fill the whole zone prematurely.
- Incorrect physical bytenr passed to fill_writer_pointer_gap()
In scrub_write_sectors(), one call site passes logical address, which
is completely wrong.
The other call site passes physical address of current sector, but
we should pass the physical address of the btrfs_bio we're submitting.
This is the cause of the -EIO errors.
[FIX]
- Do not use ZONE_APPEND for btrfs_submit_repair_write().
- Manually forward sctx->write_pointer after successful writeback
- Use the physical address of the to-be-submitted btrfs_bio for
fill_writer_pointer_gap()
Now zoned device replace would work as expected.
Reported-by: Christoph Hellwig <hch@lst.de>
Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-01 18:51:34 +08:00
|
|
|
static void scrub_submit_write_bio(struct scrub_ctx *sctx,
|
|
|
|
struct scrub_stripe *stripe,
|
|
|
|
struct btrfs_bio *bbio, bool dev_replace)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = sctx->fs_info;
|
|
|
|
u32 bio_len = bbio->bio.bi_iter.bi_size;
|
|
|
|
u32 bio_off = (bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT) -
|
|
|
|
stripe->logical;
|
|
|
|
|
|
|
|
fill_writer_pointer_gap(sctx, stripe->physical + bio_off);
|
|
|
|
atomic_inc(&stripe->pending_io);
|
|
|
|
btrfs_submit_repair_write(bbio, stripe->mirror_num, dev_replace);
|
|
|
|
if (!btrfs_is_zoned(fs_info))
|
|
|
|
return;
|
|
|
|
/*
|
|
|
|
* For zoned writeback, queue depth must be 1, thus we must wait for
|
|
|
|
* the write to finish before the next write.
|
|
|
|
*/
|
|
|
|
wait_scrub_stripe_io(stripe);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* And also need to update the write pointer if write finished
|
|
|
|
* successfully.
|
|
|
|
*/
|
|
|
|
if (!test_bit(bio_off >> fs_info->sectorsize_bits,
|
|
|
|
&stripe->write_error_bitmap))
|
|
|
|
sctx->write_pointer += bio_len;
|
|
|
|
}
|
|
|
|
|
2023-03-20 10:12:55 +08:00
|
|
|
/*
|
|
|
|
* Submit the write bio(s) for the sectors specified by @write_bitmap.
|
|
|
|
*
|
|
|
|
* Here we utilize btrfs_submit_repair_write(), which has some extra benefits:
|
|
|
|
*
|
|
|
|
* - Only needs logical bytenr and mirror_num
|
|
|
|
* Just like the scrub read path
|
|
|
|
*
|
|
|
|
* - Would only result in writes to the specified mirror
|
|
|
|
* Unlike the regular writeback path, which would write back to all stripes
|
|
|
|
*
|
|
|
|
* - Handle dev-replace and read-repair writeback differently
|
|
|
|
*/
|
2023-03-20 10:12:57 +08:00
|
|
|
static void scrub_write_sectors(struct scrub_ctx *sctx, struct scrub_stripe *stripe,
|
|
|
|
unsigned long write_bitmap, bool dev_replace)
|
2023-03-20 10:12:55 +08:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
|
|
|
|
struct btrfs_bio *bbio = NULL;
|
|
|
|
int sector_nr;
|
|
|
|
|
|
|
|
for_each_set_bit(sector_nr, &write_bitmap, stripe->nr_sectors) {
|
|
|
|
struct page *page = scrub_stripe_get_page(stripe, sector_nr);
|
|
|
|
unsigned int pgoff = scrub_stripe_get_page_offset(stripe, sector_nr);
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
/* We should only writeback sectors covered by an extent. */
|
|
|
|
ASSERT(test_bit(sector_nr, &stripe->extent_sector_bitmap));
|
|
|
|
|
|
|
|
/* Cannot merge with previous sector, submit the current one. */
|
|
|
|
if (bbio && sector_nr && !test_bit(sector_nr - 1, &write_bitmap)) {
|
btrfs: zoned: fix dev-replace after the scrub rework
[BUG]
After commit e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror()
to scrub_stripe infrastructure"), scrub no longer works for zoned device
at all.
Even an empty zoned btrfs cannot be replaced:
# mkfs.btrfs -f /dev/nvme0n1
# mount /dev/nvme0n1 /mnt/btrfs
# btrfs replace start -Bf 1 /dev/nvme0n2 /mnt/btrfs
Resetting device zones /dev/nvme1n1 (160 zones) ...
ERROR: ioctl(DEV_REPLACE_START) failed on "/mnt/btrfs/": Input/output error
And we can hit kernel crash related to that:
BTRFS info (device nvme1n1): host-managed zoned block device /dev/nvme3n1, 160 zones of 134217728 bytes
BTRFS info (device nvme1n1): dev_replace from /dev/nvme2n1 (devid 2) to /dev/nvme3n1 started
nvme3n1: Zone Management Append(0x7d) @ LBA 65536, 4 blocks, Zone Is Full (sct 0x1 / sc 0xb9) DNR
I/O error, dev nvme3n1, sector 786432 op 0xd:(ZONE_APPEND) flags 0x4000 phys_seg 3 prio class 2
BTRFS error (device nvme1n1): bdev /dev/nvme3n1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
BUG: kernel NULL pointer dereference, address: 00000000000000a8
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:_raw_spin_lock_irqsave+0x1e/0x40
Call Trace:
<IRQ>
btrfs_lookup_ordered_extent+0x31/0x190
btrfs_record_physical_zoned+0x18/0x40
btrfs_simple_end_io+0xaf/0xc0
blk_update_request+0x153/0x4c0
blk_mq_end_request+0x15/0xd0
nvme_poll_cq+0x1d3/0x360
nvme_irq+0x39/0x80
__handle_irq_event_percpu+0x3b/0x190
handle_irq_event+0x2f/0x70
handle_edge_irq+0x7c/0x210
__common_interrupt+0x34/0xa0
common_interrupt+0x7d/0xa0
</IRQ>
<TASK>
asm_common_interrupt+0x22/0x40
[CAUSE]
Dev-replace reuses scrub code to iterate all extents and write the
existing content back to the new device.
And for zoned devices, we call fill_writer_pointer_gap() to make sure
all the writes into the zoned device is sequential, even if there may be
some gaps between the writes.
However we have several different bugs all related to zoned dev-replace:
- We are using ZONE_APPEND operation for metadata style write back
For zoned devices, btrfs has two ways to write data:
* ZONE_APPEND for data
This allows higher queue depth, but will not be able to know where
the write would land.
Thus needs to grab the real on-disk physical location in it's endio.
* WRITE for metadata
This requires single queue depth (new writes can only be submitted
after previous one finished), and all writes must be sequential.
For scrub, we go single queue depth, but still goes with ZONE_APPEND,
which requires btrfs_bio::inode being populated.
This is the cause of that crash.
- No correct tracing of write_pointer
After a write finished, we should forward sctx->write_pointer, or
fill_writer_pointer_gap() would not work properly and cause more
than necessary zero out, and fill the whole zone prematurely.
- Incorrect physical bytenr passed to fill_writer_pointer_gap()
In scrub_write_sectors(), one call site passes logical address, which
is completely wrong.
The other call site passes physical address of current sector, but
we should pass the physical address of the btrfs_bio we're submitting.
This is the cause of the -EIO errors.
[FIX]
- Do not use ZONE_APPEND for btrfs_submit_repair_write().
- Manually forward sctx->write_pointer after successful writeback
- Use the physical address of the to-be-submitted btrfs_bio for
fill_writer_pointer_gap()
Now zoned device replace would work as expected.
Reported-by: Christoph Hellwig <hch@lst.de>
Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-01 18:51:34 +08:00
|
|
|
scrub_submit_write_bio(sctx, stripe, bbio, dev_replace);
|
2023-03-20 10:12:55 +08:00
|
|
|
bbio = NULL;
|
|
|
|
}
|
|
|
|
if (!bbio) {
|
|
|
|
bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_WRITE,
|
|
|
|
fs_info, scrub_write_endio, stripe);
|
|
|
|
bbio->bio.bi_iter.bi_sector = (stripe->logical +
|
|
|
|
(sector_nr << fs_info->sectorsize_bits)) >>
|
|
|
|
SECTOR_SHIFT;
|
|
|
|
}
|
|
|
|
ret = bio_add_page(&bbio->bio, page, fs_info->sectorsize, pgoff);
|
|
|
|
ASSERT(ret == fs_info->sectorsize);
|
|
|
|
}
|
btrfs: zoned: fix dev-replace after the scrub rework
[BUG]
After commit e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror()
to scrub_stripe infrastructure"), scrub no longer works for zoned device
at all.
Even an empty zoned btrfs cannot be replaced:
# mkfs.btrfs -f /dev/nvme0n1
# mount /dev/nvme0n1 /mnt/btrfs
# btrfs replace start -Bf 1 /dev/nvme0n2 /mnt/btrfs
Resetting device zones /dev/nvme1n1 (160 zones) ...
ERROR: ioctl(DEV_REPLACE_START) failed on "/mnt/btrfs/": Input/output error
And we can hit kernel crash related to that:
BTRFS info (device nvme1n1): host-managed zoned block device /dev/nvme3n1, 160 zones of 134217728 bytes
BTRFS info (device nvme1n1): dev_replace from /dev/nvme2n1 (devid 2) to /dev/nvme3n1 started
nvme3n1: Zone Management Append(0x7d) @ LBA 65536, 4 blocks, Zone Is Full (sct 0x1 / sc 0xb9) DNR
I/O error, dev nvme3n1, sector 786432 op 0xd:(ZONE_APPEND) flags 0x4000 phys_seg 3 prio class 2
BTRFS error (device nvme1n1): bdev /dev/nvme3n1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
BUG: kernel NULL pointer dereference, address: 00000000000000a8
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:_raw_spin_lock_irqsave+0x1e/0x40
Call Trace:
<IRQ>
btrfs_lookup_ordered_extent+0x31/0x190
btrfs_record_physical_zoned+0x18/0x40
btrfs_simple_end_io+0xaf/0xc0
blk_update_request+0x153/0x4c0
blk_mq_end_request+0x15/0xd0
nvme_poll_cq+0x1d3/0x360
nvme_irq+0x39/0x80
__handle_irq_event_percpu+0x3b/0x190
handle_irq_event+0x2f/0x70
handle_edge_irq+0x7c/0x210
__common_interrupt+0x34/0xa0
common_interrupt+0x7d/0xa0
</IRQ>
<TASK>
asm_common_interrupt+0x22/0x40
[CAUSE]
Dev-replace reuses scrub code to iterate all extents and write the
existing content back to the new device.
And for zoned devices, we call fill_writer_pointer_gap() to make sure
all the writes into the zoned device is sequential, even if there may be
some gaps between the writes.
However we have several different bugs all related to zoned dev-replace:
- We are using ZONE_APPEND operation for metadata style write back
For zoned devices, btrfs has two ways to write data:
* ZONE_APPEND for data
This allows higher queue depth, but will not be able to know where
the write would land.
Thus needs to grab the real on-disk physical location in it's endio.
* WRITE for metadata
This requires single queue depth (new writes can only be submitted
after previous one finished), and all writes must be sequential.
For scrub, we go single queue depth, but still goes with ZONE_APPEND,
which requires btrfs_bio::inode being populated.
This is the cause of that crash.
- No correct tracing of write_pointer
After a write finished, we should forward sctx->write_pointer, or
fill_writer_pointer_gap() would not work properly and cause more
than necessary zero out, and fill the whole zone prematurely.
- Incorrect physical bytenr passed to fill_writer_pointer_gap()
In scrub_write_sectors(), one call site passes logical address, which
is completely wrong.
The other call site passes physical address of current sector, but
we should pass the physical address of the btrfs_bio we're submitting.
This is the cause of the -EIO errors.
[FIX]
- Do not use ZONE_APPEND for btrfs_submit_repair_write().
- Manually forward sctx->write_pointer after successful writeback
- Use the physical address of the to-be-submitted btrfs_bio for
fill_writer_pointer_gap()
Now zoned device replace would work as expected.
Reported-by: Christoph Hellwig <hch@lst.de>
Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-01 18:51:34 +08:00
|
|
|
if (bbio)
|
|
|
|
scrub_submit_write_bio(sctx, stripe, bbio, dev_replace);
|
2023-03-20 10:12:55 +08:00
|
|
|
}
|
|
|
|
|
2023-03-29 14:54:57 +08:00
|
|
|
/*
|
|
|
|
* Throttling of IO submission, bandwidth-limit based, the timeslice is 1
|
|
|
|
* second. Limit can be set via /sys/fs/UUID/devinfo/devid/scrub_speed_max.
|
|
|
|
*/
|
2023-03-20 10:12:58 +08:00
|
|
|
static void scrub_throttle_dev_io(struct scrub_ctx *sctx, struct btrfs_device *device,
|
|
|
|
unsigned int bio_size)
|
2019-10-09 19:58:13 +08:00
|
|
|
{
|
|
|
|
const int time_slice = 1000;
|
|
|
|
s64 delta;
|
|
|
|
ktime_t now;
|
|
|
|
u32 div;
|
|
|
|
u64 bwlimit;
|
|
|
|
|
|
|
|
bwlimit = READ_ONCE(device->scrub_speed_max);
|
|
|
|
if (bwlimit == 0)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Slice is divided into intervals when the IO is submitted, adjust by
|
|
|
|
* bwlimit and maximum of 64 intervals.
|
|
|
|
*/
|
|
|
|
div = max_t(u32, 1, (u32)(bwlimit / (16 * 1024 * 1024)));
|
|
|
|
div = min_t(u32, 64, div);
|
|
|
|
|
|
|
|
/* Start new epoch, set deadline */
|
|
|
|
now = ktime_get();
|
|
|
|
if (sctx->throttle_deadline == 0) {
|
|
|
|
sctx->throttle_deadline = ktime_add_ms(now, time_slice / div);
|
|
|
|
sctx->throttle_sent = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Still in the time to send? */
|
|
|
|
if (ktime_before(now, sctx->throttle_deadline)) {
|
|
|
|
/* If current bio is within the limit, send it */
|
2023-03-20 10:12:58 +08:00
|
|
|
sctx->throttle_sent += bio_size;
|
2019-10-09 19:58:13 +08:00
|
|
|
if (sctx->throttle_sent <= div_u64(bwlimit, div))
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* We're over the limit, sleep until the rest of the slice */
|
|
|
|
delta = ktime_ms_delta(sctx->throttle_deadline, now);
|
|
|
|
} else {
|
|
|
|
/* New request after deadline, start new epoch */
|
|
|
|
delta = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (delta) {
|
|
|
|
long timeout;
|
|
|
|
|
|
|
|
timeout = div_u64(delta * HZ, 1000);
|
|
|
|
schedule_timeout_interruptible(timeout);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Next call will start the deadline period */
|
|
|
|
sctx->throttle_deadline = 0;
|
|
|
|
}
|
|
|
|
|
2014-04-01 18:01:43 +08:00
|
|
|
/*
|
|
|
|
* Given a physical address, this will calculate it's
|
|
|
|
* logical offset. if this is a parity stripe, it will return
|
|
|
|
* the most left data stripe's logical offset.
|
|
|
|
*
|
|
|
|
* return 0 if it is a data stripe, 1 means parity stripe.
|
|
|
|
*/
|
|
|
|
static int get_raid56_logic_offset(u64 physical, int num,
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
struct btrfs_chunk_map *map, u64 *offset,
|
Btrfs, raid56: support parity scrub on raid56
The implementation is:
- Read and check all the data with checksum in the same stripe.
All the data which has checksum is COW data, and we are sure
that it is not changed though we don't lock the stripe. because
the space of that data just can be reclaimed after the current
transction is committed, and then the fs can use it to store the
other data, but when doing scrub, we hold the current transaction,
that is that data can not be recovered, it is safe that read and check
it out of the stripe lock.
- Lock the stripe
- Read out all the data without checksum and parity
The data without checksum and the parity may be changed if we don't
lock the stripe, so we need read it in the stripe lock context.
- Check the parity
- Re-calculate the new parity and write back it if the old parity
is not right
- Unlock the stripe
If we can not read out the data or the data we read is corrupted,
we will try to repair it. If the repair fails. we will mark the
horizontal sub-stripe(pages on the same horizontal) as corrupted
sub-stripe, and we will skip the parity check and repair of that
horizontal sub-stripe.
And in order to skip the horizontal sub-stripe that has no data, we
introduce a bitmap. If there is some data on the horizontal sub-stripe,
we will the relative bit to 1, and when we check and repair the
parity, we will skip those horizontal sub-stripes that the relative
bits is 0.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-11-06 17:20:58 +08:00
|
|
|
u64 *stripe_start)
|
2014-04-01 18:01:43 +08:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
int j = 0;
|
|
|
|
u64 last_offset;
|
2019-05-17 17:43:45 +08:00
|
|
|
const int data_stripes = nr_data_stripes(map);
|
2014-04-01 18:01:43 +08:00
|
|
|
|
2019-05-17 17:43:45 +08:00
|
|
|
last_offset = (physical - map->stripes[num].physical) * data_stripes;
|
Btrfs, raid56: support parity scrub on raid56
The implementation is:
- Read and check all the data with checksum in the same stripe.
All the data which has checksum is COW data, and we are sure
that it is not changed though we don't lock the stripe. because
the space of that data just can be reclaimed after the current
transction is committed, and then the fs can use it to store the
other data, but when doing scrub, we hold the current transaction,
that is that data can not be recovered, it is safe that read and check
it out of the stripe lock.
- Lock the stripe
- Read out all the data without checksum and parity
The data without checksum and the parity may be changed if we don't
lock the stripe, so we need read it in the stripe lock context.
- Check the parity
- Re-calculate the new parity and write back it if the old parity
is not right
- Unlock the stripe
If we can not read out the data or the data we read is corrupted,
we will try to repair it. If the repair fails. we will mark the
horizontal sub-stripe(pages on the same horizontal) as corrupted
sub-stripe, and we will skip the parity check and repair of that
horizontal sub-stripe.
And in order to skip the horizontal sub-stripe that has no data, we
introduce a bitmap. If there is some data on the horizontal sub-stripe,
we will the relative bit to 1, and when we check and repair the
parity, we will skip those horizontal sub-stripes that the relative
bits is 0.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-11-06 17:20:58 +08:00
|
|
|
if (stripe_start)
|
|
|
|
*stripe_start = last_offset;
|
|
|
|
|
2014-04-01 18:01:43 +08:00
|
|
|
*offset = last_offset;
|
2019-05-17 17:43:45 +08:00
|
|
|
for (i = 0; i < data_stripes; i++) {
|
2023-02-17 13:36:59 +08:00
|
|
|
u32 stripe_nr;
|
|
|
|
u32 stripe_index;
|
|
|
|
u32 rot;
|
|
|
|
|
2023-06-22 14:42:40 +08:00
|
|
|
*offset = last_offset + btrfs_stripe_nr_to_offset(i);
|
2014-04-01 18:01:43 +08:00
|
|
|
|
2023-02-17 13:36:59 +08:00
|
|
|
stripe_nr = (u32)(*offset >> BTRFS_STRIPE_LEN_SHIFT) / data_stripes;
|
2014-04-01 18:01:43 +08:00
|
|
|
|
|
|
|
/* Work out the disk rotation on this stripe-set */
|
2023-02-17 13:36:59 +08:00
|
|
|
rot = stripe_nr % map->num_stripes;
|
2014-04-01 18:01:43 +08:00
|
|
|
/* calculate which stripe this data locates */
|
|
|
|
rot += i;
|
2014-04-11 18:32:25 +08:00
|
|
|
stripe_index = rot % map->num_stripes;
|
2014-04-01 18:01:43 +08:00
|
|
|
if (stripe_index == num)
|
|
|
|
return 0;
|
|
|
|
if (stripe_index < num)
|
|
|
|
j++;
|
|
|
|
}
|
2023-06-22 14:42:40 +08:00
|
|
|
*offset = last_offset + btrfs_stripe_nr_to_offset(j);
|
2014-04-01 18:01:43 +08:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2022-03-11 15:38:42 +08:00
|
|
|
/*
|
|
|
|
* Return 0 if the extent item range covers any byte of the range.
|
|
|
|
* Return <0 if the extent item is before @search_start.
|
|
|
|
* Return >0 if the extent item is after @start_start + @search_len.
|
|
|
|
*/
|
|
|
|
static int compare_extent_item_range(struct btrfs_path *path,
|
|
|
|
u64 search_start, u64 search_len)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = path->nodes[0]->fs_info;
|
|
|
|
u64 len;
|
|
|
|
struct btrfs_key key;
|
|
|
|
|
|
|
|
btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
|
|
|
|
ASSERT(key.type == BTRFS_EXTENT_ITEM_KEY ||
|
|
|
|
key.type == BTRFS_METADATA_ITEM_KEY);
|
|
|
|
if (key.type == BTRFS_METADATA_ITEM_KEY)
|
|
|
|
len = fs_info->nodesize;
|
|
|
|
else
|
|
|
|
len = key.offset;
|
|
|
|
|
|
|
|
if (key.objectid + len <= search_start)
|
|
|
|
return -1;
|
|
|
|
if (key.objectid >= search_start + search_len)
|
|
|
|
return 1;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Locate one extent item which covers any byte in range
|
|
|
|
* [@search_start, @search_start + @search_length)
|
|
|
|
*
|
|
|
|
* If the path is not initialized, we will initialize the search by doing
|
|
|
|
* a btrfs_search_slot().
|
|
|
|
* If the path is already initialized, we will use the path as the initial
|
|
|
|
* slot, to avoid duplicated btrfs_search_slot() calls.
|
|
|
|
*
|
|
|
|
* NOTE: If an extent item starts before @search_start, we will still
|
|
|
|
* return the extent item. This is for data extent crossing stripe boundary.
|
|
|
|
*
|
|
|
|
* Return 0 if we found such extent item, and @path will point to the extent item.
|
|
|
|
* Return >0 if no such extent item can be found, and @path will be released.
|
|
|
|
* Return <0 if hit fatal error, and @path will be released.
|
|
|
|
*/
|
|
|
|
static int find_first_extent_item(struct btrfs_root *extent_root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
u64 search_start, u64 search_len)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = extent_root->fs_info;
|
|
|
|
struct btrfs_key key;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
/* Continue using the existing path */
|
|
|
|
if (path->nodes[0])
|
|
|
|
goto search_forward;
|
|
|
|
|
|
|
|
if (btrfs_fs_incompat(fs_info, SKINNY_METADATA))
|
|
|
|
key.type = BTRFS_METADATA_ITEM_KEY;
|
|
|
|
else
|
|
|
|
key.type = BTRFS_EXTENT_ITEM_KEY;
|
|
|
|
key.objectid = search_start;
|
|
|
|
key.offset = (u64)-1;
|
|
|
|
|
|
|
|
ret = btrfs_search_slot(NULL, extent_root, &key, path, 0, 0);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
2024-01-25 05:49:02 +08:00
|
|
|
if (ret == 0) {
|
|
|
|
/*
|
|
|
|
* Key with offset -1 found, there would have to exist an extent
|
|
|
|
* item with such offset, but this is out of the valid range.
|
|
|
|
*/
|
|
|
|
btrfs_release_path(path);
|
|
|
|
return -EUCLEAN;
|
|
|
|
}
|
2022-03-11 15:38:42 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Here we intentionally pass 0 as @min_objectid, as there could be
|
|
|
|
* an extent item starting before @search_start.
|
|
|
|
*/
|
|
|
|
ret = btrfs_previous_extent_item(extent_root, path, 0);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
/*
|
|
|
|
* No matter whether we have found an extent item, the next loop will
|
|
|
|
* properly do every check on the key.
|
|
|
|
*/
|
|
|
|
search_forward:
|
|
|
|
while (true) {
|
|
|
|
btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
|
|
|
|
if (key.objectid >= search_start + search_len)
|
|
|
|
break;
|
|
|
|
if (key.type != BTRFS_METADATA_ITEM_KEY &&
|
|
|
|
key.type != BTRFS_EXTENT_ITEM_KEY)
|
|
|
|
goto next;
|
|
|
|
|
|
|
|
ret = compare_extent_item_range(path, search_start, search_len);
|
|
|
|
if (ret == 0)
|
|
|
|
return ret;
|
|
|
|
if (ret > 0)
|
|
|
|
break;
|
|
|
|
next:
|
2023-11-21 21:38:37 +08:00
|
|
|
ret = btrfs_next_item(extent_root, path);
|
|
|
|
if (ret) {
|
|
|
|
/* Either no more items or a fatal error. */
|
|
|
|
btrfs_release_path(path);
|
|
|
|
return ret;
|
2022-03-11 15:38:42 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
btrfs_release_path(path);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2022-03-11 15:38:43 +08:00
|
|
|
static void get_extent_info(struct btrfs_path *path, u64 *extent_start_ret,
|
|
|
|
u64 *size_ret, u64 *flags_ret, u64 *generation_ret)
|
|
|
|
{
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_extent_item *ei;
|
|
|
|
|
|
|
|
btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
|
|
|
|
ASSERT(key.type == BTRFS_METADATA_ITEM_KEY ||
|
|
|
|
key.type == BTRFS_EXTENT_ITEM_KEY);
|
|
|
|
*extent_start_ret = key.objectid;
|
|
|
|
if (key.type == BTRFS_METADATA_ITEM_KEY)
|
|
|
|
*size_ret = path->nodes[0]->fs_info->nodesize;
|
|
|
|
else
|
|
|
|
*size_ret = key.offset;
|
|
|
|
ei = btrfs_item_ptr(path->nodes[0], path->slots[0], struct btrfs_extent_item);
|
|
|
|
*flags_ret = btrfs_extent_flags(path->nodes[0], ei);
|
|
|
|
*generation_ret = btrfs_extent_generation(path->nodes[0], ei);
|
|
|
|
}
|
|
|
|
|
2021-02-04 18:22:14 +08:00
|
|
|
static int sync_write_pointer_for_zoned(struct scrub_ctx *sctx, u64 logical,
|
|
|
|
u64 physical, u64 physical_end)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = sctx->fs_info;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
if (!btrfs_is_zoned(fs_info))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
mutex_lock(&sctx->wr_lock);
|
|
|
|
if (sctx->write_pointer < physical_end) {
|
|
|
|
ret = btrfs_sync_zone_write_pointer(sctx->wr_tgtdev, logical,
|
|
|
|
physical,
|
|
|
|
sctx->write_pointer);
|
|
|
|
if (ret)
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"zoned: failed to recover write pointer");
|
|
|
|
}
|
|
|
|
mutex_unlock(&sctx->wr_lock);
|
|
|
|
btrfs_dev_clear_zone_empty(sctx->wr_tgtdev, physical);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2023-03-20 10:12:51 +08:00
|
|
|
static void fill_one_extent_info(struct btrfs_fs_info *fs_info,
|
|
|
|
struct scrub_stripe *stripe,
|
|
|
|
u64 extent_start, u64 extent_len,
|
|
|
|
u64 extent_flags, u64 extent_gen)
|
|
|
|
{
|
|
|
|
for (u64 cur_logical = max(stripe->logical, extent_start);
|
|
|
|
cur_logical < min(stripe->logical + BTRFS_STRIPE_LEN,
|
|
|
|
extent_start + extent_len);
|
|
|
|
cur_logical += fs_info->sectorsize) {
|
|
|
|
const int nr_sector = (cur_logical - stripe->logical) >>
|
|
|
|
fs_info->sectorsize_bits;
|
|
|
|
struct scrub_sector_verification *sector =
|
|
|
|
&stripe->sectors[nr_sector];
|
|
|
|
|
|
|
|
set_bit(nr_sector, &stripe->extent_sector_bitmap);
|
|
|
|
if (extent_flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
|
|
|
|
sector->is_metadata = true;
|
|
|
|
sector->generation = extent_gen;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void scrub_stripe_reset_bitmaps(struct scrub_stripe *stripe)
|
|
|
|
{
|
|
|
|
stripe->extent_sector_bitmap = 0;
|
|
|
|
stripe->init_error_bitmap = 0;
|
btrfs: scrub: also report errors hit during the initial read
[BUG]
After the recent scrub rework introduced in commit e02ee89baa66 ("btrfs:
scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure"),
btrfs scrub no longer reports repaired errors any more:
# mkfs.btrfs -f $dev -d DUP
# mount $dev $mnt
# xfs_io -f -d -c "pwrite -b 64K -S 0xaa 0 64" $mnt/file
# umount $dev
# xfs_io -f -c "pwrite -S 0xff $phy1 64K" $dev # Corrupt the first mirror
# mount $dev $mnt
# btrfs scrub start -BR $mnt
scrub done for 725e7cb7-8a4a-4c77-9f2a-86943619e218
Scrub started: Tue Jun 6 14:56:50 2023
Status: finished
Duration: 0:00:00
data_extents_scrubbed: 2
tree_extents_scrubbed: 18
data_bytes_scrubbed: 131072
tree_bytes_scrubbed: 294912
read_errors: 0
csum_errors: 0 <<< No errors here
verify_errors: 0
[...]
uncorrectable_errors: 0
unverified_errors: 0
corrected_errors: 16 <<< Only corrected errors
last_physical: 2723151872
This can confuse btrfs-progs, as it relies on the csum_errors to
determine if there is anything wrong.
While on v6.3.x kernels, the report is different:
csum_errors: 16 <<<
verify_errors: 0
[...]
uncorrectable_errors: 0
unverified_errors: 0
corrected_errors: 16 <<<
[CAUSE]
In the reworked scrub, we update the scrub progress inside
scrub_stripe_report_errors(), using various bitmaps to update the
result.
For example for csum_errors, we use bitmap_weight() of
stripe->csum_error_bitmap.
Unfortunately at that stage, all error bitmaps (except
init_error_bitmap) are the result of the latest repair attempt, thus if
the stripe is fully repaired, those error bitmaps will all be empty,
resulting the above output mismatch.
To fix this, record the number of errors into stripe->init_nr_*_errors.
Since we don't really care about where those errors are, we only need to
record the number of errors.
Then in scrub_stripe_report_errors(), use those initial numbers to
update the progress other than using the latest error bitmaps.
Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-06 15:08:28 +08:00
|
|
|
stripe->init_nr_io_errors = 0;
|
|
|
|
stripe->init_nr_csum_errors = 0;
|
|
|
|
stripe->init_nr_meta_errors = 0;
|
2023-03-20 10:12:51 +08:00
|
|
|
stripe->error_bitmap = 0;
|
|
|
|
stripe->io_error_bitmap = 0;
|
|
|
|
stripe->csum_error_bitmap = 0;
|
|
|
|
stripe->meta_error_bitmap = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Locate one stripe which has at least one extent in its range.
|
|
|
|
*
|
|
|
|
* Return 0 if found such stripe, and store its info into @stripe.
|
|
|
|
* Return >0 if there is no such stripe in the specified range.
|
|
|
|
* Return <0 for error.
|
|
|
|
*/
|
2023-03-20 10:12:57 +08:00
|
|
|
static int scrub_find_fill_first_stripe(struct btrfs_block_group *bg,
|
2023-08-03 14:33:29 +08:00
|
|
|
struct btrfs_path *extent_path,
|
2023-08-03 14:33:30 +08:00
|
|
|
struct btrfs_path *csum_path,
|
2023-03-20 10:12:57 +08:00
|
|
|
struct btrfs_device *dev, u64 physical,
|
|
|
|
int mirror_num, u64 logical_start,
|
|
|
|
u32 logical_len,
|
|
|
|
struct scrub_stripe *stripe)
|
2023-03-20 10:12:51 +08:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = bg->fs_info;
|
|
|
|
struct btrfs_root *extent_root = btrfs_extent_root(fs_info, bg->start);
|
|
|
|
struct btrfs_root *csum_root = btrfs_csum_root(fs_info, bg->start);
|
|
|
|
const u64 logical_end = logical_start + logical_len;
|
|
|
|
u64 cur_logical = logical_start;
|
|
|
|
u64 stripe_end;
|
|
|
|
u64 extent_start;
|
|
|
|
u64 extent_len;
|
|
|
|
u64 extent_flags;
|
|
|
|
u64 extent_gen;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
memset(stripe->sectors, 0, sizeof(struct scrub_sector_verification) *
|
|
|
|
stripe->nr_sectors);
|
|
|
|
scrub_stripe_reset_bitmaps(stripe);
|
|
|
|
|
|
|
|
/* The range must be inside the bg. */
|
|
|
|
ASSERT(logical_start >= bg->start && logical_end <= bg->start + bg->length);
|
|
|
|
|
2023-08-03 14:33:29 +08:00
|
|
|
ret = find_first_extent_item(extent_root, extent_path, logical_start,
|
|
|
|
logical_len);
|
2023-03-20 10:12:51 +08:00
|
|
|
/* Either error or not found. */
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
2023-08-03 14:33:29 +08:00
|
|
|
get_extent_info(extent_path, &extent_start, &extent_len, &extent_flags,
|
|
|
|
&extent_gen);
|
2023-03-20 10:12:56 +08:00
|
|
|
if (extent_flags & BTRFS_EXTENT_FLAG_TREE_BLOCK)
|
|
|
|
stripe->nr_meta_extents++;
|
|
|
|
if (extent_flags & BTRFS_EXTENT_FLAG_DATA)
|
|
|
|
stripe->nr_data_extents++;
|
2023-03-20 10:12:51 +08:00
|
|
|
cur_logical = max(extent_start, cur_logical);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Round down to stripe boundary.
|
|
|
|
*
|
|
|
|
* The extra calculation against bg->start is to handle block groups
|
|
|
|
* whose logical bytenr is not BTRFS_STRIPE_LEN aligned.
|
|
|
|
*/
|
|
|
|
stripe->logical = round_down(cur_logical - bg->start, BTRFS_STRIPE_LEN) +
|
|
|
|
bg->start;
|
|
|
|
stripe->physical = physical + stripe->logical - logical_start;
|
|
|
|
stripe->dev = dev;
|
|
|
|
stripe->bg = bg;
|
|
|
|
stripe->mirror_num = mirror_num;
|
|
|
|
stripe_end = stripe->logical + BTRFS_STRIPE_LEN - 1;
|
|
|
|
|
|
|
|
/* Fill the first extent info into stripe->sectors[] array. */
|
|
|
|
fill_one_extent_info(fs_info, stripe, extent_start, extent_len,
|
|
|
|
extent_flags, extent_gen);
|
|
|
|
cur_logical = extent_start + extent_len;
|
|
|
|
|
|
|
|
/* Fill the extent info for the remaining sectors. */
|
|
|
|
while (cur_logical <= stripe_end) {
|
2023-08-03 14:33:29 +08:00
|
|
|
ret = find_first_extent_item(extent_root, extent_path, cur_logical,
|
2023-03-20 10:12:51 +08:00
|
|
|
stripe_end - cur_logical + 1);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret > 0) {
|
|
|
|
ret = 0;
|
|
|
|
break;
|
|
|
|
}
|
2023-08-03 14:33:29 +08:00
|
|
|
get_extent_info(extent_path, &extent_start, &extent_len,
|
2023-03-20 10:12:51 +08:00
|
|
|
&extent_flags, &extent_gen);
|
2023-03-20 10:12:56 +08:00
|
|
|
if (extent_flags & BTRFS_EXTENT_FLAG_TREE_BLOCK)
|
|
|
|
stripe->nr_meta_extents++;
|
|
|
|
if (extent_flags & BTRFS_EXTENT_FLAG_DATA)
|
|
|
|
stripe->nr_data_extents++;
|
2023-03-20 10:12:51 +08:00
|
|
|
fill_one_extent_info(fs_info, stripe, extent_start, extent_len,
|
|
|
|
extent_flags, extent_gen);
|
|
|
|
cur_logical = extent_start + extent_len;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Now fill the data csum. */
|
|
|
|
if (bg->flags & BTRFS_BLOCK_GROUP_DATA) {
|
|
|
|
int sector_nr;
|
|
|
|
unsigned long csum_bitmap = 0;
|
|
|
|
|
|
|
|
/* Csum space should have already been allocated. */
|
|
|
|
ASSERT(stripe->csums);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Our csum bitmap should be large enough, as BTRFS_STRIPE_LEN
|
|
|
|
* should contain at most 16 sectors.
|
|
|
|
*/
|
|
|
|
ASSERT(BITS_PER_LONG >= BTRFS_STRIPE_LEN >> fs_info->sectorsize_bits);
|
|
|
|
|
2023-08-03 14:33:30 +08:00
|
|
|
ret = btrfs_lookup_csums_bitmap(csum_root, csum_path,
|
|
|
|
stripe->logical, stripe_end,
|
|
|
|
stripe->csums, &csum_bitmap);
|
2023-03-20 10:12:51 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret > 0)
|
|
|
|
ret = 0;
|
|
|
|
|
|
|
|
for_each_set_bit(sector_nr, &csum_bitmap, stripe->nr_sectors) {
|
|
|
|
stripe->sectors[sector_nr].csum = stripe->csums +
|
|
|
|
sector_nr * fs_info->csum_size;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
set_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &stripe->state);
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2023-03-20 10:12:57 +08:00
|
|
|
static void scrub_reset_stripe(struct scrub_stripe *stripe)
|
|
|
|
{
|
|
|
|
scrub_stripe_reset_bitmaps(stripe);
|
|
|
|
|
|
|
|
stripe->nr_meta_extents = 0;
|
|
|
|
stripe->nr_data_extents = 0;
|
|
|
|
stripe->state = 0;
|
|
|
|
|
|
|
|
for (int i = 0; i < stripe->nr_sectors; i++) {
|
|
|
|
stripe->sectors[i].is_metadata = false;
|
|
|
|
stripe->sectors[i].csum = NULL;
|
|
|
|
stripe->sectors[i].generation = 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-09-15 00:07:01 +08:00
|
|
|
static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx,
|
|
|
|
struct scrub_stripe *stripe)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
|
|
|
|
struct btrfs_bio *bbio = NULL;
|
2024-01-17 08:32:26 +08:00
|
|
|
unsigned int nr_sectors = min(BTRFS_STRIPE_LEN, stripe->bg->start +
|
|
|
|
stripe->bg->length - stripe->logical) >>
|
|
|
|
fs_info->sectorsize_bits;
|
2023-09-15 00:07:01 +08:00
|
|
|
u64 stripe_len = BTRFS_STRIPE_LEN;
|
|
|
|
int mirror = stripe->mirror_num;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
atomic_inc(&stripe->pending_io);
|
|
|
|
|
|
|
|
for_each_set_bit(i, &stripe->extent_sector_bitmap, stripe->nr_sectors) {
|
|
|
|
struct page *page = scrub_stripe_get_page(stripe, i);
|
|
|
|
unsigned int pgoff = scrub_stripe_get_page_offset(stripe, i);
|
|
|
|
|
2024-01-17 08:32:26 +08:00
|
|
|
/* We're beyond the chunk boundary, no need to read anymore. */
|
|
|
|
if (i >= nr_sectors)
|
|
|
|
break;
|
|
|
|
|
2023-09-15 00:07:01 +08:00
|
|
|
/* The current sector cannot be merged, submit the bio. */
|
|
|
|
if (bbio &&
|
|
|
|
((i > 0 &&
|
|
|
|
!test_bit(i - 1, &stripe->extent_sector_bitmap)) ||
|
|
|
|
bbio->bio.bi_iter.bi_size >= stripe_len)) {
|
|
|
|
ASSERT(bbio->bio.bi_iter.bi_size);
|
|
|
|
atomic_inc(&stripe->pending_io);
|
|
|
|
btrfs_submit_bio(bbio, mirror);
|
|
|
|
bbio = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!bbio) {
|
|
|
|
struct btrfs_io_stripe io_stripe = {};
|
|
|
|
struct btrfs_io_context *bioc = NULL;
|
|
|
|
const u64 logical = stripe->logical +
|
|
|
|
(i << fs_info->sectorsize_bits);
|
|
|
|
int err;
|
|
|
|
|
|
|
|
bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_READ,
|
|
|
|
fs_info, scrub_read_endio, stripe);
|
|
|
|
bbio->bio.bi_iter.bi_sector = logical >> SECTOR_SHIFT;
|
|
|
|
|
|
|
|
io_stripe.is_scrub = true;
|
|
|
|
err = btrfs_map_block(fs_info, BTRFS_MAP_READ, logical,
|
|
|
|
&stripe_len, &bioc, &io_stripe,
|
|
|
|
&mirror);
|
|
|
|
btrfs_put_bioc(bioc);
|
|
|
|
if (err) {
|
|
|
|
btrfs_bio_end_io(bbio,
|
|
|
|
errno_to_blk_status(err));
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
__bio_add_page(&bbio->bio, page, fs_info->sectorsize, pgoff);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (bbio) {
|
|
|
|
ASSERT(bbio->bio.bi_iter.bi_size);
|
|
|
|
atomic_inc(&stripe->pending_io);
|
|
|
|
btrfs_submit_bio(bbio, mirror);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (atomic_dec_and_test(&stripe->pending_io)) {
|
|
|
|
wake_up(&stripe->io_wait);
|
|
|
|
INIT_WORK(&stripe->work, scrub_stripe_read_repair_worker);
|
|
|
|
queue_work(stripe->bg->fs_info->scrub_workers, &stripe->work);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-03-20 10:12:57 +08:00
|
|
|
static void scrub_submit_initial_read(struct scrub_ctx *sctx,
|
|
|
|
struct scrub_stripe *stripe)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = sctx->fs_info;
|
|
|
|
struct btrfs_bio *bbio;
|
btrfs: scrub: avoid use-after-free when chunk length is not 64K aligned
[BUG]
There is a bug report that, on a ext4-converted btrfs, scrub leads to
various problems, including:
- "unable to find chunk map" errors
BTRFS info (device vdb): scrub: started on devid 1
BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 4096
BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 45056
This would lead to unrepariable errors.
- Use-after-free KASAN reports:
==================================================================
BUG: KASAN: slab-use-after-free in __blk_rq_map_sg+0x18f/0x7c0
Read of size 8 at addr ffff8881013c9040 by task btrfs/909
CPU: 0 PID: 909 Comm: btrfs Not tainted 6.7.0-x64v3-dbg #11 c50636e9419a8354555555245df535e380563b2b
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2023.11-2 12/24/2023
Call Trace:
<TASK>
dump_stack_lvl+0x43/0x60
print_report+0xcf/0x640
kasan_report+0xa6/0xd0
__blk_rq_map_sg+0x18f/0x7c0
virtblk_prep_rq.isra.0+0x215/0x6a0 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff]
virtio_queue_rqs+0xc4/0x310 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff]
blk_mq_flush_plug_list.part.0+0x780/0x860
__blk_flush_plug+0x1ba/0x220
blk_finish_plug+0x3b/0x60
submit_initial_group_read+0x10a/0x290 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
flush_scrub_stripes+0x38e/0x430 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
scrub_stripe+0x82a/0xae0 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
scrub_chunk+0x178/0x200 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
scrub_enumerate_chunks+0x4bc/0xa30 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
btrfs_scrub_dev+0x398/0x810 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
btrfs_ioctl+0x4b9/0x3020 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
__x64_sys_ioctl+0xbd/0x100
do_syscall_64+0x5d/0xe0
entry_SYSCALL_64_after_hwframe+0x63/0x6b
RIP: 0033:0x7f47e5e0952b
- Crash, mostly due to above use-after-free
[CAUSE]
The converted fs has the following data chunk layout:
item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 2214658048) itemoff 16025 itemsize 80
length 86016 owner 2 stripe_len 65536 type DATA|single
For above logical bytenr 2214744064, it's at the chunk end
(2214658048 + 86016 = 2214744064).
This means btrfs_submit_bio() would split the bio, and trigger endio
function for both of the two halves.
However scrub_submit_initial_read() would only expect the endio function
to be called once, not any more.
This means the first endio function would already free the bbio::bio,
leaving the bvec freed, thus the 2nd endio call would lead to
use-after-free.
[FIX]
- Make sure scrub_read_endio() only updates bits in its range
Since we may read less than 64K at the end of the chunk, we should not
touch the bits beyond chunk boundary.
- Make sure scrub_submit_initial_read() only to read the chunk range
This is done by calculating the real number of sectors we need to
read, and add sector-by-sector to the bio.
Thankfully the scrub read repair path won't need extra fixes:
- scrub_stripe_submit_repair_read()
With above fixes, we won't update error bit for range beyond chunk,
thus scrub_stripe_submit_repair_read() should never submit any read
beyond the chunk.
Reported-by: Rongrong <i@rong.moe>
Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Tested-by: Rongrong <i@rong.moe>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-17 08:32:25 +08:00
|
|
|
unsigned int nr_sectors = min(BTRFS_STRIPE_LEN, stripe->bg->start +
|
|
|
|
stripe->bg->length - stripe->logical) >>
|
|
|
|
fs_info->sectorsize_bits;
|
2023-03-20 10:12:57 +08:00
|
|
|
int mirror = stripe->mirror_num;
|
|
|
|
|
|
|
|
ASSERT(stripe->bg);
|
|
|
|
ASSERT(stripe->mirror_num > 0);
|
|
|
|
ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &stripe->state));
|
|
|
|
|
2023-09-15 00:07:01 +08:00
|
|
|
if (btrfs_need_stripe_tree_update(fs_info, stripe->bg->flags)) {
|
|
|
|
scrub_submit_extent_sector_read(sctx, stripe);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2023-03-20 10:12:57 +08:00
|
|
|
bbio = btrfs_bio_alloc(SCRUB_STRIPE_PAGES, REQ_OP_READ, fs_info,
|
|
|
|
scrub_read_endio, stripe);
|
|
|
|
|
|
|
|
bbio->bio.bi_iter.bi_sector = stripe->logical >> SECTOR_SHIFT;
|
btrfs: scrub: avoid use-after-free when chunk length is not 64K aligned
[BUG]
There is a bug report that, on a ext4-converted btrfs, scrub leads to
various problems, including:
- "unable to find chunk map" errors
BTRFS info (device vdb): scrub: started on devid 1
BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 4096
BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 45056
This would lead to unrepariable errors.
- Use-after-free KASAN reports:
==================================================================
BUG: KASAN: slab-use-after-free in __blk_rq_map_sg+0x18f/0x7c0
Read of size 8 at addr ffff8881013c9040 by task btrfs/909
CPU: 0 PID: 909 Comm: btrfs Not tainted 6.7.0-x64v3-dbg #11 c50636e9419a8354555555245df535e380563b2b
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2023.11-2 12/24/2023
Call Trace:
<TASK>
dump_stack_lvl+0x43/0x60
print_report+0xcf/0x640
kasan_report+0xa6/0xd0
__blk_rq_map_sg+0x18f/0x7c0
virtblk_prep_rq.isra.0+0x215/0x6a0 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff]
virtio_queue_rqs+0xc4/0x310 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff]
blk_mq_flush_plug_list.part.0+0x780/0x860
__blk_flush_plug+0x1ba/0x220
blk_finish_plug+0x3b/0x60
submit_initial_group_read+0x10a/0x290 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
flush_scrub_stripes+0x38e/0x430 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
scrub_stripe+0x82a/0xae0 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
scrub_chunk+0x178/0x200 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
scrub_enumerate_chunks+0x4bc/0xa30 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
btrfs_scrub_dev+0x398/0x810 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
btrfs_ioctl+0x4b9/0x3020 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
__x64_sys_ioctl+0xbd/0x100
do_syscall_64+0x5d/0xe0
entry_SYSCALL_64_after_hwframe+0x63/0x6b
RIP: 0033:0x7f47e5e0952b
- Crash, mostly due to above use-after-free
[CAUSE]
The converted fs has the following data chunk layout:
item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 2214658048) itemoff 16025 itemsize 80
length 86016 owner 2 stripe_len 65536 type DATA|single
For above logical bytenr 2214744064, it's at the chunk end
(2214658048 + 86016 = 2214744064).
This means btrfs_submit_bio() would split the bio, and trigger endio
function for both of the two halves.
However scrub_submit_initial_read() would only expect the endio function
to be called once, not any more.
This means the first endio function would already free the bbio::bio,
leaving the bvec freed, thus the 2nd endio call would lead to
use-after-free.
[FIX]
- Make sure scrub_read_endio() only updates bits in its range
Since we may read less than 64K at the end of the chunk, we should not
touch the bits beyond chunk boundary.
- Make sure scrub_submit_initial_read() only to read the chunk range
This is done by calculating the real number of sectors we need to
read, and add sector-by-sector to the bio.
Thankfully the scrub read repair path won't need extra fixes:
- scrub_stripe_submit_repair_read()
With above fixes, we won't update error bit for range beyond chunk,
thus scrub_stripe_submit_repair_read() should never submit any read
beyond the chunk.
Reported-by: Rongrong <i@rong.moe>
Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Tested-by: Rongrong <i@rong.moe>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-17 08:32:25 +08:00
|
|
|
/* Read the whole range inside the chunk boundary. */
|
|
|
|
for (unsigned int cur = 0; cur < nr_sectors; cur++) {
|
|
|
|
struct page *page = scrub_stripe_get_page(stripe, cur);
|
|
|
|
unsigned int pgoff = scrub_stripe_get_page_offset(stripe, cur);
|
2023-03-20 10:12:57 +08:00
|
|
|
int ret;
|
|
|
|
|
btrfs: scrub: avoid use-after-free when chunk length is not 64K aligned
[BUG]
There is a bug report that, on a ext4-converted btrfs, scrub leads to
various problems, including:
- "unable to find chunk map" errors
BTRFS info (device vdb): scrub: started on devid 1
BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 4096
BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 45056
This would lead to unrepariable errors.
- Use-after-free KASAN reports:
==================================================================
BUG: KASAN: slab-use-after-free in __blk_rq_map_sg+0x18f/0x7c0
Read of size 8 at addr ffff8881013c9040 by task btrfs/909
CPU: 0 PID: 909 Comm: btrfs Not tainted 6.7.0-x64v3-dbg #11 c50636e9419a8354555555245df535e380563b2b
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2023.11-2 12/24/2023
Call Trace:
<TASK>
dump_stack_lvl+0x43/0x60
print_report+0xcf/0x640
kasan_report+0xa6/0xd0
__blk_rq_map_sg+0x18f/0x7c0
virtblk_prep_rq.isra.0+0x215/0x6a0 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff]
virtio_queue_rqs+0xc4/0x310 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff]
blk_mq_flush_plug_list.part.0+0x780/0x860
__blk_flush_plug+0x1ba/0x220
blk_finish_plug+0x3b/0x60
submit_initial_group_read+0x10a/0x290 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
flush_scrub_stripes+0x38e/0x430 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
scrub_stripe+0x82a/0xae0 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
scrub_chunk+0x178/0x200 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
scrub_enumerate_chunks+0x4bc/0xa30 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
btrfs_scrub_dev+0x398/0x810 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
btrfs_ioctl+0x4b9/0x3020 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
__x64_sys_ioctl+0xbd/0x100
do_syscall_64+0x5d/0xe0
entry_SYSCALL_64_after_hwframe+0x63/0x6b
RIP: 0033:0x7f47e5e0952b
- Crash, mostly due to above use-after-free
[CAUSE]
The converted fs has the following data chunk layout:
item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 2214658048) itemoff 16025 itemsize 80
length 86016 owner 2 stripe_len 65536 type DATA|single
For above logical bytenr 2214744064, it's at the chunk end
(2214658048 + 86016 = 2214744064).
This means btrfs_submit_bio() would split the bio, and trigger endio
function for both of the two halves.
However scrub_submit_initial_read() would only expect the endio function
to be called once, not any more.
This means the first endio function would already free the bbio::bio,
leaving the bvec freed, thus the 2nd endio call would lead to
use-after-free.
[FIX]
- Make sure scrub_read_endio() only updates bits in its range
Since we may read less than 64K at the end of the chunk, we should not
touch the bits beyond chunk boundary.
- Make sure scrub_submit_initial_read() only to read the chunk range
This is done by calculating the real number of sectors we need to
read, and add sector-by-sector to the bio.
Thankfully the scrub read repair path won't need extra fixes:
- scrub_stripe_submit_repair_read()
With above fixes, we won't update error bit for range beyond chunk,
thus scrub_stripe_submit_repair_read() should never submit any read
beyond the chunk.
Reported-by: Rongrong <i@rong.moe>
Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Tested-by: Rongrong <i@rong.moe>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-17 08:32:25 +08:00
|
|
|
ret = bio_add_page(&bbio->bio, page, fs_info->sectorsize, pgoff);
|
2023-03-20 10:12:57 +08:00
|
|
|
/* We should have allocated enough bio vectors. */
|
btrfs: scrub: avoid use-after-free when chunk length is not 64K aligned
[BUG]
There is a bug report that, on a ext4-converted btrfs, scrub leads to
various problems, including:
- "unable to find chunk map" errors
BTRFS info (device vdb): scrub: started on devid 1
BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 4096
BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 45056
This would lead to unrepariable errors.
- Use-after-free KASAN reports:
==================================================================
BUG: KASAN: slab-use-after-free in __blk_rq_map_sg+0x18f/0x7c0
Read of size 8 at addr ffff8881013c9040 by task btrfs/909
CPU: 0 PID: 909 Comm: btrfs Not tainted 6.7.0-x64v3-dbg #11 c50636e9419a8354555555245df535e380563b2b
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2023.11-2 12/24/2023
Call Trace:
<TASK>
dump_stack_lvl+0x43/0x60
print_report+0xcf/0x640
kasan_report+0xa6/0xd0
__blk_rq_map_sg+0x18f/0x7c0
virtblk_prep_rq.isra.0+0x215/0x6a0 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff]
virtio_queue_rqs+0xc4/0x310 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff]
blk_mq_flush_plug_list.part.0+0x780/0x860
__blk_flush_plug+0x1ba/0x220
blk_finish_plug+0x3b/0x60
submit_initial_group_read+0x10a/0x290 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
flush_scrub_stripes+0x38e/0x430 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
scrub_stripe+0x82a/0xae0 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
scrub_chunk+0x178/0x200 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
scrub_enumerate_chunks+0x4bc/0xa30 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
btrfs_scrub_dev+0x398/0x810 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
btrfs_ioctl+0x4b9/0x3020 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
__x64_sys_ioctl+0xbd/0x100
do_syscall_64+0x5d/0xe0
entry_SYSCALL_64_after_hwframe+0x63/0x6b
RIP: 0033:0x7f47e5e0952b
- Crash, mostly due to above use-after-free
[CAUSE]
The converted fs has the following data chunk layout:
item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 2214658048) itemoff 16025 itemsize 80
length 86016 owner 2 stripe_len 65536 type DATA|single
For above logical bytenr 2214744064, it's at the chunk end
(2214658048 + 86016 = 2214744064).
This means btrfs_submit_bio() would split the bio, and trigger endio
function for both of the two halves.
However scrub_submit_initial_read() would only expect the endio function
to be called once, not any more.
This means the first endio function would already free the bbio::bio,
leaving the bvec freed, thus the 2nd endio call would lead to
use-after-free.
[FIX]
- Make sure scrub_read_endio() only updates bits in its range
Since we may read less than 64K at the end of the chunk, we should not
touch the bits beyond chunk boundary.
- Make sure scrub_submit_initial_read() only to read the chunk range
This is done by calculating the real number of sectors we need to
read, and add sector-by-sector to the bio.
Thankfully the scrub read repair path won't need extra fixes:
- scrub_stripe_submit_repair_read()
With above fixes, we won't update error bit for range beyond chunk,
thus scrub_stripe_submit_repair_read() should never submit any read
beyond the chunk.
Reported-by: Rongrong <i@rong.moe>
Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Tested-by: Rongrong <i@rong.moe>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-17 08:32:25 +08:00
|
|
|
ASSERT(ret == fs_info->sectorsize);
|
2023-03-20 10:12:57 +08:00
|
|
|
}
|
|
|
|
atomic_inc(&stripe->pending_io);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* For dev-replace, either user asks to avoid the source dev, or
|
|
|
|
* the device is missing, we try the next mirror instead.
|
|
|
|
*/
|
|
|
|
if (sctx->is_dev_replace &&
|
|
|
|
(fs_info->dev_replace.cont_reading_from_srcdev_mode ==
|
|
|
|
BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_AVOID ||
|
|
|
|
!stripe->dev->bdev)) {
|
|
|
|
int num_copies = btrfs_num_copies(fs_info, stripe->bg->start,
|
|
|
|
stripe->bg->length);
|
|
|
|
|
|
|
|
mirror = calc_next_mirror(mirror, num_copies);
|
|
|
|
}
|
|
|
|
btrfs_submit_bio(bbio, mirror);
|
|
|
|
}
|
|
|
|
|
btrfs: dev-replace: error out if we have unrepaired metadata error during
[BUG]
Even before the scrub rework, if we have some corrupted metadata failed
to be repaired during replace, we still continue replacing and let it
finish just as there is nothing wrong:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad bytenr, has 0 want 5578752
BTRFS error (device dm-4): unable to fixup (regular) error at logical 5578752 on dev /dev/mapper/test-scratch1
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 finished
This can lead to unexpected problems for the resulting filesystem.
[CAUSE]
Btrfs reuses scrub code path for dev-replace to iterate all dev extents.
But unlike scrub, dev-replace doesn't really bother to check the scrub
progress, which records all the errors found during replace.
And even if we check the progress, we cannot really determine which
errors are minor, which are critical just by the plain numbers.
(remember we don't treat metadata/data checksum error differently).
This behavior is there from the very beginning.
[FIX]
Instead of continuing the replace, just error out if we hit an
unrepaired metadata sector.
Now the dev-replace would be rejected with -EIO, to let the user know.
Although it also means, the filesystem has some metadata error which
cannot be repaired, the user would be upset anyway.
The new dmesg would look like this:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS error (device dm-4): unable to fixup (regular) error at logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560
BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata sector at 5578752
BTRFS error (device dm-4): btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, /dev/mapper/test-scratch2) failed -5
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-06 15:26:29 +08:00
|
|
|
static bool stripe_has_metadata_error(struct scrub_stripe *stripe)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for_each_set_bit(i, &stripe->error_bitmap, stripe->nr_sectors) {
|
|
|
|
if (stripe->sectors[i].is_metadata) {
|
|
|
|
struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
|
|
|
|
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"stripe %llu has unrepaired metadata sector at %llu",
|
|
|
|
stripe->logical,
|
|
|
|
stripe->logical + (i << fs_info->sectorsize_bits));
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
btrfs: scrub: fix grouping of read IO
[REGRESSION]
There are several regression reports about the scrub performance with
v6.4 kernel.
On a PCIe 3.0 device, the old v6.3 kernel can go 3GB/s scrub speed, but
v6.4 can only go 1GB/s, an obvious 66% performance drop.
[CAUSE]
Iostat shows a very different behavior between v6.3 and v6.4 kernel:
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 9731.00 3425544.00 17237.00 63.92 2.18 352.02 21.18 100.00
nvme0n1p3 15578.00 993616.00 5.00 0.03 0.09 63.78 1.32 100.00
The upper one is v6.3 while the lower one is v6.4.
There are several obvious differences:
- Very few read merges
This turns out to be a behavior change that we no longer do bio
plug/unplug.
- Very low aqu-sz
This is due to the submit-and-wait behavior of flush_scrub_stripes(),
and extra extent/csum tree search.
Both behaviors are not that obvious on SATA SSDs, as SATA SSDs have NCQ
to merge the reads, while SATA SSDs can not handle high queue depth well
either.
[FIX]
For now this patch focuses on the read speed fix. Dev-replace replace
speed needs more work.
For the read part, we go two directions to fix the problems:
- Re-introduce blk plug/unplug to merge read requests
This is pretty simple, and the behavior is pretty easy to observe.
This would enlarge the average read request size to 512K.
- Introduce multi-group reads and no longer wait for each group
Instead of the old behavior, which submits 8 stripes and waits for
them, here we would enlarge the total number of stripes to 16 * 8.
Which is 8M per device, the same limit as the old scrub in-flight
bios size limit.
Now every time we fill a group (8 stripes), we submit them and
continue to next stripes.
Only when the full 16 * 8 stripes are all filled, we submit the
remaining ones (the last group), and wait for all groups to finish.
Then submit the repair writes and dev-replace writes.
This should enlarge the queue depth.
This would greatly improve the merge rate (thus read block size) and
queue depth:
Before (with regression, and cached extent/csum path):
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 20666.00 1318240.00 10.00 0.05 0.08 63.79 1.63 100.00
After (with all patches applied):
nvme0n1p3 5165.00 2278304.00 30557.00 85.54 0.55 441.10 2.81 100.00
i.e. 1287 to 2224 MB/s.
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-03 14:33:31 +08:00
|
|
|
static void submit_initial_group_read(struct scrub_ctx *sctx,
|
|
|
|
unsigned int first_slot,
|
|
|
|
unsigned int nr_stripes)
|
|
|
|
{
|
|
|
|
struct blk_plug plug;
|
|
|
|
|
|
|
|
ASSERT(first_slot < SCRUB_TOTAL_STRIPES);
|
|
|
|
ASSERT(first_slot + nr_stripes <= SCRUB_TOTAL_STRIPES);
|
|
|
|
|
|
|
|
scrub_throttle_dev_io(sctx, sctx->stripes[0].dev,
|
|
|
|
btrfs_stripe_nr_to_offset(nr_stripes));
|
|
|
|
blk_start_plug(&plug);
|
|
|
|
for (int i = 0; i < nr_stripes; i++) {
|
|
|
|
struct scrub_stripe *stripe = &sctx->stripes[first_slot + i];
|
|
|
|
|
|
|
|
/* Those stripes should be initialized. */
|
|
|
|
ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &stripe->state));
|
|
|
|
scrub_submit_initial_read(sctx, stripe);
|
|
|
|
}
|
|
|
|
blk_finish_plug(&plug);
|
|
|
|
}
|
|
|
|
|
btrfs: dev-replace: error out if we have unrepaired metadata error during
[BUG]
Even before the scrub rework, if we have some corrupted metadata failed
to be repaired during replace, we still continue replacing and let it
finish just as there is nothing wrong:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad bytenr, has 0 want 5578752
BTRFS error (device dm-4): unable to fixup (regular) error at logical 5578752 on dev /dev/mapper/test-scratch1
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 finished
This can lead to unexpected problems for the resulting filesystem.
[CAUSE]
Btrfs reuses scrub code path for dev-replace to iterate all dev extents.
But unlike scrub, dev-replace doesn't really bother to check the scrub
progress, which records all the errors found during replace.
And even if we check the progress, we cannot really determine which
errors are minor, which are critical just by the plain numbers.
(remember we don't treat metadata/data checksum error differently).
This behavior is there from the very beginning.
[FIX]
Instead of continuing the replace, just error out if we hit an
unrepaired metadata sector.
Now the dev-replace would be rejected with -EIO, to let the user know.
Although it also means, the filesystem has some metadata error which
cannot be repaired, the user would be upset anyway.
The new dmesg would look like this:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS error (device dm-4): unable to fixup (regular) error at logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560
BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata sector at 5578752
BTRFS error (device dm-4): btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, /dev/mapper/test-scratch2) failed -5
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-06 15:26:29 +08:00
|
|
|
static int flush_scrub_stripes(struct scrub_ctx *sctx)
|
2023-03-20 10:12:57 +08:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = sctx->fs_info;
|
|
|
|
struct scrub_stripe *stripe;
|
|
|
|
const int nr_stripes = sctx->cur_stripe;
|
btrfs: dev-replace: error out if we have unrepaired metadata error during
[BUG]
Even before the scrub rework, if we have some corrupted metadata failed
to be repaired during replace, we still continue replacing and let it
finish just as there is nothing wrong:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad bytenr, has 0 want 5578752
BTRFS error (device dm-4): unable to fixup (regular) error at logical 5578752 on dev /dev/mapper/test-scratch1
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 finished
This can lead to unexpected problems for the resulting filesystem.
[CAUSE]
Btrfs reuses scrub code path for dev-replace to iterate all dev extents.
But unlike scrub, dev-replace doesn't really bother to check the scrub
progress, which records all the errors found during replace.
And even if we check the progress, we cannot really determine which
errors are minor, which are critical just by the plain numbers.
(remember we don't treat metadata/data checksum error differently).
This behavior is there from the very beginning.
[FIX]
Instead of continuing the replace, just error out if we hit an
unrepaired metadata sector.
Now the dev-replace would be rejected with -EIO, to let the user know.
Although it also means, the filesystem has some metadata error which
cannot be repaired, the user would be upset anyway.
The new dmesg would look like this:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS error (device dm-4): unable to fixup (regular) error at logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560
BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata sector at 5578752
BTRFS error (device dm-4): btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, /dev/mapper/test-scratch2) failed -5
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-06 15:26:29 +08:00
|
|
|
int ret = 0;
|
2023-03-20 10:12:57 +08:00
|
|
|
|
|
|
|
if (!nr_stripes)
|
btrfs: dev-replace: error out if we have unrepaired metadata error during
[BUG]
Even before the scrub rework, if we have some corrupted metadata failed
to be repaired during replace, we still continue replacing and let it
finish just as there is nothing wrong:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad bytenr, has 0 want 5578752
BTRFS error (device dm-4): unable to fixup (regular) error at logical 5578752 on dev /dev/mapper/test-scratch1
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 finished
This can lead to unexpected problems for the resulting filesystem.
[CAUSE]
Btrfs reuses scrub code path for dev-replace to iterate all dev extents.
But unlike scrub, dev-replace doesn't really bother to check the scrub
progress, which records all the errors found during replace.
And even if we check the progress, we cannot really determine which
errors are minor, which are critical just by the plain numbers.
(remember we don't treat metadata/data checksum error differently).
This behavior is there from the very beginning.
[FIX]
Instead of continuing the replace, just error out if we hit an
unrepaired metadata sector.
Now the dev-replace would be rejected with -EIO, to let the user know.
Although it also means, the filesystem has some metadata error which
cannot be repaired, the user would be upset anyway.
The new dmesg would look like this:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS error (device dm-4): unable to fixup (regular) error at logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560
BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata sector at 5578752
BTRFS error (device dm-4): btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, /dev/mapper/test-scratch2) failed -5
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-06 15:26:29 +08:00
|
|
|
return 0;
|
2023-03-20 10:12:57 +08:00
|
|
|
|
|
|
|
ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &sctx->stripes[0].state));
|
2023-03-20 10:12:58 +08:00
|
|
|
|
btrfs: scrub: fix grouping of read IO
[REGRESSION]
There are several regression reports about the scrub performance with
v6.4 kernel.
On a PCIe 3.0 device, the old v6.3 kernel can go 3GB/s scrub speed, but
v6.4 can only go 1GB/s, an obvious 66% performance drop.
[CAUSE]
Iostat shows a very different behavior between v6.3 and v6.4 kernel:
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 9731.00 3425544.00 17237.00 63.92 2.18 352.02 21.18 100.00
nvme0n1p3 15578.00 993616.00 5.00 0.03 0.09 63.78 1.32 100.00
The upper one is v6.3 while the lower one is v6.4.
There are several obvious differences:
- Very few read merges
This turns out to be a behavior change that we no longer do bio
plug/unplug.
- Very low aqu-sz
This is due to the submit-and-wait behavior of flush_scrub_stripes(),
and extra extent/csum tree search.
Both behaviors are not that obvious on SATA SSDs, as SATA SSDs have NCQ
to merge the reads, while SATA SSDs can not handle high queue depth well
either.
[FIX]
For now this patch focuses on the read speed fix. Dev-replace replace
speed needs more work.
For the read part, we go two directions to fix the problems:
- Re-introduce blk plug/unplug to merge read requests
This is pretty simple, and the behavior is pretty easy to observe.
This would enlarge the average read request size to 512K.
- Introduce multi-group reads and no longer wait for each group
Instead of the old behavior, which submits 8 stripes and waits for
them, here we would enlarge the total number of stripes to 16 * 8.
Which is 8M per device, the same limit as the old scrub in-flight
bios size limit.
Now every time we fill a group (8 stripes), we submit them and
continue to next stripes.
Only when the full 16 * 8 stripes are all filled, we submit the
remaining ones (the last group), and wait for all groups to finish.
Then submit the repair writes and dev-replace writes.
This should enlarge the queue depth.
This would greatly improve the merge rate (thus read block size) and
queue depth:
Before (with regression, and cached extent/csum path):
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 20666.00 1318240.00 10.00 0.05 0.08 63.79 1.63 100.00
After (with all patches applied):
nvme0n1p3 5165.00 2278304.00 30557.00 85.54 0.55 441.10 2.81 100.00
i.e. 1287 to 2224 MB/s.
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-03 14:33:31 +08:00
|
|
|
/* Submit the stripes which are populated but not submitted. */
|
|
|
|
if (nr_stripes % SCRUB_STRIPES_PER_GROUP) {
|
|
|
|
const int first_slot = round_down(nr_stripes, SCRUB_STRIPES_PER_GROUP);
|
|
|
|
|
|
|
|
submit_initial_group_read(sctx, first_slot, nr_stripes - first_slot);
|
2023-03-20 10:12:57 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
for (int i = 0; i < nr_stripes; i++) {
|
|
|
|
stripe = &sctx->stripes[i];
|
|
|
|
|
|
|
|
wait_event(stripe->repair_wait,
|
|
|
|
test_bit(SCRUB_STRIPE_FLAG_REPAIR_DONE, &stripe->state));
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Submit for dev-replace. */
|
|
|
|
if (sctx->is_dev_replace) {
|
btrfs: dev-replace: error out if we have unrepaired metadata error during
[BUG]
Even before the scrub rework, if we have some corrupted metadata failed
to be repaired during replace, we still continue replacing and let it
finish just as there is nothing wrong:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad bytenr, has 0 want 5578752
BTRFS error (device dm-4): unable to fixup (regular) error at logical 5578752 on dev /dev/mapper/test-scratch1
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 finished
This can lead to unexpected problems for the resulting filesystem.
[CAUSE]
Btrfs reuses scrub code path for dev-replace to iterate all dev extents.
But unlike scrub, dev-replace doesn't really bother to check the scrub
progress, which records all the errors found during replace.
And even if we check the progress, we cannot really determine which
errors are minor, which are critical just by the plain numbers.
(remember we don't treat metadata/data checksum error differently).
This behavior is there from the very beginning.
[FIX]
Instead of continuing the replace, just error out if we hit an
unrepaired metadata sector.
Now the dev-replace would be rejected with -EIO, to let the user know.
Although it also means, the filesystem has some metadata error which
cannot be repaired, the user would be upset anyway.
The new dmesg would look like this:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS error (device dm-4): unable to fixup (regular) error at logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560
BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata sector at 5578752
BTRFS error (device dm-4): btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, /dev/mapper/test-scratch2) failed -5
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-06 15:26:29 +08:00
|
|
|
/*
|
|
|
|
* For dev-replace, if we know there is something wrong with
|
2023-12-06 02:26:39 +08:00
|
|
|
* metadata, we should immediately abort.
|
btrfs: dev-replace: error out if we have unrepaired metadata error during
[BUG]
Even before the scrub rework, if we have some corrupted metadata failed
to be repaired during replace, we still continue replacing and let it
finish just as there is nothing wrong:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad bytenr, has 0 want 5578752
BTRFS error (device dm-4): unable to fixup (regular) error at logical 5578752 on dev /dev/mapper/test-scratch1
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 finished
This can lead to unexpected problems for the resulting filesystem.
[CAUSE]
Btrfs reuses scrub code path for dev-replace to iterate all dev extents.
But unlike scrub, dev-replace doesn't really bother to check the scrub
progress, which records all the errors found during replace.
And even if we check the progress, we cannot really determine which
errors are minor, which are critical just by the plain numbers.
(remember we don't treat metadata/data checksum error differently).
This behavior is there from the very beginning.
[FIX]
Instead of continuing the replace, just error out if we hit an
unrepaired metadata sector.
Now the dev-replace would be rejected with -EIO, to let the user know.
Although it also means, the filesystem has some metadata error which
cannot be repaired, the user would be upset anyway.
The new dmesg would look like this:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS error (device dm-4): unable to fixup (regular) error at logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560
BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata sector at 5578752
BTRFS error (device dm-4): btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, /dev/mapper/test-scratch2) failed -5
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-06 15:26:29 +08:00
|
|
|
*/
|
|
|
|
for (int i = 0; i < nr_stripes; i++) {
|
|
|
|
if (stripe_has_metadata_error(&sctx->stripes[i])) {
|
|
|
|
ret = -EIO;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
2023-03-20 10:12:57 +08:00
|
|
|
for (int i = 0; i < nr_stripes; i++) {
|
|
|
|
unsigned long good;
|
|
|
|
|
|
|
|
stripe = &sctx->stripes[i];
|
|
|
|
|
|
|
|
ASSERT(stripe->dev == fs_info->dev_replace.srcdev);
|
|
|
|
|
|
|
|
bitmap_andnot(&good, &stripe->extent_sector_bitmap,
|
|
|
|
&stripe->error_bitmap, stripe->nr_sectors);
|
|
|
|
scrub_write_sectors(sctx, stripe, good, true);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Wait for the above writebacks to finish. */
|
|
|
|
for (int i = 0; i < nr_stripes; i++) {
|
|
|
|
stripe = &sctx->stripes[i];
|
|
|
|
|
|
|
|
wait_scrub_stripe_io(stripe);
|
|
|
|
scrub_reset_stripe(stripe);
|
|
|
|
}
|
btrfs: dev-replace: error out if we have unrepaired metadata error during
[BUG]
Even before the scrub rework, if we have some corrupted metadata failed
to be repaired during replace, we still continue replacing and let it
finish just as there is nothing wrong:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad bytenr, has 0 want 5578752
BTRFS error (device dm-4): unable to fixup (regular) error at logical 5578752 on dev /dev/mapper/test-scratch1
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 finished
This can lead to unexpected problems for the resulting filesystem.
[CAUSE]
Btrfs reuses scrub code path for dev-replace to iterate all dev extents.
But unlike scrub, dev-replace doesn't really bother to check the scrub
progress, which records all the errors found during replace.
And even if we check the progress, we cannot really determine which
errors are minor, which are critical just by the plain numbers.
(remember we don't treat metadata/data checksum error differently).
This behavior is there from the very beginning.
[FIX]
Instead of continuing the replace, just error out if we hit an
unrepaired metadata sector.
Now the dev-replace would be rejected with -EIO, to let the user know.
Although it also means, the filesystem has some metadata error which
cannot be repaired, the user would be upset anyway.
The new dmesg would look like this:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS error (device dm-4): unable to fixup (regular) error at logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560
BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata sector at 5578752
BTRFS error (device dm-4): btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, /dev/mapper/test-scratch2) failed -5
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-06 15:26:29 +08:00
|
|
|
out:
|
2023-03-20 10:12:57 +08:00
|
|
|
sctx->cur_stripe = 0;
|
btrfs: dev-replace: error out if we have unrepaired metadata error during
[BUG]
Even before the scrub rework, if we have some corrupted metadata failed
to be repaired during replace, we still continue replacing and let it
finish just as there is nothing wrong:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad bytenr, has 0 want 5578752
BTRFS error (device dm-4): unable to fixup (regular) error at logical 5578752 on dev /dev/mapper/test-scratch1
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 finished
This can lead to unexpected problems for the resulting filesystem.
[CAUSE]
Btrfs reuses scrub code path for dev-replace to iterate all dev extents.
But unlike scrub, dev-replace doesn't really bother to check the scrub
progress, which records all the errors found during replace.
And even if we check the progress, we cannot really determine which
errors are minor, which are critical just by the plain numbers.
(remember we don't treat metadata/data checksum error differently).
This behavior is there from the very beginning.
[FIX]
Instead of continuing the replace, just error out if we hit an
unrepaired metadata sector.
Now the dev-replace would be rejected with -EIO, to let the user know.
Although it also means, the filesystem has some metadata error which
cannot be repaired, the user would be upset anyway.
The new dmesg would look like this:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS error (device dm-4): unable to fixup (regular) error at logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560
BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata sector at 5578752
BTRFS error (device dm-4): btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, /dev/mapper/test-scratch2) failed -5
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-06 15:26:29 +08:00
|
|
|
return ret;
|
2023-03-20 10:12:57 +08:00
|
|
|
}
|
|
|
|
|
2023-03-28 16:57:35 +08:00
|
|
|
static void raid56_scrub_wait_endio(struct bio *bio)
|
|
|
|
{
|
|
|
|
complete(bio->bi_private);
|
|
|
|
}
|
|
|
|
|
2023-03-20 10:12:58 +08:00
|
|
|
static int queue_scrub_stripe(struct scrub_ctx *sctx, struct btrfs_block_group *bg,
|
|
|
|
struct btrfs_device *dev, int mirror_num,
|
btrfs: scrub: fix grouping of read IO
[REGRESSION]
There are several regression reports about the scrub performance with
v6.4 kernel.
On a PCIe 3.0 device, the old v6.3 kernel can go 3GB/s scrub speed, but
v6.4 can only go 1GB/s, an obvious 66% performance drop.
[CAUSE]
Iostat shows a very different behavior between v6.3 and v6.4 kernel:
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 9731.00 3425544.00 17237.00 63.92 2.18 352.02 21.18 100.00
nvme0n1p3 15578.00 993616.00 5.00 0.03 0.09 63.78 1.32 100.00
The upper one is v6.3 while the lower one is v6.4.
There are several obvious differences:
- Very few read merges
This turns out to be a behavior change that we no longer do bio
plug/unplug.
- Very low aqu-sz
This is due to the submit-and-wait behavior of flush_scrub_stripes(),
and extra extent/csum tree search.
Both behaviors are not that obvious on SATA SSDs, as SATA SSDs have NCQ
to merge the reads, while SATA SSDs can not handle high queue depth well
either.
[FIX]
For now this patch focuses on the read speed fix. Dev-replace replace
speed needs more work.
For the read part, we go two directions to fix the problems:
- Re-introduce blk plug/unplug to merge read requests
This is pretty simple, and the behavior is pretty easy to observe.
This would enlarge the average read request size to 512K.
- Introduce multi-group reads and no longer wait for each group
Instead of the old behavior, which submits 8 stripes and waits for
them, here we would enlarge the total number of stripes to 16 * 8.
Which is 8M per device, the same limit as the old scrub in-flight
bios size limit.
Now every time we fill a group (8 stripes), we submit them and
continue to next stripes.
Only when the full 16 * 8 stripes are all filled, we submit the
remaining ones (the last group), and wait for all groups to finish.
Then submit the repair writes and dev-replace writes.
This should enlarge the queue depth.
This would greatly improve the merge rate (thus read block size) and
queue depth:
Before (with regression, and cached extent/csum path):
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 20666.00 1318240.00 10.00 0.05 0.08 63.79 1.63 100.00
After (with all patches applied):
nvme0n1p3 5165.00 2278304.00 30557.00 85.54 0.55 441.10 2.81 100.00
i.e. 1287 to 2224 MB/s.
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-03 14:33:31 +08:00
|
|
|
u64 logical, u32 length, u64 physical,
|
|
|
|
u64 *found_logical_ret)
|
2023-03-20 10:12:57 +08:00
|
|
|
{
|
|
|
|
struct scrub_stripe *stripe;
|
|
|
|
int ret;
|
|
|
|
|
btrfs: scrub: fix grouping of read IO
[REGRESSION]
There are several regression reports about the scrub performance with
v6.4 kernel.
On a PCIe 3.0 device, the old v6.3 kernel can go 3GB/s scrub speed, but
v6.4 can only go 1GB/s, an obvious 66% performance drop.
[CAUSE]
Iostat shows a very different behavior between v6.3 and v6.4 kernel:
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 9731.00 3425544.00 17237.00 63.92 2.18 352.02 21.18 100.00
nvme0n1p3 15578.00 993616.00 5.00 0.03 0.09 63.78 1.32 100.00
The upper one is v6.3 while the lower one is v6.4.
There are several obvious differences:
- Very few read merges
This turns out to be a behavior change that we no longer do bio
plug/unplug.
- Very low aqu-sz
This is due to the submit-and-wait behavior of flush_scrub_stripes(),
and extra extent/csum tree search.
Both behaviors are not that obvious on SATA SSDs, as SATA SSDs have NCQ
to merge the reads, while SATA SSDs can not handle high queue depth well
either.
[FIX]
For now this patch focuses on the read speed fix. Dev-replace replace
speed needs more work.
For the read part, we go two directions to fix the problems:
- Re-introduce blk plug/unplug to merge read requests
This is pretty simple, and the behavior is pretty easy to observe.
This would enlarge the average read request size to 512K.
- Introduce multi-group reads and no longer wait for each group
Instead of the old behavior, which submits 8 stripes and waits for
them, here we would enlarge the total number of stripes to 16 * 8.
Which is 8M per device, the same limit as the old scrub in-flight
bios size limit.
Now every time we fill a group (8 stripes), we submit them and
continue to next stripes.
Only when the full 16 * 8 stripes are all filled, we submit the
remaining ones (the last group), and wait for all groups to finish.
Then submit the repair writes and dev-replace writes.
This should enlarge the queue depth.
This would greatly improve the merge rate (thus read block size) and
queue depth:
Before (with regression, and cached extent/csum path):
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 20666.00 1318240.00 10.00 0.05 0.08 63.79 1.63 100.00
After (with all patches applied):
nvme0n1p3 5165.00 2278304.00 30557.00 85.54 0.55 441.10 2.81 100.00
i.e. 1287 to 2224 MB/s.
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-03 14:33:31 +08:00
|
|
|
/*
|
|
|
|
* There should always be one slot left, as caller filling the last
|
|
|
|
* slot should flush them all.
|
|
|
|
*/
|
|
|
|
ASSERT(sctx->cur_stripe < SCRUB_TOTAL_STRIPES);
|
2023-03-20 10:12:57 +08:00
|
|
|
|
2023-10-28 10:58:45 +08:00
|
|
|
/* @found_logical_ret must be specified. */
|
|
|
|
ASSERT(found_logical_ret);
|
|
|
|
|
2023-03-20 10:12:57 +08:00
|
|
|
stripe = &sctx->stripes[sctx->cur_stripe];
|
|
|
|
scrub_reset_stripe(stripe);
|
2023-08-03 14:33:30 +08:00
|
|
|
ret = scrub_find_fill_first_stripe(bg, &sctx->extent_path,
|
|
|
|
&sctx->csum_path, dev, physical,
|
|
|
|
mirror_num, logical, length, stripe);
|
2023-03-20 10:12:57 +08:00
|
|
|
/* Either >0 as no more extents or <0 for error. */
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2023-10-28 10:58:45 +08:00
|
|
|
*found_logical_ret = stripe->logical;
|
2023-03-20 10:12:57 +08:00
|
|
|
sctx->cur_stripe++;
|
btrfs: scrub: fix grouping of read IO
[REGRESSION]
There are several regression reports about the scrub performance with
v6.4 kernel.
On a PCIe 3.0 device, the old v6.3 kernel can go 3GB/s scrub speed, but
v6.4 can only go 1GB/s, an obvious 66% performance drop.
[CAUSE]
Iostat shows a very different behavior between v6.3 and v6.4 kernel:
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 9731.00 3425544.00 17237.00 63.92 2.18 352.02 21.18 100.00
nvme0n1p3 15578.00 993616.00 5.00 0.03 0.09 63.78 1.32 100.00
The upper one is v6.3 while the lower one is v6.4.
There are several obvious differences:
- Very few read merges
This turns out to be a behavior change that we no longer do bio
plug/unplug.
- Very low aqu-sz
This is due to the submit-and-wait behavior of flush_scrub_stripes(),
and extra extent/csum tree search.
Both behaviors are not that obvious on SATA SSDs, as SATA SSDs have NCQ
to merge the reads, while SATA SSDs can not handle high queue depth well
either.
[FIX]
For now this patch focuses on the read speed fix. Dev-replace replace
speed needs more work.
For the read part, we go two directions to fix the problems:
- Re-introduce blk plug/unplug to merge read requests
This is pretty simple, and the behavior is pretty easy to observe.
This would enlarge the average read request size to 512K.
- Introduce multi-group reads and no longer wait for each group
Instead of the old behavior, which submits 8 stripes and waits for
them, here we would enlarge the total number of stripes to 16 * 8.
Which is 8M per device, the same limit as the old scrub in-flight
bios size limit.
Now every time we fill a group (8 stripes), we submit them and
continue to next stripes.
Only when the full 16 * 8 stripes are all filled, we submit the
remaining ones (the last group), and wait for all groups to finish.
Then submit the repair writes and dev-replace writes.
This should enlarge the queue depth.
This would greatly improve the merge rate (thus read block size) and
queue depth:
Before (with regression, and cached extent/csum path):
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 20666.00 1318240.00 10.00 0.05 0.08 63.79 1.63 100.00
After (with all patches applied):
nvme0n1p3 5165.00 2278304.00 30557.00 85.54 0.55 441.10 2.81 100.00
i.e. 1287 to 2224 MB/s.
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-03 14:33:31 +08:00
|
|
|
|
|
|
|
/* We filled one group, submit it. */
|
|
|
|
if (sctx->cur_stripe % SCRUB_STRIPES_PER_GROUP == 0) {
|
|
|
|
const int first_slot = sctx->cur_stripe - SCRUB_STRIPES_PER_GROUP;
|
|
|
|
|
|
|
|
submit_initial_group_read(sctx, first_slot, SCRUB_STRIPES_PER_GROUP);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Last slot used, flush them all. */
|
|
|
|
if (sctx->cur_stripe == SCRUB_TOTAL_STRIPES)
|
|
|
|
return flush_scrub_stripes(sctx);
|
2023-03-20 10:12:57 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2023-03-28 16:57:35 +08:00
|
|
|
static int scrub_raid56_parity_stripe(struct scrub_ctx *sctx,
|
|
|
|
struct btrfs_device *scrub_dev,
|
|
|
|
struct btrfs_block_group *bg,
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
struct btrfs_chunk_map *map,
|
2023-03-28 16:57:35 +08:00
|
|
|
u64 full_stripe_start)
|
|
|
|
{
|
|
|
|
DECLARE_COMPLETION_ONSTACK(io_done);
|
|
|
|
struct btrfs_fs_info *fs_info = sctx->fs_info;
|
|
|
|
struct btrfs_raid_bio *rbio;
|
|
|
|
struct btrfs_io_context *bioc = NULL;
|
2023-08-03 14:33:29 +08:00
|
|
|
struct btrfs_path extent_path = { 0 };
|
2023-08-03 14:33:30 +08:00
|
|
|
struct btrfs_path csum_path = { 0 };
|
2023-03-28 16:57:35 +08:00
|
|
|
struct bio *bio;
|
|
|
|
struct scrub_stripe *stripe;
|
|
|
|
bool all_empty = true;
|
|
|
|
const int data_stripes = nr_data_stripes(map);
|
|
|
|
unsigned long extent_bitmap = 0;
|
2023-06-22 14:42:40 +08:00
|
|
|
u64 length = btrfs_stripe_nr_to_offset(data_stripes);
|
2023-03-28 16:57:35 +08:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
ASSERT(sctx->raid56_data_stripes);
|
|
|
|
|
2023-08-03 14:33:29 +08:00
|
|
|
/*
|
2023-08-03 14:33:30 +08:00
|
|
|
* For data stripe search, we cannot re-use the same extent/csum paths,
|
|
|
|
* as the data stripe bytenr may be smaller than previous extent. Thus
|
|
|
|
* we have to use our own extent/csum paths.
|
2023-08-03 14:33:29 +08:00
|
|
|
*/
|
|
|
|
extent_path.search_commit_root = 1;
|
|
|
|
extent_path.skip_locking = 1;
|
2023-08-03 14:33:30 +08:00
|
|
|
csum_path.search_commit_root = 1;
|
|
|
|
csum_path.skip_locking = 1;
|
2023-08-03 14:33:29 +08:00
|
|
|
|
2023-03-28 16:57:35 +08:00
|
|
|
for (int i = 0; i < data_stripes; i++) {
|
|
|
|
int stripe_index;
|
|
|
|
int rot;
|
|
|
|
u64 physical;
|
|
|
|
|
|
|
|
stripe = &sctx->raid56_data_stripes[i];
|
|
|
|
rot = div_u64(full_stripe_start - bg->start,
|
|
|
|
data_stripes) >> BTRFS_STRIPE_LEN_SHIFT;
|
|
|
|
stripe_index = (i + rot) % map->num_stripes;
|
|
|
|
physical = map->stripes[stripe_index].physical +
|
2023-06-22 14:42:40 +08:00
|
|
|
btrfs_stripe_nr_to_offset(rot);
|
2023-03-28 16:57:35 +08:00
|
|
|
|
|
|
|
scrub_reset_stripe(stripe);
|
|
|
|
set_bit(SCRUB_STRIPE_FLAG_NO_REPORT, &stripe->state);
|
2023-08-03 14:33:30 +08:00
|
|
|
ret = scrub_find_fill_first_stripe(bg, &extent_path, &csum_path,
|
2023-03-28 16:57:35 +08:00
|
|
|
map->stripes[stripe_index].dev, physical, 1,
|
2023-06-22 14:42:40 +08:00
|
|
|
full_stripe_start + btrfs_stripe_nr_to_offset(i),
|
2023-03-28 16:57:35 +08:00
|
|
|
BTRFS_STRIPE_LEN, stripe);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
/*
|
|
|
|
* No extent in this data stripe, need to manually mark them
|
|
|
|
* initialized to make later read submission happy.
|
|
|
|
*/
|
|
|
|
if (ret > 0) {
|
|
|
|
stripe->logical = full_stripe_start +
|
2023-06-22 14:42:40 +08:00
|
|
|
btrfs_stripe_nr_to_offset(i);
|
2023-03-28 16:57:35 +08:00
|
|
|
stripe->dev = map->stripes[stripe_index].dev;
|
|
|
|
stripe->mirror_num = 1;
|
|
|
|
set_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &stripe->state);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Check if all data stripes are empty. */
|
|
|
|
for (int i = 0; i < data_stripes; i++) {
|
|
|
|
stripe = &sctx->raid56_data_stripes[i];
|
|
|
|
if (!bitmap_empty(&stripe->extent_sector_bitmap, stripe->nr_sectors)) {
|
|
|
|
all_empty = false;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (all_empty) {
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (int i = 0; i < data_stripes; i++) {
|
|
|
|
stripe = &sctx->raid56_data_stripes[i];
|
|
|
|
scrub_submit_initial_read(sctx, stripe);
|
|
|
|
}
|
|
|
|
for (int i = 0; i < data_stripes; i++) {
|
|
|
|
stripe = &sctx->raid56_data_stripes[i];
|
|
|
|
|
|
|
|
wait_event(stripe->repair_wait,
|
|
|
|
test_bit(SCRUB_STRIPE_FLAG_REPAIR_DONE, &stripe->state));
|
|
|
|
}
|
|
|
|
/* For now, no zoned support for RAID56. */
|
|
|
|
ASSERT(!btrfs_is_zoned(sctx->fs_info));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Now all data stripes are properly verified. Check if we have any
|
|
|
|
* unrepaired, if so abort immediately or we could further corrupt the
|
|
|
|
* P/Q stripes.
|
|
|
|
*
|
|
|
|
* During the loop, also populate extent_bitmap.
|
|
|
|
*/
|
|
|
|
for (int i = 0; i < data_stripes; i++) {
|
|
|
|
unsigned long error;
|
|
|
|
|
|
|
|
stripe = &sctx->raid56_data_stripes[i];
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We should only check the errors where there is an extent.
|
|
|
|
* As we may hit an empty data stripe while it's missing.
|
|
|
|
*/
|
|
|
|
bitmap_and(&error, &stripe->error_bitmap,
|
|
|
|
&stripe->extent_sector_bitmap, stripe->nr_sectors);
|
|
|
|
if (!bitmap_empty(&error, stripe->nr_sectors)) {
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"unrepaired sectors detected, full stripe %llu data stripe %u errors %*pbl",
|
|
|
|
full_stripe_start, i, stripe->nr_sectors,
|
|
|
|
&error);
|
|
|
|
ret = -EIO;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
bitmap_or(&extent_bitmap, &extent_bitmap,
|
|
|
|
&stripe->extent_sector_bitmap, stripe->nr_sectors);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Now we can check and regenerate the P/Q stripe. */
|
|
|
|
bio = bio_alloc(NULL, 1, REQ_OP_READ, GFP_NOFS);
|
|
|
|
bio->bi_iter.bi_sector = full_stripe_start >> SECTOR_SHIFT;
|
|
|
|
bio->bi_private = &io_done;
|
|
|
|
bio->bi_end_io = raid56_scrub_wait_endio;
|
|
|
|
|
|
|
|
btrfs_bio_counter_inc_blocked(fs_info);
|
2023-05-31 12:17:38 +08:00
|
|
|
ret = btrfs_map_block(fs_info, BTRFS_MAP_WRITE, full_stripe_start,
|
2023-09-17 18:06:21 +08:00
|
|
|
&length, &bioc, NULL, NULL);
|
2023-03-28 16:57:35 +08:00
|
|
|
if (ret < 0) {
|
|
|
|
btrfs_put_bioc(bioc);
|
|
|
|
btrfs_bio_counter_dec(fs_info);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
rbio = raid56_parity_alloc_scrub_rbio(bio, bioc, scrub_dev, &extent_bitmap,
|
|
|
|
BTRFS_STRIPE_LEN >> fs_info->sectorsize_bits);
|
|
|
|
btrfs_put_bioc(bioc);
|
|
|
|
if (!rbio) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
btrfs_bio_counter_dec(fs_info);
|
|
|
|
goto out;
|
|
|
|
}
|
btrfs: scrub: use recovered data stripes as cache to avoid unnecessary read
For P/Q stripe scrub, we have quite some duplicated read IO:
- Data stripes read for verification
This is triggered by the scrub_submit_initial_read() inside
scrub_raid56_parity_stripe().
- Data stripes read (again) for P/Q stripe verification
This is triggered by scrub_assemble_read_bios() from scrub_rbio().
Although we can have hit rbio cache and avoid unnecessary read, the
chance is very low, as scrub would easily flush the whole rbio cache.
This means, even we're just scrubbing a single P/Q stripe, we would read
the data stripes twice for the best case scenario. If we need to
recover some data stripes, it would cause more reads on the same data
stripes, again and again.
However before we call raid56_parity_submit_scrub_rbio() we already
have all data stripes repaired and their contents ready to use.
But RAID56 cache is unaware about the scrub cache, thus RAID56 layer
itself still needs to re-read the data stripes.
To avoid such cache miss, this patch would:
- Introduce a new helper, raid56_parity_cache_data_pages()
This function would grab the pages from an array, and copy the content
to the rbio, marking all the involved sectors uptodate.
The page copy is unavoidable because of the cache pages of rbio are all
self managed, thus can not utilize outside pages without screwing up
the lifespan.
- Use the repaired data stripes as cache inside
scrub_raid56_parity_stripe()
By this, we ensure all the data sectors of the scrub rbio are already
uptodate, and no need to read them again from disk.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-13 13:57:18 +08:00
|
|
|
/* Use the recovered stripes as cache to avoid read them from disk again. */
|
|
|
|
for (int i = 0; i < data_stripes; i++) {
|
|
|
|
stripe = &sctx->raid56_data_stripes[i];
|
|
|
|
|
|
|
|
raid56_parity_cache_data_pages(rbio, stripe->pages,
|
|
|
|
full_stripe_start + (i << BTRFS_STRIPE_LEN_SHIFT));
|
|
|
|
}
|
2023-03-28 16:57:35 +08:00
|
|
|
raid56_parity_submit_scrub_rbio(rbio);
|
|
|
|
wait_for_completion_io(&io_done);
|
|
|
|
ret = blk_status_to_errno(bio->bi_status);
|
|
|
|
bio_put(bio);
|
|
|
|
btrfs_bio_counter_dec(fs_info);
|
|
|
|
|
2023-08-03 14:33:29 +08:00
|
|
|
btrfs_release_path(&extent_path);
|
2023-08-03 14:33:30 +08:00
|
|
|
btrfs_release_path(&csum_path);
|
2023-03-28 16:57:35 +08:00
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2022-03-11 15:38:43 +08:00
|
|
|
/*
|
|
|
|
* Scrub one range which can only has simple mirror based profile.
|
|
|
|
* (Including all range in SINGLE/DUP/RAID1/RAID1C*, and each stripe in
|
|
|
|
* RAID0/RAID10).
|
|
|
|
*
|
|
|
|
* Since we may need to handle a subset of block group, we need @logical_start
|
|
|
|
* and @logical_length parameter.
|
|
|
|
*/
|
|
|
|
static int scrub_simple_mirror(struct scrub_ctx *sctx,
|
|
|
|
struct btrfs_block_group *bg,
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
struct btrfs_chunk_map *map,
|
2022-03-11 15:38:43 +08:00
|
|
|
u64 logical_start, u64 logical_length,
|
|
|
|
struct btrfs_device *device,
|
|
|
|
u64 physical, int mirror_num)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = sctx->fs_info;
|
|
|
|
const u64 logical_end = logical_start + logical_length;
|
|
|
|
u64 cur_logical = logical_start;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
/* The range must be inside the bg */
|
|
|
|
ASSERT(logical_start >= bg->start && logical_end <= bg->start + bg->length);
|
|
|
|
|
|
|
|
/* Go through each extent items inside the logical range */
|
|
|
|
while (cur_logical < logical_end) {
|
2023-10-28 10:58:45 +08:00
|
|
|
u64 found_logical = U64_MAX;
|
2023-03-20 10:12:58 +08:00
|
|
|
u64 cur_physical = physical + cur_logical - logical_start;
|
2022-03-11 15:38:43 +08:00
|
|
|
|
|
|
|
/* Canceled? */
|
|
|
|
if (atomic_read(&fs_info->scrub_cancel_req) ||
|
|
|
|
atomic_read(&sctx->cancel_req)) {
|
|
|
|
ret = -ECANCELED;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
/* Paused? */
|
|
|
|
if (atomic_read(&fs_info->scrub_pause_req)) {
|
|
|
|
/* Push queued extents */
|
|
|
|
scrub_blocked_if_needed(fs_info);
|
|
|
|
}
|
|
|
|
/* Block group removed? */
|
|
|
|
spin_lock(&bg->lock);
|
2022-07-16 03:45:24 +08:00
|
|
|
if (test_bit(BLOCK_GROUP_FLAG_REMOVED, &bg->runtime_flags)) {
|
2022-03-11 15:38:43 +08:00
|
|
|
spin_unlock(&bg->lock);
|
|
|
|
ret = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
spin_unlock(&bg->lock);
|
|
|
|
|
2023-03-20 10:12:58 +08:00
|
|
|
ret = queue_scrub_stripe(sctx, bg, device, mirror_num,
|
|
|
|
cur_logical, logical_end - cur_logical,
|
btrfs: scrub: fix grouping of read IO
[REGRESSION]
There are several regression reports about the scrub performance with
v6.4 kernel.
On a PCIe 3.0 device, the old v6.3 kernel can go 3GB/s scrub speed, but
v6.4 can only go 1GB/s, an obvious 66% performance drop.
[CAUSE]
Iostat shows a very different behavior between v6.3 and v6.4 kernel:
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 9731.00 3425544.00 17237.00 63.92 2.18 352.02 21.18 100.00
nvme0n1p3 15578.00 993616.00 5.00 0.03 0.09 63.78 1.32 100.00
The upper one is v6.3 while the lower one is v6.4.
There are several obvious differences:
- Very few read merges
This turns out to be a behavior change that we no longer do bio
plug/unplug.
- Very low aqu-sz
This is due to the submit-and-wait behavior of flush_scrub_stripes(),
and extra extent/csum tree search.
Both behaviors are not that obvious on SATA SSDs, as SATA SSDs have NCQ
to merge the reads, while SATA SSDs can not handle high queue depth well
either.
[FIX]
For now this patch focuses on the read speed fix. Dev-replace replace
speed needs more work.
For the read part, we go two directions to fix the problems:
- Re-introduce blk plug/unplug to merge read requests
This is pretty simple, and the behavior is pretty easy to observe.
This would enlarge the average read request size to 512K.
- Introduce multi-group reads and no longer wait for each group
Instead of the old behavior, which submits 8 stripes and waits for
them, here we would enlarge the total number of stripes to 16 * 8.
Which is 8M per device, the same limit as the old scrub in-flight
bios size limit.
Now every time we fill a group (8 stripes), we submit them and
continue to next stripes.
Only when the full 16 * 8 stripes are all filled, we submit the
remaining ones (the last group), and wait for all groups to finish.
Then submit the repair writes and dev-replace writes.
This should enlarge the queue depth.
This would greatly improve the merge rate (thus read block size) and
queue depth:
Before (with regression, and cached extent/csum path):
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 20666.00 1318240.00 10.00 0.05 0.08 63.79 1.63 100.00
After (with all patches applied):
nvme0n1p3 5165.00 2278304.00 30557.00 85.54 0.55 441.10 2.81 100.00
i.e. 1287 to 2224 MB/s.
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-03 14:33:31 +08:00
|
|
|
cur_physical, &found_logical);
|
2022-03-11 15:38:43 +08:00
|
|
|
if (ret > 0) {
|
|
|
|
/* No more extent, just update the accounting */
|
|
|
|
sctx->stat.last_physical = physical + logical_length;
|
|
|
|
ret = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (ret < 0)
|
|
|
|
break;
|
|
|
|
|
2023-10-28 10:58:45 +08:00
|
|
|
/* queue_scrub_stripe() returned 0, @found_logical must be updated. */
|
|
|
|
ASSERT(found_logical != U64_MAX);
|
btrfs: scrub: fix grouping of read IO
[REGRESSION]
There are several regression reports about the scrub performance with
v6.4 kernel.
On a PCIe 3.0 device, the old v6.3 kernel can go 3GB/s scrub speed, but
v6.4 can only go 1GB/s, an obvious 66% performance drop.
[CAUSE]
Iostat shows a very different behavior between v6.3 and v6.4 kernel:
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 9731.00 3425544.00 17237.00 63.92 2.18 352.02 21.18 100.00
nvme0n1p3 15578.00 993616.00 5.00 0.03 0.09 63.78 1.32 100.00
The upper one is v6.3 while the lower one is v6.4.
There are several obvious differences:
- Very few read merges
This turns out to be a behavior change that we no longer do bio
plug/unplug.
- Very low aqu-sz
This is due to the submit-and-wait behavior of flush_scrub_stripes(),
and extra extent/csum tree search.
Both behaviors are not that obvious on SATA SSDs, as SATA SSDs have NCQ
to merge the reads, while SATA SSDs can not handle high queue depth well
either.
[FIX]
For now this patch focuses on the read speed fix. Dev-replace replace
speed needs more work.
For the read part, we go two directions to fix the problems:
- Re-introduce blk plug/unplug to merge read requests
This is pretty simple, and the behavior is pretty easy to observe.
This would enlarge the average read request size to 512K.
- Introduce multi-group reads and no longer wait for each group
Instead of the old behavior, which submits 8 stripes and waits for
them, here we would enlarge the total number of stripes to 16 * 8.
Which is 8M per device, the same limit as the old scrub in-flight
bios size limit.
Now every time we fill a group (8 stripes), we submit them and
continue to next stripes.
Only when the full 16 * 8 stripes are all filled, we submit the
remaining ones (the last group), and wait for all groups to finish.
Then submit the repair writes and dev-replace writes.
This should enlarge the queue depth.
This would greatly improve the merge rate (thus read block size) and
queue depth:
Before (with regression, and cached extent/csum path):
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 20666.00 1318240.00 10.00 0.05 0.08 63.79 1.63 100.00
After (with all patches applied):
nvme0n1p3 5165.00 2278304.00 30557.00 85.54 0.55 441.10 2.81 100.00
i.e. 1287 to 2224 MB/s.
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-03 14:33:31 +08:00
|
|
|
cur_logical = found_logical + BTRFS_STRIPE_LEN;
|
2023-03-20 10:12:58 +08:00
|
|
|
|
2022-03-11 15:38:43 +08:00
|
|
|
/* Don't hold CPU for too long time */
|
|
|
|
cond_resched();
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2022-03-11 15:38:44 +08:00
|
|
|
/* Calculate the full stripe length for simple stripe based profiles */
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
static u64 simple_stripe_full_stripe_len(const struct btrfs_chunk_map *map)
|
2022-03-11 15:38:44 +08:00
|
|
|
{
|
|
|
|
ASSERT(map->type & (BTRFS_BLOCK_GROUP_RAID0 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID10));
|
|
|
|
|
2023-06-22 14:42:40 +08:00
|
|
|
return btrfs_stripe_nr_to_offset(map->num_stripes / map->sub_stripes);
|
2022-03-11 15:38:44 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Get the logical bytenr for the stripe */
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
static u64 simple_stripe_get_logical(struct btrfs_chunk_map *map,
|
2022-03-11 15:38:44 +08:00
|
|
|
struct btrfs_block_group *bg,
|
|
|
|
int stripe_index)
|
|
|
|
{
|
|
|
|
ASSERT(map->type & (BTRFS_BLOCK_GROUP_RAID0 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID10));
|
|
|
|
ASSERT(stripe_index < map->num_stripes);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* (stripe_index / sub_stripes) gives how many data stripes we need to
|
|
|
|
* skip.
|
|
|
|
*/
|
2023-06-22 14:42:40 +08:00
|
|
|
return btrfs_stripe_nr_to_offset(stripe_index / map->sub_stripes) +
|
2023-02-17 13:36:58 +08:00
|
|
|
bg->start;
|
2022-03-11 15:38:44 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Get the mirror number for the stripe */
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
static int simple_stripe_mirror_num(struct btrfs_chunk_map *map, int stripe_index)
|
2022-03-11 15:38:44 +08:00
|
|
|
{
|
|
|
|
ASSERT(map->type & (BTRFS_BLOCK_GROUP_RAID0 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID10));
|
|
|
|
ASSERT(stripe_index < map->num_stripes);
|
|
|
|
|
|
|
|
/* For RAID0, it's fixed to 1, for RAID10 it's 0,1,0,1... */
|
|
|
|
return stripe_index % map->sub_stripes + 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int scrub_simple_stripe(struct scrub_ctx *sctx,
|
|
|
|
struct btrfs_block_group *bg,
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
struct btrfs_chunk_map *map,
|
2022-03-11 15:38:44 +08:00
|
|
|
struct btrfs_device *device,
|
|
|
|
int stripe_index)
|
|
|
|
{
|
|
|
|
const u64 logical_increment = simple_stripe_full_stripe_len(map);
|
|
|
|
const u64 orig_logical = simple_stripe_get_logical(map, bg, stripe_index);
|
|
|
|
const u64 orig_physical = map->stripes[stripe_index].physical;
|
|
|
|
const int mirror_num = simple_stripe_mirror_num(map, stripe_index);
|
|
|
|
u64 cur_logical = orig_logical;
|
|
|
|
u64 cur_physical = orig_physical;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
while (cur_logical < bg->start + bg->length) {
|
|
|
|
/*
|
|
|
|
* Inside each stripe, RAID0 is just SINGLE, and RAID10 is
|
|
|
|
* just RAID1, so we can reuse scrub_simple_mirror() to scrub
|
|
|
|
* this stripe.
|
|
|
|
*/
|
2023-01-16 15:04:13 +08:00
|
|
|
ret = scrub_simple_mirror(sctx, bg, map, cur_logical,
|
|
|
|
BTRFS_STRIPE_LEN, device, cur_physical,
|
|
|
|
mirror_num);
|
2022-03-11 15:38:44 +08:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
/* Skip to next stripe which belongs to the target device */
|
|
|
|
cur_logical += logical_increment;
|
|
|
|
/* For physical offset, we just go to next stripe */
|
2023-02-17 13:36:58 +08:00
|
|
|
cur_physical += BTRFS_STRIPE_LEN;
|
2022-03-11 15:38:44 +08:00
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
Btrfs: rename the scrub context structure
The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.
Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is used.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-11-02 16:58:09 +08:00
|
|
|
static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
|
2021-12-15 14:59:42 +08:00
|
|
|
struct btrfs_block_group *bg,
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
struct btrfs_chunk_map *map,
|
2012-11-02 20:26:57 +08:00
|
|
|
struct btrfs_device *scrub_dev,
|
2022-05-13 16:34:28 +08:00
|
|
|
int stripe_index)
|
2011-03-08 21:14:00 +08:00
|
|
|
{
|
2016-06-23 06:54:56 +08:00
|
|
|
struct btrfs_fs_info *fs_info = sctx->fs_info;
|
2022-03-11 15:38:43 +08:00
|
|
|
const u64 profile = map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
|
2021-12-15 14:59:42 +08:00
|
|
|
const u64 chunk_logical = bg->start;
|
2011-03-08 21:14:00 +08:00
|
|
|
int ret;
|
btrfs: dev-replace: error out if we have unrepaired metadata error during
[BUG]
Even before the scrub rework, if we have some corrupted metadata failed
to be repaired during replace, we still continue replacing and let it
finish just as there is nothing wrong:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad bytenr, has 0 want 5578752
BTRFS error (device dm-4): unable to fixup (regular) error at logical 5578752 on dev /dev/mapper/test-scratch1
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 finished
This can lead to unexpected problems for the resulting filesystem.
[CAUSE]
Btrfs reuses scrub code path for dev-replace to iterate all dev extents.
But unlike scrub, dev-replace doesn't really bother to check the scrub
progress, which records all the errors found during replace.
And even if we check the progress, we cannot really determine which
errors are minor, which are critical just by the plain numbers.
(remember we don't treat metadata/data checksum error differently).
This behavior is there from the very beginning.
[FIX]
Instead of continuing the replace, just error out if we hit an
unrepaired metadata sector.
Now the dev-replace would be rejected with -EIO, to let the user know.
Although it also means, the filesystem has some metadata error which
cannot be repaired, the user would be upset anyway.
The new dmesg would look like this:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS error (device dm-4): unable to fixup (regular) error at logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560
BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata sector at 5578752
BTRFS error (device dm-4): btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, /dev/mapper/test-scratch2) failed -5
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-06 15:26:29 +08:00
|
|
|
int ret2;
|
2022-03-11 15:38:41 +08:00
|
|
|
u64 physical = map->stripes[stripe_index].physical;
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
const u64 dev_stripe_len = btrfs_calc_stripe_length(map);
|
2022-05-13 16:34:28 +08:00
|
|
|
const u64 physical_end = physical + dev_stripe_len;
|
2011-03-08 21:14:00 +08:00
|
|
|
u64 logical;
|
2013-04-27 10:56:57 +08:00
|
|
|
u64 logic_end;
|
2022-03-11 15:38:46 +08:00
|
|
|
/* The logical increment after finishing one stripe */
|
2022-01-21 19:42:24 +08:00
|
|
|
u64 increment;
|
2022-03-11 15:38:46 +08:00
|
|
|
/* Offset inside the chunk */
|
2011-03-08 21:14:00 +08:00
|
|
|
u64 offset;
|
Btrfs, raid56: support parity scrub on raid56
The implementation is:
- Read and check all the data with checksum in the same stripe.
All the data which has checksum is COW data, and we are sure
that it is not changed though we don't lock the stripe. because
the space of that data just can be reclaimed after the current
transction is committed, and then the fs can use it to store the
other data, but when doing scrub, we hold the current transaction,
that is that data can not be recovered, it is safe that read and check
it out of the stripe lock.
- Lock the stripe
- Read out all the data without checksum and parity
The data without checksum and the parity may be changed if we don't
lock the stripe, so we need read it in the stripe lock context.
- Check the parity
- Re-calculate the new parity and write back it if the old parity
is not right
- Unlock the stripe
If we can not read out the data or the data we read is corrupted,
we will try to repair it. If the repair fails. we will mark the
horizontal sub-stripe(pages on the same horizontal) as corrupted
sub-stripe, and we will skip the parity check and repair of that
horizontal sub-stripe.
And in order to skip the horizontal sub-stripe that has no data, we
introduce a bitmap. If there is some data on the horizontal sub-stripe,
we will the relative bit to 1, and when we check and repair the
parity, we will skip those horizontal sub-stripes that the relative
bits is 0.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-11-06 17:20:58 +08:00
|
|
|
u64 stripe_logical;
|
2014-04-01 18:01:43 +08:00
|
|
|
int stop_loop = 0;
|
2013-01-30 07:40:14 +08:00
|
|
|
|
2023-08-03 14:33:29 +08:00
|
|
|
/* Extent_path should be released by now. */
|
|
|
|
ASSERT(sctx->extent_path.nodes[0] == NULL);
|
|
|
|
|
2013-12-04 21:16:53 +08:00
|
|
|
scrub_blocked_if_needed(fs_info);
|
2011-06-10 18:39:23 +08:00
|
|
|
|
2021-02-04 18:22:13 +08:00
|
|
|
if (sctx->is_dev_replace &&
|
|
|
|
btrfs_dev_is_sequential(sctx->wr_tgtdev, physical)) {
|
|
|
|
mutex_lock(&sctx->wr_lock);
|
|
|
|
sctx->write_pointer = physical;
|
|
|
|
mutex_unlock(&sctx->wr_lock);
|
|
|
|
}
|
|
|
|
|
2023-03-28 16:57:35 +08:00
|
|
|
/* Prepare the extra data stripes used by RAID56. */
|
|
|
|
if (profile & BTRFS_BLOCK_GROUP_RAID56_MASK) {
|
|
|
|
ASSERT(sctx->raid56_data_stripes == NULL);
|
|
|
|
|
|
|
|
sctx->raid56_data_stripes = kcalloc(nr_data_stripes(map),
|
|
|
|
sizeof(struct scrub_stripe),
|
|
|
|
GFP_KERNEL);
|
|
|
|
if (!sctx->raid56_data_stripes) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
for (int i = 0; i < nr_data_stripes(map); i++) {
|
|
|
|
ret = init_scrub_stripe(fs_info,
|
|
|
|
&sctx->raid56_data_stripes[i]);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
sctx->raid56_data_stripes[i].bg = bg;
|
|
|
|
sctx->raid56_data_stripes[i].sctx = sctx;
|
|
|
|
}
|
|
|
|
}
|
2022-03-11 15:38:43 +08:00
|
|
|
/*
|
|
|
|
* There used to be a big double loop to handle all profiles using the
|
|
|
|
* same routine, which grows larger and more gross over time.
|
|
|
|
*
|
|
|
|
* So here we handle each profile differently, so simpler profiles
|
|
|
|
* have simpler scrubbing function.
|
|
|
|
*/
|
|
|
|
if (!(profile & (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID10 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID56_MASK))) {
|
|
|
|
/*
|
|
|
|
* Above check rules out all complex profile, the remaining
|
|
|
|
* profiles are SINGLE|DUP|RAID1|RAID1C*, which is simple
|
|
|
|
* mirrored duplication without stripe.
|
|
|
|
*
|
|
|
|
* Only @physical and @mirror_num needs to calculated using
|
|
|
|
* @stripe_index.
|
|
|
|
*/
|
2023-01-16 15:04:13 +08:00
|
|
|
ret = scrub_simple_mirror(sctx, bg, map, bg->start, bg->length,
|
|
|
|
scrub_dev, map->stripes[stripe_index].physical,
|
2022-03-11 15:38:43 +08:00
|
|
|
stripe_index + 1);
|
2022-03-11 15:38:45 +08:00
|
|
|
offset = 0;
|
2022-03-11 15:38:43 +08:00
|
|
|
goto out;
|
|
|
|
}
|
2022-03-11 15:38:44 +08:00
|
|
|
if (profile & (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID10)) {
|
2023-01-16 15:04:13 +08:00
|
|
|
ret = scrub_simple_stripe(sctx, bg, map, scrub_dev, stripe_index);
|
2023-06-22 14:42:40 +08:00
|
|
|
offset = btrfs_stripe_nr_to_offset(stripe_index / map->sub_stripes);
|
2022-03-11 15:38:44 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Only RAID56 goes through the old code */
|
|
|
|
ASSERT(map->type & BTRFS_BLOCK_GROUP_RAID56_MASK);
|
2011-03-08 21:14:00 +08:00
|
|
|
ret = 0;
|
2022-03-11 15:38:45 +08:00
|
|
|
|
|
|
|
/* Calculate the logical end of the stripe */
|
|
|
|
get_raid56_logic_offset(physical_end, stripe_index,
|
|
|
|
map, &logic_end, NULL);
|
|
|
|
logic_end += chunk_logical;
|
|
|
|
|
|
|
|
/* Initialize @offset in case we need to go to out: label */
|
|
|
|
get_raid56_logic_offset(physical, stripe_index, map, &offset, NULL);
|
2023-06-22 14:42:40 +08:00
|
|
|
increment = btrfs_stripe_nr_to_offset(nr_data_stripes(map));
|
2022-03-11 15:38:45 +08:00
|
|
|
|
2022-03-11 15:38:46 +08:00
|
|
|
/*
|
|
|
|
* Due to the rotation, for RAID56 it's better to iterate each stripe
|
|
|
|
* using their physical offset.
|
|
|
|
*/
|
2014-04-01 18:01:43 +08:00
|
|
|
while (physical < physical_end) {
|
2022-03-11 15:38:46 +08:00
|
|
|
ret = get_raid56_logic_offset(physical, stripe_index, map,
|
|
|
|
&logical, &stripe_logical);
|
2022-03-11 15:38:45 +08:00
|
|
|
logical += chunk_logical;
|
|
|
|
if (ret) {
|
|
|
|
/* it is parity strip */
|
|
|
|
stripe_logical += chunk_logical;
|
2023-03-28 16:57:35 +08:00
|
|
|
ret = scrub_raid56_parity_stripe(sctx, scrub_dev, bg,
|
|
|
|
map, stripe_logical);
|
2022-03-11 15:38:45 +08:00
|
|
|
if (ret)
|
|
|
|
goto out;
|
2022-03-11 15:38:46 +08:00
|
|
|
goto next;
|
2015-07-21 12:22:29 +08:00
|
|
|
}
|
|
|
|
|
2022-03-11 15:38:46 +08:00
|
|
|
/*
|
|
|
|
* Now we're at a data stripe, scrub each extents in the range.
|
|
|
|
*
|
|
|
|
* At this stage, if we ignore the repair part, inside each data
|
|
|
|
* stripe it is no different than SINGLE profile.
|
|
|
|
* We can reuse scrub_simple_mirror() here, as the repair part
|
|
|
|
* is still based on @mirror_num.
|
|
|
|
*/
|
2023-01-16 15:04:13 +08:00
|
|
|
ret = scrub_simple_mirror(sctx, bg, map, logical, BTRFS_STRIPE_LEN,
|
2022-03-11 15:38:46 +08:00
|
|
|
scrub_dev, physical, 1);
|
2011-03-08 21:14:00 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
next:
|
|
|
|
logical += increment;
|
2023-02-17 13:36:58 +08:00
|
|
|
physical += BTRFS_STRIPE_LEN;
|
Btrfs: rename the scrub context structure
The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.
Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is used.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-11-02 16:58:09 +08:00
|
|
|
spin_lock(&sctx->stat_lock);
|
2013-04-27 10:56:57 +08:00
|
|
|
if (stop_loop)
|
2022-05-13 16:34:28 +08:00
|
|
|
sctx->stat.last_physical =
|
|
|
|
map->stripes[stripe_index].physical + dev_stripe_len;
|
2013-04-27 10:56:57 +08:00
|
|
|
else
|
|
|
|
sctx->stat.last_physical = physical;
|
Btrfs: rename the scrub context structure
The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.
Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is used.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-11-02 16:58:09 +08:00
|
|
|
spin_unlock(&sctx->stat_lock);
|
2013-04-27 10:56:57 +08:00
|
|
|
if (stop_loop)
|
|
|
|
break;
|
2011-03-08 21:14:00 +08:00
|
|
|
}
|
2012-11-06 18:43:11 +08:00
|
|
|
out:
|
btrfs: dev-replace: error out if we have unrepaired metadata error during
[BUG]
Even before the scrub rework, if we have some corrupted metadata failed
to be repaired during replace, we still continue replacing and let it
finish just as there is nothing wrong:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad bytenr, has 0 want 5578752
BTRFS error (device dm-4): unable to fixup (regular) error at logical 5578752 on dev /dev/mapper/test-scratch1
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 finished
This can lead to unexpected problems for the resulting filesystem.
[CAUSE]
Btrfs reuses scrub code path for dev-replace to iterate all dev extents.
But unlike scrub, dev-replace doesn't really bother to check the scrub
progress, which records all the errors found during replace.
And even if we check the progress, we cannot really determine which
errors are minor, which are critical just by the plain numbers.
(remember we don't treat metadata/data checksum error differently).
This behavior is there from the very beginning.
[FIX]
Instead of continuing the replace, just error out if we hit an
unrepaired metadata sector.
Now the dev-replace would be rejected with -EIO, to let the user know.
Although it also means, the filesystem has some metadata error which
cannot be repaired, the user would be upset anyway.
The new dmesg would look like this:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS error (device dm-4): unable to fixup (regular) error at logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560
BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata sector at 5578752
BTRFS error (device dm-4): btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, /dev/mapper/test-scratch2) failed -5
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-06 15:26:29 +08:00
|
|
|
ret2 = flush_scrub_stripes(sctx);
|
2023-06-14 14:49:35 +08:00
|
|
|
if (!ret)
|
btrfs: dev-replace: error out if we have unrepaired metadata error during
[BUG]
Even before the scrub rework, if we have some corrupted metadata failed
to be repaired during replace, we still continue replacing and let it
finish just as there is nothing wrong:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad bytenr, has 0 want 5578752
BTRFS error (device dm-4): unable to fixup (regular) error at logical 5578752 on dev /dev/mapper/test-scratch1
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 finished
This can lead to unexpected problems for the resulting filesystem.
[CAUSE]
Btrfs reuses scrub code path for dev-replace to iterate all dev extents.
But unlike scrub, dev-replace doesn't really bother to check the scrub
progress, which records all the errors found during replace.
And even if we check the progress, we cannot really determine which
errors are minor, which are critical just by the plain numbers.
(remember we don't treat metadata/data checksum error differently).
This behavior is there from the very beginning.
[FIX]
Instead of continuing the replace, just error out if we hit an
unrepaired metadata sector.
Now the dev-replace would be rejected with -EIO, to let the user know.
Although it also means, the filesystem has some metadata error which
cannot be repaired, the user would be upset anyway.
The new dmesg would look like this:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS error (device dm-4): unable to fixup (regular) error at logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560
BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata sector at 5578752
BTRFS error (device dm-4): btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, /dev/mapper/test-scratch2) failed -5
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-06 15:26:29 +08:00
|
|
|
ret = ret2;
|
2023-08-03 14:33:29 +08:00
|
|
|
btrfs_release_path(&sctx->extent_path);
|
2023-08-03 14:33:30 +08:00
|
|
|
btrfs_release_path(&sctx->csum_path);
|
2023-08-03 14:33:29 +08:00
|
|
|
|
2023-03-28 16:57:35 +08:00
|
|
|
if (sctx->raid56_data_stripes) {
|
|
|
|
for (int i = 0; i < nr_data_stripes(map); i++)
|
|
|
|
release_scrub_stripe(&sctx->raid56_data_stripes[i]);
|
|
|
|
kfree(sctx->raid56_data_stripes);
|
|
|
|
sctx->raid56_data_stripes = NULL;
|
|
|
|
}
|
2021-02-04 18:22:14 +08:00
|
|
|
|
|
|
|
if (sctx->is_dev_replace && ret >= 0) {
|
|
|
|
int ret2;
|
|
|
|
|
2021-12-15 14:59:42 +08:00
|
|
|
ret2 = sync_write_pointer_for_zoned(sctx,
|
|
|
|
chunk_logical + offset,
|
|
|
|
map->stripes[stripe_index].physical,
|
|
|
|
physical_end);
|
2021-02-04 18:22:14 +08:00
|
|
|
if (ret2)
|
|
|
|
ret = ret2;
|
|
|
|
}
|
|
|
|
|
2011-03-08 21:14:00 +08:00
|
|
|
return ret < 0 ? ret : 0;
|
|
|
|
}
|
|
|
|
|
Btrfs: rename the scrub context structure
The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.
Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is used.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-11-02 16:58:09 +08:00
|
|
|
static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx,
|
2021-12-15 14:59:41 +08:00
|
|
|
struct btrfs_block_group *bg,
|
2012-11-02 20:26:57 +08:00
|
|
|
struct btrfs_device *scrub_dev,
|
2015-11-19 18:57:20 +08:00
|
|
|
u64 dev_offset,
|
2021-12-15 14:59:41 +08:00
|
|
|
u64 dev_extent_len)
|
2011-03-08 21:14:00 +08:00
|
|
|
{
|
2016-06-23 06:54:56 +08:00
|
|
|
struct btrfs_fs_info *fs_info = sctx->fs_info;
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
struct btrfs_chunk_map *map;
|
2011-03-08 21:14:00 +08:00
|
|
|
int i;
|
2012-11-06 18:43:11 +08:00
|
|
|
int ret = 0;
|
2011-03-08 21:14:00 +08:00
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
map = btrfs_find_chunk_map(fs_info, bg->start, bg->length);
|
|
|
|
if (!map) {
|
2015-11-19 18:57:20 +08:00
|
|
|
/*
|
|
|
|
* Might have been an unused block group deleted by the cleaner
|
|
|
|
* kthread or relocation.
|
|
|
|
*/
|
2021-12-15 14:59:41 +08:00
|
|
|
spin_lock(&bg->lock);
|
2022-07-16 03:45:24 +08:00
|
|
|
if (!test_bit(BLOCK_GROUP_FLAG_REMOVED, &bg->runtime_flags))
|
2015-11-19 18:57:20 +08:00
|
|
|
ret = -EINVAL;
|
2021-12-15 14:59:41 +08:00
|
|
|
spin_unlock(&bg->lock);
|
2015-11-19 18:57:20 +08:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
if (map->start != bg->start)
|
2011-03-08 21:14:00 +08:00
|
|
|
goto out;
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
if (map->chunk_len < dev_extent_len)
|
2011-03-08 21:14:00 +08:00
|
|
|
goto out;
|
|
|
|
|
|
|
|
for (i = 0; i < map->num_stripes; ++i) {
|
2012-11-02 20:26:57 +08:00
|
|
|
if (map->stripes[i].dev->bdev == scrub_dev->bdev &&
|
2012-02-09 22:09:02 +08:00
|
|
|
map->stripes[i].physical == dev_offset) {
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
ret = scrub_stripe(sctx, bg, map, scrub_dev, i);
|
2011-03-08 21:14:00 +08:00
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
out:
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 21:38:38 +08:00
|
|
|
btrfs_free_chunk_map(map);
|
2011-03-08 21:14:00 +08:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2021-02-04 18:22:13 +08:00
|
|
|
static int finish_extent_writes_for_zoned(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_group *cache)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = cache->fs_info;
|
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
|
|
|
|
if (!btrfs_is_zoned(fs_info))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
btrfs_wait_block_group_reservations(cache);
|
|
|
|
btrfs_wait_nocow_writers(cache);
|
|
|
|
btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start, cache->length);
|
|
|
|
|
|
|
|
trans = btrfs_join_transaction(root);
|
|
|
|
if (IS_ERR(trans))
|
|
|
|
return PTR_ERR(trans);
|
|
|
|
return btrfs_commit_transaction(trans);
|
|
|
|
}
|
|
|
|
|
2011-03-08 21:14:00 +08:00
|
|
|
static noinline_for_stack
|
2012-11-02 20:26:57 +08:00
|
|
|
int scrub_enumerate_chunks(struct scrub_ctx *sctx,
|
2018-08-15 02:09:52 +08:00
|
|
|
struct btrfs_device *scrub_dev, u64 start, u64 end)
|
2011-03-08 21:14:00 +08:00
|
|
|
{
|
|
|
|
struct btrfs_dev_extent *dev_extent = NULL;
|
|
|
|
struct btrfs_path *path;
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *fs_info = sctx->fs_info;
|
|
|
|
struct btrfs_root *root = fs_info->dev_root;
|
2011-03-08 21:14:00 +08:00
|
|
|
u64 chunk_offset;
|
2015-08-05 16:43:30 +08:00
|
|
|
int ret = 0;
|
btrfs: Continue replace when set_block_ro failed
xfstests/011 failed in node with small_size filesystem.
Can be reproduced by following script:
DEV_LIST="/dev/vdd /dev/vde"
DEV_REPLACE="/dev/vdf"
do_test()
{
local mkfs_opt="$1"
local size="$2"
dmesg -c >/dev/null
umount $SCRATCH_MNT &>/dev/null
echo mkfs.btrfs -f $mkfs_opt "${DEV_LIST[*]}"
mkfs.btrfs -f $mkfs_opt "${DEV_LIST[@]}" || return 1
mount "${DEV_LIST[0]}" $SCRATCH_MNT
echo -n "Writing big files"
dd if=/dev/urandom of=$SCRATCH_MNT/t0 bs=1M count=1 >/dev/null 2>&1
for ((i = 1; i <= size; i++)); do
echo -n .
/bin/cp $SCRATCH_MNT/t0 $SCRATCH_MNT/t$i || return 1
done
echo
echo Start replace
btrfs replace start -Bf "${DEV_LIST[0]}" "$DEV_REPLACE" $SCRATCH_MNT || {
dmesg
return 1
}
return 0
}
# Set size to value near fs size
# for example, 1897 can trigger this bug in 2.6G device.
#
./do_test "-d raid1 -m raid1" 1897
System will report replace fail with following warning in dmesg:
[ 134.710853] BTRFS: dev_replace from /dev/vdd (devid 1) to /dev/vdf started
[ 135.542390] BTRFS: btrfs_scrub_dev(/dev/vdd, 1, /dev/vdf) failed -28
[ 135.543505] ------------[ cut here ]------------
[ 135.544127] WARNING: CPU: 0 PID: 4080 at fs/btrfs/dev-replace.c:428 btrfs_dev_replace_start+0x398/0x440()
[ 135.545276] Modules linked in:
[ 135.545681] CPU: 0 PID: 4080 Comm: btrfs Not tainted 4.3.0 #256
[ 135.546439] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[ 135.547798] ffffffff81c5bfcf ffff88003cbb3d28 ffffffff817fe7b5 0000000000000000
[ 135.548774] ffff88003cbb3d60 ffffffff810a88f1 ffff88002b030000 00000000ffffffe4
[ 135.549774] ffff88003c080000 ffff88003c082588 ffff88003c28ab60 ffff88003cbb3d70
[ 135.550758] Call Trace:
[ 135.551086] [<ffffffff817fe7b5>] dump_stack+0x44/0x55
[ 135.551737] [<ffffffff810a88f1>] warn_slowpath_common+0x81/0xc0
[ 135.552487] [<ffffffff810a89e5>] warn_slowpath_null+0x15/0x20
[ 135.553211] [<ffffffff81448c88>] btrfs_dev_replace_start+0x398/0x440
[ 135.554051] [<ffffffff81412c3e>] btrfs_ioctl+0x1d2e/0x25c0
[ 135.554722] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
[ 135.555506] [<ffffffff8111ab36>] ? current_kernel_time64+0x56/0xa0
[ 135.556304] [<ffffffff81201e3d>] do_vfs_ioctl+0x30d/0x580
[ 135.557009] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
[ 135.557855] [<ffffffff810011d1>] ? do_audit_syscall_entry+0x61/0x70
[ 135.558669] [<ffffffff8120d1c1>] ? __fget_light+0x61/0x90
[ 135.559374] [<ffffffff81202124>] SyS_ioctl+0x74/0x80
[ 135.559987] [<ffffffff81809857>] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 135.560842] ---[ end trace 2a5c1fc3205abbdd ]---
Reason:
When big data writen to fs, the whole free space will be allocated
for data chunk.
And operation as scrub need to set_block_ro(), and when there is
only one metadata chunk in system(or other metadata chunks
are all full), the function will try to allocate a new chunk,
and failed because no space in device.
Fix:
When set_block_ro failed for metadata chunk, it is not a problem
because scrub_lock paused commit_trancaction in same time, and
metadata are always cowed, so the on-the-fly writepages will not
write data into same place with scrub/replace.
Let replace continue in this case is no problem.
Tested by above script, and xfstests/011, plus 100 times xfstests/070.
Changelog v1->v2:
1: Add detail comments in source and commit-message.
2: Add dmesg detail into commit-message.
3: Limit return value of -ENOSPC to be passed.
All suggested by: Filipe Manana <fdmanana@gmail.com>
Suggested-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-17 18:46:17 +08:00
|
|
|
int ro_set;
|
2011-03-08 21:14:00 +08:00
|
|
|
int slot;
|
|
|
|
struct extent_buffer *l;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_key found_key;
|
2019-10-30 02:20:18 +08:00
|
|
|
struct btrfs_block_group *cache;
|
2012-11-06 18:43:11 +08:00
|
|
|
struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
|
2011-03-08 21:14:00 +08:00
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2015-11-27 23:31:35 +08:00
|
|
|
path->reada = READA_FORWARD;
|
2011-03-08 21:14:00 +08:00
|
|
|
path->search_commit_root = 1;
|
|
|
|
path->skip_locking = 1;
|
|
|
|
|
2012-11-02 20:26:57 +08:00
|
|
|
key.objectid = scrub_dev->devid;
|
2011-03-08 21:14:00 +08:00
|
|
|
key.offset = 0ull;
|
|
|
|
key.type = BTRFS_DEV_EXTENT_KEY;
|
|
|
|
|
|
|
|
while (1) {
|
2021-12-15 14:59:41 +08:00
|
|
|
u64 dev_extent_len;
|
|
|
|
|
2011-03-08 21:14:00 +08:00
|
|
|
ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
|
|
|
|
if (ret < 0)
|
2011-06-03 16:09:26 +08:00
|
|
|
break;
|
|
|
|
if (ret > 0) {
|
|
|
|
if (path->slots[0] >=
|
|
|
|
btrfs_header_nritems(path->nodes[0])) {
|
|
|
|
ret = btrfs_next_leaf(root, path);
|
2015-08-05 16:43:30 +08:00
|
|
|
if (ret < 0)
|
|
|
|
break;
|
|
|
|
if (ret > 0) {
|
|
|
|
ret = 0;
|
2011-06-03 16:09:26 +08:00
|
|
|
break;
|
2015-08-05 16:43:30 +08:00
|
|
|
}
|
|
|
|
} else {
|
|
|
|
ret = 0;
|
2011-06-03 16:09:26 +08:00
|
|
|
}
|
|
|
|
}
|
2011-03-08 21:14:00 +08:00
|
|
|
|
|
|
|
l = path->nodes[0];
|
|
|
|
slot = path->slots[0];
|
|
|
|
|
|
|
|
btrfs_item_key_to_cpu(l, &found_key, slot);
|
|
|
|
|
2012-11-02 20:26:57 +08:00
|
|
|
if (found_key.objectid != scrub_dev->devid)
|
2011-03-08 21:14:00 +08:00
|
|
|
break;
|
|
|
|
|
2014-06-05 00:41:45 +08:00
|
|
|
if (found_key.type != BTRFS_DEV_EXTENT_KEY)
|
2011-03-08 21:14:00 +08:00
|
|
|
break;
|
|
|
|
|
|
|
|
if (found_key.offset >= end)
|
|
|
|
break;
|
|
|
|
|
|
|
|
if (found_key.offset < key.offset)
|
|
|
|
break;
|
|
|
|
|
|
|
|
dev_extent = btrfs_item_ptr(l, slot, struct btrfs_dev_extent);
|
2021-12-15 14:59:41 +08:00
|
|
|
dev_extent_len = btrfs_dev_extent_length(l, dev_extent);
|
2011-03-08 21:14:00 +08:00
|
|
|
|
2021-12-15 14:59:41 +08:00
|
|
|
if (found_key.offset + dev_extent_len <= start)
|
2014-06-19 10:42:51 +08:00
|
|
|
goto skip;
|
2011-03-08 21:14:00 +08:00
|
|
|
|
|
|
|
chunk_offset = btrfs_dev_extent_chunk_offset(l, dev_extent);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* get a reference on the corresponding block group to prevent
|
|
|
|
* the chunk from going away while we scrub it
|
|
|
|
*/
|
|
|
|
cache = btrfs_lookup_block_group(fs_info, chunk_offset);
|
2014-06-19 10:42:51 +08:00
|
|
|
|
|
|
|
/* some chunks are removed but not committed to disk yet,
|
|
|
|
* continue scrubbing */
|
|
|
|
if (!cache)
|
|
|
|
goto skip;
|
|
|
|
|
btrfs: fix assertion failure during scrub due to block group reallocation
During a scrub, or device replace, we can race with block group removal
and allocation and trigger the following assertion failure:
[7526.385524] assertion failed: cache->start == chunk_offset, in fs/btrfs/scrub.c:3817
[7526.387351] ------------[ cut here ]------------
[7526.387373] kernel BUG at fs/btrfs/ctree.h:3599!
[7526.388001] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[7526.388970] CPU: 2 PID: 1158150 Comm: btrfs Not tainted 5.17.0-rc8-btrfs-next-114 #4
[7526.390279] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[7526.392430] RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
[7526.393520] Code: f3 48 c7 c7 20 (...)
[7526.396926] RSP: 0018:ffffb9154176bc40 EFLAGS: 00010246
[7526.397690] RAX: 0000000000000048 RBX: ffffa0db8a910000 RCX: 0000000000000000
[7526.398732] RDX: 0000000000000000 RSI: ffffffff9d7239a2 RDI: 00000000ffffffff
[7526.399766] RBP: ffffa0db8a911e10 R08: ffffffffa71a3ca0 R09: 0000000000000001
[7526.400793] R10: 0000000000000001 R11: 0000000000000000 R12: ffffa0db4b170800
[7526.401839] R13: 00000003494b0000 R14: ffffa0db7c55b488 R15: ffffa0db8b19a000
[7526.402874] FS: 00007f6c99c40640(0000) GS:ffffa0de6d200000(0000) knlGS:0000000000000000
[7526.404038] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[7526.405040] CR2: 00007f31b0882160 CR3: 000000014b38c004 CR4: 0000000000370ee0
[7526.406112] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[7526.407148] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[7526.408169] Call Trace:
[7526.408529] <TASK>
[7526.408839] scrub_enumerate_chunks.cold+0x11/0x79 [btrfs]
[7526.409690] ? do_wait_intr_irq+0xb0/0xb0
[7526.410276] btrfs_scrub_dev+0x226/0x620 [btrfs]
[7526.410995] ? preempt_count_add+0x49/0xa0
[7526.411592] btrfs_ioctl+0x1ab5/0x36d0 [btrfs]
[7526.412278] ? __fget_files+0xc9/0x1b0
[7526.412825] ? kvm_sched_clock_read+0x14/0x40
[7526.413459] ? lock_release+0x155/0x4a0
[7526.414022] ? __x64_sys_ioctl+0x83/0xb0
[7526.414601] __x64_sys_ioctl+0x83/0xb0
[7526.415150] do_syscall_64+0x3b/0xc0
[7526.415675] entry_SYSCALL_64_after_hwframe+0x44/0xae
[7526.416408] RIP: 0033:0x7f6c99d34397
[7526.416931] Code: 3c 1c e8 1c ff (...)
[7526.419641] RSP: 002b:00007f6c99c3fca8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[7526.420735] RAX: ffffffffffffffda RBX: 00005624e1e007b0 RCX: 00007f6c99d34397
[7526.421779] RDX: 00005624e1e007b0 RSI: 00000000c400941b RDI: 0000000000000003
[7526.422820] RBP: 0000000000000000 R08: 00007f6c99c40640 R09: 0000000000000000
[7526.423906] R10: 00007f6c99c40640 R11: 0000000000000246 R12: 00007fff746755de
[7526.424924] R13: 00007fff746755df R14: 0000000000000000 R15: 00007f6c99c40640
[7526.425950] </TASK>
That assertion is relatively new, introduced with commit d04fbe19aefd2
("btrfs: scrub: cleanup the argument list of scrub_chunk()").
The block group we get at scrub_enumerate_chunks() can actually have a
start address that is smaller then the chunk offset we extracted from a
device extent item we got from the commit root of the device tree.
This is very rare, but it can happen due to a race with block group
removal and allocation. For example, the following steps show how this
can happen:
1) We are at transaction T, and we have the following blocks groups,
sorted by their logical start address:
[ bg A, start address A, length 1G (data) ]
[ bg B, start address B, length 1G (data) ]
(...)
[ bg W, start address W, length 1G (data) ]
--> logical address space hole of 256M,
there used to be a 256M metadata block group here
[ bg Y, start address Y, length 256M (metadata) ]
--> Y matches W's end offset + 256M
Block group Y is the block group with the highest logical address in
the whole filesystem;
2) Block group Y is deleted and its extent mapping is removed by the call
to remove_extent_mapping() made from btrfs_remove_block_group().
So after this point, the last element of the mapping red black tree,
its rightmost node, is the mapping for block group W;
3) While still at transaction T, a new data block group is allocated,
with a length of 1G. When creating the block group we do a call to
find_next_chunk(), which returns the logical start address for the
new block group. This calls returns X, which corresponds to the
end offset of the last block group, the rightmost node in the mapping
red black tree (fs_info->mapping_tree), plus one.
So we get a new block group that starts at logical address X and with
a length of 1G. It spans over the whole logical range of the old block
group Y, that was previously removed in the same transaction.
However the device extent allocated to block group X is not the same
device extent that was used by block group Y, and it also does not
overlap that extent, which must be always the case because we allocate
extents by searching through the commit root of the device tree
(otherwise it could corrupt a filesystem after a power failure or
an unclean shutdown in general), so the extent allocator is behaving
as expected;
4) We have a task running scrub, currently at scrub_enumerate_chunks().
There it searches for device extent items in the device tree, using
its commit root. It finds a device extent item that was used by
block group Y, and it extracts the value Y from that item into the
local variable 'chunk_offset', using btrfs_dev_extent_chunk_offset();
It then calls btrfs_lookup_block_group() to find block group for
the logical address Y - since there's currently no block group that
starts at that logical address, it returns block group X, because
its range contains Y.
This results in triggering the assertion:
ASSERT(cache->start == chunk_offset);
right before calling scrub_chunk(), as cache->start is X and
chunk_offset is Y.
This is more likely to happen of filesystems not larger than 50G, because
for these filesystems we use a 256M size for metadata block groups and
a 1G size for data block groups, while for filesystems larger than 50G,
we use a 1G size for both data and metadata block groups (except for
zoned filesystems). It could also happen on any filesystem size due to
the fact that system block groups are always smaller (32M) than both
data and metadata block groups, but these are not frequently deleted, so
much less likely to trigger the race.
So make scrub skip any block group with a start offset that is less than
the value we expect, as that means it's a new block group that was created
in the current transaction. It's pointless to continue and try to scrub
its extents, because scrub searches for extents using the commit root, so
it won't find any. For a device replace, skip it as well for the same
reasons, and we don't need to worry about the possibility of extents of
the new block group not being to the new device, because we have the write
duplication setup done through btrfs_map_block().
Fixes: d04fbe19aefd ("btrfs: scrub: cleanup the argument list of scrub_chunk()")
CC: stable@vger.kernel.org # 5.17
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-19 21:23:57 +08:00
|
|
|
ASSERT(cache->start <= chunk_offset);
|
|
|
|
/*
|
|
|
|
* We are using the commit root to search for device extents, so
|
|
|
|
* that means we could have found a device extent item from a
|
|
|
|
* block group that was deleted in the current transaction. The
|
|
|
|
* logical start offset of the deleted block group, stored at
|
|
|
|
* @chunk_offset, might be part of the logical address range of
|
|
|
|
* a new block group (which uses different physical extents).
|
|
|
|
* In this case btrfs_lookup_block_group() has returned the new
|
|
|
|
* block group, and its start address is less than @chunk_offset.
|
|
|
|
*
|
|
|
|
* We skip such new block groups, because it's pointless to
|
|
|
|
* process them, as we won't find their extents because we search
|
|
|
|
* for them using the commit root of the extent tree. For a device
|
|
|
|
* replace it's also fine to skip it, we won't miss copying them
|
|
|
|
* to the target device because we have the write duplication
|
|
|
|
* setup through the regular write path (by btrfs_map_block()),
|
|
|
|
* and we have committed a transaction when we started the device
|
|
|
|
* replace, right after setting up the device replace state.
|
|
|
|
*/
|
|
|
|
if (cache->start < chunk_offset) {
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
goto skip;
|
|
|
|
}
|
|
|
|
|
2021-02-04 18:22:11 +08:00
|
|
|
if (sctx->is_dev_replace && btrfs_is_zoned(fs_info)) {
|
2022-07-16 03:45:24 +08:00
|
|
|
if (!test_bit(BLOCK_GROUP_FLAG_TO_COPY, &cache->runtime_flags)) {
|
2021-04-14 21:05:26 +08:00
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
goto skip;
|
2021-02-04 18:22:11 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
btrfs: fix a race between scrub and block group removal/allocation
When scrub is verifying the extents of a block group for a device, it is
possible that the corresponding block group gets removed and its logical
address and device extents get used for a new block group allocation.
When this happens scrub incorrectly reports that errors were detected
and, if the the new block group has a different profile then the old one,
deleted block group, we can crash due to a null pointer dereference.
Possibly other unexpected and weird consequences can happen as well.
Consider the following sequence of actions that leads to the null pointer
dereference crash when scrub is running in parallel with balance:
1) Balance sets block group X to read-only mode and starts relocating it.
Block group X is a metadata block group, has a raid1 profile (two
device extents, each one in a different device) and a logical address
of 19424870400;
2) Scrub is running and finds device extent E, which belongs to block
group X. It enters scrub_stripe() to find all extents allocated to
block group X, the search is done using the extent tree;
3) Balance finishes relocating block group X and removes block group X;
4) Balance starts relocating another block group and when trying to
commit the current transaction as part of the preparation step
(prepare_to_relocate()), it blocks because scrub is running;
5) The scrub task finds the metadata extent at the logical address
19425001472 and marks the pages of the extent to be read by a bio
(struct scrub_bio). The extent item's flags, which have the bit
BTRFS_EXTENT_FLAG_TREE_BLOCK set, are added to each page (struct
scrub_page). It is these flags in the scrub pages that tells the
bio's end io function (scrub_bio_end_io_worker) which type of extent
it is dealing with. At this point we end up with 4 pages in a bio
which is ready for submission (the metadata extent has a size of
16Kb, so that gives 4 pages on x86);
6) At the next iteration of scrub_stripe(), scrub checks that there is a
pause request from the relocation task trying to commit a transaction,
therefore it submits the pending bio and pauses, waiting for the
transaction commit to complete before resuming;
7) The relocation task commits the transaction. The device extent E, that
was used by our block group X, is now available for allocation, since
the commit root for the device tree was swapped by the transaction
commit;
8) Another task doing a direct IO write allocates a new data block group Y
which ends using device extent E. This new block group Y also ends up
getting the same logical address that block group X had: 19424870400.
This happens because block group X was the block group with the highest
logical address and, when allocating Y, find_next_chunk() returns the
end offset of the current last block group to be used as the logical
address for the new block group, which is
18351128576 + 1073741824 = 19424870400
So our new block group Y has the same logical address and device extent
that block group X had. However Y is a data block group, while X was
a metadata one, and Y has a raid0 profile, while X had a raid1 profile;
9) After allocating block group Y, the direct IO submits a bio to write
to device extent E;
10) The read bio submitted by scrub reads the 4 pages (16Kb) from device
extent E, which now correspond to the data written by the task that
did a direct IO write. Then at the end io function associated with
the bio, scrub_bio_end_io_worker(), we call scrub_block_complete()
which calls scrub_checksum(). This later function checks the flags
of the first page, and sees that the bit BTRFS_EXTENT_FLAG_TREE_BLOCK
is set in the flags, so it assumes it has a metadata extent and
then calls scrub_checksum_tree_block(). That functions returns an
error, since interpreting data as a metadata extent causes the
checksum verification to fail.
So this makes scrub_checksum() call scrub_handle_errored_block(),
which determines 'failed_mirror_index' to be 1, since the device
extent E was allocated as the second mirror of block group X.
It allocates BTRFS_MAX_MIRRORS scrub_block structures as an array at
'sblocks_for_recheck', and all the memory is initialized to zeroes by
kcalloc().
After that it calls scrub_setup_recheck_block(), which is responsible
for filling each of those structures. However, when that function
calls btrfs_map_sblock() against the logical address of the metadata
extent, 19425001472, it gets a struct btrfs_bio ('bbio') that matches
the current block group Y. However block group Y has a raid0 profile
and not a raid1 profile like X had, so the following call returns 1:
scrub_nr_raid_mirrors(bbio)
And as a result scrub_setup_recheck_block() only initializes the
first (index 0) scrub_block structure in 'sblocks_for_recheck'.
Then scrub_recheck_block() is called by scrub_handle_errored_block()
with the second (index 1) scrub_block structure as the argument,
because 'failed_mirror_index' was previously set to 1.
This scrub_block was not initialized by scrub_setup_recheck_block(),
so it has zero pages, its 'page_count' member is 0 and its 'pagev'
page array has all members pointing to NULL.
Finally when scrub_recheck_block() calls scrub_recheck_block_checksum()
we have a NULL pointer dereference when accessing the flags of the first
page, as pavev[0] is NULL:
static void scrub_recheck_block_checksum(struct scrub_block *sblock)
{
(...)
if (sblock->pagev[0]->flags & BTRFS_EXTENT_FLAG_DATA)
scrub_checksum_data(sblock);
(...)
}
Producing a stack trace like the following:
[542998.008985] BUG: kernel NULL pointer dereference, address: 0000000000000028
[542998.010238] #PF: supervisor read access in kernel mode
[542998.010878] #PF: error_code(0x0000) - not-present page
[542998.011516] PGD 0 P4D 0
[542998.011929] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[542998.012786] CPU: 3 PID: 4846 Comm: kworker/u8:1 Tainted: G B W 5.6.0-rc7-btrfs-next-58 #1
[542998.014524] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
[542998.016065] Workqueue: btrfs-scrub btrfs_work_helper [btrfs]
[542998.017255] RIP: 0010:scrub_recheck_block_checksum+0xf/0x20 [btrfs]
[542998.018474] Code: 4c 89 e6 ...
[542998.021419] RSP: 0018:ffffa7af0375fbd8 EFLAGS: 00010202
[542998.022120] RAX: 0000000000000000 RBX: ffff9792e674d120 RCX: 0000000000000000
[542998.023178] RDX: 0000000000000001 RSI: ffff9792e674d120 RDI: ffff9792e674d120
[542998.024465] RBP: 0000000000000000 R08: 0000000000000067 R09: 0000000000000001
[542998.025462] R10: ffffa7af0375fa50 R11: 0000000000000000 R12: ffff9791f61fe800
[542998.026357] R13: ffff9792e674d120 R14: 0000000000000001 R15: ffffffffc0e3dfc0
[542998.027237] FS: 0000000000000000(0000) GS:ffff9792fb200000(0000) knlGS:0000000000000000
[542998.028327] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[542998.029261] CR2: 0000000000000028 CR3: 00000000b3b18003 CR4: 00000000003606e0
[542998.030301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[542998.031316] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[542998.032380] Call Trace:
[542998.032752] scrub_recheck_block+0x162/0x400 [btrfs]
[542998.033500] ? __alloc_pages_nodemask+0x31e/0x460
[542998.034228] scrub_handle_errored_block+0x6f8/0x1920 [btrfs]
[542998.035170] scrub_bio_end_io_worker+0x100/0x520 [btrfs]
[542998.035991] btrfs_work_helper+0xaa/0x720 [btrfs]
[542998.036735] process_one_work+0x26d/0x6a0
[542998.037275] worker_thread+0x4f/0x3e0
[542998.037740] ? process_one_work+0x6a0/0x6a0
[542998.038378] kthread+0x103/0x140
[542998.038789] ? kthread_create_worker_on_cpu+0x70/0x70
[542998.039419] ret_from_fork+0x3a/0x50
[542998.039875] Modules linked in: dm_snapshot dm_thin_pool ...
[542998.047288] CR2: 0000000000000028
[542998.047724] ---[ end trace bde186e176c7f96a ]---
This issue has been around for a long time, possibly since scrub exists.
The last time I ran into it was over 2 years ago. After recently fixing
fstests to pass the "--full-balance" command line option to btrfs-progs
when doing balance, several tests started to more heavily exercise balance
with fsstress, scrub and other operations in parallel, and therefore
started to hit this issue again (with btrfs/061 for example).
Fix this by having scrub increment the 'trimming' counter of the block
group, which pins the block group in such a way that it guarantees neither
its logical address nor device extents can be reused by future block group
allocations until we decrement the 'trimming' counter. Also make sure that
on each iteration of scrub_stripe() we stop scrubbing the block group if
it was removed already.
A later patch in the series will rename the block group's 'trimming'
counter and its helpers to a more generic name, since now it is not used
exclusively for pinning while trimming anymore.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-08 18:01:10 +08:00
|
|
|
/*
|
|
|
|
* Make sure that while we are scrubbing the corresponding block
|
|
|
|
* group doesn't get its logical address and its device extents
|
|
|
|
* reused for another block group, which can possibly be of a
|
|
|
|
* different type and different profile. We do this to prevent
|
|
|
|
* false error detections and crashes due to bogus attempts to
|
|
|
|
* repair extents.
|
|
|
|
*/
|
|
|
|
spin_lock(&cache->lock);
|
2022-07-16 03:45:24 +08:00
|
|
|
if (test_bit(BLOCK_GROUP_FLAG_REMOVED, &cache->runtime_flags)) {
|
btrfs: fix a race between scrub and block group removal/allocation
When scrub is verifying the extents of a block group for a device, it is
possible that the corresponding block group gets removed and its logical
address and device extents get used for a new block group allocation.
When this happens scrub incorrectly reports that errors were detected
and, if the the new block group has a different profile then the old one,
deleted block group, we can crash due to a null pointer dereference.
Possibly other unexpected and weird consequences can happen as well.
Consider the following sequence of actions that leads to the null pointer
dereference crash when scrub is running in parallel with balance:
1) Balance sets block group X to read-only mode and starts relocating it.
Block group X is a metadata block group, has a raid1 profile (two
device extents, each one in a different device) and a logical address
of 19424870400;
2) Scrub is running and finds device extent E, which belongs to block
group X. It enters scrub_stripe() to find all extents allocated to
block group X, the search is done using the extent tree;
3) Balance finishes relocating block group X and removes block group X;
4) Balance starts relocating another block group and when trying to
commit the current transaction as part of the preparation step
(prepare_to_relocate()), it blocks because scrub is running;
5) The scrub task finds the metadata extent at the logical address
19425001472 and marks the pages of the extent to be read by a bio
(struct scrub_bio). The extent item's flags, which have the bit
BTRFS_EXTENT_FLAG_TREE_BLOCK set, are added to each page (struct
scrub_page). It is these flags in the scrub pages that tells the
bio's end io function (scrub_bio_end_io_worker) which type of extent
it is dealing with. At this point we end up with 4 pages in a bio
which is ready for submission (the metadata extent has a size of
16Kb, so that gives 4 pages on x86);
6) At the next iteration of scrub_stripe(), scrub checks that there is a
pause request from the relocation task trying to commit a transaction,
therefore it submits the pending bio and pauses, waiting for the
transaction commit to complete before resuming;
7) The relocation task commits the transaction. The device extent E, that
was used by our block group X, is now available for allocation, since
the commit root for the device tree was swapped by the transaction
commit;
8) Another task doing a direct IO write allocates a new data block group Y
which ends using device extent E. This new block group Y also ends up
getting the same logical address that block group X had: 19424870400.
This happens because block group X was the block group with the highest
logical address and, when allocating Y, find_next_chunk() returns the
end offset of the current last block group to be used as the logical
address for the new block group, which is
18351128576 + 1073741824 = 19424870400
So our new block group Y has the same logical address and device extent
that block group X had. However Y is a data block group, while X was
a metadata one, and Y has a raid0 profile, while X had a raid1 profile;
9) After allocating block group Y, the direct IO submits a bio to write
to device extent E;
10) The read bio submitted by scrub reads the 4 pages (16Kb) from device
extent E, which now correspond to the data written by the task that
did a direct IO write. Then at the end io function associated with
the bio, scrub_bio_end_io_worker(), we call scrub_block_complete()
which calls scrub_checksum(). This later function checks the flags
of the first page, and sees that the bit BTRFS_EXTENT_FLAG_TREE_BLOCK
is set in the flags, so it assumes it has a metadata extent and
then calls scrub_checksum_tree_block(). That functions returns an
error, since interpreting data as a metadata extent causes the
checksum verification to fail.
So this makes scrub_checksum() call scrub_handle_errored_block(),
which determines 'failed_mirror_index' to be 1, since the device
extent E was allocated as the second mirror of block group X.
It allocates BTRFS_MAX_MIRRORS scrub_block structures as an array at
'sblocks_for_recheck', and all the memory is initialized to zeroes by
kcalloc().
After that it calls scrub_setup_recheck_block(), which is responsible
for filling each of those structures. However, when that function
calls btrfs_map_sblock() against the logical address of the metadata
extent, 19425001472, it gets a struct btrfs_bio ('bbio') that matches
the current block group Y. However block group Y has a raid0 profile
and not a raid1 profile like X had, so the following call returns 1:
scrub_nr_raid_mirrors(bbio)
And as a result scrub_setup_recheck_block() only initializes the
first (index 0) scrub_block structure in 'sblocks_for_recheck'.
Then scrub_recheck_block() is called by scrub_handle_errored_block()
with the second (index 1) scrub_block structure as the argument,
because 'failed_mirror_index' was previously set to 1.
This scrub_block was not initialized by scrub_setup_recheck_block(),
so it has zero pages, its 'page_count' member is 0 and its 'pagev'
page array has all members pointing to NULL.
Finally when scrub_recheck_block() calls scrub_recheck_block_checksum()
we have a NULL pointer dereference when accessing the flags of the first
page, as pavev[0] is NULL:
static void scrub_recheck_block_checksum(struct scrub_block *sblock)
{
(...)
if (sblock->pagev[0]->flags & BTRFS_EXTENT_FLAG_DATA)
scrub_checksum_data(sblock);
(...)
}
Producing a stack trace like the following:
[542998.008985] BUG: kernel NULL pointer dereference, address: 0000000000000028
[542998.010238] #PF: supervisor read access in kernel mode
[542998.010878] #PF: error_code(0x0000) - not-present page
[542998.011516] PGD 0 P4D 0
[542998.011929] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[542998.012786] CPU: 3 PID: 4846 Comm: kworker/u8:1 Tainted: G B W 5.6.0-rc7-btrfs-next-58 #1
[542998.014524] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
[542998.016065] Workqueue: btrfs-scrub btrfs_work_helper [btrfs]
[542998.017255] RIP: 0010:scrub_recheck_block_checksum+0xf/0x20 [btrfs]
[542998.018474] Code: 4c 89 e6 ...
[542998.021419] RSP: 0018:ffffa7af0375fbd8 EFLAGS: 00010202
[542998.022120] RAX: 0000000000000000 RBX: ffff9792e674d120 RCX: 0000000000000000
[542998.023178] RDX: 0000000000000001 RSI: ffff9792e674d120 RDI: ffff9792e674d120
[542998.024465] RBP: 0000000000000000 R08: 0000000000000067 R09: 0000000000000001
[542998.025462] R10: ffffa7af0375fa50 R11: 0000000000000000 R12: ffff9791f61fe800
[542998.026357] R13: ffff9792e674d120 R14: 0000000000000001 R15: ffffffffc0e3dfc0
[542998.027237] FS: 0000000000000000(0000) GS:ffff9792fb200000(0000) knlGS:0000000000000000
[542998.028327] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[542998.029261] CR2: 0000000000000028 CR3: 00000000b3b18003 CR4: 00000000003606e0
[542998.030301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[542998.031316] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[542998.032380] Call Trace:
[542998.032752] scrub_recheck_block+0x162/0x400 [btrfs]
[542998.033500] ? __alloc_pages_nodemask+0x31e/0x460
[542998.034228] scrub_handle_errored_block+0x6f8/0x1920 [btrfs]
[542998.035170] scrub_bio_end_io_worker+0x100/0x520 [btrfs]
[542998.035991] btrfs_work_helper+0xaa/0x720 [btrfs]
[542998.036735] process_one_work+0x26d/0x6a0
[542998.037275] worker_thread+0x4f/0x3e0
[542998.037740] ? process_one_work+0x6a0/0x6a0
[542998.038378] kthread+0x103/0x140
[542998.038789] ? kthread_create_worker_on_cpu+0x70/0x70
[542998.039419] ret_from_fork+0x3a/0x50
[542998.039875] Modules linked in: dm_snapshot dm_thin_pool ...
[542998.047288] CR2: 0000000000000028
[542998.047724] ---[ end trace bde186e176c7f96a ]---
This issue has been around for a long time, possibly since scrub exists.
The last time I ran into it was over 2 years ago. After recently fixing
fstests to pass the "--full-balance" command line option to btrfs-progs
when doing balance, several tests started to more heavily exercise balance
with fsstress, scrub and other operations in parallel, and therefore
started to hit this issue again (with btrfs/061 for example).
Fix this by having scrub increment the 'trimming' counter of the block
group, which pins the block group in such a way that it guarantees neither
its logical address nor device extents can be reused by future block group
allocations until we decrement the 'trimming' counter. Also make sure that
on each iteration of scrub_stripe() we stop scrubbing the block group if
it was removed already.
A later patch in the series will rename the block group's 'trimming'
counter and its helpers to a more generic name, since now it is not used
exclusively for pinning while trimming anymore.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-08 18:01:10 +08:00
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
goto skip;
|
|
|
|
}
|
2020-05-08 18:01:47 +08:00
|
|
|
btrfs_freeze_block_group(cache);
|
btrfs: fix a race between scrub and block group removal/allocation
When scrub is verifying the extents of a block group for a device, it is
possible that the corresponding block group gets removed and its logical
address and device extents get used for a new block group allocation.
When this happens scrub incorrectly reports that errors were detected
and, if the the new block group has a different profile then the old one,
deleted block group, we can crash due to a null pointer dereference.
Possibly other unexpected and weird consequences can happen as well.
Consider the following sequence of actions that leads to the null pointer
dereference crash when scrub is running in parallel with balance:
1) Balance sets block group X to read-only mode and starts relocating it.
Block group X is a metadata block group, has a raid1 profile (two
device extents, each one in a different device) and a logical address
of 19424870400;
2) Scrub is running and finds device extent E, which belongs to block
group X. It enters scrub_stripe() to find all extents allocated to
block group X, the search is done using the extent tree;
3) Balance finishes relocating block group X and removes block group X;
4) Balance starts relocating another block group and when trying to
commit the current transaction as part of the preparation step
(prepare_to_relocate()), it blocks because scrub is running;
5) The scrub task finds the metadata extent at the logical address
19425001472 and marks the pages of the extent to be read by a bio
(struct scrub_bio). The extent item's flags, which have the bit
BTRFS_EXTENT_FLAG_TREE_BLOCK set, are added to each page (struct
scrub_page). It is these flags in the scrub pages that tells the
bio's end io function (scrub_bio_end_io_worker) which type of extent
it is dealing with. At this point we end up with 4 pages in a bio
which is ready for submission (the metadata extent has a size of
16Kb, so that gives 4 pages on x86);
6) At the next iteration of scrub_stripe(), scrub checks that there is a
pause request from the relocation task trying to commit a transaction,
therefore it submits the pending bio and pauses, waiting for the
transaction commit to complete before resuming;
7) The relocation task commits the transaction. The device extent E, that
was used by our block group X, is now available for allocation, since
the commit root for the device tree was swapped by the transaction
commit;
8) Another task doing a direct IO write allocates a new data block group Y
which ends using device extent E. This new block group Y also ends up
getting the same logical address that block group X had: 19424870400.
This happens because block group X was the block group with the highest
logical address and, when allocating Y, find_next_chunk() returns the
end offset of the current last block group to be used as the logical
address for the new block group, which is
18351128576 + 1073741824 = 19424870400
So our new block group Y has the same logical address and device extent
that block group X had. However Y is a data block group, while X was
a metadata one, and Y has a raid0 profile, while X had a raid1 profile;
9) After allocating block group Y, the direct IO submits a bio to write
to device extent E;
10) The read bio submitted by scrub reads the 4 pages (16Kb) from device
extent E, which now correspond to the data written by the task that
did a direct IO write. Then at the end io function associated with
the bio, scrub_bio_end_io_worker(), we call scrub_block_complete()
which calls scrub_checksum(). This later function checks the flags
of the first page, and sees that the bit BTRFS_EXTENT_FLAG_TREE_BLOCK
is set in the flags, so it assumes it has a metadata extent and
then calls scrub_checksum_tree_block(). That functions returns an
error, since interpreting data as a metadata extent causes the
checksum verification to fail.
So this makes scrub_checksum() call scrub_handle_errored_block(),
which determines 'failed_mirror_index' to be 1, since the device
extent E was allocated as the second mirror of block group X.
It allocates BTRFS_MAX_MIRRORS scrub_block structures as an array at
'sblocks_for_recheck', and all the memory is initialized to zeroes by
kcalloc().
After that it calls scrub_setup_recheck_block(), which is responsible
for filling each of those structures. However, when that function
calls btrfs_map_sblock() against the logical address of the metadata
extent, 19425001472, it gets a struct btrfs_bio ('bbio') that matches
the current block group Y. However block group Y has a raid0 profile
and not a raid1 profile like X had, so the following call returns 1:
scrub_nr_raid_mirrors(bbio)
And as a result scrub_setup_recheck_block() only initializes the
first (index 0) scrub_block structure in 'sblocks_for_recheck'.
Then scrub_recheck_block() is called by scrub_handle_errored_block()
with the second (index 1) scrub_block structure as the argument,
because 'failed_mirror_index' was previously set to 1.
This scrub_block was not initialized by scrub_setup_recheck_block(),
so it has zero pages, its 'page_count' member is 0 and its 'pagev'
page array has all members pointing to NULL.
Finally when scrub_recheck_block() calls scrub_recheck_block_checksum()
we have a NULL pointer dereference when accessing the flags of the first
page, as pavev[0] is NULL:
static void scrub_recheck_block_checksum(struct scrub_block *sblock)
{
(...)
if (sblock->pagev[0]->flags & BTRFS_EXTENT_FLAG_DATA)
scrub_checksum_data(sblock);
(...)
}
Producing a stack trace like the following:
[542998.008985] BUG: kernel NULL pointer dereference, address: 0000000000000028
[542998.010238] #PF: supervisor read access in kernel mode
[542998.010878] #PF: error_code(0x0000) - not-present page
[542998.011516] PGD 0 P4D 0
[542998.011929] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[542998.012786] CPU: 3 PID: 4846 Comm: kworker/u8:1 Tainted: G B W 5.6.0-rc7-btrfs-next-58 #1
[542998.014524] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
[542998.016065] Workqueue: btrfs-scrub btrfs_work_helper [btrfs]
[542998.017255] RIP: 0010:scrub_recheck_block_checksum+0xf/0x20 [btrfs]
[542998.018474] Code: 4c 89 e6 ...
[542998.021419] RSP: 0018:ffffa7af0375fbd8 EFLAGS: 00010202
[542998.022120] RAX: 0000000000000000 RBX: ffff9792e674d120 RCX: 0000000000000000
[542998.023178] RDX: 0000000000000001 RSI: ffff9792e674d120 RDI: ffff9792e674d120
[542998.024465] RBP: 0000000000000000 R08: 0000000000000067 R09: 0000000000000001
[542998.025462] R10: ffffa7af0375fa50 R11: 0000000000000000 R12: ffff9791f61fe800
[542998.026357] R13: ffff9792e674d120 R14: 0000000000000001 R15: ffffffffc0e3dfc0
[542998.027237] FS: 0000000000000000(0000) GS:ffff9792fb200000(0000) knlGS:0000000000000000
[542998.028327] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[542998.029261] CR2: 0000000000000028 CR3: 00000000b3b18003 CR4: 00000000003606e0
[542998.030301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[542998.031316] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[542998.032380] Call Trace:
[542998.032752] scrub_recheck_block+0x162/0x400 [btrfs]
[542998.033500] ? __alloc_pages_nodemask+0x31e/0x460
[542998.034228] scrub_handle_errored_block+0x6f8/0x1920 [btrfs]
[542998.035170] scrub_bio_end_io_worker+0x100/0x520 [btrfs]
[542998.035991] btrfs_work_helper+0xaa/0x720 [btrfs]
[542998.036735] process_one_work+0x26d/0x6a0
[542998.037275] worker_thread+0x4f/0x3e0
[542998.037740] ? process_one_work+0x6a0/0x6a0
[542998.038378] kthread+0x103/0x140
[542998.038789] ? kthread_create_worker_on_cpu+0x70/0x70
[542998.039419] ret_from_fork+0x3a/0x50
[542998.039875] Modules linked in: dm_snapshot dm_thin_pool ...
[542998.047288] CR2: 0000000000000028
[542998.047724] ---[ end trace bde186e176c7f96a ]---
This issue has been around for a long time, possibly since scrub exists.
The last time I ran into it was over 2 years ago. After recently fixing
fstests to pass the "--full-balance" command line option to btrfs-progs
when doing balance, several tests started to more heavily exercise balance
with fsstress, scrub and other operations in parallel, and therefore
started to hit this issue again (with btrfs/061 for example).
Fix this by having scrub increment the 'trimming' counter of the block
group, which pins the block group in such a way that it guarantees neither
its logical address nor device extents can be reused by future block group
allocations until we decrement the 'trimming' counter. Also make sure that
on each iteration of scrub_stripe() we stop scrubbing the block group if
it was removed already.
A later patch in the series will rename the block group's 'trimming'
counter and its helpers to a more generic name, since now it is not used
exclusively for pinning while trimming anymore.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-08 18:01:10 +08:00
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
|
2015-08-05 16:43:30 +08:00
|
|
|
/*
|
|
|
|
* we need call btrfs_inc_block_group_ro() with scrubs_paused,
|
|
|
|
* to avoid deadlock caused by:
|
|
|
|
* btrfs_inc_block_group_ro()
|
|
|
|
* -> btrfs_wait_for_commit()
|
|
|
|
* -> btrfs_commit_transaction()
|
|
|
|
* -> btrfs_scrub_pause()
|
|
|
|
*/
|
|
|
|
scrub_pause_on(fs_info);
|
btrfs: scrub: Don't check free space before marking a block group RO
[BUG]
When running btrfs/072 with only one online CPU, it has a pretty high
chance to fail:
btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg)
- output mismatch (see xfstests-dev/results//btrfs/072.out.bad)
--- tests/btrfs/072.out 2019-10-22 15:18:14.008965340 +0800
+++ /xfstests-dev/results//btrfs/072.out.bad 2019-11-14 15:56:45.877152240 +0800
@@ -1,2 +1,3 @@
QA output created by 072
Silence is golden
+Scrub find errors in "-m dup -d single" test
...
And with the following call trace:
BTRFS info (device dm-5): scrub: started on devid 1
------------[ cut here ]------------
BTRFS: Transaction aborted (error -27)
WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
CPU: 0 PID: 55087 Comm: btrfs Tainted: G W O 5.4.0-rc1-custom+ #13
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
Call Trace:
__btrfs_end_transaction+0xdb/0x310 [btrfs]
btrfs_end_transaction+0x10/0x20 [btrfs]
btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs]
scrub_enumerate_chunks+0x264/0x940 [btrfs]
btrfs_scrub_dev+0x45c/0x8f0 [btrfs]
btrfs_ioctl+0x31a1/0x3fb0 [btrfs]
do_vfs_ioctl+0x636/0xaa0
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x43/0x50
do_syscall_64+0x79/0xe0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
---[ end trace 166c865cec7688e7 ]---
[CAUSE]
The error number -27 is -EFBIG, returned from the following call chain:
btrfs_end_transaction()
|- __btrfs_end_transaction()
|- btrfs_create_pending_block_groups()
|- btrfs_finish_chunk_alloc()
|- btrfs_add_system_chunk()
This happens because we have used up all space of
btrfs_super_block::sys_chunk_array.
The root cause is, we have the following bad loop of creating tons of
system chunks:
1. The only SYSTEM chunk is being scrubbed
It's very common to have only one SYSTEM chunk.
2. New SYSTEM bg will be allocated
As btrfs_inc_block_group_ro() will check if we have enough space
after marking current bg RO. If not, then allocate a new chunk.
3. New SYSTEM bg is still empty, will be reclaimed
During the reclaim, we will mark it RO again.
4. That newly allocated empty SYSTEM bg get scrubbed
We go back to step 2, as the bg is already mark RO but still not
cleaned up yet.
If the cleaner kthread doesn't get executed fast enough (e.g. only one
CPU), then we will get more and more empty SYSTEM chunks, using up all
the space of btrfs_super_block::sys_chunk_array.
[FIX]
Since scrub/dev-replace doesn't always need to allocate new extent,
especially chunk tree extent, so we don't really need to do chunk
pre-allocation.
To break above spiral, here we introduce a new parameter to
btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we
need extra chunk pre-allocation.
For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass
@do_chunk_alloc=false.
This should keep unnecessary empty chunks from popping up for scrub.
Also, since there are two parameters for btrfs_inc_block_group_ro(),
add more comment for it.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-15 10:09:00 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Don't do chunk preallocation for scrub.
|
|
|
|
*
|
|
|
|
* This is especially important for SYSTEM bgs, or we can hit
|
|
|
|
* -EFBIG from btrfs_finish_chunk_alloc() like:
|
|
|
|
* 1. The only SYSTEM bg is marked RO.
|
|
|
|
* Since SYSTEM bg is small, that's pretty common.
|
|
|
|
* 2. New SYSTEM bg will be allocated
|
|
|
|
* Due to regular version will allocate new chunk.
|
|
|
|
* 3. New SYSTEM bg is empty and will get cleaned up
|
|
|
|
* Before cleanup really happens, it's marked RO again.
|
|
|
|
* 4. Empty SYSTEM bg get scrubbed
|
|
|
|
* We go back to 2.
|
|
|
|
*
|
|
|
|
* This can easily boost the amount of SYSTEM chunks if cleaner
|
|
|
|
* thread can't be triggered fast enough, and use up all space
|
|
|
|
* of btrfs_super_block::sys_chunk_array
|
btrfs: scrub: Require mandatory block group RO for dev-replace
[BUG]
For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
looped runs can lead to random failure, where scrub finds csum error.
The possibility is not high, around 1/20 to 1/100, but it's causing data
corruption.
The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
check free space before marking a block group RO")
[CAUSE]
Dev-replace has two source of writes:
- Write duplication
All writes to source device will also be duplicated to target device.
Content: Not yet persisted data/meta
- Scrub copy
Dev-replace reused scrub code to iterate through existing extents, and
copy the verified data to target device.
Content: Previously persisted data and metadata
The difference in contents makes the following race possible:
Regular Writer | Dev-replace
-----------------------------------------------------------------
^ |
| Preallocate one data extent |
| at bytenr X, len 1M |
v |
^ Commit transaction |
| Now extent [X, X+1M) is in |
v commit root |
================== Dev replace starts =========================
| ^
| | Scrub extent [X, X+1M)
| | Read [X, X+1M)
| | (The content are mostly garbage
| | since it's preallocated)
^ | v
| Write back happens for |
| extent [X, X+512K) |
| New data writes to both |
| source and target dev. |
v |
| ^
| | Scrub writes back extent [X, X+1M)
| | to target device.
| | This will over write the new data in
| | [X, X+512K)
| v
This race can only happen for nocow writes. Thus metadata and data cow
writes are safe, as COW will never overwrite extents of previous
transaction (in commit root).
This behavior can be confirmed by disabling all fallocate related calls
in fsstress (*), then all related tests can pass a 2000 run loop.
*: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
-f collapse=0 -f punch=0 -f resvsp=0"
I didn't expect resvsp ioctl will fallback to fallocate in VFS...
[FIX]
Make dev-replace to require mandatory block group RO, and wait for current
nocow writes before calling scrub_chunk().
This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
when set_block_ro failed") for dev-replace path.
The side effect is, dev-replace can be more strict on avaialble space, but
definitely worth to avoid data corruption.
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-24 07:58:20 +08:00
|
|
|
*
|
|
|
|
* While for dev replace, we need to try our best to mark block
|
|
|
|
* group RO, to prevent race between:
|
|
|
|
* - Write duplication
|
|
|
|
* Contains latest data
|
|
|
|
* - Scrub copy
|
|
|
|
* Contains data from commit tree
|
|
|
|
*
|
|
|
|
* If target block group is not marked RO, nocow writes can
|
|
|
|
* be overwritten by scrub copy, causing data corruption.
|
|
|
|
* So for dev-replace, it's not allowed to continue if a block
|
|
|
|
* group is not RO.
|
btrfs: scrub: Don't check free space before marking a block group RO
[BUG]
When running btrfs/072 with only one online CPU, it has a pretty high
chance to fail:
btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg)
- output mismatch (see xfstests-dev/results//btrfs/072.out.bad)
--- tests/btrfs/072.out 2019-10-22 15:18:14.008965340 +0800
+++ /xfstests-dev/results//btrfs/072.out.bad 2019-11-14 15:56:45.877152240 +0800
@@ -1,2 +1,3 @@
QA output created by 072
Silence is golden
+Scrub find errors in "-m dup -d single" test
...
And with the following call trace:
BTRFS info (device dm-5): scrub: started on devid 1
------------[ cut here ]------------
BTRFS: Transaction aborted (error -27)
WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
CPU: 0 PID: 55087 Comm: btrfs Tainted: G W O 5.4.0-rc1-custom+ #13
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
Call Trace:
__btrfs_end_transaction+0xdb/0x310 [btrfs]
btrfs_end_transaction+0x10/0x20 [btrfs]
btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs]
scrub_enumerate_chunks+0x264/0x940 [btrfs]
btrfs_scrub_dev+0x45c/0x8f0 [btrfs]
btrfs_ioctl+0x31a1/0x3fb0 [btrfs]
do_vfs_ioctl+0x636/0xaa0
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x43/0x50
do_syscall_64+0x79/0xe0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
---[ end trace 166c865cec7688e7 ]---
[CAUSE]
The error number -27 is -EFBIG, returned from the following call chain:
btrfs_end_transaction()
|- __btrfs_end_transaction()
|- btrfs_create_pending_block_groups()
|- btrfs_finish_chunk_alloc()
|- btrfs_add_system_chunk()
This happens because we have used up all space of
btrfs_super_block::sys_chunk_array.
The root cause is, we have the following bad loop of creating tons of
system chunks:
1. The only SYSTEM chunk is being scrubbed
It's very common to have only one SYSTEM chunk.
2. New SYSTEM bg will be allocated
As btrfs_inc_block_group_ro() will check if we have enough space
after marking current bg RO. If not, then allocate a new chunk.
3. New SYSTEM bg is still empty, will be reclaimed
During the reclaim, we will mark it RO again.
4. That newly allocated empty SYSTEM bg get scrubbed
We go back to step 2, as the bg is already mark RO but still not
cleaned up yet.
If the cleaner kthread doesn't get executed fast enough (e.g. only one
CPU), then we will get more and more empty SYSTEM chunks, using up all
the space of btrfs_super_block::sys_chunk_array.
[FIX]
Since scrub/dev-replace doesn't always need to allocate new extent,
especially chunk tree extent, so we don't really need to do chunk
pre-allocation.
To break above spiral, here we introduce a new parameter to
btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we
need extra chunk pre-allocation.
For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass
@do_chunk_alloc=false.
This should keep unnecessary empty chunks from popping up for scrub.
Also, since there are two parameters for btrfs_inc_block_group_ro(),
add more comment for it.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-15 10:09:00 +08:00
|
|
|
*/
|
btrfs: scrub: Require mandatory block group RO for dev-replace
[BUG]
For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
looped runs can lead to random failure, where scrub finds csum error.
The possibility is not high, around 1/20 to 1/100, but it's causing data
corruption.
The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
check free space before marking a block group RO")
[CAUSE]
Dev-replace has two source of writes:
- Write duplication
All writes to source device will also be duplicated to target device.
Content: Not yet persisted data/meta
- Scrub copy
Dev-replace reused scrub code to iterate through existing extents, and
copy the verified data to target device.
Content: Previously persisted data and metadata
The difference in contents makes the following race possible:
Regular Writer | Dev-replace
-----------------------------------------------------------------
^ |
| Preallocate one data extent |
| at bytenr X, len 1M |
v |
^ Commit transaction |
| Now extent [X, X+1M) is in |
v commit root |
================== Dev replace starts =========================
| ^
| | Scrub extent [X, X+1M)
| | Read [X, X+1M)
| | (The content are mostly garbage
| | since it's preallocated)
^ | v
| Write back happens for |
| extent [X, X+512K) |
| New data writes to both |
| source and target dev. |
v |
| ^
| | Scrub writes back extent [X, X+1M)
| | to target device.
| | This will over write the new data in
| | [X, X+512K)
| v
This race can only happen for nocow writes. Thus metadata and data cow
writes are safe, as COW will never overwrite extents of previous
transaction (in commit root).
This behavior can be confirmed by disabling all fallocate related calls
in fsstress (*), then all related tests can pass a 2000 run loop.
*: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
-f collapse=0 -f punch=0 -f resvsp=0"
I didn't expect resvsp ioctl will fallback to fallocate in VFS...
[FIX]
Make dev-replace to require mandatory block group RO, and wait for current
nocow writes before calling scrub_chunk().
This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
when set_block_ro failed") for dev-replace path.
The side effect is, dev-replace can be more strict on avaialble space, but
definitely worth to avoid data corruption.
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-24 07:58:20 +08:00
|
|
|
ret = btrfs_inc_block_group_ro(cache, sctx->is_dev_replace);
|
2021-02-04 18:22:13 +08:00
|
|
|
if (!ret && sctx->is_dev_replace) {
|
|
|
|
ret = finish_extent_writes_for_zoned(root, cache);
|
|
|
|
if (ret) {
|
|
|
|
btrfs_dec_block_group_ro(cache);
|
|
|
|
scrub_pause_off(fs_info);
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
btrfs: Continue replace when set_block_ro failed
xfstests/011 failed in node with small_size filesystem.
Can be reproduced by following script:
DEV_LIST="/dev/vdd /dev/vde"
DEV_REPLACE="/dev/vdf"
do_test()
{
local mkfs_opt="$1"
local size="$2"
dmesg -c >/dev/null
umount $SCRATCH_MNT &>/dev/null
echo mkfs.btrfs -f $mkfs_opt "${DEV_LIST[*]}"
mkfs.btrfs -f $mkfs_opt "${DEV_LIST[@]}" || return 1
mount "${DEV_LIST[0]}" $SCRATCH_MNT
echo -n "Writing big files"
dd if=/dev/urandom of=$SCRATCH_MNT/t0 bs=1M count=1 >/dev/null 2>&1
for ((i = 1; i <= size; i++)); do
echo -n .
/bin/cp $SCRATCH_MNT/t0 $SCRATCH_MNT/t$i || return 1
done
echo
echo Start replace
btrfs replace start -Bf "${DEV_LIST[0]}" "$DEV_REPLACE" $SCRATCH_MNT || {
dmesg
return 1
}
return 0
}
# Set size to value near fs size
# for example, 1897 can trigger this bug in 2.6G device.
#
./do_test "-d raid1 -m raid1" 1897
System will report replace fail with following warning in dmesg:
[ 134.710853] BTRFS: dev_replace from /dev/vdd (devid 1) to /dev/vdf started
[ 135.542390] BTRFS: btrfs_scrub_dev(/dev/vdd, 1, /dev/vdf) failed -28
[ 135.543505] ------------[ cut here ]------------
[ 135.544127] WARNING: CPU: 0 PID: 4080 at fs/btrfs/dev-replace.c:428 btrfs_dev_replace_start+0x398/0x440()
[ 135.545276] Modules linked in:
[ 135.545681] CPU: 0 PID: 4080 Comm: btrfs Not tainted 4.3.0 #256
[ 135.546439] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[ 135.547798] ffffffff81c5bfcf ffff88003cbb3d28 ffffffff817fe7b5 0000000000000000
[ 135.548774] ffff88003cbb3d60 ffffffff810a88f1 ffff88002b030000 00000000ffffffe4
[ 135.549774] ffff88003c080000 ffff88003c082588 ffff88003c28ab60 ffff88003cbb3d70
[ 135.550758] Call Trace:
[ 135.551086] [<ffffffff817fe7b5>] dump_stack+0x44/0x55
[ 135.551737] [<ffffffff810a88f1>] warn_slowpath_common+0x81/0xc0
[ 135.552487] [<ffffffff810a89e5>] warn_slowpath_null+0x15/0x20
[ 135.553211] [<ffffffff81448c88>] btrfs_dev_replace_start+0x398/0x440
[ 135.554051] [<ffffffff81412c3e>] btrfs_ioctl+0x1d2e/0x25c0
[ 135.554722] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
[ 135.555506] [<ffffffff8111ab36>] ? current_kernel_time64+0x56/0xa0
[ 135.556304] [<ffffffff81201e3d>] do_vfs_ioctl+0x30d/0x580
[ 135.557009] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
[ 135.557855] [<ffffffff810011d1>] ? do_audit_syscall_entry+0x61/0x70
[ 135.558669] [<ffffffff8120d1c1>] ? __fget_light+0x61/0x90
[ 135.559374] [<ffffffff81202124>] SyS_ioctl+0x74/0x80
[ 135.559987] [<ffffffff81809857>] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 135.560842] ---[ end trace 2a5c1fc3205abbdd ]---
Reason:
When big data writen to fs, the whole free space will be allocated
for data chunk.
And operation as scrub need to set_block_ro(), and when there is
only one metadata chunk in system(or other metadata chunks
are all full), the function will try to allocate a new chunk,
and failed because no space in device.
Fix:
When set_block_ro failed for metadata chunk, it is not a problem
because scrub_lock paused commit_trancaction in same time, and
metadata are always cowed, so the on-the-fly writepages will not
write data into same place with scrub/replace.
Let replace continue in this case is no problem.
Tested by above script, and xfstests/011, plus 100 times xfstests/070.
Changelog v1->v2:
1: Add detail comments in source and commit-message.
2: Add dmesg detail into commit-message.
3: Limit return value of -ENOSPC to be passed.
All suggested by: Filipe Manana <fdmanana@gmail.com>
Suggested-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-17 18:46:17 +08:00
|
|
|
if (ret == 0) {
|
|
|
|
ro_set = 1;
|
2023-04-13 13:57:17 +08:00
|
|
|
} else if (ret == -ENOSPC && !sctx->is_dev_replace &&
|
|
|
|
!(cache->flags & BTRFS_BLOCK_GROUP_RAID56_MASK)) {
|
btrfs: Continue replace when set_block_ro failed
xfstests/011 failed in node with small_size filesystem.
Can be reproduced by following script:
DEV_LIST="/dev/vdd /dev/vde"
DEV_REPLACE="/dev/vdf"
do_test()
{
local mkfs_opt="$1"
local size="$2"
dmesg -c >/dev/null
umount $SCRATCH_MNT &>/dev/null
echo mkfs.btrfs -f $mkfs_opt "${DEV_LIST[*]}"
mkfs.btrfs -f $mkfs_opt "${DEV_LIST[@]}" || return 1
mount "${DEV_LIST[0]}" $SCRATCH_MNT
echo -n "Writing big files"
dd if=/dev/urandom of=$SCRATCH_MNT/t0 bs=1M count=1 >/dev/null 2>&1
for ((i = 1; i <= size; i++)); do
echo -n .
/bin/cp $SCRATCH_MNT/t0 $SCRATCH_MNT/t$i || return 1
done
echo
echo Start replace
btrfs replace start -Bf "${DEV_LIST[0]}" "$DEV_REPLACE" $SCRATCH_MNT || {
dmesg
return 1
}
return 0
}
# Set size to value near fs size
# for example, 1897 can trigger this bug in 2.6G device.
#
./do_test "-d raid1 -m raid1" 1897
System will report replace fail with following warning in dmesg:
[ 134.710853] BTRFS: dev_replace from /dev/vdd (devid 1) to /dev/vdf started
[ 135.542390] BTRFS: btrfs_scrub_dev(/dev/vdd, 1, /dev/vdf) failed -28
[ 135.543505] ------------[ cut here ]------------
[ 135.544127] WARNING: CPU: 0 PID: 4080 at fs/btrfs/dev-replace.c:428 btrfs_dev_replace_start+0x398/0x440()
[ 135.545276] Modules linked in:
[ 135.545681] CPU: 0 PID: 4080 Comm: btrfs Not tainted 4.3.0 #256
[ 135.546439] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[ 135.547798] ffffffff81c5bfcf ffff88003cbb3d28 ffffffff817fe7b5 0000000000000000
[ 135.548774] ffff88003cbb3d60 ffffffff810a88f1 ffff88002b030000 00000000ffffffe4
[ 135.549774] ffff88003c080000 ffff88003c082588 ffff88003c28ab60 ffff88003cbb3d70
[ 135.550758] Call Trace:
[ 135.551086] [<ffffffff817fe7b5>] dump_stack+0x44/0x55
[ 135.551737] [<ffffffff810a88f1>] warn_slowpath_common+0x81/0xc0
[ 135.552487] [<ffffffff810a89e5>] warn_slowpath_null+0x15/0x20
[ 135.553211] [<ffffffff81448c88>] btrfs_dev_replace_start+0x398/0x440
[ 135.554051] [<ffffffff81412c3e>] btrfs_ioctl+0x1d2e/0x25c0
[ 135.554722] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
[ 135.555506] [<ffffffff8111ab36>] ? current_kernel_time64+0x56/0xa0
[ 135.556304] [<ffffffff81201e3d>] do_vfs_ioctl+0x30d/0x580
[ 135.557009] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
[ 135.557855] [<ffffffff810011d1>] ? do_audit_syscall_entry+0x61/0x70
[ 135.558669] [<ffffffff8120d1c1>] ? __fget_light+0x61/0x90
[ 135.559374] [<ffffffff81202124>] SyS_ioctl+0x74/0x80
[ 135.559987] [<ffffffff81809857>] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 135.560842] ---[ end trace 2a5c1fc3205abbdd ]---
Reason:
When big data writen to fs, the whole free space will be allocated
for data chunk.
And operation as scrub need to set_block_ro(), and when there is
only one metadata chunk in system(or other metadata chunks
are all full), the function will try to allocate a new chunk,
and failed because no space in device.
Fix:
When set_block_ro failed for metadata chunk, it is not a problem
because scrub_lock paused commit_trancaction in same time, and
metadata are always cowed, so the on-the-fly writepages will not
write data into same place with scrub/replace.
Let replace continue in this case is no problem.
Tested by above script, and xfstests/011, plus 100 times xfstests/070.
Changelog v1->v2:
1: Add detail comments in source and commit-message.
2: Add dmesg detail into commit-message.
3: Limit return value of -ENOSPC to be passed.
All suggested by: Filipe Manana <fdmanana@gmail.com>
Suggested-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-17 18:46:17 +08:00
|
|
|
/*
|
|
|
|
* btrfs_inc_block_group_ro return -ENOSPC when it
|
|
|
|
* failed in creating new chunk for metadata.
|
btrfs: scrub: Require mandatory block group RO for dev-replace
[BUG]
For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
looped runs can lead to random failure, where scrub finds csum error.
The possibility is not high, around 1/20 to 1/100, but it's causing data
corruption.
The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
check free space before marking a block group RO")
[CAUSE]
Dev-replace has two source of writes:
- Write duplication
All writes to source device will also be duplicated to target device.
Content: Not yet persisted data/meta
- Scrub copy
Dev-replace reused scrub code to iterate through existing extents, and
copy the verified data to target device.
Content: Previously persisted data and metadata
The difference in contents makes the following race possible:
Regular Writer | Dev-replace
-----------------------------------------------------------------
^ |
| Preallocate one data extent |
| at bytenr X, len 1M |
v |
^ Commit transaction |
| Now extent [X, X+1M) is in |
v commit root |
================== Dev replace starts =========================
| ^
| | Scrub extent [X, X+1M)
| | Read [X, X+1M)
| | (The content are mostly garbage
| | since it's preallocated)
^ | v
| Write back happens for |
| extent [X, X+512K) |
| New data writes to both |
| source and target dev. |
v |
| ^
| | Scrub writes back extent [X, X+1M)
| | to target device.
| | This will over write the new data in
| | [X, X+512K)
| v
This race can only happen for nocow writes. Thus metadata and data cow
writes are safe, as COW will never overwrite extents of previous
transaction (in commit root).
This behavior can be confirmed by disabling all fallocate related calls
in fsstress (*), then all related tests can pass a 2000 run loop.
*: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
-f collapse=0 -f punch=0 -f resvsp=0"
I didn't expect resvsp ioctl will fallback to fallocate in VFS...
[FIX]
Make dev-replace to require mandatory block group RO, and wait for current
nocow writes before calling scrub_chunk().
This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
when set_block_ro failed") for dev-replace path.
The side effect is, dev-replace can be more strict on avaialble space, but
definitely worth to avoid data corruption.
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-24 07:58:20 +08:00
|
|
|
* It is not a problem for scrub, because
|
btrfs: Continue replace when set_block_ro failed
xfstests/011 failed in node with small_size filesystem.
Can be reproduced by following script:
DEV_LIST="/dev/vdd /dev/vde"
DEV_REPLACE="/dev/vdf"
do_test()
{
local mkfs_opt="$1"
local size="$2"
dmesg -c >/dev/null
umount $SCRATCH_MNT &>/dev/null
echo mkfs.btrfs -f $mkfs_opt "${DEV_LIST[*]}"
mkfs.btrfs -f $mkfs_opt "${DEV_LIST[@]}" || return 1
mount "${DEV_LIST[0]}" $SCRATCH_MNT
echo -n "Writing big files"
dd if=/dev/urandom of=$SCRATCH_MNT/t0 bs=1M count=1 >/dev/null 2>&1
for ((i = 1; i <= size; i++)); do
echo -n .
/bin/cp $SCRATCH_MNT/t0 $SCRATCH_MNT/t$i || return 1
done
echo
echo Start replace
btrfs replace start -Bf "${DEV_LIST[0]}" "$DEV_REPLACE" $SCRATCH_MNT || {
dmesg
return 1
}
return 0
}
# Set size to value near fs size
# for example, 1897 can trigger this bug in 2.6G device.
#
./do_test "-d raid1 -m raid1" 1897
System will report replace fail with following warning in dmesg:
[ 134.710853] BTRFS: dev_replace from /dev/vdd (devid 1) to /dev/vdf started
[ 135.542390] BTRFS: btrfs_scrub_dev(/dev/vdd, 1, /dev/vdf) failed -28
[ 135.543505] ------------[ cut here ]------------
[ 135.544127] WARNING: CPU: 0 PID: 4080 at fs/btrfs/dev-replace.c:428 btrfs_dev_replace_start+0x398/0x440()
[ 135.545276] Modules linked in:
[ 135.545681] CPU: 0 PID: 4080 Comm: btrfs Not tainted 4.3.0 #256
[ 135.546439] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[ 135.547798] ffffffff81c5bfcf ffff88003cbb3d28 ffffffff817fe7b5 0000000000000000
[ 135.548774] ffff88003cbb3d60 ffffffff810a88f1 ffff88002b030000 00000000ffffffe4
[ 135.549774] ffff88003c080000 ffff88003c082588 ffff88003c28ab60 ffff88003cbb3d70
[ 135.550758] Call Trace:
[ 135.551086] [<ffffffff817fe7b5>] dump_stack+0x44/0x55
[ 135.551737] [<ffffffff810a88f1>] warn_slowpath_common+0x81/0xc0
[ 135.552487] [<ffffffff810a89e5>] warn_slowpath_null+0x15/0x20
[ 135.553211] [<ffffffff81448c88>] btrfs_dev_replace_start+0x398/0x440
[ 135.554051] [<ffffffff81412c3e>] btrfs_ioctl+0x1d2e/0x25c0
[ 135.554722] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
[ 135.555506] [<ffffffff8111ab36>] ? current_kernel_time64+0x56/0xa0
[ 135.556304] [<ffffffff81201e3d>] do_vfs_ioctl+0x30d/0x580
[ 135.557009] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
[ 135.557855] [<ffffffff810011d1>] ? do_audit_syscall_entry+0x61/0x70
[ 135.558669] [<ffffffff8120d1c1>] ? __fget_light+0x61/0x90
[ 135.559374] [<ffffffff81202124>] SyS_ioctl+0x74/0x80
[ 135.559987] [<ffffffff81809857>] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 135.560842] ---[ end trace 2a5c1fc3205abbdd ]---
Reason:
When big data writen to fs, the whole free space will be allocated
for data chunk.
And operation as scrub need to set_block_ro(), and when there is
only one metadata chunk in system(or other metadata chunks
are all full), the function will try to allocate a new chunk,
and failed because no space in device.
Fix:
When set_block_ro failed for metadata chunk, it is not a problem
because scrub_lock paused commit_trancaction in same time, and
metadata are always cowed, so the on-the-fly writepages will not
write data into same place with scrub/replace.
Let replace continue in this case is no problem.
Tested by above script, and xfstests/011, plus 100 times xfstests/070.
Changelog v1->v2:
1: Add detail comments in source and commit-message.
2: Add dmesg detail into commit-message.
3: Limit return value of -ENOSPC to be passed.
All suggested by: Filipe Manana <fdmanana@gmail.com>
Suggested-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-17 18:46:17 +08:00
|
|
|
* metadata are always cowed, and our scrub paused
|
|
|
|
* commit_transactions.
|
2023-04-13 13:57:17 +08:00
|
|
|
*
|
|
|
|
* For RAID56 chunks, we have to mark them read-only
|
|
|
|
* for scrub, as later we would use our own cache
|
|
|
|
* out of RAID56 realm.
|
|
|
|
* Thus we want the RAID56 bg to be marked RO to
|
|
|
|
* prevent RMW from screwing up out cache.
|
btrfs: Continue replace when set_block_ro failed
xfstests/011 failed in node with small_size filesystem.
Can be reproduced by following script:
DEV_LIST="/dev/vdd /dev/vde"
DEV_REPLACE="/dev/vdf"
do_test()
{
local mkfs_opt="$1"
local size="$2"
dmesg -c >/dev/null
umount $SCRATCH_MNT &>/dev/null
echo mkfs.btrfs -f $mkfs_opt "${DEV_LIST[*]}"
mkfs.btrfs -f $mkfs_opt "${DEV_LIST[@]}" || return 1
mount "${DEV_LIST[0]}" $SCRATCH_MNT
echo -n "Writing big files"
dd if=/dev/urandom of=$SCRATCH_MNT/t0 bs=1M count=1 >/dev/null 2>&1
for ((i = 1; i <= size; i++)); do
echo -n .
/bin/cp $SCRATCH_MNT/t0 $SCRATCH_MNT/t$i || return 1
done
echo
echo Start replace
btrfs replace start -Bf "${DEV_LIST[0]}" "$DEV_REPLACE" $SCRATCH_MNT || {
dmesg
return 1
}
return 0
}
# Set size to value near fs size
# for example, 1897 can trigger this bug in 2.6G device.
#
./do_test "-d raid1 -m raid1" 1897
System will report replace fail with following warning in dmesg:
[ 134.710853] BTRFS: dev_replace from /dev/vdd (devid 1) to /dev/vdf started
[ 135.542390] BTRFS: btrfs_scrub_dev(/dev/vdd, 1, /dev/vdf) failed -28
[ 135.543505] ------------[ cut here ]------------
[ 135.544127] WARNING: CPU: 0 PID: 4080 at fs/btrfs/dev-replace.c:428 btrfs_dev_replace_start+0x398/0x440()
[ 135.545276] Modules linked in:
[ 135.545681] CPU: 0 PID: 4080 Comm: btrfs Not tainted 4.3.0 #256
[ 135.546439] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[ 135.547798] ffffffff81c5bfcf ffff88003cbb3d28 ffffffff817fe7b5 0000000000000000
[ 135.548774] ffff88003cbb3d60 ffffffff810a88f1 ffff88002b030000 00000000ffffffe4
[ 135.549774] ffff88003c080000 ffff88003c082588 ffff88003c28ab60 ffff88003cbb3d70
[ 135.550758] Call Trace:
[ 135.551086] [<ffffffff817fe7b5>] dump_stack+0x44/0x55
[ 135.551737] [<ffffffff810a88f1>] warn_slowpath_common+0x81/0xc0
[ 135.552487] [<ffffffff810a89e5>] warn_slowpath_null+0x15/0x20
[ 135.553211] [<ffffffff81448c88>] btrfs_dev_replace_start+0x398/0x440
[ 135.554051] [<ffffffff81412c3e>] btrfs_ioctl+0x1d2e/0x25c0
[ 135.554722] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
[ 135.555506] [<ffffffff8111ab36>] ? current_kernel_time64+0x56/0xa0
[ 135.556304] [<ffffffff81201e3d>] do_vfs_ioctl+0x30d/0x580
[ 135.557009] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
[ 135.557855] [<ffffffff810011d1>] ? do_audit_syscall_entry+0x61/0x70
[ 135.558669] [<ffffffff8120d1c1>] ? __fget_light+0x61/0x90
[ 135.559374] [<ffffffff81202124>] SyS_ioctl+0x74/0x80
[ 135.559987] [<ffffffff81809857>] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 135.560842] ---[ end trace 2a5c1fc3205abbdd ]---
Reason:
When big data writen to fs, the whole free space will be allocated
for data chunk.
And operation as scrub need to set_block_ro(), and when there is
only one metadata chunk in system(or other metadata chunks
are all full), the function will try to allocate a new chunk,
and failed because no space in device.
Fix:
When set_block_ro failed for metadata chunk, it is not a problem
because scrub_lock paused commit_trancaction in same time, and
metadata are always cowed, so the on-the-fly writepages will not
write data into same place with scrub/replace.
Let replace continue in this case is no problem.
Tested by above script, and xfstests/011, plus 100 times xfstests/070.
Changelog v1->v2:
1: Add detail comments in source and commit-message.
2: Add dmesg detail into commit-message.
3: Limit return value of -ENOSPC to be passed.
All suggested by: Filipe Manana <fdmanana@gmail.com>
Suggested-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-17 18:46:17 +08:00
|
|
|
*/
|
|
|
|
ro_set = 0;
|
btrfs: fix race between writes to swap files and scrub
When we active a swap file, at btrfs_swap_activate(), we acquire the
exclusive operation lock to prevent the physical location of the swap
file extents to be changed by operations such as balance and device
replace/resize/remove. We also call there can_nocow_extent() which,
among other things, checks if the block group of a swap file extent is
currently RO, and if it is we can not use the extent, since a write
into it would result in COWing the extent.
However we have no protection against a scrub operation running after we
activate the swap file, which can result in the swap file extents to be
COWed while the scrub is running and operating on the respective block
group, because scrub turns a block group into RO before it processes it
and then back again to RW mode after processing it. That means an attempt
to write into a swap file extent while scrub is processing the respective
block group, will result in COWing the extent, changing its physical
location on disk.
Fix this by making sure that block groups that have extents that are used
by active swap files can not be turned into RO mode, therefore making it
not possible for a scrub to turn them into RO mode. When a scrub finds a
block group that can not be turned to RO due to the existence of extents
used by swap files, it proceeds to the next block group and logs a warning
message that mentions the block group was skipped due to active swap
files - this is the same approach we currently use for balance.
Fixes: ed46ff3d42378 ("Btrfs: support swap files")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-05 20:55:37 +08:00
|
|
|
} else if (ret == -ETXTBSY) {
|
|
|
|
btrfs_warn(fs_info,
|
|
|
|
"skipping scrub of block group %llu due to active swapfile",
|
|
|
|
cache->start);
|
|
|
|
scrub_pause_off(fs_info);
|
|
|
|
ret = 0;
|
|
|
|
goto skip_unfreeze;
|
btrfs: Continue replace when set_block_ro failed
xfstests/011 failed in node with small_size filesystem.
Can be reproduced by following script:
DEV_LIST="/dev/vdd /dev/vde"
DEV_REPLACE="/dev/vdf"
do_test()
{
local mkfs_opt="$1"
local size="$2"
dmesg -c >/dev/null
umount $SCRATCH_MNT &>/dev/null
echo mkfs.btrfs -f $mkfs_opt "${DEV_LIST[*]}"
mkfs.btrfs -f $mkfs_opt "${DEV_LIST[@]}" || return 1
mount "${DEV_LIST[0]}" $SCRATCH_MNT
echo -n "Writing big files"
dd if=/dev/urandom of=$SCRATCH_MNT/t0 bs=1M count=1 >/dev/null 2>&1
for ((i = 1; i <= size; i++)); do
echo -n .
/bin/cp $SCRATCH_MNT/t0 $SCRATCH_MNT/t$i || return 1
done
echo
echo Start replace
btrfs replace start -Bf "${DEV_LIST[0]}" "$DEV_REPLACE" $SCRATCH_MNT || {
dmesg
return 1
}
return 0
}
# Set size to value near fs size
# for example, 1897 can trigger this bug in 2.6G device.
#
./do_test "-d raid1 -m raid1" 1897
System will report replace fail with following warning in dmesg:
[ 134.710853] BTRFS: dev_replace from /dev/vdd (devid 1) to /dev/vdf started
[ 135.542390] BTRFS: btrfs_scrub_dev(/dev/vdd, 1, /dev/vdf) failed -28
[ 135.543505] ------------[ cut here ]------------
[ 135.544127] WARNING: CPU: 0 PID: 4080 at fs/btrfs/dev-replace.c:428 btrfs_dev_replace_start+0x398/0x440()
[ 135.545276] Modules linked in:
[ 135.545681] CPU: 0 PID: 4080 Comm: btrfs Not tainted 4.3.0 #256
[ 135.546439] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[ 135.547798] ffffffff81c5bfcf ffff88003cbb3d28 ffffffff817fe7b5 0000000000000000
[ 135.548774] ffff88003cbb3d60 ffffffff810a88f1 ffff88002b030000 00000000ffffffe4
[ 135.549774] ffff88003c080000 ffff88003c082588 ffff88003c28ab60 ffff88003cbb3d70
[ 135.550758] Call Trace:
[ 135.551086] [<ffffffff817fe7b5>] dump_stack+0x44/0x55
[ 135.551737] [<ffffffff810a88f1>] warn_slowpath_common+0x81/0xc0
[ 135.552487] [<ffffffff810a89e5>] warn_slowpath_null+0x15/0x20
[ 135.553211] [<ffffffff81448c88>] btrfs_dev_replace_start+0x398/0x440
[ 135.554051] [<ffffffff81412c3e>] btrfs_ioctl+0x1d2e/0x25c0
[ 135.554722] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
[ 135.555506] [<ffffffff8111ab36>] ? current_kernel_time64+0x56/0xa0
[ 135.556304] [<ffffffff81201e3d>] do_vfs_ioctl+0x30d/0x580
[ 135.557009] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
[ 135.557855] [<ffffffff810011d1>] ? do_audit_syscall_entry+0x61/0x70
[ 135.558669] [<ffffffff8120d1c1>] ? __fget_light+0x61/0x90
[ 135.559374] [<ffffffff81202124>] SyS_ioctl+0x74/0x80
[ 135.559987] [<ffffffff81809857>] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 135.560842] ---[ end trace 2a5c1fc3205abbdd ]---
Reason:
When big data writen to fs, the whole free space will be allocated
for data chunk.
And operation as scrub need to set_block_ro(), and when there is
only one metadata chunk in system(or other metadata chunks
are all full), the function will try to allocate a new chunk,
and failed because no space in device.
Fix:
When set_block_ro failed for metadata chunk, it is not a problem
because scrub_lock paused commit_trancaction in same time, and
metadata are always cowed, so the on-the-fly writepages will not
write data into same place with scrub/replace.
Let replace continue in this case is no problem.
Tested by above script, and xfstests/011, plus 100 times xfstests/070.
Changelog v1->v2:
1: Add detail comments in source and commit-message.
2: Add dmesg detail into commit-message.
3: Limit return value of -ENOSPC to be passed.
All suggested by: Filipe Manana <fdmanana@gmail.com>
Suggested-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-17 18:46:17 +08:00
|
|
|
} else {
|
2016-09-20 22:05:00 +08:00
|
|
|
btrfs_warn(fs_info,
|
2017-07-13 21:32:18 +08:00
|
|
|
"failed setting block group ro: %d", ret);
|
2020-05-08 18:01:47 +08:00
|
|
|
btrfs_unfreeze_block_group(cache);
|
2015-08-05 16:43:30 +08:00
|
|
|
btrfs_put_block_group(cache);
|
btrfs: scrub: Require mandatory block group RO for dev-replace
[BUG]
For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
looped runs can lead to random failure, where scrub finds csum error.
The possibility is not high, around 1/20 to 1/100, but it's causing data
corruption.
The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
check free space before marking a block group RO")
[CAUSE]
Dev-replace has two source of writes:
- Write duplication
All writes to source device will also be duplicated to target device.
Content: Not yet persisted data/meta
- Scrub copy
Dev-replace reused scrub code to iterate through existing extents, and
copy the verified data to target device.
Content: Previously persisted data and metadata
The difference in contents makes the following race possible:
Regular Writer | Dev-replace
-----------------------------------------------------------------
^ |
| Preallocate one data extent |
| at bytenr X, len 1M |
v |
^ Commit transaction |
| Now extent [X, X+1M) is in |
v commit root |
================== Dev replace starts =========================
| ^
| | Scrub extent [X, X+1M)
| | Read [X, X+1M)
| | (The content are mostly garbage
| | since it's preallocated)
^ | v
| Write back happens for |
| extent [X, X+512K) |
| New data writes to both |
| source and target dev. |
v |
| ^
| | Scrub writes back extent [X, X+1M)
| | to target device.
| | This will over write the new data in
| | [X, X+512K)
| v
This race can only happen for nocow writes. Thus metadata and data cow
writes are safe, as COW will never overwrite extents of previous
transaction (in commit root).
This behavior can be confirmed by disabling all fallocate related calls
in fsstress (*), then all related tests can pass a 2000 run loop.
*: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
-f collapse=0 -f punch=0 -f resvsp=0"
I didn't expect resvsp ioctl will fallback to fallocate in VFS...
[FIX]
Make dev-replace to require mandatory block group RO, and wait for current
nocow writes before calling scrub_chunk().
This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
when set_block_ro failed") for dev-replace path.
The side effect is, dev-replace can be more strict on avaialble space, but
definitely worth to avoid data corruption.
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-24 07:58:20 +08:00
|
|
|
scrub_pause_off(fs_info);
|
2015-08-05 16:43:30 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
btrfs: scrub: Require mandatory block group RO for dev-replace
[BUG]
For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
looped runs can lead to random failure, where scrub finds csum error.
The possibility is not high, around 1/20 to 1/100, but it's causing data
corruption.
The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
check free space before marking a block group RO")
[CAUSE]
Dev-replace has two source of writes:
- Write duplication
All writes to source device will also be duplicated to target device.
Content: Not yet persisted data/meta
- Scrub copy
Dev-replace reused scrub code to iterate through existing extents, and
copy the verified data to target device.
Content: Previously persisted data and metadata
The difference in contents makes the following race possible:
Regular Writer | Dev-replace
-----------------------------------------------------------------
^ |
| Preallocate one data extent |
| at bytenr X, len 1M |
v |
^ Commit transaction |
| Now extent [X, X+1M) is in |
v commit root |
================== Dev replace starts =========================
| ^
| | Scrub extent [X, X+1M)
| | Read [X, X+1M)
| | (The content are mostly garbage
| | since it's preallocated)
^ | v
| Write back happens for |
| extent [X, X+512K) |
| New data writes to both |
| source and target dev. |
v |
| ^
| | Scrub writes back extent [X, X+1M)
| | to target device.
| | This will over write the new data in
| | [X, X+512K)
| v
This race can only happen for nocow writes. Thus metadata and data cow
writes are safe, as COW will never overwrite extents of previous
transaction (in commit root).
This behavior can be confirmed by disabling all fallocate related calls
in fsstress (*), then all related tests can pass a 2000 run loop.
*: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
-f collapse=0 -f punch=0 -f resvsp=0"
I didn't expect resvsp ioctl will fallback to fallocate in VFS...
[FIX]
Make dev-replace to require mandatory block group RO, and wait for current
nocow writes before calling scrub_chunk().
This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
when set_block_ro failed") for dev-replace path.
The side effect is, dev-replace can be more strict on avaialble space, but
definitely worth to avoid data corruption.
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-24 07:58:20 +08:00
|
|
|
/*
|
|
|
|
* Now the target block is marked RO, wait for nocow writes to
|
|
|
|
* finish before dev-replace.
|
|
|
|
* COW is fine, as COW never overwrites extents in commit tree.
|
|
|
|
*/
|
|
|
|
if (sctx->is_dev_replace) {
|
|
|
|
btrfs_wait_nocow_writers(cache);
|
|
|
|
btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start,
|
|
|
|
cache->length);
|
|
|
|
}
|
|
|
|
|
|
|
|
scrub_pause_off(fs_info);
|
2019-10-31 18:55:01 +08:00
|
|
|
down_write(&dev_replace->rwsem);
|
2021-12-15 14:59:41 +08:00
|
|
|
dev_replace->cursor_right = found_key.offset + dev_extent_len;
|
2012-11-06 18:43:11 +08:00
|
|
|
dev_replace->cursor_left = found_key.offset;
|
|
|
|
dev_replace->item_needs_writeback = 1;
|
2018-09-07 22:11:23 +08:00
|
|
|
up_write(&dev_replace->rwsem);
|
|
|
|
|
2021-12-15 14:59:41 +08:00
|
|
|
ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset,
|
|
|
|
dev_extent_len);
|
2021-02-04 18:22:11 +08:00
|
|
|
if (sctx->is_dev_replace &&
|
|
|
|
!btrfs_finish_block_group_to_copy(dev_replace->srcdev,
|
|
|
|
cache, found_key.offset))
|
|
|
|
ro_set = 0;
|
|
|
|
|
2019-10-31 18:55:01 +08:00
|
|
|
down_write(&dev_replace->rwsem);
|
2016-05-15 02:44:40 +08:00
|
|
|
dev_replace->cursor_left = dev_replace->cursor_right;
|
|
|
|
dev_replace->item_needs_writeback = 1;
|
2019-10-31 18:55:01 +08:00
|
|
|
up_write(&dev_replace->rwsem);
|
2016-05-15 02:44:40 +08:00
|
|
|
|
btrfs: Continue replace when set_block_ro failed
xfstests/011 failed in node with small_size filesystem.
Can be reproduced by following script:
DEV_LIST="/dev/vdd /dev/vde"
DEV_REPLACE="/dev/vdf"
do_test()
{
local mkfs_opt="$1"
local size="$2"
dmesg -c >/dev/null
umount $SCRATCH_MNT &>/dev/null
echo mkfs.btrfs -f $mkfs_opt "${DEV_LIST[*]}"
mkfs.btrfs -f $mkfs_opt "${DEV_LIST[@]}" || return 1
mount "${DEV_LIST[0]}" $SCRATCH_MNT
echo -n "Writing big files"
dd if=/dev/urandom of=$SCRATCH_MNT/t0 bs=1M count=1 >/dev/null 2>&1
for ((i = 1; i <= size; i++)); do
echo -n .
/bin/cp $SCRATCH_MNT/t0 $SCRATCH_MNT/t$i || return 1
done
echo
echo Start replace
btrfs replace start -Bf "${DEV_LIST[0]}" "$DEV_REPLACE" $SCRATCH_MNT || {
dmesg
return 1
}
return 0
}
# Set size to value near fs size
# for example, 1897 can trigger this bug in 2.6G device.
#
./do_test "-d raid1 -m raid1" 1897
System will report replace fail with following warning in dmesg:
[ 134.710853] BTRFS: dev_replace from /dev/vdd (devid 1) to /dev/vdf started
[ 135.542390] BTRFS: btrfs_scrub_dev(/dev/vdd, 1, /dev/vdf) failed -28
[ 135.543505] ------------[ cut here ]------------
[ 135.544127] WARNING: CPU: 0 PID: 4080 at fs/btrfs/dev-replace.c:428 btrfs_dev_replace_start+0x398/0x440()
[ 135.545276] Modules linked in:
[ 135.545681] CPU: 0 PID: 4080 Comm: btrfs Not tainted 4.3.0 #256
[ 135.546439] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[ 135.547798] ffffffff81c5bfcf ffff88003cbb3d28 ffffffff817fe7b5 0000000000000000
[ 135.548774] ffff88003cbb3d60 ffffffff810a88f1 ffff88002b030000 00000000ffffffe4
[ 135.549774] ffff88003c080000 ffff88003c082588 ffff88003c28ab60 ffff88003cbb3d70
[ 135.550758] Call Trace:
[ 135.551086] [<ffffffff817fe7b5>] dump_stack+0x44/0x55
[ 135.551737] [<ffffffff810a88f1>] warn_slowpath_common+0x81/0xc0
[ 135.552487] [<ffffffff810a89e5>] warn_slowpath_null+0x15/0x20
[ 135.553211] [<ffffffff81448c88>] btrfs_dev_replace_start+0x398/0x440
[ 135.554051] [<ffffffff81412c3e>] btrfs_ioctl+0x1d2e/0x25c0
[ 135.554722] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
[ 135.555506] [<ffffffff8111ab36>] ? current_kernel_time64+0x56/0xa0
[ 135.556304] [<ffffffff81201e3d>] do_vfs_ioctl+0x30d/0x580
[ 135.557009] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
[ 135.557855] [<ffffffff810011d1>] ? do_audit_syscall_entry+0x61/0x70
[ 135.558669] [<ffffffff8120d1c1>] ? __fget_light+0x61/0x90
[ 135.559374] [<ffffffff81202124>] SyS_ioctl+0x74/0x80
[ 135.559987] [<ffffffff81809857>] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 135.560842] ---[ end trace 2a5c1fc3205abbdd ]---
Reason:
When big data writen to fs, the whole free space will be allocated
for data chunk.
And operation as scrub need to set_block_ro(), and when there is
only one metadata chunk in system(or other metadata chunks
are all full), the function will try to allocate a new chunk,
and failed because no space in device.
Fix:
When set_block_ro failed for metadata chunk, it is not a problem
because scrub_lock paused commit_trancaction in same time, and
metadata are always cowed, so the on-the-fly writepages will not
write data into same place with scrub/replace.
Let replace continue in this case is no problem.
Tested by above script, and xfstests/011, plus 100 times xfstests/070.
Changelog v1->v2:
1: Add detail comments in source and commit-message.
2: Add dmesg detail into commit-message.
3: Limit return value of -ENOSPC to be passed.
All suggested by: Filipe Manana <fdmanana@gmail.com>
Suggested-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-17 18:46:17 +08:00
|
|
|
if (ro_set)
|
2016-06-23 06:54:24 +08:00
|
|
|
btrfs_dec_block_group_ro(cache);
|
2012-11-06 18:43:11 +08:00
|
|
|
|
2015-11-19 19:45:48 +08:00
|
|
|
/*
|
|
|
|
* We might have prevented the cleaner kthread from deleting
|
|
|
|
* this block group if it was already unused because we raced
|
|
|
|
* and set it to RO mode first. So add it back to the unused
|
|
|
|
* list, otherwise it might not ever be deleted unless a manual
|
|
|
|
* balance is triggered or it becomes used and unused again.
|
|
|
|
*/
|
|
|
|
spin_lock(&cache->lock);
|
2022-07-16 03:45:24 +08:00
|
|
|
if (!test_bit(BLOCK_GROUP_FLAG_REMOVED, &cache->runtime_flags) &&
|
|
|
|
!cache->ro && cache->reserved == 0 && cache->used == 0) {
|
2015-11-19 19:45:48 +08:00
|
|
|
spin_unlock(&cache->lock);
|
btrfs: handle empty block_group removal for async discard
block_group removal is a little tricky. It can race with the extent
allocator, the cleaner thread, and balancing. The current path is for a
block_group to be added to the unused_bgs list. Then, when the cleaner
thread comes around, it starts a transaction and then proceeds with
removing the block_group. Extents that are pinned are subsequently
removed from the pinned trees and then eventually a discard is issued
for the entire block_group.
Async discard introduces another player into the game, the discard
workqueue. While it has none of the racing issues, the new problem is
ensuring we don't leave free space untrimmed prior to forgetting the
block_group. This is handled by placing fully free block_groups on a
separate discard queue. This is necessary to maintain discarding order
as in the future we will slowly trim even fully free block_groups. The
ordering helps us make progress on the same block_group rather than say
the last fully freed block_group or needing to search through the fully
freed block groups at the beginning of a list and insert after.
The new order of events is a fully freed block group gets placed on the
unused discard queue first. Once it's processed, it will be placed on
the unusued_bgs list and then the original sequence of events will
happen, just without the final whole block_group discard.
The mount flags can change when processing unused_bgs, so when flipping
from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
free block groups on the discard_list to the unused_bg queue which will
do the final discard for us.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-14 08:22:15 +08:00
|
|
|
if (btrfs_test_opt(fs_info, DISCARD_ASYNC))
|
|
|
|
btrfs_discard_queue_work(&fs_info->discard_ctl,
|
|
|
|
cache);
|
|
|
|
else
|
|
|
|
btrfs_mark_bg_unused(cache);
|
2015-11-19 19:45:48 +08:00
|
|
|
} else {
|
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
}
|
btrfs: fix race between writes to swap files and scrub
When we active a swap file, at btrfs_swap_activate(), we acquire the
exclusive operation lock to prevent the physical location of the swap
file extents to be changed by operations such as balance and device
replace/resize/remove. We also call there can_nocow_extent() which,
among other things, checks if the block group of a swap file extent is
currently RO, and if it is we can not use the extent, since a write
into it would result in COWing the extent.
However we have no protection against a scrub operation running after we
activate the swap file, which can result in the swap file extents to be
COWed while the scrub is running and operating on the respective block
group, because scrub turns a block group into RO before it processes it
and then back again to RW mode after processing it. That means an attempt
to write into a swap file extent while scrub is processing the respective
block group, will result in COWing the extent, changing its physical
location on disk.
Fix this by making sure that block groups that have extents that are used
by active swap files can not be turned into RO mode, therefore making it
not possible for a scrub to turn them into RO mode. When a scrub finds a
block group that can not be turned to RO due to the existence of extents
used by swap files, it proceeds to the next block group and logs a warning
message that mentions the block group was skipped due to active swap
files - this is the same approach we currently use for balance.
Fixes: ed46ff3d42378 ("Btrfs: support swap files")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-05 20:55:37 +08:00
|
|
|
skip_unfreeze:
|
2020-05-08 18:01:47 +08:00
|
|
|
btrfs_unfreeze_block_group(cache);
|
2011-03-08 21:14:00 +08:00
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
if (ret)
|
|
|
|
break;
|
2018-08-15 02:09:52 +08:00
|
|
|
if (sctx->is_dev_replace &&
|
2012-11-28 01:39:51 +08:00
|
|
|
atomic64_read(&dev_replace->num_write_errors) > 0) {
|
2012-11-06 18:43:11 +08:00
|
|
|
ret = -EIO;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (sctx->stat.malloc_errors > 0) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
break;
|
|
|
|
}
|
2014-06-19 10:42:51 +08:00
|
|
|
skip:
|
2021-12-15 14:59:41 +08:00
|
|
|
key.offset = found_key.offset + dev_extent_len;
|
2011-05-23 18:30:52 +08:00
|
|
|
btrfs_release_path(path);
|
2011-03-08 21:14:00 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
btrfs_free_path(path);
|
2011-06-03 16:09:26 +08:00
|
|
|
|
2015-08-05 16:43:30 +08:00
|
|
|
return ret;
|
2011-03-08 21:14:00 +08:00
|
|
|
}
|
|
|
|
|
2023-03-20 10:12:47 +08:00
|
|
|
static int scrub_one_super(struct scrub_ctx *sctx, struct btrfs_device *dev,
|
|
|
|
struct page *page, u64 physical, u64 generation)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = sctx->fs_info;
|
|
|
|
struct bio_vec bvec;
|
|
|
|
struct bio bio;
|
|
|
|
struct btrfs_super_block *sb = page_address(page);
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
bio_init(&bio, dev->bdev, &bvec, 1, REQ_OP_READ);
|
|
|
|
bio.bi_iter.bi_sector = physical >> SECTOR_SHIFT;
|
|
|
|
__bio_add_page(&bio, page, BTRFS_SUPER_INFO_SIZE, 0);
|
|
|
|
ret = submit_bio_wait(&bio);
|
|
|
|
bio_uninit(&bio);
|
|
|
|
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
ret = btrfs_check_super_csum(fs_info, sb);
|
|
|
|
if (ret != 0) {
|
|
|
|
btrfs_err_rl(fs_info,
|
|
|
|
"super block at physical %llu devid %llu has bad csum",
|
|
|
|
physical, dev->devid);
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
if (btrfs_super_generation(sb) != generation) {
|
|
|
|
btrfs_err_rl(fs_info,
|
|
|
|
"super block at physical %llu devid %llu has bad generation %llu expect %llu",
|
|
|
|
physical, dev->devid,
|
|
|
|
btrfs_super_generation(sb), generation);
|
|
|
|
return -EUCLEAN;
|
|
|
|
}
|
|
|
|
|
|
|
|
return btrfs_validate_super(fs_info, sb, -1);
|
|
|
|
}
|
|
|
|
|
2012-11-02 20:26:57 +08:00
|
|
|
static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx,
|
|
|
|
struct btrfs_device *scrub_dev)
|
2011-03-08 21:14:00 +08:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
u64 bytenr;
|
|
|
|
u64 gen;
|
2023-03-20 10:12:47 +08:00
|
|
|
int ret = 0;
|
|
|
|
struct page *page;
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *fs_info = sctx->fs_info;
|
2011-03-08 21:14:00 +08:00
|
|
|
|
2021-10-06 04:35:25 +08:00
|
|
|
if (BTRFS_FS_ERROR(fs_info))
|
btrfs: return EROFS for BTRFS_FS_STATE_ERROR cases
Eric reported seeing this message while running generic/475
BTRFS: error (device dm-3) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
Full stack trace:
BTRFS: error (device dm-0) in btrfs_commit_transaction:2323: errno=-5 IO failure (Error while writing out transaction)
BTRFS info (device dm-0): forced readonly
BTRFS warning (device dm-0): Skipping commit of aborted transaction.
------------[ cut here ]------------
BTRFS: error (device dm-0) in cleanup_transaction:1894: errno=-5 IO failure
BTRFS: Transaction aborted (error -117)
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6480 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6488 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6490 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6498 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64c0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85e8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85f0 len 4096 err no 10
WARNING: CPU: 3 PID: 23985 at fs/btrfs/tree-log.c:3084 btrfs_sync_log+0xbc8/0xd60 [btrfs]
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4288 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4290 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4298 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42d0 len 4096 err no 10
CPU: 3 PID: 23985 Comm: fsstress Tainted: G W L 5.8.0-rc4-default+ #1181
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
RIP: 0010:btrfs_sync_log+0xbc8/0xd60 [btrfs]
RSP: 0018:ffff909a44d17bd0 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001
RDX: ffff8f3be41cb940 RSI: ffffffffb0108d2b RDI: ffffffffb0108ff7
RBP: ffff909a44d17e70 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000037988 R12: ffff8f3bd20e4000
R13: ffff8f3bd20e4428 R14: 00000000ffffff8b R15: ffff909a44d17c70
FS: 00007f6a6ed3fb80(0000) GS:ffff8f3c3dc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6a6ed3e000 CR3: 00000000525c0003 CR4: 0000000000160ee0
Call Trace:
? finish_wait+0x90/0x90
? __mutex_unlock_slowpath+0x45/0x2a0
? lock_acquire+0xa3/0x440
? lockref_put_or_lock+0x9/0x30
? dput+0x20/0x4a0
? dput+0x20/0x4a0
? do_raw_spin_unlock+0x4b/0xc0
? _raw_spin_unlock+0x1f/0x30
btrfs_sync_file+0x335/0x490 [btrfs]
do_fsync+0x38/0x70
__x64_sys_fsync+0x10/0x20
do_syscall_64+0x50/0xe0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f6a6ef1b6e3
Code: Bad RIP value.
RSP: 002b:00007ffd01e20038 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
RAX: ffffffffffffffda RBX: 000000000007a120 RCX: 00007f6a6ef1b6e3
RDX: 00007ffd01e1ffa0 RSI: 00007ffd01e1ffa0 RDI: 0000000000000003
RBP: 0000000000000003 R08: 0000000000000001 R09: 00007ffd01e2004c
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000009f
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
softirqs last enabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace af146e0e38433456 ]---
BTRFS: error (device dm-0) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
This ret came from btrfs_write_marked_extents(). If we get an aborted
transaction via EIO before, we'll see it in btree_write_cache_pages()
and return EUCLEAN, which gets printed as "Filesystem corrupted".
Except we shouldn't be returning EUCLEAN here, we need to be returning
EROFS because EUCLEAN is reserved for actual corruption, not IO errors.
We are inconsistent about our handling of BTRFS_FS_STATE_ERROR
elsewhere, but we want to use EROFS for this particular case. The
original transaction abort has the real error code for why we ended up
with an aborted transaction, all subsequent actions just need to return
EROFS because they may not have a trans handle and have no idea about
the original cause of the abort.
After patch "btrfs: don't WARN if we abort a transaction with EROFS" the
stacktrace will not be dumped either.
Reported-by: Eric Sandeen <esandeen@redhat.com>
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add full test stacktrace ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-21 22:38:37 +08:00
|
|
|
return -EROFS;
|
2012-03-12 23:03:00 +08:00
|
|
|
|
2023-03-20 10:12:47 +08:00
|
|
|
page = alloc_page(GFP_KERNEL);
|
|
|
|
if (!page) {
|
|
|
|
spin_lock(&sctx->stat_lock);
|
|
|
|
sctx->stat.malloc_errors++;
|
|
|
|
spin_unlock(&sctx->stat_lock);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
2014-07-24 11:37:09 +08:00
|
|
|
/* Seed devices of a new filesystem has their own generation. */
|
2016-06-23 06:54:23 +08:00
|
|
|
if (scrub_dev->fs_devices != fs_info->fs_devices)
|
2014-07-24 11:37:09 +08:00
|
|
|
gen = scrub_dev->generation;
|
|
|
|
else
|
2023-10-04 18:38:51 +08:00
|
|
|
gen = btrfs_get_last_trans_committed(fs_info);
|
2011-03-08 21:14:00 +08:00
|
|
|
|
|
|
|
for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
|
2024-02-26 23:39:13 +08:00
|
|
|
ret = btrfs_sb_log_location(scrub_dev, i, 0, &bytenr);
|
|
|
|
if (ret == -ENOENT)
|
|
|
|
break;
|
|
|
|
|
|
|
|
if (ret) {
|
|
|
|
spin_lock(&sctx->stat_lock);
|
|
|
|
sctx->stat.super_errors++;
|
|
|
|
spin_unlock(&sctx->stat_lock);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2014-09-03 21:35:33 +08:00
|
|
|
if (bytenr + BTRFS_SUPER_INFO_SIZE >
|
|
|
|
scrub_dev->commit_total_bytes)
|
2011-03-08 21:14:00 +08:00
|
|
|
break;
|
btrfs: implement log-structured superblock for ZONED mode
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones. However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today. So, the number of superblock and
copies is limited to be two. Second downside is that we cannot support
devices which have no conventional zones at all.
To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.
We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.
The following zones are reserved as the circular buffer on ZONED btrfs.
- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
next to it
If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-10 19:26:14 +08:00
|
|
|
if (!btrfs_check_super_location(scrub_dev, bytenr))
|
|
|
|
continue;
|
2011-03-08 21:14:00 +08:00
|
|
|
|
2023-03-20 10:12:47 +08:00
|
|
|
ret = scrub_one_super(sctx, scrub_dev, page, bytenr, gen);
|
|
|
|
if (ret) {
|
|
|
|
spin_lock(&sctx->stat_lock);
|
|
|
|
sctx->stat.super_errors++;
|
|
|
|
spin_unlock(&sctx->stat_lock);
|
|
|
|
}
|
2011-03-08 21:14:00 +08:00
|
|
|
}
|
2023-03-20 10:12:47 +08:00
|
|
|
__free_page(page);
|
2011-03-08 21:14:00 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
btrfs: allocate scrub workqueues outside of locks
I got the following lockdep splat while testing:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00172-g021118712e59 #932 Not tainted
------------------------------------------------------
btrfs/229626 is trying to acquire lock:
ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450
but task is already holding lock:
ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_scrub_dev+0x11c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_run_dev_stats+0x49/0x480
commit_cowonly_roots+0xb5/0x2a0
btrfs_commit_transaction+0x516/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_commit_transaction+0x4bb/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_record_root_in_trans+0x43/0x70
start_transaction+0xd1/0x5d0
btrfs_dirty_inode+0x42/0xd0
touch_atime+0xa1/0xd0
btrfs_file_mmap+0x3f/0x60
mmap_region+0x3a4/0x640
do_mmap+0x376/0x580
vm_mmap_pgoff+0xd5/0x120
ksys_mmap_pgoff+0x193/0x230
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x68/0x90
_copy_to_user+0x1e/0x80
perf_read+0x141/0x2c0
vfs_read+0xad/0x1b0
ksys_read+0x5f/0xe0
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&cpuctx_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x88/0x150
perf_event_init+0x1db/0x20b
start_kernel+0x3ae/0x53c
secondary_startup_64+0xa4/0xb0
-> #1 (pmus_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x4f/0x150
cpuhp_invoke_callback+0xb1/0x900
_cpu_up.constprop.26+0x9f/0x130
cpu_up+0x7b/0xc0
bringup_nonboot_cpus+0x4f/0x60
smp_init+0x26/0x71
kernel_init_freeable+0x110/0x258
kernel_init+0xa/0x103
ret_from_fork+0x1f/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
cpus_read_lock+0x39/0xb0
alloc_workqueue+0x378/0x450
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->scrub_lock);
lock(&fs_devs->device_list_mutex);
lock(&fs_info->scrub_lock);
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by btrfs/229626:
#0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
#1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
stack backtrace:
CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? alloc_workqueue+0x378/0x450
cpus_read_lock+0x39/0xb0
? alloc_workqueue+0x378/0x450
alloc_workqueue+0x378/0x450
? rcu_read_lock_sched_held+0x52/0x80
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
? start_transaction+0xd1/0x5d0
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
? do_sigaction+0x102/0x250
? lockdep_hardirqs_on_prepare+0xca/0x160
? _raw_spin_unlock_irq+0x24/0x30
? trace_hardirqs_on+0x1c/0xe0
? _raw_spin_unlock_irq+0x24/0x30
? do_sigaction+0x102/0x250
? ksys_ioctl+0x83/0xc0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens because we're allocating the scrub workqueues under the
scrub and device list mutex, which brings in a whole host of other
dependencies.
Because the work queue allocation is done with GFP_KERNEL, it can
trigger reclaim, which can lead to a transaction commit, which in turns
needs the device_list_mutex, it can lead to a deadlock. A different
problem for which this fix is a solution.
Fix this by moving the actual allocation outside of the
scrub lock, and then only take the lock once we're ready to actually
assign them to the fs_info. We'll now have to cleanup the workqueues in
a few more places, so I've added a helper to do the refcount dance to
safely free the workqueues.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-10 23:42:29 +08:00
|
|
|
static void scrub_workers_put(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
if (refcount_dec_and_mutex_lock(&fs_info->scrub_workers_refcnt,
|
|
|
|
&fs_info->scrub_lock)) {
|
2022-04-18 12:43:10 +08:00
|
|
|
struct workqueue_struct *scrub_workers = fs_info->scrub_workers;
|
btrfs: allocate scrub workqueues outside of locks
I got the following lockdep splat while testing:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00172-g021118712e59 #932 Not tainted
------------------------------------------------------
btrfs/229626 is trying to acquire lock:
ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450
but task is already holding lock:
ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_scrub_dev+0x11c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_run_dev_stats+0x49/0x480
commit_cowonly_roots+0xb5/0x2a0
btrfs_commit_transaction+0x516/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_commit_transaction+0x4bb/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_record_root_in_trans+0x43/0x70
start_transaction+0xd1/0x5d0
btrfs_dirty_inode+0x42/0xd0
touch_atime+0xa1/0xd0
btrfs_file_mmap+0x3f/0x60
mmap_region+0x3a4/0x640
do_mmap+0x376/0x580
vm_mmap_pgoff+0xd5/0x120
ksys_mmap_pgoff+0x193/0x230
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x68/0x90
_copy_to_user+0x1e/0x80
perf_read+0x141/0x2c0
vfs_read+0xad/0x1b0
ksys_read+0x5f/0xe0
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&cpuctx_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x88/0x150
perf_event_init+0x1db/0x20b
start_kernel+0x3ae/0x53c
secondary_startup_64+0xa4/0xb0
-> #1 (pmus_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x4f/0x150
cpuhp_invoke_callback+0xb1/0x900
_cpu_up.constprop.26+0x9f/0x130
cpu_up+0x7b/0xc0
bringup_nonboot_cpus+0x4f/0x60
smp_init+0x26/0x71
kernel_init_freeable+0x110/0x258
kernel_init+0xa/0x103
ret_from_fork+0x1f/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
cpus_read_lock+0x39/0xb0
alloc_workqueue+0x378/0x450
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->scrub_lock);
lock(&fs_devs->device_list_mutex);
lock(&fs_info->scrub_lock);
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by btrfs/229626:
#0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
#1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
stack backtrace:
CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? alloc_workqueue+0x378/0x450
cpus_read_lock+0x39/0xb0
? alloc_workqueue+0x378/0x450
alloc_workqueue+0x378/0x450
? rcu_read_lock_sched_held+0x52/0x80
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
? start_transaction+0xd1/0x5d0
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
? do_sigaction+0x102/0x250
? lockdep_hardirqs_on_prepare+0xca/0x160
? _raw_spin_unlock_irq+0x24/0x30
? trace_hardirqs_on+0x1c/0xe0
? _raw_spin_unlock_irq+0x24/0x30
? do_sigaction+0x102/0x250
? ksys_ioctl+0x83/0xc0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens because we're allocating the scrub workqueues under the
scrub and device list mutex, which brings in a whole host of other
dependencies.
Because the work queue allocation is done with GFP_KERNEL, it can
trigger reclaim, which can lead to a transaction commit, which in turns
needs the device_list_mutex, it can lead to a deadlock. A different
problem for which this fix is a solution.
Fix this by moving the actual allocation outside of the
scrub lock, and then only take the lock once we're ready to actually
assign them to the fs_info. We'll now have to cleanup the workqueues in
a few more places, so I've added a helper to do the refcount dance to
safely free the workqueues.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-10 23:42:29 +08:00
|
|
|
|
|
|
|
fs_info->scrub_workers = NULL;
|
|
|
|
mutex_unlock(&fs_info->scrub_lock);
|
|
|
|
|
2022-04-18 12:43:10 +08:00
|
|
|
if (scrub_workers)
|
|
|
|
destroy_workqueue(scrub_workers);
|
btrfs: allocate scrub workqueues outside of locks
I got the following lockdep splat while testing:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00172-g021118712e59 #932 Not tainted
------------------------------------------------------
btrfs/229626 is trying to acquire lock:
ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450
but task is already holding lock:
ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_scrub_dev+0x11c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_run_dev_stats+0x49/0x480
commit_cowonly_roots+0xb5/0x2a0
btrfs_commit_transaction+0x516/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_commit_transaction+0x4bb/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_record_root_in_trans+0x43/0x70
start_transaction+0xd1/0x5d0
btrfs_dirty_inode+0x42/0xd0
touch_atime+0xa1/0xd0
btrfs_file_mmap+0x3f/0x60
mmap_region+0x3a4/0x640
do_mmap+0x376/0x580
vm_mmap_pgoff+0xd5/0x120
ksys_mmap_pgoff+0x193/0x230
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x68/0x90
_copy_to_user+0x1e/0x80
perf_read+0x141/0x2c0
vfs_read+0xad/0x1b0
ksys_read+0x5f/0xe0
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&cpuctx_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x88/0x150
perf_event_init+0x1db/0x20b
start_kernel+0x3ae/0x53c
secondary_startup_64+0xa4/0xb0
-> #1 (pmus_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x4f/0x150
cpuhp_invoke_callback+0xb1/0x900
_cpu_up.constprop.26+0x9f/0x130
cpu_up+0x7b/0xc0
bringup_nonboot_cpus+0x4f/0x60
smp_init+0x26/0x71
kernel_init_freeable+0x110/0x258
kernel_init+0xa/0x103
ret_from_fork+0x1f/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
cpus_read_lock+0x39/0xb0
alloc_workqueue+0x378/0x450
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->scrub_lock);
lock(&fs_devs->device_list_mutex);
lock(&fs_info->scrub_lock);
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by btrfs/229626:
#0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
#1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
stack backtrace:
CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? alloc_workqueue+0x378/0x450
cpus_read_lock+0x39/0xb0
? alloc_workqueue+0x378/0x450
alloc_workqueue+0x378/0x450
? rcu_read_lock_sched_held+0x52/0x80
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
? start_transaction+0xd1/0x5d0
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
? do_sigaction+0x102/0x250
? lockdep_hardirqs_on_prepare+0xca/0x160
? _raw_spin_unlock_irq+0x24/0x30
? trace_hardirqs_on+0x1c/0xe0
? _raw_spin_unlock_irq+0x24/0x30
? do_sigaction+0x102/0x250
? ksys_ioctl+0x83/0xc0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens because we're allocating the scrub workqueues under the
scrub and device list mutex, which brings in a whole host of other
dependencies.
Because the work queue allocation is done with GFP_KERNEL, it can
trigger reclaim, which can lead to a transaction commit, which in turns
needs the device_list_mutex, it can lead to a deadlock. A different
problem for which this fix is a solution.
Fix this by moving the actual allocation outside of the
scrub lock, and then only take the lock once we're ready to actually
assign them to the fs_info. We'll now have to cleanup the workqueues in
a few more places, so I've added a helper to do the refcount dance to
safely free the workqueues.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-10 23:42:29 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-03-08 21:14:00 +08:00
|
|
|
/*
|
|
|
|
* get a reference count on fs_info->scrub_workers. start worker if necessary
|
|
|
|
*/
|
2023-08-03 14:33:32 +08:00
|
|
|
static noinline_for_stack int scrub_workers_get(struct btrfs_fs_info *fs_info)
|
2011-03-08 21:14:00 +08:00
|
|
|
{
|
2022-04-18 12:43:10 +08:00
|
|
|
struct workqueue_struct *scrub_workers = NULL;
|
2015-02-17 01:34:01 +08:00
|
|
|
unsigned int flags = WQ_FREEZABLE | WQ_UNBOUND;
|
2014-02-28 10:46:17 +08:00
|
|
|
int max_active = fs_info->thread_pool_size;
|
btrfs: allocate scrub workqueues outside of locks
I got the following lockdep splat while testing:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00172-g021118712e59 #932 Not tainted
------------------------------------------------------
btrfs/229626 is trying to acquire lock:
ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450
but task is already holding lock:
ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_scrub_dev+0x11c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_run_dev_stats+0x49/0x480
commit_cowonly_roots+0xb5/0x2a0
btrfs_commit_transaction+0x516/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_commit_transaction+0x4bb/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_record_root_in_trans+0x43/0x70
start_transaction+0xd1/0x5d0
btrfs_dirty_inode+0x42/0xd0
touch_atime+0xa1/0xd0
btrfs_file_mmap+0x3f/0x60
mmap_region+0x3a4/0x640
do_mmap+0x376/0x580
vm_mmap_pgoff+0xd5/0x120
ksys_mmap_pgoff+0x193/0x230
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x68/0x90
_copy_to_user+0x1e/0x80
perf_read+0x141/0x2c0
vfs_read+0xad/0x1b0
ksys_read+0x5f/0xe0
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&cpuctx_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x88/0x150
perf_event_init+0x1db/0x20b
start_kernel+0x3ae/0x53c
secondary_startup_64+0xa4/0xb0
-> #1 (pmus_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x4f/0x150
cpuhp_invoke_callback+0xb1/0x900
_cpu_up.constprop.26+0x9f/0x130
cpu_up+0x7b/0xc0
bringup_nonboot_cpus+0x4f/0x60
smp_init+0x26/0x71
kernel_init_freeable+0x110/0x258
kernel_init+0xa/0x103
ret_from_fork+0x1f/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
cpus_read_lock+0x39/0xb0
alloc_workqueue+0x378/0x450
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->scrub_lock);
lock(&fs_devs->device_list_mutex);
lock(&fs_info->scrub_lock);
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by btrfs/229626:
#0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
#1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
stack backtrace:
CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? alloc_workqueue+0x378/0x450
cpus_read_lock+0x39/0xb0
? alloc_workqueue+0x378/0x450
alloc_workqueue+0x378/0x450
? rcu_read_lock_sched_held+0x52/0x80
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
? start_transaction+0xd1/0x5d0
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
? do_sigaction+0x102/0x250
? lockdep_hardirqs_on_prepare+0xca/0x160
? _raw_spin_unlock_irq+0x24/0x30
? trace_hardirqs_on+0x1c/0xe0
? _raw_spin_unlock_irq+0x24/0x30
? do_sigaction+0x102/0x250
? ksys_ioctl+0x83/0xc0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens because we're allocating the scrub workqueues under the
scrub and device list mutex, which brings in a whole host of other
dependencies.
Because the work queue allocation is done with GFP_KERNEL, it can
trigger reclaim, which can lead to a transaction commit, which in turns
needs the device_list_mutex, it can lead to a deadlock. A different
problem for which this fix is a solution.
Fix this by moving the actual allocation outside of the
scrub lock, and then only take the lock once we're ready to actually
assign them to the fs_info. We'll now have to cleanup the workqueues in
a few more places, so I've added a helper to do the refcount dance to
safely free the workqueues.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-10 23:42:29 +08:00
|
|
|
int ret = -ENOMEM;
|
2011-03-08 21:14:00 +08:00
|
|
|
|
btrfs: allocate scrub workqueues outside of locks
I got the following lockdep splat while testing:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00172-g021118712e59 #932 Not tainted
------------------------------------------------------
btrfs/229626 is trying to acquire lock:
ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450
but task is already holding lock:
ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_scrub_dev+0x11c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_run_dev_stats+0x49/0x480
commit_cowonly_roots+0xb5/0x2a0
btrfs_commit_transaction+0x516/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_commit_transaction+0x4bb/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_record_root_in_trans+0x43/0x70
start_transaction+0xd1/0x5d0
btrfs_dirty_inode+0x42/0xd0
touch_atime+0xa1/0xd0
btrfs_file_mmap+0x3f/0x60
mmap_region+0x3a4/0x640
do_mmap+0x376/0x580
vm_mmap_pgoff+0xd5/0x120
ksys_mmap_pgoff+0x193/0x230
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x68/0x90
_copy_to_user+0x1e/0x80
perf_read+0x141/0x2c0
vfs_read+0xad/0x1b0
ksys_read+0x5f/0xe0
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&cpuctx_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x88/0x150
perf_event_init+0x1db/0x20b
start_kernel+0x3ae/0x53c
secondary_startup_64+0xa4/0xb0
-> #1 (pmus_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x4f/0x150
cpuhp_invoke_callback+0xb1/0x900
_cpu_up.constprop.26+0x9f/0x130
cpu_up+0x7b/0xc0
bringup_nonboot_cpus+0x4f/0x60
smp_init+0x26/0x71
kernel_init_freeable+0x110/0x258
kernel_init+0xa/0x103
ret_from_fork+0x1f/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
cpus_read_lock+0x39/0xb0
alloc_workqueue+0x378/0x450
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->scrub_lock);
lock(&fs_devs->device_list_mutex);
lock(&fs_info->scrub_lock);
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by btrfs/229626:
#0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
#1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
stack backtrace:
CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? alloc_workqueue+0x378/0x450
cpus_read_lock+0x39/0xb0
? alloc_workqueue+0x378/0x450
alloc_workqueue+0x378/0x450
? rcu_read_lock_sched_held+0x52/0x80
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
? start_transaction+0xd1/0x5d0
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
? do_sigaction+0x102/0x250
? lockdep_hardirqs_on_prepare+0xca/0x160
? _raw_spin_unlock_irq+0x24/0x30
? trace_hardirqs_on+0x1c/0xe0
? _raw_spin_unlock_irq+0x24/0x30
? do_sigaction+0x102/0x250
? ksys_ioctl+0x83/0xc0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens because we're allocating the scrub workqueues under the
scrub and device list mutex, which brings in a whole host of other
dependencies.
Because the work queue allocation is done with GFP_KERNEL, it can
trigger reclaim, which can lead to a transaction commit, which in turns
needs the device_list_mutex, it can lead to a deadlock. A different
problem for which this fix is a solution.
Fix this by moving the actual allocation outside of the
scrub lock, and then only take the lock once we're ready to actually
assign them to the fs_info. We'll now have to cleanup the workqueues in
a few more places, so I've added a helper to do the refcount dance to
safely free the workqueues.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-10 23:42:29 +08:00
|
|
|
if (refcount_inc_not_zero(&fs_info->scrub_workers_refcnt))
|
|
|
|
return 0;
|
2019-01-30 14:45:01 +08:00
|
|
|
|
2023-08-03 14:33:32 +08:00
|
|
|
scrub_workers = alloc_workqueue("btrfs-scrub", flags, max_active);
|
btrfs: allocate scrub workqueues outside of locks
I got the following lockdep splat while testing:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00172-g021118712e59 #932 Not tainted
------------------------------------------------------
btrfs/229626 is trying to acquire lock:
ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450
but task is already holding lock:
ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_scrub_dev+0x11c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_run_dev_stats+0x49/0x480
commit_cowonly_roots+0xb5/0x2a0
btrfs_commit_transaction+0x516/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_commit_transaction+0x4bb/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_record_root_in_trans+0x43/0x70
start_transaction+0xd1/0x5d0
btrfs_dirty_inode+0x42/0xd0
touch_atime+0xa1/0xd0
btrfs_file_mmap+0x3f/0x60
mmap_region+0x3a4/0x640
do_mmap+0x376/0x580
vm_mmap_pgoff+0xd5/0x120
ksys_mmap_pgoff+0x193/0x230
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x68/0x90
_copy_to_user+0x1e/0x80
perf_read+0x141/0x2c0
vfs_read+0xad/0x1b0
ksys_read+0x5f/0xe0
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&cpuctx_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x88/0x150
perf_event_init+0x1db/0x20b
start_kernel+0x3ae/0x53c
secondary_startup_64+0xa4/0xb0
-> #1 (pmus_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x4f/0x150
cpuhp_invoke_callback+0xb1/0x900
_cpu_up.constprop.26+0x9f/0x130
cpu_up+0x7b/0xc0
bringup_nonboot_cpus+0x4f/0x60
smp_init+0x26/0x71
kernel_init_freeable+0x110/0x258
kernel_init+0xa/0x103
ret_from_fork+0x1f/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
cpus_read_lock+0x39/0xb0
alloc_workqueue+0x378/0x450
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->scrub_lock);
lock(&fs_devs->device_list_mutex);
lock(&fs_info->scrub_lock);
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by btrfs/229626:
#0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
#1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
stack backtrace:
CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? alloc_workqueue+0x378/0x450
cpus_read_lock+0x39/0xb0
? alloc_workqueue+0x378/0x450
alloc_workqueue+0x378/0x450
? rcu_read_lock_sched_held+0x52/0x80
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
? start_transaction+0xd1/0x5d0
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
? do_sigaction+0x102/0x250
? lockdep_hardirqs_on_prepare+0xca/0x160
? _raw_spin_unlock_irq+0x24/0x30
? trace_hardirqs_on+0x1c/0xe0
? _raw_spin_unlock_irq+0x24/0x30
? do_sigaction+0x102/0x250
? ksys_ioctl+0x83/0xc0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens because we're allocating the scrub workqueues under the
scrub and device list mutex, which brings in a whole host of other
dependencies.
Because the work queue allocation is done with GFP_KERNEL, it can
trigger reclaim, which can lead to a transaction commit, which in turns
needs the device_list_mutex, it can lead to a deadlock. A different
problem for which this fix is a solution.
Fix this by moving the actual allocation outside of the
scrub lock, and then only take the lock once we're ready to actually
assign them to the fs_info. We'll now have to cleanup the workqueues in
a few more places, so I've added a helper to do the refcount dance to
safely free the workqueues.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-10 23:42:29 +08:00
|
|
|
if (!scrub_workers)
|
2023-06-12 15:23:29 +08:00
|
|
|
return -ENOMEM;
|
2019-01-30 14:45:02 +08:00
|
|
|
|
btrfs: allocate scrub workqueues outside of locks
I got the following lockdep splat while testing:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00172-g021118712e59 #932 Not tainted
------------------------------------------------------
btrfs/229626 is trying to acquire lock:
ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450
but task is already holding lock:
ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_scrub_dev+0x11c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_run_dev_stats+0x49/0x480
commit_cowonly_roots+0xb5/0x2a0
btrfs_commit_transaction+0x516/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_commit_transaction+0x4bb/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_record_root_in_trans+0x43/0x70
start_transaction+0xd1/0x5d0
btrfs_dirty_inode+0x42/0xd0
touch_atime+0xa1/0xd0
btrfs_file_mmap+0x3f/0x60
mmap_region+0x3a4/0x640
do_mmap+0x376/0x580
vm_mmap_pgoff+0xd5/0x120
ksys_mmap_pgoff+0x193/0x230
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x68/0x90
_copy_to_user+0x1e/0x80
perf_read+0x141/0x2c0
vfs_read+0xad/0x1b0
ksys_read+0x5f/0xe0
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&cpuctx_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x88/0x150
perf_event_init+0x1db/0x20b
start_kernel+0x3ae/0x53c
secondary_startup_64+0xa4/0xb0
-> #1 (pmus_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x4f/0x150
cpuhp_invoke_callback+0xb1/0x900
_cpu_up.constprop.26+0x9f/0x130
cpu_up+0x7b/0xc0
bringup_nonboot_cpus+0x4f/0x60
smp_init+0x26/0x71
kernel_init_freeable+0x110/0x258
kernel_init+0xa/0x103
ret_from_fork+0x1f/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
cpus_read_lock+0x39/0xb0
alloc_workqueue+0x378/0x450
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->scrub_lock);
lock(&fs_devs->device_list_mutex);
lock(&fs_info->scrub_lock);
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by btrfs/229626:
#0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
#1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
stack backtrace:
CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? alloc_workqueue+0x378/0x450
cpus_read_lock+0x39/0xb0
? alloc_workqueue+0x378/0x450
alloc_workqueue+0x378/0x450
? rcu_read_lock_sched_held+0x52/0x80
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
? start_transaction+0xd1/0x5d0
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
? do_sigaction+0x102/0x250
? lockdep_hardirqs_on_prepare+0xca/0x160
? _raw_spin_unlock_irq+0x24/0x30
? trace_hardirqs_on+0x1c/0xe0
? _raw_spin_unlock_irq+0x24/0x30
? do_sigaction+0x102/0x250
? ksys_ioctl+0x83/0xc0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens because we're allocating the scrub workqueues under the
scrub and device list mutex, which brings in a whole host of other
dependencies.
Because the work queue allocation is done with GFP_KERNEL, it can
trigger reclaim, which can lead to a transaction commit, which in turns
needs the device_list_mutex, it can lead to a deadlock. A different
problem for which this fix is a solution.
Fix this by moving the actual allocation outside of the
scrub lock, and then only take the lock once we're ready to actually
assign them to the fs_info. We'll now have to cleanup the workqueues in
a few more places, so I've added a helper to do the refcount dance to
safely free the workqueues.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-10 23:42:29 +08:00
|
|
|
mutex_lock(&fs_info->scrub_lock);
|
|
|
|
if (refcount_read(&fs_info->scrub_workers_refcnt) == 0) {
|
2023-06-12 15:23:29 +08:00
|
|
|
ASSERT(fs_info->scrub_workers == NULL);
|
btrfs: allocate scrub workqueues outside of locks
I got the following lockdep splat while testing:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00172-g021118712e59 #932 Not tainted
------------------------------------------------------
btrfs/229626 is trying to acquire lock:
ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450
but task is already holding lock:
ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_scrub_dev+0x11c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_run_dev_stats+0x49/0x480
commit_cowonly_roots+0xb5/0x2a0
btrfs_commit_transaction+0x516/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_commit_transaction+0x4bb/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_record_root_in_trans+0x43/0x70
start_transaction+0xd1/0x5d0
btrfs_dirty_inode+0x42/0xd0
touch_atime+0xa1/0xd0
btrfs_file_mmap+0x3f/0x60
mmap_region+0x3a4/0x640
do_mmap+0x376/0x580
vm_mmap_pgoff+0xd5/0x120
ksys_mmap_pgoff+0x193/0x230
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x68/0x90
_copy_to_user+0x1e/0x80
perf_read+0x141/0x2c0
vfs_read+0xad/0x1b0
ksys_read+0x5f/0xe0
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&cpuctx_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x88/0x150
perf_event_init+0x1db/0x20b
start_kernel+0x3ae/0x53c
secondary_startup_64+0xa4/0xb0
-> #1 (pmus_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x4f/0x150
cpuhp_invoke_callback+0xb1/0x900
_cpu_up.constprop.26+0x9f/0x130
cpu_up+0x7b/0xc0
bringup_nonboot_cpus+0x4f/0x60
smp_init+0x26/0x71
kernel_init_freeable+0x110/0x258
kernel_init+0xa/0x103
ret_from_fork+0x1f/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
cpus_read_lock+0x39/0xb0
alloc_workqueue+0x378/0x450
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->scrub_lock);
lock(&fs_devs->device_list_mutex);
lock(&fs_info->scrub_lock);
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by btrfs/229626:
#0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
#1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
stack backtrace:
CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? alloc_workqueue+0x378/0x450
cpus_read_lock+0x39/0xb0
? alloc_workqueue+0x378/0x450
alloc_workqueue+0x378/0x450
? rcu_read_lock_sched_held+0x52/0x80
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
? start_transaction+0xd1/0x5d0
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
? do_sigaction+0x102/0x250
? lockdep_hardirqs_on_prepare+0xca/0x160
? _raw_spin_unlock_irq+0x24/0x30
? trace_hardirqs_on+0x1c/0xe0
? _raw_spin_unlock_irq+0x24/0x30
? do_sigaction+0x102/0x250
? ksys_ioctl+0x83/0xc0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens because we're allocating the scrub workqueues under the
scrub and device list mutex, which brings in a whole host of other
dependencies.
Because the work queue allocation is done with GFP_KERNEL, it can
trigger reclaim, which can lead to a transaction commit, which in turns
needs the device_list_mutex, it can lead to a deadlock. A different
problem for which this fix is a solution.
Fix this by moving the actual allocation outside of the
scrub lock, and then only take the lock once we're ready to actually
assign them to the fs_info. We'll now have to cleanup the workqueues in
a few more places, so I've added a helper to do the refcount dance to
safely free the workqueues.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-10 23:42:29 +08:00
|
|
|
fs_info->scrub_workers = scrub_workers;
|
2019-01-30 14:45:02 +08:00
|
|
|
refcount_set(&fs_info->scrub_workers_refcnt, 1);
|
btrfs: allocate scrub workqueues outside of locks
I got the following lockdep splat while testing:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00172-g021118712e59 #932 Not tainted
------------------------------------------------------
btrfs/229626 is trying to acquire lock:
ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450
but task is already holding lock:
ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_scrub_dev+0x11c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_run_dev_stats+0x49/0x480
commit_cowonly_roots+0xb5/0x2a0
btrfs_commit_transaction+0x516/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_commit_transaction+0x4bb/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_record_root_in_trans+0x43/0x70
start_transaction+0xd1/0x5d0
btrfs_dirty_inode+0x42/0xd0
touch_atime+0xa1/0xd0
btrfs_file_mmap+0x3f/0x60
mmap_region+0x3a4/0x640
do_mmap+0x376/0x580
vm_mmap_pgoff+0xd5/0x120
ksys_mmap_pgoff+0x193/0x230
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x68/0x90
_copy_to_user+0x1e/0x80
perf_read+0x141/0x2c0
vfs_read+0xad/0x1b0
ksys_read+0x5f/0xe0
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&cpuctx_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x88/0x150
perf_event_init+0x1db/0x20b
start_kernel+0x3ae/0x53c
secondary_startup_64+0xa4/0xb0
-> #1 (pmus_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x4f/0x150
cpuhp_invoke_callback+0xb1/0x900
_cpu_up.constprop.26+0x9f/0x130
cpu_up+0x7b/0xc0
bringup_nonboot_cpus+0x4f/0x60
smp_init+0x26/0x71
kernel_init_freeable+0x110/0x258
kernel_init+0xa/0x103
ret_from_fork+0x1f/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
cpus_read_lock+0x39/0xb0
alloc_workqueue+0x378/0x450
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->scrub_lock);
lock(&fs_devs->device_list_mutex);
lock(&fs_info->scrub_lock);
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by btrfs/229626:
#0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
#1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
stack backtrace:
CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? alloc_workqueue+0x378/0x450
cpus_read_lock+0x39/0xb0
? alloc_workqueue+0x378/0x450
alloc_workqueue+0x378/0x450
? rcu_read_lock_sched_held+0x52/0x80
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
? start_transaction+0xd1/0x5d0
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
? do_sigaction+0x102/0x250
? lockdep_hardirqs_on_prepare+0xca/0x160
? _raw_spin_unlock_irq+0x24/0x30
? trace_hardirqs_on+0x1c/0xe0
? _raw_spin_unlock_irq+0x24/0x30
? do_sigaction+0x102/0x250
? ksys_ioctl+0x83/0xc0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens because we're allocating the scrub workqueues under the
scrub and device list mutex, which brings in a whole host of other
dependencies.
Because the work queue allocation is done with GFP_KERNEL, it can
trigger reclaim, which can lead to a transaction commit, which in turns
needs the device_list_mutex, it can lead to a deadlock. A different
problem for which this fix is a solution.
Fix this by moving the actual allocation outside of the
scrub lock, and then only take the lock once we're ready to actually
assign them to the fs_info. We'll now have to cleanup the workqueues in
a few more places, so I've added a helper to do the refcount dance to
safely free the workqueues.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-10 23:42:29 +08:00
|
|
|
mutex_unlock(&fs_info->scrub_lock);
|
|
|
|
return 0;
|
2011-06-10 18:07:07 +08:00
|
|
|
}
|
btrfs: allocate scrub workqueues outside of locks
I got the following lockdep splat while testing:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00172-g021118712e59 #932 Not tainted
------------------------------------------------------
btrfs/229626 is trying to acquire lock:
ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450
but task is already holding lock:
ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_scrub_dev+0x11c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_run_dev_stats+0x49/0x480
commit_cowonly_roots+0xb5/0x2a0
btrfs_commit_transaction+0x516/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_commit_transaction+0x4bb/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_record_root_in_trans+0x43/0x70
start_transaction+0xd1/0x5d0
btrfs_dirty_inode+0x42/0xd0
touch_atime+0xa1/0xd0
btrfs_file_mmap+0x3f/0x60
mmap_region+0x3a4/0x640
do_mmap+0x376/0x580
vm_mmap_pgoff+0xd5/0x120
ksys_mmap_pgoff+0x193/0x230
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x68/0x90
_copy_to_user+0x1e/0x80
perf_read+0x141/0x2c0
vfs_read+0xad/0x1b0
ksys_read+0x5f/0xe0
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&cpuctx_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x88/0x150
perf_event_init+0x1db/0x20b
start_kernel+0x3ae/0x53c
secondary_startup_64+0xa4/0xb0
-> #1 (pmus_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x4f/0x150
cpuhp_invoke_callback+0xb1/0x900
_cpu_up.constprop.26+0x9f/0x130
cpu_up+0x7b/0xc0
bringup_nonboot_cpus+0x4f/0x60
smp_init+0x26/0x71
kernel_init_freeable+0x110/0x258
kernel_init+0xa/0x103
ret_from_fork+0x1f/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
cpus_read_lock+0x39/0xb0
alloc_workqueue+0x378/0x450
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->scrub_lock);
lock(&fs_devs->device_list_mutex);
lock(&fs_info->scrub_lock);
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by btrfs/229626:
#0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
#1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
stack backtrace:
CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? alloc_workqueue+0x378/0x450
cpus_read_lock+0x39/0xb0
? alloc_workqueue+0x378/0x450
alloc_workqueue+0x378/0x450
? rcu_read_lock_sched_held+0x52/0x80
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
? start_transaction+0xd1/0x5d0
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
? do_sigaction+0x102/0x250
? lockdep_hardirqs_on_prepare+0xca/0x160
? _raw_spin_unlock_irq+0x24/0x30
? trace_hardirqs_on+0x1c/0xe0
? _raw_spin_unlock_irq+0x24/0x30
? do_sigaction+0x102/0x250
? ksys_ioctl+0x83/0xc0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens because we're allocating the scrub workqueues under the
scrub and device list mutex, which brings in a whole host of other
dependencies.
Because the work queue allocation is done with GFP_KERNEL, it can
trigger reclaim, which can lead to a transaction commit, which in turns
needs the device_list_mutex, it can lead to a deadlock. A different
problem for which this fix is a solution.
Fix this by moving the actual allocation outside of the
scrub lock, and then only take the lock once we're ready to actually
assign them to the fs_info. We'll now have to cleanup the workqueues in
a few more places, so I've added a helper to do the refcount dance to
safely free the workqueues.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-10 23:42:29 +08:00
|
|
|
/* Other thread raced in and created the workers for us */
|
|
|
|
refcount_inc(&fs_info->scrub_workers_refcnt);
|
|
|
|
mutex_unlock(&fs_info->scrub_lock);
|
2015-06-12 20:36:58 +08:00
|
|
|
|
btrfs: allocate scrub workqueues outside of locks
I got the following lockdep splat while testing:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00172-g021118712e59 #932 Not tainted
------------------------------------------------------
btrfs/229626 is trying to acquire lock:
ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450
but task is already holding lock:
ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_scrub_dev+0x11c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_run_dev_stats+0x49/0x480
commit_cowonly_roots+0xb5/0x2a0
btrfs_commit_transaction+0x516/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_commit_transaction+0x4bb/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_record_root_in_trans+0x43/0x70
start_transaction+0xd1/0x5d0
btrfs_dirty_inode+0x42/0xd0
touch_atime+0xa1/0xd0
btrfs_file_mmap+0x3f/0x60
mmap_region+0x3a4/0x640
do_mmap+0x376/0x580
vm_mmap_pgoff+0xd5/0x120
ksys_mmap_pgoff+0x193/0x230
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x68/0x90
_copy_to_user+0x1e/0x80
perf_read+0x141/0x2c0
vfs_read+0xad/0x1b0
ksys_read+0x5f/0xe0
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&cpuctx_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x88/0x150
perf_event_init+0x1db/0x20b
start_kernel+0x3ae/0x53c
secondary_startup_64+0xa4/0xb0
-> #1 (pmus_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x4f/0x150
cpuhp_invoke_callback+0xb1/0x900
_cpu_up.constprop.26+0x9f/0x130
cpu_up+0x7b/0xc0
bringup_nonboot_cpus+0x4f/0x60
smp_init+0x26/0x71
kernel_init_freeable+0x110/0x258
kernel_init+0xa/0x103
ret_from_fork+0x1f/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
cpus_read_lock+0x39/0xb0
alloc_workqueue+0x378/0x450
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->scrub_lock);
lock(&fs_devs->device_list_mutex);
lock(&fs_info->scrub_lock);
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by btrfs/229626:
#0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
#1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
stack backtrace:
CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? alloc_workqueue+0x378/0x450
cpus_read_lock+0x39/0xb0
? alloc_workqueue+0x378/0x450
alloc_workqueue+0x378/0x450
? rcu_read_lock_sched_held+0x52/0x80
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
? start_transaction+0xd1/0x5d0
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
? do_sigaction+0x102/0x250
? lockdep_hardirqs_on_prepare+0xca/0x160
? _raw_spin_unlock_irq+0x24/0x30
? trace_hardirqs_on+0x1c/0xe0
? _raw_spin_unlock_irq+0x24/0x30
? do_sigaction+0x102/0x250
? ksys_ioctl+0x83/0xc0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens because we're allocating the scrub workqueues under the
scrub and device list mutex, which brings in a whole host of other
dependencies.
Because the work queue allocation is done with GFP_KERNEL, it can
trigger reclaim, which can lead to a transaction commit, which in turns
needs the device_list_mutex, it can lead to a deadlock. A different
problem for which this fix is a solution.
Fix this by moving the actual allocation outside of the
scrub lock, and then only take the lock once we're ready to actually
assign them to the fs_info. We'll now have to cleanup the workqueues in
a few more places, so I've added a helper to do the refcount dance to
safely free the workqueues.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-10 23:42:29 +08:00
|
|
|
ret = 0;
|
2023-03-29 13:55:20 +08:00
|
|
|
|
2022-04-18 12:43:10 +08:00
|
|
|
destroy_workqueue(scrub_workers);
|
btrfs: allocate scrub workqueues outside of locks
I got the following lockdep splat while testing:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00172-g021118712e59 #932 Not tainted
------------------------------------------------------
btrfs/229626 is trying to acquire lock:
ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450
but task is already holding lock:
ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_scrub_dev+0x11c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_run_dev_stats+0x49/0x480
commit_cowonly_roots+0xb5/0x2a0
btrfs_commit_transaction+0x516/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_commit_transaction+0x4bb/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_record_root_in_trans+0x43/0x70
start_transaction+0xd1/0x5d0
btrfs_dirty_inode+0x42/0xd0
touch_atime+0xa1/0xd0
btrfs_file_mmap+0x3f/0x60
mmap_region+0x3a4/0x640
do_mmap+0x376/0x580
vm_mmap_pgoff+0xd5/0x120
ksys_mmap_pgoff+0x193/0x230
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x68/0x90
_copy_to_user+0x1e/0x80
perf_read+0x141/0x2c0
vfs_read+0xad/0x1b0
ksys_read+0x5f/0xe0
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&cpuctx_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x88/0x150
perf_event_init+0x1db/0x20b
start_kernel+0x3ae/0x53c
secondary_startup_64+0xa4/0xb0
-> #1 (pmus_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x4f/0x150
cpuhp_invoke_callback+0xb1/0x900
_cpu_up.constprop.26+0x9f/0x130
cpu_up+0x7b/0xc0
bringup_nonboot_cpus+0x4f/0x60
smp_init+0x26/0x71
kernel_init_freeable+0x110/0x258
kernel_init+0xa/0x103
ret_from_fork+0x1f/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
cpus_read_lock+0x39/0xb0
alloc_workqueue+0x378/0x450
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->scrub_lock);
lock(&fs_devs->device_list_mutex);
lock(&fs_info->scrub_lock);
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by btrfs/229626:
#0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
#1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
stack backtrace:
CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? alloc_workqueue+0x378/0x450
cpus_read_lock+0x39/0xb0
? alloc_workqueue+0x378/0x450
alloc_workqueue+0x378/0x450
? rcu_read_lock_sched_held+0x52/0x80
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
? start_transaction+0xd1/0x5d0
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
? do_sigaction+0x102/0x250
? lockdep_hardirqs_on_prepare+0xca/0x160
? _raw_spin_unlock_irq+0x24/0x30
? trace_hardirqs_on+0x1c/0xe0
? _raw_spin_unlock_irq+0x24/0x30
? do_sigaction+0x102/0x250
? ksys_ioctl+0x83/0xc0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens because we're allocating the scrub workqueues under the
scrub and device list mutex, which brings in a whole host of other
dependencies.
Because the work queue allocation is done with GFP_KERNEL, it can
trigger reclaim, which can lead to a transaction commit, which in turns
needs the device_list_mutex, it can lead to a deadlock. A different
problem for which this fix is a solution.
Fix this by moving the actual allocation outside of the
scrub lock, and then only take the lock once we're ready to actually
assign them to the fs_info. We'll now have to cleanup the workqueues in
a few more places, so I've added a helper to do the refcount dance to
safely free the workqueues.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-10 23:42:29 +08:00
|
|
|
return ret;
|
2011-03-08 21:14:00 +08:00
|
|
|
}
|
|
|
|
|
2012-11-06 00:03:39 +08:00
|
|
|
int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
|
|
|
|
u64 end, struct btrfs_scrub_progress *progress,
|
2012-11-06 01:29:28 +08:00
|
|
|
int readonly, int is_dev_replace)
|
2011-03-08 21:14:00 +08:00
|
|
|
{
|
2021-10-06 04:12:42 +08:00
|
|
|
struct btrfs_dev_lookup_args args = { .devid = devid };
|
Btrfs: rename the scrub context structure
The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.
Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is used.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-11-02 16:58:09 +08:00
|
|
|
struct scrub_ctx *sctx;
|
2011-03-08 21:14:00 +08:00
|
|
|
int ret;
|
|
|
|
struct btrfs_device *dev;
|
Btrfs: fix deadlock with memory reclaim during scrub
When a transaction commit starts, it attempts to pause scrub and it blocks
until the scrub is paused. So while the transaction is blocked waiting for
scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub,
otherwise we risk getting into a deadlock with reclaim.
Checking for scrub pause requests is done early at the beginning of the
while loop of scrub_stripe() and later in the loop, scrub_extent() and
scrub_raid56_parity() are called, which in turn call scrub_pages() and
scrub_pages_for_parity() respectively. These last two functions do memory
allocations using GFP_KERNEL. Same problem could happen while scrubbing
the super blocks, since it calls scrub_pages().
We also can not have any of the worker tasks, created by the scrub task,
doing GFP_KERNEL allocations, because before pausing, the scrub task waits
for all the worker tasks to complete (also done at scrub_stripe()).
So make sure GFP_NOFS is used for the memory allocations because at any
time a scrub pause request can happen from another task that started to
commit a transaction.
Fixes: 58c4e173847a ("btrfs: scrub: use GFP_KERNEL on the submission path")
CC: stable@vger.kernel.org # 4.6+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-27 04:07:17 +08:00
|
|
|
unsigned int nofs_flag;
|
2022-08-02 14:53:03 +08:00
|
|
|
bool need_commit = false;
|
2011-03-08 21:14:00 +08:00
|
|
|
|
2012-11-06 00:03:39 +08:00
|
|
|
if (btrfs_fs_closing(fs_info))
|
2019-02-26 02:57:41 +08:00
|
|
|
return -EAGAIN;
|
2011-03-08 21:14:00 +08:00
|
|
|
|
2022-08-10 13:58:57 +08:00
|
|
|
/* At mount time we have ensured nodesize is in the range of [4K, 64K]. */
|
|
|
|
ASSERT(fs_info->nodesize <= BTRFS_STRIPE_LEN);
|
2012-03-28 02:21:27 +08:00
|
|
|
|
2022-08-10 13:58:57 +08:00
|
|
|
/*
|
|
|
|
* SCRUB_MAX_SECTORS_PER_BLOCK is calculated using the largest possible
|
|
|
|
* value (max nodesize / min sectorsize), thus nodesize should always
|
|
|
|
* be fine.
|
|
|
|
*/
|
|
|
|
ASSERT(fs_info->nodesize <=
|
|
|
|
SCRUB_MAX_SECTORS_PER_BLOCK << fs_info->sectorsize_bits);
|
2012-11-02 21:58:04 +08:00
|
|
|
|
2018-12-04 23:11:56 +08:00
|
|
|
/* Allocate outside of device_list_mutex */
|
|
|
|
sctx = scrub_setup_ctx(fs_info, is_dev_replace);
|
|
|
|
if (IS_ERR(sctx))
|
|
|
|
return PTR_ERR(sctx);
|
2011-03-08 21:14:00 +08:00
|
|
|
|
2023-08-03 14:33:32 +08:00
|
|
|
ret = scrub_workers_get(fs_info);
|
btrfs: allocate scrub workqueues outside of locks
I got the following lockdep splat while testing:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00172-g021118712e59 #932 Not tainted
------------------------------------------------------
btrfs/229626 is trying to acquire lock:
ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450
but task is already holding lock:
ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_scrub_dev+0x11c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_run_dev_stats+0x49/0x480
commit_cowonly_roots+0xb5/0x2a0
btrfs_commit_transaction+0x516/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_commit_transaction+0x4bb/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_record_root_in_trans+0x43/0x70
start_transaction+0xd1/0x5d0
btrfs_dirty_inode+0x42/0xd0
touch_atime+0xa1/0xd0
btrfs_file_mmap+0x3f/0x60
mmap_region+0x3a4/0x640
do_mmap+0x376/0x580
vm_mmap_pgoff+0xd5/0x120
ksys_mmap_pgoff+0x193/0x230
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x68/0x90
_copy_to_user+0x1e/0x80
perf_read+0x141/0x2c0
vfs_read+0xad/0x1b0
ksys_read+0x5f/0xe0
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&cpuctx_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x88/0x150
perf_event_init+0x1db/0x20b
start_kernel+0x3ae/0x53c
secondary_startup_64+0xa4/0xb0
-> #1 (pmus_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x4f/0x150
cpuhp_invoke_callback+0xb1/0x900
_cpu_up.constprop.26+0x9f/0x130
cpu_up+0x7b/0xc0
bringup_nonboot_cpus+0x4f/0x60
smp_init+0x26/0x71
kernel_init_freeable+0x110/0x258
kernel_init+0xa/0x103
ret_from_fork+0x1f/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
cpus_read_lock+0x39/0xb0
alloc_workqueue+0x378/0x450
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->scrub_lock);
lock(&fs_devs->device_list_mutex);
lock(&fs_info->scrub_lock);
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by btrfs/229626:
#0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
#1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
stack backtrace:
CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? alloc_workqueue+0x378/0x450
cpus_read_lock+0x39/0xb0
? alloc_workqueue+0x378/0x450
alloc_workqueue+0x378/0x450
? rcu_read_lock_sched_held+0x52/0x80
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
? start_transaction+0xd1/0x5d0
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
? do_sigaction+0x102/0x250
? lockdep_hardirqs_on_prepare+0xca/0x160
? _raw_spin_unlock_irq+0x24/0x30
? trace_hardirqs_on+0x1c/0xe0
? _raw_spin_unlock_irq+0x24/0x30
? do_sigaction+0x102/0x250
? ksys_ioctl+0x83/0xc0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens because we're allocating the scrub workqueues under the
scrub and device list mutex, which brings in a whole host of other
dependencies.
Because the work queue allocation is done with GFP_KERNEL, it can
trigger reclaim, which can lead to a transaction commit, which in turns
needs the device_list_mutex, it can lead to a deadlock. A different
problem for which this fix is a solution.
Fix this by moving the actual allocation outside of the
scrub lock, and then only take the lock once we're ready to actually
assign them to the fs_info. We'll now have to cleanup the workqueues in
a few more places, so I've added a helper to do the refcount dance to
safely free the workqueues.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-10 23:42:29 +08:00
|
|
|
if (ret)
|
|
|
|
goto out_free_ctx;
|
|
|
|
|
2012-11-06 00:03:39 +08:00
|
|
|
mutex_lock(&fs_info->fs_devices->device_list_mutex);
|
2021-10-06 04:12:42 +08:00
|
|
|
dev = btrfs_find_device(fs_info->fs_devices, &args);
|
2017-12-04 12:54:54 +08:00
|
|
|
if (!dev || (test_bit(BTRFS_DEV_STATE_MISSING, &dev->dev_state) &&
|
|
|
|
!is_dev_replace)) {
|
2012-11-06 00:03:39 +08:00
|
|
|
mutex_unlock(&fs_info->fs_devices->device_list_mutex);
|
2018-12-04 23:11:56 +08:00
|
|
|
ret = -ENODEV;
|
btrfs: allocate scrub workqueues outside of locks
I got the following lockdep splat while testing:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00172-g021118712e59 #932 Not tainted
------------------------------------------------------
btrfs/229626 is trying to acquire lock:
ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450
but task is already holding lock:
ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_scrub_dev+0x11c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_run_dev_stats+0x49/0x480
commit_cowonly_roots+0xb5/0x2a0
btrfs_commit_transaction+0x516/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_commit_transaction+0x4bb/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_record_root_in_trans+0x43/0x70
start_transaction+0xd1/0x5d0
btrfs_dirty_inode+0x42/0xd0
touch_atime+0xa1/0xd0
btrfs_file_mmap+0x3f/0x60
mmap_region+0x3a4/0x640
do_mmap+0x376/0x580
vm_mmap_pgoff+0xd5/0x120
ksys_mmap_pgoff+0x193/0x230
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x68/0x90
_copy_to_user+0x1e/0x80
perf_read+0x141/0x2c0
vfs_read+0xad/0x1b0
ksys_read+0x5f/0xe0
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&cpuctx_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x88/0x150
perf_event_init+0x1db/0x20b
start_kernel+0x3ae/0x53c
secondary_startup_64+0xa4/0xb0
-> #1 (pmus_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x4f/0x150
cpuhp_invoke_callback+0xb1/0x900
_cpu_up.constprop.26+0x9f/0x130
cpu_up+0x7b/0xc0
bringup_nonboot_cpus+0x4f/0x60
smp_init+0x26/0x71
kernel_init_freeable+0x110/0x258
kernel_init+0xa/0x103
ret_from_fork+0x1f/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
cpus_read_lock+0x39/0xb0
alloc_workqueue+0x378/0x450
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->scrub_lock);
lock(&fs_devs->device_list_mutex);
lock(&fs_info->scrub_lock);
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by btrfs/229626:
#0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
#1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
stack backtrace:
CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? alloc_workqueue+0x378/0x450
cpus_read_lock+0x39/0xb0
? alloc_workqueue+0x378/0x450
alloc_workqueue+0x378/0x450
? rcu_read_lock_sched_held+0x52/0x80
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
? start_transaction+0xd1/0x5d0
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
? do_sigaction+0x102/0x250
? lockdep_hardirqs_on_prepare+0xca/0x160
? _raw_spin_unlock_irq+0x24/0x30
? trace_hardirqs_on+0x1c/0xe0
? _raw_spin_unlock_irq+0x24/0x30
? do_sigaction+0x102/0x250
? ksys_ioctl+0x83/0xc0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens because we're allocating the scrub workqueues under the
scrub and device list mutex, which brings in a whole host of other
dependencies.
Because the work queue allocation is done with GFP_KERNEL, it can
trigger reclaim, which can lead to a transaction commit, which in turns
needs the device_list_mutex, it can lead to a deadlock. A different
problem for which this fix is a solution.
Fix this by moving the actual allocation outside of the
scrub lock, and then only take the lock once we're ready to actually
assign them to the fs_info. We'll now have to cleanup the workqueues in
a few more places, so I've added a helper to do the refcount dance to
safely free the workqueues.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-10 23:42:29 +08:00
|
|
|
goto out;
|
2011-03-08 21:14:00 +08:00
|
|
|
}
|
|
|
|
|
2017-12-04 12:54:52 +08:00
|
|
|
if (!is_dev_replace && !readonly &&
|
|
|
|
!test_bit(BTRFS_DEV_STATE_WRITEABLE, &dev->dev_state)) {
|
2014-07-24 11:37:07 +08:00
|
|
|
mutex_unlock(&fs_info->fs_devices->device_list_mutex);
|
2020-07-09 17:25:40 +08:00
|
|
|
btrfs_err_in_rcu(fs_info,
|
|
|
|
"scrub on devid %llu: filesystem on %s is not writable",
|
2022-11-13 09:32:07 +08:00
|
|
|
devid, btrfs_dev_name(dev));
|
2018-12-04 23:11:56 +08:00
|
|
|
ret = -EROFS;
|
btrfs: allocate scrub workqueues outside of locks
I got the following lockdep splat while testing:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00172-g021118712e59 #932 Not tainted
------------------------------------------------------
btrfs/229626 is trying to acquire lock:
ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450
but task is already holding lock:
ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_scrub_dev+0x11c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_run_dev_stats+0x49/0x480
commit_cowonly_roots+0xb5/0x2a0
btrfs_commit_transaction+0x516/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_commit_transaction+0x4bb/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_record_root_in_trans+0x43/0x70
start_transaction+0xd1/0x5d0
btrfs_dirty_inode+0x42/0xd0
touch_atime+0xa1/0xd0
btrfs_file_mmap+0x3f/0x60
mmap_region+0x3a4/0x640
do_mmap+0x376/0x580
vm_mmap_pgoff+0xd5/0x120
ksys_mmap_pgoff+0x193/0x230
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x68/0x90
_copy_to_user+0x1e/0x80
perf_read+0x141/0x2c0
vfs_read+0xad/0x1b0
ksys_read+0x5f/0xe0
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&cpuctx_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x88/0x150
perf_event_init+0x1db/0x20b
start_kernel+0x3ae/0x53c
secondary_startup_64+0xa4/0xb0
-> #1 (pmus_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x4f/0x150
cpuhp_invoke_callback+0xb1/0x900
_cpu_up.constprop.26+0x9f/0x130
cpu_up+0x7b/0xc0
bringup_nonboot_cpus+0x4f/0x60
smp_init+0x26/0x71
kernel_init_freeable+0x110/0x258
kernel_init+0xa/0x103
ret_from_fork+0x1f/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
cpus_read_lock+0x39/0xb0
alloc_workqueue+0x378/0x450
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->scrub_lock);
lock(&fs_devs->device_list_mutex);
lock(&fs_info->scrub_lock);
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by btrfs/229626:
#0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
#1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
stack backtrace:
CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? alloc_workqueue+0x378/0x450
cpus_read_lock+0x39/0xb0
? alloc_workqueue+0x378/0x450
alloc_workqueue+0x378/0x450
? rcu_read_lock_sched_held+0x52/0x80
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
? start_transaction+0xd1/0x5d0
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
? do_sigaction+0x102/0x250
? lockdep_hardirqs_on_prepare+0xca/0x160
? _raw_spin_unlock_irq+0x24/0x30
? trace_hardirqs_on+0x1c/0xe0
? _raw_spin_unlock_irq+0x24/0x30
? do_sigaction+0x102/0x250
? ksys_ioctl+0x83/0xc0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens because we're allocating the scrub workqueues under the
scrub and device list mutex, which brings in a whole host of other
dependencies.
Because the work queue allocation is done with GFP_KERNEL, it can
trigger reclaim, which can lead to a transaction commit, which in turns
needs the device_list_mutex, it can lead to a deadlock. A different
problem for which this fix is a solution.
Fix this by moving the actual allocation outside of the
scrub lock, and then only take the lock once we're ready to actually
assign them to the fs_info. We'll now have to cleanup the workqueues in
a few more places, so I've added a helper to do the refcount dance to
safely free the workqueues.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-10 23:42:29 +08:00
|
|
|
goto out;
|
2014-07-24 11:37:07 +08:00
|
|
|
}
|
|
|
|
|
2013-10-12 02:11:12 +08:00
|
|
|
mutex_lock(&fs_info->scrub_lock);
|
2017-12-04 12:54:53 +08:00
|
|
|
if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &dev->dev_state) ||
|
2017-12-04 12:54:55 +08:00
|
|
|
test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &dev->dev_state)) {
|
2011-03-08 21:14:00 +08:00
|
|
|
mutex_unlock(&fs_info->scrub_lock);
|
2012-11-06 00:03:39 +08:00
|
|
|
mutex_unlock(&fs_info->fs_devices->device_list_mutex);
|
2018-12-04 23:11:56 +08:00
|
|
|
ret = -EIO;
|
btrfs: allocate scrub workqueues outside of locks
I got the following lockdep splat while testing:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00172-g021118712e59 #932 Not tainted
------------------------------------------------------
btrfs/229626 is trying to acquire lock:
ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450
but task is already holding lock:
ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_scrub_dev+0x11c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_run_dev_stats+0x49/0x480
commit_cowonly_roots+0xb5/0x2a0
btrfs_commit_transaction+0x516/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_commit_transaction+0x4bb/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_record_root_in_trans+0x43/0x70
start_transaction+0xd1/0x5d0
btrfs_dirty_inode+0x42/0xd0
touch_atime+0xa1/0xd0
btrfs_file_mmap+0x3f/0x60
mmap_region+0x3a4/0x640
do_mmap+0x376/0x580
vm_mmap_pgoff+0xd5/0x120
ksys_mmap_pgoff+0x193/0x230
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x68/0x90
_copy_to_user+0x1e/0x80
perf_read+0x141/0x2c0
vfs_read+0xad/0x1b0
ksys_read+0x5f/0xe0
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&cpuctx_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x88/0x150
perf_event_init+0x1db/0x20b
start_kernel+0x3ae/0x53c
secondary_startup_64+0xa4/0xb0
-> #1 (pmus_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x4f/0x150
cpuhp_invoke_callback+0xb1/0x900
_cpu_up.constprop.26+0x9f/0x130
cpu_up+0x7b/0xc0
bringup_nonboot_cpus+0x4f/0x60
smp_init+0x26/0x71
kernel_init_freeable+0x110/0x258
kernel_init+0xa/0x103
ret_from_fork+0x1f/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
cpus_read_lock+0x39/0xb0
alloc_workqueue+0x378/0x450
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->scrub_lock);
lock(&fs_devs->device_list_mutex);
lock(&fs_info->scrub_lock);
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by btrfs/229626:
#0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
#1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
stack backtrace:
CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? alloc_workqueue+0x378/0x450
cpus_read_lock+0x39/0xb0
? alloc_workqueue+0x378/0x450
alloc_workqueue+0x378/0x450
? rcu_read_lock_sched_held+0x52/0x80
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
? start_transaction+0xd1/0x5d0
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
? do_sigaction+0x102/0x250
? lockdep_hardirqs_on_prepare+0xca/0x160
? _raw_spin_unlock_irq+0x24/0x30
? trace_hardirqs_on+0x1c/0xe0
? _raw_spin_unlock_irq+0x24/0x30
? do_sigaction+0x102/0x250
? ksys_ioctl+0x83/0xc0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens because we're allocating the scrub workqueues under the
scrub and device list mutex, which brings in a whole host of other
dependencies.
Because the work queue allocation is done with GFP_KERNEL, it can
trigger reclaim, which can lead to a transaction commit, which in turns
needs the device_list_mutex, it can lead to a deadlock. A different
problem for which this fix is a solution.
Fix this by moving the actual allocation outside of the
scrub lock, and then only take the lock once we're ready to actually
assign them to the fs_info. We'll now have to cleanup the workqueues in
a few more places, so I've added a helper to do the refcount dance to
safely free the workqueues.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-10 23:42:29 +08:00
|
|
|
goto out;
|
2011-03-08 21:14:00 +08:00
|
|
|
}
|
|
|
|
|
2018-09-07 22:11:23 +08:00
|
|
|
down_read(&fs_info->dev_replace.rwsem);
|
2018-01-03 16:08:30 +08:00
|
|
|
if (dev->scrub_ctx ||
|
2012-11-06 20:15:27 +08:00
|
|
|
(!is_dev_replace &&
|
|
|
|
btrfs_dev_replace_is_ongoing(&fs_info->dev_replace))) {
|
2018-09-07 22:11:23 +08:00
|
|
|
up_read(&fs_info->dev_replace.rwsem);
|
2011-03-08 21:14:00 +08:00
|
|
|
mutex_unlock(&fs_info->scrub_lock);
|
2012-11-06 00:03:39 +08:00
|
|
|
mutex_unlock(&fs_info->fs_devices->device_list_mutex);
|
2018-12-04 23:11:56 +08:00
|
|
|
ret = -EINPROGRESS;
|
btrfs: allocate scrub workqueues outside of locks
I got the following lockdep splat while testing:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00172-g021118712e59 #932 Not tainted
------------------------------------------------------
btrfs/229626 is trying to acquire lock:
ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450
but task is already holding lock:
ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_scrub_dev+0x11c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_run_dev_stats+0x49/0x480
commit_cowonly_roots+0xb5/0x2a0
btrfs_commit_transaction+0x516/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_commit_transaction+0x4bb/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_record_root_in_trans+0x43/0x70
start_transaction+0xd1/0x5d0
btrfs_dirty_inode+0x42/0xd0
touch_atime+0xa1/0xd0
btrfs_file_mmap+0x3f/0x60
mmap_region+0x3a4/0x640
do_mmap+0x376/0x580
vm_mmap_pgoff+0xd5/0x120
ksys_mmap_pgoff+0x193/0x230
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x68/0x90
_copy_to_user+0x1e/0x80
perf_read+0x141/0x2c0
vfs_read+0xad/0x1b0
ksys_read+0x5f/0xe0
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&cpuctx_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x88/0x150
perf_event_init+0x1db/0x20b
start_kernel+0x3ae/0x53c
secondary_startup_64+0xa4/0xb0
-> #1 (pmus_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x4f/0x150
cpuhp_invoke_callback+0xb1/0x900
_cpu_up.constprop.26+0x9f/0x130
cpu_up+0x7b/0xc0
bringup_nonboot_cpus+0x4f/0x60
smp_init+0x26/0x71
kernel_init_freeable+0x110/0x258
kernel_init+0xa/0x103
ret_from_fork+0x1f/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
cpus_read_lock+0x39/0xb0
alloc_workqueue+0x378/0x450
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->scrub_lock);
lock(&fs_devs->device_list_mutex);
lock(&fs_info->scrub_lock);
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by btrfs/229626:
#0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
#1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
stack backtrace:
CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? alloc_workqueue+0x378/0x450
cpus_read_lock+0x39/0xb0
? alloc_workqueue+0x378/0x450
alloc_workqueue+0x378/0x450
? rcu_read_lock_sched_held+0x52/0x80
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
? start_transaction+0xd1/0x5d0
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
? do_sigaction+0x102/0x250
? lockdep_hardirqs_on_prepare+0xca/0x160
? _raw_spin_unlock_irq+0x24/0x30
? trace_hardirqs_on+0x1c/0xe0
? _raw_spin_unlock_irq+0x24/0x30
? do_sigaction+0x102/0x250
? ksys_ioctl+0x83/0xc0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens because we're allocating the scrub workqueues under the
scrub and device list mutex, which brings in a whole host of other
dependencies.
Because the work queue allocation is done with GFP_KERNEL, it can
trigger reclaim, which can lead to a transaction commit, which in turns
needs the device_list_mutex, it can lead to a deadlock. A different
problem for which this fix is a solution.
Fix this by moving the actual allocation outside of the
scrub lock, and then only take the lock once we're ready to actually
assign them to the fs_info. We'll now have to cleanup the workqueues in
a few more places, so I've added a helper to do the refcount dance to
safely free the workqueues.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-10 23:42:29 +08:00
|
|
|
goto out;
|
2011-03-08 21:14:00 +08:00
|
|
|
}
|
2018-09-07 22:11:23 +08:00
|
|
|
up_read(&fs_info->dev_replace.rwsem);
|
2013-10-12 02:11:12 +08:00
|
|
|
|
Btrfs: rename the scrub context structure
The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.
Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is used.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-11-02 16:58:09 +08:00
|
|
|
sctx->readonly = readonly;
|
2018-01-03 16:08:30 +08:00
|
|
|
dev->scrub_ctx = sctx;
|
2013-12-04 21:15:19 +08:00
|
|
|
mutex_unlock(&fs_info->fs_devices->device_list_mutex);
|
2011-03-08 21:14:00 +08:00
|
|
|
|
2013-12-04 21:15:19 +08:00
|
|
|
/*
|
|
|
|
* checking @scrub_pause_req here, we can avoid
|
|
|
|
* race between committing transaction and scrubbing.
|
|
|
|
*/
|
2013-12-04 21:16:53 +08:00
|
|
|
__scrub_blocked_if_needed(fs_info);
|
2011-03-08 21:14:00 +08:00
|
|
|
atomic_inc(&fs_info->scrubs_running);
|
|
|
|
mutex_unlock(&fs_info->scrub_lock);
|
|
|
|
|
Btrfs: fix deadlock with memory reclaim during scrub
When a transaction commit starts, it attempts to pause scrub and it blocks
until the scrub is paused. So while the transaction is blocked waiting for
scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub,
otherwise we risk getting into a deadlock with reclaim.
Checking for scrub pause requests is done early at the beginning of the
while loop of scrub_stripe() and later in the loop, scrub_extent() and
scrub_raid56_parity() are called, which in turn call scrub_pages() and
scrub_pages_for_parity() respectively. These last two functions do memory
allocations using GFP_KERNEL. Same problem could happen while scrubbing
the super blocks, since it calls scrub_pages().
We also can not have any of the worker tasks, created by the scrub task,
doing GFP_KERNEL allocations, because before pausing, the scrub task waits
for all the worker tasks to complete (also done at scrub_stripe()).
So make sure GFP_NOFS is used for the memory allocations because at any
time a scrub pause request can happen from another task that started to
commit a transaction.
Fixes: 58c4e173847a ("btrfs: scrub: use GFP_KERNEL on the submission path")
CC: stable@vger.kernel.org # 4.6+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-27 04:07:17 +08:00
|
|
|
/*
|
|
|
|
* In order to avoid deadlock with reclaim when there is a transaction
|
|
|
|
* trying to pause scrub, make sure we use GFP_NOFS for all the
|
2022-03-13 18:40:01 +08:00
|
|
|
* allocations done at btrfs_scrub_sectors() and scrub_sectors_for_parity()
|
Btrfs: fix deadlock with memory reclaim during scrub
When a transaction commit starts, it attempts to pause scrub and it blocks
until the scrub is paused. So while the transaction is blocked waiting for
scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub,
otherwise we risk getting into a deadlock with reclaim.
Checking for scrub pause requests is done early at the beginning of the
while loop of scrub_stripe() and later in the loop, scrub_extent() and
scrub_raid56_parity() are called, which in turn call scrub_pages() and
scrub_pages_for_parity() respectively. These last two functions do memory
allocations using GFP_KERNEL. Same problem could happen while scrubbing
the super blocks, since it calls scrub_pages().
We also can not have any of the worker tasks, created by the scrub task,
doing GFP_KERNEL allocations, because before pausing, the scrub task waits
for all the worker tasks to complete (also done at scrub_stripe()).
So make sure GFP_NOFS is used for the memory allocations because at any
time a scrub pause request can happen from another task that started to
commit a transaction.
Fixes: 58c4e173847a ("btrfs: scrub: use GFP_KERNEL on the submission path")
CC: stable@vger.kernel.org # 4.6+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-27 04:07:17 +08:00
|
|
|
* invoked by our callees. The pausing request is done when the
|
|
|
|
* transaction commit starts, and it blocks the transaction until scrub
|
|
|
|
* is paused (done at specific points at scrub_stripe() or right above
|
|
|
|
* before incrementing fs_info->scrubs_running).
|
|
|
|
*/
|
|
|
|
nofs_flag = memalloc_nofs_save();
|
2012-11-06 18:43:11 +08:00
|
|
|
if (!is_dev_replace) {
|
2022-08-02 14:53:03 +08:00
|
|
|
u64 old_super_errors;
|
|
|
|
|
|
|
|
spin_lock(&sctx->stat_lock);
|
|
|
|
old_super_errors = sctx->stat.super_errors;
|
|
|
|
spin_unlock(&sctx->stat_lock);
|
|
|
|
|
2019-01-03 16:17:40 +08:00
|
|
|
btrfs_info(fs_info, "scrub: started on devid %llu", devid);
|
2013-10-25 19:12:02 +08:00
|
|
|
/*
|
|
|
|
* by holding device list mutex, we can
|
|
|
|
* kick off writing super in log tree sync.
|
|
|
|
*/
|
2013-12-04 21:15:19 +08:00
|
|
|
mutex_lock(&fs_info->fs_devices->device_list_mutex);
|
2012-11-06 18:43:11 +08:00
|
|
|
ret = scrub_supers(sctx, dev);
|
2013-12-04 21:15:19 +08:00
|
|
|
mutex_unlock(&fs_info->fs_devices->device_list_mutex);
|
2022-08-02 14:53:03 +08:00
|
|
|
|
|
|
|
spin_lock(&sctx->stat_lock);
|
|
|
|
/*
|
|
|
|
* Super block errors found, but we can not commit transaction
|
|
|
|
* at current context, since btrfs_commit_transaction() needs
|
|
|
|
* to pause the current running scrub (hold by ourselves).
|
|
|
|
*/
|
|
|
|
if (sctx->stat.super_errors > old_super_errors && !sctx->readonly)
|
|
|
|
need_commit = true;
|
|
|
|
spin_unlock(&sctx->stat_lock);
|
2012-11-06 18:43:11 +08:00
|
|
|
}
|
2011-03-08 21:14:00 +08:00
|
|
|
|
|
|
|
if (!ret)
|
2018-08-15 02:09:52 +08:00
|
|
|
ret = scrub_enumerate_chunks(sctx, dev, start, end);
|
Btrfs: fix deadlock with memory reclaim during scrub
When a transaction commit starts, it attempts to pause scrub and it blocks
until the scrub is paused. So while the transaction is blocked waiting for
scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub,
otherwise we risk getting into a deadlock with reclaim.
Checking for scrub pause requests is done early at the beginning of the
while loop of scrub_stripe() and later in the loop, scrub_extent() and
scrub_raid56_parity() are called, which in turn call scrub_pages() and
scrub_pages_for_parity() respectively. These last two functions do memory
allocations using GFP_KERNEL. Same problem could happen while scrubbing
the super blocks, since it calls scrub_pages().
We also can not have any of the worker tasks, created by the scrub task,
doing GFP_KERNEL allocations, because before pausing, the scrub task waits
for all the worker tasks to complete (also done at scrub_stripe()).
So make sure GFP_NOFS is used for the memory allocations because at any
time a scrub pause request can happen from another task that started to
commit a transaction.
Fixes: 58c4e173847a ("btrfs: scrub: use GFP_KERNEL on the submission path")
CC: stable@vger.kernel.org # 4.6+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-27 04:07:17 +08:00
|
|
|
memalloc_nofs_restore(nofs_flag);
|
2011-03-08 21:14:00 +08:00
|
|
|
|
|
|
|
atomic_dec(&fs_info->scrubs_running);
|
|
|
|
wake_up(&fs_info->scrub_pause_wait);
|
|
|
|
|
|
|
|
if (progress)
|
Btrfs: rename the scrub context structure
The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.
Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is used.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-11-02 16:58:09 +08:00
|
|
|
memcpy(progress, &sctx->stat, sizeof(*progress));
|
2011-03-08 21:14:00 +08:00
|
|
|
|
2019-01-03 16:17:40 +08:00
|
|
|
if (!is_dev_replace)
|
|
|
|
btrfs_info(fs_info, "scrub: %s on devid %llu with status: %d",
|
|
|
|
ret ? "not finished" : "finished", devid, ret);
|
|
|
|
|
2011-03-08 21:14:00 +08:00
|
|
|
mutex_lock(&fs_info->scrub_lock);
|
2018-01-03 16:08:30 +08:00
|
|
|
dev->scrub_ctx = NULL;
|
2011-03-08 21:14:00 +08:00
|
|
|
mutex_unlock(&fs_info->scrub_lock);
|
|
|
|
|
btrfs: allocate scrub workqueues outside of locks
I got the following lockdep splat while testing:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00172-g021118712e59 #932 Not tainted
------------------------------------------------------
btrfs/229626 is trying to acquire lock:
ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450
but task is already holding lock:
ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_scrub_dev+0x11c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_run_dev_stats+0x49/0x480
commit_cowonly_roots+0xb5/0x2a0
btrfs_commit_transaction+0x516/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_commit_transaction+0x4bb/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_record_root_in_trans+0x43/0x70
start_transaction+0xd1/0x5d0
btrfs_dirty_inode+0x42/0xd0
touch_atime+0xa1/0xd0
btrfs_file_mmap+0x3f/0x60
mmap_region+0x3a4/0x640
do_mmap+0x376/0x580
vm_mmap_pgoff+0xd5/0x120
ksys_mmap_pgoff+0x193/0x230
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x68/0x90
_copy_to_user+0x1e/0x80
perf_read+0x141/0x2c0
vfs_read+0xad/0x1b0
ksys_read+0x5f/0xe0
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&cpuctx_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x88/0x150
perf_event_init+0x1db/0x20b
start_kernel+0x3ae/0x53c
secondary_startup_64+0xa4/0xb0
-> #1 (pmus_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x4f/0x150
cpuhp_invoke_callback+0xb1/0x900
_cpu_up.constprop.26+0x9f/0x130
cpu_up+0x7b/0xc0
bringup_nonboot_cpus+0x4f/0x60
smp_init+0x26/0x71
kernel_init_freeable+0x110/0x258
kernel_init+0xa/0x103
ret_from_fork+0x1f/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
cpus_read_lock+0x39/0xb0
alloc_workqueue+0x378/0x450
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->scrub_lock);
lock(&fs_devs->device_list_mutex);
lock(&fs_info->scrub_lock);
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by btrfs/229626:
#0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
#1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
stack backtrace:
CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? alloc_workqueue+0x378/0x450
cpus_read_lock+0x39/0xb0
? alloc_workqueue+0x378/0x450
alloc_workqueue+0x378/0x450
? rcu_read_lock_sched_held+0x52/0x80
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
? start_transaction+0xd1/0x5d0
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
? do_sigaction+0x102/0x250
? lockdep_hardirqs_on_prepare+0xca/0x160
? _raw_spin_unlock_irq+0x24/0x30
? trace_hardirqs_on+0x1c/0xe0
? _raw_spin_unlock_irq+0x24/0x30
? do_sigaction+0x102/0x250
? ksys_ioctl+0x83/0xc0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens because we're allocating the scrub workqueues under the
scrub and device list mutex, which brings in a whole host of other
dependencies.
Because the work queue allocation is done with GFP_KERNEL, it can
trigger reclaim, which can lead to a transaction commit, which in turns
needs the device_list_mutex, it can lead to a deadlock. A different
problem for which this fix is a solution.
Fix this by moving the actual allocation outside of the
scrub lock, and then only take the lock once we're ready to actually
assign them to the fs_info. We'll now have to cleanup the workqueues in
a few more places, so I've added a helper to do the refcount dance to
safely free the workqueues.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-10 23:42:29 +08:00
|
|
|
scrub_workers_put(fs_info);
|
2015-02-10 05:14:24 +08:00
|
|
|
scrub_put_ctx(sctx);
|
2011-03-08 21:14:00 +08:00
|
|
|
|
2022-08-02 14:53:03 +08:00
|
|
|
/*
|
|
|
|
* We found some super block errors before, now try to force a
|
|
|
|
* transaction commit, as scrub has finished.
|
|
|
|
*/
|
|
|
|
if (need_commit) {
|
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
|
|
|
|
trans = btrfs_start_transaction(fs_info->tree_root, 0);
|
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"scrub: failed to start transaction to fix super block errors: %d", ret);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
ret = btrfs_commit_transaction(trans);
|
|
|
|
if (ret < 0)
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"scrub: failed to commit transaction to fix super block errors: %d", ret);
|
|
|
|
}
|
2018-12-04 23:11:56 +08:00
|
|
|
return ret;
|
btrfs: allocate scrub workqueues outside of locks
I got the following lockdep splat while testing:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00172-g021118712e59 #932 Not tainted
------------------------------------------------------
btrfs/229626 is trying to acquire lock:
ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450
but task is already holding lock:
ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_scrub_dev+0x11c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_run_dev_stats+0x49/0x480
commit_cowonly_roots+0xb5/0x2a0
btrfs_commit_transaction+0x516/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_commit_transaction+0x4bb/0xa60
sync_filesystem+0x6b/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
__prepare_exit_to_usermode+0x1cc/0x1e0
do_syscall_64+0x5c/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_record_root_in_trans+0x43/0x70
start_transaction+0xd1/0x5d0
btrfs_dirty_inode+0x42/0xd0
touch_atime+0xa1/0xd0
btrfs_file_mmap+0x3f/0x60
mmap_region+0x3a4/0x640
do_mmap+0x376/0x580
vm_mmap_pgoff+0xd5/0x120
ksys_mmap_pgoff+0x193/0x230
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x68/0x90
_copy_to_user+0x1e/0x80
perf_read+0x141/0x2c0
vfs_read+0xad/0x1b0
ksys_read+0x5f/0xe0
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&cpuctx_mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x88/0x150
perf_event_init+0x1db/0x20b
start_kernel+0x3ae/0x53c
secondary_startup_64+0xa4/0xb0
-> #1 (pmus_lock){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
perf_event_init_cpu+0x4f/0x150
cpuhp_invoke_callback+0xb1/0x900
_cpu_up.constprop.26+0x9f/0x130
cpu_up+0x7b/0xc0
bringup_nonboot_cpus+0x4f/0x60
smp_init+0x26/0x71
kernel_init_freeable+0x110/0x258
kernel_init+0xa/0x103
ret_from_fork+0x1f/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
cpus_read_lock+0x39/0xb0
alloc_workqueue+0x378/0x450
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->scrub_lock);
lock(&fs_devs->device_list_mutex);
lock(&fs_info->scrub_lock);
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by btrfs/229626:
#0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
#1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
stack backtrace:
CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? alloc_workqueue+0x378/0x450
cpus_read_lock+0x39/0xb0
? alloc_workqueue+0x378/0x450
alloc_workqueue+0x378/0x450
? rcu_read_lock_sched_held+0x52/0x80
__btrfs_alloc_workqueue+0x15d/0x200
btrfs_alloc_workqueue+0x51/0x160
scrub_workers_get+0x5a/0x170
btrfs_scrub_dev+0x18c/0x630
? start_transaction+0xd1/0x5d0
btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
btrfs_ioctl+0x2799/0x30a0
? do_sigaction+0x102/0x250
? lockdep_hardirqs_on_prepare+0xca/0x160
? _raw_spin_unlock_irq+0x24/0x30
? trace_hardirqs_on+0x1c/0xe0
? _raw_spin_unlock_irq+0x24/0x30
? do_sigaction+0x102/0x250
? ksys_ioctl+0x83/0xc0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens because we're allocating the scrub workqueues under the
scrub and device list mutex, which brings in a whole host of other
dependencies.
Because the work queue allocation is done with GFP_KERNEL, it can
trigger reclaim, which can lead to a transaction commit, which in turns
needs the device_list_mutex, it can lead to a deadlock. A different
problem for which this fix is a solution.
Fix this by moving the actual allocation outside of the
scrub lock, and then only take the lock once we're ready to actually
assign them to the fs_info. We'll now have to cleanup the workqueues in
a few more places, so I've added a helper to do the refcount dance to
safely free the workqueues.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-10 23:42:29 +08:00
|
|
|
out:
|
|
|
|
scrub_workers_put(fs_info);
|
2018-12-04 23:11:56 +08:00
|
|
|
out_free_ctx:
|
|
|
|
scrub_free_ctx(sctx);
|
|
|
|
|
2011-03-08 21:14:00 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
void btrfs_scrub_pause(struct btrfs_fs_info *fs_info)
|
2011-03-08 21:14:00 +08:00
|
|
|
{
|
|
|
|
mutex_lock(&fs_info->scrub_lock);
|
|
|
|
atomic_inc(&fs_info->scrub_pause_req);
|
|
|
|
while (atomic_read(&fs_info->scrubs_paused) !=
|
|
|
|
atomic_read(&fs_info->scrubs_running)) {
|
|
|
|
mutex_unlock(&fs_info->scrub_lock);
|
|
|
|
wait_event(fs_info->scrub_pause_wait,
|
|
|
|
atomic_read(&fs_info->scrubs_paused) ==
|
|
|
|
atomic_read(&fs_info->scrubs_running));
|
|
|
|
mutex_lock(&fs_info->scrub_lock);
|
|
|
|
}
|
|
|
|
mutex_unlock(&fs_info->scrub_lock);
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
void btrfs_scrub_continue(struct btrfs_fs_info *fs_info)
|
2011-03-08 21:14:00 +08:00
|
|
|
{
|
|
|
|
atomic_dec(&fs_info->scrub_pause_req);
|
|
|
|
wake_up(&fs_info->scrub_pause_wait);
|
|
|
|
}
|
|
|
|
|
2012-11-06 00:03:39 +08:00
|
|
|
int btrfs_scrub_cancel(struct btrfs_fs_info *fs_info)
|
2011-03-08 21:14:00 +08:00
|
|
|
{
|
|
|
|
mutex_lock(&fs_info->scrub_lock);
|
|
|
|
if (!atomic_read(&fs_info->scrubs_running)) {
|
|
|
|
mutex_unlock(&fs_info->scrub_lock);
|
|
|
|
return -ENOTCONN;
|
|
|
|
}
|
|
|
|
|
|
|
|
atomic_inc(&fs_info->scrub_cancel_req);
|
|
|
|
while (atomic_read(&fs_info->scrubs_running)) {
|
|
|
|
mutex_unlock(&fs_info->scrub_lock);
|
|
|
|
wait_event(fs_info->scrub_pause_wait,
|
|
|
|
atomic_read(&fs_info->scrubs_running) == 0);
|
|
|
|
mutex_lock(&fs_info->scrub_lock);
|
|
|
|
}
|
|
|
|
atomic_dec(&fs_info->scrub_cancel_req);
|
|
|
|
mutex_unlock(&fs_info->scrub_lock);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-03-20 23:32:55 +08:00
|
|
|
int btrfs_scrub_cancel_dev(struct btrfs_device *dev)
|
2012-03-02 00:24:58 +08:00
|
|
|
{
|
2019-03-20 23:32:55 +08:00
|
|
|
struct btrfs_fs_info *fs_info = dev->fs_info;
|
Btrfs: rename the scrub context structure
The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.
Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is used.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-11-02 16:58:09 +08:00
|
|
|
struct scrub_ctx *sctx;
|
2011-03-08 21:14:00 +08:00
|
|
|
|
|
|
|
mutex_lock(&fs_info->scrub_lock);
|
2018-01-03 16:08:30 +08:00
|
|
|
sctx = dev->scrub_ctx;
|
Btrfs: rename the scrub context structure
The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.
Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is used.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-11-02 16:58:09 +08:00
|
|
|
if (!sctx) {
|
2011-03-08 21:14:00 +08:00
|
|
|
mutex_unlock(&fs_info->scrub_lock);
|
|
|
|
return -ENOTCONN;
|
|
|
|
}
|
Btrfs: rename the scrub context structure
The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.
Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is used.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-11-02 16:58:09 +08:00
|
|
|
atomic_inc(&sctx->cancel_req);
|
2018-01-03 16:08:30 +08:00
|
|
|
while (dev->scrub_ctx) {
|
2011-03-08 21:14:00 +08:00
|
|
|
mutex_unlock(&fs_info->scrub_lock);
|
|
|
|
wait_event(fs_info->scrub_pause_wait,
|
2018-01-03 16:08:30 +08:00
|
|
|
dev->scrub_ctx == NULL);
|
2011-03-08 21:14:00 +08:00
|
|
|
mutex_lock(&fs_info->scrub_lock);
|
|
|
|
}
|
|
|
|
mutex_unlock(&fs_info->scrub_lock);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
2012-03-28 02:21:26 +08:00
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
int btrfs_scrub_progress(struct btrfs_fs_info *fs_info, u64 devid,
|
2011-03-08 21:14:00 +08:00
|
|
|
struct btrfs_scrub_progress *progress)
|
|
|
|
{
|
2021-10-06 04:12:42 +08:00
|
|
|
struct btrfs_dev_lookup_args args = { .devid = devid };
|
2011-03-08 21:14:00 +08:00
|
|
|
struct btrfs_device *dev;
|
Btrfs: rename the scrub context structure
The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.
Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is used.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-11-02 16:58:09 +08:00
|
|
|
struct scrub_ctx *sctx = NULL;
|
2011-03-08 21:14:00 +08:00
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
mutex_lock(&fs_info->fs_devices->device_list_mutex);
|
2021-10-06 04:12:42 +08:00
|
|
|
dev = btrfs_find_device(fs_info->fs_devices, &args);
|
2011-03-08 21:14:00 +08:00
|
|
|
if (dev)
|
2018-01-03 16:08:30 +08:00
|
|
|
sctx = dev->scrub_ctx;
|
Btrfs: rename the scrub context structure
The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.
Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is used.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-11-02 16:58:09 +08:00
|
|
|
if (sctx)
|
|
|
|
memcpy(progress, &sctx->stat, sizeof(*progress));
|
2016-06-23 06:54:23 +08:00
|
|
|
mutex_unlock(&fs_info->fs_devices->device_list_mutex);
|
2011-03-08 21:14:00 +08:00
|
|
|
|
Btrfs: rename the scrub context structure
The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.
Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is used.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-11-02 16:58:09 +08:00
|
|
|
return dev ? (sctx ? 0 : -ENOTCONN) : -ENODEV;
|
2011-03-08 21:14:00 +08:00
|
|
|
}
|