linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-11-11 12:28:41 +08:00

History

Zygo Blaxell e7acd6209f btrfs: fix stripe length calculation for non-zoned data chunk allocation commit `8a540e990d` upstream. Commit `f6fca3917b` "btrfs: store chunk size in space-info struct" broke data chunk allocations on non-zoned multi-device filesystems when using default chunk_size. Commit `5da431b71d` "btrfs: fix the max chunk size and stripe length calculation" partially fixed that, and this patch completes the fix for that case. After commit `f6fca3917b` and `5da431b71d`, the sequence of events for a data chunk allocation on a non-zoned filesystem is: 1. btrfs_create_chunk calls init_alloc_chunk_ctl, which copies space_info->chunk_size (default 10 GiB) to ctl->max_stripe_len unmodified. Before `f6fca3917b`, ctl->max_stripe_len value was 1 GiB for non-zoned data chunks and not configurable. 2. btrfs_create_chunk calls gather_device_info which consumes and produces more fields of chunk_ctl. 3. gather_device_info multiplies ctl->max_stripe_len by ctl->dev_stripes (which is 1 in all cases except dup) and calls find_free_dev_extent with that number as num_bytes. 4. find_free_dev_extent locates the first dev_extent hole on a device which is at least as large as num_bytes. With default max_chunk_size from `f6fca3917b`, it finds the first hole which is longer than 10 GiB, or the largest hole if that hole is shorter than 10 GiB. This is different from the pre-f6fca3917b4d behavior, where num_bytes is 1 GiB, and find_free_dev_extent may choose a different hole. 5. gather_device_info repeats step 4 with all devices to find the first or largest dev_extent hole that can be allocated on each device. 6. gather_device_info sorts the device list by the hole size on each device, using total unallocated space on each device to break ties, then returns to btrfs_create_chunk with the list. 7. btrfs_create_chunk calls decide_stripe_size_regular. 8. decide_stripe_size_regular finds the largest stripe_len that fits across the first nr_devs device dev_extent holes that were found by gather_device_info (and satisfies other constraints on stripe_len that are not relevant here). 9. decide_stripe_size_regular caps the length of the stripe it computed at 1 GiB. This cap appeared in `5da431b71d` to correct one of the other regressions introduced in `f6fca3917b`. 10. btrfs_create_chunk creates a new chunk with the above computed size and number of devices. At step 4, gather_device_info() has found a location where stripe up to 10 GiB in length could be allocated on several devices, and selected which devices should have a dev_extent allocated on them, but at step 9, only 1 GiB of the space that was found on each device can be used. This mismatch causes new suboptimal chunk allocation cases that did not occur in pre-f6fca3917b4d kernels. Consider a filesystem using raid1 profile with 3 devices. After some balances, device 1 has 10x 1 GiB unallocated space, while devices 2 and 3 have 1x 10 GiB unallocated space, i.e. the same total amount of space, but distributed across different numbers of dev_extent holes. For visualization, let's ignore all the chunks that were allocated before this point, and focus on the remaining holes: Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10x 1 GiB unallocated) Device 2: [__________] (10 GiB contig unallocated) Device 3: [__________] (10 GiB contig unallocated) Before `f6fca3917b`, the allocator would fill these optimally by allocating chunks with dev_extents on devices 1 and 2 ([12]), 1 and 3 ([13]), or 2 and 3 ([23]): [after 0 chunk allocations] Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB) Device 2: [__________] (10 GiB) Device 3: [__________] (10 GiB) [after 1 chunk allocation] Device 1: [12] [_] [_] [_] [_] [_] [_] [_] [_] [_] Device 2: [12] [_________] (9 GiB) Device 3: [__________] (10 GiB) [after 2 chunk allocations] Device 1: [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB) Device 2: [12] [_________] (9 GiB) Device 3: [13] [_________] (9 GiB) [after 3 chunk allocations] Device 1: [12] [13] [12] [_] [_] [_] [_] [_] [_] [_] (7 GiB) Device 2: [12] [12] [________] (8 GiB) Device 3: [13] [_________] (9 GiB) [...] [after 12 chunk allocations] Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [_] [_] (2 GiB) Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [__] (2 GiB) Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB) [after 13 chunk allocations] Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [_] (1 GiB) Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB) Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB) [after 14 chunk allocations] Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full) Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB) Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [13] [_] (1 GiB) [after 15 chunk allocations] Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full) Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [23] (full) Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [13] [23] (full) This allocates all of the space with no waste. The sorting function used by gather_device_info considers free space holes above 1 GiB in length to be equal to 1 GiB, so once find_free_dev_extent locates a sufficiently long hole on each device, all the holes appear equal in the sort, and the comparison falls back to sorting devices by total free space. This keeps usable space on each device equal so they can all be filled completely. After `f6fca3917b`, the allocator prefers the devices with larger holes over the devices with more free space, so it makes bad allocation choices: [after 1 chunk allocation] Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB) Device 2: [23] [_________] (9 GiB) Device 3: [23] [_________] (9 GiB) [after 2 chunk allocations] Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB) Device 2: [23] [23] [________] (8 GiB) Device 3: [23] [23] [________] (8 GiB) [after 3 chunk allocations] Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB) Device 2: [23] [23] [23] [_______] (7 GiB) Device 3: [23] [23] [23] [_______] (7 GiB) [...] [after 9 chunk allocations] Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB) Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB) Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB) [after 10 chunk allocations] Device 1: [12] [_] [_] [_] [_] [_] [_] [_] [_] [_] (9 GiB) Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [12] (full) Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB) [after 11 chunk allocations] Device 1: [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB) Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [12] (full) Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [13] (full) No further allocations are possible, with 8 GiB wasted (4 GiB of data space). The sort in gather_device_info now considers free space in holes longer than 1 GiB to be distinct, so it will prefer devices 2 and 3 over device 1 until all but 1 GiB is allocated on devices 2 and 3. At that point, with only 1 GiB unallocated on every device, the largest hole length on each device is equal at 1 GiB, so the sort finally moves to ordering the devices with the most free space, but by this time it is too late to make use of the free space on device 1. Note that it's possible to contrive a case where the pre-f6fca3917b4d allocator fails the same way, but these cases generally have extensive dev_extent fragmentation as a precondition (e.g. many holes of 768M in length on one device, and few holes 1 GiB in length on the others). With the regression in `f6fca3917b`, bad chunk allocation can occur even under optimal conditions, when all dev_extent holes are exact multiples of stripe_len in length, as in the example above. Also note that post-f6fca3917b4d kernels do treat dev_extent holes larger than 10 GiB as equal, so the bad behavior won't show up on a freshly formatted filesystem; however, as the filesystem ages and fills up, and holes ranging from 1 GiB to 10 GiB in size appear, the problem can show up as a failure to balance after adding or removing devices, or an unexpected shortfall in available space due to unequal allocation. To fix the regression and make data chunk allocation work again, set ctl->max_stripe_len back to the original SZ_1G, or space_info->chunk_size if that's smaller (the latter can happen if the user set space_info->chunk_size to less than 1 GiB via sysfs, or it's a 32 MiB system chunk with a hardcoded chunk_size and stripe_len). While researching the background of the earlier commits, I found that an identical fix was already proposed at: https://lore.kernel.org/linux-btrfs/de83ac46-a4a3-88d3-85ce-255b7abc5249@gmx.com/ The previous review missed one detail: ctl->max_stripe_len is used before decide_stripe_size_regular() is called, when it is too late for the changes in that function to have any effect. ctl->max_stripe_len is not used directly by decide_stripe_size_regular(), but the parameter does heavily influence the per-device free space data presented to the function. Fixes: `f6fca3917b` ("btrfs: store chunk size in space-info struct") CC: stable@vger.kernel.org # 6.1+ Link: https://lore.kernel.org/linux-btrfs/20231007051421.19657-1-ce3g8jdj@umail.furryterror.org/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>		2023-10-25 12:16:10 +02:00
..
9p	fs/9p: Remove unused extern declaration	2023-07-20 19:21:48 +00:00
adfs	splice: Use filemap_splice_read() instead of generic_file_splice_read()	2023-05-24 08:42:17 -06:00
affs	splice: Use filemap_splice_read() instead of generic_file_splice_read()	2023-05-24 08:42:17 -06:00
afs	afs: Fix accidental truncation when storing data	2023-07-04 12:24:32 -07:00
autofs	autofs: fix memory leak of waitqueues in autofs_catatonic_mode	2023-09-23 11:14:17 +02:00
befs	befs: Replace all non-returning strlcpy with strscpy	2023-05-30 16:42:00 -07:00
bfs	splice: Use filemap_splice_read() instead of generic_file_splice_read()	2023-05-24 08:42:17 -06:00
btrfs	btrfs: fix stripe length calculation for non-zoned data chunk allocation	2023-10-25 12:16:10 +02:00
cachefiles	v6.5/vfs.file	2023-06-26 10:14:36 -07:00
ceph	ceph: fix type promotion bug on 32bit systems	2023-10-19 23:11:05 +02:00
coda	vfs: get rid of old '->iterate' directory operation	2023-08-06 15:08:35 +02:00
configfs	fs: consolidate duplicate dt_type helpers	2023-04-03 09:23:54 +02:00
cramfs	splice: Use filemap_splice_read() instead of generic_file_splice_read()	2023-05-24 08:42:17 -06:00
crypto	fscrypt: Replace 1-element array with flexible array	2023-05-23 19:46:09 -07:00
debugfs	debugfs: Correct the 'debugfs_create_str' docs	2023-05-31 19:02:14 +01:00
devpts	devpts: simplify two-level sysctl registration for pty_kern_table	2023-03-13 12:36:34 +01:00
dlm	dlm: fix plock lookup when using multiple lockspaces	2023-09-13 09:53:54 +02:00
ecryptfs	splice: Use filemap_splice_read() instead of generic_file_splice_read()	2023-05-24 08:42:17 -06:00
efivarfs	efivarfs: fix statfs() on efivarfs	2023-09-23 11:14:32 +02:00
efs
erofs	erofs: allow empty device tags in flatdev mode	2023-10-10 22:02:59 +02:00
exfat	vfs: get rid of old '->iterate' directory operation	2023-08-06 15:08:35 +02:00
exportfs	vfs: get rid of old '->iterate' directory operation	2023-08-06 15:08:35 +02:00
ext2	ext2: fix datatype of block number in ext2_xattr_set2()	2023-09-23 11:14:26 +02:00
ext4	ext4: do not let fstrim block system suspend	2023-10-06 13:15:46 +02:00
f2fs	f2fs: avoid false alarm of circular locking	2023-09-19 12:30:23 +02:00
fat	splice: Use filemap_splice_read() instead of generic_file_splice_read()	2023-05-24 08:42:17 -06:00
freevxfs	There is no particular theme here - mainly quick hits all over the tree.	2023-02-23 17:55:40 -08:00
fscache	fscache: Use clear_and_wake_up_bit() in fscache_create_volume_work()	2023-01-30 12:51:54 +00:00
fuse	fuse: nlookup missing decrement in fuse_direntplus_link	2023-09-19 12:30:23 +02:00
gfs2	gfs2: fix glock shrinker ref issues	2023-10-06 13:16:17 +02:00
hfs	splice: Use filemap_splice_read() instead of generic_file_splice_read()	2023-05-24 08:42:17 -06:00
hfsplus	splice: Use filemap_splice_read() instead of generic_file_splice_read()	2023-05-24 08:42:17 -06:00
hostfs	Landlock updates for v6.5-rc1	2023-06-27 17:10:27 -07:00
hpfs	splice: Use filemap_splice_read() instead of generic_file_splice_read()	2023-05-24 08:42:17 -06:00
hugetlbfs	hugetlb: revert use of page_cache_next_miss()	2023-06-23 16:59:32 -07:00
iomap	iomap: Fix possible overflow condition in iomap_write_delalloc_scan	2023-09-23 11:14:17 +02:00
isofs
jbd2	jbd2: correct the end of the journal recovery scan range	2023-09-19 12:30:22 +02:00
jffs2	for-6.5/splice-2023-06-23	2023-06-26 11:52:12 -07:00
jfs	jfs: fix invalid free of JFS_IP(ipimap)->i_imap in diUnmount	2023-09-23 11:14:26 +02:00
kernfs	kernfs: fix missing kernfs_iattr_rwsem locking	2023-09-19 12:30:09 +02:00
lockd	fs: lockd: avoid possible wrong NULL parameter	2023-09-13 09:53:33 +02:00
minix	splice: Use filemap_splice_read() instead of generic_file_splice_read()	2023-05-24 08:42:17 -06:00
netfs	netfs: Only call folio_start_fscache() one time for each folio	2023-10-06 13:15:45 +02:00
nfs	NFSv4: Fix a nfs4_state_manager() race	2023-10-10 22:03:01 +02:00
nfs_common	NFSv4.2: remove MODULE_LICENSE in non-modules	2023-04-13 13:13:52 -07:00
nfsd	NFSD: Fix zero NFSv4 READ results when RQ_SPLICE_OK is not set	2023-10-06 13:16:07 +02:00
nilfs2	nilfs2: fix potential use after free in nilfs_gccache_submit_read_data()	2023-10-06 13:16:21 +02:00
nls	fs/nls: make load_nls() take a const parameter	2023-07-25 00:30:02 -05:00
notify	fanotify: disallow mount/sb marks on kernel internal pseudo fs	2023-07-04 13:29:29 +02:00
ntfs	vfs: get rid of old '->iterate' directory operation	2023-08-06 15:08:35 +02:00
ntfs3	driver ntfs3 for linux 6.5	2023-07-07 14:59:38 -07:00
ocfs2	fs: ocfs2: namei: check return value of ocfs2_add_entry()	2023-09-13 09:53:08 +02:00
omfs	splice: Use filemap_splice_read() instead of generic_file_splice_read()	2023-05-24 08:42:17 -06:00
openpromfs
orangefs	orangefs: Provide a splice-read wrapper	2023-05-24 08:42:16 -06:00
overlayfs	ovl: fix regression in showing lowerdir mount option	2023-10-19 23:11:08 +02:00
proc	proc: nommu: fix empty /proc/<pid>/maps	2023-10-06 13:15:59 +02:00
pstore	pstore/ram: Check start of empty przs during init	2023-09-13 09:53:55 +02:00
qnx4	qnx4: credit contributors in CREDITS	2023-03-14 12:56:30 -06:00
qnx6	qnx6: credit contributor and mark filesystem orphan	2023-03-14 12:56:30 -06:00
quota	quota: Fix slow quotaoff	2023-10-19 23:10:56 +02:00
ramfs	- Yosry Ahmed brought back some cgroup v1 stats in OOM logs.	2023-06-28 10:28:11 -07:00
reiserfs	reiserfs: Check the return value from __getblk()	2023-09-13 09:52:57 +02:00
romfs	splice: Use filemap_splice_read() instead of generic_file_splice_read()	2023-05-24 08:42:17 -06:00
smb	ksmbd: not allow to open file if delelete on close bit is set	2023-10-19 23:11:04 +02:00
squashfs	squashfs: fix cache race with migration	2023-07-08 09:29:30 -07:00
sysfs	sysfs: Skip empty folders creation	2023-06-15 13:37:53 +02:00
sysv	for-6.5/splice-2023-06-23	2023-06-26 11:52:12 -07:00
tracefs	tracefs: Add missing lockdown check to tracefs_create_dir()	2023-09-23 11:14:37 +02:00
ubifs	splice: Use filemap_splice_read() instead of generic_file_splice_read()	2023-05-24 08:42:17 -06:00
udf	\n	2023-06-29 13:39:51 -07:00
ufs	splice: Use filemap_splice_read() instead of generic_file_splice_read()	2023-05-24 08:42:17 -06:00
unicode	unicode: remove MODULE_LICENSE in non-modules	2023-04-13 13:13:54 -07:00
vboxsf	hardening fixes for v6.5-rc6	2023-08-08 14:59:49 -07:00
verity	fsverity: skip PKCS#7 parser when keyring is empty	2023-09-13 09:53:55 +02:00
xfs	xfs: convert flex-array declarations in xfs attr shortform objects	2023-07-17 08:48:56 -07:00
zonefs	zonefs: fix synchronous direct writes to sequential files	2023-08-10 12:59:47 +09:00
aio.c	fs/aio: Stop allocating aio rings from HIGHMEM	2023-06-15 09:22:23 +02:00
anon_inodes.c
attr.c	attr: block mode changes of symlinks	2023-09-23 11:14:34 +02:00
bad_inode.c	fs: port ->permission() to pass mnt_idmap	2023-01-19 09:24:28 +01:00
binfmt_elf_fdpic.c	fs: binfmt_elf_efpic: fix personality for ELF-FDPIC	2023-10-06 13:16:29 +02:00
binfmt_elf_test.c
binfmt_elf.c	Merge branch 'expand-stack'	2023-06-28 20:35:21 -07:00
binfmt_flat.c
binfmt_misc.c	binfmt_misc: fix shift-out-of-bounds in check_special_flags	2022-12-02 13:57:04 -08:00
binfmt_script.c
buffer.c	\n	2023-06-29 13:39:51 -07:00
char_dev.c	vfs: Replace all non-returning strlcpy with strscpy	2023-05-15 09:42:01 +02:00
compat_binfmt_elf.c
coredump.c	v6.5/vfs.misc	2023-06-26 09:50:21 -07:00
d_path.c	fs: d_path: include internal.h	2023-05-17 09:16:59 +02:00
dax.c	dax: enable dax fault handler to report VM_FAULT_HWPOISON	2023-06-26 07:54:23 -06:00
dcache.c
direct-io.c	- Yosry Ahmed brought back some cgroup v1 stats in OOM logs.	2023-06-28 10:28:11 -07:00
drop_caches.c
eventfd.c	eventfd: prevent underflow for eventfd semaphores	2023-09-13 09:52:58 +02:00
eventpoll.c	v6.5/vfs.misc	2023-06-26 09:50:21 -07:00
exec.c	\n	2023-06-29 13:31:44 -07:00
fcntl.c	fs.idmapped.v6.3	2023-02-20 11:53:11 -08:00
fhandle.c	fsnotify: move fsnotify_open() hook into do_dentry_open()	2023-06-12 10:43:45 +02:00
file_table.c	fs: move cleanup from init_file() into its callers	2023-07-02 13:15:49 +02:00
file.c	fs: Fix kernel-doc warnings	2023-10-19 23:11:08 +02:00
filesystems.c
fs_context.c	fs: factor out vfs_parse_monolithic_sep() helper	2023-10-19 23:11:08 +02:00
fs_parser.c	ext4: journal_path mount options should follow links	2022-12-01 10:46:54 -05:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c	writeback: move wb_over_bg_thresh() call outside lock section	2023-06-09 16:25:14 -07:00
fsopen.c
init.c	fs: port ->permission() to pass mnt_idmap	2023-01-19 09:24:28 +01:00
inode.c	locking: remove spin_lock_prefetch	2023-08-12 09:18:47 -07:00
internal.h	v6.5/vfs.file	2023-06-26 10:14:36 -07:00
ioctl.c	fs: Fix kernel-doc warnings	2023-10-19 23:11:08 +02:00
Kconfig	smb: move client and server files to common directory fs/smb	2023-05-24 16:29:21 -05:00
Kconfig.binfmt
kernel_read_file.c	fs: Fix kernel-doc warnings	2023-10-19 23:11:08 +02:00
libfs.c	direct_write_fallback(): on error revert the ->ki_pos update from buffered write	2023-10-06 13:16:01 +02:00
locks.c	locks: fix KASAN: use-after-free in trace_event_raw_event_filelock_lock	2023-09-23 11:14:17 +02:00
Makefile	for-6.5/block-2023-06-23	2023-06-26 12:47:20 -07:00
mbcache.c	ext4: fix deadlock due to mbcache entry corruption	2022-12-08 21:49:25 -05:00
mnt_idmapping.c	fs: move mnt_idmap	2023-01-19 09:24:30 +01:00
mount.h
mpage.c	mpage: use folios in bio end_io handler	2023-04-18 16:30:02 -07:00
namei.c	fs: Fix kernel-doc warnings	2023-10-19 23:11:08 +02:00
namespace.c	v6.5/vfs.mount	2023-06-26 10:27:04 -07:00
nsfs.c	kill the last remaining user of proc_ns_fget()	2023-04-20 22:55:35 -04:00
open.c	fs: Fix kernel-doc warnings	2023-10-19 23:11:08 +02:00
pipe.c	pipe: check for IOCB_NOWAIT alongside O_NONBLOCK	2023-05-12 17:17:27 +02:00
pnode.c	fs: allow to mount beneath top mount	2023-05-19 04:30:22 +02:00
pnode.h	fs: allow to mount beneath top mount	2023-05-19 04:30:22 +02:00
posix_acl.c	acl: don't depend on IOP_XATTR	2023-03-06 09:59:20 +01:00
proc_namespace.c	tty, proc, kernfs, random: Use copy_splice_read()	2023-05-24 08:42:16 -06:00
read_write.c	splice: Use filemap_splice_read() instead of generic_file_splice_read()	2023-05-24 08:42:17 -06:00
readdir.c	vfs: get rid of old '->iterate' directory operation	2023-08-06 15:08:35 +02:00
remap_range.c	fs: use UB-safe check for signed addition overflow in remap_verify_area	2023-05-24 11:03:59 +02:00
select.c
seq_file.c	use less confusing names for iov_iter direction initializers	2022-11-25 13:01:55 -05:00
signalfd.c
splice.c	splice: fsnotify_access(in), fsnotify_modify(out) on success in tee	2023-09-13 09:52:58 +02:00
stack.c
stat.c	fs.idmapped.v6.3	2023-02-20 11:53:11 -08:00
statfs.c	statfs: enforce statfs[64] structure initialization	2023-05-17 15:20:17 +02:00
super.c	\n	2023-06-29 13:39:51 -07:00
sync.c
sysctls.c	sysctl: Refactor base paths registrations	2023-05-23 21:43:26 -07:00
timerfd.c
userfaultfd.c	Merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes.	2023-06-23 16:58:19 -07:00
utimes.c	fs.idmapped.v6.3	2023-02-20 11:53:11 -08:00
xattr.c	fs: don't call posix_acl_listxattr in generic_listxattr	2023-05-17 15:25:20 +02:00