linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-11-25 05:04:09 +08:00

History

Omar Sandoval ced8ecf026 btrfs: fix space cache corruption and potential double allocations When testing space_cache v2 on a large set of machines, we encountered a few symptoms: 1. "unable to add free space :-17" (EEXIST) errors. 2. Missing free space info items, sometimes caught with a "missing free space info for X" error. 3. Double-accounted space: ranges that were allocated in the extent tree and also marked as free in the free space tree, ranges that were marked as allocated twice in the extent tree, or ranges that were marked as free twice in the free space tree. If the latter made it onto disk, the next reboot would hit the BUG_ON() in add_new_free_space(). 4. On some hosts with no on-disk corruption or error messages, the in-memory space cache (dumped with drgn) disagreed with the free space tree. All of these symptoms have the same underlying cause: a race between caching the free space for a block group and returning free space to the in-memory space cache for pinned extents causes us to double-add a free range to the space cache. This race exists when free space is cached from the free space tree (space_cache=v2) or the extent tree (nospace_cache, or space_cache=v1 if the cache needs to be regenerated). struct btrfs_block_group::last_byte_to_unpin and struct btrfs_block_group::progress are supposed to protect against this race, but commit `d0c2f4fa55` ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit") subtly broke this by allowing multiple transactions to be unpinning extents at the same time. Specifically, the race is as follows: 1. An extent is deleted from an uncached block group in transaction A. 2. btrfs_commit_transaction() is called for transaction A. 3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed ref for the deleted extent. 4. __btrfs_free_extent() -> do_free_extent_accounting() -> add_to_free_space_tree() adds the deleted extent back to the free space tree. 5. do_free_extent_accounting() -> btrfs_update_block_group() -> btrfs_cache_block_group() queues up the block group to get cached. block_group->progress is set to block_group->start. 6. btrfs_commit_transaction() for transaction A calls switch_commit_roots(). It sets block_group->last_byte_to_unpin to block_group->progress, which is block_group->start because the block group hasn't been cached yet. 7. The caching thread gets to our block group. Since the commit roots were already switched, load_free_space_tree() sees the deleted extent as free and adds it to the space cache. It finishes caching and sets block_group->progress to U64_MAX. 8. btrfs_commit_transaction() advances transaction A to TRANS_STATE_SUPER_COMMITTED. 9. fsync calls btrfs_commit_transaction() for transaction B. Since transaction A is already in TRANS_STATE_SUPER_COMMITTED and the commit is for fsync, it advances. 10. btrfs_commit_transaction() for transaction B calls switch_commit_roots(). This time, the block group has already been cached, so it sets block_group->last_byte_to_unpin to U64_MAX. 11. btrfs_commit_transaction() for transaction A calls btrfs_finish_extent_commit(), which calls unpin_extent_range() for the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by transaction B!), so it adds the deleted extent to the space cache again! This explains all of our symptoms above: * If the sequence of events is exactly as described above, when the free space is re-added in step 11, it will fail with EEXIST. * If another thread reallocates the deleted extent in between steps 7 and 11, then step 11 will silently re-add that space to the space cache as free even though it is actually allocated. Then, if that space is allocated again, the free space tree will be corrupted (namely, the wrong item will be deleted). * If we don't catch this free space tree corruption, it will continue to get worse as extents are deleted and reallocated. The v1 space_cache is synchronously loaded when an extent is deleted (btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group() with load_cache_only=1), so it is not normally affected by this bug. However, as noted above, if we fail to load the space cache, we will fall back to caching from the extent tree and may hit this bug. The easiest fix for this race is to also make caching from the free space tree or extent tree synchronous. Josef tested this and found no performance regressions. A few extra changes fall out of this change. Namely, this fix does the following, with step 2 being the crucial fix: 1. Factor btrfs_caching_ctl_wait_done() out of btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl that we already hold a reference to. 2. Change the call in btrfs_cache_block_group() of btrfs_wait_space_cache_v1_finished() to btrfs_caching_ctl_wait_done(), which makes us wait regardless of the space_cache option. 3. Delete the now unused btrfs_wait_space_cache_v1_finished() and space_cache_v1_done(). 4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to `bool wait` to more accurately describe its new meaning. 5. Change a few callers which had a separate call to btrfs_wait_block_group_cache_done() to use wait = true instead. 6. Make btrfs_wait_block_group_cache_done() static now that it's not used outside of block-group.c anymore. Fixes: `d0c2f4fa55` ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit") CC: stable@vger.kernel.org # 5.12+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>		2022-08-23 22:13:54 +02:00
..
9p	9p: fix EBADF errors in cached mode	2022-06-17 06:03:30 +09:00
adfs	fs: Convert block_read_full_page() to block_read_full_folio()	2022-05-09 16:21:44 -04:00
affs	affs: Convert affs to read_folio	2022-05-09 16:21:44 -04:00
afs	netfs: do not unlock and put the folio twice	2022-07-14 10:10:12 +02:00
autofs
befs	befs: Convert befs to read_folio	2022-05-09 16:21:45 -04:00
bfs	fs: Convert block_read_full_page() to block_read_full_folio()	2022-05-09 16:21:44 -04:00
btrfs	btrfs: fix space cache corruption and potential double allocations	2022-08-23 22:13:54 +02:00
cachefiles	cachefiles: narrow the scope of flushed requests when releasing fd	2022-07-05 16:12:21 +01:00
ceph	netfs: do not unlock and put the folio twice	2022-07-14 10:10:12 +02:00
cifs	smb3: workaround negprot bug in some Samba servers	2022-07-13 19:59:47 -05:00
coda	coda: Convert coda to read_folio	2022-05-09 16:21:45 -04:00
configfs	configfs: fix a race in configfs_{,un}register_subsystem()	2022-02-22 18:30:28 +01:00
cramfs	cramfs: Convert cramfs to read_folio	2022-05-09 16:21:45 -04:00
crypto	fscrypt: add new helper functions for test_dummy_encryption	2022-05-09 16:18:54 -07:00
debugfs	debugfs: Document that debugfs_create functions need not be error checked	2022-02-25 11:56:13 +01:00
devpts	fsnotify: fix fsnotify hooks in pseudo filesystems	2022-01-24 14:17:02 +01:00
dlm	dlm: use kref_put_lock in __put_lkb	2022-05-02 11:23:49 -05:00
ecryptfs	ecryptfs: Convert ecryptfs to read_folio	2022-05-09 16:21:45 -04:00
efivarfs
efs	efs: Convert efs symlinks to read_folio	2022-05-09 16:21:45 -04:00
erofs	Changes since last update:	2022-06-01 11:54:29 -07:00
exfat	exfat: use updated exfat_chain directly during renaming	2022-06-09 21:26:32 +09:00
exportfs	exportfs: support idmapped mounts	2022-04-28 16:31:10 +02:00
ext2	ext2: fix fs corruption when trying to remove a non-empty directory with IO error	2022-06-16 10:55:45 +02:00
ext4	ext4: fix a doubled word "need" in a comment	2022-06-18 19:36:20 -04:00
f2fs	f2fs: do not count ENOENT for error case	2022-06-21 08:29:56 -07:00
fat	Not a lot of material this cycle. Many singleton patches against various	2022-05-27 11:22:03 -07:00
freevxfs	SPDX changes for 5.19-rc1	2022-06-03 10:34:34 -07:00
fscache	fscache: Fix invalidation/lookup race	2022-07-05 16:12:55 +01:00
fuse	libnvdimm for 5.19	2022-05-27 15:49:30 -07:00
gfs2	Page cache changes for 5.19	2022-05-24 19:55:07 -07:00
hfs	fs: Change try_to_free_buffers() to take a folio	2022-05-09 23:12:34 -04:00
hfsplus	fs: Change try_to_free_buffers() to take a folio	2022-05-09 23:12:34 -04:00
hostfs	hostfs: Convert hostfs to read_folio	2022-05-09 16:21:45 -04:00
hpfs	hpfs: Convert symlinks to read_folio	2022-05-09 16:21:45 -04:00
hugetlbfs	hugetlbfs: zero partial pages during fallocate hole punch	2022-06-16 19:11:32 -07:00
iomap	Page cache changes for 5.19	2022-05-24 19:55:07 -07:00
isofs	isofs: Convert symlinks and zisofs to read_folio	2022-05-09 16:21:45 -04:00
jbd2	fs: fix jbd2_journal_try_to_free_buffers() kernel-doc comment	2022-06-16 10:36:09 -04:00
jffs2	This pull request contains fixes for JFFS2, UBI and UBIFS	2022-06-03 14:42:24 -07:00
jfs	JFS: One bug fix and some code cleanup	2022-05-27 15:59:21 -07:00
kernfs	kernfs: Separate kernfs_pr_cont_buf and rename_lock.	2022-05-19 19:37:06 +02:00
ksmbd	vfs: fix copy_file_range() regression in cross-fs copies	2022-06-30 15:16:38 -07:00
lockd	lockd: fix nlm_close_files	2022-07-11 15:49:56 -04:00
minix	fs: Convert block_read_full_page() to block_read_full_folio()	2022-05-09 16:21:44 -04:00
netfs	netfs: do not unlock and put the folio twice	2022-07-14 10:10:12 +02:00
nfs	NFSv4: Add an fattr allocation to _nfs4_discover_trunking()	2022-06-30 16:13:00 -04:00
nfs_common
nfsd	Notable regression fixes:	2022-07-14 12:29:43 -07:00
nilfs2	nilfs2: fix incorrect masking of permission flags for symlinks	2022-07-03 15:42:33 -07:00
nls
notify	fanotify: refine the validation checks on non-dir inode mask	2022-06-28 11:18:13 +02:00
ntfs	Not a lot of material this cycle. Many singleton patches against various	2022-05-27 11:22:03 -07:00
ntfs3	Ntfs3 for 5.19	2022-06-03 16:57:16 -07:00
ocfs2	Not a lot of material this cycle. Many singleton patches against various	2022-05-27 11:22:03 -07:00
omfs	fs: Convert block_read_full_page() to block_read_full_folio()	2022-05-09 16:21:44 -04:00
openpromfs	fs: allocate inode by using alloc_inode_sb()	2022-03-22 15:57:03 -07:00
orangefs	orangefs: Convert to free_folio	2022-05-09 23:12:53 -04:00
overlayfs	ovl: turn of SB_POSIXACL with idmapped layers temporarily	2022-07-08 15:48:31 +02:00
proc	Not a lot of material this cycle. Many singleton patches against various	2022-05-27 11:22:03 -07:00
pstore	pstore: Don't use semaphores in always-atomic-context code	2022-03-15 11:08:23 -07:00
qnx4	fs: Convert block_read_full_page() to block_read_full_folio()	2022-05-09 16:21:44 -04:00
qnx6	fs: Convert mpage_readpage to mpage_read_folio	2022-05-09 16:21:44 -04:00
quota	quota: Prevent memory allocation recursion while holding dq_lock	2022-06-06 10:08:10 +02:00
ramfs	Merge branch 'akpm' (patches from Andrew)	2021-11-09 10:11:53 -08:00
reiserfs	fs: Change try_to_free_buffers() to take a folio	2022-05-09 23:12:34 -04:00
romfs	romfs: Convert romfs to read_folio	2022-05-09 16:21:46 -04:00
smbfs_common	Add various fsctl structs	2022-05-23 20:24:12 -05:00
squashfs	Page cache changes for 5.19	2022-05-24 19:55:07 -07:00
sysfs	kobject: kobj_type: remove default_attrs	2022-04-05 15:39:19 +02:00
sysv	Not a lot of material this cycle. Many singleton patches against various	2022-05-27 11:22:03 -07:00
tracefs	tracefs: Fix syntax errors in comments	2022-06-17 19:01:28 -04:00
ubifs	This pull request contains fixes for JFFS2, UBI and UBIFS	2022-06-03 14:42:24 -07:00
udf	Page cache changes for 5.19	2022-05-24 19:55:07 -07:00
ufs	fs: Convert block_read_full_page() to block_read_full_folio()	2022-05-09 16:21:44 -04:00
unicode	kbuild: unify cmd_copy and cmd_shipped	2022-02-14 10:37:32 +09:00
vboxsf	vboxsf: Convert vboxsf to read_folio	2022-05-09 16:21:46 -04:00
verity	Page cache changes for 5.19	2022-05-24 19:55:07 -07:00
xfs	xfs: prevent a UAF when log IO errors race with unmount	2022-07-01 09:09:52 -07:00
zonefs	zonefs: fix zonefs_iomap_begin() for reads	2022-06-08 19:13:55 +09:00
aio.c	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2022-04-01 19:57:03 -07:00
anon_inodes.c
attr.c	fs: account for group membership	2022-06-14 12:18:47 +02:00
bad_inode.c
binfmt_aout.c
binfmt_elf_fdpic.c	coredump: Snapshot the vmas in do_coredump	2022-03-08 12:55:29 -06:00
binfmt_elf_test.c	binfmt_elf: Introduce KUnit test	2022-03-03 20:38:56 -08:00
binfmt_elf.c	revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE"	2022-04-15 14:49:56 -07:00
binfmt_flat.c	binfmt_flat: Remove shared library support	2022-04-22 10:57:18 -07:00
binfmt_misc.c	Fix regression due to "fs: move binfmt_misc sysctl to its own file"	2022-02-09 09:50:02 -08:00
binfmt_script.c
buffer.c	fs: Convert drop_buffers() to use a folio	2022-05-09 23:12:34 -04:00
char_dev.c
compat_binfmt_elf.c	binfmt_elf: Introduce KUnit test	2022-03-03 20:38:56 -08:00
coredump.c	ptrace: Cleanups for v5.18	2022-03-28 17:29:53 -07:00
d_path.c	d_path: fix Kernel doc validator complaining	2021-11-06 13:30:32 -07:00
dax.c	libnvdimm for 5.19	2022-05-27 15:49:30 -07:00
dcache.c	mm: dcache: use kmem_cache_alloc_lru() to allocate dentry	2022-03-22 15:57:03 -07:00
direct-io.c	direct-io: remove random prefetches	2022-04-17 19:50:02 -06:00
drop_caches.c
eventfd.c
eventpoll.c	eventpoll: simplify sysctl declaration with register_sysctl()	2022-01-22 08:33:35 +02:00
exec.c	fix race between exit_itimers() and /proc/pid/timers	2022-07-11 09:52:59 -07:00
fcntl.c	VFS: add FMODE_CAN_ODIRECT file flag	2022-05-09 18:20:49 -07:00
fhandle.c
file_table.c	Descriptor handling cleanups	2022-06-04 18:52:00 -07:00
file.c	fix the breakage in close_fd_get_file() calling conventions change	2022-06-05 15:03:03 -04:00
filesystems.c
fs_context.c	vfs: fs_context: fix up param length parsing in legacy_parse_param	2022-01-18 09:23:19 +02:00
fs_parser.c	fs_parse: allow parameter value to be empty	2021-12-09 14:09:36 -05:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c	writeback: Fix inode->i_io_list not be protected by inode->i_lock error	2022-06-06 09:54:30 +02:00
fsopen.c	uninline may_mount() and don't opencode it in fspick(2)/fsopen(2)	2022-05-19 23:25:10 -04:00
init.c
inode.c	writeback: Fix inode->i_io_list not be protected by inode->i_lock error	2022-06-06 09:54:30 +02:00
internal.h	Cleanups (and one fix) around struct mount handling.	2022-06-04 19:00:05 -07:00
io_uring.c	io_uring: do not recycle buffer in READV	2022-07-21 08:31:31 -06:00
io-wq.c	io-wq: use __set_notify_signal() to wake workers	2022-04-30 08:39:54 -06:00
io-wq.h	io_uring: add support for IORING_ASYNC_CANCEL_ALL	2022-04-24 18:18:18 -06:00
ioctl.c	Fixes for 5.18-rc1:	2022-04-01 19:35:56 -07:00
Kconfig	mm: hugetlb_vmemmap: cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP*	2022-04-28 23:16:15 -07:00
Kconfig.binfmt	m68knommu: changes for linux 5.19	2022-05-30 10:56:18 -07:00
kernel_read_file.c
libfs.c	fs: Convert simple_readpage to simple_read_folio	2022-05-09 16:21:44 -04:00
locks.c	fs/lock: add 2 callbacks to lock_manager_operations to resolve conflict	2022-05-19 12:25:39 -04:00
Makefile	Fix from Christoph Hellwig merging the CONFIG_UNICODE_UTF8_DATA into the	2022-02-01 11:13:24 -08:00
mbcache.c
mount.h
mpage.c	fs: Change try_to_free_buffers() to take a folio	2022-05-09 23:12:34 -04:00
namei.c	Several cleanups in fs/namei.c.	2022-06-04 19:07:15 -07:00
namespace.c	Cleanups (and one fix) around struct mount handling.	2022-06-04 19:00:05 -07:00
no-block.c
nsfs.c
open.c	RISC-V Patches for the 5.19 Merge Window, Part 1	2022-05-31 14:10:54 -07:00
pipe.c	Not a lot of material this cycle. Many singleton patches against various	2022-05-27 11:22:03 -07:00
pnode.c
pnode.h
posix_acl.c	fs: fix acl translation	2022-04-19 10:19:02 -07:00
proc_namespace.c	fs: add is_idmapped_mnt() helper	2021-12-03 18:44:06 +01:00
read_write.c	vfs: fix copy_file_range() regression in cross-fs copies	2022-06-30 15:16:38 -07:00
readdir.c
remap_range.c	Revert "vf/remap: return the amount of bytes actually deduplicated"	2022-07-14 15:35:24 -07:00
select.c	select: Fix indefinitely sleeping task in poll_schedule_timeout()	2022-01-11 09:03:05 -08:00
seq_file.c	rxrpc: Fix locking issue	2022-05-22 21:03:01 +01:00
signalfd.c	Merge branch 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace	2022-01-17 05:49:30 +02:00
splice.c	mm: Convert remove_mapping() to take a folio	2022-03-21 12:59:01 -04:00
stack.c
stat.c	RISC-V Patches for the 5.19 Merge Window, Part 1	2022-05-31 14:10:54 -07:00
statfs.c
super.c	block: add a bdev_stable_writes helper	2022-04-17 19:49:59 -06:00
sync.c	riscv: compat: syscall: Add compat_sys_call_table implementation	2022-04-26 13:36:25 -07:00
sysctls.c	fs: move namespace sysctls and declare fs base directory	2022-01-22 08:33:36 +02:00
timerfd.c
userfaultfd.c	mm/uffd: enable write protection for shmem & hugetlbfs	2022-05-13 07:20:11 -07:00
utimes.c
xattr.c	fs: split off do_getxattr from getxattr	2022-04-24 18:18:37 -06:00