linux/fs
Dave Chinner c85007e2e3 xfs: don't use BMBT btree split workers for IO completion
When we split a BMBT due to record insertion, we offload it to a
worker thread because we can be deep in the stack when we try to
allocate a new block for the BMBT. Allocation can use several
kilobytes of stack (full memory reclaim, swap and/or IO path can
end up on the stack during allocation) and we can already be several
kilobytes deep in the stack when we need to split the BMBT.

A recent workload demonstrated a deadlock in this BMBT split
offload. It requires several things to happen at once:

1. two inodes need a BMBT split at the same time, one must be
unwritten extent conversion from IO completion, the other must be
from extent allocation.

2. there must be a no available xfs_alloc_wq worker threads
available in the worker pool.

3. There must be sustained severe memory shortages such that new
kworker threads cannot be allocated to the xfs_alloc_wq pool for
both threads that need split work to be run

4. The split work from the unwritten extent conversion must run
first.

5. when the BMBT block allocation runs from the split work, it must
loop over all AGs and not be able to either trylock an AGF
successfully, or each AGF is is able to lock has no space available
for a single block allocation.

6. The BMBT allocation must then attempt to lock the AGF that the
second task queued to the rescuer thread already has locked before
it finds an AGF it can allocate from.

At this point, we have an ABBA deadlock between tasks queued on the
xfs_alloc_wq rescuer thread and a locked AGF. i.e. The queued task
holding the AGF lock can't be run by the rescuer thread until the
task the rescuer thread is runing gets the AGF lock....

This is a highly improbably series of events, but there it is.

There's a couple of ways to fix this, but the easiest way to ensure
that we only punt tasks with a locked AGF that holds enough space
for the BMBT block allocations to the worker thread.

This works for unwritten extent conversion in IO completion (which
doesn't have a locked AGF and space reservations) because we have
tight control over the IO completion stack. It is typically only 6
functions deep when xfs_btree_split() is called because we've
already offloaded the IO completion work to a worker thread and
hence we don't need to worry about stack overruns here.

The other place we can be called for a BMBT split without a
preceeding allocation is __xfs_bunmapi() when punching out the
center of an existing extent. We don't remove extents in the IO
path, so these operations don't tend to be called with a lot of
stack consumed. Hence we don't really need to ship the split off to
a worker thread in these cases, either.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-05 08:48:24 -08:00
..
9p 9p-for-6.2-rc1 2022-12-23 11:39:18 -08:00
adfs fs: Convert block_read_full_page() to block_read_full_folio() 2022-05-09 16:21:44 -04:00
affs affs: initialize fsdata in affs_truncate() 2023-01-10 14:55:20 +01:00
afs rxrpc: Move call state changes from recvmsg to I/O thread 2023-01-06 09:43:33 +00:00
autofs autofs: remove unused ino field inode 2022-07-17 17:31:42 -07:00
befs befs: Convert befs_symlink_read_folio() to use a folio 2022-08-02 12:34:03 -04:00
bfs fs: Convert block_read_full_page() to block_read_full_folio() 2022-05-09 16:21:44 -04:00
btrfs for-6.2-rc4-tag 2023-01-20 11:59:01 -08:00
cachefiles fscache,cachefiles: add prepare_ondemand_read() callback 2022-12-07 10:56:29 +08:00
ceph ceph: avoid use-after-free in ceph_fl_release_lock() 2023-01-02 12:27:25 +01:00
cifs cifs: Fix oops due to uncleared server->smbd_conn in reconnect 2023-01-25 09:57:48 -06:00
coda coda: Convert coda_symlink_filler() to use a folio 2022-08-02 12:34:03 -04:00
configfs configfs: fix possible memory leak in configfs_create_dir() 2022-12-02 11:11:22 +01:00
cramfs cramfs: read_mapping_page() is synchronous 2022-08-02 12:34:02 -04:00
crypto for-6.2/block-2022-12-08 2022-12-13 10:43:59 -08:00
debugfs debugfs: fix error when writing negative value to atomic_t debugfs file 2022-11-30 16:13:16 -08:00
devpts
dlm Treewide: Stop corrupting socket's task_frag 2022-12-19 17:28:49 -08:00
ecryptfs ecryptfs: use stub posix acl handlers 2022-10-20 10:13:31 +02:00
efivarfs efi: vars: prohibit reading random seed variables 2022-12-01 09:51:21 +01:00
efs efs: Convert efs symlinks to read_folio 2022-05-09 16:21:45 -04:00
erofs erofs: clean up parsing of fscache related options 2023-01-16 22:39:47 +08:00
exfat Description for this pull request: 2022-12-15 18:14:21 -08:00
exportfs exportfs: use pr_debug for unreachable debug statements 2022-11-28 12:54:45 -05:00
ext2 \n 2022-12-12 20:32:50 -08:00
ext4 ext4: make xattr char unsignedness in hash explicit 2023-01-24 12:38:45 -08:00
f2fs f2fs: let's avoid panic if extent_tree is not created 2023-01-03 08:59:06 -08:00
fat MM patches for 6.2-rc1. 2022-12-13 19:29:45 -08:00
freevxfs freevxfs: Convert vxfs_immed_read_folio() to use a folio 2022-08-02 12:34:03 -04:00
fscache iov_iter work; most of that is about getting rid of 2022-12-12 18:29:54 -08:00
fuse fuse: fixes after adapting to new posix acl api 2023-01-24 16:33:37 +01:00
gfs2 Revert "gfs2: stop using generic_writepages in gfs2_ail1_start_one" 2023-01-22 09:46:14 +01:00
hfs hfs/hfsplus: avoid WARN_ON() for sanity check, use proper error handling 2023-01-06 14:09:13 -08:00
hfsplus MM patches for 6.2-rc1. 2022-12-13 19:29:45 -08:00
hostfs hostfs: move from strlcpy with unused retval to strscpy 2022-09-19 22:46:25 +02:00
hpfs hpfs: remove ->writepage 2022-12-11 18:12:18 -08:00
hugetlbfs hugetlbfs: inode: remove unnecessary (void*) conversions 2022-11-30 15:58:56 -08:00
iomap New XFS code for 6.2: 2022-12-14 10:11:51 -08:00
isofs - hfs and hfsplus kmap API modernization from Fabio Francesco 2022-10-12 11:00:22 -07:00
jbd2 jbd2: switch jbd2_submit_inode_data() to use fs-provided hook for data writeout 2022-12-08 21:49:25 -05:00
jffs2 fs: rename current get acl method 2022-10-20 10:13:27 +02:00
jfs MM patches for 6.2-rc1. 2022-12-13 19:29:45 -08:00
kernfs kernfs: fix all kernel-doc warnings and multiple typos 2022-11-23 19:28:26 +01:00
ksmbd ksmbd: downgrade ndr version error message to debug 2023-01-25 18:31:18 -06:00
lockd NFSD 6.2 Release Notes 2022-12-12 20:54:39 -08:00
minix vfs: open inside ->tmpfile() 2022-09-24 07:00:00 +02:00
netfs use less confusing names for iov_iter direction initializers 2022-11-25 13:01:55 -05:00
nfs NFS client fixes for Linux 6.2 2023-01-07 10:38:11 -08:00
nfs_common
nfsd nfsd-6.2 fixes: 2023-01-24 12:58:47 -08:00
nilfs2 nilfs2: fix general protection fault in nilfs_btree_insert() 2023-01-11 16:14:21 -08:00
nls
notify Merge tag 'fsnotify-for_v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2022-10-07 08:28:50 -07:00
ntfs - hfs and hfsplus kmap API modernization from Fabio Francesco 2022-10-12 11:00:22 -07:00
ntfs3 fs/ntfs3: don't hold ni_lock when calling truncate_setsize() 2023-01-02 10:31:09 -08:00
ocfs2 Treewide: Stop corrupting socket's task_frag 2022-12-19 17:28:49 -08:00
omfs omfs: remove ->writepage 2022-12-11 18:12:18 -08:00
openpromfs fs: allocate inode by using alloc_inode_sb() 2022-03-22 15:57:03 -07:00
orangefs orangefs: four fixes from Zhang Xiaoxu and two from Colin Ian King 2022-12-14 11:16:33 -08:00
overlayfs ovl: fail on invalid uid/gid mapping at copy up 2023-01-27 16:17:19 +01:00
proc ARM64: 2022-12-15 11:12:21 -08:00
pstore pstore updates for v6.2-rc1-fixes 2022-12-23 11:55:54 -08:00
qnx4 fs: Convert block_read_full_page() to block_read_full_folio() 2022-05-09 16:21:44 -04:00
qnx6 fs/qnx6: delete unnecessary checks before brelse() 2022-09-11 21:55:07 -07:00
quota ext4: fix bug_on in __es_tree_search caused by bad quota inode 2022-12-08 21:49:23 -05:00
ramfs tmpfile API change 2022-10-10 19:45:17 -07:00
reiserfs lsm/stable-6.2 PR 20221212 2022-12-13 09:47:48 -08:00
romfs romfs: Convert romfs to read_folio 2022-05-09 16:21:46 -04:00
smbfs_common smb3: define missing create contexts 2022-10-05 01:55:27 -05:00
squashfs fs.idmapped.squashfs.v6.2 2022-12-12 20:24:51 -08:00
sysfs kobject: kobj_type: remove default_attrs 2022-04-05 15:39:19 +02:00
sysv fs: sysv: Fix sysv_nblocks() returns wrong value 2022-12-10 14:13:37 -05:00
tracefs tracefs: Only clobber mode/uid/gid on remount if asked 2022-09-08 17:10:54 -04:00
ubifs treewide: use get_random_u32_below() instead of deprecated function 2022-11-18 02:15:15 +01:00
udf udf: initialize newblock to 0 2023-01-06 15:44:32 +01:00
ufs ufs: replace ll_rw_block() 2022-09-11 20:26:07 -07:00
unicode
vboxsf vboxsf: Convert vboxsf to read_folio 2022-05-09 16:21:46 -04:00
verity fsverity: simplify fsverity_get_digest() 2022-11-29 21:07:41 -08:00
xfs xfs: don't use BMBT btree split workers for IO completion 2023-02-05 08:48:24 -08:00
zonefs zonefs: Detect append writes at invalid locations 2023-01-16 08:42:12 +09:00
aio.c use less confusing names for iov_iter direction initializers 2022-11-25 13:01:55 -05:00
anon_inodes.c dynamic_dname(): drop unused dentry argument 2022-08-20 11:34:04 -04:00
attr.c attr: use consistent sgid stripping checks 2022-10-18 10:09:47 +02:00
bad_inode.c fs: rename current get acl method 2022-10-20 10:13:27 +02:00
binfmt_elf_fdpic.c elfcore: Add a cprm parameter to elf_core_extra_{phdrs,data_size} 2023-01-05 15:12:12 +00:00
binfmt_elf_test.c
binfmt_elf.c elfcore: Add a cprm parameter to elf_core_extra_{phdrs,data_size} 2023-01-05 15:12:12 +00:00
binfmt_flat.c binfmt_flat: Remove shared library support 2022-04-22 10:57:18 -07:00
binfmt_misc.c binfmt_misc: fix shift-out-of-bounds in check_special_flags 2022-12-02 13:57:04 -08:00
binfmt_script.c
buffer.c - hfs and hfsplus kmap API modernization from Fabio Francesco 2022-10-12 11:00:22 -07:00
char_dev.c chardev: fix error handling in cdev_device_add() 2022-12-02 17:48:59 +01:00
compat_binfmt_elf.c
coredump.c hardening updates for v6.2-rc1 2022-12-14 12:20:00 -08:00
d_path.c d_path.c: typo fix... 2022-08-20 11:34:33 -04:00
dax.c fsdax,xfs: port unshare to fsdax 2022-12-11 18:12:17 -08:00
dcache.c tmpfile API change 2022-10-10 19:45:17 -07:00
direct-io.c block: remove PSI accounting from the bio layer 2022-09-20 08:24:38 -06:00
drop_caches.c
eventfd.c eventfd: provide a eventfd_signal_mask() helper 2022-11-22 06:07:55 -07:00
eventpoll.c eventpoll: add EPOLL_URING_WAKE poll wakeup flag 2022-11-21 07:45:29 -07:00
exec.c fs.vfsuid.conversion.v6.2 2022-12-12 19:20:05 -08:00
fcntl.c keep iocb_flags() result cached in struct file 2022-06-10 16:10:23 -04:00
fhandle.c do_sys_name_to_handle(): constify path 2022-09-01 17:36:39 -04:00
file_table.c locks: fix TOCTOU race when granting write lease 2022-08-16 10:59:54 -04:00
file.c fs: use acquire ordering in __fget_light() 2022-10-31 15:30:11 -04:00
filesystems.c
fs_context.c
fs_parser.c ext4: journal_path mount options should follow links 2022-12-01 10:46:54 -05:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c for-6.2/writeback-2022-12-12 2022-12-15 18:09:48 -08:00
fsopen.c uninline may_mount() and don't opencode it in fspick(2)/fsopen(2) 2022-05-19 23:25:10 -04:00
init.c
inode.c fs.vfsuid.conversion.v6.2 2022-12-12 19:20:05 -08:00
internal.h fs.ovl.setgid.v6.2 2022-12-12 19:03:10 -08:00
ioctl.c Fixes for 5.18-rc1: 2022-04-01 19:35:56 -07:00
Kconfig hugetlb: make hugetlb depends on SYSFS or SYSCTL 2022-09-11 20:26:10 -07:00
Kconfig.binfmt Xtensa updates for v6.1 2022-10-10 14:21:11 -07:00
kernel_read_file.c fs/kernel_read_file: allow to read files up-to ssize_t 2022-06-16 19:58:21 -07:00
libfs.c libfs: add DEFINE_SIMPLE_ATTRIBUTE_SIGNED for signed value 2022-11-30 16:13:16 -08:00
locks.c Add process name and pid to locks warning 2022-11-30 05:08:10 -05:00
Makefile a.out: Remove the a.out implementation 2022-09-27 07:11:02 -07:00
mbcache.c ext4: fix deadlock due to mbcache entry corruption 2022-12-08 21:49:25 -05:00
mount.h switch try_to_unlazy_next() to __legitimize_mnt() 2022-07-05 16:18:21 -04:00
mpage.c Folio changes for 6.0 2022-08-03 10:35:43 -07:00
namei.c Landlock updates for v6.2-rc1 2022-12-13 09:14:50 -08:00
namespace.c fs.idmapped.mnt_idmap.v6.2 2022-12-12 19:30:18 -08:00
no-block.c
nsfs.c dynamic_dname(): drop unused dentry argument 2022-08-20 11:34:04 -04:00
open.c Landlock updates for v6.2-rc1 2022-12-13 09:14:50 -08:00
pipe.c dynamic_dname(): drop unused dentry argument 2022-08-20 11:34:04 -04:00
pnode.c pnode: terminate at peers of source 2022-12-21 14:45:25 +01:00
pnode.h
posix_acl.c fs.idmapped.mnt_idmap.v6.2 2022-12-12 19:30:18 -08:00
proc_namespace.c vfs: escape hash as well 2022-06-28 13:58:05 -04:00
read_write.c iov_iter work; most of that is about getting rid of 2022-12-12 18:29:54 -08:00
readdir.c Change calling conventions for filldir_t 2022-08-17 17:25:04 -04:00
remap_range.c New VFS code for 6.2: 2022-12-13 10:26:38 -08:00
select.c
seq_file.c use less confusing names for iov_iter direction initializers 2022-11-25 13:01:55 -05:00
signalfd.c
splice.c use less confusing names for iov_iter direction initializers 2022-11-25 13:01:55 -05:00
stack.c
stat.c fs: use type safe idmapping helpers 2022-10-26 10:02:34 +02:00
statfs.c
super.c misc pile 2022-12-12 18:38:47 -08:00
sync.c riscv: compat: syscall: Add compat_sys_call_table implementation 2022-04-26 13:36:25 -07:00
sysctls.c
timerfd.c
userfaultfd.c mm/userfaultfd: enable writenotify while userfaultfd-wp is enabled for a VMA 2023-01-11 16:14:20 -08:00
utimes.c
xattr.c fs.xattr.simple.rework.rbtree.rwlock.v6.2 2022-12-13 10:08:36 -08:00