Go to file
Dave Chinner c85007e2e3 xfs: don't use BMBT btree split workers for IO completion
When we split a BMBT due to record insertion, we offload it to a
worker thread because we can be deep in the stack when we try to
allocate a new block for the BMBT. Allocation can use several
kilobytes of stack (full memory reclaim, swap and/or IO path can
end up on the stack during allocation) and we can already be several
kilobytes deep in the stack when we need to split the BMBT.

A recent workload demonstrated a deadlock in this BMBT split
offload. It requires several things to happen at once:

1. two inodes need a BMBT split at the same time, one must be
unwritten extent conversion from IO completion, the other must be
from extent allocation.

2. there must be a no available xfs_alloc_wq worker threads
available in the worker pool.

3. There must be sustained severe memory shortages such that new
kworker threads cannot be allocated to the xfs_alloc_wq pool for
both threads that need split work to be run

4. The split work from the unwritten extent conversion must run
first.

5. when the BMBT block allocation runs from the split work, it must
loop over all AGs and not be able to either trylock an AGF
successfully, or each AGF is is able to lock has no space available
for a single block allocation.

6. The BMBT allocation must then attempt to lock the AGF that the
second task queued to the rescuer thread already has locked before
it finds an AGF it can allocate from.

At this point, we have an ABBA deadlock between tasks queued on the
xfs_alloc_wq rescuer thread and a locked AGF. i.e. The queued task
holding the AGF lock can't be run by the rescuer thread until the
task the rescuer thread is runing gets the AGF lock....

This is a highly improbably series of events, but there it is.

There's a couple of ways to fix this, but the easiest way to ensure
that we only punt tasks with a locked AGF that holds enough space
for the BMBT block allocations to the worker thread.

This works for unwritten extent conversion in IO completion (which
doesn't have a locked AGF and space reservations) because we have
tight control over the IO completion stack. It is typically only 6
functions deep when xfs_btree_split() is called because we've
already offloaded the IO completion work to a worker thread and
hence we don't need to worry about stack overruns here.

The other place we can be called for a BMBT split without a
preceeding allocation is __xfs_bunmapi() when punching out the
center of an existing extent. We don't remove extents in the IO
path, so these operations don't tend to be called with a lot of
stack consumed. Hence we don't really need to ship the split off to
a worker thread in these cases, either.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-05 08:48:24 -08:00
arch - Start checking for -mindirect-branch-cs-prefix clang support too now that LLVM 2023-01-29 11:17:34 -08:00
block block-6.2-2023-01-20 2023-01-20 12:44:41 -08:00
certs certs: make system keyring depend on built-in x509 parser 2022-09-24 04:31:18 +09:00
crypto This update includes the following changes: 2022-12-14 12:31:09 -08:00
Documentation - Start checking for -mindirect-branch-cs-prefix clang support too now that LLVM 2023-01-29 11:17:34 -08:00
drivers - Start checking for -mindirect-branch-cs-prefix clang support too now that LLVM 2023-01-29 11:17:34 -08:00
fs xfs: don't use BMBT btree split workers for IO completion 2023-02-05 08:48:24 -08:00
include drm fixes for 6.2-rc6 2023-01-27 13:18:14 -08:00
init Kbuild fixes for v6.2 (3rd) 2023-01-21 10:56:37 -08:00
io_uring io_uring: always prep_async for drain requests 2023-01-27 06:29:29 -07:00
ipc Non-MM patches for 6.2-rc1. 2022-12-12 17:28:58 -08:00
kernel - Cleanup the firmware node for the new IRQ MSI domain properly, to 2023-01-29 11:26:49 -08:00
lib hardening fixes for v6.2-rc6 2023-01-27 16:09:12 -08:00
LICENSES LICENSES: Add the copyleft-next-0.3.1 license 2022-11-08 15:44:01 +01:00
mm Revert "mm/compaction: fix set skip in fast_find_migrateblock" 2023-01-29 10:38:43 -08:00
net net: mctp: mark socks as dead on unhash, prevent re-add 2023-01-25 13:07:37 +00:00
rust rust: print: avoid evaluating arguments in pr_* macros in unsafe blocks 2023-01-16 00:54:35 +01:00
samples ftrace: Export ftrace_free_filter() to modules 2023-01-24 11:20:58 -05:00
scripts Fix up more non-executable files marked executable 2023-01-28 11:17:57 -08:00
security tomoyo: Update website link 2023-01-13 23:11:38 +09:00
sound treewide: fix up files incorrectly marked executable 2023-01-26 10:05:39 -08:00
tools gpio fixes for v6.2-rc6 2023-01-27 13:47:40 -08:00
usr usr/gen_init_cpio.c: remove unnecessary -1 values from int file 2022-10-03 14:21:44 -07:00
virt VFIO fixes for v6.2-rc6 2023-01-23 11:56:07 -08:00
.clang-format iommufd for 6.2 2022-12-14 09:15:43 -08:00
.cocciconfig
.get_maintainer.ignore get_maintainer: add Alan to .get_maintainer.ignore 2022-08-20 15:17:44 -07:00
.gitattributes .gitattributes: use 'dts' diff driver for dts files 2019-12-04 19:44:11 -08:00
.gitignore .gitignore: ignore *.rpm 2022-12-30 17:22:14 +09:00
.mailmap 21 hotfixes. Thirteen of these address pre-6.1 issues and hence have 2023-01-16 16:36:39 -08:00
.rustfmt.toml rust: add .rustfmt.toml 2022-09-28 09:02:20 +02:00
COPYING COPYING: state that all contributions really are covered by this file 2020-02-10 13:32:20 -08:00
CREDITS MAINTAINERS: Update MPTCP maintainer list and CREDITS 2023-01-23 21:42:13 -08:00
Kbuild Kbuild updates for v6.1 2022-10-10 12:00:45 -07:00
Kconfig kbuild: ensure full rebuild when the compiler is updated 2020-05-12 13:28:33 +09:00
MAINTAINERS Tracing updates for 6.2: 2023-01-27 16:03:32 -08:00
Makefile Linux 6.2-rc6 2023-01-29 13:59:43 -08:00
README Drop all 00-INDEX files from Documentation/ 2018-09-09 15:08:58 -06:00

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.