linux/fs/xfs
Dave Chinner 348a1983cf xfs: fix unlink vs cluster buffer instantiation race
Luis has been reporting an assert failure when freeing an inode
cluster during inode inactivation for a while. The assert looks
like:

 XFS: Assertion failed: bp->b_flags & XBF_DONE, file: fs/xfs/xfs_trans_buf.c, line: 241
 ------------[ cut here ]------------
 kernel BUG at fs/xfs/xfs_message.c:102!
 Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
 CPU: 4 PID: 73 Comm: kworker/4:1 Not tainted 6.10.0-rc1 #4
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 Workqueue: xfs-inodegc/loop5 xfs_inodegc_worker [xfs]
 RIP: 0010:assfail (fs/xfs/xfs_message.c:102) xfs
 RSP: 0018:ffff88810188f7f0 EFLAGS: 00010202
 RAX: 0000000000000000 RBX: ffff88816e748250 RCX: 1ffffffff844b0e7
 RDX: 0000000000000004 RSI: ffff88810188f558 RDI: ffffffffc2431fa0
 RBP: 1ffff11020311f01 R08: 0000000042431f9f R09: ffffed1020311e9b
 R10: ffff88810188f4df R11: ffffffffac725d70 R12: ffff88817a3f4000
 R13: ffff88812182f000 R14: ffff88810188f998 R15: ffffffffc2423f80
 FS:  0000000000000000(0000) GS:ffff8881c8400000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000055fe9d0f109c CR3: 000000014426c002 CR4: 0000000000770ef0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
 PKRU: 55555554
 Call Trace:
  <TASK>
 xfs_trans_read_buf_map (fs/xfs/xfs_trans_buf.c:241 (discriminator 1)) xfs
 xfs_imap_to_bp (fs/xfs/xfs_trans.h:210 fs/xfs/libxfs/xfs_inode_buf.c:138) xfs
 xfs_inode_item_precommit (fs/xfs/xfs_inode_item.c:145) xfs
 xfs_trans_run_precommits (fs/xfs/xfs_trans.c:931) xfs
 __xfs_trans_commit (fs/xfs/xfs_trans.c:966) xfs
 xfs_inactive_ifree (fs/xfs/xfs_inode.c:1811) xfs
 xfs_inactive (fs/xfs/xfs_inode.c:2013) xfs
 xfs_inodegc_worker (fs/xfs/xfs_icache.c:1841 fs/xfs/xfs_icache.c:1886) xfs
 process_one_work (kernel/workqueue.c:3231)
 worker_thread (kernel/workqueue.c:3306 (discriminator 2) kernel/workqueue.c:3393 (discriminator 2))
 kthread (kernel/kthread.c:389)
 ret_from_fork (arch/x86/kernel/process.c:147)
 ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
  </TASK>

And occurs when the the inode precommit handlers is attempt to look
up the inode cluster buffer to attach the inode for writeback.

The trail of logic that I can reconstruct is as follows.

	1. the inode is clean when inodegc runs, so it is not
	   attached to a cluster buffer when precommit runs.

	2. #1 implies the inode cluster buffer may be clean and not
	   pinned by dirty inodes when inodegc runs.

	3. #2 implies that the inode cluster buffer can be reclaimed
	   by memory pressure at any time.

	4. The assert failure implies that the cluster buffer was
	   attached to the transaction, but not marked done. It had
	   been accessed earlier in the transaction, but not marked
	   done.

	5. #4 implies the cluster buffer has been invalidated (i.e.
	   marked stale).

	6. #5 implies that the inode cluster buffer was instantiated
	   uninitialised in the transaction in xfs_ifree_cluster(),
	   which only instantiates the buffers to invalidate them
	   and never marks them as done.

Given factors 1-3, this issue is highly dependent on timing and
environmental factors. Hence the issue can be very difficult to
reproduce in some situations, but highly reliable in others. Luis
has an environment where it can be reproduced easily by g/531 but,
OTOH, I've reproduced it only once in ~2000 cycles of g/531.

I think the fix is to have xfs_ifree_cluster() set the XBF_DONE flag
on the cluster buffers, even though they may not be initialised. The
reasons why I think this is safe are:

	1. A buffer cache lookup hit on a XBF_STALE buffer will
	   clear the XBF_DONE flag. Hence all future users of the
	   buffer know they have to re-initialise the contents
	   before use and mark it done themselves.

	2. xfs_trans_binval() sets the XFS_BLI_STALE flag, which
	   means the buffer remains locked until the journal commit
	   completes and the buffer is unpinned. Hence once marked
	   XBF_STALE/XFS_BLI_STALE by xfs_ifree_cluster(), the only
	   context that can access the freed buffer is the currently
	   running transaction.

	3. #2 implies that future buffer lookups in the currently
	   running transaction will hit the transaction match code
	   and not the buffer cache. Hence XBF_STALE and
	   XFS_BLI_STALE will not be cleared unless the transaction
	   initialises and logs the buffer with valid contents
	   again. At which point, the buffer will be marked marked
	   XBF_DONE again, so having XBF_DONE already set on the
	   stale buffer is a moot point.

	4. #2 also implies that any concurrent access to that
	   cluster buffer will block waiting on the buffer lock
	   until the inode cluster has been fully freed and is no
	   longer an active inode cluster buffer.

	5. #4 + #1 means that any future user of the disk range of
	   that buffer will always see the range of disk blocks
	   covered by the cluster buffer as not done, and hence must
	   initialise the contents themselves.

	6. Setting XBF_DONE in xfs_ifree_cluster() then means the
	   unlinked inode precommit code will see a XBF_DONE buffer
	   from the transaction match as it expects. It can then
	   attach the stale but newly dirtied inode to the stale
	   but newly dirtied cluster buffer without unexpected
	   failures. The stale buffer will then sail through the
	   journal and do the right thing with the attached stale
	   inode during unpin.

Hence the fix is just one line of extra code. The explanation of
why we have to set XBF_DONE in xfs_ifree_cluster, OTOH, is long and
complex....

Fixes: 82842fee6e ("xfs: fix AGF vs inode cluster buffer deadlock")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-06-17 11:17:09 +05:30
..
libxfs xfs: make sure sb_fdblocks is non-negative 2024-06-10 11:38:12 +05:30
scrub xfs: don't open-code u64_to_user_ptr 2024-05-27 15:55:52 +05:30
Kconfig xfs: support in-memory btrees 2024-02-22 12:43:35 -08:00
Makefile New code for 6.10: 2024-05-20 12:55:12 -07:00
xfs_acl.c xfs: make attr removal an explicit operation 2024-04-23 07:46:51 -07:00
xfs_acl.h
xfs_aops.c xfs: xfs_quota_unreserve_blkres can't fail 2024-05-03 11:15:03 +05:30
xfs_aops.h
xfs_attr_inactive.c xfs: report dir/attr block corruption errors to the health system 2024-02-22 12:32:18 -08:00
xfs_attr_item.c xfs: fix xfs_init_attr_trans not handling explicit operation codes 2024-05-27 15:55:52 +05:30
xfs_attr_item.h xfs: create attr log item opcodes and formats for parent pointers 2024-04-23 07:46:57 -07:00
xfs_attr_list.c xfs: pass the attr value to put_listent when possible 2024-04-23 07:47:00 -07:00
xfs_bio_io.c
xfs_bmap_item.c xfs: simplify iext overflow checking and upgrade 2024-05-03 11:20:06 +05:30
xfs_bmap_item.h xfs: move xfs_bmap_defer_add to xfs_bmap_item.c 2024-02-22 12:44:21 -08:00
xfs_bmap_util.c xfs: simplify iext overflow checking and upgrade 2024-05-03 11:20:06 +05:30
xfs_bmap_util.h xfs: xfs_quota_unreserve_blkres can't fail 2024-05-03 11:15:03 +05:30
xfs_buf_item_recover.c xfs: convert remaining kmem_free() to kfree() 2024-02-13 18:07:34 +05:30
xfs_buf_item.c xfs: convert remaining kmem_free() to kfree() 2024-02-13 18:07:34 +05:30
xfs_buf_item.h
xfs_buf_mem.c xfs: fix dev_t usage in xmbuf tracepoints 2024-03-15 10:30:23 +05:30
xfs_buf_mem.h xfs: launder in-memory btree buffers before transaction commit 2024-02-22 12:43:36 -08:00
xfs_buf.c getting rid of bogus set_blocksize() uses, switching it 2024-05-21 08:34:51 -07:00
xfs_buf.h New code for 6.9: 2024-03-13 13:52:24 -07:00
xfs_dahash_test.c xfs: test the ascii case-insensitive hash 2023-04-11 19:05:05 -07:00
xfs_dahash_test.h xfs: test dir/attr hash when loading module 2023-03-19 09:55:49 -07:00
xfs_dir2_readdir.c xfs: refactor dir format helpers 2024-04-26 11:21:46 +05:30
xfs_discard.c xfs: fix performance problems when fstrimming a subset of a fragmented AG 2024-04-15 14:59:00 -07:00
xfs_discard.h xfs: move log discard work to xfs_discard.c 2023-10-04 09:24:02 +11:00
xfs_dquot_item_recover.c xfs: dquot recovery does not validate the recovered dquot 2023-11-22 23:39:36 +05:30
xfs_dquot_item.c
xfs_dquot_item.h
xfs_dquot.c xfs: simplify iext overflow checking and upgrade 2024-05-03 11:20:06 +05:30
xfs_dquot.h xfs: Increase XFS_QM_TRANS_MAXDQS to 5 2024-04-15 14:59:01 -07:00
xfs_drain.c xfs: minimize overhead of drain wakeups by using jump labels 2023-04-11 18:59:59 -07:00
xfs_drain.h xfs: minimize overhead of drain wakeups by using jump labels 2023-04-11 18:59:59 -07:00
xfs_error.c xfs: add error injection to test file mapping exchange recovery 2024-04-15 14:54:19 -07:00
xfs_error.h
xfs_exchmaps_item.c xfs: capture inode generation numbers in the ondisk exchmaps log item 2024-04-15 14:54:24 -07:00
xfs_exchmaps_item.h xfs: create deferred log items for file mapping exchanges 2024-04-15 14:54:17 -07:00
xfs_exchrange.c xfs: support non-power-of-two rtextsize with exchange-range 2024-04-15 14:54:23 -07:00
xfs_exchrange.h xfs: create deferred log items for file mapping exchanges 2024-04-15 14:54:17 -07:00
xfs_export.c xfs: add parent pointer ioctls 2024-04-23 07:47:00 -07:00
xfs_export.h xfs: add parent pointer ioctls 2024-04-23 07:47:00 -07:00
xfs_extent_busy.c xfs: unwind xfs_extent_busy_clear 2024-04-22 12:53:34 +05:30
xfs_extent_busy.h xfs: repair free space btrees 2023-12-15 10:03:32 -08:00
xfs_extfree_item.c xfs: convert remaining kmem_free() to kfree() 2024-02-13 18:07:34 +05:30
xfs_extfree_item.h
xfs_file.c New code for 6.10: 2024-05-20 12:55:12 -07:00
xfs_file.h xfs: create a new helper to return a file's allocation unit 2024-04-15 14:54:10 -07:00
xfs_filestream.c xfs: convert remaining kmem_free() to kfree() 2024-02-13 18:07:34 +05:30
xfs_filestream.h xfs: pass perag to filestreams tracing 2023-02-13 09:14:56 +11:00
xfs_fsmap.c xfs: refactor realtime inode locking 2024-04-22 18:00:47 +05:30
xfs_fsmap.h
xfs_fsops.c xfs: split xfs_mod_freecounter 2024-04-22 18:00:47 +05:30
xfs_fsops.h xfs: split xfs_mod_freecounter 2024-04-22 18:00:47 +05:30
xfs_globals.c xfs: add debug knobs to control btree bulk load slack factors 2023-12-15 10:03:28 -08:00
xfs_handle.c xfs: don't open-code u64_to_user_ptr 2024-05-27 15:55:52 +05:30
xfs_handle.h xfs: add parent pointer ioctls 2024-04-23 07:47:00 -07:00
xfs_health.c xfs: report directory tree corruption in the health information 2024-04-23 16:55:17 -07:00
xfs_hooks.c xfs: allow scrub to hook metadata updates in other writers 2024-02-22 12:30:45 -08:00
xfs_hooks.h xfs: allow scrub to hook metadata updates in other writers 2024-02-22 12:30:45 -08:00
xfs_icache.c xfs: widen flags argument to the xfs_iflags_* helpers 2024-05-02 07:48:37 -07:00
xfs_icache.h xfs: use per-mount cpumask to track nonempty percpu inodegc lists 2023-09-11 08:39:03 -07:00
xfs_icreate_item.c xfs: convert kmem_free() for kvmalloc users to kvfree() 2024-02-13 18:07:34 +05:30
xfs_icreate_item.h
xfs_inode_item_recover.c xfs: convert remaining kmem_free() to kfree() 2024-02-13 18:07:34 +05:30
xfs_inode_item.c xfs: Replace xfs_isilocked with xfs_assert_ilocked 2024-02-19 21:19:33 +05:30
xfs_inode_item.h xfs: fix AGF vs inode cluster buffer deadlock 2023-06-05 04:08:27 +10:00
xfs_inode.c xfs: fix unlink vs cluster buffer instantiation race 2024-06-17 11:17:09 +05:30
xfs_inode.h xfs: widen flags argument to the xfs_iflags_* helpers 2024-05-02 07:48:37 -07:00
xfs_ioctl32.c xfs: move handle ioctl code to xfs_handle.c 2024-04-23 07:47:00 -07:00
xfs_ioctl32.h arch: Remove Itanium (IA-64) architecture 2023-09-11 08:13:17 +00:00
xfs_ioctl.c xfs: introduce vectored scrub mode 2024-04-23 16:55:18 -07:00
xfs_ioctl.h xfs: move handle ioctl code to xfs_handle.c 2024-04-23 07:47:00 -07:00
xfs_iomap.c xfs: simplify iext overflow checking and upgrade 2024-05-03 11:20:06 +05:30
xfs_iomap.h
xfs_iops.c xfs: parent pointer attribute creation 2024-04-23 07:46:58 -07:00
xfs_iops.h xfs: declare xfs_file.c symbols in xfs_file.h 2024-04-15 14:54:09 -07:00
xfs_itable.c xfs: hide private inodes from bulkstat and handle functions 2024-04-15 14:58:48 -07:00
xfs_itable.h
xfs_iunlink_item.c xfs: create traced helper to get extra perag references 2023-04-11 18:59:55 -07:00
xfs_iunlink_item.h
xfs_iwalk.c xfs: Clear W=1 warning in xfs_iwalk_run_callbacks() 2024-05-27 15:54:24 +05:30
xfs_iwalk.h
xfs_linux.h xfs: refactor non-power-of-two alignment checks 2024-04-15 14:54:12 -07:00
xfs_log_cil.c xfs: fix CIL sparse lock context warnings 2024-04-20 20:23:59 +05:30
xfs_log_priv.h xfs: Fix typo in comment 2024-04-22 12:51:43 +05:30
xfs_log_recover.c xfs: clean up buffer allocation in xlog_do_recovery_pass 2024-05-03 11:10:17 +05:30
xfs_log.c xfs: only clear log incompat flags at clean unmount 2024-04-15 14:54:06 -07:00
xfs_log.h xfs: only clear log incompat flags at clean unmount 2024-04-15 14:54:06 -07:00
xfs_message.c
xfs_message.h
xfs_mount.c xfs: use an XFS_OPSTATE_ flag for detecting if logged xattrs are available 2024-04-23 07:46:51 -07:00
xfs_mount.h xfs: use an XFS_OPSTATE_ flag for detecting if logged xattrs are available 2024-04-23 07:46:51 -07:00
xfs_mru_cache.c xfs: use GFP_KERNEL in pure transaction contexts 2024-02-13 18:07:35 +05:30
xfs_mru_cache.h
xfs_notify_failure.c mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind 2023-12-07 14:34:26 +05:30
xfs_pnfs.c
xfs_pnfs.h
xfs_pwork.c
xfs_pwork.h
xfs_qm_bhv.c xfs: track quota updates during live quotacheck 2024-02-22 12:30:55 -08:00
xfs_qm_syscalls.c
xfs_qm.c xfs: Hold inode locks in xfs_ialloc 2024-04-15 14:59:02 -07:00
xfs_qm.h xfs: Increase XFS_QM_TRANS_MAXDQS to 5 2024-04-15 14:59:01 -07:00
xfs_quota.h xfs: xfs_quota_unreserve_blkres can't fail 2024-05-03 11:15:03 +05:30
xfs_quotaops.c
xfs_refcount_item.c xfs: place intent recovery under NOFS allocation context 2024-02-13 18:07:35 +05:30
xfs_refcount_item.h
xfs_reflink.c xfs: Add cond_resched to block unmap range and reflink remap path 2024-05-27 20:50:35 +05:30
xfs_reflink.h
xfs_rmap_item.c xfs: place intent recovery under NOFS allocation context 2024-02-13 18:07:35 +05:30
xfs_rmap_item.h
xfs_rtalloc.c xfs: simplify iext overflow checking and upgrade 2024-05-03 11:20:06 +05:30
xfs_rtalloc.h xfs: move xfs_bmap_rtalloc to xfs_rtalloc.c 2023-12-22 11:18:11 +05:30
xfs_stats.c xfs: define an in-memory btree for storing refcount bag info during repairs 2024-02-22 12:43:40 -08:00
xfs_stats.h xfs: define an in-memory btree for storing refcount bag info during repairs 2024-02-22 12:43:40 -08:00
xfs_super.c xfs: add a incompat feature bit for parent pointers 2024-04-23 07:47:01 -07:00
xfs_super.h xfs: create scaffolding for creating debugfs entries 2023-08-10 07:48:07 -07:00
xfs_symlink.c xfs: add parent attributes to symlink 2024-04-23 07:46:58 -07:00
xfs_symlink.h xfs: move remote symlink target read function to libxfs 2024-02-22 12:45:17 -08:00
xfs_sysctl.c fs: Remove the now superfluous sentinel elements from ctl_table array 2023-12-28 04:57:57 -08:00
xfs_sysctl.h xfs: add debug knobs to control btree bulk load slack factors 2023-12-15 10:03:28 -08:00
xfs_sysfs.c xfs: remove duplicate ifdefs 2024-02-17 09:32:32 +05:30
xfs_sysfs.h
xfs_trace.c xfs: add parent pointer ioctls 2024-04-23 07:47:00 -07:00
xfs_trace.h tracing/treewide: Remove second parameter of __assign_str() 2024-05-22 20:14:47 -04:00
xfs_trans_ail.c xfs: convert remaining kmem_free() to kfree() 2024-02-13 18:07:34 +05:30
xfs_trans_buf.c xfs: launder in-memory btree buffers before transaction commit 2024-02-22 12:43:36 -08:00
xfs_trans_dquot.c xfs: Increase XFS_QM_TRANS_MAXDQS to 5 2024-04-15 14:59:01 -07:00
xfs_trans_priv.h
xfs_trans.c xfs: split xfs_mod_freecounter 2024-04-22 18:00:47 +05:30
xfs_trans.h xfs: don't use current->journal_info 2024-03-25 10:21:01 +05:30
xfs_xattr.c xfs: make the reserved block permission flag explicit in xfs_attr_set 2024-04-23 07:47:03 -07:00
xfs_xattr.h xfs: remove xfs_da_args.attr_flags 2024-04-23 07:46:50 -07:00
xfs.h