linux/include
Zhihao Cheng b9bda5f601 vfs: Don't evict inode under the inode lru traversing context
commit 2a0629834c upstream.

The inode reclaiming process(See function prune_icache_sb) collects all
reclaimable inodes and mark them with I_FREEING flag at first, at that
time, other processes will be stuck if they try getting these inodes
(See function find_inode_fast), then the reclaiming process destroy the
inodes by function dispose_list(). Some filesystems(eg. ext4 with
ea_inode feature, ubifs with xattr) may do inode lookup in the inode
evicting callback function, if the inode lookup is operated under the
inode lru traversing context, deadlock problems may happen.

Case 1: In function ext4_evict_inode(), the ea inode lookup could happen
        if ea_inode feature is enabled, the lookup process will be stuck
	under the evicting context like this:

 1. File A has inode i_reg and an ea inode i_ea
 2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea
 3. Then, following three processes running like this:

    PA                              PB
 echo 2 > /proc/sys/vm/drop_caches
  shrink_slab
   prune_dcache_sb
   // i_reg is added into lru, lru->i_ea->i_reg
   prune_icache_sb
    list_lru_walk_one
     inode_lru_isolate
      i_ea->i_state |= I_FREEING // set inode state
     inode_lru_isolate
      __iget(i_reg)
      spin_unlock(&i_reg->i_lock)
      spin_unlock(lru_lock)
                                     rm file A
                                      i_reg->nlink = 0
      iput(i_reg) // i_reg->nlink is 0, do evict
       ext4_evict_inode
        ext4_xattr_delete_inode
         ext4_xattr_inode_dec_ref_all
          ext4_xattr_inode_iget
           ext4_iget(i_ea->i_ino)
            iget_locked
             find_inode_fast
              __wait_on_freeing_inode(i_ea) ----→ AA deadlock
    dispose_list // cannot be executed by prune_icache_sb
     wake_up_bit(&i_ea->i_state)

Case 2: In deleted inode writing function ubifs_jnl_write_inode(), file
        deleting process holds BASEHD's wbuf->io_mutex while getting the
	xattr inode, which could race with inode reclaiming process(The
        reclaiming process could try locking BASEHD's wbuf->io_mutex in
	inode evicting function), then an ABBA deadlock problem would
	happen as following:

 1. File A has inode ia and a xattr(with inode ixa), regular file B has
    inode ib and a xattr.
 2. getfattr(A, xattr_buf) // ixa is added into lru // lru->ixa
 3. Then, following three processes running like this:

        PA                PB                        PC
                echo 2 > /proc/sys/vm/drop_caches
                 shrink_slab
                  prune_dcache_sb
                  // ib and ia are added into lru, lru->ixa->ib->ia
                  prune_icache_sb
                   list_lru_walk_one
                    inode_lru_isolate
                     ixa->i_state |= I_FREEING // set inode state
                    inode_lru_isolate
                     __iget(ib)
                     spin_unlock(&ib->i_lock)
                     spin_unlock(lru_lock)
                                                   rm file B
                                                    ib->nlink = 0
 rm file A
  iput(ia)
   ubifs_evict_inode(ia)
    ubifs_jnl_delete_inode(ia)
     ubifs_jnl_write_inode(ia)
      make_reservation(BASEHD) // Lock wbuf->io_mutex
      ubifs_iget(ixa->i_ino)
       iget_locked
        find_inode_fast
         __wait_on_freeing_inode(ixa)
          |          iput(ib) // ib->nlink is 0, do evict
          |           ubifs_evict_inode
          |            ubifs_jnl_delete_inode(ib)
          ↓             ubifs_jnl_write_inode
     ABBA deadlock ←-----make_reservation(BASEHD)
                   dispose_list // cannot be executed by prune_icache_sb
                    wake_up_bit(&ixa->i_state)

Fix the possible deadlock by using new inode state flag I_LRU_ISOLATING
to pin the inode in memory while inode_lru_isolate() reclaims its pages
instead of using ordinary inode reference. This way inode deletion
cannot be triggered from inode_lru_isolate() thus avoiding the deadlock.
evict() is made to wait for I_LRU_ISOLATING to be cleared before
proceeding with inode cleanup.

Link: https://lore.kernel.org/all/37c29c42-7685-d1f0-067d-63582ffac405@huaweicloud.com/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219022
Fixes: e50e5129f3 ("ext4: xattr-in-inode support")
Fixes: 7959cf3a75 ("ubifs: journal: Handle xattrs like files")
Cc: stable@vger.kernel.org
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Link: https://lore.kernel.org/r/20240809031628.1069873-1-chengzhihao@huaweicloud.com
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-29 17:33:13 +02:00
..
acpi ACPICA: Add a depth argument to acpi_execute_reg_methods() 2024-08-29 17:33:13 +02:00
asm-generic vmlinux.lds.h: catch .bss..L* sections into BSS") 2024-08-03 08:53:35 +02:00
clocksource
crypto crypto: af_alg - Disallow multiple in-flight AIO requests 2024-01-25 15:35:16 -08:00
drm drm/mipi-dsi: Fix theoretical int overflow in mipi_dsi_generic_write_seq() 2024-08-03 08:53:45 +02:00
dt-bindings clk: renesas: r8a779g0: Correct PFC/GPIO parent clocks 2024-03-26 18:19:47 -04:00
keys
kunit
kvm KVM: arm64: Fix host-programmed guest events in nVHE 2024-04-10 16:35:48 +02:00
linux vfs: Don't evict inode under the inode lru traversing context 2024-08-29 17:33:13 +02:00
math-emu
media media: cec: core: avoid recursive cec_claim_log_addrs 2024-06-12 11:12:43 +02:00
memory
misc
net ipv6: fix source address selection with route leak 2024-08-14 13:59:03 +02:00
pcmcia
ras
rdma RDMA/core: Fix umem iterator when PAGE_SIZE is greater then HCA pgsz 2023-12-13 18:45:16 +01:00
rv rv: Set variable 'da_mon_##name' to static 2023-09-01 21:00:00 -04:00
scsi scsi: mpi3mr: Fix ATA NCQ priority support 2024-06-21 14:38:25 +02:00
soc soc: qcom: rpmh-rsc: Enhance check for VRM in-flight request 2024-06-16 13:47:33 +02:00
sound ASoc: tas2781: Enable RCA-based playback without DSP firmware download 2024-08-03 08:53:57 +02:00
target
trace platform/x86/intel/ifs: Gen2 Scan test support 2024-08-14 13:58:37 +02:00
uapi Revert "misc: fastrpc: Restrict untrusted app to attach to privileged PD" 2024-08-29 17:33:10 +02:00
ufs scsi: ufs: mcq: Fix missing argument 'hba' in MCQ_OPR_OFFSET_n 2024-08-03 08:53:56 +02:00
vdso
video fbdev: stifb: Make the STI next font pointer a 32-bit signed offset 2023-11-28 17:19:58 +00:00
xen xen/events: reduce externally visible helper functions 2024-03-01 13:34:57 +01:00