2
0
mirror of https://github.com/edk2-porting/linux-next.git synced 2025-01-15 17:14:00 +08:00
linux-next/fs
Josef Bacik 8c99516a8c btrfs: exclude mmaps while doing remap
Darrick reported a potential issue to me where we could allow mmap
writes after validating a page range matched in the case of dedupe.
Generally we rely on lock page -> lock extent with the ordered flush to
protect us, but this is done after we check the pages because we use the
generic helpers, so we could modify the page in between doing the check
and locking the range.

There also exists a deadlock, as described by Filipe

"""
When cloning a file range, we lock the inodes, flush any delalloc within
the respective file ranges, wait for any ordered extents and then lock the
file ranges in both inodes. This means that right after we flush delalloc
and before we lock the file ranges, memory mapped writes can come in and
dirty pages in the file ranges of the clone operation.

Most of the time this is harmless and causes no problems. However, if we
are low on available metadata space, we can later end up in a deadlock
when starting a transaction to replace file extent items. This happens if
when allocating metadata space for the transaction, we need to wait for
the async reclaim thread to release space and the reclaim thread needs to
flush delalloc for the inode that got the memory mapped write and has its
range locked by the clone task.

Basically what happens is the following:

1) A clone operation locks inodes A and B, flushes delalloc for both
   inodes in the respective file ranges and waits for any ordered extents
   in those ranges to complete;

2) Before the clone task locks the file ranges, another task does a
   memory mapped write (which does not lock the inode) for one of the
   inodes of the clone operation. So now we have a dirty page in one of
   the ranges used by the clone operation;

3) The clone operation locks the file ranges for inodes A and B;

4) Later, when iterating over the file extents of inode A, the clone
   task attempts to start a transaction. There's not enough available
   free metadata space, so the async reclaim task is started (if not
   running already) and we wait for someone to wake us up on our
   reservation ticket;

5) The async reclaim task is not able to release space by any other
   means and decides to flush delalloc for the inode of the clone
   operation;

6) The workqueue job used to flush the inode blocks when starting
   delalloc for the inode, since the file range is currently locked by
   the clone task;

7) But the clone task is waiting on its reservation ticket and the async
   reclaim task is waiting on the flush job to complete, which can't
   progress since the clone task has the file range locked. So unless
   some other task is able to release space, for example an ordered
   extent for some other inode completes, we have a deadlock between all
   these tasks;

When this happens stack traces like the following show up in dmesg/syslog:

 INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
       Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/u16:11  state:D stack:    0 pid:1810830 ppid:     2 flags:0x00004000
 Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
 Call Trace:
  __schedule+0x5d1/0xcf0
  schedule+0x45/0xe0
  lock_extent_bits+0x1e6/0x2d0 [btrfs]
  ? finish_wait+0x90/0x90
  btrfs_invalidatepage+0x32c/0x390 [btrfs]
  ? __mod_memcg_state+0x8e/0x160
  __extent_writepage+0x2d4/0x400 [btrfs]
  extent_write_cache_pages+0x2b2/0x500 [btrfs]
  ? lock_release+0x20e/0x4c0
  ? trace_hardirqs_on+0x1b/0xf0
  extent_writepages+0x43/0x90 [btrfs]
  ? lock_acquire+0x1a3/0x490
  do_writepages+0x43/0xe0
  ? __filemap_fdatawrite_range+0xa4/0x100
  __filemap_fdatawrite_range+0xc5/0x100
  btrfs_run_delalloc_work+0x17/0x40 [btrfs]
  btrfs_work_helper+0xf1/0x600 [btrfs]
  process_one_work+0x24e/0x5e0
  worker_thread+0x50/0x3b0
  ? process_one_work+0x5e0/0x5e0
  kthread+0x153/0x170
  ? kthread_mod_delayed_work+0xc0/0xc0
  ret_from_fork+0x22/0x30
 INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
       Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/u16:1   state:D stack:    0 pid:2426217 ppid:     2 flags:0x00004000
 Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
 Call Trace:
  __schedule+0x5d1/0xcf0
  ? kvm_clock_read+0x14/0x30
  ? wait_for_completion+0x81/0x110
  schedule+0x45/0xe0
  schedule_timeout+0x30c/0x580
  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  ? lock_acquire+0x1a3/0x490
  ? try_to_wake_up+0x7a/0xa20
  ? lock_release+0x20e/0x4c0
  ? lock_acquired+0x199/0x490
  ? wait_for_completion+0x81/0x110
  wait_for_completion+0xab/0x110
  start_delalloc_inodes+0x2af/0x390 [btrfs]
  btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
  flush_space+0x24f/0x660 [btrfs]
  btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
  process_one_work+0x24e/0x5e0
  worker_thread+0x20f/0x3b0
  ? process_one_work+0x5e0/0x5e0
  kthread+0x153/0x170
  ? kthread_mod_delayed_work+0xc0/0xc0
  ret_from_fork+0x22/0x30
(...)
several other tasks blocked on inode locks held by the clone task below
(...)
 RIP: 0033:0x7f61efe73fff
 Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5.
 RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c
 RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff
 RDX: 00000000ffffff9c RSI: 0000560fbd604690 RDI: 00000000ffffff9c
 RBP: 00007ffc3371beb0 R08: 0000000000000002 R09: 0000560fbd5d75f0
 R10: 0000560fbd5d81f0 R11: 0000000000000202 R12: 0000000000000002
 R13: 000000000000000b R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
 task: fdm-stress        state:D stack:    0 pid:2508234 ppid:2508153 flags:0x00004000
 Call Trace:
  __schedule+0x5d1/0xcf0
  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  schedule+0x45/0xe0
  __reserve_bytes+0x4a4/0xb10 [btrfs]
  ? finish_wait+0x90/0x90
  btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
  btrfs_block_rsv_add+0x1f/0x50 [btrfs]
  start_transaction+0x2d1/0x760 [btrfs]
  btrfs_replace_file_extents+0x120/0x930 [btrfs]
  ? lock_release+0x20e/0x4c0
  btrfs_clone+0x3e4/0x7e0 [btrfs]
  ? btrfs_lookup_first_ordered_extent+0x8e/0x100 [btrfs]
  btrfs_clone_files+0xf6/0x150 [btrfs]
  btrfs_remap_file_range+0x324/0x3d0 [btrfs]
  do_clone_file_range+0xd4/0x1f0
  vfs_clone_file_range+0x4d/0x230
  ? lock_release+0x20e/0x4c0
  ioctl_file_clone+0x8f/0xc0
  do_vfs_ioctl+0x342/0x750
  __x64_sys_ioctl+0x62/0xb0
  do_syscall_64+0x33/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
"""

Fix both of these issues by excluding mmaps from happening we are doing
any sort of remap, which prevents this race completely.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:15 +02:00
..
9p Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-02-27 08:07:12 -08:00
adfs fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
affs idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
afs afs: Use wait_on_page_writeback_killable 2021-03-23 20:54:37 +00:00
autofs fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
befs [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
bfs fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
btrfs btrfs: exclude mmaps while doing remap 2021-04-19 17:25:15 +02:00
cachefiles cachefiles, afs: mm wait fixes 2021-03-24 10:22:00 -07:00
ceph idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
cifs cifs: escape spaces in share names 2021-04-07 21:30:27 -05:00
coda fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
configfs configfs: fix a use-after-free in __configfs_open_file 2021-03-11 12:13:48 +01:00
cramfs cramfs: use %pD instead of messing with file_dentry()->d_name 2021-01-05 23:02:47 -05:00
crypto block: rename BIO_MAX_PAGES to BIO_MAX_VECS 2021-03-11 07:47:48 -07:00
debugfs Driver core / debugfs update for 5.12-rc1 2021-02-24 10:13:55 -08:00
devpts
dlm fs: dlm: check on existing node address 2020-11-10 12:14:20 -06:00
ecryptfs idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
efivarfs fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
efs [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
erofs Change since last update: 2021-03-13 12:26:22 -08:00
exfat idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
exportfs exportfs: Add a function to return the raw output from fh_to_dentry() 2020-12-09 09:39:38 -05:00
ext2 fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
ext4 Miscellaneous ext4 bug fixes for v5.12. 2021-03-21 14:06:10 -07:00
f2fs block: rename BIO_MAX_PAGES to BIO_MAX_VECS 2021-03-11 07:47:48 -07:00
fat idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
freevxfs
fscache
fuse fuse: 32-bit user space ioctl compat for fuse device 2021-03-16 15:20:16 +01:00
gfs2 Two more gfs2 fixes 2021-04-03 12:15:01 -07:00
hfs fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
hfsplus idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
hostfs hostfs: fix memory handling in follow_link() 2021-03-25 18:57:42 -04:00
hpfs fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
hugetlbfs hugetlbfs: remove unneeded return value of hugetlb_vmtruncate() 2021-02-24 13:38:35 -08:00
iomap Merge branch 'iomap-5.12-fixes' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux 2021-03-18 10:37:30 -07:00
isofs isofs: handle large user and group ID 2021-02-03 19:05:52 +01:00
jbd2 block: use an on-stack bio in blkdev_issue_flush 2021-01-27 09:51:48 -07:00
jffs2 idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
jfs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-02-27 08:07:12 -08:00
kernfs idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
lockd SUNRPC: Make trace_svc_process() display the RPC procedure symbolically 2021-01-25 09:36:23 -05:00
minix fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
nfs nfs: we don't support removing system.nfs4_acl 2021-03-11 13:17:42 -05:00
nfs_common NFSv4_2: SSC helper should use its own config. 2021-01-28 10:55:37 -05:00
nfsd NFSD: fix error handling in NFSv4.0 callbacks 2021-03-11 10:58:49 -05:00
nilfs2 block: rename BIO_MAX_PAGES to BIO_MAX_VECS 2021-03-11 07:47:48 -07:00
nls
notify idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
ntfs ntfs: check for valid standard information attribute 2021-02-24 13:38:26 -08:00
ocfs2 ocfs2: fix deadlock between setattr and dio_end_io_write 2021-04-09 14:54:23 -07:00
omfs fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
openpromfs
orangefs idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
overlayfs idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
proc mm: use is_cow_mapping() across tree where proper 2021-03-13 11:27:30 -08:00
pstore pstore fixes for v5.12-rc2 2021-03-05 17:21:25 -08:00
qnx4 [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
qnx6 [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
quota quota: Fix memory leak when handling corrupted quota file 2021-01-05 14:42:18 +01:00
ramfs ramfs: support O_TMPFILE 2021-02-24 13:38:26 -08:00
reiserfs reiserfs: update reiserfs_xattrs_initialized() condition 2021-03-30 14:27:32 -07:00
romfs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-10-24 12:26:05 -07:00
squashfs squashfs: fix xattr id and id lookup sanity checks 2021-03-25 09:22:55 -07:00
sysfs sysfs: Support zapping of binary attr mmaps 2021-01-12 14:26:31 +01:00
sysv fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
tracefs fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
ubifs idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
udf idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
ufs fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
unicode unicode: Add utf8_casefold_hash 2020-09-10 14:03:31 -07:00
vboxsf fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
verity idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
xfs xfs: also reject BULKSTAT_SINGLE in a mount user namespace 2021-03-15 08:50:41 -07:00
zonefs zonefs: fix to update .i_wr_refcnt correctly in zonefs_open_zone() 2021-03-17 08:56:50 +09:00
aio.c Merge branch 'akpm' (patches from Andrew) 2020-12-15 12:53:37 -08:00
anon_inodes.c fs: anon_inodes: rephrase to appropriate kernel-doc 2021-01-15 12:17:25 -05:00
attr.c ima: handle idmapped mounts 2021-01-24 14:27:20 +01:00
bad_inode.c fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
binfmt_aout.c
binfmt_elf_fdpic.c Merge branch 'parisc-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux 2021-02-21 13:20:41 -08:00
binfmt_elf.c Merge branch 'parisc-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux 2021-02-21 13:20:41 -08:00
binfmt_em86.c
binfmt_flat.c binfmt_flat: revert "binfmt_flat: don't offset the data start" 2020-08-24 08:49:13 +10:00
binfmt_misc.c binfmt_misc: fix possible deadlock in bm_register_write 2021-03-13 11:27:30 -08:00
binfmt_script.c
block_dev.c block: don't ignore REQ_NOWAIT for direct IO 2021-04-02 08:34:30 -06:00
buffer.c fs: buffer: use raw page_memcg() on locked page 2021-02-24 13:38:30 -08:00
char_dev.c
compat_binfmt_elf.c get rid of COMPAT_ELF_EXEC_PAGESIZE 2021-01-06 08:42:51 -05:00
coredump.c fs/coredump: use kmap_local_page() 2021-02-26 09:41:05 -08:00
d_path.c fs: fix NULL dereference due to data race in prepend_path() 2020-10-14 14:54:45 -07:00
dax.c mm: provide a saner PTE walking API for modules 2021-02-09 07:05:44 -05:00
dcache.c fs: delete repeated words in comments 2021-02-24 13:38:26 -08:00
direct-io.c fs: direct-io: fix missing sdio->boundary 2021-04-09 14:54:23 -07:00
drop_caches.c
eventfd.c eventfd: Export eventfd_ctx_do_read() 2020-11-15 09:49:10 -05:00
eventpoll.c kcmp: Support selection of SYS_kcmp without CHECKPOINT_RESTORE 2021-02-16 09:59:41 +01:00
exec.c fs: delete repeated words in comments 2021-02-24 13:38:26 -08:00
fcntl.c idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
fhandle.c fs: delete repeated words in comments 2021-02-24 13:38:26 -08:00
file_table.c epoll: take epitem list out of struct file 2020-10-25 20:02:08 -04:00
file.c file: fix close_range() for unshare+cloexec 2021-04-02 14:11:10 +02:00
filesystems.c
fs_context.c treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
fs_parser.c fs_parse: mark fs_param_bad_value() as static 2020-10-13 18:38:27 -07:00
fs_pin.c
fs_struct.c vfs: Use sequence counter with associated spinlock 2020-07-29 16:14:27 +02:00
fs_types.c
fs-writeback.c fs: improve comments for writeback_single_inode() 2021-01-13 17:26:50 +01:00
fsopen.c treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
init.c init: handle idmapped mounts 2021-01-24 14:27:19 +01:00
inode.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-02-27 08:07:12 -08:00
internal.h idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
io_uring.c io_uring: fix early sqd_list removal sqpoll hangs 2021-04-14 13:07:27 -06:00
io-wq.c io-wq: cancel unbounded works on io-wq destroy 2021-04-08 13:33:17 -06:00
io-wq.h io_uring: remove structures from include/linux/io_uring.h 2021-03-18 09:44:35 -06:00
ioctl.c fs: remove ksys_ioctl 2020-07-31 08:16:01 +02:00
Kconfig s390,alpha: make TMPFS_INODE64 available again 2021-03-08 10:46:30 +01:00
Kconfig.binfmt Merge branch 'work.elf-compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-02-21 09:29:23 -08:00
kernel_read_file.c fs/kernel_file_read: Add "offset" arg for partial reads 2020-10-05 13:37:04 +02:00
libfs.c idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
locks.c Revert "nfsd4: a client's own opens needn't prevent delegations" 2021-03-09 10:37:34 -05:00
Makefile fs: Remove dcookies support 2021-01-29 10:06:46 +05:30
mbcache.c
mount.h mount: make {lock,unlock}_mount_hash() static 2021-01-24 14:29:34 +01:00
mpage.c block: rename BIO_MAX_PAGES to BIO_MAX_VECS 2021-03-11 07:47:48 -07:00
namei.c LOOKUP_MOUNTPOINT: we are cleaning "jumped" flag too late 2021-04-06 20:33:00 -04:00
namespace.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-02-27 08:07:12 -08:00
no-block.c
nsfs.c
open.c idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
pipe.c fs: delete repeated words in comments 2021-02-24 13:38:26 -08:00
pnode.c
pnode.h mount: fix mounting of detached mounts onto targets that reside on shared mounts 2021-03-08 15:18:43 +01:00
posix_acl.c fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
proc_namespace.c fs: introduce MOUNT_ATTR_IDMAP 2021-01-24 14:43:45 +01:00
read_write.c teach sendfile(2) to handle send-to-pipe directly 2021-01-25 23:29:36 -05:00
readdir.c readdir: make sure to verify directory entry for legacy interfaces too 2021-04-17 11:39:49 -07:00
remap_range.c ioctl: handle idmapped mounts 2021-01-24 14:27:19 +01:00
select.c kernel, fs: Introduce and use set_restart_fn() and arch_set_restart_data() 2021-03-16 22:13:10 +01:00
seq_file.c fs: fix kernel-doc markups 2021-01-21 14:06:00 -07:00
signalfd.c treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
splice.c for-5.12/block-2021-02-17 2021-02-21 11:02:48 -08:00
stack.c
stat.c fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
statfs.c s390,alpha: switch to 64-bit ino_t 2021-02-13 17:17:53 +01:00
super.c It has been a relatively quiet cycle in docsland. 2021-02-22 10:57:46 -08:00
sync.c
timerfd.c
userfaultfd.c userfaultfd: use secure anon inodes for userfaultfd 2021-01-14 17:40:57 -05:00
utimes.c utimes: handle idmapped mounts 2021-01-24 14:27:18 +01:00
xattr.c namei: handle idmapped mounts in may_*() helpers 2021-01-24 14:27:17 +01:00