linux/fs
Boris Burkov b73a6fd1b1 btrfs: split partial dio bios before submit
If an application is doing direct io to a btrfs file and experiences a
page fault reading from the write buffer, iomap will issue a partial
bio, and allow the fs to keep going. However, there was a subtle bug in
this code path in the btrfs dio iomap implementation that led to the
partial write ending up as a gap in the file's extents and to be read
back as zeros.

The sequence of events in a partial write, lightly summarized and
trimmed down for brevity is as follows:

==== WRITING TASK ====
 btrfs_direct_write
 __iomap_dio_write
 iomap_iter
 btrfs_dio_iomap_begin # create full ordered extent
 iomap_dio_bio_iter
 bio_iov_iter_get_pages # page fault; partial read
 submit_bio # partial bio
 iomap_iter
 btrfs_dio_iomap_end
 btrfs_mark_ordered_io_finished # sets BTRFS_ORDERED_IOERR;
				# submit to finish_ordered_fn wq
 fault_in_iov_iter_readable # btrfs_direct_write detects partial write
 __iomap_dio_write
 iomap_iter
 btrfs_dio_iomap_begin # create second partial ordered extent
 iomap_dio_bio_iter
 bio_iov_iter_get_pages # read all of remainder
 submit_bio # partial bio with all of remainder
 iomap_iter
 btrfs_dio_iomap_end # nothing exciting to do with ordered io

==== DIO ENDIO ====
== FIRST PARTIAL BIO ==
 btrfs_dio_end_io
 btrfs_mark_ordered_io_finished # bytes_left > 0
			        # don't submit to finish_ordered_fn wq
== SECOND PARTIAL BIO ==
 btrfs_dio_end_io
 btrfs_mark_ordered_io_finished # bytes_left == 0
			        # submit to finish_ordered_fn wq

==== BTRFS FINISH ORDERED WQ ====
== FIRST PARTIAL BIO ==
 btrfs_finish_ordered_io # called by dio_iomap_end_io, sees
		         # BTRFS_ORDERED_IOERR, just drops the
		         # ordered_extent
==SECOND PARTIAL BIO==
 btrfs_finish_ordered_io # called by btrfs_dio_end_io, writes out file
		         # extents, csums, etc...

The essence of the problem is that while btrfs_direct_write and iomap
properly interact to submit all the correct bios, there is insufficient
logic in the btrfs dio functions (btrfs_dio_iomap_begin,
btrfs_dio_submit_io, btrfs_dio_end_io, and btrfs_dio_iomap_end) to
ensure that every bio is at least a part of a completed ordered_extent.
And it is completing an ordered_extent that results in crucial
functionality like writing out a file extent for the range.

More specifically, btrfs_dio_end_io treats the ordered extent as
unfinished but btrfs_dio_iomap_end sets BTRFS_ORDERED_IOERR on it.
Thus, the finish io work doesn't result in file extents, csums, etc.
In the aftermath, such a file behaves as though it has a hole in it,
instead of the purportedly written data.

We considered a few options for fixing the bug:

  1. treat the partial bio as if we had truncated the file, which would
     result in properly finishing it.
  2. split the ordered extent when submitting a partial bio.
  3. cache the ordered extent across calls to __iomap_dio_rw in
     iter->private, so that we could reuse it and correctly apply
     several bios to it.

I had trouble with 1, and it felt the most like a hack, so I tried 2
and 3. Since 3 has the benefit of also not creating an extra file
extent, and avoids an ordered extent lookup during bio submission, it
felt like the best option. However, that turned out to re-introduce a
deadlock which this code discarding the ordered_extent between faults
was meant to fix in the first place. (Link to an explanation of the
deadlock below.)

Therefore, go with fix 2, which requires a bit more setup work but fixes
the corruption without introducing the deadlock, which is fundamentally
caused by the ordered extent existing when we attempt to fault in a
range that overlaps with it.

Put succinctly, what this patch does is: when we submit a dio bio, check
if it is partial against the ordered extent stored in dio_data, and if it
is, extract the ordered_extent that matches the bio exactly out of the
larger ordered_extent. Keep the remaining ordered_extent around in dio_data
for cancellation in iomap_end.

Thanks to Josef, Christoph, and Filipe with their help figuring out the
bug and the fix.

Fixes: 51bd9563b6 ("btrfs: fix deadlock due to page faults during direct IO reads and writes")
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2169947
Link: https://lore.kernel.org/linux-btrfs/aa1fb69e-b613-47aa-a99e-a0a2c9ed273f@app.fastmail.com/
Link: https://pastebin.com/3SDaH8C6
Link: https://lore.kernel.org/linux-btrfs/20230315195231.GW10580@twin.jikos.cz/T/#t
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Boris Burkov <boris@bur.io>
[ hch: refactored the ordered_extent extraction ]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:21 +02:00
..
9p 9P FS: Fix wild-memory-access write in v9fs_get_acl 2023-03-27 00:34:16 +00:00
adfs fs: port ->setattr() to pass mnt_idmap 2023-01-19 09:24:02 +01:00
affs for-6.3/dio-2023-02-16 2023-02-20 14:10:36 -08:00
afs - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
autofs fs: port ->permission() to pass mnt_idmap 2023-01-19 09:24:28 +01:00
befs befs: Convert befs_symlink_read_folio() to use a folio 2022-08-02 12:34:03 -04:00
bfs fs: port inode_init_owner() to mnt_idmap 2023-01-19 09:24:28 +01:00
btrfs btrfs: split partial dio bios before submit 2023-04-17 18:01:21 +02:00
cachefiles fs: port ->permission() to pass mnt_idmap 2023-01-19 09:24:28 +01:00
ceph Two small fixes from Xiubo and myself, marked for stable. 2023-03-02 10:48:30 -08:00
cifs cifs: fix negotiate context parsing 2023-04-15 18:26:56 -05:00
coda hardening updates for v6.3-rc1 2023-02-21 11:07:23 -08:00
configfs fs: port ->permission() to pass mnt_idmap 2023-01-19 09:24:28 +01:00
cramfs fs/cramfs/inode.c: initialize file_ra_state 2023-03-02 21:54:23 -08:00
crypto fscrypt: check for NULL keyring in fscrypt_put_master_key_activeref() 2023-03-18 21:08:03 -07:00
debugfs ARM: SoC drivers for 6.3 2023-02-27 10:04:49 -08:00
devpts
dlm Driver core changes for 6.3-rc1 2023-02-24 12:58:55 -08:00
ecryptfs This update includes the following changes: 2023-02-21 18:10:50 -08:00
efivarfs A healthy mix of EFI contributions this time: 2023-02-23 14:41:48 -08:00
efs
erofs erofs: use wrapper i_blocksize() in erofs_file_read_iter() 2023-03-09 23:36:04 +08:00
exfat Description for this pull request: 2023-03-01 08:42:27 -08:00
exportfs fs: port ->permission() to pass mnt_idmap 2023-01-19 09:24:28 +01:00
ext2 for-6.3/dio-2023-02-16 2023-02-20 14:10:36 -08:00
ext4 ext4: fix possible double unlock when moving a directory 2023-03-17 21:53:52 -04:00
f2fs f2fs-for-6.3-rc1 2023-02-27 16:18:51 -08:00
fat There is no particular theme here - mainly quick hits all over the tree. 2023-02-23 17:55:40 -08:00
freevxfs There is no particular theme here - mainly quick hits all over the tree. 2023-02-23 17:55:40 -08:00
fscache fscache: Use clear_and_wake_up_bit() in fscache_create_volume_work() 2023-01-30 12:51:54 +00:00
fuse fuse update for 6.3 2023-02-27 09:53:58 -08:00
gfs2 Reinstate "GFS2: free disk inode which is deleted by remote node -V2" 2023-03-23 19:37:56 +01:00
hfs There is no particular theme here - mainly quick hits all over the tree. 2023-02-23 17:55:40 -08:00
hfsplus fs: hfsplus: fix UAF issue in hfsplus_put_super 2023-03-02 21:54:23 -08:00
hostfs This pull request contains the following changes for UML: 2023-03-01 09:13:00 -08:00
hpfs fs: port ->rename() to pass mnt_idmap 2023-01-19 09:24:26 +01:00
hugetlbfs - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
iomap - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
isofs - hfs and hfsplus kmap API modernization from Fabio Francesco 2022-10-12 11:00:22 -07:00
jbd2 Bug fixes and regressions for ext4, the most serious of which is a 2023-03-12 08:55:55 -07:00
jffs2 This pull request contains updates for JFFS2, UBI and UBIFS 2023-03-01 09:06:51 -08:00
jfs Just one simple sanity check 2023-03-01 08:47:19 -08:00
kernfs Driver core changes for 6.3-rc1 2023-02-24 12:58:55 -08:00
ksmbd ksmbd: avoid out of bounds access in decode_preauth_ctxt() 2023-04-13 14:17:32 -05:00
lockd lockd: set file_lock start and end when decoding nlm4 testargs 2023-03-14 14:00:55 -04:00
minix Merge branch 'work.minix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2023-02-24 19:01:15 -08:00
netfs netfs: Fix netfs_extract_iter_to_sg() for ITER_UBUF/IOVEC 2023-04-12 09:26:36 -07:00
nfs nfsd-6.3 fixes: 2023-04-04 11:20:55 -07:00
nfs_common filelock: move file locking definitions to separate header file 2023-01-11 06:52:32 -05:00
nfsd nfsd-6.3 fixes: 2023-04-04 11:20:55 -07:00
nilfs2 nilfs2: fix sysfs interface lifetime 2023-04-05 18:06:24 -07:00
nls
notify RCU pull request for v6.3 2023-02-21 10:45:51 -08:00
ntfs There is no particular theme here - mainly quick hits all over the tree. 2023-02-23 17:55:40 -08:00
ntfs3 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
ocfs2 ocfs2: fix data corruption after failed write 2023-03-07 17:04:55 -08:00
omfs fs: port inode_init_owner() to mnt_idmap 2023-01-19 09:24:28 +01:00
openpromfs
orangefs - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
overlayfs fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
proc capability: just use a 'u64' instead of a 'u32[2]' array 2023-03-01 10:01:22 -08:00
pstore pstore updates for v6.2-rc1-fixes 2022-12-23 11:55:54 -08:00
qnx4
qnx6 fs/qnx6: delete unnecessary checks before brelse() 2022-09-11 21:55:07 -07:00
quota RCU pull request for v6.3 2023-02-21 10:45:51 -08:00
ramfs - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
reiserfs - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
romfs mm/nommu: factor out check for NOMMU shared mappings into is_nommu_shared_mapping() 2023-01-18 17:12:56 -08:00
smbfs_common smb3: Replace smb2pdu 1-element arrays with flex-arrays 2023-02-20 17:25:43 -06:00
squashfs revert "squashfs: harden sanity check in squashfs_read_xattr_id_table" 2023-02-03 17:52:25 -08:00
sysfs
sysv Merge branch 'work.sysv' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2023-02-24 19:03:26 -08:00
tracefs fs: port ->mkdir() to pass mnt_idmap 2023-01-19 09:24:26 +01:00
ubifs This pull request contains updates for JFFS2, UBI and UBIFS 2023-03-01 09:06:51 -08:00
udf udf: Warn if block mapping is done for in-ICB files 2023-03-06 16:38:25 +01:00
ufs fs: port inode_init_owner() to mnt_idmap 2023-01-19 09:24:28 +01:00
unicode
vboxsf fs: port ->rename() to pass mnt_idmap 2023-01-19 09:24:26 +01:00
verity fsverity: don't drop pagecache at end of FS_IOC_ENABLE_VERITY 2023-03-15 22:50:41 -07:00
xfs xfs: fix mismerged tracepoints 2023-03-24 13:16:01 -07:00
zonefs zonefs: Do not propagate iomap_dio_rw() ENOTBLK error to user space 2023-03-30 20:56:02 +09:00
aio.c Merge branch 'mm-hotfixes-stable' into mm-stable 2023-02-10 15:34:48 -08:00
anon_inodes.c dynamic_dname(): drop unused dentry argument 2022-08-20 11:34:04 -04:00
attr.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
bad_inode.c fs: port ->permission() to pass mnt_idmap 2023-01-19 09:24:28 +01:00
binfmt_elf_fdpic.c elfcore: Add a cprm parameter to elf_core_extra_{phdrs,data_size} 2023-01-05 15:12:12 +00:00
binfmt_elf_test.c
binfmt_elf.c Linux 6.2-rc6 2023-01-31 15:01:20 +01:00
binfmt_flat.c
binfmt_misc.c binfmt_misc: fix shift-out-of-bounds in check_special_flags 2022-12-02 13:57:04 -08:00
binfmt_script.c
buffer.c - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
char_dev.c chardev: fix error handling in cdev_device_add() 2022-12-02 17:48:59 +01:00
compat_binfmt_elf.c
coredump.c - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
d_path.c d_path.c: typo fix... 2022-08-20 11:34:33 -04:00
dax.c fsdax: force clear dirty mark if CoW 2023-04-05 18:06:23 -07:00
dcache.c tmpfile API change 2022-10-10 19:45:17 -07:00
direct-io.c fs: move sb_init_dio_done_wq out of direct-io.c 2023-01-26 10:30:56 -07:00
drop_caches.c
eventfd.c eventfd: provide a eventfd_signal_mask() helper 2022-11-22 06:07:55 -07:00
eventpoll.c eventpoll: add EPOLL_URING_WAKE poll wakeup flag 2022-11-21 07:45:29 -07:00
exec.c - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
fcntl.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
fhandle.c do_sys_name_to_handle(): constify path 2022-09-01 17:36:39 -04:00
file_table.c filelock: move file locking definitions to separate header file 2023-01-11 06:52:32 -05:00
file.c fs: prevent out-of-bounds array speculation when closing a file descriptor 2023-03-09 22:46:21 -05:00
filesystems.c
fs_context.c
fs_parser.c ext4: journal_path mount options should follow links 2022-12-01 10:46:54 -05:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c mm: convert mem_cgroup_css_from_page() to mem_cgroup_css_from_folio() 2023-02-02 22:33:19 -08:00
fsopen.c
init.c fs: port ->permission() to pass mnt_idmap 2023-01-19 09:24:28 +01:00
inode.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
internal.h for-6.3/dio-2023-02-16 2023-02-20 14:10:36 -08:00
ioctl.c fs: port inode_owner_or_capable() to mnt_idmap 2023-01-19 09:24:29 +01:00
Kconfig fs: build the legacy direct I/O code conditionally 2023-01-26 10:30:56 -07:00
Kconfig.binfmt Xtensa updates for v6.1 2022-10-10 14:21:11 -07:00
kernel_read_file.c
libfs.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
locks.c filelocks: use mount idmapping for setlease permission check 2023-03-09 22:36:12 +01:00
Makefile for-6.3/dio-2023-02-16 2023-02-20 14:10:36 -08:00
mbcache.c ext4: fix deadlock due to mbcache entry corruption 2022-12-08 21:49:25 -05:00
mnt_idmapping.c fs: move mnt_idmap 2023-01-19 09:24:30 +01:00
mount.h
mpage.c - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
namei.c NFSD 6.3 Release Notes 2023-02-22 14:21:40 -08:00
namespace.c fs: drop peer group ids under namespace lock 2023-03-31 12:13:37 +02:00
no-block.c
nsfs.c nsfs: repair kernel-doc for ns_match() 2023-01-11 15:47:40 -05:00
open.c vfs: avoid duplicating creds in faccessat if possible 2023-02-27 16:39:19 -08:00
pipe.c dynamic_dname(): drop unused dentry argument 2022-08-20 11:34:04 -04:00
pnode.c pnode: terminate at peers of source 2022-12-21 14:45:25 +01:00
pnode.h
posix_acl.c fs.acl.v6.3 2023-02-20 12:14:33 -08:00
proc_namespace.c
read_write.c iov_iter work; most of that is about getting rid of 2022-12-12 18:29:54 -08:00
readdir.c Change calling conventions for filldir_t 2022-08-17 17:25:04 -04:00
remap_range.c fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap 2023-01-19 09:24:29 +01:00
select.c
seq_file.c use less confusing names for iov_iter direction initializers 2022-11-25 13:01:55 -05:00
signalfd.c
splice.c splice: Remove redundant assignment to ret 2023-03-09 10:10:31 +01:00
stack.c
stat.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
statfs.c
super.c fscrypt: destroy keyring after security_sb_delete() 2023-03-14 10:30:30 -07:00
sync.c
sysctls.c
timerfd.c
userfaultfd.c mm: replace vma->vm_flags direct modifications with modifier calls 2023-02-09 16:51:39 -08:00
utimes.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
xattr.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00