[ Upstream commit f2f26b4a84 ]
Patch series "nilfs2: fix kernel bug at submit_bh_wbc()".
This resolves a kernel BUG reported by syzbot. Since there are two
flaws involved, I've made each one a separate patch.
The first patch alone resolves the syzbot-reported bug, but I think
both fixes should be sent to stable, so I've tagged them as such.
This patch (of 2):
Syzbot has reported a kernel bug in submit_bh_wbc() when writing file data
to a nilfs2 file system whose metadata is corrupted.
There are two flaws involved in this issue.
The first flaw is that when nilfs_get_block() locates a data block using
btree or direct mapping, if the disk address translation routine
nilfs_dat_translate() fails with internal code -ENOENT due to DAT metadata
corruption, it can be passed back to nilfs_get_block(). This causes
nilfs_get_block() to misidentify an existing block as non-existent,
causing both data block lookup and insertion to fail inconsistently.
The second flaw is that nilfs_get_block() returns a successful status in
this inconsistent state. This causes the caller __block_write_begin_int()
or others to request a read even though the buffer is not mapped,
resulting in a BUG_ON check for the BH_Mapped flag in submit_bh_wbc()
failing.
This fixes the first issue by changing the return value to code -EINVAL
when a conversion using DAT fails with code -ENOENT, avoiding the
conflicting condition that leads to the kernel bug described above. Here,
code -EINVAL indicates that metadata corruption was detected during the
block lookup, which will be properly handled as a file system error and
converted to -EIO when passing through the nilfs2 bmap layer.
Link: https://lkml.kernel.org/r/20240313105827.5296-1-konishi.ryusuke@gmail.com
Link: https://lkml.kernel.org/r/20240313105827.5296-2-konishi.ryusuke@gmail.com
Fixes: c3a7abf06c ("nilfs2: support contiguous lookup of blocks")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+cfed5b56649bddf80d6e@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=cfed5b56649bddf80d6e
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 74b0ebcbdd ]
In f2fs_do_write_data_page, FI_ATOMIC_FILE flag selects the target inode
between the original inode and COW inode. When aborting atomic write and
writeback occur simultaneously, invalid data can be written to original
inode if the FI_ATOMIC_FILE flag is cleared meanwhile.
To prevent the problem, let's truncate all pages before clearing the flag
Atomic write thread Writeback thread
f2fs_abort_atomic_write
clear_inode_flag(inode, FI_ATOMIC_FILE)
__writeback_single_inode
do_writepages
f2fs_do_write_data_page
- use dn of original inode
truncate_inode_pages_final
Fixes: 3db1de0e58 ("f2fs: change the current atomic write way")
Cc: stable@vger.kernel.org #v5.19+
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Yeongjin Gil <youngjin.gil@samsung.com>
Signed-off-by: Sunmin Jeong <s_min.jeong@samsung.com>
Reviewed-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 4bf7832234 ]
In f2fs_update_inode, i_size of the atomic file isn't updated until
FI_ATOMIC_COMMITTED flag is set. When committing atomic write right
after the writeback of the inode, i_size of the raw inode will not be
updated. It can cause the atomicity corruption due to a mismatch between
old file size and new data.
To prevent the problem, let's mark inode dirty for FI_ATOMIC_COMMITTED
Atomic write thread Writeback thread
__writeback_single_inode
write_inode
f2fs_update_inode
- skip i_size update
f2fs_ioc_commit_atomic_write
f2fs_commit_atomic_write
set_inode_flag(inode, FI_ATOMIC_COMMITTED)
f2fs_do_sync_file
f2fs_fsync_node_pages
- skip f2fs_update_inode since the inode is clean
Fixes: 3db1de0e58 ("f2fs: change the current atomic write way")
Cc: stable@vger.kernel.org #v5.19+
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Yeongjin Gil <youngjin.gil@samsung.com>
Signed-off-by: Sunmin Jeong <s_min.jeong@samsung.com>
Reviewed-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 34cd86b663 ]
Use vfs_getattr() to retrieve stat information, rather than make
assumptions about how a filesystem fills inode structs.
Cc: stable@vger.kernel.org
Signed-off-by: Marios Makassikis <mmakassikis@freebox.fr>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 17f46b803d ]
In production we have been hitting the following warning consistently
------------[ cut here ]------------
refcount_t: underflow; use-after-free.
WARNING: CPU: 17 PID: 1800359 at lib/refcount.c:28 refcount_warn_saturate+0x9c/0xe0
Workqueue: nfsiod nfs_direct_write_schedule_work [nfs]
RIP: 0010:refcount_warn_saturate+0x9c/0xe0
PKRU: 55555554
Call Trace:
<TASK>
? __warn+0x9f/0x130
? refcount_warn_saturate+0x9c/0xe0
? report_bug+0xcc/0x150
? handle_bug+0x3d/0x70
? exc_invalid_op+0x16/0x40
? asm_exc_invalid_op+0x16/0x20
? refcount_warn_saturate+0x9c/0xe0
nfs_direct_write_schedule_work+0x237/0x250 [nfs]
process_one_work+0x12f/0x4a0
worker_thread+0x14e/0x3b0
? ZSTD_getCParams_internal+0x220/0x220
kthread+0xdc/0x120
? __btf_name_valid+0xa0/0xa0
ret_from_fork+0x1f/0x30
This is because we're completing the nfs_direct_request twice in a row.
The source of this is when we have our commit requests to submit, we
process them and send them off, and then in the completion path for the
commit requests we have
if (nfs_commit_end(cinfo.mds))
nfs_direct_write_complete(dreq);
However since we're submitting asynchronous requests we sometimes have
one that completes before we submit the next one, so we end up calling
complete on the nfs_direct_request twice.
The only other place we use nfs_generic_commit_list() is in
__nfs_commit_inode, which wraps this call in a
nfs_commit_begin();
nfs_commit_end();
Which is a common pattern for this style of completion handling, one
that is also repeated in the direct code with get_dreq()/put_dreq()
calls around where we process events as well as in the completion paths.
Fix this by using the same pattern for the commit requests.
Before with my 200 node rocksdb stress running this warning would pop
every 10ish minutes. With my patch the stress test has been running for
several hours without popping.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a6b3bfe176 ]
We observed a corruption during on-line resize of a file system that is
larger than 16 TiB with 4k block size. With having more then 2^32 blocks
resize_inode is turned off by default by mke2fs. The issue can be
reproduced on a smaller file system for convenience by explicitly
turning off resize_inode. An on-line resize across an 8 GiB boundary (the
size of a meta block group in this setup) then leads to a corruption:
dev=/dev/<some_dev> # should be >= 16 GiB
mkdir -p /corruption
/sbin/mke2fs -t ext4 -b 4096 -O ^resize_inode $dev $((2 * 2**21 - 2**15))
mount -t ext4 $dev /corruption
dd if=/dev/zero bs=4096 of=/corruption/test count=$((2*2**21 - 4*2**15))
sha1sum /corruption/test
# 79d2658b39dcfd77274e435b0934028adafaab11 /corruption/test
/sbin/resize2fs $dev $((2*2**21))
# drop page cache to force reload the block from disk
echo 1 > /proc/sys/vm/drop_caches
sha1sum /corruption/test
# 3c2abc63cbf1a94c9e6977e0fbd72cd832c4d5c3 /corruption/test
2^21 = 2^15*2^6 equals 8 GiB whereof 2^15 is the number of blocks per
block group and 2^6 are the number of block groups that make a meta
block group.
The last checksum might be different depending on how the file is laid
out across the physical blocks. The actual corruption occurs at physical
block 63*2^15 = 2064384 which would be the location of the backup of the
meta block group's block descriptor. During the on-line resize the file
system will be converted to meta_bg starting at s_first_meta_bg which is
2 in the example - meaning all block groups after 16 GiB. However, in
ext4_flex_group_add we might add block groups that are not part of the
first meta block group yet. In the reproducer we achieved this by
substracting the size of a whole block group from the point where the
meta block group would start. This must be considered when updating the
backup block group descriptors to follow the non-meta_bg layout. The fix
is to add a test whether the group to add is already part of the meta
block group or not.
Fixes: 01f795f9e0 ("ext4: add online resizing support for meta_bg and 64-bit file systems")
Cc: <stable@vger.kernel.org>
Signed-off-by: Maximilian Heyne <mheyne@amazon.de>
Tested-by: Srivathsa Dara <srivathsa.d.dara@oracle.com>
Reviewed-by: Srivathsa Dara <srivathsa.d.dara@oracle.com>
Link: https://lore.kernel.org/r/20240215155009.94493-1-mheyne@amazon.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ae6bd7f9b4 ]
At contains_pending_extent() the value of the end offset of a chunk we
found in the device's allocation state io tree is inclusive, so when
we calculate the length we pass to the in_range() macro, we must sum
1 to the expression "physical_end - physical_offset".
In practice the wrong calculation should be harmless as chunks sizes
are never 1 byte and we should never have 1 byte ranges of unallocated
space. Nevertheless fix the wrong calculation.
Reported-by: Alex Lyakas <alex.lyakas@zadara.com>
Link: https://lore.kernel.org/linux-btrfs/CAOcd+r30e-f4R-5x-S7sV22RJPe7+pgwherA6xqN2_qe7o4XTg@mail.gmail.com/
Fixes: 1c11b63eff ("btrfs: replace pending/pinned chunks lists with io tree")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d139ded8b9 ]
[BUG]
If qgroup is marked inconsistent (e.g. caused by operations needing full
subtree rescan, like creating a snapshot and assign to a higher level
qgroup), btrfs would immediately start leaking its data reserved space.
The following script can easily reproduce it:
mkfs.btrfs -O quota -f $dev
mount $dev $mnt
btrfs subvolume create $mnt/subv1
btrfs qgroup create 1/0 $mnt
# This snapshot creation would mark qgroup inconsistent,
# as the ownership involves different higher level qgroup, thus
# we have to rescan both source and snapshot, which can be very
# time consuming, thus here btrfs just choose to mark qgroup
# inconsistent, and let users to determine when to do the rescan.
btrfs subv snapshot -i 1/0 $mnt/subv1 $mnt/snap1
# Now this write would lead to qgroup rsv leak.
xfs_io -f -c "pwrite 0 64k" $mnt/file1
# And at unmount time, btrfs would report 64K DATA rsv space leaked.
umount $mnt
And we would have the following dmesg output for the unmount:
BTRFS info (device dm-1): last unmount of filesystem 14a3d84e-f47b-4f72-b053-a8a36eef74d3
BTRFS warning (device dm-1): qgroup 0/5 has unreleased space, type 0 rsv 65536
[CAUSE]
Since commit e15e9f43c7 ("btrfs: introduce
BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup accounting"),
we introduce a mode for btrfs qgroup to skip the timing consuming
backref walk, if the qgroup is already inconsistent.
But this skip also covered the data reserved freeing, thus the qgroup
reserved space for each newly created data extent would not be freed,
thus cause the leakage.
[FIX]
Make the data extent reserved space freeing mandatory.
The qgroup reserved space handling is way cheaper compared to the
backref walking part, and we always have the super sensitive leak
detector, thus it's definitely worth to always free the qgroup
reserved data space.
Reported-by: Fabian Vogt <fvogt@suse.com>
Fixes: e15e9f43c7 ("btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup accounting")
CC: stable@vger.kernel.org # 6.1+
Link: https://bugzilla.suse.com/show_bug.cgi?id=1216196
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b1fe686a76 ]
The root inode is assumed to be always hashed. Do not unhash the root
inode even if it is marked BAD.
Fixes: 5d069dbe8a ("fuse: fix bad inode")
Cc: <stable@vger.kernel.org> # v5.11
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 68ca1b49e4 ]
The root inode has a fixed nodeid and generation (1, 0).
Prior to the commit 15db16837a ("fuse: fix illegal access to inode with
reused nodeid") generation number on lookup was ignored. After this commit
lookup with the wrong generation number resulted in the inode being
unhashed. This is correct for non-root inodes, but replacing the root
inode is wrong and results in weird behavior.
Fix by reverting to the old behavior if ignoring the generation for the
root inode, but issuing a warning in dmesg.
Reported-by: Antonio SJ Musumeci <trapexit@spawn.link>
Closes: https://lore.kernel.org/all/CAOQ4uxhek5ytdN8Yz2tNEOg5ea4NkBb4nk0FGPjPk_9nz-VG3g@mail.gmail.com/
Fixes: 15db16837a ("fuse: fix illegal access to inode with reused nodeid")
Cc: <stable@vger.kernel.org> # v5.14
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 723012cab7 ]
Page cache reads are lockless, so setting the freshly allocated page
uptodate before we've overwritten it with the data it's supposed to have
in it will allow a simultaneous reader to see old data. Move the call
to SetPageUptodate into ubifs_write_end(), which is after we copied the
new data into the page.
Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit fde2497d2b ]
When fat_encode_fh_nostale() encodes file handle without a parent it
stores only first 10 bytes of the file handle. However the length of the
file handle must be a multiple of 4 so the file handle is actually 12
bytes long and the last two bytes remain uninitialized. This is not
great at we potentially leak uninitialized information with the handle
to userspace. Properly initialize the full handle length.
Link: https://lkml.kernel.org/r/20240205122626.13701-1-jack@suse.cz
Reported-by: syzbot+3ce5dea5b1539ff36769@syzkaller.appspotmail.com
Fixes: ea3983ace6 ("fat: restructure export_operations")
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 4fbf8bc733 ]
When yangerkun review commit 93cdf49f6e ("ext4: Fix best extent lstart
adjustment logic in ext4_mb_new_inode_pa()"), it was found that the best
extent did not completely cover the original request after adjusting the
best extent lstart in ext4_mb_new_inode_pa() as follows:
original request: 2/10(8)
normalized request: 0/64(64)
best extent: 0/9(9)
When we check if best ex can be kept at start of goal, ac_o_ex.fe_logical
is 2 less than the adjusted best extent logical end 9, so we think the
adjustment is done. But obviously 0/9(9) doesn't cover 2/10(8), so we
should determine here if the original request logical end is less than or
equal to the adjusted best extent logical end.
In addition, add a comment stating when adjusted best_ex will not cover
the original request, and remove the duplicate assertion because adjusting
lstart makes no change to b_ex.fe_len.
Link: https://lore.kernel.org/r/3630fa7f-b432-7afd-5f79-781bc3b2c5ea@huawei.com
Fixes: 93cdf49f6e ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()")
Cc: <stable@kernel.org>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20240201141845.1879253-1-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 719fcafe07 ]
We've been seeing the following panic in production
BUG: kernel NULL pointer dereference, address: 0000000000000065
PGD 2f485f067 P4D 2f485f067 PUD 2cc5d8067 PMD 0
RIP: 0010:ff_layout_cancel_io+0x3a/0x90 [nfs_layout_flexfiles]
Call Trace:
<TASK>
? __die+0x78/0xc0
? page_fault_oops+0x286/0x380
? __rpc_execute+0x2c3/0x470 [sunrpc]
? rpc_new_task+0x42/0x1c0 [sunrpc]
? exc_page_fault+0x5d/0x110
? asm_exc_page_fault+0x22/0x30
? ff_layout_free_layoutreturn+0x110/0x110 [nfs_layout_flexfiles]
? ff_layout_cancel_io+0x3a/0x90 [nfs_layout_flexfiles]
? ff_layout_cancel_io+0x6f/0x90 [nfs_layout_flexfiles]
pnfs_mark_matching_lsegs_return+0x1b0/0x360 [nfsv4]
pnfs_error_mark_layout_for_return+0x9e/0x110 [nfsv4]
? ff_layout_send_layouterror+0x50/0x160 [nfs_layout_flexfiles]
nfs4_ff_layout_prepare_ds+0x11f/0x290 [nfs_layout_flexfiles]
ff_layout_pg_init_write+0xf0/0x1f0 [nfs_layout_flexfiles]
__nfs_pageio_add_request+0x154/0x6c0 [nfs]
nfs_pageio_add_request+0x26b/0x380 [nfs]
nfs_do_writepage+0x111/0x1e0 [nfs]
nfs_writepages_callback+0xf/0x30 [nfs]
write_cache_pages+0x17f/0x380
? nfs_pageio_init_write+0x50/0x50 [nfs]
? nfs_writepages+0x6d/0x210 [nfs]
? nfs_writepages+0x6d/0x210 [nfs]
nfs_writepages+0x125/0x210 [nfs]
do_writepages+0x67/0x220
? generic_perform_write+0x14b/0x210
filemap_fdatawrite_wbc+0x5b/0x80
file_write_and_wait_range+0x6d/0xc0
nfs_file_fsync+0x81/0x170 [nfs]
? nfs_file_mmap+0x60/0x60 [nfs]
__x64_sys_fsync+0x53/0x90
do_syscall_64+0x3d/0x90
entry_SYSCALL_64_after_hwframe+0x46/0xb0
Inspecting the core with drgn I was able to pull this
>>> prog.crashed_thread().stack_trace()[0]
#0 at 0xffffffffa079657a (ff_layout_cancel_io+0x3a/0x84) in ff_layout_cancel_io at fs/nfs/flexfilelayout/flexfilelayout.c:2021:27
>>> prog.crashed_thread().stack_trace()[0]['idx']
(u32)1
>>> prog.crashed_thread().stack_trace()[0]['flseg'].mirror_array[1].mirror_ds
(struct nfs4_ff_layout_ds *)0xffffffffffffffed
This is clear from the stack trace, we call nfs4_ff_layout_prepare_ds()
which could error out initializing the mirror_ds, and then we go to
clean it all up and our check is only for if (!mirror->mirror_ds). This
is inconsistent with the rest of the users of mirror_ds, which have
if (IS_ERR_OR_NULL(mirror_ds))
to keep from tripping over this exact scenario. Fix this up in
ff_layout_cancel_io() to make sure we don't panic when we get an error.
I also spot checked all the other instances of checking mirror_ds and we
appear to be doing the correct checks everywhere, only unconditionally
dereferencing mirror_ds when we know it would be valid.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Fixes: b739a5bd9d ("NFSv4/flexfiles: Cancel I/O if the layout is recalled or revoked")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0aec3847d0 ]
This reverts commit 57e9d49c54.
This undoes the hiding of .__afsXXXX silly-rename files. The problem with
hiding them is that rm can't then manually delete them.
This also reverts commit 5f7a076466 ("afs: Fix
endless loop in directory parsing") as that's a bugfix for the above.
Fixes: 57e9d49c54 ("afs: Hide silly-rename files from userspace")
Reported-by: Markus Suvanto <markus.suvanto@gmail.com>
Link: https://lists.infradead.org/pipermail/linux-afs/2024-February/008102.html
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/3085695.1710328121@warthog.procyon.org.uk
Reviewed-by: Jeffrey E Altman <jaltman@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2f6d721e14 ]
When a file only needs one direct_node, performing the following
operations will cause the file to be unrepairable:
unisoc # ./f2fs_io compress test.apk
unisoc #df -h | grep dm-48
/dev/block/dm-48 112G 112G 1.2M 100% /data
unisoc # ./f2fs_io release_cblocks test.apk
924
unisoc # df -h | grep dm-48
/dev/block/dm-48 112G 112G 4.8M 100% /data
unisoc # dd if=/dev/random of=file4 bs=1M count=3
3145728 bytes (3.0 M) copied, 0.025 s, 120 M/s
unisoc # df -h | grep dm-48
/dev/block/dm-48 112G 112G 1.8M 100% /data
unisoc # ./f2fs_io reserve_cblocks test.apk
F2FS_IOC_RESERVE_COMPRESS_BLOCKS failed: No space left on device
adb reboot
unisoc # df -h | grep dm-48
/dev/block/dm-48 112G 112G 11M 100% /data
unisoc # ./f2fs_io reserve_cblocks test.apk
0
This is because the file has only one direct_node. After returning
to -ENOSPC, reserved_blocks += ret will not be executed. As a result,
the reserved_blocks at this time is still 0, which is not the real
number of reserved blocks. Therefore, fsck cannot be set to repair
the file.
After this patch, the fsck flag will be set to fix this problem.
unisoc # df -h | grep dm-48
/dev/block/dm-48 112G 112G 1.8M 100% /data
unisoc # ./f2fs_io reserve_cblocks test.apk
F2FS_IOC_RESERVE_COMPRESS_BLOCKS failed: No space left on device
adb reboot then fsck will be executed
unisoc # df -h | grep dm-48
/dev/block/dm-48 112G 112G 11M 100% /data
unisoc # ./f2fs_io reserve_cblocks test.apk
924
Fixes: c75488fb4d ("f2fs: introduce F2FS_IOC_RESERVE_COMPRESS_BLOCKS")
Signed-off-by: Xiuhong Wang <xiuhong.wang@unisoc.com>
Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b7d797d241 ]
The following f2fs_io test will get a "0" result instead of -EINVAL,
unisoc # ./f2fs_io compress file
unisoc # ./f2fs_io reserve_cblocks file
0
it's not reasonable, so the judgement of
atomic_read(&F2FS_I(inode)->i_compr_blocks) should be placed after
the judgement of is_inode_flag_set(inode, FI_COMPRESS_RELEASED).
Fixes: c75488fb4d ("f2fs: introduce F2FS_IOC_RESERVE_COMPRESS_BLOCKS")
Signed-off-by: Xiuhong Wang <xiuhong.wang@unisoc.com>
Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 698ad1a538 ]
The intent is to check if 'dest' is truncated or not. So, >= should be
used instead of >, because strlcat() returns the length of 'dest' and 'src'
excluding the trailing NULL.
Fixes: 56463e50d1 ("NFS: Use super.c for NFSROOT mount option parsing")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7af2df0f67 ]
In cfd66bb715 ("f2fs: fix deadloop in foreground GC"), we needed to check
the number of blocks in a section instead of the segment.
Fixes: cfd66bb715 ("f2fs: fix deadloop in foreground GC")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ea59b12ac6 ]
It needs to check compress flag w/ .i_sem lock, otherwise, compressed
inode may be disabled after the check condition, it's not needed to
set compress option on non-compress inode.
Fixes: e1e8debec6 ("f2fs: add F2FS_IOC_SET_COMPRESS_OPTION ioctl")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit bcac8bff90 ]
Switch order of operations to avoid creating a short XDR buffer:
e.g., buflen = 12, old xdrlen = 12, new xdrlen = 20.
Having a short XDR buffer leads to lxa_maxcount be a few bytes
less than what is needed to retrieve the whole list when using
a buflen as returned by a call with size = 0:
buflen = listxattr(path, NULL, 0);
buf = malloc(buflen);
buflen = listxattr(path, buf, buflen);
For a file with one attribute (name = '123456'), the first call
with size = 0 will return buflen = 12 ('user.123456\x00').
The second call with size = 12, sends LISTXATTRS with
lxa_maxcount = 12 + 8 (cookie) + 4 (array count) = 24. The
XDR buffer needs 8 (cookie) + 4 (array count) + 4 (name count)
+ 6 (name len) + 2 (padding) + 4 (eof) = 28 which is 4 bytes
shorter than the lxa_maxcount provided in the call.
Fixes: 04a5da690e ("NFSv4.2: define limits and sizes for user xattr handling")
Signed-off-by: Jorge Mora <mora@netapp.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 251a658bbf ]
A call to listxattr() with a buffer size = 0 returns the actual
size of the buffer needed for a subsequent call. When size > 0,
nfs4_listxattr() does not return an error because either
generic_listxattr() or nfs4_listxattr_nfs4_label() consumes
exactly all the bytes then size is 0 when calling
nfs4_listxattr_nfs4_user() which then triggers the following
kernel BUG:
[ 99.403778] kernel BUG at mm/usercopy.c:102!
[ 99.404063] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
[ 99.408463] CPU: 0 PID: 3310 Comm: python3 Not tainted 6.6.0-61.fc40.aarch64 #1
[ 99.415827] Call trace:
[ 99.415985] usercopy_abort+0x70/0xa0
[ 99.416227] __check_heap_object+0x134/0x158
[ 99.416505] check_heap_object+0x150/0x188
[ 99.416696] __check_object_size.part.0+0x78/0x168
[ 99.416886] __check_object_size+0x28/0x40
[ 99.417078] listxattr+0x8c/0x120
[ 99.417252] path_listxattr+0x78/0xe0
[ 99.417476] __arm64_sys_listxattr+0x28/0x40
[ 99.417723] invoke_syscall+0x78/0x100
[ 99.417929] el0_svc_common.constprop.0+0x48/0xf0
[ 99.418186] do_el0_svc+0x24/0x38
[ 99.418376] el0_svc+0x3c/0x110
[ 99.418554] el0t_64_sync_handler+0x120/0x130
[ 99.418788] el0t_64_sync+0x194/0x198
[ 99.418994] Code: aa0003e3 d000a3e0 91310000 97f49bdb (d4210000)
Issue is reproduced when generic_listxattr() returns 'system.nfs4_acl',
thus calling lisxattr() with size = 16 will trigger the bug.
Add check on nfs4_listxattr() to return ERANGE error when it is
called with size > 0 and the return value is greater than size.
Fixes: 012a211abd ("NFSv4.2: hook in the user extended attribute handlers")
Signed-off-by: Jorge Mora <mora@netapp.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e39602da75 ]
f2fs only support to config zstd compress level w/ a positive number due
to layout design, but since commit e0c1b49f5b ("lib: zstd: Upgrade to
latest upstream zstd version 1.4.10"), zstd supports negative compress
level, so that zstd_min_clevel() may return a negative number, then w/
below mount option, .compress_level can be configed w/ a negative number,
which is not allowed to f2fs, let's add check condition to avoid it.
mount -o compress_algorithm=zstd:4294967295 /dev/sdx /mnt/f2fs
Fixes: 00e120b5e4 ("f2fs: assign default compression level")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 21ec682348 ]
During recovery, if FAULT_BLOCK is on, it is possible that
f2fs_reserve_new_block() will return -ENOSPC during recovery,
then it may trigger panic.
Also, if fault injection rate is 1 and only FAULT_BLOCK fault
type is on, it may encounter deadloop in loop of block reservation.
Let's change as below to fix these issues:
- remove bug_on() to avoid panic.
- limit the loop count of block reservation to avoid potential
deadloop.
Fixes: 956fa1ddc1 ("f2fs: fix to check return value of f2fs_reserve_new_block()")
Reported-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 5460749487 ]
In reserve_compress_blocks(), we update blkaddrs of dnode in prior to
inc_valid_block_count(), it may cause inconsistent status bewteen
i_blocks and blkaddrs once inc_valid_block_count() fails.
To fix this issue, it needs to reverse their invoking order.
Fixes: c75488fb4d ("f2fs: introduce F2FS_IOC_RESERVE_COMPRESS_BLOCKS")
Reviewed-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 59d0d4c3ea ]
This patch allows caller to pass blkaddr to f2fs_set_data_blkaddr()
and let __set_data_blkaddr() inside f2fs_set_data_blkaddr() to update
dn->data_blkaddr w/ last value of blkaddr.
Just cleanup, no logic changes.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Stable-dep-of: 5460749487 ("f2fs: compress: fix to avoid inconsistence bewteen i_blocks and dnode")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a539363613 ]
Commit 3c6c2bebef ("f2fs: avoid punch_hole overhead when releasing
volatile data") introduced FI_FIRST_BLOCK_WRITTEN as below reason:
This patch is to avoid some punch_hole overhead when releasing volatile
data. If volatile data was not written yet, we just can make the first
page as zero.
After commit 7bc155fec5 ("f2fs: kill volatile write support"), we
won't support volatile write, but it missed to remove obsolete
FI_FIRST_BLOCK_WRITTEN, delete it in this patch.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Stable-dep-of: 5460749487 ("f2fs: compress: fix to avoid inconsistence bewteen i_blocks and dnode")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3cf684f2f8 ]
Just use a simple if block for the conditional call to
inc_valid_block_count.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Stable-dep-of: 5460749487 ("f2fs: compress: fix to avoid inconsistence bewteen i_blocks and dnode")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit eb8fbaa533 ]
Compressed cluster may not be released due to we can fail in
release_compress_blocks(), fix to handle reserved compressed
cluster correctly in reserve_compress_blocks().
Fixes: 4c8ff7095b ("f2fs: support data compression")
Signed-off-by: Sheng Yong <shengyong@oppo.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit fd244524c2 ]
When we overwrite compressed cluster w/ normal cluster, we should
not unlock cp_rwsem during f2fs_write_raw_pages(), otherwise data
will be corrupted if partial blocks were persisted before CP & SPOR,
due to cluster metadata wasn't updated atomically.
Fixes: 4c8ff7095b ("f2fs: support data compression")
Reviewed-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 8a430dd49e ]
If data block in compressed cluster is not persisted with metadata
during checkpoint, after SPOR, the data may be corrupted, let's
guarantee to write compressed page by checkpoint.
Fixes: 4c8ff7095b ("f2fs: support data compression")
Reviewed-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2eae077e6e ]
This patch tries to use bitfield in struct f2fs_io_info to improve
memory usage.
struct f2fs_io_info {
...
unsigned int need_lock:8; /* indicate we need to lock cp_rwsem */
unsigned int version:8; /* version of the node */
unsigned int submitted:1; /* indicate IO submission */
unsigned int in_list:1; /* indicate fio is in io_list */
unsigned int is_por:1; /* indicate IO is from recovery or not */
unsigned int retry:1; /* need to reallocate block address */
unsigned int encrypted:1; /* indicate file is encrypted */
unsigned int post_read:1; /* require post read */
...
};
After this patch, size of struct f2fs_io_info reduces from 136 to 120.
[Nathan: fix a compile warning (single-bit-bitfield-constant-conversion)]
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Stable-dep-of: 8a430dd49e ("f2fs: compress: fix to guarantee persisting compressed blocks by CP")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 179b8c97eb ]
Dquot pointers in i_dquot array in the inode are protected by
dquot_srcu. Annotate the array pointers with __rcu, perform the locked
dereferences with srcu_dereference_check() instead of plain reads, and
set the array elements with rcu_assign_pointer().
Fixes: b9ba6f94b2 ("quota: remove dqptr_sem")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202402061900.rTuYDlo6-lkp@intel.com/
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d0aa72604f ]
Below race may cause NULL pointer dereference
P1 P2
dquot_free_inode quota_off
drop_dquot_ref
remove_dquot_ref
dquots = i_dquot(inode)
dquots = i_dquot(inode)
srcu_read_lock
dquots[cnt]) != NULL (1)
dquots[type] = NULL (2)
spin_lock(&dquots[cnt]->dq_dqb_lock) (3)
....
If dquot_free_inode(or other routines) checks inode's quota pointers (1)
before quota_off sets it to NULL(2) and use it (3) after that, NULL pointer
dereference will be triggered.
So let's fix it by using a temporary pointer to avoid this issue.
Signed-off-by: Wang Jianjian <wangjianjian3@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20240202081852.2514092-1-wangjianjian3@huawei.com>
Stable-dep-of: 179b8c97eb ("quota: Fix rcu annotations of inode dquot pointers")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7bce48f0fe ]
As Honza said, remove_inode_dquot_ref() currently does not release the
last dquot reference but instead adds the dquot to tofree_head list. This
is because dqput() can sleep while dropping of the last dquot reference
(writing back the dquot and calling ->release_dquot()) and that must not
happen under dq_list_lock. Now that dqput() queues the final dquot cleanup
into a workqueue, remove_inode_dquot_ref() can call dqput() unconditionally
and we can significantly simplify it.
Here we open code the simplified code of remove_inode_dquot_ref() into
remove_dquot_ref() and remove the function put_dquot_list() which is no
longer used.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230630110822.3881712-6-libaokun1@huawei.com>
Stable-dep-of: 179b8c97eb ("quota: Fix rcu annotations of inode dquot pointers")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a43e0fc5e9 ]
Unloading a modular pstore backend with records in pstorefs would
trigger the dput() double-drop warning:
WARNING: CPU: 0 PID: 2569 at fs/dcache.c:762 dput.part.0+0x3f3/0x410
Using the combo of d_drop()/dput() (as mentioned in
Documentation/filesystems/vfs.rst) isn't the right approach here, and
leads to the reference counting problem seen above. Use d_invalidate()
and update the code to not bother checking for error codes that can
never happen.
Suggested-by: Alexander Viro <viro@zeniv.linux.org.uk>
Fixes: 609e28bb13 ("pstore: Remove filesystem records when backend is unregistered")
Signed-off-by: Kees Cook <keescook@chromium.org>
[ Upstream commit ddb9fd7a54 ]
A while ago, we changed the way that select() and poll() preallocate
a temporary buffer just under the size of the static warning limit of
1024 bytes, as clang was frequently going slightly above that limit.
The warnings have recently returned and I took another look. As it turns
out, clang is not actually inherently worse at reserving stack space,
it just happens to inline do_select() into core_sys_select(), while gcc
never inlines it.
Annotate do_select() to never be inlined and in turn remove the special
case for the allocation size. This should give the same behavior for
both clang and gcc all the time and once more avoids those warnings.
Fixes: ad312f95d4 ("fs/select: avoid clang stack usage warning")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20240216202352.2492798-1-arnd@kernel.org
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ec16b147a5 ]
Reject values that are valid rw_hints after truncation but not before
truncation by passing an untruncated value to rw_hint_valid().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: 5657cb0797 ("fs/fcntl: use copy_to/from_user() for u64 types")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240202203926.2478590-2-bvanassche@acm.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c7bb26b847 ]
At btrfs_use_block_rsv() we read the size of a block reserve without
locking its spinlock, which makes KCSAN complain because the size of a
block reserve is always updated while holding its spinlock. The report
from KCSAN is the following:
[653.313148] BUG: KCSAN: data-race in btrfs_update_delayed_refs_rsv [btrfs] / btrfs_use_block_rsv [btrfs]
[653.314755] read to 0x000000017f5871b8 of 8 bytes by task 7519 on cpu 0:
[653.314779] btrfs_use_block_rsv+0xe4/0x2f8 [btrfs]
[653.315606] btrfs_alloc_tree_block+0xdc/0x998 [btrfs]
[653.316421] btrfs_force_cow_block+0x220/0xe38 [btrfs]
[653.317242] btrfs_cow_block+0x1ac/0x568 [btrfs]
[653.318060] btrfs_search_slot+0xda2/0x19b8 [btrfs]
[653.318879] btrfs_del_csums+0x1dc/0x798 [btrfs]
[653.319702] __btrfs_free_extent.isra.0+0xc24/0x2028 [btrfs]
[653.320538] __btrfs_run_delayed_refs+0xd3c/0x2390 [btrfs]
[653.321340] btrfs_run_delayed_refs+0xae/0x290 [btrfs]
[653.322140] flush_space+0x5e4/0x718 [btrfs]
[653.322958] btrfs_preempt_reclaim_metadata_space+0x102/0x2f8 [btrfs]
[653.323781] process_one_work+0x3b6/0x838
[653.323800] worker_thread+0x75e/0xb10
[653.323817] kthread+0x21a/0x230
[653.323836] __ret_from_fork+0x6c/0xb8
[653.323855] ret_from_fork+0xa/0x30
[653.323887] write to 0x000000017f5871b8 of 8 bytes by task 576 on cpu 3:
[653.323906] btrfs_update_delayed_refs_rsv+0x1a4/0x250 [btrfs]
[653.324699] btrfs_add_delayed_data_ref+0x468/0x6d8 [btrfs]
[653.325494] btrfs_free_extent+0x76/0x120 [btrfs]
[653.326280] __btrfs_mod_ref+0x6a8/0x6b8 [btrfs]
[653.327064] btrfs_dec_ref+0x50/0x70 [btrfs]
[653.327849] walk_up_proc+0x236/0xa50 [btrfs]
[653.328633] walk_up_tree+0x21c/0x448 [btrfs]
[653.329418] btrfs_drop_snapshot+0x802/0x1328 [btrfs]
[653.330205] btrfs_clean_one_deleted_snapshot+0x184/0x238 [btrfs]
[653.330995] cleaner_kthread+0x2b0/0x2f0 [btrfs]
[653.331781] kthread+0x21a/0x230
[653.331800] __ret_from_fork+0x6c/0xb8
[653.331818] ret_from_fork+0xa/0x30
So add a helper to get the size of a block reserve while holding the lock.
Reading the field while holding the lock instead of using the data_race()
annotation is used in order to prevent load tearing.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e06cc89475 ]
At space_info.c we have several places where we access the ->reserved
field of a block reserve without taking the block reserve's spinlock
first, which makes KCSAN warn about a data race since that field is
always updated while holding the spinlock.
The reports from KCSAN are like the following:
[117.193526] BUG: KCSAN: data-race in btrfs_block_rsv_release [btrfs] / need_preemptive_reclaim [btrfs]
[117.195148] read to 0x000000017f587190 of 8 bytes by task 6303 on cpu 3:
[117.195172] need_preemptive_reclaim+0x222/0x2f0 [btrfs]
[117.195992] __reserve_bytes+0xbb0/0xdc8 [btrfs]
[117.196807] btrfs_reserve_metadata_bytes+0x4c/0x120 [btrfs]
[117.197620] btrfs_block_rsv_add+0x78/0xa8 [btrfs]
[117.198434] btrfs_delayed_update_inode+0x154/0x368 [btrfs]
[117.199300] btrfs_update_inode+0x108/0x1c8 [btrfs]
[117.200122] btrfs_dirty_inode+0xb4/0x140 [btrfs]
[117.200937] btrfs_update_time+0x8c/0xb0 [btrfs]
[117.201754] touch_atime+0x16c/0x1e0
[117.201789] filemap_read+0x674/0x728
[117.201823] btrfs_file_read_iter+0xf8/0x410 [btrfs]
[117.202653] vfs_read+0x2b6/0x498
[117.203454] ksys_read+0xa2/0x150
[117.203473] __s390x_sys_read+0x68/0x88
[117.203495] do_syscall+0x1c6/0x210
[117.203517] __do_syscall+0xc8/0xf0
[117.203539] system_call+0x70/0x98
[117.203579] write to 0x000000017f587190 of 8 bytes by task 11 on cpu 0:
[117.203604] btrfs_block_rsv_release+0x2e8/0x578 [btrfs]
[117.204432] btrfs_delayed_inode_release_metadata+0x7c/0x1d0 [btrfs]
[117.205259] __btrfs_update_delayed_inode+0x37c/0x5e0 [btrfs]
[117.206093] btrfs_async_run_delayed_root+0x356/0x498 [btrfs]
[117.206917] btrfs_work_helper+0x160/0x7a0 [btrfs]
[117.207738] process_one_work+0x3b6/0x838
[117.207768] worker_thread+0x75e/0xb10
[117.207797] kthread+0x21a/0x230
[117.207830] __ret_from_fork+0x6c/0xb8
[117.207861] ret_from_fork+0xa/0x30
So add a helper to get the reserved amount of a block reserve while
holding the lock. The value may be not be up to date anymore when used by
need_preemptive_reclaim() and btrfs_preempt_reclaim_metadata_space(), but
that's ok since the worst it can do is cause more reclaim work do be done
sooner rather than later. Reading the field while holding the lock instead
of using the data_race() annotation is used in order to prevent load
tearing.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
In nfsd4_encode_fattr(), TIME_CREATE was being written out after all
other times. However, they should be written out in an order that
matches the bit flags in bmval1, which in this case are
#define FATTR4_WORD1_TIME_ACCESS (1UL << 15)
#define FATTR4_WORD1_TIME_CREATE (1UL << 18)
#define FATTR4_WORD1_TIME_DELTA (1UL << 19)
#define FATTR4_WORD1_TIME_METADATA (1UL << 20)
#define FATTR4_WORD1_TIME_MODIFY (1UL << 21)
so TIME_CREATE should come second.
I noticed this on a FreeBSD NFSv4.2 client, which supports creation
times. On this client, file times were weirdly permuted. With this
patch applied on the server, times looked normal on the client.
Fixes: e377a3e698 ("nfsd: Add support for the birth time attribute")
Link: https://unix.stackexchange.com/q/749605/56202
Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 147abcacee ]
The following request sequence to the same file causes the NFS client and
server getting into an infinite loop with COMMIT and NFS4ERR_DELAY:
OPEN
REMOVE
WRITE
COMMIT
Problem reported by recall11, recall12, recall14, recall20, recall22,
recall40, recall42, recall48, recall50 of nfstest suite.
This patch restores the handling of race condition in nfsd_file_do_acquire
with unlink to that prior of the regression.
Fixes: ac3a2585f0 ("nfsd: rework refcounting in filecache")
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>