[ Upstream commit a57c2d4e46 ]
When removing a delayed item, or releasing which will remove it as well,
we will modify one of the delayed node's rbtrees and item counter if the
delayed item is in one of the rbtrees. This require having the delayed
node's mutex locked, otherwise we will race with other tasks modifying
the rbtrees and the counter.
This is motivated by a previous version of another patch actually calling
btrfs_release_delayed_item() after unlocking the delayed node's mutex and
against a delayed item that is in a rbtree.
So assert at __btrfs_remove_delayed_item() that the delayed node's mutex
is locked.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e3dfcab208 ]
When unmounting all the dirty buffers will be flushed and after
the last osd request is finished the last reference of the i_count
will be released. Then it will flush the dirty cap/snap to MDSs,
and the unmounting won't wait the possible acks, which will ihold
the inodes when updating the metadata locally but makes no sense
any more, of this. This will make the evict_inodes() to skip these
inodes.
If encrypt is enabled the kernel generate a warning when removing
the encrypt keys when the skipped inodes still hold the keyring:
WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0
CPU: 4 PID: 168846 Comm: umount Tainted: G S 6.1.0-rc5-ceph-g72ead199864c #1
Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0
RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202
RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00
RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000
RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000
R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40
R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000
FS: 00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
generic_shutdown_super+0x47/0x120
kill_anon_super+0x14/0x30
ceph_kill_sb+0x36/0x90 [ceph]
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x67/0xb0
exit_to_user_mode_prepare+0x23d/0x240
syscall_exit_to_user_mode+0x25/0x60
do_syscall_64+0x40/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fd83dc39e9b
Later the kernel will crash when iput() the inodes and dereferencing
the "sb->s_master_keys", which has been released by the
generic_shutdown_super().
Link: https://tracker.ceph.com/issues/59162
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0d32a6bbb8 ]
nfsd4_encode_readv() uses xdr->buf->page_len as a starting point for
the nfsd_iter_read() sink buffer -- page_len is going to be offset
by the parts of the COMPOUND that have already been encoded into
xdr->buf->pages.
However, that value must be captured /before/
xdr_reserve_space_vec() advances page_len by the expected size of
the read payload. Otherwise, the whole front part of the first
page of the payload in the reply will be uninitialized.
Mantas hit this because sec=krb5i forces RQ_SPLICE_OK off, which
invokes the readv part of the nfsd4_encode_read() path. Also,
older Linux NFS clients appear to send shorter READ requests
for files smaller than a page, whereas newer clients just send
page-sized requests and let the server send as many bytes as
are in the file.
Reported-by: Mantas Mikulėnas <grawity@gmail.com>
Closes: https://lore.kernel.org/linux-nfs/f1d0b234-e650-0f6e-0f5d-126b3d51d1eb@gmail.com/
Fixes: 703d752155 ("NFSD: Hoist rq_vec preparation into nfsd_read() [step two]")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 74ee79142c ]
Commit f98b6215d7 ("btrfs: extent_io: do extra check for extent buffer
read write functions") changed how we handle invalid extent buffer range
for read_extent_buffer().
Previously if the range is invalid we just set the destination to zero,
but after the patch we do nothing and error out.
This can lead to smatch static checker errors like:
fs/btrfs/print-tree.c:186 print_uuid_item() error: uninitialized symbol 'subvol_id'.
fs/btrfs/tests/extent-io-tests.c:338 check_eb_bitmap() error: uninitialized symbol 'has'.
fs/btrfs/tests/extent-io-tests.c:353 check_eb_bitmap() error: uninitialized symbol 'has'.
fs/btrfs/uuid-tree.c:203 btrfs_uuid_tree_remove() error: uninitialized symbol 'read_subid'.
fs/btrfs/uuid-tree.c:353 btrfs_uuid_tree_iterate() error: uninitialized symbol 'subid_le'.
fs/btrfs/uuid-tree.c:72 btrfs_uuid_tree_lookup() error: uninitialized symbol 'data'.
fs/btrfs/volumes.c:7415 btrfs_dev_stats_value() error: uninitialized symbol 'val'.
Fix those warnings by reverting back to the old memset() behavior.
By this we keep the static checker happy and would still make a lot of
noise when such invalid ranges are passed in.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Fixes: f98b6215d7 ("btrfs: extent_io: do extra check for extent buffer read write functions")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 8287474aa5 ]
If we fail filemap_write_and_wait_range() on the range the buffered write went
into, we only report the "number of bytes which we direct-written", to quote
the comment in there. Which is fine, but buffered write has already advanced
iocb->ki_pos, so we need to roll that back. Otherwise we end up with e.g.
write(2) advancing position by more than the amount it reports having written.
Fixes: 182c25e9c1 "filemap: update ki_pos in generic_perform_write"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Message-Id: <20230827214518.GU3390869@ZenIV>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d527f51331 ]
There is a UAF when xfstests on cifs:
BUG: KASAN: use-after-free in smb2_is_network_name_deleted+0x27/0x160
Read of size 4 at addr ffff88810103fc08 by task cifsd/923
CPU: 1 PID: 923 Comm: cifsd Not tainted 6.1.0-rc4+ #45
...
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x44
print_report+0x171/0x472
kasan_report+0xad/0x130
kasan_check_range+0x145/0x1a0
smb2_is_network_name_deleted+0x27/0x160
cifs_demultiplex_thread.cold+0x172/0x5a4
kthread+0x165/0x1a0
ret_from_fork+0x1f/0x30
</TASK>
Allocated by task 923:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
__kasan_slab_alloc+0x54/0x60
kmem_cache_alloc+0x147/0x320
mempool_alloc+0xe1/0x260
cifs_small_buf_get+0x24/0x60
allocate_buffers+0xa1/0x1c0
cifs_demultiplex_thread+0x199/0x10d0
kthread+0x165/0x1a0
ret_from_fork+0x1f/0x30
Freed by task 921:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
kasan_save_free_info+0x2a/0x40
____kasan_slab_free+0x143/0x1b0
kmem_cache_free+0xe3/0x4d0
cifs_small_buf_release+0x29/0x90
SMB2_negotiate+0x8b7/0x1c60
smb2_negotiate+0x51/0x70
cifs_negotiate_protocol+0xf0/0x160
cifs_get_smb_ses+0x5fa/0x13c0
mount_get_conns+0x7a/0x750
cifs_mount+0x103/0xd00
cifs_smb3_do_mount+0x1dd/0xcb0
smb3_get_tree+0x1d5/0x300
vfs_get_tree+0x41/0xf0
path_mount+0x9b3/0xdd0
__x64_sys_mount+0x190/0x1d0
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
The UAF is because:
mount(pid: 921) | cifsd(pid: 923)
-------------------------------|-------------------------------
| cifs_demultiplex_thread
SMB2_negotiate |
cifs_send_recv |
compound_send_recv |
smb_send_rqst |
wait_for_response |
wait_event_state [1] |
| standard_receive3
| cifs_handle_standard
| handle_mid
| mid->resp_buf = buf; [2]
| dequeue_mid [3]
KILL the process [4] |
resp_iov[i].iov_base = buf |
free_rsp_buf [5] |
| is_network_name_deleted [6]
| callback
1. After send request to server, wait the response until
mid->mid_state != SUBMITTED;
2. Receive response from server, and set it to mid;
3. Set the mid state to RECEIVED;
4. Kill the process, the mid state already RECEIVED, get 0;
5. Handle and release the negotiate response;
6. UAF.
It can be easily reproduce with add some delay in [3] - [6].
Only sync call has the problem since async call's callback is
executed in cifsd process.
Add an extra state to mark the mid state to READY before wakeup the
waitter, then it can get the resp safely.
Fixes: ec637e3ffb ("[CIFS] Avoid extra large buffer allocation (and memcpy) in cifs_readpages")
Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit fe44198016 ]
On no-MMU, /proc/<pid>/maps reads as an empty file. This happens because
find_vma(mm, 0) always returns NULL (assuming no vma actually contains the
zero address, which is normally the case).
To fix this bug and improve the maintainability in the future, this patch
makes the no-MMU implementation as similar as possible to the MMU
implementation.
The only remaining differences are the lack of hold/release_task_mempolicy
and the extra code to shoehorn the gate vma into the iterator.
This has been tested on top of 6.5.3 on an STM32F746.
Link: https://lkml.kernel.org/r/20230915160055.971059-2-ben.wolsieffer@hefring.com
Fixes: 0c563f1480 ("proc: remove VMA rbtree use from nommu")
Signed-off-by: Ben Wolsieffer <ben.wolsieffer@hefring.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Giulio Benetti <giulio.benetti@benettiengineering.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 578d7699e5 ]
The no-MMU implementation of /proc/<pid>/map doesn't normally release
the mmap read lock, because it uses !IS_ERR_OR_NULL(_vml) to determine
whether to release the lock. Since _vml is NULL when the end of the
mappings is reached, the lock is not released.
Reading /proc/1/maps twice doesn't cause a hang because it only
takes the read lock, which can be taken multiple times and therefore
doesn't show any problem if the lock isn't released. Instead, you need
to perform some operation that attempts to take the write lock after
reading /proc/<pid>/maps. To actually reproduce the bug, compile the
following code as 'proc_maps_bug':
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
int main(int argc, char *argv[]) {
void *buf;
sleep(1);
buf = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
puts("mmap returned");
return 0;
}
Then, run:
./proc_maps_bug &; cat /proc/$!/maps; fg
Without this patch, mmap() will hang and the command will never
complete.
This code was incorrectly adapted from the MMU implementation, which at
the time released the lock in m_next() before returning the last entry.
The MMU implementation has diverged further from the no-MMU version since
then, so this patch brings their locking and error handling into sync,
fixing the bug and hopefully avoiding similar issues in the future.
Link: https://lkml.kernel.org/r/20230914163019.4050530-2-ben.wolsieffer@hefring.com
Fixes: 47fecca15c ("fs/proc/task_nommu.c: don't use priv->task->mm")
Signed-off-by: Ben Wolsieffer <ben.wolsieffer@hefring.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Giulio Benetti <giulio.benetti@benettiengineering.com>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stable-dep-of: fe44198016 ("proc: nommu: fix empty /proc/<pid>/maps")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 5229a658f6 ]
Len Brown has reported that system suspend sometimes fail due to
inability to freeze a task working in ext4_trim_fs() for one minute.
Trimming a large filesystem on a disk that slowly processes discard
requests can indeed take a long time. Since discard is just an advisory
call, it is perfectly fine to interrupt it at any time and the return
number of discarded blocks until that moment. Do that when we detect the
task is being frozen.
Cc: stable@kernel.org
Reported-by: Len Brown <lenb@kernel.org>
Suggested-by: Dave Chinner <david@fromorbit.com>
References: https://bugzilla.kernel.org/show_bug.cgi?id=216322
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230913150504.9054-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 45e4ab320c ]
Currently we set the group's trimmed bit in ext4_trim_all_free() based
on return value of ext4_try_to_trim_range(). However when we will want
to abort trimming because of suspend attempt, we want to return success
from ext4_try_to_trim_range() but not set the trimmed bit. Instead
implementing awkward propagation of this information, just move setting
of trimmed bit into ext4_try_to_trim_range() when the whole group is
trimmed.
Cc: stable@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230913150504.9054-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2c58c3931e ]
Instead of calling BUG() when we fail to insert a delayed dir index item
into the delayed node's tree, we can just release all the resources we
have allocated/acquired before and return the error to the caller. This is
fine because all existing call chains undo anything they have done before
calling btrfs_insert_delayed_dir_index() or BUG_ON (when creating pending
snapshots in the transaction commit path).
So remove the BUG() call and do proper error handling.
This relates to a syzbot report linked below, but does not fix it because
it only prevents hitting a BUG(), it does not fix the issue where somehow
we attempt to use twice the same index number for different index items.
Link: https://lore.kernel.org/linux-btrfs/00000000000036e1290603e097e0@google.com/
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 91bfe3104b ]
If we fail to add a delayed dir index item because there's already another
item with the same index number, we print an error message (and then BUG).
However that message isn't very helpful to debug anything because we don't
know what's the index number and what are the values of index counters in
the inode and its delayed inode (index_cnt fields of struct btrfs_inode
and struct btrfs_delayed_node).
So update the error message to include the index number and counters.
We actually had a recent case where this issue was hit by a syzbot report
(see the link below).
Link: https://lore.kernel.org/linux-btrfs/00000000000036e1290603e097e0@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 2c58c3931e ("btrfs: remove BUG() after failure to insert delayed dir index item")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit df1c357f25 ]
If a network filesystem using netfs implements a clamp_length()
function, it can set subrequest lengths smaller than a page size.
When we loop through the folios in netfs_rreq_unlock_folios() to
set any folios to be written back, we need to make sure we only
call folio_start_fscache() once for each folio.
Otherwise, this simple testcase:
mount -o fsc,rsize=1024,wsize=1024 127.0.0.1:/export /mnt/nfs
dd if=/dev/zero of=/mnt/nfs/file.bin bs=4096 count=1
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.0126359 s, 324 kB/s
echo 3 > /proc/sys/vm/drop_caches
cat /mnt/nfs/file.bin > /dev/null
will trigger an oops similar to the following:
page dumped because: VM_BUG_ON_FOLIO(folio_test_private_2(folio))
------------[ cut here ]------------
kernel BUG at include/linux/netfs.h:44!
...
CPU: 5 PID: 134 Comm: kworker/u16:5 Kdump: loaded Not tainted 6.4.0-rc5
...
RIP: 0010:netfs_rreq_unlock_folios+0x68e/0x730 [netfs]
...
Call Trace:
netfs_rreq_assess+0x497/0x660 [netfs]
netfs_subreq_terminated+0x32b/0x610 [netfs]
nfs_netfs_read_completion+0x14e/0x1a0 [nfs]
nfs_read_completion+0x2f9/0x330 [nfs]
rpc_free_task+0x72/0xa0 [sunrpc]
rpc_async_release+0x46/0x70 [sunrpc]
process_one_work+0x3bd/0x710
worker_thread+0x89/0x610
kthread+0x181/0x1c0
ret_from_fork+0x29/0x50
Fixes: 3d3c950467 ("netfs: Provide readahead and readpage netfs helpers"
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2210612
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230608214137.856006-1-dwysocha@redhat.com/ # v1
Link: https://lore.kernel.org/r/20230915185704.1082982-1-dwysocha@redhat.com/ # v2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 52954b7509 ]
On a thawed filesystem, the freeze glock is held in shared mode. In
order to initiate a cluster-wide freeze, the node initiating the freeze
drops the freeze glock and grabs it in exclusive mode. The other nodes
recognize this as contention on the freeze glock; function
freeze_go_callback is invoked. This indicates to them that they must
freeze the filesystem locally, drop the freeze glock, and then
re-acquire it in shared mode before being able to unfreeze the
filesystem locally.
While a node is trying to re-acquire the freeze glock in shared mode,
additional contention can occur. In that case, the node must behave in
the same way as above.
Unfortunately, freeze_go_callback() contains a check that causes it to
bail out when the freeze glock isn't held in shared mode. Fix that to
allow the glock to be unlocked or held in shared mode.
In addition, update a reference to trylock_super() which has been
renamed to super_trylock_shared() in the meantime.
Fixes: b77b4a4815 ("gfs2: Rework freeze / thaw logic")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 806a3bc421 ]
Currently, when GETDEVICEINFO returns multiple locations where each
is a different IP but the server's identity is same as MDS, then
nfs4_set_ds_client() finds the existing nfs_client structure which
has the MDS's max_connect value (and if it's 1), then the 1st IP
on the DS's list will get dropped due to MDS trunking rules. Other
IPs would be added as they fall under the pnfs trunking rules.
For the list of IPs the 1st goes thru calling nfs4_set_ds_client()
which will eventually call nfs4_add_trunk() and call into
rpc_clnt_test_and_add_xprt() which has the check for MDS trunking.
The other IPs (after the 1st one), would call rpc_clnt_add_xprt()
which doesn't go thru that check.
nfs4_add_trunk() is called when MDS trunking is happening and it
needs to enforce the usage of max_connect mount option of the
1st mount. However, this shouldn't be applied to pnfs flow.
Instead, this patch proposed to treat MDS=DS as DS trunking and
make sure that MDS's max_connect limit does not apply to the
1st IP returned in the GETDEVICEINFO list. It does so by
marking the newly created client with a new flag NFS_CS_PNFS
which then used to pass max_connect value to use into the
rpc_clnt_test_and_add_xprt() instead of the existing rpc
client's max_connect value set by the MDS connection.
For example, mount was done without max_connect value set
so MDS's rpc client has cl_max_connect=1. Upon calling into
rpc_clnt_test_and_add_xprt() and using rpc client's value,
the caller passes in max_connect value which is previously
been set in the pnfs path (as a part of handling
GETDEVICEINFO list of IPs) in nfs4_set_ds_client().
However, when NFS_CS_PNFS flag is not set and we know we
are doing MDS trunking, comparing a new IP of the same
server, we then set the max_connect value to the
existing MDS's value and pass that into
rpc_clnt_test_and_add_xprt().
Fixes: dc48e0abee ("SUNRPC enforce creation of no more than max_connect xprts")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 51d674a5e4 ]
After receiving the location(s) of the DS server(s) in the
GETDEVINCEINFO, create the request for the clientid to such
server and indicate that the client is connecting to a DS.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Stable-dep-of: 806a3bc421 ("NFSv4.1: fix pnfs MDS=DS session trunking")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit dd7d7ee3ba ]
With IPv6, connect() can occasionally return EINVAL if a route is
unavailable. If this happens during I/O to a data server, we want to
report it using LAYOUTERROR as an inability to connect.
Fixes: dd52128afd ("NFSv4.1/pnfs Ensure flexfiles reports all connection related errors")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b11243f720 ]
Ensure that all requests are put back onto the commit list so that they
can be rescheduled.
Fixes: 4daaeba938 ("NFS: Fix nfs_direct_write_reschedule_io()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b193a78ddb ]
Ensure that nfs_clear_request_commit() updates the correct counters when
it removes them from the commit list.
Fixes: ed5d588fe4 ("NFS: Try to join page groups before an O_DIRECT retransmission")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 8982f7aff3 ]
If we hit a fatal error when retransmitting, we do need to record the
removal of the request from the count of written bytes.
Fixes: 031d73ed76 ("NFS: Fix O_DIRECT accounting of number of bytes read/written")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 954998b60c ]
If we fail to schedule a request for transmission, there are 2
possibilities:
1) Either we hit a fatal error, and we just want to drop the remaining
requests on the floor.
2) We were asked to try again, in which case we should allow the
outstanding RPC calls to complete, so that we can recoalesce requests
and try again.
Fixes: d600ad1f2b ("NFS41: pop some layoutget errors to application")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 7fda67e8c3 upstream.
With the configuration PAGE_SIZE 64k and filesystem blocksize 64k,
a problem occurred when more than 13 million files were directly created
under a directory:
EXT4-fs error (device xx): ext4_dx_csum_set:492: inode #xxxx: comm xxxxx: dir seems corrupt? Run e2fsck -D.
EXT4-fs error (device xx): ext4_dx_csum_verify:463: inode #xxxx: comm xxxxx: dir seems corrupt? Run e2fsck -D.
EXT4-fs error (device xx): dx_probe:856: inode #xxxx: block 8188: comm xxxxx: Directory index failed checksum
When enough files are created, the fake_dirent->reclen will be 0xffff.
it doesn't equal to the blocksize 65536, i.e. 0x10000.
But it is not the same condition when blocksize equals to 4k.
when enough files are created, the fake_dirent->reclen will be 0x1000.
it equals to the blocksize 4k, i.e. 0x1000.
The problem seems to be related to the limitation of the 16-bit field
when the blocksize is set to 64k.
To address this, helpers like ext4_rec_len_{from,to}_disk has already
been introduced to complete the conversion between the encoded and the
plain form of rec_len.
So fix this one by using the helper, and all the other in this file too.
Cc: stable@kernel.org
Fixes: dbe8944404 ("ext4: Calculate and verify checksums for htree nodes")
Suggested-by: Andreas Dilger <adilger@dilger.ca>
Suggested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20230803060938.1929759-1-zhangshida@kylinos.cn
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 51aab5ffce upstream.
The function tracefs_create_dir() was missing a lockdown check and was
called by the RV code. This gave an inconsistent behavior of this function
returning success while other tracefs functions failed. This caused the
inode being freed by the wrong kmem_cache.
Link: https://lkml.kernel.org/r/20230905182711.692687042@goodmis.org
Link: https://lore.kernel.org/all/202309050916.58201dc6-oliver.sang@intel.com/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ajay Kaher <akaher@vmware.com>
Cc: Ching-lin Yu <chinglinyu@google.com>
Fixes: bf8e602186 ("tracing: Do not create tracefs files if tracefs lockdown is in effect")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fdd2630a73 upstream.
nfsd sends the transposed directory change info in the RENAME reply. The
source directory is in save_fh and the target is in current_fh.
Reported-by: Zhi Li <yieli@redhat.com>
Reported-by: Benjamin Coddington <bcodding@redhat.com>
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2218844
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4ca8e03cf2 upstream.
If we do fast tree logging we increment a counter on the current
transaction for every ordered extent we need to wait for. This means we
expect the transaction to still be there when we clear pending on the
ordered extent. However if we happen to abort the transaction and clean
it up, there could be no running transaction, and thus we'll trip the
"ASSERT(trans)" check. This is obviously incorrect, and the code
properly deals with the case that the transaction doesn't exist. Fix
this ASSERT() to only fire if there's no trans and we don't have
BTRFS_FS_ERROR() set on the file system.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ee34a82e89 upstream.
During the ino lookup ioctl we can end up calling btrfs_iget() to get an
inode reference while we are holding on a root's btree. If btrfs_iget()
needs to lookup the inode from the root's btree, because it's not
currently loaded in memory, then it will need to lock another or the
same path in the same root btree. This may result in a deadlock and
trigger the following lockdep splat:
WARNING: possible circular locking dependency detected
6.5.0-rc7-syzkaller-00004-gf7757129e3de #0 Not tainted
------------------------------------------------------
syz-executor277/5012 is trying to acquire lock:
ffff88802df41710 (btrfs-tree-01){++++}-{3:3}, at: __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
but task is already holding lock:
ffff88802df418e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (btrfs-tree-00){++++}-{3:3}:
down_read_nested+0x49/0x2f0 kernel/locking/rwsem.c:1645
__btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
btrfs_search_slot+0x13a4/0x2f80 fs/btrfs/ctree.c:2302
btrfs_init_root_free_objectid+0x148/0x320 fs/btrfs/disk-io.c:4955
btrfs_init_fs_root fs/btrfs/disk-io.c:1128 [inline]
btrfs_get_root_ref+0x5ae/0xae0 fs/btrfs/disk-io.c:1338
btrfs_get_fs_root fs/btrfs/disk-io.c:1390 [inline]
open_ctree+0x29c8/0x3030 fs/btrfs/disk-io.c:3494
btrfs_fill_super+0x1c7/0x2f0 fs/btrfs/super.c:1154
btrfs_mount_root+0x7e0/0x910 fs/btrfs/super.c:1519
legacy_get_tree+0xef/0x190 fs/fs_context.c:611
vfs_get_tree+0x8c/0x270 fs/super.c:1519
fc_mount fs/namespace.c:1112 [inline]
vfs_kern_mount+0xbc/0x150 fs/namespace.c:1142
btrfs_mount+0x39f/0xb50 fs/btrfs/super.c:1579
legacy_get_tree+0xef/0x190 fs/fs_context.c:611
vfs_get_tree+0x8c/0x270 fs/super.c:1519
do_new_mount+0x28f/0xae0 fs/namespace.c:3335
do_mount fs/namespace.c:3675 [inline]
__do_sys_mount fs/namespace.c:3884 [inline]
__se_sys_mount+0x2d9/0x3c0 fs/namespace.c:3861
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
-> #0 (btrfs-tree-01){++++}-{3:3}:
check_prev_add kernel/locking/lockdep.c:3142 [inline]
check_prevs_add kernel/locking/lockdep.c:3261 [inline]
validate_chain kernel/locking/lockdep.c:3876 [inline]
__lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144
lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761
down_read_nested+0x49/0x2f0 kernel/locking/rwsem.c:1645
__btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
btrfs_tree_read_lock fs/btrfs/locking.c:142 [inline]
btrfs_read_lock_root_node+0x292/0x3c0 fs/btrfs/locking.c:281
btrfs_search_slot_get_root fs/btrfs/ctree.c:1832 [inline]
btrfs_search_slot+0x4ff/0x2f80 fs/btrfs/ctree.c:2154
btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:412
btrfs_read_locked_inode fs/btrfs/inode.c:3892 [inline]
btrfs_iget_path+0x2d9/0x1520 fs/btrfs/inode.c:5716
btrfs_search_path_in_tree_user fs/btrfs/ioctl.c:1961 [inline]
btrfs_ioctl_ino_lookup_user+0x77a/0xf50 fs/btrfs/ioctl.c:2105
btrfs_ioctl+0xb0b/0xd40 fs/btrfs/ioctl.c:4683
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl+0xf8/0x170 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
rlock(btrfs-tree-00);
lock(btrfs-tree-01);
lock(btrfs-tree-00);
rlock(btrfs-tree-01);
*** DEADLOCK ***
1 lock held by syz-executor277/5012:
#0: ffff88802df418e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
stack backtrace:
CPU: 1 PID: 5012 Comm: syz-executor277 Not tainted 6.5.0-rc7-syzkaller-00004-gf7757129e3de #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106
check_noncircular+0x375/0x4a0 kernel/locking/lockdep.c:2195
check_prev_add kernel/locking/lockdep.c:3142 [inline]
check_prevs_add kernel/locking/lockdep.c:3261 [inline]
validate_chain kernel/locking/lockdep.c:3876 [inline]
__lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144
lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761
down_read_nested+0x49/0x2f0 kernel/locking/rwsem.c:1645
__btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136
btrfs_tree_read_lock fs/btrfs/locking.c:142 [inline]
btrfs_read_lock_root_node+0x292/0x3c0 fs/btrfs/locking.c:281
btrfs_search_slot_get_root fs/btrfs/ctree.c:1832 [inline]
btrfs_search_slot+0x4ff/0x2f80 fs/btrfs/ctree.c:2154
btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:412
btrfs_read_locked_inode fs/btrfs/inode.c:3892 [inline]
btrfs_iget_path+0x2d9/0x1520 fs/btrfs/inode.c:5716
btrfs_search_path_in_tree_user fs/btrfs/ioctl.c:1961 [inline]
btrfs_ioctl_ino_lookup_user+0x77a/0xf50 fs/btrfs/ioctl.c:2105
btrfs_ioctl+0xb0b/0xd40 fs/btrfs/ioctl.c:4683
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl+0xf8/0x170 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f0bec94ea39
Fix this simply by releasing the path before calling btrfs_iget() as at
point we don't need the path anymore.
Reported-by: syzbot+bf66ad948981797d2f1d@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/00000000000045fa140603c4a969@google.com/
Fixes: 23d0b79dfa ("btrfs: Add unprivileged version of ino_lookup ioctl")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2d6cd791e6 upstream.
Commit 675dfe1223 ("btrfs: fix block group item corruption after
inserting new block group") fixed one race that resulted in not persisting
a block group's item when its "used" bytes field decreases to zero.
However there's another race that can happen in a much shorter time window
that results in the same problem. The following sequence of steps explains
how it can happen:
1) Task A creates a metadata block group X, its "used" and "commit_used"
fields are initialized to 0;
2) Two extents are allocated from block group X, so its "used" field is
updated to 32K, and its "commit_used" field remains as 0;
3) Transaction commit starts, by some task B, and it enters
btrfs_start_dirty_block_groups(). There it tries to update the block
group item for block group X, which currently has its "used" field with
a value of 32K and its "commit_used" field with a value of 0. However
that fails since the block group item was not yet inserted, so at
update_block_group_item(), the btrfs_search_slot() call returns 1, and
then we set 'ret' to -ENOENT. Before jumping to the label 'fail'...
4) The block group item is inserted by task A, when for example
btrfs_create_pending_block_groups() is called when releasing its
transaction handle. This results in insert_block_group_item() inserting
the block group item in the extent tree (or block group tree), with a
"used" field having a value of 32K and setting "commit_used", in struct
btrfs_block_group, to the same value (32K);
5) Task B jumps to the 'fail' label and then resets the "commit_used"
field to 0. At btrfs_start_dirty_block_groups(), because -ENOENT was
returned from update_block_group_item(), we add the block group again
to the list of dirty block groups, so that we will try again in the
critical section of the transaction commit when calling
btrfs_write_dirty_block_groups();
6) Later the two extents from block group X are freed, so its "used" field
becomes 0;
7) If no more extents are allocated from block group X before we get into
btrfs_write_dirty_block_groups(), then when we call
update_block_group_item() again for block group X, we will not update
the block group item to reflect that it has 0 bytes used, because the
"used" and "commit_used" fields in struct btrfs_block_group have the
same value, a value of 0.
As a result after committing the transaction we have an empty block
group with its block group item having a 32K value for its "used" field.
This will trigger errors from fsck ("btrfs check" command) and after
mounting again the fs, the cleaner kthread will not automatically delete
the empty block group, since its "used" field is not 0. Possibly there
are other issues due to this inconsistency.
When this issue happens, the error reported by fsck is like this:
[1/7] checking root items
[2/7] checking extents
block group [1104150528 1073741824] used 39796736 but extent items used 0
ERROR: errors found in extent allocation tree or chunk allocation
(...)
So fix this by not resetting the "commit_used" field of a block group when
we don't find the block group item at update_block_group_item().
Fixes: 7248e0cebb ("btrfs: skip update of block group item if used bytes are the same")
CC: stable@vger.kernel.org # 6.2+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5e0e879926 upstream.
[BUG]
After commit 72a69cd030 ("btrfs: subpage: pack all subpage bitmaps
into a larger bitmap"), the DEBUG section of btree_dirty_folio() would
no longer compile.
[CAUSE]
If DEBUG is defined, we would do extra checks for btree_dirty_folio(),
mostly to make sure the range we marked dirty has an extent buffer and
that extent buffer is dirty.
For subpage, we need to iterate through all the extent buffers covered
by that page range, and make sure they all matches the criteria.
However commit 72a69cd030 ("btrfs: subpage: pack all subpage bitmaps
into a larger bitmap") changes how we store the bitmap, we pack all the
16 bits bitmaps into a larger bitmap, which would save some space.
This means we no longer have btrfs_subpage::dirty_bitmap, instead the
dirty bitmap is starting at btrfs_subpage_info::dirty_offset, and has a
length of btrfs_subpage_info::bitmap_nr_bits.
[FIX]
Although I'm not sure if it still makes sense to maintain such code, at
least let it compile.
This patch would let us test the bits one by one through the bitmaps.
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 724768a393 upstream.
ovl_{read,write}_iter() always call fdput(real) to put one or zero
refcounts of the real file, but for aio, whether it was submitted or not,
ovl_aio_put() also calls fdput(), which is not balanced. This is only a
problem in the less common case when FDPUT_FPUT flag is set.
To fix the problem use get_file() to take file refcount and use fput()
instead of fdput() in ovl_aio_put().
Fixes: 2406a307ac ("ovl: implement async IO routines")
Cc: <stable@vger.kernel.org> # v5.6
Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ab04830202 upstream.
Some local filesystems support setting persistent fileattr flags
(e.g. FS_NOATIME_FL) on directories and regular files via ioctl.
Some of those persistent fileattr flags are reflected to vfs as
in-memory inode flags (e.g. S_NOATIME).
Overlayfs uses the in-memory inode flags (e.g. S_NOATIME) on a lower file
as an indication that a the lower file may have persistent inode fileattr
flags (e.g. FS_NOATIME_FL) that need to be copied to upper file.
However, in some cases, the S_NOATIME in-memory flag could be a false
indication for persistent FS_NOATIME_FL fileattr. For example, with NFS
and FUSE lower fs, as was the case in the two bug reports, the S_NOATIME
flag is set unconditionally for all inodes.
Users cannot set persistent fileattr flags on symlinks and special files,
but in some local fs, such as ext4/btrfs/tmpfs, the FS_NOATIME_FL fileattr
flag are inheritted to symlinks and special files from parent directory.
In both cases described above, when lower symlink has the S_NOATIME flag,
overlayfs will try to copy the symlink's fileattrs and fail with error
ENOXIO, because it could not open the symlink for the ioctl security hook.
To solve this failure, do not attempt to copyup fileattrs for anything
other than directories and regular files.
Reported-by: Ruiwen Zhao <ruiwen@google.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217850
Fixes: 72db82115d ("ovl: copy up sync/noatime fileattr flags")
Cc: <stable@vger.kernel.org> # v5.15
Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5d1f903f75 upstream.
Changing the mode of symlinks is meaningless as the vfs doesn't take the
mode of a symlink into account during path lookup permission checking.
However, the vfs doesn't block mode changes on symlinks. This however,
has lead to an untenable mess roughly classifiable into the following
two categories:
(1) Filesystems that don't implement a i_op->setattr() for symlinks.
Such filesystems may or may not know that without i_op->setattr()
defined, notify_change() falls back to simple_setattr() causing the
inode's mode in the inode cache to be changed.
That's a generic issue as this will affect all non-size changing
inode attributes including ownership changes.
Example: afs
(2) Filesystems that fail with EOPNOTSUPP but change the mode of the
symlink nonetheless.
Some filesystems will happily update the mode of a symlink but still
return EOPNOTSUPP. This is the biggest source of confusion for
userspace.
The EOPNOTSUPP in this case comes from POSIX ACLs. Specifically it
comes from filesystems that call posix_acl_chmod(), e.g., btrfs via
if (!err && attr->ia_valid & ATTR_MODE)
err = posix_acl_chmod(idmap, dentry, inode->i_mode);
Filesystems including btrfs don't implement i_op->set_acl() so
posix_acl_chmod() will report EOPNOTSUPP.
When posix_acl_chmod() is called, most filesystems will have
finished updating the inode.
Perversely, this has the consequences that this behavior may depend
on two kconfig options and mount options:
* CONFIG_POSIX_ACL={y,n}
* CONFIG_${FSTYPE}_POSIX_ACL={y,n}
* Opt_acl, Opt_noacl
Example: btrfs, ext4, xfs
The only way to change the mode on a symlink currently involves abusing
an O_PATH file descriptor in the following manner:
fd = openat(-1, "/path/to/link", O_CLOEXEC | O_PATH | O_NOFOLLOW);
char path[PATH_MAX];
snprintf(path, sizeof(path), "/proc/self/fd/%d", fd);
chmod(path, 0000);
But for most major filesystems with POSIX ACL support such as btrfs,
ext4, ceph, tmpfs, xfs and others this will fail with EOPNOTSUPP with
the mode still updated due to the aforementioned posix_acl_chmod()
nonsense.
So, given that for all major filesystems this would fail with EOPNOTSUPP
and that both glibc (cf. [1]) and musl (cf. [2]) outright block mode
changes on symlinks we should just try and block mode changes on
symlinks directly in the vfs and have a clean break with this nonsense.
If this causes any regressions, we do the next best thing and fix up all
filesystems that do return EOPNOTSUPP with the mode updated to not call
posix_acl_chmod() on symlinks.
But as usual, let's try the clean cut solution first. It's a simple
patch that can be easily reverted. Not marking this for backport as I'll
do that manually if we're reasonably sure that this works and there are
no strong objections.
We could block this in chmod_common() but it's more appropriate to do it
notify_change() as it will also mean that we catch filesystems that
change symlink permissions explicitly or accidently.
Similar proposals were floated in the past as in [3] and [4] and again
recently in [5]. There's also a couple of bugs about this inconsistency
as in [6] and [7].
Link: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=99527a3727e44cb8661ee1f743068f108ec93979;hb=HEAD [1]
Link: https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c [2]
Link: https://lore.kernel.org/all/20200911065733.GA31579@infradead.org [3]
Link: https://sourceware.org/legacy-ml/libc-alpha/2020-02/msg00518.html [4]
Link: https://lore.kernel.org/lkml/87lefmbppo.fsf@oldenburg.str.redhat.com [5]
Link: https://sourceware.org/legacy-ml/libc-alpha/2020-02/msg00467.html [6]
Link: https://sourceware.org/bugzilla/show_bug.cgi?id=14578#c17 [7]
Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org # please backport to all LTSes but not before v6.6-rc2 is tagged
Suggested-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Florian Weimer <fweimer@redhat.com>
Message-Id: <20230712-vfs-chmod-symlinks-v2-1-08cfb92b61dd@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 79b83606ab ]
Some firmware (notably U-Boot) provides GetVariable() and
GetNextVariableName() but not QueryVariableInfo().
With commit d86ff3333c ("efivarfs: expose used and total size") the
statfs syscall was broken for such firmware.
If QueryVariableInfo() does not exist or returns EFI_UNSUPPORTED, just
report the file system size as 0 as statfs_simple() previously did.
Fixes: d86ff3333c ("efivarfs: expose used and total size")
Link: https://lore.kernel.org/all/20230910045445.41632-1-heinrich.schuchardt@canonical.com/
Signed-off-by: Heinrich Schuchardt <heinrich.schuchardt@canonical.com>
[ardb: log warning on QueryVariableInfo() failure]
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6bfe3959b0 ]
The function btrfs_validate_super() should verify the metadata_uuid in
the provided superblock argument. Because, all its callers expect it to
do that.
Such as in the following stacks:
write_all_supers()
sb = fs_info->super_for_commit;
btrfs_validate_write_super(.., sb)
btrfs_validate_super(.., sb, ..)
scrub_one_super()
btrfs_validate_super(.., sb, ..)
And
check_dev_super()
btrfs_validate_super(.., sb, ..)
However, it currently verifies the fs_info::super_copy::metadata_uuid
instead. Fix this using the correct metadata_uuid in the superblock
argument.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 4844c3664a ]
In some cases, we need to read the FSID from the superblock when the
metadata_uuid is not set, and otherwise, read the metadata_uuid. So,
add a helper.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 6bfe3959b0 ("btrfs: compare the correct fsid/metadata_uuid in btrfs_validate_super")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 13bb483d32 ]
In the current implementation, block groups are activated at reservation
time to ensure that all reserved bytes can be written to an active metadata
block group. However, this approach has proven to be less efficient, as it
activates block groups more frequently than necessary, putting pressure on
the active zone resource and leading to potential issues such as early
ENOSPC or hung_task.
Another drawback of the current method is that it hampers metadata
over-commit, and necessitates additional flush operations and block group
allocations, resulting in decreased overall performance.
To address these issues, this commit introduces a write-time activation of
metadata and system block group. This involves reserving at least one
active block group specifically for a metadata and system block group.
Since metadata write-out is always allocated sequentially, when we need to
write to a non-active block group, we can wait for the ongoing IOs to
complete, activate a new block group, and then proceed with writing to the
new block group.
Fixes: b093151391 ("btrfs: zoned: activate metadata block group on flush_space")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0356ad41e0 ]
We currently advance the meta_write_pointer in
btrfs_check_meta_write_pointer(). That makes it necessary to revert it
when locking the buffer failed. Instead, we can advance it just before
sending the buffer.
Also, this is necessary for the following commit. In the commit, it needs
to release the zoned_meta_io_lock to allow IOs to come in and wait for them
to fill the currently active block group. If we advance the
meta_write_pointer before locking the extent buffer, the following extent
buffer can pass the meta_write_pointer check, resulting in an unaligned
write failure.
Advancing the pointer is still thread-safe as the extent buffer is locked.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 13bb483d32 ("btrfs: zoned: activate metadata block group on write time")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2ad8c0510a ]
Now that we have writeback_control passed to
btrfs_check_meta_write_pointer(), we can move the wbc condition in
submit_eb_page() to btrfs_check_meta_write_pointer() and return int.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 13bb483d32 ("btrfs: zoned: activate metadata block group on write time")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7db94301a9 ]
For metadata write out on the zoned mode, we call
btrfs_check_meta_write_pointer() to check if an extent buffer to be written
is aligned to the write pointer.
We look up a block group containing the extent buffer for every extent
buffer, which takes unnecessary effort as the writing extent buffers are
mostly contiguous.
Introduce "zoned_bg" to cache the block group working on. Also, while
at it, rename "cache" to "block_group".
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 13bb483d32 ("btrfs: zoned: activate metadata block group on write time")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 861093eff4 ]
Introduce btrfs_eb_write_context to consolidate writeback_control and the
exntent buffer context. This will help adding a block group context as
well.
While at it, move the eb context setting before
btrfs_check_meta_write_pointer(). We can set it here because we anyway need
to skip pages in the same eb if that eb is rejected by
btrfs_check_meta_write_pointer().
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 13bb483d32 ("btrfs: zoned: activate metadata block group on write time")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit bc056e7163 ]
When we calculate the end position of ext4_free_extent, this position may
be exactly where ext4_lblk_t (i.e. uint) overflows. For example, if
ac_g_ex.fe_logical is 4294965248 and ac_orig_goal_len is 2048, then the
computed end is 0x100000000, which is 0. If ac->ac_o_ex.fe_logical is not
the first case of adjusting the best extent, that is, new_bex_end > 0, the
following BUG_ON will be triggered:
=========================================================
kernel BUG at fs/ext4/mballoc.c:5116!
invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 3 PID: 673 Comm: xfs_io Tainted: G E 6.5.0-rc1+ #279
RIP: 0010:ext4_mb_new_inode_pa+0xc5/0x430
Call Trace:
<TASK>
ext4_mb_use_best_found+0x203/0x2f0
ext4_mb_try_best_found+0x163/0x240
ext4_mb_regular_allocator+0x158/0x1550
ext4_mb_new_blocks+0x86a/0xe10
ext4_ext_map_blocks+0xb0c/0x13a0
ext4_map_blocks+0x2cd/0x8f0
ext4_iomap_begin+0x27b/0x400
iomap_iter+0x222/0x3d0
__iomap_dio_rw+0x243/0xcb0
iomap_dio_rw+0x16/0x80
=========================================================
A simple reproducer demonstrating the problem:
mkfs.ext4 -F /dev/sda -b 4096 100M
mount /dev/sda /tmp/test
fallocate -l1M /tmp/test/tmp
fallocate -l10M /tmp/test/file
fallocate -i -o 1M -l16777203M /tmp/test/file
fsstress -d /tmp/test -l 0 -n 100000 -p 8 &
sleep 10 && killall -9 fsstress
rm -f /tmp/test/tmp
xfs_io -c "open -ad /tmp/test/file" -c "pwrite -S 0xff 0 8192"
We simply refactor the logic for adjusting the best extent by adding
a temporary ext4_free_extent ex and use extent_logical_end() to avoid
overflow, which also simplifies the code.
Cc: stable@kernel.org # 6.4
Fixes: 93cdf49f6e ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230724121059.11834-3-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit bedc5d3463 ]
Let's say we want to allocate 2 blocks starting from 4294966386, after
predicting the file size, start is aligned to 4294965248, len is changed
to 2048, then end = start + size = 0x100000000. Since end is of
type ext4_lblk_t, i.e. uint, end is truncated to 0.
This causes (pa->pa_lstart >= end) to always hold when checking if the
current extent to be allocated crosses already preallocated blocks, so the
resulting ac_g_ex may cross already preallocated blocks. Hence we convert
the end type to loff_t and use pa_logical_end() to avoid overflow.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230724121059.11834-4-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 43bbddc067 ]
When we use lstart + len to calculate the end of free extent or prealloc
space, it may exceed the maximum value of 4294967295(0xffffffff) supported
by ext4_lblk_t and cause overflow, which may lead to various problems.
Therefore, we add two helper functions, extent_logical_end() and
pa_logical_end(), to limit the type of end to loff_t, and also convert
lstart to loff_t for calculation to avoid overflow.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230724121059.11834-2-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e880763484 ]
I run a small server that uses external hard drives for backups. The
backup software I use uses ext2 filesystems with 4KiB block size and
the server is running SELinux and therefore relies on xattr. I recently
upgraded the hard drives from 4TB to 12TB models. I noticed that after
transferring some TBs I got a filesystem error "Freeing blocks not in
datazone - block = 18446744071529317386, count = 1" and the backup
process stopped. Trying to fix the fs with e2fsck resulted in a
completely corrupted fs. The error probably came from ext2_free_blocks(),
and because of the large number 18e19 this problem immediately looked
like some kind of integer overflow. Whereas the 4TB fs was about 1e9
blocks, the new 12TB is about 3e9 blocks. So, searching the ext2 code,
I came across the line in fs/ext2/xattr.c:745 where ext2_new_block()
is called and the resulting block number is stored in the variable block
as an int datatype. If a block with a block number greater than
INT32_MAX is returned, this variable overflows and the call to
sb_getblk() at line fs/ext2/xattr.c:750 fails, then the call to
ext2_free_blocks() produces the error.
Signed-off-by: Georg Ottinger <g.ottinger@gmx.at>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230815100340.22121-1-g.ottinger@gmx.at>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7f72f50547 ]
[BUG]
Syzbot reported several warning triggered inside
lookup_inline_extent_backref().
[CAUSE]
As usual, the reproducer doesn't reliably trigger locally here, but at
least we know the WARN_ON() is triggered when an inline backref can not
be found, and it can only be triggered when @insert is true. (I.e.
inserting a new inline backref, which means the backref should already
exist)
[ENHANCEMENT]
After the WARN_ON(), dump all the parameters and the extent tree
leaf to help debug.
Link: https://syzkaller.appspot.com/bug?extid=d6f9ff86c1d804ba2bc6
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 257614301a ]
[PROBLEM]
Inside function update_inline_extent_backref(), we have several
BUG_ON()s along with some ASSERT()s which can be triggered by corrupted
filesystem.
[ANAYLYSE]
Most of those BUG_ON()s and ASSERT()s are just a way of handling
unexpected on-disk data.
Although we have tree-checker to rule out obviously incorrect extent
tree blocks, it's not enough for these ones. Thus we need proper error
handling for them.
[FIX]
Thankfully all the callers of update_inline_extent_backref() would
eventually handle the errror by aborting the current transaction.
So this patch would do the proper error handling by:
- Make update_inline_extent_backref() to return int
The return value would be either 0 or -EUCLEAN.
- Replace BUG_ON()s and ASSERT()s with proper error handling
This includes:
* Dump the bad extent tree leaf
* Output an error message for the cause
This would include the extent bytenr, num_bytes (if needed), the bad
values and expected good values.
* Return -EUCLEAN
Note here we remove all the WARN_ON()s, as eventually the transaction
would be aborted, thus a backtrace would be triggered anyway.
- Better comments on why we expect refs == 1 and refs_to_mode == -1 for
tree blocks
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ccbe77f7e4 ]
Syzkaller reports a memory leak:
BUG: memory leak
unreferenced object 0xffff88810b279e00 (size 96):
comm "syz-executor399", pid 3631, jiffies 4294964921 (age 23.870s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 08 9e 27 0b 81 88 ff ff ..........'.....
08 9e 27 0b 81 88 ff ff 00 00 00 00 00 00 00 00 ..'.............
backtrace:
[<ffffffff814cfc90>] kmalloc_trace+0x20/0x90 mm/slab_common.c:1046
[<ffffffff81bb75ca>] kmalloc include/linux/slab.h:576 [inline]
[<ffffffff81bb75ca>] autofs_wait+0x3fa/0x9a0 fs/autofs/waitq.c:378
[<ffffffff81bb88a7>] autofs_do_expire_multi+0xa7/0x3e0 fs/autofs/expire.c:593
[<ffffffff81bb8c33>] autofs_expire_multi+0x53/0x80 fs/autofs/expire.c:619
[<ffffffff81bb6972>] autofs_root_ioctl_unlocked+0x322/0x3b0 fs/autofs/root.c:897
[<ffffffff81bb6a95>] autofs_root_ioctl+0x25/0x30 fs/autofs/root.c:910
[<ffffffff81602a9c>] vfs_ioctl fs/ioctl.c:51 [inline]
[<ffffffff81602a9c>] __do_sys_ioctl fs/ioctl.c:870 [inline]
[<ffffffff81602a9c>] __se_sys_ioctl fs/ioctl.c:856 [inline]
[<ffffffff81602a9c>] __x64_sys_ioctl+0xfc/0x140 fs/ioctl.c:856
[<ffffffff84608225>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
[<ffffffff84608225>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
[<ffffffff84800087>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
autofs_wait_queue structs should be freed if their wait_ctr becomes zero.
Otherwise they will be lost.
In this case an AUTOFS_IOC_EXPIRE_MULTI ioctl is done, then a new
waitqueue struct is allocated in autofs_wait(), its initial wait_ctr
equals 2. After that wait_event_killable() is interrupted (it returns
-ERESTARTSYS), so that 'wq->name.name == NULL' condition may be not
satisfied. Actually, this condition can be satisfied when
autofs_wait_release() or autofs_catatonic_mode() is called and, what is
also important, wait_ctr is decremented in those places. Upon the exit of
autofs_wait(), wait_ctr is decremented to 1. Then the unmounting process
begins: kill_sb calls autofs_catatonic_mode(), which should have freed the
waitqueues, but it only decrements its usage counter to zero which is not
a correct behaviour.
edit:imk
This description is of course not correct. The umount performed as a result
of an expire is a umount of a mount that has been automounted, it's not the
autofs mount itself. They happen independently, usually after everything
mounted within the autofs file system has been expired away. If everything
hasn't been expired away the automount daemon can still exit leaving mounts
in place. But expires done in both cases will result in a notification that
calls autofs_wait_release() with a result status. The problem case is the
summary execution of of the automount daemon. In this case any waiting
processes won't be woken up until either they are terminated or the mount
is umounted.
end edit: imk
So in catatonic mode we should free waitqueues which counter becomes zero.
edit: imk
Initially I was concerned that the calling of autofs_wait_release() and
autofs_catatonic_mode() was not mutually exclusive but that can't be the
case (obviously) because the queue entry (or entries) is removed from the
list when either of these two functions are called. Consequently the wait
entry will be freed by only one of these functions or by the woken process
in autofs_wait() depending on the order of the calls.
end edit: imk
Reported-by: syzbot+5e53f70e69ff0c0a1c0c@syzkaller.appspotmail.com
Suggested-by: Takeshi Misawa <jeliantsurux@gmail.com>
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: autofs@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Message-Id: <169112719161.7590.6700123246297365841.stgit@donald.themaw.net>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit eee2d2e6ea ]
folio_next_index() returns an unsigned long value which left shifted
by PAGE_SHIFT could possibly cause an overflow on 32-bit system. Instead
use folio_pos(folio) + folio_size(folio), which does this correctly.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit ae76d8e3e1 upstream.
[REGRESSION]
There are several regression reports about the scrub performance with
v6.4 kernel.
On a PCIe 3.0 device, the old v6.3 kernel can go 3GB/s scrub speed, but
v6.4 can only go 1GB/s, an obvious 66% performance drop.
[CAUSE]
Iostat shows a very different behavior between v6.3 and v6.4 kernel:
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 9731.00 3425544.00 17237.00 63.92 2.18 352.02 21.18 100.00
nvme0n1p3 15578.00 993616.00 5.00 0.03 0.09 63.78 1.32 100.00
The upper one is v6.3 while the lower one is v6.4.
There are several obvious differences:
- Very few read merges
This turns out to be a behavior change that we no longer do bio
plug/unplug.
- Very low aqu-sz
This is due to the submit-and-wait behavior of flush_scrub_stripes(),
and extra extent/csum tree search.
Both behaviors are not that obvious on SATA SSDs, as SATA SSDs have NCQ
to merge the reads, while SATA SSDs can not handle high queue depth well
either.
[FIX]
For now this patch focuses on the read speed fix. Dev-replace replace
speed needs more work.
For the read part, we go two directions to fix the problems:
- Re-introduce blk plug/unplug to merge read requests
This is pretty simple, and the behavior is pretty easy to observe.
This would enlarge the average read request size to 512K.
- Introduce multi-group reads and no longer wait for each group
Instead of the old behavior, which submits 8 stripes and waits for
them, here we would enlarge the total number of stripes to 16 * 8.
Which is 8M per device, the same limit as the old scrub in-flight
bios size limit.
Now every time we fill a group (8 stripes), we submit them and
continue to next stripes.
Only when the full 16 * 8 stripes are all filled, we submit the
remaining ones (the last group), and wait for all groups to finish.
Then submit the repair writes and dev-replace writes.
This should enlarge the queue depth.
This would greatly improve the merge rate (thus read block size) and
queue depth:
Before (with regression, and cached extent/csum path):
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 20666.00 1318240.00 10.00 0.05 0.08 63.79 1.63 100.00
After (with all patches applied):
nvme0n1p3 5165.00 2278304.00 30557.00 85.54 0.55 441.10 2.81 100.00
i.e. 1287 to 2224 MB/s.
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3c771c1944 upstream.
One of the bottleneck of the new scrub code is the extra csum tree
search.
The old code would only do the csum tree search for each scrub bio,
which can be as large as 512KiB, thus they can afford to allocate a new
path each time.
But the new scrub code is doing csum tree search for each stripe, which
is only 64KiB, this means we'd better re-use the same csum path during
each search.
This patch would introduce a per-sctx path for csum tree search, as we
don't need to re-allocate the path every time we need to do a csum tree
search.
With this change we can further improve the queue depth and improve the
scrub read performance:
Before (with regression and cached extent tree path):
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 15875.00 1013328.00 12.00 0.08 0.08 63.83 1.35 100.00
After (with both cached extent/csum tree path):
nvme0n1p3 17759.00 1133280.00 10.00 0.06 0.08 63.81 1.50 100.00
Fixes: e02ee89baa ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1dc4888e72 upstream.
Since commit e02ee89baa ("btrfs: scrub: switch scrub_simple_mirror()
to scrub_stripe infrastructure"), scrub no longer re-use the same path
for extent tree search.
This can lead to unnecessary extent tree search, especially for the new
stripe based scrub, as we have way more stripes to prepare.
This patch would re-introduce a shared path for extent tree search, and
properly release it when the block group is scrubbed.
This change alone can improve scrub performance slightly by reducing the
time spend preparing the stripe thus improving the queue depth.
Before (with regression):
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 15578.00 993616.00 5.00 0.03 0.09 63.78 1.32 100.00
After (with this patch):
nvme0n1p3 15875.00 1013328.00 12.00 0.08 0.08 63.83 1.35 100.00
Fixes: e02ee89baa ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d167aa76dc upstream.
The function btrfs_validate_super() should verify the fsid in the provided
superblock argument. Because, all its callers expect it to do that.
Such as in the following stack:
write_all_supers()
sb = fs_info->super_for_commit;
btrfs_validate_write_super(.., sb)
btrfs_validate_super(.., sb, ..)
scrub_one_super()
btrfs_validate_super(.., sb, ..)
And
check_dev_super()
btrfs_validate_super(.., sb, ..)
However, it currently verifies the fs_info::super_copy::fsid instead,
which is not correct. Fix this using the correct fsid in the superblock
argument.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5b135b382a upstream.
Now that, we can re-enable metadata over-commit. As we moved the activation
from the reservation time to the write time, we no longer need to ensure
all the reserved bytes is properly activated.
Without the metadata over-commit, it suffers from lower performance because
it needs to flush the delalloc items more often and allocate more block
groups. Re-enabling metadata over-commit will solve the issue.
Fixes: 79417d040f ("btrfs: zoned: disable metadata overcommit for zoned")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e7f1326cc2 upstream.
One of the CI runs triggered the following panic
assertion failed: PagePrivate(page) && page->private, in fs/btrfs/subpage.c:229
------------[ cut here ]------------
kernel BUG at fs/btrfs/subpage.c:229!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
CPU: 0 PID: 923660 Comm: btrfs Not tainted 6.5.0-rc3+ #1
pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
pc : btrfs_subpage_assert+0xbc/0xf0
lr : btrfs_subpage_assert+0xbc/0xf0
sp : ffff800093213720
x29: ffff800093213720 x28: ffff8000932138b4 x27: 000000000c280000
x26: 00000001b5d00000 x25: 000000000c281000 x24: 000000000c281fff
x23: 0000000000001000 x22: 0000000000000000 x21: ffffff42b95bf880
x20: ffff42b9528e0000 x19: 0000000000001000 x18: ffffffffffffffff
x17: 667274622f736620 x16: 6e69202c65746176 x15: 0000000000000028
x14: 0000000000000003 x13: 00000000002672d7 x12: 0000000000000000
x11: ffffcd3f0ccd9204 x10: ffffcd3f0554ae50 x9 : ffffcd3f0379528c
x8 : ffff800093213428 x7 : 0000000000000000 x6 : ffffcd3f091771e8
x5 : ffff42b97f333948 x4 : 0000000000000000 x3 : 0000000000000000
x2 : 0000000000000000 x1 : ffff42b9556cde80 x0 : 000000000000004f
Call trace:
btrfs_subpage_assert+0xbc/0xf0
btrfs_subpage_set_dirty+0x38/0xa0
btrfs_page_set_dirty+0x58/0x88
relocate_one_page+0x204/0x5f0
relocate_file_extent_cluster+0x11c/0x180
relocate_data_extent+0xd0/0xf8
relocate_block_group+0x3d0/0x4e8
btrfs_relocate_block_group+0x2d8/0x490
btrfs_relocate_chunk+0x54/0x1a8
btrfs_balance+0x7f4/0x1150
btrfs_ioctl+0x10f0/0x20b8
__arm64_sys_ioctl+0x120/0x11d8
invoke_syscall.constprop.0+0x80/0xd8
do_el0_svc+0x6c/0x158
el0_svc+0x50/0x1b0
el0t_64_sync_handler+0x120/0x130
el0t_64_sync+0x194/0x198
Code: 91098021 b0007fa0 91346000 97e9c6d2 (d4210000)
This is the same problem outlined in 17b17fcd6d ("btrfs:
set_page_extent_mapped after read_folio in btrfs_cont_expand") , and the
fix is the same. I originally looked for the same pattern elsewhere in
our code, but mistakenly skipped over this code because I saw the page
cache readahead before we set_page_extent_mapped, not realizing that
this was only in the !page case, that we can still end up with a
!uptodate page and then do the btrfs_read_folio further down.
The fix here is the same as the above mentioned patch, move the
set_page_extent_mapped call to after the btrfs_read_folio() block to
make sure that we have the subpage blocksize stuff setup properly before
using the page.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4490e803e1 upstream.
When joining a transaction with TRANS_JOIN_NOSTART, if we don't find a
running transaction we end up creating one. This goes against the purpose
of TRANS_JOIN_NOSTART which is to join a running transaction if its state
is at or below the state TRANS_STATE_COMMIT_START, otherwise return an
-ENOENT error and don't start a new transaction. So fix this to not create
a new transaction if there's no running transaction at or below that
state.
CC: stable@vger.kernel.org # 4.14+
Fixes: a6d155d2e3 ("Btrfs: fix deadlock between fiemap and transaction commits")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e28b02118b upstream.
If we do a write whose bio suffers an error, we will never reclaim the
qgroup reserved space for it. We allocate the space in the write_iter
codepath, then release the reservation as we allocate the ordered
extent, but we only create a delayed ref if the ordered extent finishes.
If it has an error, we simply leak the rsv. This is apparent in running
any error injecting (dmerror) fstests like btrfs/146 or btrfs/160. Such
tests fail due to dmesg on umount complaining about the leaked qgroup
data space.
When we clean up other aspects of space on failed ordered_extents, also
free the qgroup rsv.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a649684967 upstream.
btrfs_start_transaction reserves metadata space of the PERTRANS type
before it identifies a transaction to start/join. This allows flushing
when reserving that space without a deadlock. However, it results in a
race which temporarily breaks qgroup rsv accounting.
T1 T2
start_transaction
do_stuff
start_transaction
qgroup_reserve_meta_pertrans
commit_transaction
qgroup_free_meta_all_pertrans
hit an error starting txn
goto reserve_fail
qgroup_free_meta_pertrans (already freed!)
The basic issue is that there is nothing preventing another commit from
committing before start_transaction finishes (in fact sometimes we
intentionally wait for it) so any error path that frees the reserve is
at risk of this race.
While this exact space was getting freed anyway, and it's not a huge
deal to double free it (just a warning, the free code catches this), it
can result in incorrectly freeing some other pertrans reservation in
this same reservation, which could then lead to spuriously granting
reservations we might not have the space for. Therefore, I do believe it
is worth fixing.
To fix it, use the existing prealloc->pertrans conversion mechanism.
When we first reserve the space, we reserve prealloc space and only when
we are sure we have a transaction do we convert it to pertrans. This way
any racing commits do not blow away our reservation, but we still get a
pertrans reservation that is freed when _this_ transaction gets committed.
This issue can be reproduced by running generic/269 with either qgroups
or squotas enabled via mkfs on the scratch device.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 332581bde2 upstream.
When multiple writes happen at once, we may need to sacrifice a currently
active block group to be zone finished for a new allocation. We choose a
block group with the least free space left, and zone finish it.
To do the finishing, we need to send IOs for already allocated region
and wait for them and on-going IOs. Otherwise, these IOs fail because the
zone is already finished at the time the IO reach a device.
However, if a block group dedicated to the data relocation is zone
finished, there is a chance that finishing it before an ongoing write IO
reaches the device. That is because there is timing gap between an
allocation is done (block_group->reservations == 0, as pre-allocation is
done) and an ordered extent is created when the relocation IO starts.
Thus, if we finish the zone between them, we can fail the IOs.
We cannot simply use "fs_info->data_reloc_bg == block_group->start" to
avoid the zone finishing. Because, the data_reloc_bg may already switch to
a new block group, while there are still ongoing write IOs to the old
data_reloc_bg.
So, this patch reworks the BLOCK_GROUP_FLAG_ZONED_DATA_RELOC bit to
indicate there is a data relocation allocation and/or ongoing write to the
block group. The bit is set on allocation and cleared in end_io function of
the last IO for the currently allocated region.
To change the timing of the bit setting also solves the issue that the bit
being left even after there is no IO going on. With the current code, if
the data_reloc_bg switches after the last IO to the current data_reloc_bg,
the bit is set at this timing and there is no one clearing that bit. As a
result, that block group is kept unallocatable for anything.
Fixes: 343d8a3085 ("btrfs: zoned: prevent allocation from previous data relocation BG")
Fixes: 74e91b12b1 ("btrfs: zoned: zone finish unused block group")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b8bd342d50 upstream.
During our debugging of glusterfs, we found an Assertion failed error:
inode_lookup >= nlookup, which was caused by the nlookup value in the
kernel being greater than that in the FUSE file system.
The issue was introduced by fuse_direntplus_link, where in the function,
fuse_iget increments nlookup, and if d_splice_alias returns failure,
fuse_direntplus_link returns failure without decrementing nlookup
https://github.com/gluster/glusterfs/pull/4081
Signed-off-by: ruanmeisi <ruan.meisi@zte.com.cn>
Fixes: 0b05b18381 ("fuse: implement NFS-like readdirplus support")
Cc: <stable@vger.kernel.org> # v3.9
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d2d9bb3b6d upstream.
https://bugzilla.kernel.org/show_bug.cgi?id=216050
Somehow we're getting a page which has a different mapping.
Let's avoid the infinite loop.
Cc: <stable@vger.kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 194505b55d upstream.
The commit referenced below opened up concurrent unaligned dio under
shared locking for pure overwrites. In doing so, it enabled use of
the IOMAP_DIO_OVERWRITE_ONLY flag and added a warning on unexpected
-EAGAIN returns as an extra precaution, since ext4 does not retry
writes in such cases. The flag itself is advisory in this case since
ext4 checks for unaligned I/Os and uses appropriate locking up
front, rather than on a retry in response to -EAGAIN.
As it turns out, the warning check is susceptible to false positives
because there are scenarios where -EAGAIN can be expected from lower
layers without necessarily having IOCB_NOWAIT set on the iocb. For
example, one instance of the warning has been seen where io_uring
sets IOCB_HIPRI, which in turn results in REQ_POLLED|REQ_NOWAIT on
the bio. This results in -EAGAIN if the block layer is unable to
allocate a request, etc. [Note that there is an outstanding patch to
untangle REQ_POLLED and REQ_NOWAIT such that the latter relies on
IOCB_NOWAIT, which would also address this instance of the warning.]
Another instance of the warning has been reproduced by syzbot. A dio
write is interrupted down in __get_user_pages_locked() waiting on
the mm lock and returns -EAGAIN up the stack. If the iomap dio
iteration layer has made no progress on the write to this point,
-EAGAIN returns up to the filesystem and triggers the warning.
This use of the overwrite flag in ext4 is precautionary and
half-baked. I.e., ext4 doesn't actually implement overwrite checking
in the iomap callbacks when the flag is set, so the only extra
verification it provides are i_size checks in the generic iomap dio
layer. Combined with the tendency for false positives, the added
verification is not worth the extra trouble. Remove the flag,
associated warning, and update the comments to document when
concurrent unaligned dio writes are allowed and why said flag is not
used.
Cc: stable@kernel.org
Reported-by: syzbot+5050ad0fb47527b1808a@syzkaller.appspotmail.com
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Fixes: 310ee0902b ("ext4: allow concurrent unaligned dio overwrites")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230810165559.946222-1-bfoster@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7ca4b085f4 upstream.
If the filename casefolding fails, we'll be leaking memory from the
fscrypt_name struct, namely from the 'crypto_buf.name' member.
Make sure we free it in the error path on both ext4_fname_setup_filename()
and ext4_fname_prepare_lookup() functions.
Cc: stable@kernel.org
Fixes: 1ae98e295f ("ext4: optimize match for casefolded encrypted dirs")
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20230803091713.13239-1-lhenriques@suse.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 68228da51c upstream.
When setup_system_zone, flex_bg is not initialized so it is always 1.
Use a new helper function, ext4_num_base_meta_blocks() which does not
depend on sbi->s_log_groups_per_flex being initialized.
[ Squashed two patches in the Link URL's below together into a single
commit, which is simpler to review/understand. Also fix checkpatch
warnings. --TYT ]
Cc: stable@kernel.org
Signed-off-by: Wang Jianjian <wangjianjian0@foxmail.com>
Link: https://lore.kernel.org/r/tencent_21AF0D446A9916ED5C51492CC6C9A0A77B05@qq.com
Link: https://lore.kernel.org/r/tencent_D744D1450CC169AEA77FCF0A64719909ED05@qq.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 768d612f79 upstream.
Yikebaer reported an issue:
==================================================================
BUG: KASAN: slab-use-after-free in ext4_es_insert_extent+0xc68/0xcb0
fs/ext4/extents_status.c:894
Read of size 4 at addr ffff888112ecc1a4 by task syz-executor/8438
CPU: 1 PID: 8438 Comm: syz-executor Not tainted 6.5.0-rc5 #1
Call Trace:
[...]
kasan_report+0xba/0xf0 mm/kasan/report.c:588
ext4_es_insert_extent+0xc68/0xcb0 fs/ext4/extents_status.c:894
ext4_map_blocks+0x92a/0x16f0 fs/ext4/inode.c:680
ext4_alloc_file_blocks.isra.0+0x2df/0xb70 fs/ext4/extents.c:4462
ext4_zero_range fs/ext4/extents.c:4622 [inline]
ext4_fallocate+0x251c/0x3ce0 fs/ext4/extents.c:4721
[...]
Allocated by task 8438:
[...]
kmem_cache_zalloc include/linux/slab.h:693 [inline]
__es_alloc_extent fs/ext4/extents_status.c:469 [inline]
ext4_es_insert_extent+0x672/0xcb0 fs/ext4/extents_status.c:873
ext4_map_blocks+0x92a/0x16f0 fs/ext4/inode.c:680
ext4_alloc_file_blocks.isra.0+0x2df/0xb70 fs/ext4/extents.c:4462
ext4_zero_range fs/ext4/extents.c:4622 [inline]
ext4_fallocate+0x251c/0x3ce0 fs/ext4/extents.c:4721
[...]
Freed by task 8438:
[...]
kmem_cache_free+0xec/0x490 mm/slub.c:3823
ext4_es_try_to_merge_right fs/ext4/extents_status.c:593 [inline]
__es_insert_extent+0x9f4/0x1440 fs/ext4/extents_status.c:802
ext4_es_insert_extent+0x2ca/0xcb0 fs/ext4/extents_status.c:882
ext4_map_blocks+0x92a/0x16f0 fs/ext4/inode.c:680
ext4_alloc_file_blocks.isra.0+0x2df/0xb70 fs/ext4/extents.c:4462
ext4_zero_range fs/ext4/extents.c:4622 [inline]
ext4_fallocate+0x251c/0x3ce0 fs/ext4/extents.c:4721
[...]
==================================================================
The flow of issue triggering is as follows:
1. remove es
raw es es removed es1
|-------------------| -> |----|.......|------|
2. insert es
es insert es1 merge with es es1 merge with es and free es1
|----|.......|------| -> |------------|------| -> |-------------------|
es merges with newes, then merges with es1, frees es1, then determines
if es1->es_len is 0 and triggers a UAF.
The code flow is as follows:
ext4_es_insert_extent
es1 = __es_alloc_extent(true);
es2 = __es_alloc_extent(true);
__es_remove_extent(inode, lblk, end, NULL, es1)
__es_insert_extent(inode, &newes, es1) ---> insert es1 to es tree
__es_insert_extent(inode, &newes, es2)
ext4_es_try_to_merge_right
ext4_es_free_extent(inode, es1) ---> es1 is freed
if (es1 && !es1->es_len)
// Trigger UAF by determining if es1 is used.
We determine whether es1 or es2 is used immediately after calling
__es_remove_extent() or __es_insert_extent() to avoid triggering a
UAF if es1 or es2 is freed.
Reported-by: Yikebaer Aizezi <yikebaer61@gmail.com>
Closes: https://lore.kernel.org/lkml/CALcu4raD4h9coiyEBL4Bm0zjDwxC2CyPiTwsP3zFuhot6y9Beg@mail.gmail.com
Fixes: 2a69c45008 ("ext4: using nofail preallocation in ext4_es_insert_extent()")
Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230815070808.3377171-1-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2dfba3bb40 upstream.
We got a filesystem inconsistency issue below while running generic/475
I/O failure pressure test with fast_commit feature enabled.
Symlink /p3/d3/d1c/d6c/dd6/dce/l101 (inode #132605) is invalid.
If fast_commit feature is enabled, a special fast_commit journal area is
appended to the end of the normal journal area. The journal->j_last
point to the first unused block behind the normal journal area instead
of the whole log area, and the journal->j_fc_last point to the first
unused block behind the fast_commit journal area. While doing journal
recovery, do_one_pass(PASS_SCAN) should first scan the normal journal
area and turn around to the first block once it meet journal->j_last,
but the wrap() macro misuse the journal->j_fc_last, so the recovering
could not read the next magic block (commit block perhaps) and would end
early mistakenly and missing tN and every transaction after it in the
following example. Finally, it could lead to filesystem inconsistency.
| normal journal area | fast commit area |
+-------------------------------------------------+------------------+
| tN(rere) | tN+1 |~| tN-x |...| tN-1 | tN(front) | .... |
+-------------------------------------------------+------------------+
/ / /
start journal->j_last journal->j_fc_last
This patch fix it by use the correct ending journal->j_last.
Fixes: 5b849b5f96 ("jbd2: fast commit recovery path")
Cc: stable@kernel.org
Reported-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/linux-ext4/20230613043120.GB1584772@mit.edu/
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230626073322.3956567-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 590a809ff7 upstream.
Following process will corrupt ext4 image:
Step 1:
jbd2_journal_commit_transaction
__jbd2_journal_insert_checkpoint(jh, commit_transaction)
// Put jh into trans1->t_checkpoint_list
journal->j_checkpoint_transactions = commit_transaction
// Put trans1 into journal->j_checkpoint_transactions
Step 2:
do_get_write_access
test_clear_buffer_dirty(bh) // clear buffer dirty,set jbd dirty
__jbd2_journal_file_buffer(jh, transaction) // jh belongs to trans2
Step 3:
drop_cache
journal_shrink_one_cp_list
jbd2_journal_try_remove_checkpoint
if (!trylock_buffer(bh)) // lock bh, true
if (buffer_dirty(bh)) // buffer is not dirty
__jbd2_journal_remove_checkpoint(jh)
// remove jh from trans1->t_checkpoint_list
Step 4:
jbd2_log_do_checkpoint
trans1 = journal->j_checkpoint_transactions
// jh is not in trans1->t_checkpoint_list
jbd2_cleanup_journal_tail(journal) // trans1 is done
Step 5: Power cut, trans2 is not committed, jh is lost in next mounting.
Fix it by checking 'jh->b_transaction' before remove it from checkpoint.
Cc: stable@kernel.org
Fixes: 46f881b5b1 ("jbd2: fix a race when checking checkpoint buffer busy")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230714025528.564988-3-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 373ac52179 upstream.
journal_clean_one_cp_list() has been merged into
journal_shrink_one_cp_list(), but do chekpoint buffer cleanup from the
committing process is just a best effort, it should stop scan once it
meet a busy buffer, or else it will cause a lot of invalid buffer scan
and checks. We catch a performance regression when doing fs_mark tests
below.
Test cmd:
./fs_mark -d scratch -s 1024 -n 10000 -t 1 -D 100 -N 100
Before merging checkpoint buffer cleanup:
FSUse% Count Size Files/sec App Overhead
95 10000 1024 8304.9 49033
After merging checkpoint buffer cleanup:
FSUse% Count Size Files/sec App Overhead
95 10000 1024 7649.0 50012
FSUse% Count Size Files/sec App Overhead
95 10000 1024 2107.1 50871
After merging checkpoint buffer cleanup, the total loop count in
journal_shrink_one_cp_list() could be up to 6,261,600+ (50,000+ ~
100,000+ in general), most of them are invalid. This patch fix it
through passing 'shrink_type' into journal_shrink_one_cp_list() and add
a new 'SHRINK_BUSY_STOP' to indicate it should stop once meet a busy
buffer. After fix, the loop count descending back to 10,000+.
After this fix:
FSUse% Count Size Files/sec App Overhead
95 10000 1024 8558.4 49109
Cc: stable@kernel.org
Fixes: b98dba273a ("jbd2: remove journal_clean_one_cp_list()")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230714025528.564988-2-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit efc0b0bcff ]
In addition to the EINVAL, there may be an ENOMEM.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 70431bfd82 ("cifs: Support fscache indexing rewrite")
Signed-off-by: Katya Orlova <e.orlova@ispras.ru>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b74cd55aa9 ]
First, function gfs2_ail_flush_reqd checks the SDF_FORCE_AIL_FLUSH flag
to determine if an AIL flush should be forced in low-memory situations.
However, it also immediately clears the flag, and when called repeatedly
as in function gfs2_logd, the flag will be lost. Fix that by pulling
the SDF_FORCE_AIL_FLUSH flag check out of gfs2_ail_flush_reqd.
Second, function gfs2_writepages sets the SDF_FORCE_AIL_FLUSH flag
whether or not enough pages were written. If enough pages could be
written, flushing the AIL is unnecessary, though.
Third, gfs2_writepages doesn't wake up logd after setting the
SDF_FORCE_AIL_FLUSH flag, so it can take a long time for logd to react.
It would be preferable to wake up logd, but that hurts the performance
of some workloads and we don't quite understand why so far, so don't
wake up logd so far.
Fixes: b066a4eebd ("gfs2: forcibly flush ail to relieve memory pressure")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 96562c45af upstream.
It is an almost improbable error case but when page allocating loop in
nfs4_get_device_info() fails then we should only free the already
allocated pages, as __free_page() can't deal with NULL arguments.
Found by Linux Verification Center (linuxtesting.org).
Cc: stable@vger.kernel.org
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 88975a5596 upstream.
We must ensure that the subrequests are joined back into the head before
we can retransmit a request. If the head was not on the commit lists,
because the server wrote it synchronously, we still need to add it back
to the retransmission list.
Add a call that mirrors the effect of nfs_cancel_remove_inode() for
O_DIRECT.
Fixes: ed5d588fe4 ("NFS: Try to join page groups before an O_DIRECT retransmission")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b6d44d4231 upstream.
We read and cache directory contents when we get directory
lease, so we should ask for read permission to read contents
of directory.
Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 09ee7a3bf8 upstream.
The ChannelSequence field in the SMB3 header is supposed to be
increased after reconnect to allow the server to distinguish
requests from before and after the reconnect. We had always
been setting it to zero. There are cases where incrementing
ChannelSequence on requests after network reconnects can reduce
the chance of data corruptions.
See MS-SMB2 3.2.4.1 and 3.2.7.1
Signed-off-by: Steve French <stfrench@microsoft.com>
Cc: stable@vger.kernel.org # 5.16+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0559f63057 upstream.
When the kernfs_iattr_rwsem was introduced a case was missed.
The update of the kernfs directory node child count was also protected
by the kernfs_rwsem and needs to be included in the change so that the
child count (and so the inode n_link attribute) does not change while
holding the rwsem for read.
Fixes: 9caf696142 ("kernfs: Introduce separate rwsem to protect inode attributes.")
Cc: stable <stable@kernel.org>
Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-By: Imran Khan <imran.f.khan@oracle.com>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Eric Sandeen <sandeen@sandeen.net>
Link: https://lore.kernel.org/r/169128520941.68052.15749253469930138901.stgit@donald.themaw.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fe8c3623ab upstream.
After commit 30696378f6 ("pstore/ram: Do not treat empty buffers as
valid"), initialization would assume a prz was valid after seeing that
the buffer_size is zero (regardless of the buffer start position). This
unchecked start value means it could be outside the bounds of the buffer,
leading to future access panics when written to:
sysdump_panic_event+0x3b4/0x5b8
atomic_notifier_call_chain+0x54/0x90
panic+0x1c8/0x42c
die+0x29c/0x2a8
die_kernel_fault+0x68/0x78
__do_kernel_fault+0x1c4/0x1e0
do_bad_area+0x40/0x100
do_translation_fault+0x68/0x80
do_mem_abort+0x68/0xf8
el1_da+0x1c/0xc0
__raw_writeb+0x38/0x174
__memcpy_toio+0x40/0xac
persistent_ram_update+0x44/0x12c
persistent_ram_write+0x1a8/0x1b8
ramoops_pstore_write+0x198/0x1e8
pstore_console_write+0x94/0xe0
...
To avoid this, also check if the prz start is 0 during the initialization
phase. If not, the next prz sanity check case will discover it (start >
size) and zap the buffer back to a sane state.
Fixes: 30696378f6 ("pstore/ram: Do not treat empty buffers as valid")
Cc: Yunlong Xing <yunlong.xing@unisoc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Enlin Mu <enlin.mu@unisoc.com>
Link: https://lore.kernel.org/r/20230801060432.1307717-1-yunlong.xing@unisoc.com
[kees: update commit log with backtrace and clarifications]
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 919dc32095 upstream.
If an fsverity builtin signature is given for a file but the
".fs-verity" keyring is empty, there's no real reason to run the PKCS#7
parser. Skip this to avoid the PKCS#7 attack surface when builtin
signature support is configured into the kernel but is not being used.
This is a hardening improvement, not a fix per se, but I've added
Fixes and Cc stable to get it out to more users.
Fixes: 432434c9f8 ("fs-verity: support builtin file signatures")
Cc: stable@vger.kernel.org
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lore.kernel.org/r/20230820173237.2579-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7c53e847ff upstream.
All posix lock ops, for all lockspaces (gfs2 file systems) are
sent to userspace (dlm_controld) through a single misc device.
The dlm_controld daemon reads the ops from the misc device
and sends them to other cluster nodes using separate, per-lockspace
cluster api communication channels. The ops for a single lockspace
are ordered at this level, so that the results are received in
the same sequence that the requests were sent. When the results
are sent back to the kernel via the misc device, they are again
funneled through the single misc device for all lockspaces. When
the dlm code in the kernel processes the results from the misc
device, these results will be returned in the same sequence that
the requests were sent, on a per-lockspace basis. A recent change
in this request/reply matching code missed the "per-lockspace"
check (fsid comparison) when matching request and reply, so replies
could be incorrectly matched to requests from other lockspaces.
Cc: stable@vger.kernel.org
Reported-by: Barry Marson <bmarson@redhat.com>
Fixes: 57e2c2f2d9 ("fs: dlm: fix mismatch of plock results from userspace")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ccf61486fe upstream.
Due to an oversight in commit 1b3044e39a ("procfs: fix pthread
cross-thread naming if !PR_DUMPABLE") in switching from REG to NOD,
chmod operations on /proc/thread-self/comm were no longer blocked as
they are on almost all other procfs files.
A very similar situation with /proc/self/environ was used to as a root
exploit a long time ago, but procfs has SB_I_NOEXEC so this is simply a
correctness issue.
Ref: https://lwn.net/Articles/191954/
Ref: 6d76fa58b0 ("Don't allow chmod() on the /proc/<pid>/ files")
Fixes: 1b3044e39a ("procfs: fix pthread cross-thread naming if !PR_DUMPABLE")
Cc: stable@vger.kernel.org # v4.7+
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Message-Id: <20230713141001.27046-1-cyphar@cyphar.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 091a4dfbb1 ]
After remount, F2FS_OPTION().compress_level was assgin to
LZ4HC_DEFAULT_CLEVEL incorrectly, result in lz4hc:9 was enabled, fix it.
1. mount /dev/vdb
/dev/vdb on /mnt/f2fs type f2fs (...,compress_algorithm=lz4,compress_log_size=2,...)
2. mount -t f2fs -o remount,compress_log_size=3 /mnt/f2fs/
3. mount|grep f2fs
/dev/vdb on /mnt/f2fs type f2fs (...,compress_algorithm=lz4:9,compress_log_size=3,...)
Fixes: 00e120b5e4 ("f2fs: assign default compression level")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 005abf9e5e ]
Previously, we have two mechanisms to cache & submit small discards:
a) set max small discard number in /sys/fs/f2fs/vdb/max_small_discards,
and checkpoint will cache small discard candidates w/ configured maximum
number.
b) call FITRIM ioctl, also, checkpoint in f2fs_trim_fs() will cache small
discard candidates w/ configured discard granularity, but w/o limitation
of number. FSTRIM interface is asynchronized, so it won't submit discard
directly.
Finally, discard thread will submit them in background periodically.
However, after commit 9ac00e7cef ("f2fs: do not issue small discard
commands during checkpoint"), the mechanism a) is broken, since no matter
how we configure the sysfs entry /sys/fs/f2fs/vdb/max_small_discards,
checkpoint will not cache small discard candidates any more.
echo 0 > /sys/fs/f2fs/vdb/max_small_discards
xfs_io -f /mnt/f2fs/file -c "pwrite 0 2m" -c "fsync"
xfs_io /mnt/f2fs/file -c "fpunch 0 4k"
sync
cat /proc/fs/f2fs/vdb/discard_plist_info |head -2
echo 100 > /sys/fs/f2fs/vdb/max_small_discards
rm /mnt/f2fs/file
xfs_io -f /mnt/f2fs/file -c "pwrite 0 2m" -c "fsync"
xfs_io /mnt/f2fs/file -c "fpunch 0 4k"
sync
cat /proc/fs/f2fs/vdb/discard_plist_info |head -2
Before the patch:
Discard pend list(Show diacrd_cmd count on each entry, .:not exist):
0 . . . . . . . .
Discard pend list(Show diacrd_cmd count on each entry, .:not exist):
0 3 1 . . . . . .
After the patch:
Discard pend list(Show diacrd_cmd count on each entry, .:not exist):
0 . . . . . . . .
Discard pend list(Show diacrd_cmd count on each entry, .:not exist):
0 . . . . . . . .
This patch reverts commit 9ac00e7cef ("f2fs: do not issue small discard
commands during checkpoint") in order to fix this issue.
Fixes: 9ac00e7cef ("f2fs: do not issue small discard commands during checkpoint")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9bf1dcbdfd ]
As reported, status debugfs entry shows inconsistent GC stats as below:
GC calls: 6008 (BG: 6161)
- data segments : 3053 (BG: 3053)
- node segments : 2955 (BG: 2955)
Total GC calls is larger than BGGC calls, the reason is:
- f2fs_stat_info.call_count accounts total migrated section count
by f2fs_gc()
- f2fs_stat_info.bg_gc accounts total call times of f2fs_gc() from
background gc_thread
Another issue is gc_foreground_calls sysfs entry shows total GC call
count rather than FGGC call count.
This patch changes as below for fix:
- account GC calls and migrated segment count separately
- support to account migrated section count if it enables large section
mode
- fix to show correct value in gc_foreground_calls sysfs entry
Fixes: fc7100ea2a ("f2fs: Add f2fs stats to sysfs")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 958ccbbf1c ]
syzbot reports a f2fs bug as below:
UBSAN: array-index-out-of-bounds in fs/f2fs/f2fs.h:3275:19
index 1409 is out of range for type '__le32[923]' (aka 'unsigned int[923]')
Call Trace:
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106
ubsan_epilogue lib/ubsan.c:217 [inline]
__ubsan_handle_out_of_bounds+0x11c/0x150 lib/ubsan.c:348
inline_data_addr fs/f2fs/f2fs.h:3275 [inline]
__recover_inline_status fs/f2fs/inode.c:113 [inline]
do_read_inode fs/f2fs/inode.c:480 [inline]
f2fs_iget+0x4730/0x48b0 fs/f2fs/inode.c:604
f2fs_fill_super+0x640e/0x80c0 fs/f2fs/super.c:4601
mount_bdev+0x276/0x3b0 fs/super.c:1391
legacy_get_tree+0xef/0x190 fs/fs_context.c:611
vfs_get_tree+0x8c/0x270 fs/super.c:1519
do_new_mount+0x28f/0xae0 fs/namespace.c:3335
do_mount fs/namespace.c:3675 [inline]
__do_sys_mount fs/namespace.c:3884 [inline]
__se_sys_mount+0x2d9/0x3c0 fs/namespace.c:3861
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
The issue was bisected to:
commit d48a7b3a72
Author: Chao Yu <chao@kernel.org>
Date: Mon Jan 9 03:49:20 2023 +0000
f2fs: fix to do sanity check on extent cache correctly
The root cause is we applied both v1 and v2 of the patch, v2 is the right
fix, so it needs to revert v1 in order to fix reported issue.
v1:
commit d48a7b3a72 ("f2fs: fix to do sanity check on extent cache correctly")
https://lore.kernel.org/lkml/20230109034920.492914-1-chao@kernel.org/
v2:
commit 269d119481 ("f2fs: fix to do sanity check on extent cache correctly")
https://lore.kernel.org/lkml/20230207134808.1827869-1-chao@kernel.org/
Reported-by: syzbot+601018296973a481f302@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-f2fs-devel/000000000000fcf0690600e4d04d@google.com/
Fixes: d48a7b3a72 ("f2fs: fix to do sanity check on extent cache correctly")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3cb88bc159 ]
The commit 25f9080576 ("f2fs: add async reset zone command support")
introduced "async reset zone commands" by calling
__submit_zone_reset_cmd() in async discard operations. However,
__submit_zone_reset_cmd() is called regardless of zone type of discard
target zone. When devices have conventional zones, zone reset commands
are sent to the conventional zones and cause I/O errors.
Avoid the I/O errors by checking that the discard target zone type is
sequential write required. If not, handle the discard operation in same
manner as non-zoned, regular block devices. For that purpose, add a new
helper function f2fs_bdev_index() which gets index of the zone reset
target device.
Fixes: 25f9080576 ("f2fs: add async reset zone command support")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 51bf8d3c81 ]
f2fs_scan_devices reopens the main device since the very beginning, which
has always been useless, and also means that we don't pass the right
holder for the reopen, which now leads to a warning as the core super.c
holder ops aren't passed in for the reopen.
Fixes: 3c62be17d4 ("f2fs: support multiple devices")
Fixes: 0718afd47f ("block: introduce holder ops")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c4a123d2e8 ]
The comma at the end of the line was leftover from an earlier refactor
of the _nfs4_pnfs_v3_ds_connect() function. This is technically valid C,
so the compilers didn't catch it, but if I'm understanding how it works
correctly it assigns the return value of rpc_clnt_add_xprtr() to
xprtdata.cred.
Reported-by: Olga Kornievskaia <kolga@netapp.com>
Fixes: a12f996d34 ("NFSv4/pNFS: Use connections to a DS that are all of the same protocol family")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 5690eed941 ]
If the client sent a synchronous copy and the server replied with
ERR_OFFLOAD_NO_REQ indicating that it wants an asynchronous
copy instead, the client should retry with asynchronous copy.
Fixes: 539f57b3e0 ("NFS handle COPY ERR_OFFLOAD_NO_REQS")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f67b55b658 ]
Commit 64cfca85ba asserts the only valid return values for
nfs2/3_decode_dirent should not include -ENAMETOOLONG, but for a server
that sends a filename3 which exceeds MAXNAMELEN in a READDIR response the
client's behavior will be to endlessly retry the operation.
We could map -ENAMETOOLONG into -EBADCOOKIE, but that would produce
truncated listings without any error. The client should return an error
for this case to clearly assert that the server implementation must be
corrected.
Fixes: 64cfca85ba ("NFS: Return valid errors from nfs2/3_decode_dirent()")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6372e2ee62 ]
The XDR specification in RFC 8881 looks like this:
struct device_addr4 {
layouttype4 da_layout_type;
opaque da_addr_body<>;
};
struct GETDEVICEINFO4resok {
device_addr4 gdir_device_addr;
bitmap4 gdir_notification;
};
union GETDEVICEINFO4res switch (nfsstat4 gdir_status) {
case NFS4_OK:
GETDEVICEINFO4resok gdir_resok4;
case NFS4ERR_TOOSMALL:
count4 gdir_mincount;
default:
void;
};
Looking at nfsd4_encode_getdeviceinfo() ....
When the client provides a zero gd_maxcount, then the Linux NFS
server implementation encodes the da_layout_type field and then
skips the da_addr_body field completely, proceeding directly to
encode gdir_notification field.
There does not appear to be an option in the specification to skip
encoding da_addr_body. Moreover, Section 18.40.3 says:
> If the client wants to just update or turn off notifications, it
> MAY send a GETDEVICEINFO operation with gdia_maxcount set to zero.
> In that event, if the device ID is valid, the reply's da_addr_body
> field of the gdir_device_addr field will be of zero length.
Since the layout drivers are responsible for encoding the
da_addr_body field, put this fix inside the ->encode_getdeviceinfo
methods.
Fixes: 9cf514ccfa ("nfsd: implement pNFS operations")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Tom Haynes <loghyr@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit de8d38cf44 ]
clang's static analysis warning: fs/lockd/mon.c: line 293, column 2:
Null pointer passed as 2nd argument to memory copy function.
Assuming 'hostname' is NULL and calling 'nsm_create_handle()', this will
pass NULL as 2nd argument to memory copy function 'memcpy()'. So return
NULL if 'hostname' is invalid.
Fixes: 77a3ef33e2 ("NSM: More clean up of nsm_get_handle()")
Signed-off-by: Su Hui <suhui@nfschina.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1524773425 ]
Running generic/475(filesystem consistent tests after power cut) could
easily trigger unattached inode error while doing fsck:
Unattached zero-length inode 39405. Clear? no
Unattached inode 39405
Connect to /lost+found? no
Above inconsistence is caused by following process:
P1 P2
ext4_create
inode = ext4_new_inode_start_handle // itable records nlink=1
ext4_add_nondir
err = ext4_add_entry // ENOSPC
ext4_append
ext4_bread
ext4_getblk
ext4_map_blocks // returns ENOSPC
drop_nlink(inode) // won't be updated into disk inode
ext4_orphan_add(handle, inode)
ext4_orphan_file_add
ext4_journal_stop(handle)
jbd2_journal_commit_transaction // commit success
>> power cut <<
ext4_fill_super
ext4_load_and_init_journal // itable records nlink=1
ext4_orphan_cleanup
ext4_process_orphan
if (inode->i_nlink) // true, inode won't be deleted
Then, allocated inode will be reserved on disk and corresponds to no
dentries, so e2fsck reports 'unattached inode' problem.
The problem won't happen if orphan file feature is disabled, because
ext4_orphan_add() will update disk inode in orphan list mode. There
are several places not updating disk inode while putting inode into
orphan area, such as ext4_add_nondir(), ext4_symlink() and whiteout
in ext4_rename(). Fix it by updating inode into disk in all error
branches of these places.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217605
Fixes: 02f310fcf4 ("ext4: Speedup ext4 orphan inode handling")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230628132011.650383-1-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 08b45fcb2d ]
This allocation should use the passed in GFP_ flags instead of
GFP_KERNEL. One places where this matters is in filelayout_pg_init_write()
which uses GFP_NOFS as the allocation flags.
Fixes: 5c83746a0c ("pnfs/blocklayout: in-kernel GETDEVICEINFO XDR parsing")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 303a780520 ]
I found that the read code might send multiple requests using the same
nfs_pgio_header, but nfs4_proc_read_setup() is only called once. This is
how we ended up occasionally double-freeing the scratch buffer, but also
means we set a NULL pointer but non-zero length to the xdr scratch
buffer. This results in an oops the first time decoding needs to copy
something to scratch, which frequently happens when decoding READ_PLUS
hole segments.
I fix this by moving scratch handling into the pageio read code. I
provide a function to allocate scratch space for decoding read replies,
and free the scratch buffer when the nfs_pgio_header is freed.
Fixes: fbd2a05f29 (NFSv4.2: Rework scratch handling for READ_PLUS)
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 8d18f6c5bb ]
I bump the decode_read_plus_maxsz to account for hole segments, but I
need to subtract out this increase when calling
rpc_prepare_reply_pages() so the common case of single data segment
replies can be directly placed into the xdr pages without needing to be
shifted around.
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Fixes: d3b00a802c ("NFS: Replace the READ_PLUS decoding code")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit bb05a617f0 ]
Smatch reports:
fs/nfs/nfs42xdr.c:1131 decode_read_plus() warn: missing error code? 'status'
Which Dan suggests to fix by doing a hardcoded "return 0" from the
"if (segments == 0)" check.
Additionally, smatch reports that the "status = -EIO" assignment is not
used. This patch addresses both these issues.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Closes: https://lore.kernel.org/r/202305222209.6l5VM2lL-lkp@intel.com/
Fixes: d3b00a802c ("NFS: Replace the READ_PLUS decoding code")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 60c672b7f2 ]
ngroups is ext4_group_t (unsigned int) while next_linear_group treat it
in int. If ngroups is bigger than max number described by int, it will
be treat as a negative number. Then "return group + 1 >= ngroups ? 0 :
group + 1;" may keep returning 0.
Switch int to ext4_group_t in next_linear_group to fix the overflow.
Fixes: 196e402adf ("ext4: improve cr 0 / cr 1 group scanning")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230801143204.2284343-3-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a9ce5993a0 ]
Group corruption check will access memory of grp and will trigger kernel
crash if grp is NULL. So do NULL check before corruption check.
Fixes: 5354b2af34 ("ext4: allow ext4_get_group_info() to fail")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230801143204.2284343-2-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit dabc8b2075 ]
The dquot_mark_dquot_dirty() using dquot references from the inode
should be protected by dquot_srcu. quota_off code takes care to call
synchronize_srcu(&dquot_srcu) to not drop dquot references while they
are used by other users. But dquot_transfer() breaks this assumption.
We call dquot_transfer() to drop the last reference of dquot and add
it to free_dquots, but there may still be other users using the dquot
at this time, as shown in the function graph below:
cpu1 cpu2
_________________|_________________
wb_do_writeback CHOWN(1)
...
ext4_da_update_reserve_space
dquot_claim_block
...
dquot_mark_dquot_dirty // try to dirty old quota
test_bit(DQ_ACTIVE_B, &dquot->dq_flags) // still ACTIVE
if (test_bit(DQ_MOD_B, &dquot->dq_flags))
// test no dirty, wait dq_list_lock
...
dquot_transfer
__dquot_transfer
dqput_all(transfer_from) // rls old dquot
dqput // last dqput
dquot_release
clear_bit(DQ_ACTIVE_B, &dquot->dq_flags)
atomic_dec(&dquot->dq_count)
put_dquot_last(dquot)
list_add_tail(&dquot->dq_free, &free_dquots)
// add the dquot to free_dquots
if (!test_and_set_bit(DQ_MOD_B, &dquot->dq_flags))
add dqi_dirty_list // add released dquot to dirty_list
This can cause various issues, such as dquot being destroyed by
dqcache_shrink_scan() after being added to free_dquots, which can trigger
a UAF in dquot_mark_dquot_dirty(); or after dquot is added to free_dquots
and then to dirty_list, it is added to free_dquots again after
dquot_writeback_dquots() is executed, which causes the free_dquots list to
be corrupted and triggers a UAF when dqcache_shrink_scan() is called for
freeing dquot twice.
As Honza said, we need to fix dquot_transfer() to follow the guarantees
dquot_srcu should provide. But calling synchronize_srcu() directly from
dquot_transfer() is too expensive (and mostly unnecessary). So we add
dquot whose last reference should be dropped to the new global dquot
list releasing_dquots, and then queue work item which would call
synchronize_srcu() and after that perform the final cleanup of all the
dquots on releasing_dquots.
Fixes: 4580b30ea8 ("quota: Do not dirty bad dquots")
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230630110822.3881712-5-libaokun1@huawei.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 33bcfafc48 ]
Add new helper function dquot_active() to make the code more concise.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230630110822.3881712-4-libaokun1@huawei.com>
Stable-dep-of: dabc8b2075 ("quota: fix dqput() to follow the guarantees dquot_srcu should provide")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 4b9bdfa165 ]
Now we have a helper function dquot_dirty() to determine if dquot has
DQ_MOD_B bit. dquot_active() can easily be misunderstood as a helper
function to determine if dquot has DQ_ACTIVE_B bit. So we avoid this by
renaming it to inode_quota_active() and later on we will add the helper
function dquot_active() to determine if dquot has DQ_ACTIVE_B bit.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230630110822.3881712-3-libaokun1@huawei.com>
Stable-dep-of: dabc8b2075 ("quota: fix dqput() to follow the guarantees dquot_srcu should provide")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0241284778 ]
Refactor out dquot_write_dquot() to reduce duplicate code.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230630110822.3881712-2-libaokun1@huawei.com>
Stable-dep-of: dabc8b2075 ("quota: fix dqput() to follow the guarantees dquot_srcu should provide")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6b72e5f9e7 ]
Process result of ocfs2_add_entry() in case we have an error
value.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Link: https://lkml.kernel.org/r/20230803145417.177649-1-artem.chernyshev@red-soft.ru
Fixes: ccd979bdbc ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem")
Signed-off-by: Artem Chernyshev <artem.chernyshev@red-soft.ru>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Artem Chernyshev <artem.chernyshev@red-soft.ru>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Kurt Hackel <kurt.hackel@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 91b1ad0815 ]
Currently ztailpacking pclusters are chained with FOLLOWED_NOINPLACE and
not recorded into the managed_pslots XArray.
After commit 7674a42f35 ("erofs: use struct lockref to replace
handcrafted approach"), ztailpacking pclusters won't be freed with
erofs_workgroup_put() anymore, which will cause the following issue:
BUG erofs_pcluster-1 (Tainted: G OE ): Objects remaining in erofs_pcluster-1 on __kmem_cache_shutdown()
Use z_erofs_free_pcluster() directly to free ztailpacking pclusters.
Fixes: 7674a42f35 ("erofs: use struct lockref to replace handcrafted approach")
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230822110530.96831-1-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c02d35d89b ]
When doing a relocation, there is a chance that at the time of
btrfs_reloc_clone_csums(), there is no checksum for the corresponding
region.
In this case, btrfs_finish_ordered_zoned()'s sum points to an invalid item
and so ordered_extent's logical is set to some invalid value. Then,
btrfs_lookup_block_group() in btrfs_zone_finish_endio() failed to find a
block group and will hit an assert or a null pointer dereference as
following.
This can be reprodcued by running btrfs/028 several times (e.g, 4 to 16
times) with a null_blk setup. The device's zone size and capacity is set to
32 MB and the storage size is set to 5 GB on my setup.
KASAN: null-ptr-deref in range [0x0000000000000088-0x000000000000008f]
CPU: 6 PID: 3105720 Comm: kworker/u16:13 Tainted: G W 6.5.0-rc6-kts+ #1
Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0 12/17/2015
Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
RIP: 0010:btrfs_zone_finish_endio.part.0+0x34/0x160 [btrfs]
Code: 41 54 49 89 fc 55 48 89 f5 53 e8 57 7d fc ff 48 8d b8 88 00 00 00 48 89 c3 48 b8 00 00 00 00 00
> 3c 02 00 0f 85 02 01 00 00 f6 83 88 00 00 00 01 0f 84 a8 00 00
RSP: 0018:ffff88833cf87b08 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000011 RSI: 0000000000000004 RDI: 0000000000000088
RBP: 0000000000000002 R08: 0000000000000001 R09: ffffed102877b827
R10: ffff888143bdc13b R11: ffff888125b1cbc0 R12: ffff888143bdc000
R13: 0000000000007000 R14: ffff888125b1cba8 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88881e500000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f3ed85223d5 CR3: 00000001519b4005 CR4: 00000000001706e0
Call Trace:
<TASK>
? die_addr+0x3c/0xa0
? exc_general_protection+0x148/0x220
? asm_exc_general_protection+0x22/0x30
? btrfs_zone_finish_endio.part.0+0x34/0x160 [btrfs]
? btrfs_zone_finish_endio.part.0+0x19/0x160 [btrfs]
btrfs_finish_one_ordered+0x7b8/0x1de0 [btrfs]
? rcu_is_watching+0x11/0xb0
? lock_release+0x47a/0x620
? btrfs_finish_ordered_zoned+0x59b/0x800 [btrfs]
? __pfx_btrfs_finish_one_ordered+0x10/0x10 [btrfs]
? btrfs_finish_ordered_zoned+0x358/0x800 [btrfs]
? __smp_call_single_queue+0x124/0x350
? rcu_is_watching+0x11/0xb0
btrfs_work_helper+0x19f/0xc60 [btrfs]
? __pfx_try_to_wake_up+0x10/0x10
? _raw_spin_unlock_irq+0x24/0x50
? rcu_is_watching+0x11/0xb0
process_one_work+0x8c1/0x1430
? __pfx_lock_acquire+0x10/0x10
? __pfx_process_one_work+0x10/0x10
? __pfx_do_raw_spin_lock+0x10/0x10
? _raw_spin_lock_irq+0x52/0x60
worker_thread+0x100/0x12c0
? __kthread_parkme+0xc1/0x1f0
? __pfx_worker_thread+0x10/0x10
kthread+0x2ea/0x3c0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x30/0x70
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
On the zoned mode, writing to pre-allocated region means data relocation
write. Such write always uses WRITE command so there is no need of splitting
and rewriting logical address. Thus, we can just skip the function for the
case.
Fixes: cbfce4c7fb ("btrfs: optimize the logical to physical mapping for zoned writes")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a221ab717c ]
We do not need to release the iomap_page in iomap_invalidate_folio()
to allow the folio to be split. The splitting code will call
->release_folio() if there is still per-fs private data attached to
the folio. At that point, we will check if the folio is still dirty
and decline to release the iomap_page. It is possible to trigger the
warning in perfectly legitimate circumstances (eg if a disk read fails,
we do a partial write to the folio, then we truncate the folio), which
will cause those writes to be lost.
Fixes: 60d8231089 ("iomap: Support large folios in invalidatepage")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0d5a4f8f77 ]
The d_hash_and_lookup() function returns error pointers or NULL.
Most incorrect error checks were fixed, but the one in int path_pts()
was forgotten.
Fixes: eedf265aa0 ("devpts: Make each mount of devpts an independent filesystem.")
Signed-off-by: Wang Ming <machel@vivo.com>
Message-Id: <20230713120555.7025-1-machel@vivo.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 758b492047 ]
For eventfd with flag EFD_SEMAPHORE, when its ctx->count is 0, calling
eventfd_ctx_do_read will cause ctx->count to overflow to ULLONG_MAX.
An underflow can happen with EFD_SEMAPHORE eventfds in at least the
following three subsystems:
(1) virt/kvm/eventfd.c
(2) drivers/vfio/virqfd.c
(3) drivers/virt/acrn/irqfd.c
where (2) and (3) are just modeled after (1). An eventfd must be
specified for use with the KVM_IRQFD ioctl(). This can also be an
EFD_SEMAPHORE eventfd. When the eventfd count is zero or has been
decremented to zero an underflow can be triggered when the irqfd is shut
down by raising the KVM_IRQFD_FLAG_DEASSIGN flag in the KVM_IRQFD
ioctl():
// ctx->count == 0
kvm_vm_ioctl()
-> kvm_irqfd()
-> kvm_irqfd_deassign()
-> irqfd_deactivate()
-> irqfd_shutdown()
-> eventfd_ctx_remove_wait_queue(&cnt)
-> eventfd_ctx_do_read(&cnt)
Userspace polling on the eventfd wouldn't notice the underflow because 1
is always returned as the value from eventfd_read() while ctx->count
would've underflowed. It's not a huge deal because this should only be
happening when the irqfd is shutdown but we should still fix it and
avoid the spurious wakeup.
Fixes: cb289d6244 ("eventfd - allow atomic read and waitqueue remove")
Signed-off-by: Wen Yang <wenyang.linux@foxmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dylan Yudaken <dylany@fb.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Message-Id: <tencent_7588DFD1F365950A757310D764517A14B306@qq.com>
[brauner: rewrite commit message and add explanation how this underflow can happen]
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 576d498e0a ]
Same logic applies here: this can fill up the pipe, and pollers that rely
on getting IN_MODIFY notifications never wake up.
Fixes: 983652c691 ("splice: report related fsnotify events")
Link: https://lore.kernel.org/linux-fsdevel/jbyihkyk5dtaohdwjyivambb2gffyjs3dodpofafnkkunxq7bu@jngkdxx65pux/t/#u
Link: https://bugs.debian.org/1039488
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Acked-by: Jan Kara <jack@suse.cz>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Message-Id: <10d76dd8c85017ae3cd047c9b9a32e26daefdaa2.1688393619.git.nabijaczleweli@nabijaczleweli.xyz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7f0f1ea069 ]
Same logic applies here: this can fill up the pipe and pollers that rely
on getting IN_MODIFY notifications never wake up.
Fixes: 983652c691 ("splice: report related fsnotify events")
Link: https://lore.kernel.org/linux-fsdevel/jbyihkyk5dtaohdwjyivambb2gffyjs3dodpofafnkkunxq7bu@jngkdxx65pux/t/#u
Link: https://bugs.debian.org/1039488
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Acked-by: Jan Kara <jack@suse.cz>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Message-Id: <8d9ad5acb9c5c1dd2376a2ff5da6ac3183115389.1688393619.git.nabijaczleweli@nabijaczleweli.xyz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 12ee4b66af ]
The current behaviour caused an asymmetry where some write APIs
(write, sendfile) would notify the written-to/read-from objects,
but splice wouldn't.
This affected userspace which uses inotify, most notably coreutils
tail -f, to monitor pipes.
If the pipe buffer had been filled by a splice-family function:
* tail wouldn't know and thus wouldn't service the pipe, and
* all writes to the pipe would block because it's full,
thus service was denied.
(For the particular case of tail -f this could be worked around
with ---disable-inotify.)
Fixes: 983652c691 ("splice: report related fsnotify events")
Link: https://lore.kernel.org/linux-fsdevel/jbyihkyk5dtaohdwjyivambb2gffyjs3dodpofafnkkunxq7bu@jngkdxx65pux/t/#u
Link: https://bugs.debian.org/1039488
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Acked-by: Jan Kara <jack@suse.cz>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Message-Id: <604ec704d933e0e0121d9e107ce914512e045fad.1688393619.git.nabijaczleweli@nabijaczleweli.xyz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ba38980add ]
__getblk() can return a NULL pointer if we run out of memory or if we
try to access beyond the end of the device; check it and handle it
appropriately.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/lkml/CAFcO6XOacq3hscbXevPQP7sXRoYFz34ZdKPYjmd6k5sZuhGFDw@mail.gmail.com/
Tested-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2") # probably introduced in 2002
Acked-by: Edward Shishkin <edward.shishkin@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 579c7e4150 upstream.
This reverts commit bfd4766239.
Shinichiro Kawasaki reported:
When I ran workloads on f2fs using v6.5-rcX with fixes [1][2] and a zoned block
devices with 4kb logical block size, I observe mount failure as follows. When
I revert this commit, the failure goes away.
[ 167.781975][ T1555] F2FS-fs (dm-0): IO Block Size: 4 KB
[ 167.890728][ T1555] F2FS-fs (dm-0): Found nat_bits in checkpoint
[ 171.482588][ T1555] F2FS-fs (dm-0): Zone without valid block has non-zero write pointer. Reset the write pointer: wp[0x1300,0x8]
[ 171.496000][ T1555] F2FS-fs (dm-0): (0) : Unaligned zone reset attempted (block 280000 + 80000)
[ 171.505037][ T1555] F2FS-fs (dm-0): Discard zone failed: (errno=-5)
The patch replaced "sbi->log_blocksize - SECTOR_SHIFT" with
"sbi->log_sectors_per_block". However, I think these two are not equal when the
device has 4k logical block size. The former uses Linux kernel sector size 512
byte. The latter use 512b sector size or 4kb sector size depending on the
device. mkfs.f2fs obtains logical block size via BLKSSZGET ioctl from the device
and reflects it to the value sbi->log_sector_size_per_block. This causes
unexpected write pointer calculations in check_zone_write_pointer(). This
resulted in unexpected zone reset and the mount failure.
[1] https://lkml.kernel.org/linux-f2fs-devel/20230711050101.GA19128@lst.de/
[2] https://lore.kernel.org/linux-f2fs-devel/20230804091556.2372567-1-shinichiro.kawasaki@wdc.com/
Cc: stable@vger.kernel.org
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: bfd4766239 ("f2fs: clean up w/ sbi->log_sectors_per_block")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 91ec6c8559 upstream.
This reverts commit 5a8bee63b1.
Jürg Billeter reports the following regression:
Since v6.3-rc1 commit 5a8bee63b1 ("fuse: in fuse_flush only wait if
someone wants the return code") `fput()` is called asynchronously if a
file is closed as part of a process exiting, i.e., if there was no
explicit `close()` before exit.
If the file was open for writing, also `put_write_access()` is called
asynchronously as part of the async `fput()`.
If that newly written file is an executable, attempting to `execve()` the
new file can fail with `ETXTBSY` if it's called after the writer process
exited but before the async `fput()` has run.
Reported-and-tested-by: "Jürg Billeter" <j@bitron.ch>
Cc: <stable@vger.kernel.org> # v6.3
Link: https://lore.kernel.org/all/4f66cded234462964899f2a661750d6798a57ec0.camel@bitron.ch/
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit cdaac8e7e5 upstream.
A syzbot stress test using a corrupted disk image reported that
mark_buffer_dirty() called from __nilfs_mark_inode_dirty() or
nilfs_palloc_commit_alloc_entry() may output a kernel warning, and can
panic if the kernel is booted with panic_on_warn.
This is because nilfs2 keeps buffer pointers in local structures for some
metadata and reuses them, but such buffers may be forcibly discarded by
nilfs_clear_dirty_page() in some critical situations.
This issue is reported to appear after commit 28a65b49eb ("nilfs2: do
not write dirty data after degenerating to read-only"), but the issue has
potentially existed before.
Fix this issue by checking the uptodate flag when attempting to reuse an
internally held buffer, and reloading the metadata instead of reusing the
buffer if the flag was lost.
Link: https://lkml.kernel.org/r/20230818131804.7758-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+cdfcae656bac88ba0e2d@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/0000000000003da75f05fdeffd12@google.com
Fixes: 8c26c4e269 ("nilfs2: fix issue with flush kernel thread after remount in RO mode because of driver's internal error or metadata corruption")
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org> # 3.10+
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e628bf939a upstream.
Create 3 kinds of files to reproduce this problem.
dd if=/dev/urandom of=127k.bin bs=1024 count=127
dd if=/dev/urandom of=128k.bin bs=1024 count=128
dd if=/dev/urandom of=129k.bin bs=1024 count=129
When copying files from ksmbd share to windows or cifs.ko, The following
error message happen from windows client.
"The file '129k.bin' is too large for the destination filesystem."
We can see the error logs from ksmbd debug prints
[48394.611537] ksmbd: RDMA r/w request 0x0: token 0x669d, length 0x20000
[48394.612054] ksmbd: smb_direct: RDMA write, len 0x20000, needed credits 0x1
[48394.612572] ksmbd: filename 129k.bin, offset 131072, len 131072
[48394.614189] ksmbd: nbytes 1024, offset 132096 mincount 0
[48394.614585] ksmbd: Failed to process 8 [-22]
And we can reproduce it with cifs.ko,
e.g. dd if=129k.bin of=/dev/null bs=128KB count=2
This problem is that ksmbd rdma return error if remaining bytes is less
than Length of Buffer Descriptor V1 Structure.
smb_direct_rdma_xmit()
...
if (desc_buf_len == 0 || total_length > buf_len ||
total_length > t->max_rdma_rw_size)
return -EINVAL;
This patch reduce descriptor size with remaining bytes and remove the
check for total_length and buf_len.
Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4b081ce0d8 upstream.
If authblob->SessionKey.Length is bigger than session key
size(CIFS_KEY_SIZE), slub overflow can happen in key exchange codes.
cifs_arc4_crypt copy to session key array from SessionKey from client.
Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-21940
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 17d5b135bb upstream.
If ->DataOffset of create context is 0, DataBuffer size is not correctly
validated. This patch change wrong validation code and consider tag
length in request.
Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-21824
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e4c1cf523d upstream.
This was accidentally fixed up in commit e4c1cf523d but we can't
take the full change due to other dependancy issues, so here is just
the actual bugfix that is needed.
[Background]
keltargw reported an issue [1] that with mmaped I/Os, sometimes the
tail of the last page (after file ends) is not filled with zeroes.
The root cause is that such tail page could be wrongly selected for
inplace I/Os so the zeroed part will then be filled with compressed
data instead of zeroes.
A simple fix is to avoid doing inplace I/Os for such tail parts,
actually that was already fixed upstream in commit e4c1cf523d
("erofs: tidy up z_erofs_do_read_page()") by accident.
[1] https://lore.kernel.org/r/3ad8b469-25db-a297-21f9-75db2d6ad224@linux.alibaba.com
Reported-by: keltargw <keltar.gw@gmail.com>
Fixes: 3883a79abd ("staging: erofs: introduce VLE decompression support")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
or aren't considered suitable for a -stable backport.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZOjuGgAKCRDdBJ7gKXxA
jkLlAQDY9sYxhQZp1PFLirUIPeOBjEyifVy6L6gCfk9j0snLggEA2iK+EtuJt2Dc
SlMfoTq29zyU/YgfKKwZEVKtPJZOHQU=
=oTcj
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2023-08-25-11-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"18 hotfixes. 13 are cc:stable and the remainder pertain to post-6.4
issues or aren't considered suitable for a -stable backport"
* tag 'mm-hotfixes-stable-2023-08-25-11-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
shmem: fix smaps BUG sleeping while atomic
selftests: cachestat: catch failing fsync test on tmpfs
selftests: cachestat: test for cachestat availability
maple_tree: disable mas_wr_append() when other readers are possible
madvise:madvise_free_pte_range(): don't use mapcount() against large folio for sharing check
madvise:madvise_free_huge_pmd(): don't use mapcount() against large folio for sharing check
madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against large folio for sharing check
mm: multi-gen LRU: don't spin during memcg release
mm: memory-failure: fix unexpected return value in soft_offline_page()
radix tree: remove unused variable
mm: add a call to flush_cache_vmap() in vmap_pfn()
selftests/mm: FOLL_LONGTERM need to be updated to 0x100
nilfs2: fix general protection fault in nilfs_lookup_dirty_data_buffers()
mm/gup: handle cont-PTE hugetlb pages correctly in gup_must_unshare() via GUP-fast
selftests: cgroup: fix test_kmem_basic less than error
mm: enable page walking API to lock vmas during the walk
smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd()
mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
- Close race window when handling FREE_STATEID operations
- Fix regression in /proc/fs/nfsd/v4_end_grace introduced in v6.5-rc
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmTntHEACgkQM2qzM29m
f5cGgQ//ZvUW1Vvp+s86puw+EyEtwu15Ms19kTYNR+AKebpPz/c9K9iEF3nmZXEL
bRn25fELtzXYi8rqavrv8fMj7dQhmkT3DE0WaJcTtCLD5N5bGDO3mQeoQ1fKGR1r
rHITp0jC25Viur7kXhXU6qIcvu0VthK+feW/DMlkKsmSlQE5V4utxUGYZp8gfZGU
7cbYRpCqF2J1bJSPxH/lKpg5ZHztpZW6aPXG7frHcg04qsfqrMRS0HqG8KYaAKXh
BObBqSYDo8agOa3u361pBZoVZHF2/7gFXlZKIZdp+6F5/B1IjoN+7eWnI7hFxiH7
zf5jLa9xlWrXr2vQTuPEJa9dCr756Ixzq7IJ7ZzIMOpVypixZ04jBLfnuhcnayu9
8k/0CFqQwmfvIcXgJEpTJ+OKm0kDqI3n7WE9gkeYBkRewEvJQXaFZ/vqTYi7bp9H
eWlwQ4bHE5touERBMp0HmDdct/ZdUn8dS6MDcdGFXrVf5m+Jt6hZCXTnpU3Ah+zF
d0uK4IEwJ2yC9FhBqOYZ6+XBr1JA+40vdnHOBvKdpAzQnIdwnNa4rzR0Eab6+m4i
fmhI63s9slPBcBMroRC0mhftcdkd7LjBhhWbsDu8nemKmmHcOKzwTda78EayQYnm
/zJUVr8BqzkgaJG1PUn9y0g4IOfgTiokDmBdLu6bTAanRtekhVY=
=BF07
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
"Two last-minute one-liners for v6.5-rc. One got lost in the shuffle,
and the other was reported just this morning"
- Close race window when handling FREE_STATEID operations
- Fix regression in /proc/fs/nfsd/v4_end_grace introduced in v6.5-rc"
* tag 'nfsd-6.5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
NFSD: Fix a thinko introduced by recent trace point changes
nfsd: Fix race to FREE_STATEID and cl_revoked
The fixed commit erroneously removed a call to nfsd_end_grace(),
which makes calls to write_v4_end_grace() a no-op.
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202308241229.68396422-oliver.sang@intel.com
Fixes: 39d432fc76 ("NFSD: trace nfsctl operations")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Highlights include:
Stable fixes
- NFS: Fix a use after free in nfs_direct_join_group()
Bugfixes
- NFS: Fix a sysfs server name memory leak
- NFS: Fix a lock recovery hang in NFSv4.0
- NFS: Fix page free in the error path for nfs42_proc_getxattr
- NFS: Fix page free in the error path for __nfs4_get_acl_uncached
- SUNRPC/rdma: Fix receive buffer dma-mapping after a server disconnect
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmTjgEEACgkQZwvnipYK
APItFA//WzGcKbujlMXpiRdvUg6k6CfG/ikBRB1UwQEyZjK/tVZ96qt6UuHGNMbz
b8GaGls7NRYJKezAcMSW9QMMPYVyG0PLwxOW6BPwsZS61Zn6HMeM1YRboaZEid7f
JrUNhbUXHl6bVWrBNEtcr3IN/5ERU4sGCAa4A3uWdNxGyffD/avrK06/bfmE/SJi
+7LVPp0M9rM5X5Z1c407TbWfg+L81Q9t0tTz7II3Ba9i2BzQ0uhQhyVUQAGF767u
Vua4XWTRoqG1es+tA4iuwZ3KtaqXoaMRDWPLGTkmBrY+pAo+u4IPzY5LCwfUu6kI
vttkZU5b0b05+UomJ1d+Muzr8uEjRmBhIHZsP6lgVVmuNzqkDb0gCGkfix87J+RO
0QmDZ9D0ftJxsb8fSdp8iy8NqmqJ6X4FhsylRtANEuCrf8+zrkUlBJi47CCwpYDD
8gq6SoTfA8MmiSgzrBuYkJe2HSx7c2csDl3xp5KrJX2IHODjbzlHC05fNadTWc6W
0jQvq1cJ2xBYDNSxkG0Trsd3lTTao3rZC4M7imVVjTTOHS8X1LNCLkbZ7LVnA8rn
0F+lp/h1qs/daXSp0aMG5wyvZNkx5rsJ23o+InNCjiCh3cDvoi9mg6DN5bQK8Foy
Iqd2MTgxrMaF/FUbdGLdnFX4GQkgFPng8TpdX8sqqm1JHUprpqg=
=nd41
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.5-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
- fix a use after free in nfs_direct_join_group() (Cc: stable)
- fix sysfs server name memory leak
- fix lock recovery hang in NFSv4.0
- fix page free in the error path for nfs42_proc_getxattr() and
__nfs4_get_acl_uncached()
- SUNRPC/rdma: fix receive buffer dma-mapping after a server disconnect
* tag 'nfs-for-6.5-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
xprtrdma: Remap Receive buffers after a reconnect
NFSv4: fix out path in __nfs4_get_acl_uncached
NFSv4.2: fix error handling in nfs42_proc_getxattr
NFS: Fix sysfs server name memory leak
NFS: Fix a use after free in nfs_direct_join_group()
NFSv4: Fix dropped lock for racing OPEN and delegation return
A syzbot stress test reported that create_empty_buffers() called from
nilfs_lookup_dirty_data_buffers() can cause a general protection fault.
Analysis using its reproducer revealed that the back reference "mapping"
from a page/folio has been changed to NULL after dirty page/folio gang
lookup in nilfs_lookup_dirty_data_buffers().
Fix this issue by excluding pages/folios from being collected if, after
acquiring a lock on each page/folio, its back reference "mapping" differs
from the pointer to the address space struct that held the page/folio.
Link: https://lkml.kernel.org/r/20230805132038.6435-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+0ad741797f4565e7e2d2@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/0000000000002930a705fc32b231@google.com
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
walk_page_range() and friends often operate under write-locked mmap_lock.
With introduction of vma locks, the vmas have to be locked as well during
such walks to prevent concurrent page faults in these areas. Add an
additional member to mm_walk_ops to indicate locking requirements for the
walk.
The change ensures that page walks which prevent concurrent page faults
by write-locking mmap_lock, operate correctly after introduction of
per-vma locks. With per-vma locks page faults can be handled under vma
lock without taking mmap_lock at all, so write locking mmap_lock would
not stop them. The change ensures vmas are properly locked during such
walks.
A sample issue this solves is do_mbind() performing queue_pages_range()
to queue pages for migration. Without this change a concurrent page
can be faulted into the area and be left out of migration.
Link: https://lkml.kernel.org/r/20230804152724.3090321-2-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Suggested-by: Jann Horn <jannh@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We shouldn't be using a GUP-internal helper if it can be avoided.
Similar to smaps_pte_entry() that uses vm_normal_page(), let's use
vm_normal_page_pmd() that similarly refuses to return the huge zeropage.
In contrast to follow_trans_huge_pmd(), vm_normal_page_pmd():
(1) Will always return the head page, not a tail page of a THP.
If we'd ever call smaps_account with a tail page while setting "compound
= true", we could be in trouble, because smaps_account() would look at
the memmap of unrelated pages.
If we're unlucky, that memmap does not exist at all. Before we removed
PG_doublemap, we could have triggered something similar as in
commit 24d7275ce2 ("fs/proc: task_mmu.c: don't read mapcount for
migration entry").
This can theoretically happen ever since commit ff9f47f6f0 ("mm: proc:
smaps_rollup: do not stall write attempts on mmap_lock"):
(a) We're in show_smaps_rollup() and processed a VMA
(b) We release the mmap lock in show_smaps_rollup() because it is
contended
(c) We merged that VMA with another VMA
(d) We collapsed a THP in that merged VMA at that position
If the end address of the original VMA falls into the middle of a THP
area, we would call smap_gather_stats() with a start address that falls
into a PMD-mapped THP. It's probably very rare to trigger when not
really forced.
(2) Will succeed on a is_pci_p2pdma_page(), like vm_normal_page()
Treat such PMDs here just like smaps_pte_entry() would treat such PTEs.
If such pages would be anonymous, we most certainly would want to
account them.
(3) Will skip over pmd_devmap(), like vm_normal_page() for pte_devmap()
As noted in vm_normal_page(), that is only for handling legacy ZONE_DEVICE
pages. So just like smaps_pte_entry(), we'll now also ignore such PMD
entries.
Especially, follow_pmd_mask() never ends up calling
follow_trans_huge_pmd() on pmd_devmap(). Instead it calls
follow_devmap_pmd() -- which will fail if neither FOLL_GET nor FOLL_PIN
is set.
So skipping pmd_devmap() pages seems to be the right thing to do.
(4) Will properly handle VM_MIXEDMAP/VM_PFNMAP, like vm_normal_page()
We won't be returning a memmap that should be ignored by core-mm, or
worse, a memmap that does not even exist. Note that while
walk_page_range() will skip VM_PFNMAP mappings, walk_page_vma() won't.
Most probably this case doesn't currently really happen on the PMD level,
otherwise we'd already be able to trigger kernel crashes when reading
smaps / smaps_rollup.
So most probably only (1) is relevant in practice as of now, but could only
cause trouble in extreme corner cases.
Let's move follow_trans_huge_pmd() to mm/internal.h to discourage future
reuse in wrong context.
Link: https://lkml.kernel.org/r/20230803143208.383663-3-david@redhat.com
Fixes: ff9f47f6f0 ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock")
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: liubo <liubo254@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmTgyQQACgkQxWXV+ddt
WDvqSQ/+PFg0GwssGuiqWTGbfHV2bJCJWeuXUJNuKFo8PtEnpN0zf28ihsaRXAHF
ZDFKrRjEmb62n+EWJFDpC7wmnz6UJEoEtQteN2VBnLSIUQAKFI+g5flXrR85rk1D
d52JSXtaXSZeCtZH/wdYWdfkL19SJQqJrFDY1WmRLCylOsLHuG0a67fXNeL+5WM/
NgGUMk0bO/j2CKjiCwJT4EpsSP4tFj49TciuDESyXnS8aDbPLbAQkGpYlE+99HSj
D3vjZeqdVfmVhSjdIrK2eTlndzCl+HU+J1DXHzRE6I5XkXhzofJFtrlsvl++C9pv
UZL9bFyMFzybKME33RWvzXBhiRguZ4hfGBoh5FQbJl4yErU4I5RVZcd3/S/2V6n+
AzWemwkOdLEiiPD+aLV28EYdKpnd4GFweVTxeXjdXrJrSx/e4Vn/kPNq1aZJi6Qi
ex3hZWr0oN7JG/StN6i3ix09fEB8cyDzn/jaEwk5zb6uHVN8fw7whkVwZOvFkXx5
VcPxZOyxBFxwmN+L6JlxkIGEpu8UQC2RHa1JJzDTXJPqpz6W68d2wJ8jlDFJYUaf
fahDd8FoG/e/EYh8sPsOnp3gMY53UxxWLF8fuZXVScq9+g5zA3jfftF+a3TaA5bh
e119g0ml+KIGtTB7Q8nLob4PA12NNhNtHbKfdSPDhOfvz8heg9A=
=eFDQ
-----END PGP SIGNATURE-----
Merge tag 'for-6.5-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix infinite loop in readdir(), could happen in a big directory when
files get renamed during enumeration
- fix extent map handling of skipped pinned ranges
- fix a corner case when handling ordered extent length
- fix a potential crash when balance cancel races with pause
- verify correct uuid when starting scrub or device replace
* tag 'for-6.5-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix incorrect splitting in btrfs_drop_extent_map_range
btrfs: fix BUG_ON condition in btrfs_cancel_balance
btrfs: only subtract from len_to_oe_boundary when it is tracking an extent
btrfs: fix replace/scrub failure with metadata_uuid
btrfs: fix infinite directory reads
Another highly rare error case when a page allocating loop (inside
__nfs4_get_acl_uncached, this time) is not properly unwound on error.
Since pages array is allocated being uninitialized, need to free only
lower array indices. NULL checks were useful before commit 62a1573fcf
("NFSv4 fix acl retrieval over krb5i/krb5p mounts") when the array had
been initialized to zero on stack.
Found by Linux Verification Center (linuxtesting.org).
Fixes: 62a1573fcf ("NFSv4 fix acl retrieval over krb5i/krb5p mounts")
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
There is a slight issue with error handling code inside
nfs42_proc_getxattr(). If page allocating loop fails then we free the
failing page array element which is NULL but __free_page() can't deal with
NULL args.
Found by Linux Verification Center (linuxtesting.org).
Fixes: a1f26739cc ("NFSv4.2: improve page handling for GETXATTR")
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Free the formatted server index string after it has been duplicated by
kobject_rename().
Fixes: 1c7251187d ("NFS: add superblock sysfs entries")
Reported-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
In production we were seeing a variety of WARN_ON()'s in the extent_map
code, specifically in btrfs_drop_extent_map_range() when we have to call
add_extent_mapping() for our second split.
Consider the following extent map layout
PINNED
[0 16K) [32K, 48K)
and then we call btrfs_drop_extent_map_range for [0, 36K), with
skip_pinned == true. The initial loop will have
start = 0
end = 36K
len = 36K
we will find the [0, 16k) extent, but since we are pinned we will skip
it, which has this code
start = em_end;
if (end != (u64)-1)
len = start + len - em_end;
em_end here is 16K, so now the values are
start = 16K
len = 16K + 36K - 16K = 36K
len should instead be 20K. This is a problem when we find the next
extent at [32K, 48K), we need to split this extent to leave [36K, 48k),
however the code for the split looks like this
split->start = start + len;
split->len = em_end - (start + len);
In this case we have
em_end = 48K
split->start = 16K + 36K // this should be 16K + 20K
split->len = 48K - (16K + 36K) // this overflows as 16K + 36K is 52K
and now we have an invalid extent_map in the tree that potentially
overlaps other entries in the extent map. Even in the non-overlapping
case we will have split->start set improperly, which will cause problems
with any block related calculations.
We don't actually need len in this loop, we can simply use end as our
end point, and only adjust start up when we find a pinned extent we need
to skip.
Adjust the logic to do this, which keeps us from inserting an invalid
extent map.
We only skip_pinned in the relocation case, so this is relatively rare,
except in the case where you are running relocation a lot, which can
happen with auto relocation on.
Fixes: 55ef689900 ("Btrfs: Fix btrfs_drop_extent_cache for skip pinned case")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Be more careful when tearing down the subrequests of an O_DIRECT write
as part of a retransmission.
Reported-by: Chris Mason <clm@fb.com>
Fixes: ed5d588fe4 ("NFS: Try to join page groups before an O_DIRECT retransmission")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Pausing and canceling balance can race to interrupt balance lead to BUG_ON
panic in btrfs_cancel_balance. The BUG_ON condition in btrfs_cancel_balance
does not take this race scenario into account.
However, the race condition has no other side effects. We can fix that.
Reproducing it with panic trace like this:
kernel BUG at fs/btrfs/volumes.c:4618!
RIP: 0010:btrfs_cancel_balance+0x5cf/0x6a0
Call Trace:
<TASK>
? do_nanosleep+0x60/0x120
? hrtimer_nanosleep+0xb7/0x1a0
? sched_core_clone_cookie+0x70/0x70
btrfs_ioctl_balance_ctl+0x55/0x70
btrfs_ioctl+0xa46/0xd20
__x64_sys_ioctl+0x7d/0xa0
do_syscall_64+0x38/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Race scenario as follows:
> mutex_unlock(&fs_info->balance_mutex);
> --------------------
> .......issue pause and cancel req in another thread
> --------------------
> ret = __btrfs_balance(fs_info);
>
> mutex_lock(&fs_info->balance_mutex);
> if (ret == -ECANCELED && atomic_read(&fs_info->balance_pause_req)) {
> btrfs_info(fs_info, "balance: paused");
> btrfs_exclop_balance(fs_info, BTRFS_EXCLOP_BALANCE_PAUSED);
> }
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: xiaoshoukui <xiaoshoukui@ruijie.com.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
bio_ctrl->len_to_oe_boundary is used to make sure we stay inside a zone
as we submit bios for writes. Every time we add a page to the bio, we
decrement those bytes from len_to_oe_boundary, and then we submit the
bio if we happen to hit zero.
Most of the time, len_to_oe_boundary gets set to U32_MAX.
submit_extent_page() adds pages into our bio, and the size of the bio
ends up limited by:
- Are we contiguous on disk?
- Does bio_add_page() allow us to stuff more in?
- is len_to_oe_boundary > 0?
The len_to_oe_boundary math starts with U32_MAX, which isn't page or
sector aligned, and subtracts from it until it hits zero. In the
non-zoned case, the last IO we submit before we hit zero is going to be
unaligned, triggering BUGs.
This is hard to trigger because bio_add_page() isn't going to make a bio
of U32_MAX size unless you give it a perfect set of pages and fully
contiguous extents on disk. We can hit it pretty reliably while making
large swapfiles during provisioning because the machine is freshly
booted, mostly idle, and the disk is freshly formatted. It's also
possible to trigger with reads when read_ahead_kb is set to 4GB.
The code has been clean up and shifted around a few times, but this flaw
has been lurking since the counter was added. I think the commit
24e6c80822 ("btrfs: simplify main loop in submit_extent_page") ended
up exposing the bug.
The fix used here is to skip doing math on len_to_oe_boundary unless
we've changed it from the default U32_MAX value. bio_add_page() is the
real limit we want, and there's no reason to do extra math when block
layer is doing it for us.
Sample reproducer, note you'll need to change the path to the bdi and
device:
SUBVOL=/btrfs/swapvol
SWAPFILE=$SUBVOL/swapfile
SZMB=8192
mkfs.btrfs -f /dev/vdb
mount /dev/vdb /btrfs
btrfs subvol create $SUBVOL
chattr +C $SUBVOL
dd if=/dev/zero of=$SWAPFILE bs=1M count=$SZMB
sync
echo 4 > /proc/sys/vm/drop_caches
echo 4194304 > /sys/class/bdi/btrfs-2/read_ahead_kb
while true; do
echo 1 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/drop_caches
dd of=/dev/zero if=$SWAPFILE bs=4096M count=2 iflag=fullblock
done
Fixes: 24e6c80822 ("btrfs: simplify main loop in submit_extent_page")
CC: stable@vger.kernel.org # 6.4+
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fstests with POST_MKFS_CMD="btrfstune -m" (as in the mailing list)
reported a few of the test cases failing.
The failure scenario can be summarized and simplified as follows:
$ mkfs.btrfs -fq -draid1 -mraid1 /dev/sdb1 /dev/sdb2 :0
$ btrfstune -m /dev/sdb1 :0
$ wipefs -a /dev/sdb1 :0
$ mount -o degraded /dev/sdb2 /btrfs :0
$ btrfs replace start -B -f -r 1 /dev/sdb1 /btrfs :1
STDERR:
ERROR: ioctl(DEV_REPLACE_START) failed on "/btrfs": Input/output error
[11290.583502] BTRFS warning (device sdb2): tree block 22036480 mirror 2 has bad fsid, has 99835c32-49f0-4668-9e66-dc277a96b4a6 want da40350c-33ac-4872-92a8-4948ed8c04d0
[11290.586580] BTRFS error (device sdb2): unable to fix up (regular) error at logical 22020096 on dev /dev/sdb8 physical 1048576
As above, the replace is failing because we are verifying the header with
fs_devices::fsid instead of fs_devices::metadata_uuid, despite the
metadata_uuid actually being present.
To fix this, use fs_devices::metadata_uuid. We copy fsid into
fs_devices::metadata_uuid if there is no metadata_uuid, so its fine.
Fixes: a3ddbaebc7 ("btrfs: scrub: introduce a helper to verify one metadata block")
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit abdb1742a3 removed code that clears ctx->username when sec=none, so attempting
to mount with '-o sec=none' now fails with -EACCES. Fix it by adding that logic to the
parsing of the 'sec' option, as well as checking if the mount is using null auth before
setting the username when parsing the 'user' option.
Fixes: abdb1742a3 ("cifs: get rid of mount options string parsing")
Cc: stable@vger.kernel.org
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmTaMT8ACgkQiiy9cAdy
T1F7Ygv/ed2tYvcdwvrakFOvLWEvgTpAh/tb7dh+l58mG1WV7ZDRDWamSAyzy8OT
s8IeBYmIaOdOv5opYakuQrY0lhTcUSRCoSsvH6k2vZAMLG0SX9nXVyHv0JgPDTuz
gNvxDRrOusOhNfVlGya0dhH90hDyvW1wzU66HlCMbzrfmeQKKG6A6shOztGfw1y6
cXVKr4k315dcH9sAHzMDcg5bv3ucyKWztdAaF68dK71oEUwceMTmKpKc7OYPxThn
DOY4blVefIUAPTZYh7RD1Ota1VYfQafZFu01ttqh3XvG9PtOlDTuEbRlANpYv2d/
Awn6ZIdx2tV8MERJ7R0p/vKdVj5m5sDaTls0q4PWc/OMFrOFGfvuMhoZ/uAFPhFc
e9EKjg7u0B7q3F8aT4E34Hqwl6UNhyDvRqn5BhztcDgMdIke7OVuvHQSiLGXfMQT
XJN0bTynTB6RnHDsFxG8i7YBlsHDk6Ic/xOAcSG42U/5hNBTrovY9HyhYdMzZ/TD
9sETDezn
=Woor
-----END PGP SIGNATURE-----
Merge tag '6.5-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
"Three smb client fixes, all for stable:
- fix for oops in unmount race with lease break of deferred close
- debugging improvement for reconnect
- fix for fscache deadlock (folio_wait_bit_common hang)"
* tag '6.5-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb3: display network namespace in debug information
cifs: Release folio lock on fscache read hit.
cifs: fix potential oops in cifs_oplock_break
The readdir implementation currently processes always up to the last index
it finds. This however can result in an infinite loop if the directory has
a large number of entries such that they won't all fit in the given buffer
passed to the readdir callback, that is, dir_emit() returns a non-zero
value. Because in that case readdir() will be called again and if in the
meanwhile new directory entries were added and we still can't put all the
remaining entries in the buffer, we keep repeating this over and over.
The following C program and test script reproduce the problem:
$ cat /mnt/readdir_prog.c
#include <sys/types.h>
#include <dirent.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
DIR *dir = opendir(".");
struct dirent *dd;
while ((dd = readdir(dir))) {
printf("%s\n", dd->d_name);
rename(dd->d_name, "TEMPFILE");
rename("TEMPFILE", dd->d_name);
}
closedir(dir);
}
$ gcc -o /mnt/readdir_prog /mnt/readdir_prog.c
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV &> /dev/null
#mkfs.xfs -f $DEV &> /dev/null
#mkfs.ext4 -F $DEV &> /dev/null
mount $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= 2000; i++)); do
echo -n > $MNT/testdir/file_$i
done
cd $MNT/testdir
/mnt/readdir_prog
cd /mnt
umount $MNT
This behaviour is surprising to applications and it's unlike ext4, xfs,
tmpfs, vfat and other filesystems, which always finish. In this case where
new entries were added due to renames, some file names may be reported
more than once, but this varies according to each filesystem - for example
ext4 never reported the same file more than once while xfs reports the
first 13 file names twice.
So change our readdir implementation to track the last index number when
opendir() is called and then make readdir() never process beyond that
index number. This gives the same behaviour as ext4.
Reported-by: Rob Landley <rob@landley.net>
Link: https://lore.kernel.org/linux-btrfs/2c8c55ec-04c6-e0dc-9c5c-8c7924778c35@landley.net/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217681
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We recently had problems where a network namespace was deleted
causing hard to debug reconnect problems. To help deal with
configuration issues like this it is useful to dump the network
namespace to better debug what happened.
So add this to information displayed in /proc/fs/cifs/DebugData for
the server (and channels if mounted with multichannel). For example:
Local Users To Server: 1 SecMode: 0x1 Req On Wire: 0 Net namespace: 4026531840
This can be easily compared with what is displayed for the
processes on the system. For example /proc/1/ns/net in this case
showed the same thing (see below), and we can see that the namespace
is still valid in this example.
'net:[4026531840]'
Cc: stable@vger.kernel.org
Acked-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Under the current code, when cifs_readpage_worker is called, the call
contract is that the callee should unlock the page. This is documented
in the read_folio section of Documentation/filesystems/vfs.rst as:
> The filesystem should unlock the folio once the read has completed,
> whether it was successful or not.
Without this change, when fscache is in use and cache hit occurs during
a read, the page lock is leaked, producing the following stack on
subsequent reads (via mmap) to the page:
$ cat /proc/3890/task/12864/stack
[<0>] folio_wait_bit_common+0x124/0x350
[<0>] filemap_read_folio+0xad/0xf0
[<0>] filemap_fault+0x8b1/0xab0
[<0>] __do_fault+0x39/0x150
[<0>] do_fault+0x25c/0x3e0
[<0>] __handle_mm_fault+0x6ca/0xc70
[<0>] handle_mm_fault+0xe9/0x350
[<0>] do_user_addr_fault+0x225/0x6c0
[<0>] exc_page_fault+0x84/0x1b0
[<0>] asm_exc_page_fault+0x27/0x30
This requires a reboot to resolve; it is a deadlock.
Note however that the call to cifs_readpage_from_fscache does mark the
page clean, but does not free the folio lock. This happens in
__cifs_readpage_from_fscache on success. Releasing the lock at that
point however is not appropriate as cifs_readahead also calls
cifs_readpage_from_fscache and *does* unconditionally release the lock
after its return. This change therefore effectively makes
cifs_readpage_worker work like cifs_readahead.
Signed-off-by: Russell Harmon <russ@har.mn>
Acked-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmTXzuUACgkQxWXV+ddt
WDvQVg/+PwDYtfsFBBxWboR/Ehu+nGj+PGRGH5kUumCt03760GtVMYqJzakinAoA
TUg7N+SvC0i6STUQ1LxkqdyU+eHxk0D1qwK7HJtbqNQJ+kwaEPlilHwsptMmuuM/
xaei+C8gLmQreL5ZH6ZsnLfV4aaFR7Ur8KSiAq28H6dKXGnh4q9yio2BspeFoc4Y
8cD2Q8eOUxLBbGkAy9RHeMWf6OMOv2jyzdA761NZrjxUe23bDWSdRM6cRhfdJIh+
gfwW1IVH2EVOwo+FeaIpMSf4dpnenOYOKOftTncrz7XS0VEN/wJYQXGjNbLa7u4d
RxV2RujzRPePAUKDbLRakfXotcuKdSQuX2epLSYkQTfGQ0KRYu5YIDQgkm3r7Yky
cF5mkyEyI8lFCiop7Bgi3MqnzoY5ZgWAkWSy9/TzjQ4yRhjiZ3fmk5JgoJ8gwUc3
Fle4czcmKvk6ZqQAn90b0qGtW9FXzVAekZjLAH26O7+dgEn+CCAfwT9GuG7h+ATM
9Bh+5U5PWxWmNPTYU8Sn+WR9HpVL6+1maxrax/Ftb8/FuFlQXFHxK+OnTcKx9K+y
OGsv0r/4Zv517k1qqlHvf397Jvz7MmYLyOwkqu5xyomCGtrKIBkkEGF/9sHrZJVM
YokgphDZL8AILrnnPwCOgt4lsph1VKS/Sgvu7XKovnZbvvh8S+M=
=csAj
-----END PGP SIGNATURE-----
Merge tag 'for-6.5-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"More fixes, some of them going back to older releases and there are
fixes for hangs in stress tests regarding space caching:
- fixes and progress tracking for hangs in free space caching, found
by test generic/475
- writeback fixes, write pages in integrity mode and skip writing
pages that have been written meanwhile
- properly clear end of extent range after an error
- relocation fixes:
- fix race betwen qgroup tree creation and relocation
- detect and report invalid reloc roots"
* tag 'for-6.5-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: set cache_block_group_error if we find an error
btrfs: reject invalid reloc tree root keys with stack dump
btrfs: exit gracefully if reloc roots don't match
btrfs: avoid race between qgroup tree creation and relocation
btrfs: properly clear end of the unreserved range in cow_file_range
btrfs: don't wait for writeback on clean pages in extent_write_cache_pages
btrfs: don't stop integrity writeback too early
btrfs: wait for actual caching progress during allocation
The only remaining consumer is new_inode, where it showed up in 2001 as
commit c37fa164f793 ("v2.4.9.9 -> v2.4.9.10") in a historical repo [1]
with a changelog which does not mention it.
Since then the line got only touched up to keep compiling.
While it may have been of benefit back in the day, it is guaranteed to
at best not get in the way in the multicore setting -- as the code
performs *a lot* of work between the prefetch and actual lock acquire,
any contention means the cacheline is already invalid by the time the
routine calls spin_lock(). It adds spurious traffic, for short.
On top of it prefetch is notoriously tricky to use for single-threaded
purposes, making it questionable from the get go.
As such, remove it.
I admit upfront I did not see value in benchmarking this change, but I
can do it if that is deemed appropriate.
Removal from new_inode and of the entire thing are in the same patch as
requested by Linus, so whatever weird looks can be directed at that guy.
Link: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/fs/inode.c?id=c37fa164f793735b32aa3f53154ff1a7659e6442 [1]
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- The switch to using iomap for executing direct synchronous write to
sequential files using zone append BIO overlooked cases where the BIO
built by iomap is too large and needs splitting, which is not allowed
with zone append. Fix this by using regular write commands instead.
The use of zone append commands will be reintroduces later with
proper support from iomap.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCZNbIugAKCRDdoc3SxdoY
dgGZAQCJWVBGXP/FAB4o7ifuqZ9xDjcf0RyrYrcS1N+kV1REqgD/ZnHXxAbNJNtx
A3A5W7bFQJogNp/gWR7K7/jt1cRNwgo=
=nlBY
-----END PGP SIGNATURE-----
Merge tag 'zonefs-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs
Pull zonefs fix from Damien Le Moal:
- The switch to using iomap for executing a direct synchronous write to
sequential files using a zone append BIO overlooked cases where the
BIO built by iomap is too large and needs splitting, which is not
allowed with zone append.
Fix this by using regular write commands instead. The use of zone
append commands will be reintroduced later with proper support from
iomap.
* tag 'zonefs-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: fix synchronous direct writes to sequential files
issues, or are not considered suitable for -stable backporting.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZNad/gAKCRDdBJ7gKXxA
jmw6AP9u6k8XcS8ec3/u0IUEuh7ckHx5Vvjfmo5YgWlIJDeWegD9G2fh3ZJgcjMO
jMssklfXmP+QSijCIxUva1TlzwtPDAQ=
=MqiN
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2023-08-11-13-44' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"14 hotfixes. 11 of these are cc:stable and the remainder address
post-6.4 issues, or are not considered suitable for -stable
backporting"
* tag 'mm-hotfixes-stable-2023-08-11-13-44' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/damon/core: initialize damo_filter->list from damos_new_filter()
nilfs2: fix use-after-free of nilfs_root in dirtying inodes via iput
selftests: cgroup: fix test_kmem_basic false positives
fs/proc/kcore: reinstate bounce buffer for KCORE_TEXT regions
MAINTAINERS: add maple tree mailing list
mm: compaction: fix endless looping over same migrate block
selftests: mm: ksm: fix incorrect evaluation of parameter
hugetlb: do not clear hugetlb dtor until allocating vmemmap
mm: memory-failure: avoid false hwpoison page mapped error info
mm: memory-failure: fix potential unexpected return value from unpoison_memory()
mm/swapfile: fix wrong swap entry type for hwpoisoned swapcache page
radix tree test suite: fix incorrect allocation size for pthreads
crypto, cifs: fix error handling in extract_iter_to_sg()
zsmalloc: fix races between modifications of fullness and isolated
With deferred close we can have closes that race with lease breaks,
and so with the current checks for whether to send the lease response,
oplock_response(), this can mean that an unmount (kill_sb) can occur
just before we were checking if the tcon->ses is valid. See below:
[Fri Aug 4 04:12:50 2023] RIP: 0010:cifs_oplock_break+0x1f7/0x5b0 [cifs]
[Fri Aug 4 04:12:50 2023] Code: 7d a8 48 8b 7d c0 c0 e9 02 48 89 45 b8 41 89 cf e8 3e f5 ff ff 4c 89 f7 41 83 e7 01 e8 82 b3 03 f2 49 8b 45 50 48 85 c0 74 5e <48> 83 78 60 00 74 57 45 84 ff 75 52 48 8b 43 98 48 83 eb 68 48 39
[Fri Aug 4 04:12:50 2023] RSP: 0018:ffffb30607ddbdf8 EFLAGS: 00010206
[Fri Aug 4 04:12:50 2023] RAX: 632d223d32612022 RBX: ffff97136944b1e0 RCX: 0000000080100009
[Fri Aug 4 04:12:50 2023] RDX: 0000000000000001 RSI: 0000000080100009 RDI: ffff97136944b188
[Fri Aug 4 04:12:50 2023] RBP: ffffb30607ddbe58 R08: 0000000000000001 R09: ffffffffc08e0900
[Fri Aug 4 04:12:50 2023] R10: 0000000000000001 R11: 000000000000000f R12: ffff97136944b138
[Fri Aug 4 04:12:50 2023] R13: ffff97149147c000 R14: ffff97136944b188 R15: 0000000000000000
[Fri Aug 4 04:12:50 2023] FS: 0000000000000000(0000) GS:ffff9714f7c00000(0000) knlGS:0000000000000000
[Fri Aug 4 04:12:50 2023] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Fri Aug 4 04:12:50 2023] CR2: 00007fd8de9c7590 CR3: 000000011228e000 CR4: 0000000000350ef0
[Fri Aug 4 04:12:50 2023] Call Trace:
[Fri Aug 4 04:12:50 2023] <TASK>
[Fri Aug 4 04:12:50 2023] process_one_work+0x225/0x3d0
[Fri Aug 4 04:12:50 2023] worker_thread+0x4d/0x3e0
[Fri Aug 4 04:12:50 2023] ? process_one_work+0x3d0/0x3d0
[Fri Aug 4 04:12:50 2023] kthread+0x12a/0x150
[Fri Aug 4 04:12:50 2023] ? set_kthread_struct+0x50/0x50
[Fri Aug 4 04:12:50 2023] ret_from_fork+0x22/0x30
[Fri Aug 4 04:12:50 2023] </TASK>
To fix this change the ordering of the checks before sending the oplock_response
to first check if the openFileList is empty.
Fixes: da787d5b74 ("SMB3: Do not send lease break acknowledgment if all file handles have been closed")
Suggested-by: Bharath SM <bharathsm@microsoft.com>
Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We set cache_block_group_error if btrfs_cache_block_group() returns an
error, this is because we could end up not finding space to allocate and
mistakenly return -ENOSPC, and which could then abort the transaction
with the incorrect errno, and in the case of ENOSPC result in a
WARN_ON() that will trip up tests like generic/475.
However there's the case where multiple threads can be racing, one
thread gets the proper error, and the other thread doesn't actually call
btrfs_cache_block_group(), it instead sees ->cached ==
BTRFS_CACHE_ERROR. Again the result is the same, we fail to allocate
our space and return -ENOSPC. Instead we need to set
cache_block_group_error to -EIO in this case to make sure that if we do
not make our allocation we get the appropriate error returned back to
the caller.
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Syzbot reported a crash that an ASSERT() got triggered inside
prepare_to_merge().
That ASSERT() makes sure the reloc tree is properly pointed back by its
subvolume tree.
[CAUSE]
After more debugging output, it turns out we had an invalid reloc tree:
BTRFS error (device loop1): reloc tree mismatch, root 8 has no reloc root, expect reloc root key (-8, 132, 8) gen 17
Note the above root key is (TREE_RELOC_OBJECTID, ROOT_ITEM,
QUOTA_TREE_OBJECTID), meaning it's a reloc tree for quota tree.
But reloc trees can only exist for subvolumes, as for non-subvolume
trees, we just COW the involved tree block, no need to create a reloc
tree since those tree blocks won't be shared with other trees.
Only subvolumes tree can share tree blocks with other trees (thus they
have BTRFS_ROOT_SHAREABLE flag).
Thus this new debug output proves my previous assumption that corrupted
on-disk data can trigger that ASSERT().
[FIX]
Besides the dedicated fix and the graceful exit, also let tree-checker to
check such root keys, to make sure reloc trees can only exist for subvolumes.
CC: stable@vger.kernel.org # 5.15+
Reported-by: syzbot+ae97a827ae1c3336bbb4@syzkaller.appspotmail.com
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Syzbot reported a crash that an ASSERT() got triggered inside
prepare_to_merge().
[CAUSE]
The root cause of the triggered ASSERT() is we can have a race between
quota tree creation and relocation.
This leads us to create a duplicated quota tree in the
btrfs_read_fs_root() path, and since it's treated as fs tree, it would
have ROOT_SHAREABLE flag, causing us to create a reloc tree for it.
The bug itself is fixed by a dedicated patch for it, but this already
taught us the ASSERT() is not something straightforward for
developers.
[ENHANCEMENT]
Instead of using an ASSERT(), let's handle it gracefully and output
extra info about the mismatch reloc roots to help debug.
Also with the above ASSERT() removed, we can trigger ASSERT(0)s inside
merge_reloc_roots() later.
Also replace those ASSERT(0)s with WARN_ON()s.
CC: stable@vger.kernel.org # 5.15+
Reported-by: syzbot+ae97a827ae1c3336bbb4@syzkaller.appspotmail.com
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Syzbot reported a weird ASSERT() triggered inside prepare_to_merge().
assertion failed: root->reloc_root == reloc_root, in fs/btrfs/relocation.c:1919
------------[ cut here ]------------
kernel BUG at fs/btrfs/relocation.c:1919!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 9904 Comm: syz-executor.3 Not tainted
6.4.0-syzkaller-08881-g533925cb7604 #0
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 05/27/2023
RIP: 0010:prepare_to_merge+0xbb2/0xc40 fs/btrfs/relocation.c:1919
Code: fe e9 f5 (...)
RSP: 0018:ffffc9000325f760 EFLAGS: 00010246
RAX: 000000000000004f RBX: ffff888075644030 RCX: 1481ccc522da5800
RDX: ffffc90005c09000 RSI: 00000000000364ca RDI: 00000000000364cb
RBP: ffffc9000325f870 R08: ffffffff816f33ac R09: 1ffff9200064bea0
R10: dffffc0000000000 R11: fffff5200064bea1 R12: ffff888075644000
R13: ffff88803b166000 R14: ffff88803b166560 R15: ffff88803b166558
FS: 00007f4e305fd700(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000056080679c000 CR3: 00000000193ad000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
relocate_block_group+0xa5d/0xcd0 fs/btrfs/relocation.c:3749
btrfs_relocate_block_group+0x7ab/0xd70 fs/btrfs/relocation.c:4087
btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3283
__btrfs_balance+0x1b06/0x2690 fs/btrfs/volumes.c:4018
btrfs_balance+0xbdb/0x1120 fs/btrfs/volumes.c:4402
btrfs_ioctl_balance+0x496/0x7c0 fs/btrfs/ioctl.c:3604
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl+0xf8/0x170 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f4e2f88c389
[CAUSE]
With extra debugging, the offending reloc_root is for quota tree (rootid 8).
Normally we should not use the reloc tree for quota root at all, as reloc
trees are only for subvolume trees.
But there is a race between quota enabling and relocation, this happens
after commit 85724171b3 ("btrfs: fix the btrfs_get_global_root return value").
Before that commit, for quota and free space tree, we exit immediately
if we cannot grab it from fs_info.
But now we would try to read it from disk, just as if they are fs trees,
this sets ROOT_SHAREABLE flags in such race:
Thread A | Thread B
---------------------------------+------------------------------
btrfs_quota_enable() |
| | btrfs_get_root_ref()
| | |- btrfs_get_global_root()
| | | Returned NULL
| | |- btrfs_lookup_fs_root()
| | | Returned NULL
|- btrfs_create_tree() | |
| Now quota root item is | |
| inserted | |- btrfs_read_tree_root()
| | | Got the newly inserted quota root
| | |- btrfs_init_fs_root()
| | | Set ROOT_SHAREABLE flag
[FIX]
Get back to the old behavior by returning PTR_ERR(-ENOENT) if the target
objectid is not a subvolume tree or data reloc tree.
Reported-and-tested-by: syzbot+ae97a827ae1c3336bbb4@syzkaller.appspotmail.com
Fixes: 85724171b3 ("btrfs: fix the btrfs_get_global_root return value")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When the call to btrfs_reloc_clone_csums in cow_file_range returns an
error, we jump to the out_unlock label with the extent_reserved variable
set to false. The cleanup at the label will then call
extent_clear_unlock_delalloc on the range from start to end. But we've
already added cur_alloc_size to start before the jump, so there might no
range be left from the newly incremented start to end. Move the check for
'start < end' so that it is reached by also for the !extent_reserved case.
CC: stable@vger.kernel.org # 6.1+
Fixes: a315e68f6e ("Btrfs: fix invalid attempt to free reserved space on failure to cow range")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__extent_writepage could have started on more pages than the one it was
called for. This happens regularly for zoned file systems, and in theory
could happen for compressed I/O if the worker thread was executed very
quickly. For such pages extent_write_cache_pages waits for writeback
to complete before moving on to the next page, which is highly inefficient
as it blocks the flusher thread.
Port over the PageDirty check that was added to write_cache_pages in
commit 515f4a037f ("mm: write_cache_pages optimise page cleaning") to
fix this.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
extent_write_cache_pages stops writing pages as soon as nr_to_write hits
zero. That is the right thing for opportunistic writeback, but incorrect
for data integrity writeback, which needs to ensure that no dirty pages
are left in the range. Thus only stop the writeback for WB_SYNC_NONE
if nr_to_write hits 0.
This is a port of write_cache_pages changes in commit 05fe478dd0
("mm: write_cache_pages integrity fix").
Note that I've only trigger the problem with other changes to the btrfs
writeback code, but this condition seems worthwhile fixing anyway.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ updated comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Recently we've been having mysterious hangs while running generic/475 on
the CI system. This turned out to be something like this:
Task 1
dmsetup suspend --nolockfs
-> __dm_suspend
-> dm_wait_for_completion
-> dm_wait_for_bios_completion
-> Unable to complete because of IO's on a plug in Task 2
Task 2
wb_workfn
-> wb_writeback
-> blk_start_plug
-> writeback_sb_inodes
-> Infinite loop unable to make an allocation
Task 3
cache_block_group
->read_extent_buffer_pages
->Waiting for IO to complete that can't be submitted because Task 1
suspended the DM device
The problem here is that we need Task 2 to be scheduled completely for
the blk plug to flush. Normally this would happen, we normally wait for
the block group caching to finish (Task 3), and this schedule would
result in the block plug flushing.
However if there's enough free space available from the current caching
to satisfy the allocation we won't actually wait for the caching to
complete. This check however just checks that we have enough space, not
that we can make the allocation. In this particular case we were trying
to allocate 9MiB, and we had 10MiB of free space, but we didn't have
9MiB of contiguous space to allocate, and thus the allocation failed and
we looped.
We specifically don't cycle through the FFE loop until we stop finding
cached block groups because we don't want to allocate new block groups
just because we're caching, so we short circuit the normal loop once we
hit LOOP_CACHING_WAIT and we found a caching block group.
This is normally fine, except in this particular case where the caching
thread can't make progress because the DM device has been suspended.
Fix this by not only waiting for free space to >= the amount of space we
want to allocate, but also that we make some progress in caching from
the time we start waiting. This will keep us from busy looping when the
caching is taking a while but still theoretically has enough space for
us to allocate from, and fixes this particular case by forcing us to
actually sleep and wait for forward progress, which will flush the plug.
With this fix we're no longer hanging with generic/475.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmTTuzUACgkQiiy9cAdy
T1G4iQv/XOpGmFtVLO/JW/BGWZr38BkpSFsv+ZLzu0srd1hE/BU8AskdxU6joRMF
EpMhQi9M9FTeeTm1EVg9Osn9lYwdXMKmguM5jqqjXkYBZy0QBmff+8xIdhAJxztu
mkrJ7ARvnyqavAkIR4dY9xqcD2dmxZg7YDnCfUwO7pPmaMf6QE4Ha34U6C/68utf
EnQRG8P8E9t0AvZp6KHQdlVQIke7rYqWSK4lxRvIUSS+iD70AavLL3RToCpDNaVk
gaBxXhSmJwGkPONBNxrHMyNOeH+RiZ942haOQA8HMQE2OPZqtIBU/8/zAysiQsFA
PItY6wHM/2ONDRu3RWbkgWRl5JFB6Nw9ncvwDqq4/xsAL9KJYa3Jk9OjEksPJ4K3
5jUv109HiBPBSQGCEyhcsqneTgHBMmFLpoGEDUrtKDywhbI/uOTjoAjWJqeT9ROq
qPz4QVeiTq4LM288SFPBGS9knuS/ppoC9syVcHlrzvPIy3gw0Vv2IOkpSAwBcm4k
eeSA7oNK
=gGNY
-----END PGP SIGNATURE-----
Merge tag '6.5-rc5-ksmbd-server' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
"Two ksmbd server fixes, both also for stable:
- improve buffer validation when multiple EAs returned
- missing check for command payload size"
* tag '6.5-rc5-ksmbd-server' of git://git.samba.org/ksmbd:
ksmbd: fix wrong next length validation of ea buffer in smb2_set_ea()
ksmbd: validate command request size
Commit 16d7fd3cfa ("zonefs: use iomap for synchronous direct writes")
changes zonefs code from a self-built zone append BIO to using iomap for
synchronous direct writes. This change relies on iomap submit BIO
callback to change the write BIO built by iomap to a zone append BIO.
However, this change overlooked the fact that a write BIO may be very
large as it is split when issued. The change from a regular write to a
zone append operation for the built BIO can result in a block layer
warning as zone append BIO are not allowed to be split.
WARNING: CPU: 18 PID: 202210 at block/bio.c:1644 bio_split+0x288/0x350
Call Trace:
? __warn+0xc9/0x2b0
? bio_split+0x288/0x350
? report_bug+0x2e6/0x390
? handle_bug+0x41/0x80
? exc_invalid_op+0x13/0x40
? asm_exc_invalid_op+0x16/0x20
? bio_split+0x288/0x350
bio_split_rw+0x4bc/0x810
? __pfx_bio_split_rw+0x10/0x10
? lockdep_unlock+0xf2/0x250
__bio_split_to_limits+0x1d8/0x900
blk_mq_submit_bio+0x1cf/0x18a0
? __pfx_iov_iter_extract_pages+0x10/0x10
? __pfx_blk_mq_submit_bio+0x10/0x10
? find_held_lock+0x2d/0x110
? lock_release+0x362/0x620
? mark_held_locks+0x9e/0xe0
__submit_bio+0x1ea/0x290
? __pfx___submit_bio+0x10/0x10
? seqcount_lockdep_reader_access.constprop.0+0x82/0x90
submit_bio_noacct_nocheck+0x675/0xa20
? __pfx_bio_iov_iter_get_pages+0x10/0x10
? __pfx_submit_bio_noacct_nocheck+0x10/0x10
iomap_dio_bio_iter+0x624/0x1280
__iomap_dio_rw+0xa22/0x18a0
? lock_is_held_type+0xe3/0x140
? __pfx___iomap_dio_rw+0x10/0x10
? lock_release+0x362/0x620
? zonefs_file_write_iter+0x74c/0xc80 [zonefs]
? down_write+0x13d/0x1e0
iomap_dio_rw+0xe/0x40
zonefs_file_write_iter+0x5ea/0xc80 [zonefs]
do_iter_readv_writev+0x18b/0x2c0
? __pfx_do_iter_readv_writev+0x10/0x10
? inode_security+0x54/0xf0
do_iter_write+0x13b/0x7c0
? lock_is_held_type+0xe3/0x140
vfs_writev+0x185/0x550
? __pfx_vfs_writev+0x10/0x10
? __handle_mm_fault+0x9bd/0x1c90
? find_held_lock+0x2d/0x110
? lock_release+0x362/0x620
? find_held_lock+0x2d/0x110
? lock_release+0x362/0x620
? __up_read+0x1ea/0x720
? do_pwritev+0x136/0x1f0
do_pwritev+0x136/0x1f0
? __pfx_do_pwritev+0x10/0x10
? syscall_enter_from_user_mode+0x22/0x90
? lockdep_hardirqs_on+0x7d/0x100
do_syscall_64+0x58/0x80
This error depends on the hardware used, specifically on the max zone
append bytes and max_[hw_]sectors limits. Tests using AMD Epyc machines
that have low limits did not reveal this issue while runs on Intel Xeon
machines with larger limits trigger it.
Manually splitting the zone append BIO using bio_split_rw() can solve
this issue but also requires issuing the fragment BIOs synchronously
with submit_bio_wait(), to avoid potential reordering of the zone append
BIO fragments, which would lead to data corruption. That is, this
solution is not better than using regular write BIOs which are subject
to serialization using zone write locking at the IO scheduler level.
Given this, fix the issue by removing zone append support and using
regular write BIOs for synchronous direct writes. This allows preseving
the use of iomap and having identical synchronous and asynchronous
sequential file write path. Zone append support will be reintroduced
later through io_uring commands to ensure that the needed special
handling is done correctly.
Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 16d7fd3cfa ("zonefs: use iomap for synchronous direct writes")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
- Fix a freeze consistency check in gfs2_trans_add_meta().
- Don't use filemap_splice_read as it can cause deadlocks on gfs2.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmTSOikUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTpggg//Q+1yil5lS+Egf5Z+gL1E0SgXge7k
CfEcrjSfkIL8LnVgZSpJD8I++nMdXJb533qkGOIwvAifqIP2ZnGrtK2T4cF6PgWs
iAtKLPJGtg+6HswgGMEpEnl7sSBo4DYE6EH9TCoht9N0nSfJCAVWP2brVxEmnacl
omIXQymQAAGilVB58tru0XqfvneeHCvsipEYrJ1if1VGQHUwwU5v3uZGiOh2/VB2
CU8qrA9kX3O+cXyHDED5Rja0pKkZlxogK6/OUPophTGSRDOJZsnT36OfSCAF+Dhl
TZ1J8pBircPk7nA5u7GRUk6u5HVc2idmOLx6FoRfgJ97tsDz5NodNxEtyJBY0g/O
mOD6IRqSLSNgTrH9AFff4BauilT+NOCyMoe69Dw6XHCMfxrQB4l9nEH1clQ+nuks
2jBcjkpEn71dvAJjM+YNU7a2HOmhz7w2zTidv1pN7SIspBRSOD9w5DIt855PNIwd
y7SkUsT3GVcrVaUd2mNAmM1PWn0Gu+V3tPJbXqfzCEOKLKyOtMkMndm+uHB7wwH1
25HaNV+Bj8vWbMTjF0KYIkksQ9TgboIzdeV6Q5DrIEyWsXH/pRJF07JVgGjU80/n
+pIwqj1yVkipPVZ8orKvoWucp+Q3qFN2CHJTdKjwzeHFM2+MlXktHNDdKbAEGRiw
irK7Mq41PwAMVgY=
=pPXf
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v6.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fixes from Andreas Gruenbacher:
- Fix a freeze consistency check in gfs2_trans_add_meta()
- Don't use filemap_splice_read as it can cause deadlocks on gfs2
* tag 'gfs2-v6.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Don't use filemap_splice_read
gfs2: Fix freeze consistency check in gfs2_trans_add_meta
Starting with patch 2cb1e08985, gfs2 started using the new function
filemap_splice_read rather than the old (and subsequently deleted)
function generic_file_splice_read.
filemap_splice_read works by taking references to a number of folios in
the page cache and splicing those folios into a pipe. The folios are
then read from the pipe and the folio references are dropped. This can
take an arbitrary amount of time. We cannot allow that in gfs2 because
those folio references will pin the inode glock to the node and prevent
it from being demoted, which can lead to cluster-wide deadlocks.
Instead, use copy_splice_read.
(In addition, the old generic_file_splice_read called into ->read_iter,
which called gfs2_file_read_iter, which took the inode glock during the
operation. The new filemap_splice_read interface does not take the
inode glock anymore. This is fixable, but it still wouldn't prevent
cluster-wide deadlocks.)
Fixes: 2cb1e08985 ("splice: Use filemap_splice_read() instead of generic_file_splice_read()")
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Function gfs2_trans_add_meta() checks for the SDF_FROZEN flag to make
sure that no buffers are added to a transaction while the filesystem is
frozen. With the recent freeze/thaw rework, the SDF_FROZEN flag is
cleared after thaw_super() is called, which is sufficient for
serializing freeze/thaw.
However, other filesystem operations started after thaw_super() may now
be calling gfs2_trans_add_meta() before the SDF_FROZEN flag is cleared,
which will trigger the SDF_FROZEN check in gfs2_trans_add_meta(). Fix
that by checking the s_writers.frozen state instead.
In addition, make sure not to call gfs2_assert_withdraw() with the
sd_log_lock spin lock held. Check for a withdrawn filesystem before
checking for a frozen filesystem, and don't pin/add buffers to the
current transaction in case of a failure in either case.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZM+bcQAKCRCRxhvAZXjc
opWnAP9Ik49607Rn3OvFhWiYQp21nJ9NTs4lp5H30gMM3KhOxQEA9YAafIRH3rMs
zYjmEBwf4FCW9XQ4QgmktsW4Y7RqggE=
=DCbd
-----END PGP SIGNATURE-----
Merge tag 'v6.5-rc5.vfs.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix a wrong check for O_TMPFILE during RESOLVE_CACHED lookup
- Clean up directory iterators and clarify file_needs_f_pos_lock()
* tag 'v6.5-rc5.vfs.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: rely on ->iterate_shared to determine f_pos locking
vfs: get rid of old '->iterate' directory operation
proc: fix missing conversion to 'iterate_shared'
open: make RESOLVE_CACHED correctly test for O_TMPFILE
Now that we removed ->iterate we don't need to check for either
->iterate or ->iterate_shared in file_needs_f_pos_lock(). Simply check
for ->iterate_shared instead. This will tell us whether we need to
unconditionally take the lock. Not just does it allow us to avoid
checking f_inode's mode it also actually clearly shows that we're
locking because of readdir.
Signed-off-by: Christian Brauner <brauner@kernel.org>
All users now just use '->iterate_shared()', which only takes the
directory inode lock for reading.
Filesystems that never got convered to shared mode now instead use a
wrapper that drops the lock, re-takes it in write mode, calls the old
function, and then downgrades the lock back to read mode.
This way the VFS layer and other callers no longer need to care about
filesystems that never got converted to the modern era.
The filesystems that use the new wrapper are ceph, coda, exfat, jfs,
ntfs, ocfs2, overlayfs, and vboxsf.
Honestly, several of them look like they really could just iterate their
directories in shared mode and skip the wrapper entirely, but the point
of this change is to not change semantics or fix filesystems that
haven't been fixed in the last 7+ years, but to finally get rid of the
dual iterators.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
I'm looking at the directory handling due to the discussion about f_pos
locking (see commit 797964253d: "file: reinstate f_pos locking
optimization for regular files"), and wanting to clean that up.
And one source of ugliness is how we were supposed to move filesystems
over to the '->iterate_shared()' function that only takes the inode lock
for reading many many years ago, but several filesystems still use the
bad old '->iterate()' that takes the inode lock for exclusive access.
See commit 6192269444 ("introduce a parallel variant of ->iterate()")
that also added some documentation stating
Old method is only used if the new one is absent; eventually it will
be removed. Switch while you still can; the old one won't stay.
and that was back in April 2016. Here we are, many years later, and the
old version is still clearly sadly alive and well.
Now, some of those old style iterators are probably just because the
filesystem may end up having per-inode mutable data that it uses for
iterating a directory, but at least one case is just a mistake.
Al switched over most filesystems to use '->iterate_shared()' back when
it was introduced. In particular, the /proc filesystem was converted as
one of the first ones in commit f50752eaa0 ("switch all procfs
directories ->iterate_shared()").
But then later one new user of '->iterate()' was then re-introduced by
commit 6d9c939dbe ("procfs: add smack subdir to attrs").
And that's clearly not what we wanted, since that new case just uses the
same 'proc_pident_readdir()' and 'proc_pident_lookup()' helper functions
that other /proc pident directories use, and they are most definitely
safe to use with the inode lock held shared.
So just fix it.
This still leaves a fair number of oddball filesystems using the
old-style directory iterator (ceph, coda, exfat, jfs, ntfs, ocfs2,
overlayfs, and vboxsf), but at least we don't have any remaining in the
core filesystems.
I'm going to add a wrapper function that just drops the read-lock and
takes it as a write lock, so that we can clean up the core vfs layer and
make all the ugly 'this filesystem needs exclusive inode locking' be
just filesystem-internal warts.
I just didn't want to make that conversion when we still had a core user
left.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
O_TMPFILE is actually __O_TMPFILE|O_DIRECTORY. This means that the old
fast-path check for RESOLVE_CACHED would reject all users passing
O_DIRECTORY with -EAGAIN, when in fact the intended test was to check
for __O_TMPFILE.
Cc: stable@vger.kernel.org # v5.12+
Fixes: 99668f6180 ("fs: expose LOOKUP_CACHED through openat2() RESOLVE_CACHED")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Message-Id: <20230806-resolve_cached-o_tmpfile-v1-1-7ba16308465e@cyphar.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
There are multiple smb2_ea_info buffers in FILE_FULL_EA_INFORMATION request
from client. ksmbd find next smb2_ea_info using ->NextEntryOffset of
current smb2_ea_info. ksmbd need to validate buffer length Before
accessing the next ea. ksmbd should check buffer length using buf_len,
not next variable. next is the start offset of current ea that got from
previous ea.
Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-21598
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
In commit 2b9b8f3b68 ("ksmbd: validate command payload size"), except
for SMB2_OPLOCK_BREAK_HE command, the request size of other commands
is not checked, it's not expected. Fix it by add check for request
size of other commands.
Cc: stable@vger.kernel.org
Fixes: 2b9b8f3b68 ("ksmbd: validate command payload size")
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Long Li <leo.lilong@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmTOmLAACgkQiiy9cAdy
T1EAXAv+L20j8z9XBHdNf2UYsft961NoS3058DC2PBd6xy3yxPDUgrBwFFhx2JGe
bSH6Q+LfZjJdlfh04iZjwXbEIjFjL8aWpD+gq1F0NenjNXncel39oDaxn7dNgsjl
FUfkXrsOk/UohnyqoCFPmvzt9dtltShXMjHknIPRcOnFAwtd2GSXvpzQEI3m07cB
OVQ2sbZ5U1TTxlOIcaJYDSBqpA0yuivohdJkcn7eJytnXXTfRGzFDstfGP18gWfg
UbdilYKXvsRp9Y+zR6Z0iv1j3llZTfNBtpF25X7nC3yDTQV36UIdWpKby1jSq/V0
7o6OuooMv7EPLU7ZP8RldWclyahoTKoV6F82LakKENbzYq75IDAPRfKZL5cd5yhk
PiGAky8aiAcdJZl3OFhIG3Fa9qGHK3MJDwxXV+RVpNq71UEyM6BhY6wUOOziC6Xk
xio+Mt2hPuAQLbOf25MMAB9uoz8IDSpcdudCJCwKC+zPSZXZAE/FSJBvzQUwkYLj
kkWoz+ke
=AY3g
-----END PGP SIGNATURE-----
Merge tag '6.5-rc4-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fix from Steve French:
- Fix DFS interlink problem (different namespace)
* tag '6.5-rc4-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: fix dfs link mount against w2k8
During unmount process of nilfs2, nothing holds nilfs_root structure after
nilfs2 detaches its writer in nilfs_detach_log_writer(). Previously,
nilfs_evict_inode() could cause use-after-free read for nilfs_root if
inodes are left in "garbage_list" and released by nilfs_dispose_list at
the end of nilfs_detach_log_writer(), and this bug was fixed by commit
9b5a04ac3a ("nilfs2: fix use-after-free bug of nilfs_root in
nilfs_evict_inode()").
However, it turned out that there is another possibility of UAF in the
call path where mark_inode_dirty_sync() is called from iput():
nilfs_detach_log_writer()
nilfs_dispose_list()
iput()
mark_inode_dirty_sync()
__mark_inode_dirty()
nilfs_dirty_inode()
__nilfs_mark_inode_dirty()
nilfs_load_inode_block() --> causes UAF of nilfs_root struct
This can happen after commit 0ae45f63d4 ("vfs: add support for a
lazytime mount option"), which changed iput() to call
mark_inode_dirty_sync() on its final reference if i_state has I_DIRTY_TIME
flag and i_nlink is non-zero.
This issue appears after commit 28a65b49eb ("nilfs2: do not write dirty
data after degenerating to read-only") when using the syzbot reproducer,
but the issue has potentially existed before.
Fix this issue by adding a "purging flag" to the nilfs structure, setting
that flag while disposing the "garbage_list" and checking it in
__nilfs_mark_inode_dirty().
Unlike commit 9b5a04ac3a ("nilfs2: fix use-after-free bug of nilfs_root
in nilfs_evict_inode()"), this patch does not rely on ns_writer to
determine whether to skip operations, so as not to break recovery on
mount. The nilfs_salvage_orphan_logs routine dirties the buffer of
salvaged data before attaching the log writer, so changing
__nilfs_mark_inode_dirty() to skip the operation when ns_writer is NULL
will cause recovery write to fail. The purpose of using the cleanup-only
flag is to allow for narrowing of such conditions.
Link: https://lkml.kernel.org/r/20230728191318.33047-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+74db8b3087f293d3a13a@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/000000000000b4e906060113fd63@google.com
Fixes: 0ae45f63d4 ("vfs: add support for a lazytime mount option")
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org> # 4.0+
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Some architectures do not populate the entire range categorised by
KCORE_TEXT, so we must ensure that the kernel address we read from is
valid.
Unfortunately there is no solution currently available to do so with a
purely iterator solution so reinstate the bounce buffer in this instance
so we can use copy_from_kernel_nofault() in order to avoid page faults
when regions are unmapped.
This change partly reverts commit 2e1c017077 ("fs/proc/kcore: avoid
bounce buffer for ktext data"), reinstating the bounce buffer, but adapts
the code to continue to use an iterator.
[lstoakes@gmail.com: correct comment to be strictly correct about reasoning]
Link: https://lkml.kernel.org/r/525a3f14-74fa-4c22-9fca-9dab4de8a0c3@lucifer.local
Link: https://lkml.kernel.org/r/20230731215021.70911-1-lstoakes@gmail.com
Fixes: 2e1c017077 ("fs/proc/kcore: avoid bounce buffer for ktext data")
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reported-by: Jiri Olsa <olsajiri@gmail.com>
Closes: https://lore.kernel.org/all/ZHc2fm+9daF6cgCE@krava
Tested-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Will Deacon <will@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
osd_request_timeout option and another fix to reduce the potential for
erroneous blocklisting -- this time in CephFS. All going to stable.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmTNFFUTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi5I8B/9a8C5ed0XfTadHcHX5VQsY3b//4rgp
0VYkQbjYnSCwrYRIPsvnL8LeLHzbcPGLpFAQXg7uUlmJ5dpaOz303hKmKt5GdyOR
qvWka3K4zeG177b6yc1srqs0cEsCLpQrn+krnvOl5v87QdFsCP/bsJMOrJ9mlhdM
9GjkjDRn6jvNyOLGbn3kIvwCRF9NH6/nHzjBcTUzvS8fBUye02o9C1H6ZQ7sYjKH
sJnmQCNCFHEqdaVjDZ7mw/doIrAbmTV6sgusuPjiF5bHILzX4oWG4UJmRpHFV//S
JPQgMp2DNjP8tW9aCVLVVVV5t5AKBr84etF59DaFNflk27U3COJWkE0a
=gw7n
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-6.5-rc5' of https://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"Two patches to improve RBD exclusive lock interaction with
osd_request_timeout option and another fix to reduce the potential for
erroneous blocklisting -- this time in CephFS. All going to stable"
* tag 'ceph-for-6.5-rc5' of https://github.com/ceph/ceph-client:
libceph: fix potential hang in ceph_osdc_notify()
rbd: prevent busy loop when requesting exclusive lock
ceph: defer stopping mdsc delayed_work
In commit 20ea1e7d13 ("file: always lock position for
FMODE_ATOMIC_POS") we ended up always taking the file pos lock, because
pidfd_getfd() could get a reference to the file even when it didn't have
an elevated file count due to threading of other sharing cases.
But Mateusz Guzik reports that the extra locking is actually measurable,
so let's re-introduce the optimization, and only force the locking for
directory traversal.
Directories need the lock for correctness reasons, while regular files
only need it for "POSIX semantics". Since pidfd_getfd() is about
debuggers etc special things that are _way_ outside of POSIX, we can
relax the rules for that case.
Reported-by: Mateusz Guzik <mjguzik@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/linux-fsdevel/20230803095311.ijpvhx3fyrbkasul@f/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have some reports of linux NFS clients that cannot satisfy a linux knfsd
server that always sets SEQ4_STATUS_RECALLABLE_STATE_REVOKED even though
those clients repeatedly walk all their known state using TEST_STATEID and
receive NFS4_OK for all.
Its possible for revoke_delegation() to set NFS4_REVOKED_DELEG_STID, then
nfsd4_free_stateid() finds the delegation and returns NFS4_OK to
FREE_STATEID. Afterward, revoke_delegation() moves the same delegation to
cl_revoked. This would produce the observed client/server effect.
Fix this by ensuring that the setting of sc_type to NFS4_REVOKED_DELEG_STID
and move to cl_revoked happens within the same cl_lock. This will allow
nfsd4_free_stateid() to properly remove the delegation from cl_revoked.
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2217103
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2176575
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Cc: stable@vger.kernel.org # v4.17+
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
- Fix data corruption caused by insufficient decompression on
deduplicated compressed extents;
- Drop a useless s_magic checking in erofs_kill_sb().
-----BEGIN PGP SIGNATURE-----
iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCZMuujhEceGlhbmdAa2Vy
bmVsLm9yZwAKCRA5NzHcH7XmBEGrAP9XTlBrI1AT6FdsYp7Hh6v8Qj89XaLI5m40
X/W4TWTXbAD6A8Ae1lyq7Vd2T6OaZyyUSXVkudGHUju82X50nFExggU=
=biYc
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-6.5-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
- Fix data corruption caused by insufficient decompression on
deduplicated compressed extents
- Drop a useless s_magic checking in erofs_kill_sb()
* tag 'erofs-for-6.5-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: drop unnecessary WARN_ON() in erofs_kill_sb()
erofs: fix wrong primary bvec selection on deduplicated extents
- Fix page allocation failure from allocation bitmap by using kvmalloc_array/kvfree.
- Add the check to validate if filename entries exceeds max filename length.
- Fix potential deadlock condition from dir_emit*().
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE6NzKS6Uv/XAAGHgyZwv7A1FEIQgFAmTKWboWHGxpbmtpbmpl
b25Aa2VybmVsLm9yZwAKCRBnC/sDUUQhCIBGD/9MGxo2t2khD1TyT0tI4yyvzL8F
ICmHdpgc0DWpkySo1MtALL7G6WtT2hul+Pv9XapJI99WlwtSBfQnYWaYiQJ2JYWZ
dtE2ss5FL4FiQqUxdTbSfQsKF+bGc08muvu7gC8QvsUY3J4Cj8yf5wtawu/nkONR
tK2kqzFsEFcfN4y2dqtGESi/pVvFjpIoLTimZkO47e7mNTmiSwvxcRvnFiwe79pC
9V51NfxIwxHLtJk8i3s4farp/ehDEYU+EmoL1/wiFRtjdSddDqZv036NLIasXK3V
ZRIRqYjyxceES7sEShPdrw/PgGdsfQ+oMD2R6rmOoA6lt9aWon1Y/B36tC7mTObe
cAa5lvMk5paFoXfGgikF3/e51qMWo6eo4lGpJ1HHBHRLuVRlOskduXrJKo8Nuk/k
JtuQ0rzYrqzlDJ+whwnY3YzckyFSfEHRjNn9Nm8f6LbNGsn6C+/QqLGyXt7uiXAt
AM1pW3UMfDWHIZFRL6yFlrh6WDv4VKlK+h832x2D2F+8tPIBcAGw77+lMi1okQ6p
+aqe3kKRNKtyUJyW+VvIBXIK3p2hIYZCstInZ1CKhkD+Z+wIzKVpRctYNg01DaT5
YOGTE+SZH3lDTPVMP0SYUpQ2zAZcDBDS5lcq3HqqV8e4fEczmmR2p5cgbhv6rC2z
0bcHJH44B2illfK1pQ==
=S9Uj
-----END PGP SIGNATURE-----
Merge tag 'exfat-for-6.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
Pull exfat fixes from Namjae Jeon:
- Fix page allocation failure from allocation bitmap by using
kvmalloc_array/kvfree
- Add the check to validate if filename entries exceeds max filename
length
- Fix potential deadlock condition from dir_emit*()
* tag 'exfat-for-6.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
exfat: release s_lock before calling dir_emit()
exfat: check if filename entries exceeds max filename length
exfat: use kvmalloc_array/kvfree instead of kmalloc_array/kfree
Customer reported that they couldn't mount their DFS link that was
seen by the client as a DFS interlink -- special form of DFS link
where its single target may point to a different DFS namespace -- and
it turned out that it was just a regular DFS link where its referral
header flags missed the StorageServers bit thus making the client
think it couldn't tree connect to target directly without requiring
further referrals.
When the DFS link referral header flags misses the StoraServers bit
and its target doesn't respond to any referrals, then tree connect to
it.
Fixes: a1c0d00572 ("cifs: share dfs connections and supers")
Cc: stable@vger.kernel.org
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Flushing the dirty buffer may take a long time if the cluster is
overloaded or if there is network issue. So we should ping the
MDSs periodically to keep alive, else the MDS will blocklist
the kclient.
Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/61843
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When handling deduplicated compressed data, there can be multiple
decompressed extents pointing to the same compressed data in one shot.
In such cases, the bvecs which belong to the longest extent will be
selected as the primary bvecs for real decompressors to decode and the
other duplicated bvecs will be directly copied from the primary bvecs.
Previously, only relative offsets of the longest extent were checked to
decompress the primary bvecs. On rare occasions, it can be incorrect
if there are several extents with the same start relative offset.
As a result, some short bvecs could be selected for decompression and
then cause data corruption.
For example, as Shijie Sun reported off-list, considering the following
extents of a file:
117: 903345.. 915250 | 11905 : 385024.. 389120 | 4096
...
119: 919729.. 930323 | 10594 : 385024.. 389120 | 4096
...
124: 968881.. 980786 | 11905 : 385024.. 389120 | 4096
The start relative offset is the same: 2225, but extent 119 (919729..
930323) is shorter than the others.
Let's restrict the bvec length in addition to the start offset if bvecs
are not full.
Reported-by: Shijie Sun <sunshijie@xiaomi.com>
Fixes: 5c2a64252c ("erofs: introduce partial-referenced pclusters")
Tested-by Shijie Sun <sunshijie@xiaomi.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230719065459.60083-1-hsiangkao@linux.alibaba.com
nfsd_splice_actor() has a clause in its loop that chops up a compound page
into individual pages such that if the same page is seen twice in a row, it
is discarded the second time. This is a problem with the advent of
shmem_splice_read() as that inserts zero_pages into the pipe in lieu of
pages that aren't present in the pagecache.
Fix this by assuming that the last page is being extended only if the
currently stored length + starting offset is not currently on a page
boundary.
This can be tested by NFS-exporting a tmpfs filesystem on the test machine
and truncating it to more than a page in size (eg. truncate -s 8192) and
then reading it by NFS. The first page will be all zeros, but thereafter
garbage will be read.
Note: I wonder if we can ever get a situation now where we get a splice
that gives us contiguous parts of a page in separate actor calls. As NFSD
can only be splicing from a file (I think), there are only three sources of
the page: copy_splice_read(), shmem_splice_read() and file_splice_read().
The first allocates pages for the data it reads, so the problem cannot
occur; the second should never see a partial page; and the third waits for
each page to become available before we're allowed to read from it.
Fixes: bd194b1871 ("shmem: Implement splice-read")
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
cc: Hugh Dickins <hughd@google.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-nfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmTFWPsACgkQiiy9cAdy
T1FXCwwAjVij/kFLHCyt7jS6spnch2Cj3Qji6EA89/pex8ChyVONJu6uBt7/PFG3
8rrbV/cVWih+gnDe7/faBlvdjfvwWa7gHCz4eZVAV8mpVG/Pb7AqgT6mZAXPjBJi
+3rTyODHOlB0oQFXcdp538LmTeXvkun+TWagUnQDTr/OAFQQiUsgpMojX5zQ636L
F9fbtFdQbv5v1TQKrykYHwqjGl2Fuz1JM6AAwTzdMse+4hv2yuhzJPvYJuWG3c5H
bzoFQ+nSNg2WeeKoy+njwy/xZfjc/laT/Ipb1zrrZOUC68/cawWfmA3sKYF+34UV
bGSIpJK6Pckvectl3iLy9PouVYXV5odIbRyrV9WUJvvmkWm28eQSptv+H1YQNBFh
fwPgcF/7YU6FfTiITxY4qUxmOmzGI2BDRVvzLlCrbUCau/Ztb5nolZRSuuciR4L+
v1tg/LJxJ3A1WgXBRfH3v0VpAxZIH/lWQbj1ZfFW1uUbKdnuCD1SsO+5QKv70DTF
f43UkB/r
=BsSZ
-----END PGP SIGNATURE-----
Merge tag '6.5-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
"Four small SMB3 client fixes:
- two reconnect fixes (to address the case where non-default
iocharset gets incorrectly overridden at reconnect with the
default charset)
- fix for NTLMSSP_AUTH request setting a flag incorrectly)
- Add missing check for invalid tlink (tree connection) in ioctl"
* tag '6.5-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: add missing return value check for cifs_sb_tlink
smb3: do not set NTLMSSP_VERSION flag for negotiate not auth request
cifs: fix charset issue in reconnection
fs/nls: make load_nls() take a const parameter
Commit a2225d931f ("autofs: remove left-over autofs4 stubs")
promised the removal of the fs/autofs/Kconfig fragment for AUTOFS4_FS
within a couple of releases, but five years later this still has not
happened yet, and AUTOFS4_FS is still enabled in 63 defconfigs.
Get rid of it mechanically:
git grep -l CONFIG_AUTOFS4_FS -- '*defconfig' |
xargs sed -i 's/AUTOFS4_FS/AUTOFS_FS/'
Also just remove the AUTOFS4_FS config option stub. Anybody who hasn't
regenerated their config file in the last five years will need to just
get the new name right when they do.
Signed-off-by: Sven Joachim <svenjoac@gmx.de>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
or aren't considered serious enough to justify backporting.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZMRGzwAKCRDdBJ7gKXxA
jtWoAQDqD5yton3O/tPcCC2X7QbV5bsgghIqvQFo5yWvuiJdNwEAkKwLnXISAadg
RmVCgsfQ+4CCsJgp7RpPlMS43m2AQgI=
=Rib2
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2023-07-28-15-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"11 hotfixes. Five are cc:stable and the remainder address post-6.4
issues or aren't considered serious enough to justify backporting"
* tag 'mm-hotfixes-stable-2023-07-28-15-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/memory-failure: fix hardware poison check in unpoison_memory()
proc/vmcore: fix signedness bug in read_from_oldmem()
mailmap: update remaining active codeaurora.org email addresses
mm: lock VMA in dup_anon_vma() before setting ->anon_vma
mm: fix memory ordering for mm_lock_seq and vm_lock_seq
scripts/spelling.txt: remove 'thead' as a typo
mm/pagewalk: fix EFI_PGT_DUMP of espfix area
shmem: minor fixes to splice-read implementation
tmpfs: fix Documentation of noswap and huge mount options
Revert "um: Use swap() to make code cleaner"
mm/damon/core-test: initialise context before test in damon_test_set_attrs()
blocklisting (fencing) with a couple of prerequisites and a fixup to
prevent metrics from being sent to the MDS even just once after that
has been disabled by the user. All marked for stable.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmTD7MgTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi3SBB/4nHdfSQwy0z2+PM766sUKxSlmaRw8X
4AJyGAIGj5BnHHhtluwLpEYfrh3wfyCRaNYgS64jdsudbUBxPKIIWn2lEFxtyWbC
w0R2uEc+NGJLJOYfJ+lBP06Q2r6qk7N6OGNy6qLaN+v6xJ8WPw7H3fJBLVhnPgMq
7lkACRN+0P5Xt6ZJ57kbWWiFQ+vjv7bbDa0P9zMl6uCgoYsIpvrskqygx+gHbdsq
IcnpsHu3F0ycYAJT5eJ5GcCcThvwbNjWdbJy1fERah7U/LNcX/S3To9V5LPmydOQ
tYAWMlC/1a99fr+jTYF0Pu5GLUdK0UMKeX04ZN3SKON4pORurpypw3n8
=rw72
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-6.5-rc4' of https://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"A patch to reduce the potential for erroneous RBD exclusive lock
blocklisting (fencing) with a couple of prerequisites and a fixup to
prevent metrics from being sent to the MDS even just once after that
has been disabled by the user. All marked for stable"
* tag 'ceph-for-6.5-rc4' of https://github.com/ceph/ceph-client:
rbd: retrieve and check lock owner twice before blocklisting
rbd: harden get_lock_owner_info() a bit
rbd: make get_lock_owner_info() return a single locker or NULL
ceph: never send metrics if disable_send_metrics is set
Most of these clean up warnings we've gotten out of compilation tools, but
several of them were from inspection while hunting down a couple of
regressions.
The most important one to pull is 75b396821c
(fs/9p: remove unnecessary and overrestrictive check)
which caused a regression for some folks by restricting mmap
in any case where writeback caches weren't enabled.
Most of the other bugs caught via inspection were type mismatches.
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEElpbw0ZalkJikytFRiP/V+0pf/5gFAmTDH8cACgkQiP/V+0pf
/5hSJQ//b59cDliC7Knf9B2Of1UsLJ2wYIbxWVYKLwYarKFn3tmtO5dPtWZrQzjB
Kz6fif5z1c0WdjNFLifs/XNqUq5znX/TY8bV/NmOg8VlaoJqmUQSSYnNQOWZCFKT
zwxC6BO6gPNNIkJN2xQ8oOq11Qon/nbZbuN9P2VDcT5Yr2KmFx6FHRcrBNRYAm3E
UzFdjkLrLef3VrvegJNGM3Wv2HqyNBA6QhifZBjDkydtDPMd9fRNns7Q60AARR9K
aqXV6SihE/Ox7sSmVNjTzYF67eq5Xjt+sSzo2SdfOaZxVIa6wf0UXQuFqmHts6Zs
QUCdXS5YbQAwQfdkm22rnTIxAwsbEpFOGGweUvMBXzZbl/sq/PK4Nt6DpCS8ZFi3
81Z5Ey+Q4yaxwdirP521M4ao2Ae2Fzg12bqDTNssZdOYGcXBqBfWiR5IfMbbkgWq
WzCVI3V/LshQ75pXQyS4BtW/29C2nN7g3jLrF3Q5OTe7XmHMCZFvtP4lKvY0piQy
++XoDs1LCJWSZebfkNa05L5nhQ1mYhwiZutHTtF3ejTTiJvcJXQ4xHYLzjOON+4i
blLTpgWLO0rIRAmX0I8GwPi6q0xL4rFP4XGGz/LDppRQkRa13vtzFtMvuldPyEq7
g4pcLkI3SPbL982qYbg8UO+GjO/Q9M/DafXQVlUyDw04TQbT8Jg=
=TAC1
-----END PGP SIGNATURE-----
Merge tag '9p-fixes-6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs
Pull 9p fixes from Eric Van Hensbergen:
"Misc set of fixes for 9p.
Most of these clean up warnings we've gotten out of compilation tools,
but several of them were from inspection while hunting down a couple
of regressions.
The most important one is 75b396821c ("fs/9p: remove unnecessary and
overrestrictive check") which caused a regression for some folks by
restricting mmap in any case where writeback caches weren't enabled.
Most of the other bugs caught via inspection were type mismatches"
* tag '9p-fixes-6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
fs/9p: Remove unused extern declaration
9p: remove dead stores (variable set again without being read)
9p: virtio: skip incrementing unused variable
9p: virtio: make sure 'offs' is initialized in zc_request
9p: virtio: fix unlikely null pointer deref in handle_rerror
9p: fix ignored return value in v9fs_dir_release
fs/9p: remove unnecessary invalidate_inode_pages2
fs/9p: fix type mismatch in file cache mode helper
fs/9p: fix typo in comparison logic for cache mode
fs/9p: remove unnecessary and overrestrictive check
fs/9p: Fix a datatype used with V9FS_DIRECT_IO
The bug is the error handling:
if (tmp < nr_bytes) {
"tmp" can hold negative error codes but because "nr_bytes" is type size_t
the negative error codes are treated as very high positive values
(success). Fix this by changing "nr_bytes" to type ssize_t. The
"nr_bytes" variable is used to store values between 1 and PAGE_SIZE and
they can fit in ssize_t without any issue.
Link: https://lkml.kernel.org/r/b55f7eed-1c65-4adc-95d1-6c7c65a54a6e@moroto.mountain
Fixes: 5d8de293c2 ("vmcore: convert copy_oldmem_page() to take an iov_iter")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmTCTAsACgkQxWXV+ddt
WDvhYhAAluWrfM2ZzhY/tdDeKUNpf0NIAFGZIV4QP/2E43yIPC2+xMPqW/AnBnIP
k28gOhgoH7LBP/cr0IrFHMz8Glges3cHz1UxFjJZgjiU3mAA0mgkIttPpzms7vqi
3SVUxL2bJkebJy53nOpZcHlrcWveg+q0hTUslquCYBb3dA4gb61HBwzA2e0wKFeB
wYw/gQtEy3TkHQPAxVjUF28ASoaroNKsE9QjfLZV0FDn0u0zBFxqpqj7bFUay++i
sG3nPVZsqKcgIX7sUSwrpv4XAFu8fHz+GAQqCNqTxKCJ0ZZzsgzJtKs+12rv7dZC
EvRt0jEt+DgwvmEy7j250TEbcI9rMaQuny8yt2j9sNKH/m9bW0BjptCtwghDoL89
0D6qqicHbA+dJNq8/kDyxV6xC2Git2Ck0fpOfiU7YzhAFECZc/DkidvXa1keMUay
usspO+YHOjDtlq0zJ0xixbxCseJfrj4habieVKZ/CnAvb84082ZiLcMxFqop/ewB
WHKNB0O2+P78xoa7/Be6tp/w1HaaW8ZHvkPicD9d4khKJrXAKLNc/Xny4OqRT14z
sWWaFuNjC7kIUT15EAQNj0wgymA7XcTL9gM1uuSO95PN+M3j4CleApzEvR3dn9FX
gmoxuwVfVsJKKcwo6WFByqzu03kuSladEFasSHQAJbh3jyU9LUY=
=Y/po
-----END PGP SIGNATURE-----
Merge tag 'for-6.5-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix accounting of global block reserve size when block group tree is
enabled
- the async discard has been enabled in 6.2 unconditionally, but for
zoned mode it does not make that much sense to do it asynchronously
as the zones are reset as needed
- error handling and proper error value propagation fixes
* tag 'for-6.5-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: check for commit error at btrfs_attach_transaction_barrier()
btrfs: check if the transaction was aborted at btrfs_wait_for_commit()
btrfs: remove BUG_ON()'s in add_new_free_space()
btrfs: account block group tree when calculating global reserve size
btrfs: zoned: do not enable async discard