We introduced a helper function to be used by non-cifsd threads to
mark the connection for reconnect. For multichannel, when only
a particular channel needs to be reconnected, this had a bug.
This change fixes that by marking that particular channel
for reconnect.
Fixes: dca65818c8 ("cifs: use a different reconnect helper for non-cifsd threads")
Cc: stable@vger.kernel.org
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The following UAF was triggered when running fstests generic/072 with
KASAN enabled against Windows Server 2022 and mount options
'multichannel,max_channels=2,vers=3.1.1,mfsymlinks,noperm'
BUG: KASAN: slab-use-after-free in smb2_query_info_compound+0x423/0x6d0 [cifs]
Read of size 8 at addr ffff888014941048 by task xfs_io/27534
CPU: 0 PID: 27534 Comm: xfs_io Not tainted 6.6.0-rc7 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
Call Trace:
dump_stack_lvl+0x4a/0x80
print_report+0xcf/0x650
? srso_alias_return_thunk+0x5/0x7f
? srso_alias_return_thunk+0x5/0x7f
? __phys_addr+0x46/0x90
kasan_report+0xda/0x110
? smb2_query_info_compound+0x423/0x6d0 [cifs]
? smb2_query_info_compound+0x423/0x6d0 [cifs]
smb2_query_info_compound+0x423/0x6d0 [cifs]
? __pfx_smb2_query_info_compound+0x10/0x10 [cifs]
? srso_alias_return_thunk+0x5/0x7f
? __stack_depot_save+0x39/0x480
? kasan_save_stack+0x33/0x60
? kasan_set_track+0x25/0x30
? ____kasan_slab_free+0x126/0x170
smb2_queryfs+0xc2/0x2c0 [cifs]
? __pfx_smb2_queryfs+0x10/0x10 [cifs]
? __pfx___lock_acquire+0x10/0x10
smb311_queryfs+0x210/0x220 [cifs]
? __pfx_smb311_queryfs+0x10/0x10 [cifs]
? srso_alias_return_thunk+0x5/0x7f
? __lock_acquire+0x480/0x26c0
? lock_release+0x1ed/0x640
? srso_alias_return_thunk+0x5/0x7f
? do_raw_spin_unlock+0x9b/0x100
cifs_statfs+0x18c/0x4b0 [cifs]
statfs_by_dentry+0x9b/0xf0
fd_statfs+0x4e/0xb0
__do_sys_fstatfs+0x7f/0xe0
? __pfx___do_sys_fstatfs+0x10/0x10
? srso_alias_return_thunk+0x5/0x7f
? lockdep_hardirqs_on_prepare+0x136/0x200
? srso_alias_return_thunk+0x5/0x7f
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Allocated by task 27534:
kasan_save_stack+0x33/0x60
kasan_set_track+0x25/0x30
__kasan_kmalloc+0x8f/0xa0
open_cached_dir+0x71b/0x1240 [cifs]
smb2_query_info_compound+0x5c3/0x6d0 [cifs]
smb2_queryfs+0xc2/0x2c0 [cifs]
smb311_queryfs+0x210/0x220 [cifs]
cifs_statfs+0x18c/0x4b0 [cifs]
statfs_by_dentry+0x9b/0xf0
fd_statfs+0x4e/0xb0
__do_sys_fstatfs+0x7f/0xe0
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Freed by task 27534:
kasan_save_stack+0x33/0x60
kasan_set_track+0x25/0x30
kasan_save_free_info+0x2b/0x50
____kasan_slab_free+0x126/0x170
slab_free_freelist_hook+0xd0/0x1e0
__kmem_cache_free+0x9d/0x1b0
open_cached_dir+0xff5/0x1240 [cifs]
smb2_query_info_compound+0x5c3/0x6d0 [cifs]
smb2_queryfs+0xc2/0x2c0 [cifs]
This is a race between open_cached_dir() and cached_dir_lease_break()
where the cache entry for the open directory handle receives a lease
break while creating it. And before returning from open_cached_dir(),
we put the last reference of the new @cfid because of
!@cfid->has_lease.
Besides the UAF, while running xfstests a lot of missed lease breaks
have been noticed in tests that run several concurrent statfs(2) calls
on those cached fids
CIFS: VFS: \\w22-root1.gandalf.test No task to wake, unknown frame...
CIFS: VFS: \\w22-root1.gandalf.test Cmd: 18 Err: 0x0 Flags: 0x1...
CIFS: VFS: \\w22-root1.gandalf.test smb buf 00000000715bfe83 len 108
CIFS: VFS: Dump pending requests:
CIFS: VFS: \\w22-root1.gandalf.test No task to wake, unknown frame...
CIFS: VFS: \\w22-root1.gandalf.test Cmd: 18 Err: 0x0 Flags: 0x1...
CIFS: VFS: \\w22-root1.gandalf.test smb buf 000000005aa7316e len 108
...
To fix both, in open_cached_dir() ensure that @cfid->has_lease is set
right before sending out compounded request so that any potential
lease break will be get processed by demultiplex thread while we're
still caching @cfid. And, if open failed for some reason, re-check
@cfid->has_lease to decide whether or not put lease reference.
Cc: stable@vger.kernel.org
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
If @ses->chan_count <= 1, then for-loop body will not be executed so
no need to check it twice.
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We were passing 0 as the xid for the call to query
server interfaces. This is not great for debugging.
This change adds a real xid.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
In the output of /proc/fs/cifs/DebugData, we do not
print the server->capabilities field today.
With this change, we will do that.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Simplify symlink_hash() by using crypto_shash_digest() instead of an
init+update+final sequence. This should also improve performance.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
All release_mid() callers seem to hold a reference of @mid so there is
no need to call kref_put(&mid->refcount, __release_mid) under
@server->mid_lock spinlock. If they don't, then an use-after-free bug
would have occurred anyways.
By getting rid of such spinlock also fixes a potential deadlock as
shown below
CPU 0 CPU 1
------------------------------------------------------------------
cifs_demultiplex_thread() cifs_debug_data_proc_show()
release_mid()
spin_lock(&server->mid_lock);
spin_lock(&cifs_tcp_ses_lock)
spin_lock(&server->mid_lock)
__release_mid()
smb2_find_smb_tcon()
spin_lock(&cifs_tcp_ses_lock) *deadlock*
Cc: stable@vger.kernel.org
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Fixes some xfstests including generic/564 and generic/157
The "sfu" mount option can be useful for creating special files (character
and block devices in particular) but could not create FIFOs. It did
recognize existing empty files with the "system" attribute flag as FIFOs
but this is too general, so to support creating FIFOs more safely use a new
tag (but the same length as those for char and block devices ie "IntxLNK"
and "IntxBLK") "LnxFIFO" to indicate that the file should be treated as a
FIFO (when mounted with the "sfu"). For some additional context note that
"sfu" followed the way that "Services for Unix" on Windows handled these
special files (at least for character and block devices and symlinks),
which is different than newer Windows which can handle special files
as reparse points (which isn't an option to many servers).
Cc: stable@vger.kernel.org
Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Add structs and defines for new SMB3.1.1 command, server to client notification.
See MS-SMB2 section 2.2.44
Signed-off-by: Steve French <stfrench@microsoft.com>
The NTLM authenticate message currently sets the NTLMSSP_NEGOTIATE_VERSION
flag but does not populate the VERSION structure. This commit fixes this
bug by ensuring that the flag is set and the version details are included
in the message.
Signed-off-by: Meetakshi Setiya <msetiya@microsoft.com>
Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
For example:
touch -h -t 02011200 testfile
where testfile is a symlink would not change the timestamp, but
touch -t 02011200 testfile
does work to change the timestamp of the target
Suggested-by: David Howells <dhowells@redhat.com>
Reported-by: Micah Veilleux <micah.veilleux@iba-group.com>
Closes: https://bugzilla.samba.org/show_bug.cgi?id=14476
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
* Fix a bug where a writev consisting of a bunch of sub-fsblock writes
where the last buffer address is invalid could lead to an infinite
loop.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZTFcYAAKCRBKO3ySh0YR
pg6NAP4zKkrjdzOKyhu+NJDu4lnB4+8eq0PPnW5r5fw4155wTwEA1PKWp5vtKYRk
A3QLraLNS0HxpsguHpfD8KFrPN6LPA8=
=93ew
-----END PGP SIGNATURE-----
Merge tag 'iomap-6.6-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap fix from Darrick Wong:
- Fix a bug where a writev consisting of a bunch of sub-fsblock writes
where the last buffer address is invalid could lead to an infinite
loop
* tag 'iomap-6.6-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: fix short copy in iomap_write_iter()
Stable Fix:
* Fix a pNFS hang in nfs4_evict_inode()
Bugfixes:
* Force update of suid/sgid bits after an NFS v4.2 ALLOCATE op
* Fix a potential oops in nfs_inode_remove_request()
* Check the validity of the layout pointer in ff_layout_mirror_prepare_stats()
* Fix incorrectly marking the pNFS MDS with USE_PNFS_DS in some cases
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmUy4cAACgkQ18tUv7Cl
QOstABAAtK5pXSbzXXQY1IIVIxgIT9tGRMPwd+IVUwL0qRIhTqGkiax6UuZVH9id
NxrTQFrrxC4np2X4atYdC4LggOqx6wcdNLQ9oJ56zULHUtycgjqCSjFrKbA3nFa6
tn4EccPvKzjsmsbJW1rCq97XIRqZY1XotjDA3MR72YSnsOncdoLr6ng6virjOxWe
t8K8WS4tScD0kwvclQLpsEDSFEEdjTYORdS3RVXVz4crM45gikmTI7bSyZ4vc0lT
TafhFYMjcHUNyKF8MP0DguYSGTtHxCHuM/0yYZw0ZiqLdZeT7Pk/GVecV6qmMD9W
REkidcFLV53RR4qjUvL0HuM71/cX+yfwGaj6e8nZg5b7s1pJB46IG58ZmQaPwyzP
n9/zxZHIRrNSkj6e+cwFpgpz7v7wjTOsXVUqBLE9q0Rsm71L8TuzW0G/lghY8tE+
TZ4rMVlZnypFybELmEgoRXRZa0G41SeQmKYNMv+wJIwpNyxbKkI3FgpAx0A4RuyP
prYK4S55GxqKiOrP22USHRpvkhxdzeIkmnPGuy/A04VYaVJwANjUI0w0SAgg1/Oj
iNLjvO80wIL7d3fRLpjcjiQ2DtVt63WcnMgR0dV138959vh9CKAF7u7xVdW4Zsk7
cO5/WAFR5cUh2veswsi69+oivXweP57pi0XWmiJamKKckS/PPuQ=
=Z3lY
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.6-4' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
"Stable Fix:
- Fix a pNFS hang in nfs4_evict_inode()
Fixes:
- Force update of suid/sgid bits after an NFS v4.2 ALLOCATE op
- Fix a potential oops in nfs_inode_remove_request()
- Check the validity of the layout pointer in ff_layout_mirror_prepare_stats()
- Fix incorrectly marking the pNFS MDS with USE_PNFS_DS in some cases"
* tag 'nfs-for-6.6-4' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFSv4.1: fixup use EXCHGID4_FLAG_USE_PNFS_DS for DS server
pNFS/flexfiles: Check the layout validity in ff_layout_mirror_prepare_stats
pNFS: Fix a hang in nfs4_evict_inode()
NFS: Fix potential oops in nfs_inode_remove_request()
nfs42: client needs to strip file mode's suid/sgid bit after ALLOCATE op
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmUyw0wACgkQnJ2qBz9k
QNknFgf/TMbqrnLyro0JUY6w4b9mgXTGFqqnaSuopOGK31vlpfiNR6mYw+vGv82E
FcN4CSQFBIK8v7DU4pgCzbD1OxGIdJuWz7tjnI4ntr/jMM+3pqqKYhu+VkKInrBB
HEIMe/WjM0/LbX/wid6xdT3Bcz6lbAySXJFVtxU45umkuv8RGODCAr6Gf1jX3q7m
LsIv8ESCmau5hyesp1Te4N8bv7dK8x3FPpaX12BB8DkuRlaqmzwHXc0ExpMRhII8
LBllG2rUIu2GNx8AqWULw9LyBsNaZSeAF2iUl5taXaDXw8Js8eQzH/Y+wS5KaJNa
M7kszLlAByav/MSuUWWJHOqwgMhhDQ==
=bOXL
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v6.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fanotify fix from Jan Kara:
"Disable superblock / mount marks for filesystems that can encode file
handles but not open them (currently only overlayfs).
It is not clear the functionality is useful in any way so let's better
disable it before someone comes up with some creative misuse"
* tag 'fsnotify_for_v6.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fanotify: limit reporting of event with non-decodeable file handles
Starting with commit 5d8edfb900 ("iomap: Copy larger chunks from
userspace"), iomap_write_iter() can get into endless loop. This can
be reproduced with LTP writev07 which uses partially valid iovecs:
struct iovec wr_iovec[] = {
{ buffer, 64 },
{ bad_addr, 64 },
{ buffer + 64, 64 },
{ buffer + 64 * 2, 64 },
};
commit bc1bb416bb ("generic_perform_write()/iomap_write_actor():
saner logics for short copy") previously introduced the logic, which
made short copy retry in next iteration with amount of "bytes" it
managed to copy:
if (unlikely(status == 0)) {
/*
* A short copy made iomap_write_end() reject the
* thing entirely. Might be memory poisoning
* halfway through, might be a race with munmap,
* might be severe memory pressure.
*/
if (copied)
bytes = copied;
However, since 5d8edfb900 "bytes" is no longer carried into next
iteration, because it is now always initialized at the beginning of
the loop. And for iov_iter_count < PAGE_SIZE, "bytes" ends up with
same value as previous iteration, making the loop retry same copy
over and over, which leads to writev07 testcase hanging.
Make next iteration retry with amount of bytes we managed to copy.
Fixes: 5d8edfb900 ("iomap: Copy larger chunks from userspace")
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTD6IQAKCRCRxhvAZXjc
opXLAQC9X+ECnGUAOy/kvOrEBkBb7G4BuZ8XsrnL976riVNp0gEA85LaJV9Ow7Xk
51k/1ujhYkglQbCsa0zo+mI4ueE3wAQ=
=Dqrj
-----END PGP SIGNATURE-----
Merge tag 'v6.6-rc7.vfs.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fix from Christian Brauner:
"An openat() call from io_uring triggering an audit call can apparently
cause the refcount of struct filename to be incremented from multiple
threads concurrently during async execution, triggering a refcount
underflow and hitting a BUG_ON(). That bug has been lurking around
since at least v5.16 apparently.
Switch to an atomic counter to fix that. The underflow check is
downgraded from a BUG_ON() to a WARN_ON_ONCE() but we could easily
remove that check altogether tbh"
* tag 'v6.6-rc7.vfs.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
audit,io_uring: io_uring openat triggers audit reference count underflow
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEh0DEKNP0I9IjwfWEqbAzH4MkB7YFAmUwyyAACgkQqbAzH4Mk
B7aQsg/9FohvuET2YOqYlmtgbq1YjFlWkVxoNB2Be4gxKEqRcFN4V/O8PrOD0ZBX
KH8ASWGp9EKQlVKucAPUhSaZMv4u2HaKTVyRnH5KpUmNG7rC7EzRgYDY4xA5RfjL
aIxgVOhaSibFBGfys2KoS8MHFgZ9mcCWZZ1154ueSkZfX9ghytZcql5i4Ahi2OGQ
u+aa/tkhoIS9ZKWp/uIz4R+fDClrG8UoiWmJ1LoOHQ5YM3enboOld6Zz1CjbC9yN
ldInX7klL9aHRf3SoAeRVCyL23e4GB5R4+QmBhZQemhwy0jF0/7E0kCmCK0wYFWn
3t1M/+i0S7o7rI+utTA2mTuZzxnNPMXJsn2TnpH9vv1o633WOlL0mrvsM7Z/NJDz
VShPinQ9eJGvZQ9kYuMfZODtK0BP7OV8mRi7aXinzHucG+ISB2fBGxsnA0lX9EfT
fHix1oNI33lfazVzaStJ3hs/n9KXwP/9P49rBcLh75O2t3TBKc2T84qpywNx+W6S
9GDMV5V2FNWf/hMCL8pcTtab3MvpbeU5DqKnxW1D1vu2vdGiHVum/Aq26aE5W7nl
EcE5pjKppdNZgk3Qwj190vym7d8lbK4Esua6ieNM/qvWb7hAGoZO7Tjcypvj+YMY
W4aiJTW/jULPT2tkUvLvHg/r5DFHi30v6eCeTCrPfivxFRIAzBQ=
=ScrJ
-----END PGP SIGNATURE-----
Merge tag 'ntfs3_for_6.6' of https://github.com/Paragon-Software-Group/linux-ntfs3
Pull ntfs3 fixes from Konstantin Komarov:
- memory leak
- some logic errors, NULL dereferences
- some code was refactored
- more sanity checks
* tag 'ntfs3_for_6.6' of https://github.com/Paragon-Software-Group/linux-ntfs3:
fs/ntfs3: Avoid possible memory leak
fs/ntfs3: Fix directory element type detection
fs/ntfs3: Fix possible null-pointer dereference in hdr_find_e()
fs/ntfs3: Fix OOB read in ntfs_init_from_boot
fs/ntfs3: fix panic about slab-out-of-bounds caused by ntfs_list_ea()
fs/ntfs3: Fix NULL pointer dereference on error in attr_allocate_frame()
fs/ntfs3: Fix possible NULL-ptr-deref in ni_readpage_cmpr()
fs/ntfs3: Do not allow to change label if volume is read-only
fs/ntfs3: Add more info into /proc/fs/ntfs3/<dev>/volinfo
fs/ntfs3: Refactoring and comments
fs/ntfs3: Fix alternative boot searching
fs/ntfs3: Allow repeated call to ntfs3_put_sbi
fs/ntfs3: Use inode_set_ctime_to_ts instead of inode_set_ctime
fs/ntfs3: Fix shift-out-of-bounds in ntfs_fill_super
fs/ntfs3: fix deadlock in mark_as_free_ex
fs/ntfs3: Add more attributes checks in mi_enum_attr()
fs/ntfs3: Use kvmalloc instead of kmalloc(... __GFP_NOWARN)
fs/ntfs3: Write immediately updated ntfs state
fs/ntfs3: Add ckeck in ni_update_parent()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmUuntEACgkQxWXV+ddt
WDssdQ/9Fo6tN+MCH5ISAcvsW6WvBUWT62MrDnzawh96QhUf3NYf9sjME7QqwHHv
w60SDiqRlAd5UzxdIPC4Qa/6GVZZh2yFLzew3l8Fh6anxhjO5argdsfx1Wv4ADk/
FHI8zs6EZiTlk0JmEnHNclliZfaDutBRQiPL+HZx4+FCrJweS5U/4Jpg7vdfp/tp
eWdJ51pDM8iyqGTsP7a7/VaL5wLoJhbdD9wYgupZUhvY6g2tCZ71/hNiWdbKtCK8
EyQxXiAlc+k1UflOx6Xip1HLIh6HmKwxntXxRy+yj4IvJ3PhI+KS5Nqdl35TszN9
6y9MRo3oCU+2y89Yay4HZZb6DLxcAi6VwpyswnntodFQ+ICXEw7ZaNi3rSO+FCO8
KxfhLniMD5gflRP4gy+o9iZxgVQ75nmiPgBt53r+sAKZ7lv86x84DJ/ZUqL8EV0e
OJhxdzhoT0Ks8OstIuE87fgzUCjqMcgAavxcn1psKBC6/JY9v6OneA8qauSswkKs
P+diJIqZHHOBQVKFedqdIrDU6AstivSBq0ToPBslbBlcy97EO4IRoiMIw+QgHPYn
CHsPHtooBmxPyw+4HTFuzY1NIrSeUFYxTDAs9p5kMPmltkVAlLPcrpGZVya9tjds
l/YuwY2f0C9Q1pjcAc9FcN8Y5kLRCYNEWMl0M1VpC22KgjRN6r0=
=GrNu
-----END PGP SIGNATURE-----
Merge tag 'for-6.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"Fix a bug in chunk size decision that could lead to suboptimal
placement and filling patterns"
* tag 'for-6.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix stripe length calculation for non-zoned data chunk allocation
Commit a95aef69a7 ("fanotify: support reporting non-decodeable file
handles") merged in v6.5-rc1, added the ability to use an fanotify group
with FAN_REPORT_FID mode to watch filesystems that do not support nfs
export, but do know how to encode non-decodeable file handles, with the
newly introduced AT_HANDLE_FID flag.
At the time that this commit was merged, there were no filesystems
in-tree with those traits.
Commit 16aac5ad1f ("ovl: support encoding non-decodable file handles"),
merged in v6.6-rc1, added this trait to overlayfs, thus allowing fanotify
watching of overlayfs with FAN_REPORT_FID mode.
In retrospect, allowing an fanotify filesystem/mount mark on such
filesystem in FAN_REPORT_FID mode will result in getting events with
file handles, without the ability to resolve the filesystem objects from
those file handles (i.e. no open_by_handle_at() support).
For v6.6, the safer option would be to allow this mode for inode marks
only, where the caller has the opportunity to use name_to_handle_at() at
the time of setting the mark. In the future we can revise this decision.
Fixes: a95aef69a7 ("fanotify: support reporting non-decodeable file handles")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20231018100000.2453965-2-amir73il@gmail.com>
This patches fixes commit 51d674a5e4 "NFSv4.1: use
EXCHGID4_FLAG_USE_PNFS_DS for DS server", purpose of that
commit was to mark EXCHANGE_ID to the DS with the appropriate
flag.
However, connection to MDS can return both EXCHGID4_FLAG_USE_PNFS_DS
and EXCHGID4_FLAG_USE_PNFS_MDS set but previous patch would only
remember the USE_PNFS_DS and for the 2nd EXCHANGE_ID send that
to the MDS.
Instead, just mark the pnfs path exclusively.
Fixes: 51d674a5e4 ("NFSv4.1: use EXCHGID4_FLAG_USE_PNFS_DS for DS server")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Ensure that we check the layout pointer and validity after dereferencing
it in ff_layout_mirror_prepare_stats.
Fixes: 08e2e5bc6c ("pNFS/flexfiles: Clean up layoutstats")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We are not allowed to call pnfs_mark_matching_lsegs_return() without
also holding a reference to the layout header, since doing so could lead
to the reference count going to zero when we call
pnfs_layout_remove_lseg(). This again can lead to a hang when we get to
nfs4_evict_inode() and are unable to clear the layout pointer.
pnfs_layout_return_unused_byserver() is guilty of this behaviour, and
has been seen to trigger the refcount warning prior to a hang.
Fixes: b6d49ecd10 ("NFSv4: Fix a pNFS layout related use-after-free race when freeing the inode")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Commit f6fca3917b "btrfs: store chunk size in space-info struct"
broke data chunk allocations on non-zoned multi-device filesystems when
using default chunk_size. Commit 5da431b71d "btrfs: fix the max chunk
size and stripe length calculation" partially fixed that, and this patch
completes the fix for that case.
After commit f6fca3917b and 5da431b71d, the sequence of events for
a data chunk allocation on a non-zoned filesystem is:
1. btrfs_create_chunk calls init_alloc_chunk_ctl, which copies
space_info->chunk_size (default 10 GiB) to ctl->max_stripe_len
unmodified. Before f6fca3917b, ctl->max_stripe_len value was
1 GiB for non-zoned data chunks and not configurable.
2. btrfs_create_chunk calls gather_device_info which consumes
and produces more fields of chunk_ctl.
3. gather_device_info multiplies ctl->max_stripe_len by
ctl->dev_stripes (which is 1 in all cases except dup)
and calls find_free_dev_extent with that number as num_bytes.
4. find_free_dev_extent locates the first dev_extent hole on
a device which is at least as large as num_bytes. With default
max_chunk_size from f6fca3917b, it finds the first hole which is
longer than 10 GiB, or the largest hole if that hole is shorter
than 10 GiB. This is different from the pre-f6fca3917b4d
behavior, where num_bytes is 1 GiB, and find_free_dev_extent
may choose a different hole.
5. gather_device_info repeats step 4 with all devices to find
the first or largest dev_extent hole that can be allocated on
each device.
6. gather_device_info sorts the device list by the hole size
on each device, using total unallocated space on each device to
break ties, then returns to btrfs_create_chunk with the list.
7. btrfs_create_chunk calls decide_stripe_size_regular.
8. decide_stripe_size_regular finds the largest stripe_len that
fits across the first nr_devs device dev_extent holes that were
found by gather_device_info (and satisfies other constraints
on stripe_len that are not relevant here).
9. decide_stripe_size_regular caps the length of the stripe it
computed at 1 GiB. This cap appeared in 5da431b71d to correct
one of the other regressions introduced in f6fca3917b.
10. btrfs_create_chunk creates a new chunk with the above
computed size and number of devices.
At step 4, gather_device_info() has found a location where stripe up to
10 GiB in length could be allocated on several devices, and selected
which devices should have a dev_extent allocated on them, but at step
9, only 1 GiB of the space that was found on each device can be used.
This mismatch causes new suboptimal chunk allocation cases that did not
occur in pre-f6fca3917b4d kernels.
Consider a filesystem using raid1 profile with 3 devices. After some
balances, device 1 has 10x 1 GiB unallocated space, while devices 2
and 3 have 1x 10 GiB unallocated space, i.e. the same total amount of
space, but distributed across different numbers of dev_extent holes.
For visualization, let's ignore all the chunks that were allocated before
this point, and focus on the remaining holes:
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10x 1 GiB unallocated)
Device 2: [__________] (10 GiB contig unallocated)
Device 3: [__________] (10 GiB contig unallocated)
Before f6fca3917b, the allocator would fill these optimally by
allocating chunks with dev_extents on devices 1 and 2 ([12]), 1 and 3
([13]), or 2 and 3 ([23]):
[after 0 chunk allocations]
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
Device 2: [__________] (10 GiB)
Device 3: [__________] (10 GiB)
[after 1 chunk allocation]
Device 1: [12] [_] [_] [_] [_] [_] [_] [_] [_] [_]
Device 2: [12] [_________] (9 GiB)
Device 3: [__________] (10 GiB)
[after 2 chunk allocations]
Device 1: [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB)
Device 2: [12] [_________] (9 GiB)
Device 3: [13] [_________] (9 GiB)
[after 3 chunk allocations]
Device 1: [12] [13] [12] [_] [_] [_] [_] [_] [_] [_] (7 GiB)
Device 2: [12] [12] [________] (8 GiB)
Device 3: [13] [_________] (9 GiB)
[...]
[after 12 chunk allocations]
Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [_] [_] (2 GiB)
Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [__] (2 GiB)
Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB)
[after 13 chunk allocations]
Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [_] (1 GiB)
Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB)
Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB)
[after 14 chunk allocations]
Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full)
Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB)
Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [13] [_] (1 GiB)
[after 15 chunk allocations]
Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full)
Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [23] (full)
Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [13] [23] (full)
This allocates all of the space with no waste. The sorting function used
by gather_device_info considers free space holes above 1 GiB in length
to be equal to 1 GiB, so once find_free_dev_extent locates a sufficiently
long hole on each device, all the holes appear equal in the sort, and the
comparison falls back to sorting devices by total free space. This keeps
usable space on each device equal so they can all be filled completely.
After f6fca3917b, the allocator prefers the devices with larger holes
over the devices with more free space, so it makes bad allocation choices:
[after 1 chunk allocation]
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
Device 2: [23] [_________] (9 GiB)
Device 3: [23] [_________] (9 GiB)
[after 2 chunk allocations]
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
Device 2: [23] [23] [________] (8 GiB)
Device 3: [23] [23] [________] (8 GiB)
[after 3 chunk allocations]
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
Device 2: [23] [23] [23] [_______] (7 GiB)
Device 3: [23] [23] [23] [_______] (7 GiB)
[...]
[after 9 chunk allocations]
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)
Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)
[after 10 chunk allocations]
Device 1: [12] [_] [_] [_] [_] [_] [_] [_] [_] [_] (9 GiB)
Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [12] (full)
Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)
[after 11 chunk allocations]
Device 1: [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB)
Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [12] (full)
Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [13] (full)
No further allocations are possible, with 8 GiB wasted (4 GiB of data
space). The sort in gather_device_info now considers free space in
holes longer than 1 GiB to be distinct, so it will prefer devices 2 and
3 over device 1 until all but 1 GiB is allocated on devices 2 and 3.
At that point, with only 1 GiB unallocated on every device, the largest
hole length on each device is equal at 1 GiB, so the sort finally moves
to ordering the devices with the most free space, but by this time it
is too late to make use of the free space on device 1.
Note that it's possible to contrive a case where the pre-f6fca3917b4d
allocator fails the same way, but these cases generally have extensive
dev_extent fragmentation as a precondition (e.g. many holes of 768M
in length on one device, and few holes 1 GiB in length on the others).
With the regression in f6fca3917b, bad chunk allocation can occur even
under optimal conditions, when all dev_extent holes are exact multiples
of stripe_len in length, as in the example above.
Also note that post-f6fca3917b4d kernels do treat dev_extent holes
larger than 10 GiB as equal, so the bad behavior won't show up on a
freshly formatted filesystem; however, as the filesystem ages and fills
up, and holes ranging from 1 GiB to 10 GiB in size appear, the problem
can show up as a failure to balance after adding or removing devices,
or an unexpected shortfall in available space due to unequal allocation.
To fix the regression and make data chunk allocation work
again, set ctl->max_stripe_len back to the original SZ_1G, or
space_info->chunk_size if that's smaller (the latter can happen if the
user set space_info->chunk_size to less than 1 GiB via sysfs, or it's
a 32 MiB system chunk with a hardcoded chunk_size and stripe_len).
While researching the background of the earlier commits, I found that an
identical fix was already proposed at:
https://lore.kernel.org/linux-btrfs/de83ac46-a4a3-88d3-85ce-255b7abc5249@gmx.com/
The previous review missed one detail: ctl->max_stripe_len is used
before decide_stripe_size_regular() is called, when it is too late for
the changes in that function to have any effect. ctl->max_stripe_len is
not used directly by decide_stripe_size_regular(), but the parameter
does heavily influence the per-device free space data presented to
the function.
Fixes: f6fca3917b ("btrfs: store chunk size in space-info struct")
CC: stable@vger.kernel.org # 6.1+
Link: https://lore.kernel.org/linux-btrfs/20231007051421.19657-1-ce3g8jdj@umail.furryterror.org/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE9zuTYTs0RXF+Ke33EVvVyTe/1WoFAmUrsD8ACgkQEVvVyTe/
1WpCmg//XfJm6TdeFFuoEoYezytHXnEHnIM+kHMOqmtY1AoaW4884dvevKOCBsUF
BXXfnIsJc1GmKBDCOxaWFXTRN9iLHTLww8dZzJFX1vDYsqo96t7TbbssyOuOy9vp
K4uYX/pZzuggyBCUZvz/UdjZPEAwPVmsU/uXe/gkHQLS+wpOH0e7hOB7p91I0iZx
0JdleyoDyw2AtqWLscGrTYqozW+lNMl0smADWaGfLcn29rRkR10mhq5heYa+4b2C
TiG29rpIDgXyhz9w+Tq4f3oxJDrE1qlQGX7G8tJjISS9UDHxrZqFun0+nr15Y8Ge
3YstSaMlarAx+UGtuEX80QcYHlYDAWnCKwkRAD8wBtKLeO0pG6BSmCCVi34CAoN+
NBy5KQeoDV96vqoIc8lDmXWOgOkzOogJzzspOWT3H7jmonTjkYL/rYGuV1V6lpuk
Ngopc7HxjiZeGb9Zchr1KOT6xJzjYm74/Ph8I9ECPAamgO6UxlOcsBF2uc7OZIlL
Jzv5Hl+ITiXzxa/KQonx2nEtdO8INf5wxHjy4nlnnY0dcibGCA1wi/3/8/vqJBIf
tuRoQJ/eHruTf23dwPRFwo4JoOViSv3Up9GFyEodllBA6DSAHK96low8eeEZAw9e
MeyWy45dewgH1jvx3bb8Sd29CVxUXvMEl33hYngu09/cv7XulMw=
=9LTH
-----END PGP SIGNATURE-----
Merge tag 'ovl-fixes-6.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
Pull overlayfs fixes from Amir Goldstein:
- Various fixes for regressions due to conversion to new mount
api in v6.5
- Disable a new mount option syntax (append lowerdir) that was
added in v6.5 because we plan to add a different lowerdir
append syntax in v6.7
* tag 'ovl-fixes-6.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
ovl: temporarily disable appending lowedirs
ovl: fix regression in showing lowerdir mount option
ovl: fix regression in parsing of mount options with escaped comma
fs: factor out vfs_parse_monolithic_sep() helper
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmUrPnIACgkQiiy9cAdy
T1FTewv/ZgF5CX5M8EpLvUENcdzJQIh2JolP2PeSbqVjf/pI/LKrkGP6fHEo8+S2
Mnntxik3aNZ3hQ755SpGw+mT4hgr2umsDDrZxLVjvDDiLxqW9zlr55JdBS5xJvxN
enKJ8wDbP+Usn4Gb0TfY9xrPWgHyYSn9+dYoPYuh1Z1zqmhFfIRpGSBHbwUM6Ssa
vONpZUdnzdRnuHKyU3+xgU4Pr6KZLnriM/iGgjncCHJRCem0f27W50xsXkkpoVIg
GIhNRe//YL1dlqbpQpXY+4+KI/3d2JRZeVpnzCJ7ucuwyjq5KNJSnTI7Jsgqnf3V
ADe9m/HnknOG3lkQrzojxTNGLmXqlvsxUUNVtjGRccAHaOJDCsKA5Wv5L2jJahdP
ynuXz5iwtQqaHPkfIL5D48RYiMTVemmsHdP+cdnleXkcU8GpN7cd2Gnt591Wo8by
lcRS01pMlRfSh6SyKcDghEHbb2BDKyRSbIlvsy+85CBdKAqOpJXl/96p67UPcQjI
/K9g3cY6
=X4B/
-----END PGP SIGNATURE-----
Merge tag '6.6-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
- Fix for possible double free in RPC read
- Add additional check to clarify smb2_open path and quiet Coverity
- Fix incorrect error rsp in a compounding path
- Fix to properly fail open of file with pending delete on close
* tag '6.6-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix potential double free on smb2_read_pipe() error path
ksmbd: fix Null pointer dereferences in ksmbd_update_fstate()
ksmbd: fix wrong error response status by using set_smb2_rsp_status()
ksmbd: not allow to open file if delelete on close bit is set
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmUrPK0ACgkQiiy9cAdy
T1FULQv+IFTddGvRcreJFFeq5pndSGNuN4gs2dUkc+wQaS80wH1TI/tmmOB8DNx7
amXhPv4bo9inbus0ZAZLlOY6Vr7bAhQTGjXLJEVwes+yYxpxbJVmNt9aPCA6yiaP
OAC5jKUEIdIgQkcwpSaUYkiCayB0i38R300HSFLFk2XqPOt5QRCIxrBm5Ssq7yN2
OckglUl0p+FY54kI4kL6sfXi7tRV6rqjzofIxJJuiUqU3IUJk+G1b/pNouWLhEpi
wQPmpdcRvV70QGinzdsuVWNQV8HEZ+zvQQkdiIqKPqEl1hMzndFYMfSNnvDcDN+W
xMmcnIW0ei/3qjYDZ3wjnMEH/978nBrQEtUL/dpAXOmVr+U72NKQE22IlWEICSo8
GZ7JJxcB20WHNHzSFH1+Y+7kgNTgwLd59uK27xqm1/8+qcvhf/MMnjA/ym9F4pyq
4IEHPy7keQ5apSdBhtItJxg5etjpjujQomk1yRxrIU1JAnF0LpL/BFX85aT19QxX
hbbqWeyd
=ZlF9
-----END PGP SIGNATURE-----
Merge tag '6.6-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- fix caching race with open_cached_dir and laundromat cleanup of
cached dirs (addresses a problem spotted with xfstest run with
directory leases enabled)
- reduce excessive resource usage of laundromat threads
* tag '6.6-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: prevent new fids from being removed by laundromat
smb: client: make laundromat a delayed worker
Kernel v6.5 converted overlayfs to new mount api.
As an added bonus, it also added a feature to allow appending lowerdirs
using lowerdir=:/lower2,lowerdir=::/data3 syntax.
This new syntax has raised some concerns regarding escaping of colons.
We decided to try and disable this syntax, which hasn't been in the wild
for so long and introduce it again in 6.7 using explicit mount options
lowerdir+=/lower2,datadir+=/data3.
Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Link: https://lore.kernel.org/r/CAJfpegsr3A4YgF2YBevWa6n3=AcP7hNndG6EPMu3ncvV-AM71A@mail.gmail.com/
Fixes: b36a5780cb ("ovl: modify layer parameter parsing")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
* Fix calculation of offset of AG's last block and its length.
* Update incore AG block count when shrinking an AG.
* Process free extents to busy list in FIFO order.
* Make XFS report its i_version as the STATX_CHANGE_COOKIE.
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZSjJjQAKCRAH7y4RirJu
9HK3AQDLHHw5nlrVShiOtps543OOJ1uBA2PO/lxvvM3X6GPbbwEA8uYLqNPThpF2
+a5h8y9LV6uTEDsdpSoEPBmCgXKhYQg=
=RmDo
-----END PGP SIGNATURE-----
Merge tag 'xfs-6.6-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Chandan Babu:
- Fix calculation of offset of AG's last block and its length
- Update incore AG block count when shrinking an AG
- Process free extents to busy list in FIFO order
- Make XFS report its i_version as the STATX_CHANGE_COOKIE
* tag 'xfs-6.6-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: reinstate the old i_version counter as STATX_CHANGE_COOKIE
xfs: Remove duplicate include
xfs: correct calculation for agend and blockcount
xfs: process free extents to busy list in FIFO order
xfs: adjust the incore perag block_count when shrinking
Before commit b36a5780cb ("ovl: modify layer parameter parsing"),
spaces and commas in lowerdir mount option value used to be escaped using
seq_show_option().
In current upstream, when lowerdir value has a space, it is not escaped
in /proc/mounts, e.g.:
none /mnt overlay rw,relatime,lowerdir=l l,upperdir=u,workdir=w 0 0
which results in broken output of the mount utility:
none on /mnt type overlay (rw,relatime,lowerdir=l)
Store the original lowerdir mount options before unescaping and show
them using the same escaping used for seq_show_option() in addition to
escaping the colon separator character.
Fixes: b36a5780cb ("ovl: modify layer parameter parsing")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
kernel_connect() which recently grown protection against someone using
BPF to rewrite the address. All but one marked for stable.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmUpYoQTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHziwvmCACK13dkAaupcHyteYPloBgtJLNixR3X
6++nHCOXGtE7cK6n1snobFQgp/d5BqSKAeymyLqDjJOJVqG/5n8FR1gwcuY/Ogdj
Aju2Mkt7R/R/V6kvmCbqGSwiOvxZP1gBFkKcluRNQkFNP3boKw4vmJGq29Rabbyr
66NfZSETsR/H4JhAWvUyVmffrvxIx11THnvmrAnprGVKoK72HwOQFx0H4KLD2Hio
aN/yiiR15yDS6hL8dfilt+QGO6o+ZWgvRl1GOzqjwISgfeUUYVknMJXrEf+20kHs
kX3tWaLWVZsU1FyVFRO5HPYLAREjALXstFO4K1OHPERM7EipWNinIHoS
=wyYP
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-6.6-rc6' of https://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"Fixes for an overreaching WARN_ON, two error paths and a switch to
kernel_connect() which recently grown protection against someone using
BPF to rewrite the address.
All but one marked for stable"
* tag 'ceph-for-6.6-rc6' of https://github.com/ceph/ceph-client:
ceph: fix type promotion bug on 32bit systems
libceph: use kernel_connect()
ceph: remove unnecessary IS_ERR() check in ceph_fname_to_usr()
ceph: fix incorrect revoked caps assert in ceph_fill_file_size()
An io_uring openat operation can update an audit reference count
from multiple threads resulting in the call trace below.
A call to io_uring_submit() with a single openat op with a flag of
IOSQE_ASYNC results in the following reference count updates.
These first part of the system call performs two increments that do not race.
do_syscall_64()
__do_sys_io_uring_enter()
io_submit_sqes()
io_openat_prep()
__io_openat_prep()
getname()
getname_flags() /* update 1 (increment) */
__audit_getname() /* update 2 (increment) */
The openat op is queued to an io_uring worker thread which starts the
opportunity for a race. The system call exit performs one decrement.
do_syscall_64()
syscall_exit_to_user_mode()
syscall_exit_to_user_mode_prepare()
__audit_syscall_exit()
audit_reset_context()
putname() /* update 3 (decrement) */
The io_uring worker thread performs one increment and two decrements.
These updates can race with the system call decrement.
io_wqe_worker()
io_worker_handle_work()
io_wq_submit_work()
io_issue_sqe()
io_openat()
io_openat2()
do_filp_open()
path_openat()
__audit_inode() /* update 4 (increment) */
putname() /* update 5 (decrement) */
__audit_uring_exit()
audit_reset_context()
putname() /* update 6 (decrement) */
The fix is to change the refcnt member of struct audit_names
from int to atomic_t.
kernel BUG at fs/namei.c:262!
Call Trace:
...
? putname+0x68/0x70
audit_reset_context.part.0.constprop.0+0xe1/0x300
__audit_uring_exit+0xda/0x1c0
io_issue_sqe+0x1f3/0x450
? lock_timer_base+0x3b/0xd0
io_wq_submit_work+0x8d/0x2b0
? __try_to_del_timer_sync+0x67/0xa0
io_worker_handle_work+0x17c/0x2b0
io_wqe_worker+0x10a/0x350
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/MW2PR2101MB1033FFF044A258F84AEAA584F1C9A@MW2PR2101MB1033.namprd21.prod.outlook.com/
Fixes: 5bd2182d58 ("audit,io_uring,io-wq: add some basic audit support to io_uring")
Signed-off-by: Dan Clash <daclash@linux.microsoft.com>
Link: https://lore.kernel.org/r/20231012215518.GA4048@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Fix new smatch warnings:
fs/smb/server/smb2pdu.c:6131 smb2_read_pipe() error: double free of 'rpc_resp'
Fixes: e2b76ab8b5 ("ksmbd: add support for read compound")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Coverity Scan report the following one. This report is a false alarm.
Because fp is never NULL when rc is zero. This patch add null check for fp
in ksmbd_update_fstate to make alarm silence.
*** CID 1568583: Null pointer dereferences (FORWARD_NULL)
/fs/smb/server/smb2pdu.c: 3408 in smb2_open()
3402 path_put(&path);
3403 path_put(&parent_path);
3404 }
3405 ksmbd_revert_fsids(work);
3406 err_out1:
3407 if (!rc) {
>>> CID 1568583: Null pointer dereferences (FORWARD_NULL)
>>> Passing null pointer "fp" to "ksmbd_update_fstate", which dereferences it.
3408 ksmbd_update_fstate(&work->sess->file_table, fp, FP_INITED);
3409 rc = ksmbd_iov_pin_rsp(work, (void *)rsp, iov_len);
3410 }
3411 if (rc) {
3412 if (rc == -EINVAL)
3413 rsp->hdr.Status = STATUS_INVALID_PARAMETER;
Fixes: e2b76ab8b5 ("ksmbd: add support for read compound")
Reported-by: Coverity Scan <scan-admin@coverity.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
set_smb2_rsp_status() after __process_request() sets the wrong error
status. This patch resets all iov vectors and sets the error status
on clean one.
Fixes: e2b76ab8b5 ("ksmbd: add support for read compound")
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Cthon test fail with the following error.
check for proper open/unlink operation
nfsjunk files before unlink:
-rwxr-xr-x 1 root root 0 9월 25 11:03 ./nfs2y8Jm9
./nfs2y8Jm9 open; unlink ret = 0
nfsjunk files after unlink:
-rwxr-xr-x 1 root root 0 9월 25 11:03 ./nfs2y8Jm9
data compare ok
nfsjunk files after close:
ls: cannot access './nfs2y8Jm9': No such file or directory
special tests failed
Cthon expect to second unlink failure when file is already unlinked.
ksmbd can not allow to open file if flags of ksmbd inode is set with
S_DEL_ON_CLS flags.
Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Ever since commit 91c7794713 ("ovl: allow filenames with comma"), the
following example was legit overlayfs mount options:
mount -t overlay overlay -o 'lowerdir=/tmp/a\,b/lower' /mnt
The conversion to new mount api moved to using the common helper
generic_parse_monolithic() and discarded the specialized ovl_next_opt()
option separator.
Bring back ovl_next_opt() and use vfs_parse_monolithic_sep() to fix the
regression.
Reported-by: Ryan Hendrickson <ryan.hendrickson@alum.mit.edu>
Closes: https://lore.kernel.org/r/8da307fb-9318-cf78-8a27-ba5c5a0aef6d@alum.mit.edu/
Fixes: 1784fbc2ed ("ovl: port to new mount api")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Factor out vfs_parse_monolithic_sep() from generic_parse_monolithic(),
so filesystems could use it with a custom option separator callback.
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Check if @cfid->time is set in laundromat so we guarantee that only
fully cached fids will be selected for removal. While we're at it,
add missing locks to protect access of @cfid fields in order to avoid
races with open_cached_dir() and cfids_laundromat_worker(),
respectively.
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
By having laundromat kthread processing cached directories on every
second turned out to be overkill, especially when having multiple SMB
mounts.
Relax it by using a delayed worker instead that gets scheduled on
every @dir_cache_timeout (default=30) seconds per tcon.
This also fixes the 1s delay when tearing down tcon.
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The handling of STATX_CHANGE_COOKIE was moved into generic_fillattr in
commit 0d72b92883 (fs: pass the request_mask to generic_fillattr), but
we didn't account for the fact that xfs doesn't call generic_fillattr at
all.
Make XFS report its i_version as the STATX_CHANGE_COOKIE.
Fixes: 0d72b92883 (fs: pass the request_mask to generic_fillattr)
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
./fs/xfs/scrub/xfile.c: xfs_format.h is included more than once.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=6209
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
The agend should be "start + length - 1", then, blockcount should be
"end + 1 - start". Correct 2 calculation mistakes.
Also, rename "agend" to "range_agend" because it's not the end of the AG
per se; it's the end of the dead region within an AG's agblock space.
Fixes: 5cf32f63b0 ("xfs: fix the calculation for "end" and "length"")
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmUmbQMACgkQxWXV+ddt
WDtBshAAqOwMrqRwOKOze/LQ4Kl9A8p0l+XxYdt7nRSY7n15xpN6uLVsc0gTwO5n
HOquDe2ivrpdOXI6ArcujTTFHaBGX+mmubU/yi54MH0iwuCR32dYhj3j7mDUIf6F
GpTEjgxIdE4AMUw7e7Rzqbdcmq//+H+bBdm+2YkNNEBmPP06483GYthjKJ7zWdrn
pPksR9f611aHU4jZnKZJeHgZh4iVrIszIxkjeMD5NJ6KUb8LJmISLOOJzowkmugt
JH8bd1F/+/53MmpntWGnHnURI9J6UxBL0cNnYW26FjY21N3RGR2BumotW73hYaD7
6fwuxs4ZWlLqHUtIOaAVUUSfEVse7k/i7m4+sDB1JLh26alqUHunqCFV+3ROTnOY
jHwWW+qyQhxJnfgtHyDrwcybfW0V41hhmDIhoeezkSDtbnacNTMfwzXS2ELcp0KJ
/13TCruweFN0g4lBR8HfbKJCCzPayxCirtubx1nIMRysHfo10aDWz1MSvr3mkOyo
gwif/j9BMKN0+fg6l9eZNHWHfQ8qfL3dvSRBlvJcP5mnG5ZuVkxJUFH0m/UfdFbZ
sbeJHSP9wex5tJKmG3kJPAuZWwGLHCiMMCnsWoq+02KV8IXrw3Ji5z/8Hhsb51Ps
r7BGRO2A2rD9XLJtc9BCiwiV177/WknmTUtRpOyxHFfb37bKmHg=
=Wz/9
-----END PGP SIGNATURE-----
Merge tag 'for-6.6-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A revert of recent mount option parsing fix, this breaks mounts with
security options.
The second patch is a flexible array annotation"
* tag 'for-6.6-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: add __counted_by for struct btrfs_delayed_item and use struct_size()
Revert "btrfs: reject unknown mount options early"
When we're adding extents to the busy discard list, add them to the tail
of the list so that we get FIFO order. For FITRIM commands, this means
that we send discard bios sorted in order from longest to shortest, like
we did before commit 89cfa89960.
For transactions that are freeing extents, this puts them in the
transaction's busy list in FIFO order as well, which shouldn't make any
noticeable difference.
Fixes: 89cfa89960 ("xfs: reduce AGF hold times during fstrim operations")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
If we reduce the number of blocks in an AG, we must update the incore
geometry values as well.
Fixes: 0800169e3e ("xfs: Pre-calculate per-AG agbno geometry")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Once a folio's private data has been cleared, it's possible for another
process to clear the folio->mapping (e.g. via invalidate_complete_folio2
or evict_mapping_folio), so it wouldn't be safe to call
nfs_page_to_inode() after that.
Fixes: 0c493b5cf1 ("NFS: Convert buffered writes to use folios")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The Linux NFS server strips the SUID and SGID from the file mode
on ALLOCATE op.
Modify _nfs42_proc_fallocate to add NFS_INO_REVAL_FORCED to
nfs_set_cache_invalid's argument to force update of the file
mode suid/sgid bit.
Suggested-by: Trond Myklebust <trondmy@hammerspace.com>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>