-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmaOTd8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppqIEACUr8Vv2FtezvT3OfVSlYWHHLXzkRhwEG5s
vdk0o7Ow6U54sMjfymbHTgLD0ZOJf3uJ6BI95FQuW41jPzDFVbx4Hy8QzqonMkw9
1D/YQ4zrVL2mOKBzATbKpoGJzMOzGeoXEueFZ1AYPAX7RrDtP4xPQNfrcfkdE2zF
LycJN70Vp6lrZZMuI9yb9ts1tf7TFzK0HJANxOAKTgSiPmBmxesjkJlhrdUrgkAU
qDVyjj7u/ssndBJAb9i6Bl95Do8s9t4DeJq5/6wgKqtf5hClMXzPVB8Wy084gr6E
rTRsCEhOug3qEZSqfAgAxnd3XFRNc/p2KMUe5YZ4mAqux4hpSmIQQDM/5X5K9vEv
f4MNqUGlqyqntZx+KPyFpf7kLHFYS1qK4ub0FojWJEY4GrbBPNjjncLJ9+ozR0c8
kNDaFjMNAjalBee1FxNNH8LdVcd28rrCkPxRLEfO/gvBMUmvJf4ZyKmSED0v5DhY
vZqKlBqG+wg0EXvdiWEHMDh9Y+q/2XBIkS6NN/Bhh61HNu+XzC838ts1X7lR+4o2
AM5Vapw+v0q6kFBMRP3IcJI/c0UcIU8EQU7axMyzWtvhog8kx8x01hIj1L4UyYYr
rUdWrkugBVXJbywFuH/QIJxWxS/z4JdSw5VjASJLIrXy+aANmmG9Wonv95eyhpUv
5iv+EdRSNA==
=wVi8
-----END PGP SIGNATURE-----
Merge tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe updates via Keith:
- Device initialization memory leak fixes (Keith)
- More constants defined (Weiwen)
- Target debugfs support (Hannes)
- PCIe subsystem reset enhancements (Keith)
- Queue-depth multipath policy (Redhat and PureStorage)
- Implement get_unique_id (Christoph)
- Authentication error fixes (Gaosheng)
- MD updates via Song
- sync_action fix and refactoring (Yu Kuai)
- Various small fixes (Christoph Hellwig, Li Nan, and Ofir Gal, Yu
Kuai, Benjamin Marzinski, Christophe JAILLET, Yang Li)
- Fix loop detach/open race (Gulam)
- Fix lower control limit for blk-throttle (Yu)
- Add module descriptions to various drivers (Jeff)
- Add support for atomic writes for block devices, and statx reporting
for same. Includes SCSI and NVMe (John, Prasad, Alan)
- Add IO priority information to block trace points (Dongliang)
- Various zone improvements and tweaks (Damien)
- mq-deadline tag reservation improvements (Bart)
- Ignore direct reclaim swap writes in writeback throttling (Baokun)
- Block integrity improvements and fixes (Anuj)
- Add basic support for rust based block drivers. Has a dummy null_blk
variant for now (Andreas)
- Series converting driver settings to queue limits, and cleanups and
fixes related to that (Christoph)
- Cleanup for poking too deeply into the bvec internals, in preparation
for DMA mapping API changes (Christoph)
- Various minor tweaks and fixes (Jiapeng, John, Kanchan, Mikulas,
Ming, Zhu, Damien, Christophe, Chaitanya)
* tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux: (206 commits)
floppy: add missing MODULE_DESCRIPTION() macro
loop: add missing MODULE_DESCRIPTION() macro
ublk_drv: add missing MODULE_DESCRIPTION() macro
xen/blkback: add missing MODULE_DESCRIPTION() macro
block/rnbd: Constify struct kobj_type
block: take offset into account in blk_bvec_map_sg again
block: fix get_max_segment_size() warning
loop: Don't bother validating blocksize
virtio_blk: Don't bother validating blocksize
null_blk: Don't bother validating blocksize
block: Validate logical block size in blk_validate_limits()
virtio_blk: Fix default logical block size fallback
nvmet-auth: fix nvmet_auth hash error handling
nvme: implement ->get_unique_id
block: pass a phys_addr_t to get_max_segment_size
block: add a bvec_phys helper
blk-lib: check for kill signal in ioctl BLKZEROOUT
block: limit the Write Zeroes to manually writing zeroes fallback
block: refacto blkdev_issue_zeroout
block: move read-only and supported checks into (__)blkdev_issue_zeroout
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZpEHLQAKCRCRxhvAZXjc
ot3sAP9TBUM+vzUcQ5SVcUnSX+y3dhOGYnquORBbRc/Y6AzLMAEAu3TcsvdoaWfy
6ImUaju6iLqy9cCY3uDlNmJR16E4IgE=
=Bwpy
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.11.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull iomap updates from Christian Brauner:
"This contains some minor work for the iomap subsystem:
- Add documentation on the design of iomap and how to port to it
- Optimize iomap_read_folio()
- Bring back the change to iomap_write_end() to no increase i_size.
This is accompanied by a change to xfs to reserve blocks for
truncating large realtime inodes to avoid exposing stale data when
iomap_write_end() stops increasing i_size"
* tag 'vfs-6.11.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
iomap: don't increase i_size in iomap_write_end()
xfs: reserve blocks for truncating large realtime inode
Documentation: the design of iomap and how to port
iomap: Optimize iomap_read_folio
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZpEHIgAKCRCRxhvAZXjc
ovTvAQDvxpq1CIJz4arkf6lkI1VX1PcSfyV1+aIsXkrGF01tfwD+PekJH0xJ7RqU
ysuMo1uG3i1OO2xIdrdwCXJDng4QggE=
=LtRf
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.11.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pidfs updates from Christian Brauner:
"This contains work to make it possible to derive namespace file
descriptors from pidfd file descriptors.
Right now it is already possible to use a pidfd with setns() to
atomically change multiple namespaces at the same time. In other
words, it is possible to switch to the namespace context of a process
using a pidfd. There is no need to first open namespace file
descriptors via procfs.
The work included here is an extension of these abilities by allowing
to open namespace file descriptors using a pidfd. This means it is now
possible to interact with namespaces without ever touching procfs.
To this end a new set of ioctls() on pidfds is introduced covering all
supported namespace types"
* tag 'vfs-6.11.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
pidfs: allow retrieval of namespace file descriptors
nsfs: add open_namespace()
nsproxy: add helper to go from arbitrary namespace to ns_common
nsproxy: add a cleanup helper for nsproxy
file: add take_fd() cleanup helper
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZpEHFAAKCRCRxhvAZXjc
olb9AQDsA6PLSHsRIVGO3E+syvL+lXC21QdsbAkSgADbqbSC5wEA+nfG2adiWKXc
8CKGMrqXb3j75UfIRIHnM6D03wm0ywo=
=ybN0
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.11.nsfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull namespace-fs updates from Christian Brauner:
"This adds ioctls allowing to translate PIDs between PID namespaces.
The motivating use-case comes from LXCFS which is a tiny fuse
filesystem used to virtualize various aspects of procfs. LXCFS is run
on the host. The files and directories it creates can be bind-mounted
by e.g. a container at startup and mounted over the various procfs
files the container wishes to have virtualized.
When e.g. a read request for uptime is received, LXCFS will receive
the pid of the reader. In order to virtualize the corresponding read,
LXCFS needs to know the pid of the init process of the reader's pid
namespace.
In order to do this, LXCFS first needs to fork() two helper processes.
The first helper process setns() to the readers pid namespace. The
second helper process is needed to create a process that is a proper
member of the pid namespace.
The second helper process then creates a ucred message with ucred.pid
set to 1 and sends it back to LXCFS. The kernel will translate the
ucred.pid field to the corresponding pid number in LXCFS's pid
namespace. This way LXCFS can learn the init pid number of the
reader's pid namespace and can go on to virtualize.
Since these two forks() are costly LXCFS maintains an init pid cache
that caches a given pid for a fixed amount of time. The cache is
pruned during new read requests. However, even with the cache the hit
of the two forks() is singificant when a very large number of
containers are running.
So this adds a simple set of ioctls that let's a caller translate PIDs
from and into a given PID namespace. This significantly improves
performance with a very simple change.
To protect against races pidfds can be used to check whether the
process is still valid"
* tag 'vfs-6.11.nsfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
nsfs: add pid translation ioctls
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZpEHCgAKCRCRxhvAZXjc
om+gAQCC4nJqJqs9bJZIItRtZ71GnxZQO3HVohhIlNM2KKh0VgEA47JhD0O0Srfb
CleII4cQWqB32BfB/vMeff6hpNa7SA4=
=7ltk
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.11.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs mount query updates from Christian Brauner:
"This contains work to extend the abilities of listmount() and
statmount() and various fixes and cleanups.
Features:
- Allow iterating through mounts via listmount() from newest to
oldest. This makes it possible for mount(8) to keep iterating the
mount table in reverse order so it gets newest mounts first.
- Relax permissions on listmount() and statmount().
It's not necessary to have capabilities in the initial namespace:
it is sufficient to have capabilities in the owning namespace of
the mount namespace we're located in to list unreachable mounts in
that namespace.
- Extend both listmount() and statmount() to list and stat mounts in
foreign mount namespaces.
Currently the only way to iterate over mount entries in mount
namespaces that aren't in the caller's mount namespace is by
crawling through /proc in order to find /proc/<pid>/mountinfo for
the relevant mount namespace.
This is both very clumsy and hugely inefficient. So extend struct
mnt_id_req with a new member that allows to specify the mount
namespace id of the mount namespace we want to look at.
Luckily internally we already have most of the infrastructure for
this so we just need to expose it to userspace. Give userspace a
way to retrieve the id of a mount namespace via statmount() and
through a new nsfs ioctl() on mount namespace file descriptor.
This comes with appropriate selftests.
- Expose mount options through statmount().
Currently if userspace wants to get mount options for a mount and
with statmount(), they still have to open /proc/<pid>/mountinfo to
parse mount options. Simply the information through statmount()
directly.
Afterwards it's possible to only rely on statmount() and
listmount() to retrieve all and more information than
/proc/<pid>/mountinfo provides.
This comes with appropriate selftests.
Fixes:
- Avoid copying to userspace under the namespace semaphore in
listmount.
Cleanups:
- Simplify the error handling in listmount by relying on our newly
added cleanup infrastructure.
- Refuse invalid mount ids early for both listmount and statmount"
* tag 'vfs-6.11.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: reject invalid last mount id early
fs: refuse mnt id requests with invalid ids early
fs: find rootfs mount of the mount namespace
fs: only copy to userspace on success in listmount()
sefltests: extend the statmount test for mount options
fs: use guard for namespace_sem in statmount()
fs: export mount options via statmount()
fs: rename show_mnt_opts -> show_vfsmnt_opts
selftests: add a test for the foreign mnt ns extensions
fs: add an ioctl to get the mnt ns id from nsfs
fs: Allow statmount() in foreign mount namespace
fs: Allow listmount() in foreign mount namespace
fs: export the mount ns id via statmount
fs: keep an index of current mount namespaces
fs: relax permissions for statmount()
listmount: allow listing in reverse order
fs: relax permissions for listmount()
fs: simplify error handling
fs: don't copy to userspace under namespace semaphore
path: add cleanup helper
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZpEG2wAKCRCRxhvAZXjc
ooW/AQDzyY+xNGt4OPMvlyFUHd5RcyiLsMhYrkKc3FaIFjesVgD+PFW5PPW12c0V
Z4VHg9w1HDDuUn4XvELs7OXZpek7RgU=
=eDC8
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.11.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode / dentry updates from Christian Brauner:
"This contains smaller performance improvements to inodes and dentries:
inode:
- Add rcu based inode lookup variants.
They avoid one inode hash lock acquire in the common case thereby
significantly reducing contention. We already support RCU-based
operations but didn't take advantage of them during inode
insertion.
Callers of iget_locked() get the improvement without any code
changes. Callers that need a custom callback can switch to
iget5_locked_rcu() as e.g., did btrfs.
With 20 threads each walking a dedicated 1000 dirs * 1000 files
directory tree to stat(2) on a 32 core + 24GB ram vm:
before: 3.54s user 892.30s system 1966% cpu 45.549 total
after: 3.28s user 738.66s system 1955% cpu 37.932 total (-16.7%)
Long-term we should pick up the effort to introduce more
fine-grained locking and possibly improve on the currently used
hash implementation.
- Start zeroing i_state in inode_init_always() instead of doing it in
individual filesystems.
This allows us to remove an unneeded lock acquire in new_inode()
and not burden individual filesystems with this.
dcache:
- Move d_lockref out of the area used by RCU lookup to avoid
cacheline ping poing because the embedded name is sharing a
cacheline with d_lockref.
- Fix dentry size on 32bit with CONFIG_SMP=y so it does actually end
up with 128 bytes in total"
* tag 'vfs-6.11.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: fix dentry size
vfs: move d_lockref out of the area used by RCU lookup
bcachefs: remove now spurious i_state initialization
xfs: remove now spurious i_state initialization in xfs_inode_alloc
vfs: partially sanitize i_state zeroing on inode creation
xfs: preserve i_state around inode_init_always in xfs_reinit_inode
btrfs: use iget5_locked_rcu
vfs: add rcu-based find_inode variants for iget ops
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZpEGjAAKCRCRxhvAZXjc
okXfAP4tFUYszUsSqYdsgy9UvXw3Dr5zOIzQmN++NdjGkbU5fgEAs2ystqEfJgr3
v7XvGbu65CvL4/slNhBZOU4yekGx5Qc=
=C4QD
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.11.mount.api' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs mount API updates from Christian Brauner:
- Add a generic helper to parse uid and gid mount options.
Currently we open-code the same logic in various filesystems which is
error prone, especially since the verification of uid and gid mount
options is a sensitive operation in the face of idmappings.
Add a generic helper and convert all filesystems over to it. Make
sure that filesystems that are mountable in unprivileged containers
verify that the specified uid and gid can be represented in the
owning namespace of the filesystem.
- Convert hostfs to the new mount api.
* tag 'vfs-6.11.mount.api' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fuse: Convert to new uid/gid option parsing helpers
fuse: verify {g,u}id mount options correctly
fat: Convert to new uid/gid option parsing helpers
fat: Convert to new mount api
fat: move debug into fat_mount_options
vboxsf: Convert to new uid/gid option parsing helpers
tracefs: Convert to new uid/gid option parsing helpers
smb: client: Convert to new uid/gid option parsing helpers
tmpfs: Convert to new uid/gid option parsing helpers
ntfs3: Convert to new uid/gid option parsing helpers
isofs: Convert to new uid/gid option parsing helpers
hugetlbfs: Convert to new uid/gid option parsing helpers
ext4: Convert to new uid/gid option parsing helpers
exfat: Convert to new uid/gid option parsing helpers
efivarfs: Convert to new uid/gid option parsing helpers
debugfs: Convert to new uid/gid option parsing helpers
autofs: Convert to new uid/gid option parsing helpers
fs_parse: add uid & gid option option parsing helpers
hostfs: Add const qualifier to host_root in hostfs_fill_super()
hostfs: convert hostfs to use the new mount API
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZpEGnAAKCRCRxhvAZXjc
om4UAP4xnLigNBMHqoWTvfbBjxfoPcWTq53M9CM/T2t2iz4AsQD+NiUQLmNO/LDJ
8aZ0K5TI2+uWYUHbyLabmKWwAm33UQ0=
=ZeeN
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.11.casefold' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs casefolding updates from Christian Brauner:
"This contains some work to simplify the handling of casefolded names:
- Simplify the handling of casefolded names in f2fs and ext4 by
keeping the names as a qstr to avoiding unnecessary conversions
- Introduce a new generic_ci_match() libfs case-insensitive lookup
helper and use it in both f2fs and ext4 allowing to remove the
filesystem specific implementations
- Remove a bunch of ifdefs by making the unicode build checks part of
the code flow"
* tag 'vfs-6.11.casefold' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
f2fs: Move CONFIG_UNICODE defguards into the code flow
ext4: Move CONFIG_UNICODE defguards into the code flow
f2fs: Reuse generic_ci_match for ci comparisons
ext4: Reuse generic_ci_match for ci comparisons
libfs: Introduce case-insensitive string comparison helper
f2fs: Simplify the handling of cached casefolded names
ext4: Simplify the handling of cached casefolded names
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZpEGSgAKCRCRxhvAZXjc
opvwAQCBfq5sxn/P34MNheHAVJOkQlozaflLIRM/CRN60HXV3AEAiph0RJBszvDu
VhJ9VZ21zypvpS34enBfPKp1ZmyHPwI=
=hNqR
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.11.pg_error' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull PG_error removal updates from Christian Brauner:
"This contains work to remove almost all remaining users of PG_error
from filesystems and filesystem helper libraries. An additional patch
will be coming in via the jfs tree which tests the PG_error bit.
Afterwards nothing will be testing it anymore and it's safe to remove
all places which set or clear the PG_error bit.
The goal is to fully remove PG_error by the next merge window"
* tag 'vfs-6.11.pg_error' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
buffer: Remove calls to set and clear the folio error flag
iomap: Remove calls to set and clear folio error flag
vboxsf: Convert vboxsf_read_folio() to use a folio
ufs: Remove call to set the folio error flag
romfs: Convert romfs_read_folio() to use a folio
reiserfs: Remove call to folio_set_error()
orangefs: Remove calls to set/clear the error flag
nfs: Remove calls to folio_set_error
jffs2: Remove calls to set/clear the folio error flag
hostfs: Convert hostfs_read_folio() to use a folio
isofs: Convert rock_ridge_symlink_read_folio to use a folio
hpfs: Convert hpfs_symlink_read_folio to use a folio
efs: Convert efs_symlink_read_folio to use a folio
cramfs: Convert cramfs_read_folio to use a folio
coda: Convert coda_symlink_filler() to use folio_end_read()
befs: Convert befs_symlink_read_folio() to use folio_end_read()
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZpEF0AAKCRCRxhvAZXjc
oq0TAQDjfTLN75RwKQ34RIFtRun2q+OMfBQtSegtaccqazghyAD/QfmPuZDxB5DL
rsI/5k5O4VupIKrEdIaqvNxmkmDsSAc=
=bf7E
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.11.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Features:
- Support passing NULL along AT_EMPTY_PATH for statx().
NULL paths with any flag value other than AT_EMPTY_PATH go the
usual route and end up with -EFAULT to retain compatibility (Rust
is abusing calls of the sort to detect availability of statx)
This avoids path lookup code, lockref management, memory allocation
and in case of NULL path userspace memory access (which can be
quite expensive with SMAP on x86_64)
- Don't block i_writecount during exec. Remove the
deny_write_access() mechanism for executables
- Relax open_by_handle_at() permissions in specific cases where we
can prove that the caller had sufficient privileges to open a file
- Switch timespec64 fields in struct inode to discrete integers
freeing up 4 bytes
Fixes:
- Fix false positive circular locking warning in hfsplus
- Initialize hfs_inode_info after hfs_alloc_inode() in hfs
- Avoid accidental overflows in vfs_fallocate()
- Don't interrupt fallocate with EINTR in tmpfs to avoid constantly
restarting shmem_fallocate()
- Add missing quote in comment in fs/readdir
Cleanups:
- Don't assign and test in an if statement in mqueue. Move the
assignment out of the if statement
- Reflow the logic in may_create_in_sticky()
- Remove the usage of the deprecated ida_simple_xx() API from procfs
- Reject FSCONFIG_CMD_CREATE_EXCL requets that depend on the new
mount api early
- Rename variables in copy_tree() to make it easier to understand
- Replace WARN(down_read_trylock, ...) abuse with proper asserts in
various places in the VFS
- Get rid of user_path_at_empty() and drop the empty argument from
getname_flags()
- Check for error while copying and no path in one branch in
getname_flags()
- Avoid redundant smp_mb() for THP handling in do_dentry_open()
- Rename parent_ino to d_parent_ino and make it use RCU
- Remove unused header include in fs/readdir
- Export in_group_capable() helper and switch f2fs and fuse over to
it instead of open-coding the logic in both places"
* tag 'vfs-6.11.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits)
ipc: mqueue: remove assignment from IS_ERR argument
vfs: rename parent_ino to d_parent_ino and make it use RCU
vfs: support statx(..., NULL, AT_EMPTY_PATH, ...)
stat: use vfs_empty_path() helper
fs: new helper vfs_empty_path()
fs: reflow may_create_in_sticky()
vfs: remove redundant smp_mb for thp handling in do_dentry_open
fuse: Use in_group_or_capable() helper
f2fs: Use in_group_or_capable() helper
fs: Export in_group_or_capable()
vfs: reorder checks in may_create_in_sticky
hfs: fix to initialize fields of hfs_inode_info after hfs_alloc_inode()
proc: Remove usage of the deprecated ida_simple_xx() API
hfsplus: fix to avoid false alarm of circular locking
Improve readability of copy_tree
vfs: shave a branch in getname_flags
vfs: retire user_path_at_empty and drop empty arg from getname_flags
vfs: stop using user_path_at_empty in do_readlinkat
tmpfs: don't interrupt fallocate with EINTR
fs: don't block i_writecount during exec
...
This is the last - for now - of the "look, we generated some
questionable code for basic pathname lookup operations" set of
branches.
This is mainly just re-organizing the name hashing code in
link_path_walk(), mostly by improving the calling conventions to
the inlined helper functions and moving some of the code around
to allow for more straightforward code generation.
The profiles - and the generated code - look much more palatable
to me now.
* link_path_walk:
vfs: link_path_walk: move more of the name hashing into hash_name()
vfs: link_path_walk: improve may_lookup() code generation
vfs: link_path_walk: do '.' and '..' detection while hashing
vfs: link_path_walk: clarify and improve name hashing interface
vfs: link_path_walk: simplify name hash flow
Merge runtime constants infrastructure with implementations for x86 and
arm64.
This is one of four branches that came out of me looking at profiles of
my kernel build filesystem load on my 128-core Altra arm64 system, where
pathname walking and the user copies (particularly strncpy_from_user()
for fetching the pathname from user space) is very hot.
This is a very specialized "instruction alternatives" model where the
dentry hash pointer and hash count will be constants for the lifetime of
the kernel, but the allocation are not static but done early during the
kernel boot. In order to avoid the pointer load and dynamic shift, we
just rewrite the constants in the instructions in place.
We can't use the "generic" alternative instructions infrastructure,
because different architectures do it very differently, and it's
actually simpler to just have very specific helpers, with a fallback to
the generic ("old") model of just using variables for architectures that
do not implement the runtime constant patching infrastructure.
Link: https://lore.kernel.org/all/CAHk-=widPe38fUNjUOmX11ByDckaeEo9tN4Eiyke9u1SAtu9sA@mail.gmail.com/
* runtime-constants:
arm64: add 'runtime constant' support
runtime constants: add x86 architecture support
runtime constants: add default dummy infrastructure
vfs: dcache: move hashlen_hash() from callers into d_hash()
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmaSqPwACgkQiiy9cAdy
T1EdMAv/Q+qBEmCUybFAjkJelkt+FecWWWZ3L26TXjyAGZBlf7cl590Rr5jXRLw1
xPDdUt7rE0Zxpg0pK8L5QRgDjc7BwiuAIEJfxdI/gAHbEueElLGdqvFp0G1HSBvY
3lgkG5zz9uZUBemFlrxZ2Wsd4MiHBPsaBx5+TEPPGkRhWzd3LRU7fi7PGa6PUD3U
BChQED88EhWB7BfxOqctAYfUgOxqzqiaOe5KAATsWcKpJ3sqgYCHLiVn5vZQ7tYD
69HijShCHC8ng7KeXkW3XJf1knsDHlHsROzNQgX+pUqEZWcDsjGpJNKGtIO3IfeD
9uOy3U+VuPwaVnVZnr5+bSqaiZbOehvGa+3T/JOwJnRfwVP6Kb97/YiEJdVvFwiI
K0CSop3+cgBouqo9S+4j2mjosN6oCQcfTGxBXzMCIZwdawvkNAVxKg/7RpuDuRWJ
3QVVOKzmVOYE6X1RTsnBevcgjCg/t6upfD+m99a8JnZlZislyzxj9qKUcs1XZ8WJ
02TCRc3V
=v/5z
-----END PGP SIGNATURE-----
Merge tag '6.10-rc7-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fix from Steve French:
"Small fix, also for stable"
* tag '6.10-rc7-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix setting SecurityFlags to true
If you try to set /proc/fs/cifs/SecurityFlags to 1 it
will set them to CIFSSEC_MUST_NTLMV2 which no longer is
relevant (the less secure ones like lanman have been removed
from cifs.ko) and is also missing some flags (like for
signing and encryption) and can even cause mount to fail,
so change this to set it to Kerberos in this case.
Also change the description of the SecurityFlags to remove mention
of flags which are no longer supported.
Cc: stable@vger.kernel.org
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmaRcQgACgkQxWXV+ddt
WDvAGxAAknJAiREp/AmzhSwkhr+nSnqex0t+VVgsOaMTu0BEHO0xhoXc3l0QuSwS
u2AIqmOYyzr/UQVXCuatBqAE+5T4njtYAYIWwE825yquAtHNyuok9+Sjhfvxrwgs
HmNAN4Vvl2Fwds7xbWE8ug18QlssuRTIX8hk7ZtS6xo49g0tsbRX9KlzIPpsULD3
BOZa+2NJwC1PGVeNPf3p06rfiUkKfmFYgdDybe2zJ17uwsRz1CFSsaEEB35ys1f0
xYOS4epfcie03EGyZmYctuNxatUkk/J/1lTH4Z9JHwvPBvLK1U97SyJ11Wz2VQC/
8ar8gUDRYtjWdf6vn6AWBM4MseaYm9LDMlPhbSfvpDcWiclGTE64IOP4gKKr3mCh
WzlNSIR9I+tYgrhvcsCEzd7lvrSVHa7clwfooYgkEx0wl5lgbN0llAdtJWG3eeLn
3stxje2FqqXsFNj5N9SrPy7f7t6xF2i8vwk4qh6EpRuT4yuatb+nWzDm9EuTT/Bc
P+zM1KFp7Blk7Zw/Tpw0O9qjt1whStY2xrqcMzg539WVo45MmuFEFzmGBRwZsH55
QPGLIjXPpt728AgMdhBFEG0DtWaiA3AOI/C5nYOtLu92aZVBmbaX7/d/GpJv3Vvd
Ihvr9s1c49YvTZsIS0T0tkq/7LXZi/SToRJDjhP5HCrRGf7A30Y=
=gtsF
-----END PGP SIGNATURE-----
Merge tag 'for-6.10-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Fix a regression in extent map shrinker behaviour.
In the past weeks we got reports from users that there are huge
latency spikes or freezes. This was bisected to newly added shrinker
of extent maps (it was added to fix a build up of the structures in
memory).
I'm assuming that the freezes would happen to many users after release
so I'd like to get it merged now so it's in 6.10. Although the diff
size is not small the changes are relatively straightforward, the
reporters verified the fixes and we did testing on our side.
The fixes:
- adjust behaviour under memory pressure and check lock or scheduling
conditions, bail out if needed
- synchronize tracking of the scanning progress so inode ranges are
not skipped or work duplicated
- do a delayed iput when scanning a root so evicting an inode does
not slow things down in case of lots of dirty data, also fix
lockdep warning, a deadlock could happen when writing the dirty
data would need to start a transaction"
* tag 'for-6.10-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: avoid races when tracking progress for extent map shrinking
btrfs: stop extent map shrinker if reschedule is needed
btrfs: use delayed iput during extent map shrinking
- revert the SLAB_ACCOUNT patch, something crazy is going on in memcg
and someone forgot to test
- minor fixes: missing rcu_read_lock(), scheduling while atomic (in an
emergency shutdown path)
- two lockdep fixes; these could have gone earlier, but were left to
bake awhile
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmaRRssACgkQE6szbY3K
bnZpVw/+KBili018zGFIZjaYg89tNjyPhfdjtTSnTBm+989Naw4GjdF8nR81sVfC
LPW3hbUE/uZ9d9++kmwWXe+c6QrCk8skbc9z/6Bs7LAeidranZWmLNg157TfqOkF
utWhXb8oo4X0wivb6V+slnk7G4fU+p+uYfd68BFMGhNMX62ldjkVqKwpaT2bsQ03
5igS8b2ss90Q+XLkpzHNDYl/yq8kFxUbaeDc4IKFV0aN5pq93pBOGDtZCi1dHeN+
O9rHYyXoSd26+C/gjCscD/ZIfmEq7IgWRZP/2157fbVzSx9VzWZtb2FIuVf0cx/F
z01+M8fZctJVAIFijvG33P5KwAiwyDzGZMN5S/U4CICvMXu/iwJrAPUOjJoII8Dl
09ex4X0cZZjvFsA+dDMfd2Vc8U6dgpo4+7j3/rZkH/9REJypnhv/o+89Dl+Lx9P6
UdQsqphQjz0ud4Qd5TOT+x/7n8JFlRLnzeIXf8U1qKKlkKCuhxBnDEtVUc/s7gBc
8ekLQSHgpy14fCpq0wgOYx4OFQrY4KQ+Ocpt3i9RdzX1+Nti4v1REgVvBmi4UHra
5itcQFGMEq0xQ0H9eZcwqqoUHuQuzCtaP12CbECHDjWfZDWKDaYHkG2jo0SK6P9e
8ISZakmdWsu0sqP4nuexKvcb4K1ov0JtidMlGOeB3KYrQ7LHQjA=
=vpux
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-07-12' of https://evilpiepirate.org/git/bcachefs
Pull more bcachefs fixes from Kent Overstreet:
- revert the SLAB_ACCOUNT patch, something crazy is going on in memcg
and someone forgot to test
- minor fixes: missing rcu_read_lock(), scheduling while atomic (in an
emergency shutdown path)
- two lockdep fixes; these could have gone earlier, but were left to
bake awhile
* tag 'bcachefs-2024-07-12' of https://evilpiepirate.org/git/bcachefs:
bcachefs: bch2_gc_btree() should not use btree_root_lock
bcachefs: Set PF_MEMALLOC_NOFS when trans->locked
bcachefs; Use trans_unlock_long() when waiting on allocator
Revert "bcachefs: Mark bch_inode_info as SLAB_ACCOUNT"
bcachefs: fix scheduling while atomic in break_cycle()
bcachefs: Fix RCU splat
btree_root_lock is for the root keys in btree_root, not the pointers to
the nodes themselves; this fixes a lock ordering issue between
btree_root_lock and btree node locks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This reverts commit 86d81ec5f5.
This wasn't tested with memcg enabled, it immediately hits a null ptr
deref in list_lru_add().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZo9dYAAKCRCRxhvAZXjc
omYQAP4wELNW5StzljRReC6s/Kzu6IANJQlfFpuGnPIl23iRmwD+Pq433xQqSy5f
uonMBEdxqbOrJM7A6KeHKCyuAKYpNg0=
=zg3n
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.10-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"cachefiles:
- Export an existing and add a new cachefile helper to be used in
filesystems to fix reference count bugs
- Use the newly added fscache_ty_get_volume() helper to get a
reference count on an fscache_volume to handle volumes that are
about to be removed cleanly
- After withdrawing a fscache_cache via FSCACHE_CACHE_IS_WITHDRAWN
wait for all ongoing cookie lookups to complete and for the object
count to reach zero
- Propagate errors from vfs_getxattr() to avoid an infinite loop in
cachefiles_check_volume_xattr() because it keeps seeing ESTALE
- Don't send new requests when an object is dropped by raising
CACHEFILES_ONDEMAND_OJBSTATE_DROPPING
- Cancel all requests for an object that is about to be dropped
- Wait for the ondemand_boject_worker to finish before dropping a
cachefiles object to prevent use-after-free
- Use cyclic allocation for message ids to better handle id recycling
- Add missing lock protection when iterating through the xarray when
polling
netfs:
- Use standard logging helpers for debug logging
VFS:
- Fix potential use-after-free in file locks during
trace_posix_lock_inode(). The tracepoint could fire while another
task raced it and freed the lock that was requested to be traced
- Only increment the nr_dentry_negative counter for dentries that are
present on the superblock LRU. Currently, DCACHE_LRU_LIST list is
used to detect this case. However, the flag is also raised in
combination with DCACHE_SHRINK_LIST to indicate that dentry->d_lru
is used. So checking only DCACHE_LRU_LIST will lead to wrong
nr_dentry_negative count. Fix the check to not count dentries that
are on a shrink related list
Misc:
- hfsplus: fix an uninitialized value issue in copy_name
- minix: fix minixfs_rename with HIGHMEM. It still uses kunmap() even
though we switched it to kmap_local_page() a while ago"
* tag 'vfs-6.10-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
minixfs: Fix minixfs_rename with HIGHMEM
hfsplus: fix uninit-value in copy_name
vfs: don't mod negative dentry count when on shrinker list
filelock: fix potential use-after-free in posix_lock_inode
cachefiles: add missing lock protection when polling
cachefiles: cyclic allocation of msg_id to avoid reuse
cachefiles: wait for ondemand_object_worker to finish when dropping object
cachefiles: cancel all requests for the object that is being dropped
cachefiles: stop sending new request when dropping object
cachefiles: propagate errors from vfs_getxattr() to avoid infinite loop
cachefiles: fix slab-use-after-free in cachefiles_withdraw_cookie()
cachefiles: fix slab-use-after-free in fscache_withdraw_volume()
netfs, fscache: export fscache_put_volume() and add fscache_try_get_volume()
netfs: Switch debug logging to pr_debug()
We store the progress (root and inode numbers) of the extent map shrinker
in fs_info without any synchronization but we can have multiple tasks
calling into the shrinker during memory allocations when there's enough
memory pressure for example.
This can result in a task A reading fs_info->extent_map_shrinker_last_ino
after another task B updates it, and task A reading
fs_info->extent_map_shrinker_last_root before task B updates it, making
task A see an odd state that isn't necessarily harmful but may make it
skip certain inode ranges or do more work than necessary by going over
the same inodes again. These unprotected accesses would also trigger
warnings from tools like KCSAN.
So add a lock to protect access to these progress fields.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The extent map shrinker can be called in a variety of contexts where we
are under memory pressure, and of them is when a task is trying to
allocate memory. For this reason the shrinker is typically called with a
value of struct shrink_control::nr_to_scan that is much smaller than what
we return in the nr_cached_objects callback of struct super_operations
(fs/btrfs/super.c:btrfs_nr_cached_objects()), so that the shrinker does
not take a long time and cause high latencies. However we can still take
a lot of time in the shrinker even for a limited amount of nr_to_scan:
1) When traversing the red black tree that tracks open inodes in a root,
as for example with millions of open inodes we get a deep tree which
takes time searching for an inode;
2) Iterating over the extent map tree, which is a red black tree, of an
inode when doing the rb_next() calls and when removing an extent map
from the tree, since often that requires rebalancing the red black
tree;
3) When trying to write lock an inode's extent map tree we may wait for a
significant amount of time, because there's either another task about
to do IO and searching for an extent map in the tree or inserting an
extent map in the tree, and we can have thousands or even millions of
extent maps for an inode. Furthermore, there can be concurrent calls
to the shrinker so the lock might be busy simply because there is
already another task shrinking extent maps for the same inode;
4) We often reschedule if we need to, which further increases latency.
So improve on this by stopping the extent map shrinking code whenever we
need to reschedule and make it skip an inode if we can't immediately lock
its extent map tree.
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Reported-by: Andrea Gelmini <andrea.gelmini@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CABXGCsMmmb36ym8hVNGTiU8yfUS_cGvoUmGCcBrGWq9OxTrs+A@mail.gmail.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
No identifiable theme here - all are singleton patches, 19 are for MM.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZo7tTQAKCRDdBJ7gKXxA
jvhZAP977PnAwQH5khIS3xJxZrqx/+Tho7UPZzQPvHJPRpHorAD/TZfDazGtlPMD
uLPEVslh18rks/w+kddLrnlBnkpUMwY=
=vhts
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-07-10-13-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"21 hotfixes, 15 of which are cc:stable.
No identifiable theme here - all are singleton patches, 19 are for MM"
* tag 'mm-hotfixes-stable-2024-07-10-13-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits)
mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio
mm/hugetlb: fix potential race in __update_and_free_hugetlb_folio()
filemap: replace pte_offset_map() with pte_offset_map_nolock()
arch/xtensa: always_inline get_current() and current_thread_info()
sched.h: always_inline alloc_tag_{save|restore} to fix modpost warnings
MAINTAINERS: mailmap: update Lorenzo Stoakes's email address
mm: fix crashes from deferred split racing folio migration
lib/build_OID_registry: avoid non-destructive substitution for Perl < 5.13.2 compat
mm: gup: stop abusing try_grab_folio
nilfs2: fix kernel bug on rename operation of broken directory
mm/hugetlb_vmemmap: fix race with speculative PFN walkers
cachestat: do not flush stats in recency check
mm/shmem: disable PMD-sized page cache if needed
mm/filemap: skip to create PMD-sized page cache if needed
mm/readahead: limit page cache size in page_cache_ra_order()
mm/filemap: make MAX_PAGECACHE_ORDER acceptable to xarray
mm/damon/core: merge regions aggressively when max_nr_regions is unmet
Fix userfaultfd_api to return EINVAL as expected
mm: vmalloc: check if a hash-index is in cpu_possible_mask
mm: prevent derefencing NULL ptr in pfn_section_valid()
...
- Switch some asserts to WARN()
- Fix a few "transaction not locked" asserts in the data read retry
paths and backpointers gc
- Fix a race that would cause the journal to get stuck on a flush commit
- Add missing fsck checks for the fragmentation LRU
- The usual assorted ssorted syzbot fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmaOuRwACgkQE6szbY3K
bnaCHhAAi9VRqws+zx3fSpe2OMwWqAEWA84QgIFJccy+I86d7dXkqG389gFqJwMG
9S3BUHP1WooJmpsTRhK5cNtxZuKKOajXlxUYz3onsF7O/U3dHFY5GU7yIIjXS/0o
q7+iryWAJ4MmlOrAJhgPMH/WlhbSVsjANUN0n/NhlOWHccFGHmpdMTb6aYzb+lfL
iZOONKmEOR65gLzZYlO323OB2Tv00iEbOZAtxk68BLZYX+WON/j1T1A8gK4G0XSX
8wcYpXNxGGkCufjBfAbXf4mcp/WygQq0Wj3bdVMFkZ+AwSJDcfGeK1H7f6tJ9e4n
lqfWL4tgWIckS+41sA96B5cYry9TMDdhu3IeFaAm0ZrF55JT1JySGE1GNA+mo6xA
mkMAqhG7rwYh6nSJfWX0Ie+zJ9TFbmi05ZbI7jaTuQjnJ5uvPpTuRfBDi+qSWmoi
+IBDAi9hZgCUNEsLRGDm7RDQo0dpbFo6jpArn1RHK4MO/HkTrqcKpTqiGnfwFAU4
PFxwq5G9+d38+M6YMX0tXdfQ+fdxroA6aIBJSsIpF18tPRBOBlQsM2GFP34uHbyk
L6HOzed2QpM5ExBmViX79F+obuDQ/gzXQszYvDKL4QTFNbx43gPWRDrGm8EQen6y
12EScamXbUWBSWnOqxscmeUsTdTKxLfw/F43JbE2fE7jSxc5tss=
=VGT8
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-07-10' of https://evilpiepirate.org/git/bcachefs
Pull bcachefs fixes from Kent Overstreet:
- Switch some asserts to WARN()
- Fix a few "transaction not locked" asserts in the data read retry
paths and backpointers gc
- Fix a race that would cause the journal to get stuck on a flush
commit
- Add missing fsck checks for the fragmentation LRU
- The usual assorted ssorted syzbot fixes
* tag 'bcachefs-2024-07-10' of https://evilpiepirate.org/git/bcachefs: (22 commits)
bcachefs: Add missing bch2_trans_begin()
bcachefs: Fix missing error check in journal_entry_btree_keys_validate()
bcachefs: Warn on attempting a move with no replicas
bcachefs: bch2_data_update_to_text()
bcachefs: Log mount failure error code
bcachefs: Fix undefined behaviour in eytzinger1_first()
bcachefs: Mark bch_inode_info as SLAB_ACCOUNT
bcachefs: Fix bch2_inode_insert() race path for tmpfiles
closures: fix closure_sync + closure debugging
bcachefs: Fix journal getting stuck on a flush commit
bcachefs: io clock: run timer fns under clock lock
bcachefs: Repair fragmentation_lru in alloc_write_key()
bcachefs: add check for missing fragmentation in check_alloc_to_lru_ref()
bcachefs: bch2_btree_write_buffer_maybe_flush()
bcachefs: Add missing printbuf_tabstops_reset() calls
bcachefs: Fix loop restart in bch2_btree_transactions_read()
bcachefs: Fix bch2_read_retry_nodecode()
bcachefs: Don't use the new_fs() bucket alloc path on an initialized fs
bcachefs: Fix shift greater than integer size
bcachefs: Change bch2_fs_journal_stop() BUG_ON() to warning
...
After commit 230e9fc286 ("slab: add SLAB_ACCOUNT flag"), we need to mark
the inode cache as SLAB_ACCOUNT, similar to commit 5d097056c9 ("kmemcg:
account for certain kmem allocations to memcg")
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
minixfs now uses kmap_local_page(), so we can't call kunmap() to
undo it. This one call was missed as part of the commit this fixes.
Fixes: 6628f69ee6 (minixfs: Use dir_put_page() in minix_unlink() and minix_rename())
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20240709195841.1986374-1-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmaJwEYACgkQiiy9cAdy
T1EPaQwAufRbLgmhf0mXUhRukYFIWwAyPOvMEov9vr6uWAmIaqxb2ggmgxwolulS
oEheMyoE+nDRzUFnPv+QY/ihV66Eqq2A83oSW/JVc+WAhiyLG7hWKWdHr2IxEG87
IJA9oJVWoYBQVpINozibwN0qONr8AU6B0jIGZ7+MzU3e09ARLf6OltfXWjLZT68K
xK5fqcZErF+wawnk26u/FRmd81vD3zhRAIqGFIt7E62ngedTsWvqqn7Dx5MDI28a
KkgO8hudyhULGZk8qI/pN/8+vBFJlMdTWaWN9410ucpoQ+5G4M0quOsqzn5DxbWw
0lnBAgDvR1jwyU4cUj4Dgb0TnG/ABiuVQebz82LeIoisItSPenNyKc5FRfry/OFN
PJFvWoUvYGFXUtSkdmLwLeWppTVvpL8vxyk+OPx3URwheqCiaQHN/l3xSBqLIldw
4uPL+grt9zeKOvMvsBFfN+2eiUeC3foZkg4RKucs5aSPJtHra4w6zhvfsuJosNsW
XgIRM19F
=eUKV
-----END PGP SIGNATURE-----
Merge tag '6.10-rc6-smb3-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
- fix access flags to address fuse incompatibility
- fix device type returned by get filesystem info
* tag '6.10-rc6-smb3-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: discard write access to the directory open
ksmbd: return FILE_DEVICE_DISK instead of super magic
This avoids having to return the length of the component entirely by
just doing all of the name processing in hash_name(). We can just
return the end of the path component, and a flag for the DOT and DOTDOT
cases.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of having separate calls to 'inode_permission()' depending on
whether we're in RCU lookup or not, just share the first call.
Note that the initial "conditional" on LOOKUP_RCU really turns into just
a "convert the LOOKUP_RCU bit in the nameidata into the MAY_NOT_BLOCK
bit in the argument", which is just a trivial bitwise and and shift
operation.
So the initial conditional goes away entirely, and then the likely case
is that it will succeed independently of us being in RCU lookup or not,
and the possible "we may need to fall out of RCU and redo it all" fixups
that are needed afterwards all go in the unlikely path.
[ This also marks 'nd' restrict, because that means that the compiler
can know that there is no other alias, and can cache the LOOKUP_RCU
value over the call to inode_permission(). ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The method we used was predicated on the assumption that the mount
immediately following the root mount of the mount namespace would be the
rootfs mount of the namespace. That's not always the case though. For
example:
ID PARENT ID
408 412 0:60 /containers/overlay-containers/bc391117192b32071b22ef2083ebe7735d5c390f87a5779e02faf79ba0746ceb/userdata/hosts /etc/hosts rw,nosuid,nodev,relatime - tmpfs tmpfs rw,size=954664k,nr_inodes=238666,mode=700,uid=1000,gid=1000,inode64
409 414 0:61 / /dev/shm rw,nosuid,nodev,noexec,relatime - tmpfs shm rw,size=64000k,uid=1000,gid=1000,inode64
410 412 0:60 /containers/overlay-containers/bc391117192b32071b22ef2083ebe7735d5c390f87a5779e02faf79ba0746ceb/userdata/.containerenv /run/.containerenv rw,nosuid,nodev,relatime - tmpfs tmpfs rw,size=954664k,nr_inodes=238666,mode=700,uid=1000,gid=1000,inode64
411 412 0:60 /containers/overlay-containers/bc391117192b32071b22ef2083ebe7735d5c390f87a5779e02faf79ba0746ceb/userdata/hostname /etc/hostname rw,nosuid,nodev,relatime - tmpfs tmpfs rw,size=954664k,nr_inodes=238666,mode=700,uid=1000,gid=1000,inode64
412 363 0:65 / / rw,relatime - overlay overlay rw,lowerdir=/home/user1/.local/share/containers/storage/overlay/l/JS65SUCGTPCP2EEBHLRP4UCFI5:/home/user1/.local/share/containers/storage/overlay/l/DLW22KVDWUNI4242D6SDJ5GKCL [...]
413 412 0:68 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
414 412 0:69 / /dev rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755,uid=1000,gid=1000,inode64
415 412 0:70 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
416 414 0:71 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,gid=100004,mode=620,ptmxmode=666
417 414 0:67 / /dev/mqueue rw,nosuid,nodev,noexec,relatime - mqueue mqueue rw
418 415 0:27 / /sys/fs/cgroup ro,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw,nsdelegate,memory_recursiveprot
419 414 0:6 /null /dev/null rw,nosuid,noexec - devtmpfs devtmpfs rw,size=4096k,nr_inodes=1179282,mode=755,inode64
420 414 0:6 /zero /dev/zero rw,nosuid,noexec - devtmpfs devtmpfs rw,size=4096k,nr_inodes=1179282,mode=755,inode64
422 414 0:6 /full /dev/full rw,nosuid,noexec - devtmpfs devtmpfs rw,size=4096k,nr_inodes=1179282,mode=755,inode64
423 414 0:6 /tty /dev/tty rw,nosuid,noexec - devtmpfs devtmpfs rw,size=4096k,nr_inodes=1179282,mode=755,inode64
430 414 0:6 /random /dev/random rw,nosuid,noexec - devtmpfs devtmpfs rw,size=4096k,nr_inodes=1179282,mode=755,inode64
431 414 0:6 /urandom /dev/urandom rw,nosuid,noexec - devtmpfs devtmpfs rw,size=4096k,nr_inodes=1179282,mode=755,inode64
433 413 0:72 / /proc/acpi ro,relatime - tmpfs tmpfs rw,size=0k,uid=1000,gid=1000,inode64
440 413 0:6 /null /proc/kcore ro,nosuid - devtmpfs devtmpfs rw,size=4096k,nr_inodes=1179282,mode=755,inode64
441 413 0:6 /null /proc/keys ro,nosuid - devtmpfs devtmpfs rw,size=4096k,nr_inodes=1179282,mode=755,inode64
442 413 0:6 /null /proc/timer_list ro,nosuid - devtmpfs devtmpfs rw,size=4096k,nr_inodes=1179282,mode=755,inode64
443 413 0:73 / /proc/scsi ro,relatime - tmpfs tmpfs rw,size=0k,uid=1000,gid=1000,inode64
444 415 0:74 / /sys/firmware ro,relatime - tmpfs tmpfs rw,size=0k,uid=1000,gid=1000,inode64
445 415 0:75 / /sys/dev/block ro,relatime - tmpfs tmpfs rw,size=0k,uid=1000,gid=1000,inode64
446 413 0:68 /bus /proc/bus ro,nosuid,nodev,noexec,relatime - proc proc rw
447 413 0:68 /fs /proc/fs ro,nosuid,nodev,noexec,relatime - proc proc rw
448 413 0:68 /irq /proc/irq ro,nosuid,nodev,noexec,relatime - proc proc rw
449 413 0:68 /sys /proc/sys ro,nosuid,nodev,noexec,relatime - proc proc rw
450 413 0:68 /sysrq-trigger /proc/sysrq-trigger ro,nosuid,nodev,noexec,relatime - proc proc rw
364 414 0:71 /0 /dev/console rw,relatime - devpts devpts rw,gid=100004,mode=620,ptmxmode=666
In this mount table the root mount of the mount namespace is the mount
with id 363 (It isn't visible because it's literally just what the
rootfs mount is mounted upon and usually it's just a copy of the real
rootfs).
The rootfs mount that's mounted on the root mount of the mount namespace
is the mount with id 412. But the mount namespace contains mounts that
were created before the rootfs mount and thus have earlier mount ids. So
the first call to listmnt_next() would return the mount with the mount
id 408 and not the rootfs mount.
So we need to find the actual rootfs mount mounted on the root mount of
the mount namespace. This logic is also present in mntns_install() where
vfs_path_lookup() is used. We can't use this though as we're holding the
namespace semaphore. We could look at the children of the root mount of
the mount namespace directly but that also seems a bit out of place
while we have the rbtree. So let's just iterate through the rbtree
starting from the root mount of the mount namespace and find the mount
whose parent is the root mount of the mount namespace. That mount will
usually appear very early in the rbtree and afaik there can only be one.
IOW, it would be very strange if we ended up with a root mount of a
mount namespace that has shadow mounts.
Fixes: 0a3deb1185 ("fs: Allow listmount() in foreign mount namespace") # mainline only
Signed-off-by: Christian Brauner <brauner@kernel.org>
The nr_dentry_negative counter is intended to only account negative
dentries that are present on the superblock LRU. Therefore, the LRU
add, remove and isolate helpers modify the counter based on whether
the dentry is negative, but the shrinker list related helpers do not
modify the counter, and the paths that change a dentry between
positive and negative only do so if DCACHE_LRU_LIST is set.
The problem with this is that a dentry on a shrinker list still has
DCACHE_LRU_LIST set to indicate ->d_lru is in use. The additional
DCACHE_SHRINK_LIST flag denotes whether the dentry is on LRU or a
shrink related list. Therefore if a relevant operation (i.e. unlink)
occurs while a dentry is present on a shrinker list, and the
associated codepath only checks for DCACHE_LRU_LIST, then it is
technically possible to modify the negative dentry count for a
dentry that is off the LRU. Since the shrinker list related helpers
do not modify the negative dentry count (because non-LRU dentries
should not be included in the count) when the dentry is ultimately
removed from the shrinker list, this can cause the negative dentry
count to become permanently inaccurate.
This problem can be reproduced via a heavy file create/unlink vs.
drop_caches workload. On an 80xcpu system, I start 80 tasks each
running a 1k file create/delete loop, and one task spinning on
drop_caches. After 10 minutes or so of runtime, the idle/clean cache
negative dentry count increases from somewhere in the range of 5-10
entries to several hundred (and increasingly grows beyond
nr_dentry_unused).
Tweak the logic in the paths that turn a dentry negative or positive
to filter out the case where the dentry is present on a shrink
related list. This allows the above workload to maintain an accurate
negative dentry count.
Fixes: af0c9af1b3 ("fs/dcache: Track & report number of negative dentries")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20240703121301.247680-1-bfoster@redhat.com
Acked-by: Ian Kent <ikent@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Light Hsieh reported a KASAN UAF warning in trace_posix_lock_inode().
The request pointer had been changed earlier to point to a lock entry
that was added to the inode's list. However, before the tracepoint could
fire, another task raced in and freed that lock.
Fix this by moving the tracepoint inside the spinlock, which should
ensure that this doesn't happen.
Fixes: 74f6f59126 ("locks: fix KASAN: use-after-free in trace_event_raw_event_filelock_lock")
Link: https://lore.kernel.org/linux-fsdevel/724ffb0a2962e912ea62bb0515deadf39c325112.camel@kernel.org/
Reported-by: Light Hsieh (謝明燈) <Light.Hsieh@mediatek.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240702-filelock-6-10-v1-1-96e766aadc98@kernel.org
Reviewed-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
libaokun@huaweicloud.com <libaokun@huaweicloud.com> says:
This is the third version of this patch series, in which another patch set
is subsumed into this one to avoid confusing the two patch sets.
(https://patchwork.kernel.org/project/linux-fsdevel/list/?series=854914)
We've been testing ondemand mode for cachefiles since January, and we're
almost done. We hit a lot of issues during the testing period, and this
patch series fixes some of the issues. The patches have passed internal
testing without regression.
The following is a brief overview of the patches, see the patches for
more details.
Patch 1-2: Add fscache_try_get_volume() helper function to avoid
fscache_volume use-after-free on cache withdrawal.
Patch 3: Fix cachefiles_lookup_cookie() and cachefiles_withdraw_cache()
concurrency causing cachefiles_volume use-after-free.
Patch 4: Propagate error codes returned by vfs_getxattr() to avoid
endless loops.
Patch 5-7: A read request waiting for reopen could be closed maliciously
before the reopen worker is executing or waiting to be scheduled. So
ondemand_object_worker() may be called after the info and object and even
the cache have been freed and trigger use-after-free. So use
cancel_work_sync() in cachefiles_ondemand_clean_object() to cancel the
reopen worker or wait for it to finish. Since it makes no sense to wait
for the daemon to complete the reopen request, to avoid this pointless
operation blocking cancel_work_sync(), Patch 1 avoids request generation
by the DROPPING state when the request has not been sent, and Patch 2
flushes the requests of the current object before cancel_work_sync().
Patch 8: Cyclic allocation of msg_id to avoid msg_id reuse misleading
the daemon to cause hung.
Patch 9: Hold xas_lock during polling to avoid dereferencing reqs causing
use-after-free. This issue was triggered frequently in our tests, and we
found that anolis 5.10 had fixed it. So to avoid failing the test, this
patch is pushed upstream as well.
Baokun Li (7):
netfs, fscache: export fscache_put_volume() and add
fscache_try_get_volume()
cachefiles: fix slab-use-after-free in fscache_withdraw_volume()
cachefiles: fix slab-use-after-free in cachefiles_withdraw_cookie()
cachefiles: propagate errors from vfs_getxattr() to avoid infinite
loop
cachefiles: stop sending new request when dropping object
cachefiles: cancel all requests for the object that is being dropped
cachefiles: cyclic allocation of msg_id to avoid reuse
Hou Tao (1):
cachefiles: wait for ondemand_object_worker to finish when dropping
object
Jingbo Xu (1):
cachefiles: add missing lock protection when polling
fs/cachefiles/cache.c | 45 ++++++++++++++++++++++++++++-
fs/cachefiles/daemon.c | 4 +--
fs/cachefiles/internal.h | 3 ++
fs/cachefiles/ondemand.c | 52 ++++++++++++++++++++++++++++++----
fs/cachefiles/volume.c | 1 -
fs/cachefiles/xattr.c | 5 +++-
fs/netfs/fscache_volume.c | 14 +++++++++
fs/netfs/internal.h | 2 --
include/linux/fscache-cache.h | 6 ++++
include/trace/events/fscache.h | 4 +++
10 files changed, 123 insertions(+), 13 deletions(-)
Link: https://lore.kernel.org/r/20240628062930.2467993-1-libaokun@huaweicloud.com
Signed-off-by: Christian Brauner <brauner@kernel.org>