linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-09-21 20:22:13 +08:00

Author	SHA1	Message	Date
Yan Zhen	698e7d1680	proc: Fix typo in the comment The deference here confuses me. Maybe here want to say that because show_fd_locks() does not dereference the files pointer, using the stale value of the files pointer is safe. Correctly spelled comments make it easier for the reader to understand the code. replace 'deferences' with 'dereferences' in the comment & replace 'inialized' with 'initialized' in the comment. Signed-off-by: Yan Zhen <yanzhen@vivo.com> Link: https://lore.kernel.org/r/20240909063353.2246419-1-yanzhen@vivo.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-09-09 09:51:16 +02:00
Kienan Stewart	33d8525dc1	fs/pipe: Correct imprecise wording in comment The comment inaccurately describes what pipefs is - that is, a file system. Signed-off-by: Kienan Stewart <kstewart@efficios.com> Link: https://lore.kernel.org/r/20240904-pipe-correct_imprecise_wording-v1-1-2b07843472c2@efficios.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-09-05 11:39:20 +02:00
Christian Brauner	7063c229a8	Merge patch series "fhandle: expose u64 mount id to name_to_handle_at(2)" Aleksa Sarai <cyphar@cyphar.com> says: Now that we provide a unique 64-bit mount ID interface in statx(2), we can now provide a race-free way for name_to_handle_at(2) to provide a file handle and corresponding mount without needing to worry about racing with /proc/mountinfo parsing or having to open a file just to do statx(2). While this is not necessary if you are using AT_EMPTY_PATH and don't care about an extra statx(2) call, users that pass full paths into name_to_handle_at(2) need to know which mount the file handle comes from (to make sure they don't try to open_by_handle_at a file handle from a different filesystem) and switching to AT_EMPTY_PATH would require allocating a file for every name_to_handle_at(2) call, turning err = name_to_handle_at(-EBADF, "/foo/bar/baz", &handle, &mntid, AT_HANDLE_MNT_ID_UNIQUE); into int fd = openat(-EBADF, "/foo/bar/baz", O_PATH \| O_CLOEXEC); err1 = name_to_handle_at(fd, "", &handle, &unused_mntid, AT_EMPTY_PATH); err2 = statx(fd, "", AT_EMPTY_PATH, STATX_MNT_ID_UNIQUE, &statxbuf); mntid = statxbuf.stx_mnt_id; close(fd); Also, this series adds a patch to clarify how AT_* flag allocation should work going forwards. * patches from https://lore.kernel.org/r/20240828-exportfs-u64-mount-id-v3-0-10c2c4c16708@cyphar.com: fhandle: expose u64 mount id to name_to_handle_at(2) uapi: explain how per-syscall AT_* flags should be allocated Link: https://lore.kernel.org/r/20240828-exportfs-u64-mount-id-v3-0-10c2c4c16708@cyphar.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-09-05 11:39:20 +02:00
Aleksa Sarai	4356d575ef	fhandle: expose u64 mount id to name_to_handle_at(2) Now that we provide a unique 64-bit mount ID interface in statx(2), we can now provide a race-free way for name_to_handle_at(2) to provide a file handle and corresponding mount without needing to worry about racing with /proc/mountinfo parsing or having to open a file just to do statx(2). While this is not necessary if you are using AT_EMPTY_PATH and don't care about an extra statx(2) call, users that pass full paths into name_to_handle_at(2) need to know which mount the file handle comes from (to make sure they don't try to open_by_handle_at a file handle from a different filesystem) and switching to AT_EMPTY_PATH would require allocating a file for every name_to_handle_at(2) call, turning err = name_to_handle_at(-EBADF, "/foo/bar/baz", &handle, &mntid, AT_HANDLE_MNT_ID_UNIQUE); into int fd = openat(-EBADF, "/foo/bar/baz", O_PATH \| O_CLOEXEC); err1 = name_to_handle_at(fd, "", &handle, &unused_mntid, AT_EMPTY_PATH); err2 = statx(fd, "", AT_EMPTY_PATH, STATX_MNT_ID_UNIQUE, &statxbuf); mntid = statxbuf.stx_mnt_id; close(fd); Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/r/20240828-exportfs-u64-mount-id-v3-2-10c2c4c16708@cyphar.com Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-09-05 11:39:17 +02:00
Aleksa Sarai	b4fef22c2f	uapi: explain how per-syscall AT_* flags should be allocated Unfortunately, the way we have gone about adding new AT_* flags has been a little messy. In the beginning, all of the AT_* flags had generic meanings and so it made sense to share the flag bits indiscriminately. However, we inevitably ran into syscalls that needed their own syscall-specific flags. Due to the lack of a planned out policy, we ended up with the following situations: * Existing syscalls adding new features tended to use new AT_* bits, with some effort taken to try to re-use bits for flags that were so obviously syscall specific that they only make sense for a single syscall (such as the AT_EACCESS/AT_REMOVEDIR/AT_HANDLE_FID triplet). Given the constraints of bitflags, this works well in practice, but ideally (to avoid future confusion) we would plan ahead and define a set of "per-syscall bits" ahead of time so that when allocating new bits we don't end up with a complete mish-mash of which bits are supposed to be per-syscall and which aren't. * New syscalls dealt with this in several ways: - Some syscalls (like renameat2(2), move_mount(2), fsopen(2), and fspick(2)) created their separate own flag spaces that have no overlap with the AT_* flags. Most of these ended up allocating their bits sequentually. In the case of move_mount(2) and fspick(2), several flags have identical meanings to AT_* flags but were allocated in their own flag space. This makes sense for syscalls that will never share AT_* flags, but for some syscalls this leads to duplication with AT_* flags in a way that could cause confusion (if renameat2(2) grew a RENAME_EMPTY_PATH it seems likely that users could mistake it for AT_EMPTY_PATH since it is an at(2) syscall). - Some syscalls unfortunately ended up both creating their own flag space while also using bits from other flag spaces. The most obvious example is open_tree(2), where the standard usage ends up using flags from THREE* separate flag spaces: open_tree(AT_FDCWD, "/foo", OPEN_TREE_CLONE\|O_CLOEXEC\|AT_RECURSIVE); (Note that O_CLOEXEC is also platform-specific, so several future OPEN_TREE_* bits are also made unusable in one fell swoop.) It's not entirely clear to me what the "right" choice is for new syscalls. Just saying that all future VFS syscalls should use AT_* flags doesn't seem practical. openat2(2) has RESOLVE_* flags (many of which don't make much sense to burn generic AT_* flags for) and move_mount(2) has separate AT_-like flags for both the source and target so separate flags are needed anyway (though it seems possible that renameat2(2) could grow _EMPTY_PATH flags at some point, and it's a bit of a shame they can't be reused). But at least for syscalls that _do_ choose to use AT_* flags, we should explicitly state the policy that 0x2ff is currently intended for per-syscall flags and that new flags should err on the side of overlapping with existing flag bits (so we can extend the scope of generic flags in the future if necessary). And add AT_* aliases for the RENAME_* flags to further cement that renameat2(2) is an *at(2) flag, just with its own per-syscall flags. Suggested-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/r/20240828-exportfs-u64-mount-id-v3-1-10c2c4c16708@cyphar.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-09-05 11:37:45 +02:00
Michal Hocko	5c40e050e6	fs: drop GFP_NOFAIL mode from alloc_page_buffers There is only one called of alloc_page_buffers and it doesn't require __GFP_NOFAIL so drop this allocation mode. Signed-off-by: Michal Hocko <mhocko@suse.com> Link: https://lore.kernel.org/r/20240829130640.1397970-1-mhocko@kernel.org Acked-by: Song Liu <song@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 14:54:03 +02:00
Julian Sun	459ca85ae1	writeback: Refine the show_inode_state() macro definition Currently, the show_inode_state() macro only prints part of the state of inode->i_state. Let’s improve it to display more of its state. Signed-off-by: Julian Sun <sunjunchao2870@gmail.com> Link: https://lore.kernel.org/r/20240828081359.62429-1-sunjunchao2870@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:41 +02:00
Li Zhijian	7f7b850689	fs/inode: Prevent dump_mapping() accessing invalid dentry.d_name.name It's observed that a crash occurs during hot-remove a memory device, in which user is accessing the hugetlb. See calltrace as following: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 14045 at arch/x86/mm/fault.c:1278 do_user_addr_fault+0x2a0/0x790 Modules linked in: kmem device_dax cxl_mem cxl_pmem cxl_port cxl_pci dax_hmem dax_pmem nd_pmem cxl_acpi nd_btt cxl_core crc32c_intel nvme virtiofs fuse nvme_core nfit libnvdimm dm_multipath scsi_dh_rdac scsi_dh_emc s mirror dm_region_hash dm_log dm_mod CPU: 1 PID: 14045 Comm: daxctl Not tainted 6.10.0-rc2-lizhijian+ #492 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 RIP: 0010:do_user_addr_fault+0x2a0/0x790 Code: 48 8b 00 a8 04 0f 84 b5 fe ff ff e9 1c ff ff ff 4c 89 e9 4c 89 e2 be 01 00 00 00 bf 02 00 00 00 e8 b5 ef 24 00 e9 42 fe ff ff <0f> 0b 48 83 c4 08 4c 89 ea 48 89 ee 4c 89 e7 5b 5d 41 5c 41 5d 41 RSP: 0000:ffffc90000a575f0 EFLAGS: 00010046 RAX: ffff88800c303600 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000001000 RSI: ffffffff82504162 RDI: ffffffff824b2c36 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffc90000a57658 R13: 0000000000001000 R14: ffff88800bc2e040 R15: 0000000000000000 FS: 00007f51cb57d880(0000) GS:ffff88807fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000001000 CR3: 00000000072e2004 CR4: 00000000001706f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __warn+0x8d/0x190 ? do_user_addr_fault+0x2a0/0x790 ? report_bug+0x1c3/0x1d0 ? handle_bug+0x3c/0x70 ? exc_invalid_op+0x14/0x70 ? asm_exc_invalid_op+0x16/0x20 ? do_user_addr_fault+0x2a0/0x790 ? exc_page_fault+0x31/0x200 exc_page_fault+0x68/0x200 <...snip...> BUG: unable to handle page fault for address: 0000000000001000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 800000000ad92067 P4D 800000000ad92067 PUD 7677067 PMD 0 Oops: Oops: 0000 [#1] PREEMPT SMP PTI ---[ end trace 0000000000000000 ]--- BUG: unable to handle page fault for address: 0000000000001000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 800000000ad92067 P4D 800000000ad92067 PUD 7677067 PMD 0 Oops: Oops: 0000 [#1] PREEMPT SMP PTI CPU: 1 PID: 14045 Comm: daxctl Kdump: loaded Tainted: G W 6.10.0-rc2-lizhijian+ #492 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 RIP: 0010:dentry_name+0x1f4/0x440 <...snip...> ? dentry_name+0x2fa/0x440 vsnprintf+0x1f3/0x4f0 vprintk_store+0x23a/0x540 vprintk_emit+0x6d/0x330 _printk+0x58/0x80 dump_mapping+0x10b/0x1a0 ? __pfx_free_object_rcu+0x10/0x10 __dump_page+0x26b/0x3e0 ? vprintk_emit+0xe0/0x330 ? _printk+0x58/0x80 ? dump_page+0x17/0x50 dump_page+0x17/0x50 do_migrate_range+0x2f7/0x7f0 ? do_migrate_range+0x42/0x7f0 ? offline_pages+0x2f4/0x8c0 offline_pages+0x60a/0x8c0 memory_subsys_offline+0x9f/0x1c0 ? lockdep_hardirqs_on+0x77/0x100 ? _raw_spin_unlock_irqrestore+0x38/0x60 device_offline+0xe3/0x110 state_store+0x6e/0xc0 kernfs_fop_write_iter+0x143/0x200 vfs_write+0x39f/0x560 ksys_write+0x65/0xf0 do_syscall_64+0x62/0x130 Previously, some sanity check have been done in dump_mapping() before the print facility parsing '%pd' though, it's still possible to run into an invalid dentry.d_name.name. Since dump_mapping() only needs to dump the filename only, retrieve it by itself in a safer way to prevent an unnecessary crash. Note that either retrieving the filename with '%pd' or strncpy_from_kernel_nofault(), the filename could be unreliable. Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Link: https://lore.kernel.org/r/20240826055503.1522320-1-lizhijian@fujitsu.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:41 +02:00
Yu Jiaoliang	75e4c6bcb8	mnt_idmapping: Use kmemdup_array instead of kmemdup for multiple allocation Let the kememdup_array() take care about multiplication and possible overflows. v2:Add a new modification for reverse array. Signed-off-by: Yu Jiaoliang <yujiaoliang@vivo.com> Link: https://lore.kernel.org/r/20240823015542.3006262-1-yujiaoliang@vivo.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:41 +02:00
Baokun Li	3c58a9575e	netfs: Delete subtree of 'fs/netfs' when netfs module exits In netfs_init() or fscache_proc_init(), we create dentry under 'fs/netfs', but in netfs_exit(), we only delete the proc entry of 'fs/netfs' without deleting its subtree. This triggers the following WARNING: ================================================================== remove_proc_entry: removing non-empty directory 'fs/netfs', leaking at least 'requests' WARNING: CPU: 4 PID: 566 at fs/proc/generic.c:717 remove_proc_entry+0x160/0x1c0 Modules linked in: netfs(-) CPU: 4 UID: 0 PID: 566 Comm: rmmod Not tainted 6.11.0-rc3 #860 RIP: 0010:remove_proc_entry+0x160/0x1c0 Call Trace: <TASK> netfs_exit+0x12/0x620 [netfs] __do_sys_delete_module.isra.0+0x14c/0x2e0 do_syscall_64+0x4b/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e ================================================================== Therefore use remove_proc_subtree() instead of remove_proc_entry() to fix the above problem. Fixes: `7eb5b3e3a0` ("netfs, fscache: Move /proc/fs/fscache to /proc/fs/netfs and put in a symlink") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240826113404.3214786-1-libaokun@huaweicloud.com Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:41 +02:00
Hongbo Li	73ce1c9fce	fs: use LIST_HEAD() to simplify code list_head can be initialized automatically with LIST_HEAD() instead of calling INIT_LIST_HEAD(). Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Link: https://lore.kernel.org/r/20240821065456.2294216-1-lihongbo22@huawei.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:40 +02:00
Christian Brauner	41b734352c	Merge patch series "fs: add i_state helpers" Christian Brauner <brauner@kernel.org> says: I've recently looked for some free space in struct inode again because of some exec kerfuffle we had and while my idea didn't turn into anything I noticed that we often waste bytes when using wait bit operations. So I set out to switch that to another mechanism that would allow us to free up bytes. So this is an attempt to turn i_state from an unsigned long into an u32 using the individual bytes of i_state as addresses for the wait var event mechanism (Thanks to Linus for that idea.). This survives LTP, xfstests on various filesystems, and will-it-scale. * patches from https://lore.kernel.org/r/20240823-work-i_state-v3-1-5cd5fd207a57@kernel.org: inode: make i_state a u32 inode: port __I_LRU_ISOLATING to var event inode: port __I_NEW to var event inode: port __I_SYNC to var event fs: reorder i_state bits fs: add i_state helpers Link: https://lore.kernel.org/r/20240823-work-i_state-v3-1-5cd5fd207a57@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:40 +02:00
Christian Brauner	2b111edbe0	inode: make i_state a u32 Now that we use the wait var event mechanism make i_state a u32 and free up 4 bytes. This means we currently have two 4 byte holes in struct inode which we can pack. Link: https://lore.kernel.org/r/20240823-work-i_state-v3-6-5cd5fd207a57@kernel.org Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:40 +02:00
Christian Brauner	f469e6e6f5	inode: port __I_LRU_ISOLATING to var event Port the __I_LRU_ISOLATING mechanism to use the new var event mechanism. Link: https://lore.kernel.org/r/20240823-work-i_state-v3-5-5cd5fd207a57@kernel.org Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:40 +02:00
Julian Sun	88b1afbf0f	vfs: fix race between evice_inodes() and find_inode()&iput() Hi, all Recently I noticed a bug[1] in btrfs, after digged it into and I believe it'a race in vfs. Let's assume there's a inode (ie ino 261) with i_count 1 is called by iput(), and there's a concurrent thread calling generic_shutdown_super(). cpu0: cpu1: iput() // i_count is 1 ->spin_lock(inode) ->dec i_count to 0 ->iput_final() generic_shutdown_super() ->__inode_add_lru() ->evict_inodes() // cause some reason[2] ->if (atomic_read(inode->i_count)) continue; // return before // inode 261 passed the above check // list_lru_add_obj() // and then schedule out ->spin_unlock() // note here: the inode 261 // was still at sb list and hash list, // and I_FREEING\|I_WILL_FREE was not been set btrfs_iget() // after some function calls ->find_inode() // found the above inode 261 ->spin_lock(inode) // check I_FREEING\|I_WILL_FREE // and passed ->__iget() ->spin_unlock(inode) // schedule back ->spin_lock(inode) // check (I_NEW\|I_FREEING\|I_WILL_FREE) flags, // passed and set I_FREEING iput() ->spin_unlock(inode) ->spin_lock(inode) ->evict() // dec i_count to 0 ->iput_final() ->spin_unlock() ->evict() Now, we have two threads simultaneously evicting the same inode, which may trigger the BUG(inode->i_state & I_CLEAR) statement both within clear_inode() and iput(). To fix the bug, recheck the inode->i_count after holding i_lock. Because in the most scenarios, the first check is valid, and the overhead of spin_lock() can be reduced. If there is any misunderstanding, please let me know, thanks. [1]: https://lore.kernel.org/linux-btrfs/000000000000eabe1d0619c48986@google.com/ [2]: The reason might be 1. SB_ACTIVE was removed or 2. mapping_shrinkable() return false when I reproduced the bug. Reported-by: syzbot+67ba3c42bcbb4665d3ad@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=67ba3c42bcbb4665d3ad CC: stable@vger.kernel.org Fixes: `63997e98a3` ("split invalidate_inodes()") Signed-off-by: Julian Sun <sunjunchao2870@gmail.com> Link: https://lore.kernel.org/r/20240823130730.658881-1-sunjunchao2870@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:39 +02:00
Christian Brauner	0fe340a98b	inode: port __I_NEW to var event Port the __I_NEW mechanism to use the new var event mechanism. Link: https://lore.kernel.org/r/20240823-work-i_state-v3-4-5cd5fd207a57@kernel.org Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:39 +02:00
Christian Brauner	532980cb1b	inode: port __I_SYNC to var event Port the __I_SYNC mechanism to use the new var event mechanism. Link: https://lore.kernel.org/r/20240823-work-i_state-v3-3-5cd5fd207a57@kernel.org Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:39 +02:00
Christian Brauner	2ed634c96e	fs: reorder i_state bits so that we can use the first bits to derive unique addresses from i_state. Link: https://lore.kernel.org/r/20240823-work-i_state-v3-2-5cd5fd207a57@kernel.org Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:39 +02:00
Christian Brauner	da18ecbf0f	fs: add i_state helpers The i_state member is an unsigned long so that it can be used with the wait bit infrastructure which expects unsigned long. This wastes 4 bytes which we're unlikely to ever use. Switch to using the var event wait mechanism using the address of the bit. Thanks to Linus for the address idea. Link: https://lore.kernel.org/r/20240823-work-i_state-v3-1-5cd5fd207a57@kernel.org Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:39 +02:00
Eric Biggers	0ac3396ea4	MAINTAINERS: add the VFS git tree The VFS git tree is missing from MAINTAINERS. Add it. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20240820195109.38906-1-ebiggers@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:38 +02:00
Christian Brauner	71ff58ce34	fs: s/__u32/u32/ for s_fsnotify_mask The underscore variants are for uapi whereas the non-underscore variants are for in-kernel consumers. Link: https://lore.kernel.org/r/20240822-anwerben-nutzung-1cd6c82a565f@brauner Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:38 +02:00
Christian Brauner	029c3f27fe	fs: remove unused path_put_init() This helper has been unused for a while now. Link: https://lore.kernel.org/r/20240822-bewuchs-werktag-46672b3c0606@brauner Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:38 +02:00
Mateusz Guzik	57510c58b5	vfs: drop one lock trip in evict() Most commonly neither I_LRU_ISOLATING nor I_SYNC are set, but the stock kernel takes a back-to-back relock trip to check for them. It probably can be avoided altogether, but for now massage things back to just one lock acquire. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20240813143626.1573445-1-mjguzik@gmail.com Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:38 +02:00
Marc Aurèle La France	3a987b88a4	debugfs show actual source in /proc/mounts After its conversion to the new mount API, debugfs displays "none" in /proc/mounts instead of the actual source. Fix this by recognising its "source" mount option. Signed-off-by: Marc Aurèle La France <tsi@tuyoix.net> Link: https://lore.kernel.org/r/e439fae2-01da-234b-75b9-2a7951671e27@tuyoix.net Fixes: `a20971c187` ("vfs: Convert debugfs to use the new mount API") Cc: stable@vger.kernel.org # 6.10.x: `49abee5991`: debugfs: Convert to new uid/gid option parsing helpers Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:37 +02:00
Christian Brauner	1c48d44146	inode: remove __I_DIO_WAKEUP Afaict, we can just rely on inode->i_dio_count for waiting instead of this awkward indirection through __I_DIO_WAKEUP. This survives LTP dio and xfstests dio tests. Link: https://lore.kernel.org/r/20240816-vfs-misc-dio-v1-1-80fe21a2c710@kernel.org Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:37 +02:00
Hongbo Li	3717bbcb59	doc: correcting the idmapping mount example In step 2, we obtain the kernel id `k1000`. So in next step (step 3), we should translate the `k1000` not `k21000`. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Link: https://lore.kernel.org/r/20240816063611.1961910-1-lihongbo22@huawei.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:37 +02:00
Hongbo Li	1aeb6defd1	fs: Use in_group_or_capable() helper to simplify the code Since in_group_or_capable has been exported, we can use it to simplify the code when check group and capable. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Link: https://lore.kernel.org/r/20240816063849.1989856-1-lihongbo22@huawei.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:37 +02:00
Mateusz Guzik	b381fbbccb	vfs: elide smp_mb in iversion handling in the common case According to bpftrace on these routines most calls result in cmpxchg, which already provides the same guarantee. In inode_maybe_inc_iversion elision is possible because even if the wrong value was read due to now missing smp_mb fence, the issue is going to correct itself after cmpxchg. If it appears cmpxchg wont be issued, the fence + reload are there bringing back previous behavior. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20240815083310.3865-1-mjguzik@gmail.com Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:37 +02:00
Ian Kent	433f9d76a0	autofs: add per dentry expire timeout Add ability to set per-dentry mount expire timeout to autofs. There are two fairly well known automounter map formats, the autofs format and the amd format (more or less System V and Berkley). Some time ago Linux autofs added an amd map format parser that implemented a fair amount of the amd functionality. This was done within the autofs infrastructure and some functionality wasn't implemented because it either didn't make sense or required extra kernel changes. The idea was to restrict changes to be within the existing autofs functionality as much as possible and leave changes with a wider scope to be considered later. One of these changes is implementing the amd options: 1) "unmount", expire this mount according to a timeout (same as the current autofs default). 2) "nounmount", don't expire this mount (same as setting the autofs timeout to 0 except only for this specific mount) . 3) "utimeout=<seconds>", expire this mount using the specified timeout (again same as setting the autofs timeout but only for this mount). To implement these options per-dentry expire timeouts need to be implemented for autofs indirect mounts. This is because all map keys (mounts) for autofs indirect mounts use an expire timeout stored in the autofs mount super block info. structure and all indirect mounts use the same expire timeout. Now I have a request to add the "nounmount" option so I need to add the per-dentry expire handling to the kernel implementation to do this. The implementation uses the trailing path component to identify the mount (and is also used as the autofs map key) which is passed in the autofs_dev_ioctl structure path field. The expire timeout is passed in autofs_dev_ioctl timeout field (well, of the timeout union). If the passed in timeout is equal to -1 the per-dentry timeout and flag are cleared providing for the "unmount" option. If the timeout is greater than or equal to 0 the timeout is set to the value and the flag is also set. If the dentry timeout is 0 the dentry will not expire by timeout which enables the implementation of the "nounmount" option for the specific mount. When the dentry timeout is greater than zero it allows for the implementation of the "utimeout=<seconds>" option. Signed-off-by: Ian Kent <raven@themaw.net> Link: https://lore.kernel.org/r/20240814090231.963520-1-raven@themaw.net Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:36 +02:00
Mateusz Guzik	122381a469	vfs: use RCU in ilookup A soft lockup in ilookup was reported when stress-testing a 512-way system [1] (see [2] for full context) and it was verified that not taking the lock shifts issues back to mm. [1] https://lore.kernel.org/linux-mm/56865e57-c250-44da-9713-cf1404595bcc@amd.com/ [2] https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/ Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20240715071324.265879-1-mjguzik@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:36 +02:00
Christian Brauner	641bb4394f	fs: move FMODE_UNSIGNED_OFFSET to fop_flags This is another flag that is statically set and doesn't need to use up an FMODE_* bit. Move it to ->fop_flags and free up another FMODE_* bit. (1) mem_open() used from proc_mem_operations (2) adi_open() used from adi_fops (3) drm_open_helper(): (3.1) accel_open() used from DRM_ACCEL_FOPS (3.2) drm_open() used from (3.2.1) amdgpu_driver_kms_fops (3.2.2) psb_gem_fops (3.2.3) i915_driver_fops (3.2.4) nouveau_driver_fops (3.2.5) panthor_drm_driver_fops (3.2.6) radeon_driver_kms_fops (3.2.7) tegra_drm_fops (3.2.8) vmwgfx_driver_fops (3.2.9) xe_driver_fops (3.2.10) DRM_GEM_FOPS (3.2.11) DEFINE_DRM_GEM_DMA_FOPS (4) struct memdev sets fmode flags based on type of device opened. For devices using struct mem_fops unsigned offset is used. Mark all these file operations as FOP_UNSIGNED_OFFSET and add asserts into the open helper to ensure that the flag is always set. Link: https://lore.kernel.org/r/20240809-work-fop_unsigned-v1-1-658e054d893e@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:36 +02:00
Mateusz Guzik	8447d848e1	vfs: only read fops once in fops_get/put In do_dentry_open() the usage is: f->f_op = fops_get(inode->i_fop); In generated asm the compiler emits 2 reads from inode->i_fop instead of just one. This popped up due to false-sharing where loads from that offset end up bouncing a cacheline during parallel open. While this is going to be fixed, the spurious load does not need to be there. This makes do_dentry_open() go down from 1177 to 1154 bytes. fops_put() is patched to maintain some consistency. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20240810064753.1211441-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:36 +02:00
Thorsten Blum	193b72792f	fs/select: Annotate struct poll_list with __counted_by() Add the __counted_by compiler attribute to the flexible array member entries to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE. Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Link: https://lore.kernel.org/r/20240808150023.72578-2-thorsten.blum@toblux.com Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:35 +02:00
Christian Brauner	0f93bb54a3	fs: rearrange general fastpath check now that O_CREAT uses it If we find a positive dentry we can now simply try and open it. All prelimiary checks are already done with or without O_CREAT. Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:35 +02:00
Christian Brauner	d459c52ab3	fs: remove audit dummy context check Now that we audit later during lookup_open() we can remove the audit dummy context check. This simplifies things a lot. Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:35 +02:00
Christian Brauner	4770d96a6d	fs: pull up trailing slashes check for O_CREAT Perform the check for trailing slashes right in the fastpath check and don't bother with any additional work. Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:35 +02:00
Christian Brauner	c65d41c5a5	fs: move audit parent inode During O_CREAT we unconditionally audit the parent inode. This makes it difficult to support a fastpath for O_CREAT when the file already exists because we have to drop out of RCU lookup needlessly. We worked around this by checking whether audit was actually active but that's also suboptimal. Instead, move the audit of the parent inode down into lookup_open() at a point where it's mostly certain that the file needs to be created. This also reduced the inconsistency that currently exists: while audit on the parent is done independent of whether or no the file already existed an audit on the file is only performed if it has been created. By moving the audit down a bit we emit the audit a little later but it will allow us to simplify the fastpath for O_CREAT significantly. Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:35 +02:00
Jeff Layton	e747e15156	fs: try an opportunistic lookup for O_CREAT opens too Today, when opening a file we'll typically do a fast lookup, but if O_CREAT is set, the kernel always takes the exclusive inode lock. I assume this was done with the expectation that O_CREAT means that we always expect to do the create, but that's often not the case. Many programs set O_CREAT even in scenarios where the file already exists. This patch rearranges the pathwalk-for-open code to also attempt a fast_lookup in certain O_CREAT cases. If a positive dentry is found, the inode_lock can be avoided altogether, and if auditing isn't enabled, it can stay in rcuwalk mode for the last step_into. One notable exception that is hopefully temporary: if we're doing an rcuwalk and auditing is enabled, skip the lookup_fast. Legitimizing the dentry in that case is more expensive than taking the i_rwsem for now. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20240807-openfast-v3-1-040d132d2559@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:34 +02:00
Martin Karsten	b9ca079dd6	eventpoll: Annotate data-race of busy_poll_usecs A struct eventpoll's busy_poll_usecs field can be modified via a user ioctl at any time. All reads of this field should be annotated with READ_ONCE. Fixes: `85455c795c` ("eventpoll: support busy poll per epoll instance") Cc: stable@vger.kernel.org Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca> Link: https://lore.kernel.org/r/20240806123301.167557-1-jdamato@fastly.com Reviewed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:34 +02:00
Joe Damato	4f98f380f4	eventpoll: Don't re-zero eventpoll fields Remove redundant and unnecessary code. ep_alloc uses kzalloc to create struct eventpoll, so there is no need to set fields to defaults of 0. This was accidentally introduced in commit `85455c795c` ("eventpoll: support busy poll per epoll instance") and expanded on in follow-up commits. Signed-off-by: Joe Damato <jdamato@fastly.com> Link: https://lore.kernel.org/r/20240807105231.179158-1-jdamato@fastly.com Reviewed-by: Martin Karsten <mkarsten@uwaterloo.ca> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:34 +02:00
Xiaxi Shen	28c7658b2c	Fix spelling and gramatical errors Fixed 3 typos in design.rst Signed-off-by: Xiaxi Shen <shenxiaxi26@gmail.com> Link: https://lore.kernel.org/r/20240807070536.14536-1-shenxiaxi26@gmail.com Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:34 +02:00
Mateusz Guzik	087adb4f0f	vfs: dodge smp_mb in break_lease and break_deleg in the common case These inlines show up in the fast path (e.g., in do_dentry_open()) and induce said full barrier regarding i_flctx access when in most cases the pointer is NULL. The pointer can be safely checked before issuing the barrier, dodging it in most cases as a result. It is plausible the consume fence would be sufficient, but I don't want to go audit all callers regarding what they before calling here. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20240806172846.886570-1-mjguzik@gmail.com Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:33 +02:00
Joel Savitz	215ab0d8af	file: remove outdated comment after close_fd() Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: linux-fsdevel@vger.kernel.org The comment on EXPORT_SYMBOL(close_fd) was added in commit `2ca2a09d62` ("fs: add ksys_close() wrapper; remove in-kernel calls to sys_close()"), before commit `8760c909f5` ("file: Rename __close_fd to close_fd and remove the files parameter") gave the function its current name, however commit `1572bfdf21` ("file: Replace ksys_close with close_fd") removes the referenced caller entirely, obsoleting this comment. Signed-off-by: Joel Savitz <jsavitz@redhat.com> Link: https://lore.kernel.org/r/20240803025455.239276-1-jsavitz@redhat.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:33 +02:00
Yuesong Li	c5ae8e5e5a	fs/namespace.c: Fix typo in comment replace 'permanetly' with 'permanently' in the comment & replace 'propogated' with 'propagated' in the comment Signed-off-by: Yuesong Li <liyuesong@vivo.com> Link: https://lore.kernel.org/r/20240806034710.2807788-1-liyuesong@vivo.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:33 +02:00
Mateusz Guzik	0d196e7589	exec: don't WARN for racy path_noexec check Both i_mode and noexec checks wrapped in WARN_ON stem from an artifact of the previous implementation. They used to legitimately check for the condition, but that got moved up in two commits: `633fb6ac39` ("exec: move S_ISREG() check earlier") `0fd338b2d2` ("exec: move path_noexec() check earlier") Instead of being removed said checks are WARN_ON'ed instead, which has some debug value. However, the spurious path_noexec check is racy, resulting in unwarranted warnings should someone race with setting the noexec flag. One can note there is more to perm-checking whether execve is allowed and none of the conditions are guaranteed to still hold after they were tested for. Additionally this does not validate whether the code path did any perm checking to begin with -- it will pass if the inode happens to be regular. Keep the redundant path_noexec() check even though it's mindless nonsense checking for guarantee that isn't given so drop the WARN. Reword the commentary and do small tidy ups while here. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20240805131721.765484-1-mjguzik@gmail.com [brauner: keep redundant path_noexec() check] Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:33 +02:00
Jeff Layton	46460c1d42	fs: add a kerneldoc header over lookup_fast The lookup_fast helper in fs/namei.c has some subtlety in how dentries are returned. Document them. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20240802-openfast-v1-2-a1cff2a33063@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:33 +02:00
Jeff Layton	45fab40d34	fs: remove comment about d_rcu_to_refcount This function no longer exists. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20240802-openfast-v1-1-a1cff2a33063@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:32 +02:00
Yue Haibing	2e91f69afa	fs: mounts: Remove unused declaration mnt_cursor_del() Commit `2eea9ce431` ("mounts: keep list of mounts in an rbtree") removed the implementation but leave declaration. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Link: https://lore.kernel.org/r/20240803115000.589872-1-yuehaibing@huawei.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 08:22:32 +02:00
Wang Long	c01a5d89e5	percpu-rwsem: remove the unused parameter 'read' In the function percpu_rwsem_release, the parameter `read` is unused, so remove it. Signed-off-by: Wang Long <w@laoqinren.net> Link: https://lore.kernel.org/r/20240802091901.2546797-1-w@laoqinren.net Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-19 13:45:03 +02:00
Aleksa Sarai	66e5cfee62	coda: use param->file for FSCONFIG_SET_FD While the old code did support FSCONFIG_SET_FD, there's no need to re-get the file the fs_context infrastructure already grabbed for us. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/r/20240731-fsconfig-fsparam_fd-fixes-v2-2-e7c472224417@cyphar.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-19 13:45:03 +02:00

1 2 3 4 5 ...

1295561 Commits