32 bits won't overflow any time soon, but size_t is the correct type for
counting objects in memory.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Fix the following compilation error:
```
fs/bcachefs/sb-members.c: In function ‘bch2_sb_member_alloc’:
fs/bcachefs/sb-members.c:508:2: error: a label can only be part of a statement and a declaration is not a statement
508 | unsigned nr_devices = max_t(unsigned, dev_idx + 1, c->sb.nr_devices);
```
Fixes: a7d364a133 ("bcachefs: bch2_sb_member_alloc()")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds mount options for specifying recovery passes to run, or
exclude; the immediate need for this is that backpointers fsck is having
trouble completing, so we need a way to skip it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When freeing in a shrinker callback, we need to notify memory reclaim,
so it knows forward progress has been made.
Normally this is done in e.g. slab code, but we're not freeing through
slab - or rather we are, but these allocations are big, and use the
kmalloc_large() path.
This is really a bug in the slub code, but we're working around it here
for now.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is needed for overlayfs, which is used by container managers.
Signed-off-by: Sasha Finkelstein <fnkl.kernel@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
this was an oversight: rebalance is moving data to a specific device, so
we don't want it falling back to the full filesystem
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
rebalance writes must be BCH_WRITE_ALLOC_NOWAIT because they don't
allocate from the full filesystem - but we don't want spurious
allocation failures due to open buckets.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
like the previous patch, kill use of bare arrays; the encryption code
likes to work in big batches, so this is a small performance
improvement.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Quite a lot of nilfs2 work this time around.
Notable patch series in this pull request are:
"mul_u64_u64_div_u64: new implementation" by Nicolas Pitre, with
assistance from Uwe Kleine-König. Reimplement mul_u64_u64_div_u64() to
provide (much) more accurate results. The current implementation was
causing Uwe some issues in the PWM drivers.
"xz: Updates to license, filters, and compression options" from Lasse
Collin. Miscellaneous maintenance and kinor feature work to the xz
decompressor.
"Fix some GDB command error and add some GDB commands" from Kuan-Ying Lee.
Fixes and enhancements to the gdb scripts.
"treewide: add missing MODULE_DESCRIPTION() macros" from Jeff Johnson.
Adds lots of MODULE_DESCRIPTIONs, thus fixing lots of warnings about this.
"nilfs2: add support for some common ioctls" from Ryusuke Konishi. Adds
various commonly-available ioctls to nilfs2.
"This series fixes a number of formatting issues in kernel doc comments"
from Ryusuke Konishi does that.
"nilfs2: prevent unexpected ENOENT propagation" from Ryusuke Konishi. Fix
issues where -ENOENT was being unintentionally and inappropriately
returned to userspace.
"nilfs2: assorted cleanups" from Huang Xiaojia.
"nilfs2: fix potential issues with empty b-tree nodes" from Ryusuke
Konishi fixes some issues which can occur on corrupted nilfs2 filesystems.
"scripts/decode_stacktrace.sh: improve error reporting and usability" from
Luca Ceresoli does those things.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZu7dpAAKCRDdBJ7gKXxA
jsPqAPwMDEZyKlfSw7QioEHNHDkmkbP7VYCYR0CbUnppbztwpAD8D37aVbWQ+UzM
3nnOq3W2Pc2o/20zqi8Upf1mnvUrygQ=
=/NWE
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2024-09-21-07-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
"Many singleton patches - please see the various changelogs for
details.
Quite a lot of nilfs2 work this time around.
Notable patch series in this pull request are:
- "mul_u64_u64_div_u64: new implementation" by Nicolas Pitre, with
assistance from Uwe Kleine-König. Reimplement mul_u64_u64_div_u64()
to provide (much) more accurate results. The current implementation
was causing Uwe some issues in the PWM drivers.
- "xz: Updates to license, filters, and compression options" from
Lasse Collin. Miscellaneous maintenance and kinor feature work to
the xz decompressor.
- "Fix some GDB command error and add some GDB commands" from
Kuan-Ying Lee. Fixes and enhancements to the gdb scripts.
- "treewide: add missing MODULE_DESCRIPTION() macros" from Jeff
Johnson. Adds lots of MODULE_DESCRIPTIONs, thus fixing lots of
warnings about this.
- "nilfs2: add support for some common ioctls" from Ryusuke Konishi.
Adds various commonly-available ioctls to nilfs2.
- "This series fixes a number of formatting issues in kernel doc
comments" from Ryusuke Konishi does that.
- "nilfs2: prevent unexpected ENOENT propagation" from Ryusuke
Konishi. Fix issues where -ENOENT was being unintentionally and
inappropriately returned to userspace.
- "nilfs2: assorted cleanups" from Huang Xiaojia.
- "nilfs2: fix potential issues with empty b-tree nodes" from Ryusuke
Konishi fixes some issues which can occur on corrupted nilfs2
filesystems.
- "scripts/decode_stacktrace.sh: improve error reporting and
usability" from Luca Ceresoli does those things"
* tag 'mm-nonmm-stable-2024-09-21-07-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (103 commits)
list: test: increase coverage of list_test_list_replace*()
list: test: fix tests for list_cut_position()
proc: use __auto_type more
treewide: correct the typo 'retun'
ocfs2: cleanup return value and mlog in ocfs2_global_read_info()
nilfs2: remove duplicate 'unlikely()' usage
nilfs2: fix potential oob read in nilfs_btree_check_delete()
nilfs2: determine empty node blocks as corrupted
nilfs2: fix potential null-ptr-deref in nilfs_btree_insert()
user_namespace: use kmemdup_array() instead of kmemdup() for multiple allocation
tools/mm: rm thp_swap_allocator_test when make clean
squashfs: fix percpu address space issues in decompressor_multi_percpu.c
lib: glob.c: added null check for character class
nilfs2: refactor nilfs_segctor_thread()
nilfs2: use kthread_create and kthread_stop for the log writer thread
nilfs2: remove sc_timer_task
nilfs2: do not repair reserved inode bitmap in nilfs_new_inode()
nilfs2: eliminate the shared counter and spinlock for i_generation
nilfs2: separate inode type information from i_state field
nilfs2: use the BITS_PER_LONG macro
...
this pull request are:
"Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds
consistency to the APIs and behaviour of these two core allocation
functions. This also simplifies/enables Rustification.
"Some cleanups for shmem" from Baolin Wang. No functional changes - mode
code reuse, better function naming, logic simplifications.
"mm: some small page fault cleanups" from Josef Bacik. No functional
changes - code cleanups only.
"Various memory tiering fixes" from Zi Yan. A small fix and a little
cleanup.
"mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and
simplifications and .text shrinkage.
"Kernel stack usage histogram" from Pasha Tatashin and Shakeel Butt. This
is a feature, it adds new feilds to /proc/vmstat such as
$ grep kstack /proc/vmstat
kstack_1k 3
kstack_2k 188
kstack_4k 11391
kstack_8k 243
kstack_16k 0
which tells us that 11391 processes used 4k of stack while none at all
used 16k. Useful for some system tuning things, but partivularly useful
for "the dynamic kernel stack project".
"kmemleak: support for percpu memory leak detect" from Pavel Tikhomirov.
Teaches kmemleak to detect leaksage of percpu memory.
"mm: memcg: page counters optimizations" from Roman Gushchin. "3
independent small optimizations of page counters".
"mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from David
Hildenbrand. Improves PTE/PMD splitlock detection, makes powerpc/8xx work
correctly by design rather than by accident.
"mm: remove arch_make_page_accessible()" from David Hildenbrand. Some
folio conversions which make arch_make_page_accessible() unneeded.
"mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David Finkel.
Cleans up and fixes our handling of the resetting of the cgroup/process
peak-memory-use detector.
"Make core VMA operations internal and testable" from Lorenzo Stoakes.
Rationalizaion and encapsulation of the VMA manipulation APIs. With a
view to better enable testing of the VMA functions, even from a
userspace-only harness.
"mm: zswap: fixes for global shrinker" from Takero Funaki. Fix issues in
the zswap global shrinker, resulting in improved performance.
"mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill in
some missing info in /proc/zoneinfo.
"mm: replace follow_page() by folio_walk" from David Hildenbrand. Code
cleanups and rationalizations (conversion to folio_walk()) resulting in
the removal of follow_page().
"improving dynamic zswap shrinker protection scheme" from Nhat Pham. Some
tuning to improve zswap's dynamic shrinker. Significant reductions in
swapin and improvements in performance are shown.
"mm: Fix several issues with unaccepted memory" from Kirill Shutemov.
Improvements to the new unaccepted memory feature,
"mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on DAX
PUDs. This was missing, although nobody seems to have notied yet.
"Introduce a store type enum for the Maple tree" from Sidhartha Kumar.
Cleanups and modest performance improvements for the maple tree library
code.
"memcg: further decouple v1 code from v2" from Shakeel Butt. Move more
cgroup v1 remnants away from the v2 memcg code.
"memcg: initiate deprecation of v1 features" from Shakeel Butt. Adds
various warnings telling users that memcg v1 features are deprecated.
"mm: swap: mTHP swap allocator base on swap cluster order" from Chris Li.
Greatly improves the success rate of the mTHP swap allocation.
"mm: introduce numa_memblks" from Mike Rapoport. Moves various disparate
per-arch implementations of numa_memblk code into generic code.
"mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly
improves the performance of munmap() of swap-filled ptes.
"support large folio swap-out and swap-in for shmem" from Baolin Wang.
With this series we no longer split shmem large folios into simgle-page
folios when swapping out shmem.
"mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice performance
improvements and code reductions for gigantic folios.
"support shmem mTHP collapse" from Baolin Wang. Adds support for
khugepaged's collapsing of shmem mTHP folios.
"mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect()
performance regression due to the addition of mseal().
"Increase the number of bits available in page_type" from Matthew Wilcox.
Increases the number of bits available in page_type!
"Simplify the page flags a little" from Matthew Wilcox. Many legacy page
flags are now folio flags, so the page-based flags and their
accessors/mutators can be removed.
"mm: store zero pages to be swapped out in a bitmap" from Usama Arif. An
optimization which permits us to avoid writing/reading zero-filled zswap
pages to backing store.
"Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race window
which occurs when a MAP_FIXED operqtion is occurring during an unrelated
vma tree walk.
"mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of the
vma_merge() functionality, making ot cleaner, more testable and better
tested.
"misc fixups for DAMON {self,kunit} tests" from SeongJae Park. Minor
fixups of DAMON selftests and kunit tests.
"mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang. Code
cleanups and folio conversions.
"Shmem mTHP controls and stats improvements" from Ryan Roberts. Cleanups
for shmem controls and stats.
"mm: count the number of anonymous THPs per size" from Barry Song. Expose
additional anon THP stats to userspace for improved tuning.
"mm: finish isolate/putback_lru_page()" from Kefeng Wang: more folio
conversions and removal of now-unused page-based APIs.
"replace per-quota region priorities histogram buffer with per-context
one" from SeongJae Park. DAMON histogram rationalization.
"Docs/damon: update GitHub repo URLs and maintainer-profile" from SeongJae
Park. DAMON documentation updates.
"mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve
related doc and warn" from Jason Wang: fixes usage of page allocator
__GFP_NOFAIL and GFP_ATOMIC flags.
"mm: split underused THPs" from Yu Zhao. Improve THP=always policy - this
was overprovisioning THPs in sparsely accessed memory areas.
"zram: introduce custom comp backends API" frm Sergey Senozhatsky. Add
support for zram run-time compression algorithm tuning.
"mm: Care about shadow stack guard gap when getting an unmapped area" from
Mark Brown. Fix up the various arch_get_unmapped_area() implementations
to better respect guard areas.
"Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability of
mem_cgroup_iter() and various code cleanups.
"mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge
pfnmap support.
"resource: Fix region_intersects() vs add_memory_driver_managed()" from
Huang Ying. Fix a bug in region_intersects() for systems with CXL memory.
"mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches a
couple more code paths to correctly recover from the encountering of
poisoned memry.
"mm: enable large folios swap-in support" from Barry Song. Support the
swapin of mTHP memory into appropriately-sized folios, rather than into
single-page folios.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZu1BBwAKCRDdBJ7gKXxA
jlWNAQDYlqQLun7bgsAN4sSvi27VUuWv1q70jlMXTfmjJAvQqwD/fBFVR6IOOiw7
AkDbKWP2k0hWPiNJBGwoqxdHHx09Xgo=
=s0T+
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Along with the usual shower of singleton patches, notable patch series
in this pull request are:
- "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds
consistency to the APIs and behaviour of these two core allocation
functions. This also simplifies/enables Rustification.
- "Some cleanups for shmem" from Baolin Wang. No functional changes -
mode code reuse, better function naming, logic simplifications.
- "mm: some small page fault cleanups" from Josef Bacik. No
functional changes - code cleanups only.
- "Various memory tiering fixes" from Zi Yan. A small fix and a
little cleanup.
- "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and
simplifications and .text shrinkage.
- "Kernel stack usage histogram" from Pasha Tatashin and Shakeel
Butt. This is a feature, it adds new feilds to /proc/vmstat such as
$ grep kstack /proc/vmstat
kstack_1k 3
kstack_2k 188
kstack_4k 11391
kstack_8k 243
kstack_16k 0
which tells us that 11391 processes used 4k of stack while none at
all used 16k. Useful for some system tuning things, but
partivularly useful for "the dynamic kernel stack project".
- "kmemleak: support for percpu memory leak detect" from Pavel
Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory.
- "mm: memcg: page counters optimizations" from Roman Gushchin. "3
independent small optimizations of page counters".
- "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from
David Hildenbrand. Improves PTE/PMD splitlock detection, makes
powerpc/8xx work correctly by design rather than by accident.
- "mm: remove arch_make_page_accessible()" from David Hildenbrand.
Some folio conversions which make arch_make_page_accessible()
unneeded.
- "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David
Finkel. Cleans up and fixes our handling of the resetting of the
cgroup/process peak-memory-use detector.
- "Make core VMA operations internal and testable" from Lorenzo
Stoakes. Rationalizaion and encapsulation of the VMA manipulation
APIs. With a view to better enable testing of the VMA functions,
even from a userspace-only harness.
- "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix
issues in the zswap global shrinker, resulting in improved
performance.
- "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill
in some missing info in /proc/zoneinfo.
- "mm: replace follow_page() by folio_walk" from David Hildenbrand.
Code cleanups and rationalizations (conversion to folio_walk())
resulting in the removal of follow_page().
- "improving dynamic zswap shrinker protection scheme" from Nhat
Pham. Some tuning to improve zswap's dynamic shrinker. Significant
reductions in swapin and improvements in performance are shown.
- "mm: Fix several issues with unaccepted memory" from Kirill
Shutemov. Improvements to the new unaccepted memory feature,
- "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on
DAX PUDs. This was missing, although nobody seems to have notied
yet.
- "Introduce a store type enum for the Maple tree" from Sidhartha
Kumar. Cleanups and modest performance improvements for the maple
tree library code.
- "memcg: further decouple v1 code from v2" from Shakeel Butt. Move
more cgroup v1 remnants away from the v2 memcg code.
- "memcg: initiate deprecation of v1 features" from Shakeel Butt.
Adds various warnings telling users that memcg v1 features are
deprecated.
- "mm: swap: mTHP swap allocator base on swap cluster order" from
Chris Li. Greatly improves the success rate of the mTHP swap
allocation.
- "mm: introduce numa_memblks" from Mike Rapoport. Moves various
disparate per-arch implementations of numa_memblk code into generic
code.
- "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly
improves the performance of munmap() of swap-filled ptes.
- "support large folio swap-out and swap-in for shmem" from Baolin
Wang. With this series we no longer split shmem large folios into
simgle-page folios when swapping out shmem.
- "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice
performance improvements and code reductions for gigantic folios.
- "support shmem mTHP collapse" from Baolin Wang. Adds support for
khugepaged's collapsing of shmem mTHP folios.
- "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect()
performance regression due to the addition of mseal().
- "Increase the number of bits available in page_type" from Matthew
Wilcox. Increases the number of bits available in page_type!
- "Simplify the page flags a little" from Matthew Wilcox. Many legacy
page flags are now folio flags, so the page-based flags and their
accessors/mutators can be removed.
- "mm: store zero pages to be swapped out in a bitmap" from Usama
Arif. An optimization which permits us to avoid writing/reading
zero-filled zswap pages to backing store.
- "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race
window which occurs when a MAP_FIXED operqtion is occurring during
an unrelated vma tree walk.
- "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of
the vma_merge() functionality, making ot cleaner, more testable and
better tested.
- "misc fixups for DAMON {self,kunit} tests" from SeongJae Park.
Minor fixups of DAMON selftests and kunit tests.
- "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang.
Code cleanups and folio conversions.
- "Shmem mTHP controls and stats improvements" from Ryan Roberts.
Cleanups for shmem controls and stats.
- "mm: count the number of anonymous THPs per size" from Barry Song.
Expose additional anon THP stats to userspace for improved tuning.
- "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more
folio conversions and removal of now-unused page-based APIs.
- "replace per-quota region priorities histogram buffer with
per-context one" from SeongJae Park. DAMON histogram
rationalization.
- "Docs/damon: update GitHub repo URLs and maintainer-profile" from
SeongJae Park. DAMON documentation updates.
- "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and
improve related doc and warn" from Jason Wang: fixes usage of page
allocator __GFP_NOFAIL and GFP_ATOMIC flags.
- "mm: split underused THPs" from Yu Zhao. Improve THP=always policy.
This was overprovisioning THPs in sparsely accessed memory areas.
- "zram: introduce custom comp backends API" frm Sergey Senozhatsky.
Add support for zram run-time compression algorithm tuning.
- "mm: Care about shadow stack guard gap when getting an unmapped
area" from Mark Brown. Fix up the various arch_get_unmapped_area()
implementations to better respect guard areas.
- "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability
of mem_cgroup_iter() and various code cleanups.
- "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge
pfnmap support.
- "resource: Fix region_intersects() vs add_memory_driver_managed()"
from Huang Ying. Fix a bug in region_intersects() for systems with
CXL memory.
- "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches
a couple more code paths to correctly recover from the encountering
of poisoned memry.
- "mm: enable large folios swap-in support" from Barry Song. Support
the swapin of mTHP memory into appropriately-sized folios, rather
than into single-page folios"
* tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits)
zram: free secondary algorithms names
uprobes: turn xol_area->pages[2] into xol_area->page
uprobes: introduce the global struct vm_special_mapping xol_mapping
Revert "uprobes: use vm_special_mapping close() functionality"
mm: support large folios swap-in for sync io devices
mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios
mm: fix swap_read_folio_zeromap() for large folios with partial zeromap
mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries
set_memory: add __must_check to generic stubs
mm/vma: return the exact errno in vms_gather_munmap_vmas()
memcg: cleanup with !CONFIG_MEMCG_V1
mm/show_mem.c: report alloc tags in human readable units
mm: support poison recovery from copy_present_page()
mm: support poison recovery from do_cow_fault()
resource, kunit: add test case for region_intersects()
resource: make alloc_free_mem_region() works for iomem_resource
mm: z3fold: deprecate CONFIG_Z3FOLD
vfio/pci: implement huge_fault support
mm/arm64: support large pfn mappings
mm/x86: support large pfn mappings
...
Replace the deprecated one-element arrays with flexible-array members
in the structs copychunk_ioctl_req and smb2_ea_info_req.
There are no binary differences after this conversion.
Link: https://github.com/KSPP/linux/issues/79
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
allocation, extent management, fast commit, and journalling.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmbsGRcACgkQ8vlZVpUN
gaP+pwgAop3LUpOFQ9dPRTR3+37AJI8adfabfLIDkEkoVA7lyYY/6Q8pcQ0rklq3
wE1WxrJ7MaE1GaFCwRIDIL6TP+uYRK0pPjqbFBxGakhDc+WXrTcALOWWofb7J7PL
FLwP264lRRfKfpMHdK8bx6YHnEN8425PR+ZNXGVPsw+wjo72mmnq54w+ct1iOKiw
dKfIrwwCGKlBsNdYHS/XsSx7MMK8e7nsKoSq0UtpJ4PqF11/asOtlYYODc4hd27U
E3I3UDKuntmz+meAscDejOJqQk5FT184HIt/Y5JfetKU2zpUFj9IKqXDzMjijdaj
vGn9RkTXfJdxMPm1ouF2R6KIRJollg==
=V7+A
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Lots of cleanups and bug fixes this cycle, primarily in the block
allocation, extent management, fast commit, and journalling"
* tag 'ext4_for_linus-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (93 commits)
ext4: convert EXT4_B2C(sbi->s_stripe) users to EXT4_NUM_B2C
ext4: check stripe size compatibility on remount as well
ext4: fix i_data_sem unlock order in ext4_ind_migrate()
ext4: remove the special buffer dirty handling in do_journal_get_write_access
ext4: fix a potential assertion failure due to improperly dirtied buffer
ext4: hoist ext4_block_write_begin and replace the __block_write_begin
ext4: persist the new uptodate buffers in ext4_journalled_zero_new_buffers
ext4: dax: keep orphan list before truncate overflow allocated blocks
ext4: fix error message when rejecting the default hash
ext4: save unnecessary indentation in ext4_ext_create_new_leaf()
ext4: make some fast commit functions reuse extents path
ext4: refactor ext4_swap_extents() to reuse extents path
ext4: get rid of ppath in convert_initialized_extent()
ext4: get rid of ppath in ext4_ext_handle_unwritten_extents()
ext4: get rid of ppath in ext4_ext_convert_to_initialized()
ext4: get rid of ppath in ext4_convert_unwritten_extents_endio()
ext4: get rid of ppath in ext4_split_convert_extents()
ext4: get rid of ppath in ext4_split_extent()
ext4: get rid of ppath in ext4_force_split_extent_at()
ext4: get rid of ppath in ext4_split_extent_at()
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEvwAKCRCRxhvAZXjc
ohg3APwJWQnqFlBddcRl4yrPJ/cgcYSYAOdHb+E+blomSwdxcwEAmwsnLPNQOtw2
rxKvQfZqhVT437bl7RpPPZrHGxwTng8=
=6v1r
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.12.blocksize' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs blocksize updates from Christian Brauner:
"This contains the vfs infrastructure as well as the xfs bits to enable
support for block sizes (bs) larger than page sizes (ps) plus a few
fixes to related infrastructure.
There has been efforts over the last 16 years to enable enable Large
Block Sizes (LBS), that is block sizes in filesystems where bs > page
size. Through these efforts we have learned that one of the main
blockers to supporting bs > ps in filesystems has been a way to
allocate pages that are at least the filesystem block size on the page
cache where bs > ps.
Thanks to various previous efforts it is possible to support bs > ps
in XFS with only a few changes in XFS itself. Most changes are to the
page cache to support minimum order folio support for the target block
size on the filesystem.
A motivation for Large Block Sizes today is to support high-capacity
(large amount of Terabytes) QLC SSDs where the internal Indirection
Unit (IU) are typically greater than 4k to help reduce DRAM and so in
turn cost and space. In practice this then allows different
architectures to use a base page size of 4k while still enabling
support for block sizes aligned to the larger IUs by relying on high
order folios on the page cache when needed.
It also allows to take advantage of the drive's support for atomics
larger than 4k with buffered IO support in Linux. As described this
year at LSFMM, supporting large atomics greater than 4k enables
databases to remove the need to rely on their own journaling, so they
can disable double buffered writes, which is a feature different cloud
providers are already enabling through custom storage solutions"
* tag 'vfs-6.12.blocksize' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (22 commits)
Documentation: iomap: fix a typo
iomap: remove the iomap_file_buffered_write_punch_delalloc return value
iomap: pass the iomap to the punch callback
iomap: pass flags to iomap_file_buffered_write_punch_delalloc
iomap: improve shared block detection in iomap_unshare_iter
iomap: handle a post-direct I/O invalidate race in iomap_write_delalloc_release
docs:filesystems: fix spelling and grammar mistakes in iomap design page
filemap: fix htmldoc warning for mapping_align_index()
iomap: make zero range flush conditional on unwritten mappings
iomap: fix handling of dirty folios over unwritten extents
iomap: add a private argument for iomap_file_buffered_write
iomap: remove set_memor_ro() on zero page
xfs: enable block size larger than page size support
xfs: make the calculation generic in xfs_sb_validate_fsb_count()
xfs: expose block size in stat
xfs: use kvmalloc for xattr buffers
iomap: fix iomap_dio_zero() for fs bs > system page size
filemap: cap PTE range to be created to allowed zero fill in folio_map_range()
mm: split a folio in minimum folio order chunks
readahead: allocate folios with mapping_min_order in readahead
...
The pair of bloom filtered used by delegation_blocked() was intended to
block delegations on given filehandles for between 30 and 60 seconds. A
new filehandle would be recorded in the "new" bit set. That would then
be switch to the "old" bit set between 0 and 30 seconds later, and it
would remain as the "old" bit set for 30 seconds.
Unfortunately the code intended to clear the old bit set once it reached
30 seconds old, preparing it to be the next new bit set, instead cleared
the *new* bit set before switching it to be the old bit set. This means
that the "old" bit set is always empty and delegations are blocked
between 0 and 30 seconds.
This patch updates bd->new before clearing the set with that index,
instead of afterwards.
Reported-by: Olga Kornievskaia <okorniev@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 6282cd5655 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.")
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
At this point in compound processing, currentfh refers to the parent of
the file, not the file itself. Get the correct dentry from the delegation
stateid instead.
Fixes: c5967721e1 ("NFSD: handle GETATTR conflict with write delegation")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The code in nfsd4_deleg_getattr_conflict() is convoluted and buggy.
With this patch we:
- properly handle non-nfsd leases. We must not assume flc_owner is a
delegation unless fl_lmops == &nfsd_lease_mng_ops
- move the main code out of the for loop
- have a single exit which calls nfs4_put_stid()
(and other exits which don't need to call that)
[ jlayton: refactored on top of Neil's other patch: nfsd: fix
nfsd4_deleg_getattr_conflict in presence of third party lease ]
Fixes: c5967721e1 ("NFSD: handle GETATTR conflict with write delegation")
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This patch is intended to go on top of "nfsd: return -EINVAL when
namelen is 0" from Li Lingfeng. Li's patch checks for 0, but we should
be enforcing an upper bound as well.
Note that if nfsdcld somehow gets an id > NFS4_OPAQUE_LIMIT in its
database, it'll truncate it to NFS4_OPAQUE_LIMIT when it does the
downcall anyway.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Add an nfsd_copy_async_done to record the timestamp, the final
status code, and the callback stateid of an async copy.
Rename the nfsd_copy_do_async tracepoint to match that naming
convention to make it easier to enable both of these with a
single glob.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Make it easier to grep for s2s COPY stateids in trace logs: Use the
same display format in nfsd_copy_class as is used to display other
stateids.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Nothing appears to limit the number of concurrent async COPY
operations that clients can start. In addition, AFAICT each async
COPY can copy an unlimited number of 4MB chunks, so can run for a
long time. Thus IMO async COPY can become a DoS vector.
Add a restriction mechanism that bounds the number of concurrent
background COPY operations. Start simple and try to be fair -- this
patch implements a per-namespace limit.
An async COPY request that occurs while this limit is exceeded gets
NFS4ERR_DELAY. The requesting client can choose to send the request
again after a delay or fall back to a traditional read/write style
copy.
If there is need to make the mechanism more sophisticated, we can
visit that in future patches.
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Currently, when NFSD handles an asynchronous COPY, it returns a
zero write verifier, relying on the subsequent CB_OFFLOAD callback
to pass the write verifier and a stable_how4 value to the client.
However, if the CB_OFFLOAD never arrives at the client (for example,
if a network partition occurs just as the server sends the
CB_OFFLOAD operation), the client will never receive this verifier.
Thus, if the client sends a follow-up COMMIT, there is no way for
the client to assess the COMMIT result.
The usual recovery for a missing CB_OFFLOAD is for the client to
send an OFFLOAD_STATUS operation, but that operation does not carry
a write verifier in its result. Neither does it carry a stable_how4
value, so the client /must/ send a COMMIT in this case -- which will
always fail because currently there's still no write verifier in the
COPY result.
Thus the server needs to return a normal write verifier in its COPY
result even if the COPY operation is to be performed asynchronously.
If the server recognizes the callback stateid in subsequent
OFFLOAD_STATUS operations, then obviously it has not restarted, and
the write verifier the client received in the COPY result is still
valid and can be used to assess a COMMIT of the copied data, if one
is needed.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
wake_up_var() needs a barrier after the important change is made in the
var and before wake_up_var() is called, else it is possible that a wake
up won't be sent when it should.
In each case here the var is changed in an "atomic" manner, so
smb_mb__after_atomic() is sufficient.
In one case the important change (removing the lease) is performed
*after* the wake_up, which is backwards. The code survives in part
because the wait_var_event is given a timeout.
This patch adds the required barriers and calls destroy_delegation()
*before* waking any threads waiting for the delegation to be destroyed.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
nfsd has two places that open-code clear_and_wake_up_bit(). One has
the required memory barriers. The other does not.
Change both to use clear_and_wake_up_bit() so we have the barriers
without the noise.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Add the __counted_by compiler attribute to the flexible array member
volumes to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
Use struct_size() instead of manually calculating the number of bytes to
allocate for a pnfs_block_deviceaddr with a single volume.
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If not enough buffer space available, but idmap_lookup has triggered
lookup_fn which calls cache_get and returns successfully. Then we
missed to call cache_put here which pairs with cache_get.
Fixes: ddd1ea5636 ("nfsd4: use xdr_reserve_space in attribute encoding")
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Reviwed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Add some tracepoints in the callback client RPC operations. Also
add a tracepoint to nfsd4_cb_getattr_done.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Keep track of the "main" opcode for the callback, and display it in the
tracepoint. This makes it simpler to discern what's happening when there
is more than one callback in flight.
The one special case is the CB_NULL RPC. That's not a CB_COMPOUND
opcode, so designate the value 0 for that.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Currently, you get the warning and stack trace, but nothing is printed
about the relevant error codes. Add that in.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Fix spelling errors in comments of nfsd4_release_lockowner and
nfs4_set_delegation.
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Commit 427f5f83a3 ("NFSD: Ensure nf_inode is never dereferenced") passes
inode directly to nfsd_file_mark_find_or_create instead of getting it from
nf, so there is no need to pass nf.
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
list_head can be initialized automatically with LIST_HEAD()
instead of calling INIT_LIST_HEAD().
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Commit 5826e09bf3 ("NFSD: OP_CB_RECALL_ANY should recall both read and
write delegations") added a new assignment statement to add
RCA4_TYPE_MASK_WDATA_DLG to ra_bmval bitmask of OP_CB_RECALL_ANY. So the
old one should be removed.
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
According to RFC 8881, all minor versions of NFSv4 support PUTPUBFH.
Replace the XDR decoder for PUTPUBFH with a "noop" since we no
longer want the minorversion check, and PUTPUBFH has no arguments to
decode. (Ideally nfsd4_decode_noop should really be called
nfsd4_decode_void).
PUTPUBFH should now behave just like PUTROOTFH.
Reported-by: Cedric Blancher <cedric.blancher@gmail.com>
Fixes: e1a90ebd8b ("NFSD: Combine decode operations for v4 and v4.1")
Cc: Dan Shelton <dan.f.shelton@gmail.com>
Cc: Roland Mainz <roland.mainz@nrubsig.org>
Cc: stable@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The 'callback address' in client_info_show is output without quotes
causing yaml parsers to fail on processing IPv6 addresses.
Adding quotes to 'callback address' also matches that used by
the 'address' field.
Signed-off-by: Mark Grimes <mark.grimes@ixsystems.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If an NFS operation expects a particular sort of object (file, dir, link,
etc) but gets a file handle for a different sort of object, it must
return an error. The actual error varies among NFS versions in non-trivial
ways.
For v2 and v3 there are ISDIR and NOTDIR errors and, for NFSv4 only,
INVAL is suitable.
For v4.0 there is also NFS4ERR_SYMLINK which should be used if a SYMLINK
was found when not expected. This take precedence over NOTDIR.
For v4.1+ there is also NFS4ERR_WRONG_TYPE which should be used in
preference to EINVAL when none of the specific error codes apply.
When nfsd_mode_check() finds a symlink where it expected a directory it
needs to return an error code that can be converted to NOTDIR for v2 or
v3 but will be SYMLINK for v4. It must be different from the error
code returns when it finds a symlink but expects a regular file - that
must be converted to EINVAL or SYMLINK.
So we introduce an internal error code nfserr_symlink_not_dir which each
version converts as appropriate.
nfsd_check_obj_isreg() is similar to nfsd_mode_check() except that it is
only used by NFSv4 and only for OPEN. NFSERR_INVAL is never a suitable
error if the object is the wrong time. For v4.0 we use nfserr_symlink
for non-dirs even if not a symlink. For v4.1 we have nfserr_wrong_type.
We handle this difference in-place in nfsd_check_obj_isreg() as there is
nothing to be gained by delaying the choice to nfsd4_map_status().
As a result of these changes, nfsd_mode_check() doesn't need an rqstp
arg any more.
Note that NFSv4 operations are actually performed in the xdr code(!!!)
so to the only place that we can map the status code successfully is in
nfsd4_encode_operation().
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Rather than using ad hoc values for internal errors (30000, 11000, ...)
use 'enum' to sequentially allocate numbers starting from the first
known available number - now visible as NFS4ERR_FIRST_FREE.
The goal is values that are distinct from all be32 error codes. To get
those we must first select integers that are not already used, then
convert them with cpu_to_be32().
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
There is code scattered around nfsd which chooses an error status based
on the particular version of nfs being used. It is cleaner to have the
version specific choices in version specific code.
With this patch common code returns the most specific error code
possible and the version specific code maps that if necessary.
Both v2 (nfsproc.c) and v3 (nfs3proc.c) now have a "map_status()"
function which is called to map the resp->status before each non-trivial
nfsd_proc_* or nfsd3_proc_* function returns.
NFS4ERR_SYMLINK and NFS4ERR_WRONG_TYPE introduce extra complications and
are left for a later patch.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This further centralizes version number checks.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
With this patch the only places that test ->rq_vers against a specific
version are nfsd_v4client() and nfsd_set_fh_dentry().
The latter sets some flags in the svc_fh, which now includes:
fh_64bit_cookies
fh_use_wgather
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
nfsd_breaker_owns_lease() currently open-codes the same test that
nfsd_v4client() performs.
With this patch we use nfsd_v4client() instead.
Also as i_am_nfsd() is only used in combination with kthread_data(),
replace it with nfsd_current_rqst() which combines the two and returns a
valid svc_rqst, or NULL.
The test for NULL is moved into nfsd_v4client() for code clarity.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
nfsd_permission(), exp_rdonly(), nfsd_setuser(), and nfsexp_flags()
only ever need the cred out of rqstp, so pass it explicitly instead of
the whole rqstp.
This makes the interfaces cleaner.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Rather than passing the whole rqst, pass the pieces that are actually
needed. This makes the inputs to rqst_exp_find() more obvious.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Move the stateid handling to nfsd4_copy_notify.
If nfs4_preprocess_stateid_op did not produce an output stateid, error out.
Copy notify specifically does not permit the use of special stateids,
so enforce that outside generic stateid pre-processing.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Olga Kornievskaia <aglo@umich.edu>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If an svc thread needs to perform some initialisation that might fail,
it has no good way to handle the failure.
Before the thread can exit it must call svc_exit_thread(), but that
requires the service mutex to be held. The thread cannot simply take
the mutex as that could deadlock if there is a concurrent attempt to
shut down all threads (which is unlikely, but not impossible).
nfsd currently call svc_exit_thread() unprotected in the unlikely event
that unshare_fs_struct() fails.
We can clean this up by introducing svc_thread_init_status() by which an
svc thread can report whether initialisation has succeeded. If it has,
it continues normally into the action loop. If it has not,
svc_thread_init_status() immediately aborts the thread.
svc_start_kthread() waits for either of these to happen, and calls
svc_exit_thread() (under the mutex) if the thread aborted.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
sp_nrthreads is only ever accessed under the service mutex
nlmsvc_mutex nfs_callback_mutex nfsd_mutex
so these is no need for it to be an atomic_t.
The fact that all code using it is single-threaded means that we can
simplify svc_pool_victim and remove the temporary elevation of
sp_nrthreads.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Instead of using kmalloc to allocate an array for storing active version
info, just declare an array to the max size - it is only 5 or so.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
- Implement the SCHED_DEADLINE server infrastructure - Daniel Bristot de Oliveira's
last major contribution to the kernel:
"SCHED_DEADLINE servers can help fixing starvation issues of low priority
tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU
cycles. Today we have RT Throttling; DEADLINE servers should be able to
replace and improve that."
(Daniel Bristot de Oliveira, Peter Zijlstra, Joel Fernandes,
Youssef Esmat, Huang Shijie)
- Preparatory changes for sched_ext integration:
- Use set_next_task(.first) where required
- Fix up set_next_task() implementations
- Clean up DL server vs. core sched
- Split up put_prev_task_balance()
- Rework pick_next_task()
- Combine the last put_prev_task() and the first set_next_task()
- Rework dl_server
- Add put_prev_task(.next)
(Peter Zijlstra, with a fix by Tejun Heo)
- Complete the EEVDF transition and refine EEVDF scheduling:
- Implement delayed dequeue
- Allow shorter slices to wakeup-preempt
- Use sched_attr::sched_runtime to set request/slice suggestion
- Document the new feature flags
- Remove unused and duplicate-functionality fields
- Simplify & unify pick_next_task_fair()
- Misc debuggability enhancements
(Peter Zijlstra, with fixes/cleanups by Dietmar Eggemann,
Valentin Schneider and Chuyi Zhou)
- Initialize the vruntime of a new task when it is first enqueued,
resulting in significant decrease in latency of newly woken tasks.
(Zhang Qiao)
- Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
(K Prateek Nayak, Peter Zijlstra)
- Clean up and clarify the usage of Clean up usage of rt_task()
(Qais Yousef)
- Preempt SCHED_IDLE entities in strict cgroup hierarchies
(Tianchen Ding)
- Clarify the documentation of time units for deadline scheduler
parameters. (Christian Loehle)
- Remove the HZ_BW chicken-bit feature flag introduced a year ago,
the original change seems to be working fine.
(Phil Auld)
- Misc fixes and cleanups (Chen Yu, Dan Carpenter, Huang Shijie,
Peilin He, Qais Yousefm and Vincent Guittot)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmbr8qcRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gdbw/+Mj3zWfYP+dtUkfgrR2FClPAJoo1/9Dz0
LYD8XgYHu8rEJ0Aq+VbdkgYGUt9utvzUFPIxvWFDcldQl57KwhF4hp9Ir+PqJyYC
NolQ1q8ddo1hnslxnEg6SgHVzQq/4FqMM0nDNUkQETCx6zTyFFeRf+q7o/2c2m5B
uI9dSU1Wrx7XrXm2D3kB8+xP+ZRy+qhbFN5Pfuz96mhelfklylgKMfPzgAiCT/7T
JTbQhQ2HdcCNgiLoSrWsHBDy2UYpouP4zb4jyd+lDQzhSUJrj3u4Xy4vVmuTKq+y
sTgWlgKB+MTuh9UuJ4UYzSnMqg161UlMvtXeH84ABmAqDNGHRPtOKrrlcLtJ3D4x
m1SPhNnsvpjOu2pH0XLIS8al3VUesWND5S+rucHRYSq6Nvhivf4MTvRJlicXXurL
Mt2APnIlhGJuKBNWnmyZovVdtO0ZUUPlaZWfr3rCS4txAVo+HwWhsm3uhtTycQqN
gazsCiuGh6Jds90ZqA/BvdLWG+DY8J0xLlV3ex4pCXuQ/HFrabVWTyThJsULhrZ2
5mTdWIsocPctNMO9/RHMy7vJI7G7ljgHEquWVn5kiGGzXhK6VwVwKAMpfgXGw+YA
yVP6/M7a7g2yEzj69gXkcDa8k/kedMVquJ/G/8YhZM7u7sPqsMjpmaGsqsJRfnpT
ChngAzap+kA=
=TEC6
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2024-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- Implement the SCHED_DEADLINE server infrastructure - Daniel Bristot
de Oliveira's last major contribution to the kernel:
"SCHED_DEADLINE servers can help fixing starvation issues of low
priority tasks (e.g., SCHED_OTHER) when higher priority tasks
monopolize CPU cycles. Today we have RT Throttling; DEADLINE
servers should be able to replace and improve that."
(Daniel Bristot de Oliveira, Peter Zijlstra, Joel Fernandes, Youssef
Esmat, Huang Shijie)
- Preparatory changes for sched_ext integration:
- Use set_next_task(.first) where required
- Fix up set_next_task() implementations
- Clean up DL server vs. core sched
- Split up put_prev_task_balance()
- Rework pick_next_task()
- Combine the last put_prev_task() and the first set_next_task()
- Rework dl_server
- Add put_prev_task(.next)
(Peter Zijlstra, with a fix by Tejun Heo)
- Complete the EEVDF transition and refine EEVDF scheduling:
- Implement delayed dequeue
- Allow shorter slices to wakeup-preempt
- Use sched_attr::sched_runtime to set request/slice suggestion
- Document the new feature flags
- Remove unused and duplicate-functionality fields
- Simplify & unify pick_next_task_fair()
- Misc debuggability enhancements
(Peter Zijlstra, with fixes/cleanups by Dietmar Eggemann, Valentin
Schneider and Chuyi Zhou)
- Initialize the vruntime of a new task when it is first enqueued,
resulting in significant decrease in latency of newly woken tasks
(Zhang Qiao)
- Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
(K Prateek Nayak, Peter Zijlstra)
- Clean up and clarify the usage of Clean up usage of rt_task()
(Qais Yousef)
- Preempt SCHED_IDLE entities in strict cgroup hierarchies
(Tianchen Ding)
- Clarify the documentation of time units for deadline scheduler
parameters (Christian Loehle)
- Remove the HZ_BW chicken-bit feature flag introduced a year ago,
the original change seems to be working fine (Phil Auld)
- Misc fixes and cleanups (Chen Yu, Dan Carpenter, Huang Shijie,
Peilin He, Qais Yousefm and Vincent Guittot)
* tag 'sched-core-2024-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
sched/cpufreq: Use NSEC_PER_MSEC for deadline task
cpufreq/cppc: Use NSEC_PER_MSEC for deadline task
sched/deadline: Clarify nanoseconds in uapi
sched/deadline: Convert schedtool example to chrt
sched/debug: Fix the runnable tasks output
sched: Fix sched_delayed vs sched_core
kernel/sched: Fix util_est accounting for DELAY_DEQUEUE
kthread: Fix task state in kthread worker if being frozen
sched/pelt: Use rq_clock_task() for hw_pressure
sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c
sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
sched: Add put_prev_task(.next)
sched: Rework dl_server
sched: Combine the last put_prev_task() and the first set_next_task()
sched: Rework pick_next_task()
sched: Split up put_prev_task_balance()
sched: Clean up DL server vs core sched
sched: Fixup set_next_task() implementations
sched: Use set_next_task(.first) where required
sched/fair: Properly deactivate sched_delayed task upon class change
...
Four fixes for longstanding ocfs2 issues and the remainder address random
MM things.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZuvTyAAKCRDdBJ7gKXxA
jrc4AP95yg/8E50PLyVKFIsot3Vaodq908cz2vvS0n0915NO7AD+Psy11a2aR1E5
L/ZND8Zyv06qmz73WJ7BUIx0CHCniw4=
=eX0p
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-09-19-00-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc hotfixes from Andrew Morton:
"12 hotfixes, 11 of which are cc:stable.
Four fixes for longstanding ocfs2 issues and the remainder address
random MM things"
* tag 'mm-hotfixes-stable-2024-09-19-00-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/madvise: process_madvise() drop capability check if same mm
mm/huge_memory: ensure huge_zero_folio won't have large_rmappable flag set
mm/hugetlb.c: fix UAF of vma in hugetlb fault pathway
mm: change vmf_anon_prepare() to __vmf_anon_prepare()
resource: fix region_intersects() vs add_memory_driver_managed()
zsmalloc: use unique zsmalloc caches names
mm/damon/vaddr: protect vma traversal in __damon_va_thre_regions() with rcu read lock
mm: vmscan.c: fix OOM on swap stress test
ocfs2: cancel dqi_sync_work before freeing oinfo
ocfs2: fix possible null-ptr-deref in ocfs2_set_buffer_uptodate
ocfs2: remove unreasonable unlock in ocfs2_read_blocks
ocfs2: fix null-ptr-deref when journal load failed.
string:
- add mem_is_zero()
core:
- support more device numbers
- use XArray for minor ids
- add backlight constants
- Split dma fence array creation into alloc and arm
fbdev:
- remove usage of old fbdev hooks
kms:
- Add might_fault() to drm_modeset_lock priming
- Add dynamic per-crtc vblank configuration support
dma-buf:
- docs cleanup
buddy:
- Add start address support for trim function
printk:
- pass description to kmsg_dump
scheduler;
- Remove full_recover from drm_sched_start
ttm:
- Make LRU walk restartable after dropping locks
- Allow direct reclaim to allocate local memory
panic:
- add display QR code (in rust)
displayport:
- mst: GUID improvements
bridge:
- Silence error message on -EPROBE_DEFER
- analogix: Clean aup
- bridge-connector: Fix double free
- lt6505: Disable interrupt when powered off
- tc358767: Make default DP port preemphasis configurable
- lt9611uxc: require DRM_BRIDGE_ATTACH_NO_CONNECTOR
- anx7625: simplify OF array handling
- dw-hdmi: simplify clock handling
- lontium-lt8912b: fix mode validation
- nwl-dsi: fix mode vsync/hsync polarity
xe:
- Enable LunarLake and Battlemage support
- Introducing Xe2 ccs modifiers for integrated and discrete graphics
- rename xe perf to xe observation
- use wb caching on DGFX for system memory
- add fence timeouts
- Lunar Lake graphics/media/display workarounds
- Battlemage workarounds
- Battlemage GSC support
- GSC and HuC fw updates for LL/BM
- use dma_fence_chain_free
- refactor hw engine lookup and mmio access
- enable priority mem read for Xe2
- Add first GuC BMG fw
- fix dma-resv lock
- Fix DGFX display suspend/resume
- Use xe_managed for kernel BOs
- Use reserved copy engine for user binds on faulting devices
- Allow mixing dma-fence jobs and long-running faulting jobs
- fix media TLB invalidation
- fix rpm in TTM swapout path
- track resources and VF state by PF
i915:
- Type-C programming fix for MTL+
- FBC cleanup
- Calc vblank delay more accurately
- On DP MST, Enable LT fallback for UHBR<->non-UHBR rates
- Fix DP LTTPR detection
- limit relocations to INT_MAX
- fix long hangs in buddy allocator on DG2/A380
amdgpu:
- Per-queue reset support
- SDMA devcoredump support
- DCN 4.0.1 updates
- GFX12/VCN4/JPEG4 updates
- Convert vbios embedded EDID to drm_edid
- GFX9.3/9.4 devcoredump support
- process isolation framework for GFX 9.4.3/4
- take IOMMU mappings into account for P2P DMA
amdkfd:
- CRIU fixes
- HMM fix
- Enable process isolation support for GFX 9.4.3/4
- Allow users to target recommended SDMA engines
- KFD support for targetting queues on recommended SDMA engines
radeon:
- remove .load and drm_dev_alloc
- Fix vbios embedded EDID size handling
- Convert vbios embedded EDID to drm_edid
- Use GEM references instead of TTM
- r100 cp init cleanup
- Fix potential overflows in evergreen CS offset tracking
msm:
- DPU:
- implement DP/PHY mapping on SC8180X
- Enable writeback on SM8150, SC8180X, SM6125, SM6350
- DP:
- Enable widebus on all relevant chipsets
- MSM8998 HDMI support
- GPU:
- A642L speedbin support
- A615/A306/A621 support
- A7xx devcoredump support
ast:
- astdp: Support AST2600 with VGA
- Clean up HPD
- Fix timeout loop for DP link training
- reorganize output code by type (VGA, DP, etc)
- convert to struct drm_edid
- fix BMC handling for all outputs
exynos:
- drop stale MAINTAINERS pattern
- constify struct
loongson:
- use GEM refcount over TTM
mgag200:
- Improve BMC handling
- Support VBLANK intterupts
- transparently support BMC outputs
nouveau:
- Refactor and clean up internals
- Use GEM refcount over TTM's
gm12u320:
- convert to struct drm_edid
gma500:
- update i2c terms
lcdif:
- pixel clock fix
host1x:
- fix syncpoint IRQ during resume
- use iommu_paging_domain_alloc()
imx:
- ipuv3: convert to struct drm_edid
omapdrm:
- improve error handling
- use common helper for_each_endpoint_of_node()
panel:
- add support for BOE TV101WUM-LL2 plus DT bindings
- novatek-nt35950: improve error handling
- nv3051d: improve error handling
- panel-edp: add support for BOE NE140WUM-N6G; revert support for
SDC ATNA45AF01
- visionox-vtdr6130: improve error handling; use
devm_regulator_bulk_get_const()
- boe-th101mb31ig002: Support for starry-er88577 MIPI-DSI panel plus
DT; Fix porch parameter
- edp: Support AOU B116XTN02.3, AUO B116XAN06.1, AOU B116XAT04.1,
BOE NV140WUM-N41, BOE NV133WUM-N63, BOE NV116WHM-A4D, CMN N116BCA-EA2,
CMN N116BCP-EA2, CSW MNB601LS1-4
- himax-hx8394: Support Microchip AC40T08A MIPI Display panel plus DT
- ilitek-ili9806e: Support Densitron DMT028VGHMCMI-1D TFT plus DT
- jd9365da: Support Melfas lmfbx101117480 MIPI-DSI panel plus DT; Refactor
for code sharing
- panel-edp: fix name for HKC MB116AN01
- jd9365da: fix "exit sleep" commands
- jdi-fhd-r63452: simplify error handling with DSI multi-style
helpers
- mantix-mlaf057we51: simplify error handling with DSI multi-style
helpers
- simple:
support Innolux G070ACE-LH3 plus DT bindings
support On Tat Industrial Company KD50G21-40NT-A1 plus DT bindings
- st7701:
decouple DSI and DRM code
add SPI support
support Anbernic RG28XX plus DT bindings
mediatek:
- support alpha blending
- remove cl in struct cmdq_pkt
- ovl adaptor fix
- add power domain binding for mediatek DPI controller
renesas:
- rz-du: add support for RZ/G2UL plus DT bindings
rockchip:
- Improve DP sink-capability reporting
- dw_hdmi: Support 4k@60Hz
- vop: Support RGB display on Rockchip RK3066; Support 4096px width
sti:
- convert to struct drm_edid
stm:
- Avoid UAF wih managed plane and CRTC helpers
- Fix module owner
- Fix error handling in probe
- Depend on COMMON_CLK
- ltdc: Fix transparency after disabling plane; Remove unused interrupt
tegra:
- gr3d: improve PM domain handling
- convert to struct drm_edid
- Call drm_atomic_helper_shutdown()
vc4:
- fix PM during detect
- replace DRM_ERROR() with drm_error()
- v3d: simplify clock retrieval
v3d:
- Clean up perfmon
virtio:
- add DRM capset
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmbq43gACgkQDHTzWXnE
hr4+lg/+O/r41E7ioitcM0DWeWem0dTlvQr41pJ8jujHvw+bXNdg0BMGWtsTyTLA
eOft2AwofsFjg+O7l8IFXOT37mQLdIdfjb3+w5brI198InL3OWC3QV8ZSwY9VGET
n8crO9jFoxNmHZnFniBZbtI6egTyl6H+2ey3E0MTnKiPUKZQvsK/4+x532yVLPob
UUOze5wcjyGZc7LJEIZPohPVneCb9ki7sabDQqh4cxIQ0Eg+nqPpWjYM4XVd+lTS
8QmssbR49LrJ7z9m90qVE+8TjYUCn+ChDPMs61KZAAnc8k++nK41btjGZ23mDKPb
YEguahCYthWJ4U8K18iXBPnLPxZv5+harQ8OIWAUYqdIOWSXHozvuJ2Z84eHV13a
9mQ5vIymXang8G1nEXwX/vml9uhVhBCeWu3qfdse2jfaTWYUb1YzhqUoFvqI0R0K
8wT03MyNdx965CSqAhpH5Jd559ueZmpd+jsHOfhAS+1gxfD6NgoPXv7lpnMUmGWX
SnaeC9RLD4cgy7j2Swo7TEqQHrvK5XhZSwX94kU6RPmFE5RRKqWgFVQmwuikDMId
UpNqDnPT5NL2UX4TNG4V4coyTXvKgVcSB9TA7j8NSLfwdGHhiz73pkYosaZXKyxe
u6qKMwMONfZiT20nhD7RhH0AFnnKosAcO14dhn0TKFZPY6Ce9O8=
=7jR+
-----END PGP SIGNATURE-----
Merge tag 'drm-next-2024-09-19' of https://gitlab.freedesktop.org/drm/kernel
Pull drm updates from Dave Airlie:
"This adds a couple of patches outside the drm core, all should be
acked appropriately, the string and pstore ones are the main ones that
come to mind.
Otherwise it's the usual drivers, xe is getting enabled by default on
some new hardware, we've changed the device number handling to allow
more devices, and we added some optional rust code to create QR codes
in the panic handler, an idea first suggested I think 10 years ago :-)
string:
- add mem_is_zero()
core:
- support more device numbers
- use XArray for minor ids
- add backlight constants
- Split dma fence array creation into alloc and arm
fbdev:
- remove usage of old fbdev hooks
kms:
- Add might_fault() to drm_modeset_lock priming
- Add dynamic per-crtc vblank configuration support
dma-buf:
- docs cleanup
buddy:
- Add start address support for trim function
printk:
- pass description to kmsg_dump
scheduler:
- Remove full_recover from drm_sched_start
ttm:
- Make LRU walk restartable after dropping locks
- Allow direct reclaim to allocate local memory
panic:
- add display QR code (in rust)
displayport:
- mst: GUID improvements
bridge:
- Silence error message on -EPROBE_DEFER
- analogix: Clean aup
- bridge-connector: Fix double free
- lt6505: Disable interrupt when powered off
- tc358767: Make default DP port preemphasis configurable
- lt9611uxc: require DRM_BRIDGE_ATTACH_NO_CONNECTOR
- anx7625: simplify OF array handling
- dw-hdmi: simplify clock handling
- lontium-lt8912b: fix mode validation
- nwl-dsi: fix mode vsync/hsync polarity
xe:
- Enable LunarLake and Battlemage support
- Introducing Xe2 ccs modifiers for integrated and discrete graphics
- rename xe perf to xe observation
- use wb caching on DGFX for system memory
- add fence timeouts
- Lunar Lake graphics/media/display workarounds
- Battlemage workarounds
- Battlemage GSC support
- GSC and HuC fw updates for LL/BM
- use dma_fence_chain_free
- refactor hw engine lookup and mmio access
- enable priority mem read for Xe2
- Add first GuC BMG fw
- fix dma-resv lock
- Fix DGFX display suspend/resume
- Use xe_managed for kernel BOs
- Use reserved copy engine for user binds on faulting devices
- Allow mixing dma-fence jobs and long-running faulting jobs
- fix media TLB invalidation
- fix rpm in TTM swapout path
- track resources and VF state by PF
i915:
- Type-C programming fix for MTL+
- FBC cleanup
- Calc vblank delay more accurately
- On DP MST, Enable LT fallback for UHBR<->non-UHBR rates
- Fix DP LTTPR detection
- limit relocations to INT_MAX
- fix long hangs in buddy allocator on DG2/A380
amdgpu:
- Per-queue reset support
- SDMA devcoredump support
- DCN 4.0.1 updates
- GFX12/VCN4/JPEG4 updates
- Convert vbios embedded EDID to drm_edid
- GFX9.3/9.4 devcoredump support
- process isolation framework for GFX 9.4.3/4
- take IOMMU mappings into account for P2P DMA
amdkfd:
- CRIU fixes
- HMM fix
- Enable process isolation support for GFX 9.4.3/4
- Allow users to target recommended SDMA engines
- KFD support for targetting queues on recommended SDMA engines
radeon:
- remove .load and drm_dev_alloc
- Fix vbios embedded EDID size handling
- Convert vbios embedded EDID to drm_edid
- Use GEM references instead of TTM
- r100 cp init cleanup
- Fix potential overflows in evergreen CS offset tracking
msm:
- DPU:
- implement DP/PHY mapping on SC8180X
- Enable writeback on SM8150, SC8180X, SM6125, SM6350
- DP:
- Enable widebus on all relevant chipsets
- MSM8998 HDMI support
- GPU:
- A642L speedbin support
- A615/A306/A621 support
- A7xx devcoredump support
ast:
- astdp: Support AST2600 with VGA
- Clean up HPD
- Fix timeout loop for DP link training
- reorganize output code by type (VGA, DP, etc)
- convert to struct drm_edid
- fix BMC handling for all outputs
exynos:
- drop stale MAINTAINERS pattern
- constify struct
loongson:
- use GEM refcount over TTM
mgag200:
- Improve BMC handling
- Support VBLANK intterupts
- transparently support BMC outputs
nouveau:
- Refactor and clean up internals
- Use GEM refcount over TTM's
gm12u320:
- convert to struct drm_edid
gma500:
- update i2c terms
lcdif:
- pixel clock fix
host1x:
- fix syncpoint IRQ during resume
- use iommu_paging_domain_alloc()
imx:
- ipuv3: convert to struct drm_edid
omapdrm:
- improve error handling
- use common helper for_each_endpoint_of_node()
panel:
- add support for BOE TV101WUM-LL2 plus DT bindings
- novatek-nt35950: improve error handling
- nv3051d: improve error handling
- panel-edp:
- add support for BOE NE140WUM-N6G
- revert support for SDC ATNA45AF01
- visionox-vtdr6130:
- improve error handling
- use devm_regulator_bulk_get_const()
- boe-th101mb31ig002:
- Support for starry-er88577 MIPI-DSI panel plus DT
- Fix porch parameter
- edp: Support AOU B116XTN02.3, AUO B116XAN06.1, AOU B116XAT04.1, BOE
NV140WUM-N41, BOE NV133WUM-N63, BOE NV116WHM-A4D, CMN N116BCA-EA2,
CMN N116BCP-EA2, CSW MNB601LS1-4
- himax-hx8394: Support Microchip AC40T08A MIPI Display panel plus DT
- ilitek-ili9806e: Support Densitron DMT028VGHMCMI-1D TFT plus DT
- jd9365da:
- Support Melfas lmfbx101117480 MIPI-DSI panel plus DT
- Refactor for code sharing
- panel-edp: fix name for HKC MB116AN01
- jd9365da: fix "exit sleep" commands
- jdi-fhd-r63452: simplify error handling with DSI multi-style
helpers
- mantix-mlaf057we51: simplify error handling with DSI multi-style
helpers
- simple:
- support Innolux G070ACE-LH3 plus DT bindings
- support On Tat Industrial Company KD50G21-40NT-A1 plus DT
bindings
- st7701:
- decouple DSI and DRM code
- add SPI support
- support Anbernic RG28XX plus DT bindings
mediatek:
- support alpha blending
- remove cl in struct cmdq_pkt
- ovl adaptor fix
- add power domain binding for mediatek DPI controller
renesas:
- rz-du: add support for RZ/G2UL plus DT bindings
rockchip:
- Improve DP sink-capability reporting
- dw_hdmi: Support 4k@60Hz
- vop:
- Support RGB display on Rockchip RK3066
- Support 4096px width
sti:
- convert to struct drm_edid
stm:
- Avoid UAF wih managed plane and CRTC helpers
- Fix module owner
- Fix error handling in probe
- Depend on COMMON_CLK
- ltdc:
- Fix transparency after disabling plane
- Remove unused interrupt
tegra:
- gr3d: improve PM domain handling
- convert to struct drm_edid
- Call drm_atomic_helper_shutdown()
vc4:
- fix PM during detect
- replace DRM_ERROR() with drm_error()
- v3d: simplify clock retrieval
v3d:
- Clean up perfmon
virtio:
- add DRM capset"
* tag 'drm-next-2024-09-19' of https://gitlab.freedesktop.org/drm/kernel: (1326 commits)
drm/xe: Fix missing conversion to xe_display_pm_runtime_resume
drm/xe/xe2hpg: Add Wa_15016589081
drm/xe: Don't keep stale pointer to bo->ggtt_node
drm/xe: fix missing 'xe_vm_put'
drm/xe: fix build warning with CONFIG_PM=n
drm/xe: Suppress missing outer rpm protection warning
drm/xe: prevent potential UAF in pf_provision_vf_ggtt()
drm/amd/display: Add all planes on CRTC to state for overlay cursor
drm/i915/bios: fix printk format width
drm/i915/display: Fix BMG CCS modifiers
drm/amdgpu: get rid of bogus includes of fdtable.h
drm/amdkfd: CRIU fixes
drm/amdgpu: fix a race in kfd_mem_export_dmabuf()
drm: new helper: drm_gem_prime_handle_to_dmabuf()
drm/amdgpu/atomfirmware: Silence UBSAN warning
drm/amdgpu: Fix kdoc entry in 'amdgpu_vm_cpu_prepare'
drm/amd/amdgpu: apply command submission parser for JPEG v1
drm/amd/amdgpu: apply command submission parser for JPEG v2+
drm/amd/pm: fix the pp_dpm_pcie issue on smu v14.0.2/3
drm/amd/pm: update the features set on smu v14.0.2/3
...
Only f_path is used from backing files registered with
FUSE_DEV_IOC_BACKING_OPEN, so it makes sense to allow O_PATH descriptors.
O_PATH files have an empty f_op, so don't check read_iter/write_iter.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
- Convert parisc to the generic clockevents framework
- Fix syscall and mm for 64-bit userspace
- Fix stack start when ADDR_NO_RANDOMIZE personality is set
- Fix mmap(MAP_STACK) to map upward growing expandable memory on parisc
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQS86RI+GtKfB8BJu973ErUQojoPXwUCZumrLQAKCRD3ErUQojoP
X0ifAQCLI1G7+watiiVDXf3DRR3o6G73CUO5NlEkkY94T0/megEA1Pn92Prqt4fd
QikvJ/jKXPgcDHplqVWqm4EYG2ByIQA=
=ytxZ
-----END PGP SIGNATURE-----
Merge tag 'parisc-for-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc architecture updates from Helge Deller:
- On parisc we now use the generic clockevent framework for timekeeping
- Although there is no 64-bit glibc/userspace for parisc yet, for
testing purposes one can run statically linked 64-bit binaries. This
patchset contains two patches which fix 64-bit userspace which has
been broken since kernel 4.19
- Fix the userspace stack position and size when the ADDR_NO_RANDOMIZE
personality is enabled
- On other architectures mmap(MAP_GROWSDOWN | MAP_STACK) creates a
downward-growing stack. On parisc mmap(MAP_STACK) is now sufficient
to create an upward-growing stack
* tag 'parisc-for-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Allow mmap(MAP_STACK) memory to automatically expand upwards
parisc: Use PRIV_USER instead of hardcoded value
parisc: Fix itlb miss handler for 64-bit programs
parisc: Fix 64-bit userspace syscall path
parisc: Fix stack start for ADDR_NO_RANDOMIZE personality
parisc: Convert to generic clockevents
parisc: pdc_stable: Constify struct kobj_type
- Remove some unnecesary hold/unhold rsb refcounting
in cases where an existing refcount is known to exist.
- Remove some unnecessary checking for zero nodeids, which
should never exist, and add some warning if they do.
- Make the slow freeing of structs in release_lockspace()
async, run from a workqueue.
- Prior rcu freeing allows some further struct lookups to
run without a lock.
- Use blocking kernel_connect on sockets to avoid EINPROGRESS.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEcGkeEvkvjdvlR90nOBtzx/yAaaoFAmbpqCEACgkQOBtzx/yA
aap1FBAAk6qmMoGHLl4pmJtDtPNZlsyCE7/SIdGjeMuszX5ENMl3sdoJ7qxX8ism
r9v7YqN7K3lLuFb8xnuLw1i5VAp8jkNgTEQ2z4gOKRpHILCsQkJyFHF3CxpqbiOz
DbyIgOOqo00kgU/sI3NwbuwSr18DZhY8QylIvqk32K2TjhdfxhffocVsfroDJWAH
NdcH71rAVcPVe+Ls1XW6Bqn+QpDbcaki4V524MImBVVRGHxD1FjFjZ1ppohcqDK0
Xdrxeh3iVz0ovoyxkpHEvdW0+t3y4jLo4bR9iQnj7MVA5eNX4uwTdd1deSg/ueY2
6m7WMIZy6hA1Tj6TdAf+dXp+MPm+LhHaXv7wz1yJU6JdTUryvctynLgXBTVhNEcM
y5TVuLC4ad1NgFsPeGAG7vH1puXoVCxoKT8pPGQhbBoUbILJALGR8ApF/kc+CZms
AUYB2oQoaZXtaubrF3WoUQ4eakypo3z0MVzNzLsO+AYJAOiqQkLJaa6dVLu1qg7n
6UjBypNyJjevoWJIUqI0lYlXD7x3HFz3q6LYjhThITDnNo57KzqgqJ2p9KHtEp8/
WPNm7zPHzH088wk19uEbJP67FySPa34XUqebfOv32je5rHvrF7Ux3TkHI9Dm+by9
PwI3LtmXQyNbH9nwFdqFda7SuDTnlskTCEr3bZx+Dact6Zwk9Lc=
=25XB
-----END PGP SIGNATURE-----
Merge tag 'dlm-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
- Remove some unnecesary hold/unhold rsb refcounting in cases where an
existing refcount is known to exist
- Remove some unnecessary checking for zero nodeids, which should never
exist, and add some warning if they do
- Make the slow freeing of structs in release_lockspace() async, run
from a workqueue
- Prior rcu freeing allows some further struct lookups to run without a
lock
- Use blocking kernel_connect on sockets to avoid EINPROGRESS
* tag 'dlm-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: add missing -ENOMEM if alloc_workqueue() fails
dlm: do synchronized socket connect call
dlm: move lkb xarray lookup out of lock
dlm: move dlm_search_rsb_tree() out of lock
dlm: use RSB_HASHED to avoid lookup twice
dlm: async freeing of lockspace resources
dlm: drop kobject release callback handling
dlm: warn about invalid nodeid comparsions
dlm: never return invalid nodeid by dlm_our_nodeid()
dlm: remove unnecessary refcounts
dlm: cleanup memory allocation helpers
* Introduce new ioctls to exchange contents of two files.
The first ioctl does the preparation work to exchange the contents of two
files while the second ioctl performs the actual exchange if the target
file has not been changed since a given sampling point.
* Fixes
- Fix bugs associated with calculating the maximum range of realtime
extents to scan for free space.
- Copy keys instead of records when resizing the incore BMBT root block.
- Do not report FITRIMming more bytes than possibly exist in the
filesystem.
- Modify xfs_fs.h to prevent C++ compilation errors.
- Do not over eagerly free post-EOF speculative preallocation.
- Ensure st_blocks never goes to zero during COW writes
* Cleanups/refactors
- Use Xarray to hold per-AG data instead of a Radix tree.
- Cleanup the following functionality,
- Realtime bitmap.
- Inode allocator.
- Quota.
- Inode rooted btree code.
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZtmy/gAKCRAH7y4RirJu
9H3GAP9CnoiZu+U/QmNL5T15fgNGs+BQDrUNbmbn3bNlmIZviQEAi3p+50OlT0nP
lcQ/89NJ6uDFNBiphpkGajlp5vn2BQ0=
=7wy/
-----END PGP SIGNATURE-----
Merge tag 'xfs-6.12-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Chandan Babu:
"New code:
- Introduce new ioctls to exchange contents of two files.
The first ioctl does the preparation work to exchange the contents
of two files while the second ioctl performs the actual exchange if
the target file has not been changed since a given sampling point.
Fixes:
- Fix bugs associated with calculating the maximum range of realtime
extents to scan for free space.
- Copy keys instead of records when resizing the incore BMBT root
block.
- Do not report FITRIMming more bytes than possibly exist in the
filesystem.
- Modify xfs_fs.h to prevent C++ compilation errors.
- Do not over eagerly free post-EOF speculative preallocation.
- Ensure st_blocks never goes to zero during COW writes
Cleanups/refactors:
- Use Xarray to hold per-AG data instead of a Radix tree.
- Cleanups to:
- realtime bitmap
- inode allocator
- quota
- inode rooted btree code"
* tag 'xfs-6.12-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (61 commits)
xfs: ensure st_blocks never goes to zero during COW writes
xfs: use xas_for_each_marked in xfs_reclaim_inodes_count
xfs: convert perag lookup to xarray
xfs: simplify tagged perag iteration
xfs: move the tagged perag lookup helpers to xfs_icache.c
xfs: use kfree_rcu_mightsleep to free the perag structures
xfs: use LIST_HEAD() to simplify code
xfs: Remove duplicate xfs_trans_priv.h header
xfs: remove unnecessary check
xfs: Use xfs set and clear mp state helpers
xfs: reclaim speculative preallocations for append only files
xfs: simplify extent lookup in xfs_can_free_eofblocks
xfs: check XFS_EOFBLOCKS_RELEASED earlier in xfs_release_eofblocks
xfs: only free posteof blocks on first close
xfs: don't free post-EOF blocks on read close
xfs: skip all of xfs_file_release when shut down
xfs: don't bother returning errors from xfs_file_release
xfs: refactor f_op->release handling
xfs: remove the i_mode check in xfs_release
xfs: standardize the btree maxrecs function parameters
...
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmbpAwkACgkQiiy9cAdy
T1FJhgv+PX+IIGyNNW0I3f3ZzIWqc1DCwxXHCa3gvr7TKimJ71AGbEdzFZZzl3AJ
CdxSLf2NQ6tBUxl65QuMC7XykqQXKvNnQEDPoQcHfFgTtYJi+zng1dDvvXSfFbWW
m2Hql1w6MNFeKlFBavbA6MI94MnZqE5J/yCtWqw3LvEn4l2JwYrAzS5Lw9qjtcER
DmlOsrEFgpsFhhpnyPZXJxaWKZIDG2OuG61LWkqyhvLOTtuFuc9cEsTWPdeRYAT6
KKh5z58wqG2JG0IkVjG1foBclv0zcZgUzqOr2/tzbabYye991kLnUitaTwd+u8xS
pTbVIw1E91sFEqVsr2IpnLUq68MKaahlNfHkNJD0dqaMKfGOujqtNRFw82Yki4w5
aTosgECyUiGKgwuE8HLtwlJaE4EizVdrqQiP2cUOrtuWPvOvnY7vjWKC8kmSM0Z/
u0ov6JdirVlnFE3dlS0i6ywKaolsrrPYUTbv4ihjQiGHtm+VjonH8VYsdg8sUV0e
5/+cyqaF
=B6Et
-----END PGP SIGNATURE-----
Merge tag 'v6.12-rc-smb3-client-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client updates from Steve French:
- cleanups (moving duplicated code, removing unused code etc)
- fixes relating to "sfu" mount options (for better handling special
file types)
- SMB3.1.1 compression fixes/improvements
* tag 'v6.12-rc-smb3-client-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6: (24 commits)
smb: client: fix compression heuristic functions
cifs: Update SFU comments about fifos and sockets
cifs: Add support for creating SFU symlinks
smb: use LIST_HEAD() to simplify code
cifs: Recognize SFU socket type
cifs: Show debug message when SFU Fifo type was detected
cifs: Put explicit zero byte into SFU block/char types
cifs: Add support for reading SFU symlink location
cifs: Fix recognizing SFU symlinks
smb: client: compress: fix an "illegal accesses" issue
smb: client: compress: fix a potential issue of freeing an invalid pointer
smb: client: compress: LZ77 code improvements cleanup
smb: client: insert compression check/call on write requests
smb3: mark compression as CONFIG_EXPERIMENTAL and fix missing compression operation
cifs: Remove obsoleted declaration for cifs_dir_open
smb: client: Use min() macro
cifs: convert to use ERR_CAST()
smb: add comment to STATUS_MCA_OCCURED
smb: move SMB2 Status code to common header file
smb: move some duplicate definitions to common/smbacl.h
...
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmbpBKIACgkQiiy9cAdy
T1H/TQv+NEjnpJMuqTYRMRdU6prcDoESszQD/hmMCRExGs9rupZxpGioW/Su7URN
m7WGJlbjWKGB5z5MaP5ur24hoiRUT5nYEEKkTyJ4OmMbRDMnpUsLxvOieVXUMsR6
eZ+o/zHdblda54OA48+J7v+0L79xk7wesYbyWagFRzb+GOaIZe1y5BMYDwBWe8ac
KJ0TfZxFmFpbwLN88hCejrFXSK/c6vi9uxgKyB1xTgBPKQTjMeF1caSGDxMF+SrW
gNDP72/ZqoANyBxJUFdPGGEhv3aftRVku3CaLuZcTKdpHcxn9GXEK63gR9oDOEhW
ZCqhifPMm0bTeKF3eCvy8WmkxWWB4KBy8IBIUm7HnmJLo87ctUxDDT5v9XAmCo1R
zz9AAY7QY/IDKUFzais1AWu4lQNd1vQM/O635ahMH7YgIKUHnDhpHQAXnuCO6dk8
iIB6Ghb4cPuztQuy2LOiJ3AIco7O3F7VRJtV0rz/QHv0P1M9yswWyu+LfLZhYkMd
VciukaZ+
=pUoF
-----END PGP SIGNATURE-----
Merge tag '6.12-rc-ksmbd-server-fixes-part1' of git://git.samba.org/ksmbd
Pull smb server updates from Steve French:
"Four ksmbd server fixes, three for stable:
- Fix an issue where the directory can't be deleted if the share is
on a file system that does not provide dot and dotdot entries
- Fix file creation failure if the parent name of pathname is case
sensitive
- Fix write failure with FILE_APPEND_DATA flags
- Add reference count to connection struct to protect UAF of oplocks
on multichannel"
* tag '6.12-rc-ksmbd-server-fixes-part1' of git://git.samba.org/ksmbd:
ksmbd: handle caseless file creation
ksmbd: make __dir_empty() compatible with POSIX
ksmbd: add refcnt to ksmbd_conn struct
ksmbd: allow write with FILE_APPEND_DATA
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEIodevzQLVs53l6BhNqiEXrVAjGQFAmboijUACgkQNqiEXrVA
jGRgChAArKwng92U1d8WOcv1Jv/Vr9vXXu5bFqnOxTrjqlRT4Jh8k2wRInsISU/0
mJ8ICbQU9XYkoY8zvqO6ZJmqqOSMYipcivQJEug6/RAEKWYRK8pt39O7QfEcdgKr
RSjU+guBaAsnW1WFmKKSMQaDCsCkQp+UwryR/PQTQz54ADRtEJxLMQvU+u2Izj6E
3mfOsvrAZEJ3L5CRqZolcSlo3Ft20trd/shzzySx0lysBmJTU6qgxihzwASxlUNi
2y7jAX9wkHfKRwuj0uQROxtSBbpgNpi5qLs20DZA9WYPL9BqW9KupdRY1WAsIq9X
2vGDNr1ZtVWUu0fg227cDoNAxaMD1v9+lH9IdVixNUdMiG7VlEu7HFa12v8FKhaG
4ddomAtKJ508sVxAnWwOCruUMg+HevrpBo5s0CH01cjftfQ7yAWv0sfzTHyl39xD
OEpNO4gcRkt+XdyXPdQ8ZuuD+zT8bxIW8iBqwPgu7nQgEB2+pVSrRUKfsms0swtf
/16h/3rB4TAlIadTMg0nvwpr2Fd5+WHt+BRhpb2tbTWmjHykRXL5TB8I1hgRP6/e
msAjDlvpw74Ev6W4LwuzQxCztaaXRVOvPNFqWHk+0pmjOUlrtacQhhthN7jM1wax
0AWnP9FN5zXoZCy+LUqVpOgj3+/0LHWIFmY3zYoHh1Au1d01nZI=
=yxeO
-----END PGP SIGNATURE-----
Merge tag 'jfs-6.12' of github.com:kleikamp/linux-shaggy
Pull jfs updates from David Kleikamp:
"A few fixes for jfs"
* tag 'jfs-6.12' of github.com:kleikamp/linux-shaggy:
jfs: Fix uninit-value access of new_ea in ea_buffer
jfs: check if leafidx greater than num leaves per dmap tree
jfs: Fix uaf in dbFreeBits
jfs: fix out-of-bounds in dbNextAG() and diAlloc()
jfs: UBSAN: shift-out-of-bounds in dbFindBits
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE9zuTYTs0RXF+Ke33EVvVyTe/1WoFAmbodTUACgkQEVvVyTe/
1Wp5Jw/+LN99lGoKXKZ7zB8gta4hMm2O+wd3//IdfTi6u59uff0QvVNbczC6SRlJ
sW6czIgjKpR1xt6Orp0Rof0VZK09PMqq3Iw7Tt3xEkJtjKxppfMCQHeoi03CH9kF
gOHPW+Ad4jlCk+TYFEuHk5lQIgVtJ11JDtGjr5tNa03MbrLIa0pFpTdz65bg9uTn
JdPqA+fiJ/mVoXlFQEZkQPeT8CgMfTbKtQDGfcmKzrNo0ZyQdW30u5lP3MGROptn
X/QX5n77UrE9Du+UHPSaVDRUd8iA/ZcZe/7RAlKXWElHpKvAiqv84Sz0Mc1fczHx
nQHaUfQixGs/vakkPXxVqKn/Xvnj4GDwUqyM9mG8kN1x4QpdyFRqM66zZxtWsSi5
/PL0zwAY950a5fY4VsyRsSf6Zl/SlDdB+PVVajk6BKqMxWgvxyuVbFRUqfesHsfA
V5GHil7F03sPzO12SvAJg8HuresCiafLdAPJgGXuGoeprmeSP9eYx0EqcUzLpHd7
ORUF4qEylYFp8vZBJdBnZpSGRUv/mmjDDowsKeKyfgKyVR99lLKLCtqMJ0e0eGaE
mpGNg+eq//snJldvBUHrMYQvyVm18/bbOzT0ZdGPp2bMUxLjSbiab7D3S9GKRxRt
8GRg/bJfZw5tz2i1bxbe5IWfNij5pDilLeAQUsoO8pUwu/SrCy8=
=wnbj
-----END PGP SIGNATURE-----
Merge tag 'ovl-update-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
Pull overlayfs updates from Amir Goldstein:
- Increase robustness of overlayfs to crashes in the case of underlying
filesystems that to not guarantee metadata ordering to persistent
storage (problem was reported with ubifs).
- Deny mount inside container with features that require root
privileges to work properly, instead of failing operations later.
- Some clarifications to overlayfs documentation.
* tag 'ovl-update-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
ovl: fail if trusted xattrs are needed but caller lacks permission
overlayfs.rst: update metacopy section in overlayfs documentation
ovl: fsync after metadata copy-up
ovl: don't set the superblock's errseq_t manually
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmboHyUACgkQSfxwEqXe
A66wGQ/8DRIjBllwf1YuTWi4T6OcfoYxK6C9bXO6QPP5gzdTyFE9pvDuuPyad6+F
FR086ydTHeodemz1dFiQCL9etcUaxo4+6FRKyXKF9/1ezGbTA5nJd0/fKJGlqbI2
EoA4LNYHOsvCZk1BTpxRNWKeKphU9zQgQdSigy6Rx8p269UkGmIZjD1PtUc+vqfR
Ox0dK/Cswyo236fRi5HzaoMntWI4vXgLfxty0e1R7tfbstkCxSKWAON1lo3uHgkA
0HpJXWgWXAPt9gp++Fs/jGNpOqbt6IaKeV5f7CjYfvWhlFjNMhQxF+PbxknaZn/k
K0gQsItOIoFTfbQdLDIdfnj9awMdLW8FB2A1WXHpNr9pVC4ickPb1bMTF/XRd0tm
wBNu4BL0gklx6017KZg5uINMIduzMLGkBLRFiBW0en/sZMLTJTMg58BJn0CL1Pmh
1ll/Q3ToSMHalvxU2OnJagTwh4fzzCEpK/hW9WiDO4jSCsMXyX0clinrCjNo1JfA
tqgTWEy3uGtg+dg0Du9VD5JASbNQSJ0ZRnas5+qz10IRWWfTolrsk61dliXLQ4Sv
tSryDtsE2znwJF1Krh4aHNSSVhD5/l/8QaXkf9aZc/kkaHxwsx83FuWnqw6nMz8c
l4B2MbH0jUgsEqEyx+0iwk+FXE9kZKWumTVLjFZ6bRnq3q+uq0U=
=mWCw
-----END PGP SIGNATURE-----
Merge tag 'random-6.12-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull random number generator updates from Jason Donenfeld:
"Originally I'd planned on sending each of the vDSO getrandom()
architecture ports to their respective arch trees. But as we started
to work on this, we found lots of interesting issues in the shared
code and infrastructure, the fixes for which the various archs needed
to base their work.
So in the end, this turned into a nice collaborative effort fixing up
issues and porting to 5 new architectures -- arm64, powerpc64,
powerpc32, s390x, and loongarch64 -- with everybody pitching in and
commenting on each other's code. It was a fun development cycle.
This contains:
- Numerous fixups to the vDSO selftest infrastructure, getting it
running successfully on more platforms, and fixing bugs in it.
- Additions to the vDSO getrandom & chacha selftests. Basically every
time manual review unearthed a bug in a revision of an arch patch,
or an ambiguity, the tests were augmented.
By the time the last arch was submitted for review, s390x, v1 of
the series was essentially fine right out of the gate.
- Fixes to the the generic C implementation of vDSO getrandom, to
build and run successfully on all archs, decoupling it from
assumptions we had (unintentionally) made on x86_64 that didn't
carry through to the other architectures.
- Port of vDSO getrandom to LoongArch64, from Xi Ruoyao and acked by
Huacai Chen.
- Port of vDSO getrandom to ARM64, from Adhemerval Zanella and acked
by Will Deacon.
- Port of vDSO getrandom to PowerPC, in both 32-bit and 64-bit
varieties, from Christophe Leroy and acked by Michael Ellerman.
- Port of vDSO getrandom to S390X from Heiko Carstens, the arch
maintainer.
While it'd be natural for there to be things to fix up over the course
of the development cycle, these patches got a decent amount of review
from a fairly diverse crew of folks on the mailing lists, and, for the
most part, they've been cooking in linux-next, which has been helpful
for ironing out build issues.
In terms of architectures, I think that mostly takes care of the
important 64-bit archs with hardware still being produced and running
production loads in settings where vDSO getrandom is likely to help.
Arguably there's still RISC-V left, and we'll see for 6.13 whether
they find it useful and submit a port"
* tag 'random-6.12-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (47 commits)
selftests: vDSO: check cpu caps before running chacha test
s390/vdso: Wire up getrandom() vdso implementation
s390/vdso: Move vdso symbol handling to separate header file
s390/vdso: Allow alternatives in vdso code
s390/module: Provide find_section() helper
s390/facility: Let test_facility() generate static branch if possible
s390/alternatives: Remove ALT_FACILITY_EARLY
s390/facility: Disable compile time optimization for decompressor code
selftests: vDSO: fix vdso_config for s390
selftests: vDSO: fix ELF hash table entry size for s390x
powerpc/vdso: Wire up getrandom() vDSO implementation on VDSO64
powerpc/vdso: Wire up getrandom() vDSO implementation on VDSO32
powerpc/vdso: Refactor CFLAGS for CVDSO build
powerpc/vdso32: Add crtsavres
mm: Define VM_DROPPABLE for powerpc/32
powerpc/vdso: Fix VDSO data access when running in a non-root time namespace
selftests: vDSO: don't include generated headers for chacha test
arm64: vDSO: Wire up getrandom() vDSO implementation
arm64: alternative: make alternative_has_cap_likely() VDSO compatible
selftests: vDSO: also test counter in vdso_test_chacha
...
- binfmt_elf: Dump smaller VMAs first in ELF cores (Brian Mak)
- binfmt_elf: mseal address zero (Jeff Xu)
- binfmt_elf, coredump: Log the reason of the failed core dumps
(Roman Kisel)
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRSPkdeREjth1dHnSE2KwveOeQkuwUCZufubAAKCRA2KwveOeQk
uyBPAQC2sM6j4xEEjyVkUZMmtumhyHMWtYmkqTmP8SbFn0Q3eQD/Yh9iyRng6Kmc
jYm7RumqQkT+wrjXXKlazuJN9w7knAU=
=UuX5
-----END PGP SIGNATURE-----
Merge tag 'execve-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve updates from Kees Cook:
- binfmt_elf: Dump smaller VMAs first in ELF cores (Brian Mak)
- binfmt_elf: mseal address zero (Jeff Xu)
- binfmt_elf, coredump: Log the reason of the failed core dumps (Roman
Kisel)
* tag 'execve-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
binfmt_elf: mseal address zero
binfmt_elf: Dump smaller VMAs first in ELF cores
binfmt_elf, coredump: Log the reason of the failed core dumps
coredump: Standartize and fix logging
Allow idmapped mounts for virtiofs.
It's absolutely safe as for virtiofs we have the same
feature negotiation mechanism as for classical fuse
filesystems. This does not affect any existing
setups anyhow.
virtiofsd support:
https://gitlab.com/virtio-fs/virtiofsd/-/merge_requests/245
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmbn5g0ACgkQu+CwddJF
iJq+Uwf/aqnLNEpjUBzwUUhSojCpPnTtiyjv+AILTxoSTHmbu8OvN0W79+Rpbdmk
O4QapAK+BCs+VL2VATwCCufcJ75Z78txO+buQE0DgwluFTIYZ+IwpUMPsK04ln6A
FD1/uvP1QFx60heqcp2c4zWFBUpg4DE6ufx2A5kieO268lFcWLxyVlcdgRU79ZCt
uAcV2yDLk3GvPGfxZwPKEmZUo/FmuSoBv0XgT+eWxmTu/R7hcpFse49OyjBH8Tvb
8d/RCIFgXOr8dTIjtds7eenwB/is4TkRlctezEQ0jO9/JwL/BVOgXZjD1qCtNWqz
is4TWK7VV+vdq1RD+0xC2hV/+uGEwQ==
=+WAm
-----END PGP SIGNATURE-----
Merge tag 'slab-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka:
"This time it's mostly refactoring and improving APIs for slab users in
the kernel, along with some debugging improvements.
- kmem_cache_create() refactoring (Christian Brauner)
Over the years have been growing new parameters to
kmem_cache_create() where most of them are needed only for a small
number of caches - most recently the rcu_freeptr_offset parameter.
To avoid adding new parameters to kmem_cache_create() and adjusting
all its callers, or creating new wrappers such as
kmem_cache_create_rcu(), we can now pass extra parameters using the
new struct kmem_cache_args. Not explicitly initialized fields
default to values interpreted as unused.
kmem_cache_create() is for now a wrapper that works both with the
new form: kmem_cache_create(name, object_size, args, flags) and the
legacy form: kmem_cache_create(name, object_size, align, flags,
ctor)
- kmem_cache_destroy() waits for kfree_rcu()'s in flight (Vlastimil
Babka, Uladislau Rezki)
Since SLOB removal, kfree() is allowed for freeing objects
allocated by kmem_cache_create(). By extension kfree_rcu() as
allowed as well, which can allow converting simple call_rcu()
callbacks that only do kmem_cache_free(), as there was never a
kmem_cache_free_rcu() variant. However, for caches that can be
destroyed e.g. on module removal, the cache owners knew to issue
rcu_barrier() first to wait for the pending call_rcu()'s, and this
is not sufficient for pending kfree_rcu()'s due to its internal
batching optimizations. Ulad has provided a new
kvfree_rcu_barrier() and to make the usage less error-prone,
kmem_cache_destroy() calls it. Additionally, destroying
SLAB_TYPESAFE_BY_RCU caches now again issues rcu_barrier()
synchronously instead of using an async work, because the past
motivation for async work no longer applies. Users of custom
call_rcu() callbacks should however keep calling rcu_barrier()
before cache destruction.
- Debugging use-after-free in SLAB_TYPESAFE_BY_RCU caches (Jann Horn)
Currently, KASAN cannot catch UAFs in such caches as it is legal to
access them within a grace period, and we only track the grace
period when trying to free the underlying slab page. The new
CONFIG_SLUB_RCU_DEBUG option changes the freeing of individual
object to be RCU-delayed, after which KASAN can poison them.
- Delayed memcg charging (Shakeel Butt)
In some cases, the memcg is uknown at allocation time, such as
receiving network packets in softirq context. With
kmem_cache_charge() these may be now charged later when the user
and its memcg is known.
- Misc fixes and improvements (Pedro Falcato, Axel Rasmussen,
Christoph Lameter, Yan Zhen, Peng Fan, Xavier)"
* tag 'slab-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (34 commits)
mm, slab: restore kerneldoc for kmem_cache_create()
io_uring: port to struct kmem_cache_args
slab: make __kmem_cache_create() static inline
slab: make kmem_cache_create_usercopy() static inline
slab: remove kmem_cache_create_rcu()
file: port to struct kmem_cache_args
slab: create kmem_cache_create() compatibility layer
slab: port KMEM_CACHE_USERCOPY() to struct kmem_cache_args
slab: port KMEM_CACHE() to struct kmem_cache_args
slab: remove rcu_freeptr_offset from struct kmem_cache
slab: pass struct kmem_cache_args to do_kmem_cache_create()
slab: pull kmem_cache_open() into do_kmem_cache_create()
slab: pass struct kmem_cache_args to create_cache()
slab: port kmem_cache_create_usercopy() to struct kmem_cache_args
slab: port kmem_cache_create_rcu() to struct kmem_cache_args
slab: port kmem_cache_create() to struct kmem_cache_args
slab: add struct kmem_cache_args
slab: s/__kmem_cache_create/do_kmem_cache_create/g
memcg: add charging of already allocated slab objects
mm/slab: Optimize the code logic in find_mergeable()
...
If the first directory entry in the root directory is not a bitmap
directory entry, 'bh' will not be released and reassigned, which
will cause a memory leak.
Fixes: 1e49a94cf7 ("exfat: add bitmap operations")
Cc: stable@vger.kernel.org
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
We found that when writing a large file through buffer write, if the
disk is inaccessible, exFAT does not return an error normally, which
leads to the writing process not stopping properly.
To easily reproduce this issue, you can follow the steps below:
1. format a device to exFAT and then mount (with a full disk erase)
2. dd if=/dev/zero of=/exfat_mount/test.img bs=1M count=8192
3. eject the device
You may find that the dd process does not stop immediately and may
continue for a long time.
The root cause of this issue is that during buffer write process,
exFAT does not need to access the disk to look up directory entries
or the FAT table (whereas FAT would do) every time data is written.
Instead, exFAT simply marks the buffer as dirty and returns,
delegating the writeback operation to the writeback process.
If the disk cannot be accessed at this time, the error will only be
returned to the writeback process, and the original process will not
receive the error, so it cannot be returned to the user side.
When the disk cannot be accessed normally, an error should be returned
to stop the writing process.
Implement sops->shutdown and ioctl to shut down the file system
when underlying block device is marked dead.
Signed-off-by: Dongliang Cui <dongliang.cui@unisoc.com>
Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
When cleaning up defrag inodes at btrfs_cleanup_defrag_inodes(), called
during remount and unmount, we are freeing every node from the rbtree
that tracks inodes for auto defrag using
rbtree_postorder_for_each_entry_safe(), which doesn't modify the tree
itself. So once we unlock the lock that protects the rbtree, we have a
tree pointing to a root that was freed (and a root pointing to freed
nodes, and their children pointing to other freed nodes, and so on).
This makes further access to the tree result in a use-after-free with
unpredictable results.
Fix this by initializing the rbtree to an empty root after the call to
rbtree_postorder_for_each_entry_safe() and before unlocking.
Fixes: 276940915f ("btrfs: clear defragmented inodes using postorder in btrfs_cleanup_defrag_inodes()")
Reported-by: syzbot+ad7966ca1f5dd8b001b3@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/000000000000f9aad406223eabff@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There are some reports about invalid data backref objectids, the report
looks like this:
BTRFS critical (device sda): corrupt leaf: block=333654787489792 slot=110 extent bytenr=333413935558656 len=65536 invalid data ref objectid value 2543
The data ref objectid is the inode number inside the subvolume.
But in above case, the value is completely sane, not really showing the
problem.
[CAUSE]
The root cause of the problem is the deprecated feature, inode cache.
This feature results a special inode number, -12ULL, and it's no longer
recognized by tree-checker, triggering the error.
The direct problem here is the output of data ref objectid. The value
shown is in fact the dref_root (subvolume id), not the dref_objectid
(inode number).
[FIX]
Fix the output to use dref_objectid instead.
Reported-by: Neil Parton <njparton@gmail.com>
Reported-by: Archange <archange@archlinux.org>
Link: https://lore.kernel.org/linux-btrfs/CAAYHqBbrrgmh6UmW3ANbysJX9qG9Pbg3ZwnKsV=5mOpv_qix_Q@mail.gmail.com/
Link: https://lore.kernel.org/linux-btrfs/9541deea-9056-406e-be16-a996b549614d@archlinux.org/
Fixes: f333a3c7e8 ("btrfs: tree-checker: validate dref root and objectid")
CC: stable@vger.kernel.org # 6.11
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing concurrent lseek(2) system calls against the same file
descriptor, using multiple threads belonging to the same process, we have
a short time window where a race happens and can result in a memory leak.
The race happens like this:
1) A program opens a file descriptor for a file and then spawns two
threads (with the pthreads library for example), lets call them
task A and task B;
2) Task A calls lseek with SEEK_DATA or SEEK_HOLE and ends up at
file.c:find_desired_extent() while holding a read lock on the inode;
3) At the start of find_desired_extent(), it extracts the file's
private_data pointer into a local variable named 'private', which has
a value of NULL;
4) Task B also calls lseek with SEEK_DATA or SEEK_HOLE, locks the inode
in shared mode and enters file.c:find_desired_extent(), where it also
extracts file->private_data into its local variable 'private', which
has a NULL value;
5) Because it saw a NULL file private, task A allocates a private
structure and assigns to the file structure;
6) Task B also saw a NULL file private so it also allocates its own file
private and then assigns it to the same file structure, since both
tasks are using the same file descriptor.
At this point we leak the private structure allocated by task A.
Besides the memory leak, there's also the detail that both tasks end up
using the same cached state record in the private structure (struct
btrfs_file_private::llseek_cached_state), which can result in a
use-after-free problem since one task can free it while the other is
still using it (only one task took a reference count on it). Also, sharing
the cached state is not a good idea since it could result in incorrect
results in the future - right now it should not be a problem because it
end ups being used only in extent-io-tree.c:count_range_bits() where we do
range validation before using the cached state.
Fix this by protecting the private assignment and check of a file while
holding the inode's spinlock and keep track of the task that allocated
the private, so that it's used only by that task in order to prevent
user-after-free issues with the cached state record as well as potentially
using it incorrectly in the future.
Fixes: 3c32c7212f ("btrfs: use cached state when looking for delalloc ranges with lseek")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Debuggers have guess the FPU buffer layout in core dumps, which is error
prone. This is because AMD and Intel layouts differ.
To avoid buggy heuristics add a ELF section which describes the buffer
layout which can be retrieved by tools.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbpOuwTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoTRAEACGHPdAYFp5A396c9qUbHUE2gEKIad2
iuq15TZKLPY/LFqfTwnkp9/nqKtZ0gj4D6XCIucWZjwWJuPgvgGf/tC9Fk+H+C6X
9+rycP3GdqxU28qLxA428SN2Pg3lvqG4rryVWeHUXQ4x8A0DSMV+3pkNY5YgJ+2+
fTzNzVi2tkPRAXhKmj3EdcFcgDPiFQBMm1QNBpc+FqrXk4rjJb9Axln0oT8xemDv
TtJ5BMhFpR73naaiS4IrK8Tk3oFCa8CmafCQfl1zAOor/+EemPQKwMuGeiXE7dLG
eE+OTw5zuxYwlc9WoaPmM/ZiEc5JptpHQUtyHDBN7BaK87VKjsupAXXVOh6XMRCt
R2coqq7fqDqMANwWpUKddky3vSwbst1GZpXGAENOy64yU4VoFutr616WSj3sJfUi
knBauPqLAFeZLhMn/kKr5a0rBgm7VuQSlGPYEhqVdaM3Eb/zJEupFL/bTpqQbbz/
8lo2hYcfDslhShcEZYBwm4eUg+ytZ96K3ciZ5YgNih9LFBxEOo0SY1CqbQJiRtpB
3DmgldYtzRdQq5/JtFGNv717uMESn5khG3qHUpXtrDhWfD8spMWiY1yO/cwWvLFJ
ZS5ATp1dAt1Pbv2MC6r9jQBbW3V7xNNAOJdzUvIZPP04PKeV0ObFOplxhabOzUDj
OLquyIrjpxeisg==
=Vqqo
-----END PGP SIGNATURE-----
Merge tag 'x86-fpu-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fpu updates from Thomas Gleixner:
"Provide FPU buffer layout in core dumps:
Debuggers have guess the FPU buffer layout in core dumps, which is
error prone. This is because AMD and Intel layouts differ.
To avoid buggy heuristics add a ELF section which describes the buffer
layout which can be retrieved by tools"
* tag 'x86-fpu-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/elf: Add a new FPU buffer layout info to x86 core files
Switch away from quite chatty declarations using typeof_member().
In theory this is faster to compile too because there is no macro
expansion and there is less type checking.
Link: https://lkml.kernel.org/r/81bf02fd-8724-4f4d-a2bb-c59620b7d716@p183
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmbi598ACgkQUqAMR0iA
lPL3lw//WaRDKJ1Cb/bKAn3nRpjdqiNBI//K1gRJp0LgLE7qEudE25t4j3F9tvvP
pc9AB81g1Au8Br6iOd+NiGXXW5KWJHaZ3rUAdeo6co4NQCbrY6qTA78ItZSQImBH
A9fhWWr1TGRX8L/N/gR2eYBnpbDGIbRahUOQraUpBn4kEPyR47KEx7Njjo48GcmR
Ye8dIYwUOWEgQeIuIxIAwNf6KyNjo5tQpgve+M8HGwy8mZqP9XV6UjXUACVwQNx6
+CK+IGM+94tCq5KalOaJ5BtsXGKlabHIs7y9QpLS45M2QoHIlDIvpaxzLf0FTsPI
CppqedAGN2jU0NyjfbFk1c+SNQfDZEAZVyF6vKFelP7t2jzAx301RyB2S+Cm7Hh+
PajFty41UT0/y17V4sZawfMqpFyp7Wr6RKQYYKMBRdSQQkToh/dmebBvqPAHW9cJ
LInQQf+XdzbonKa+CTmT/Tg+eM2R124FWeMVnEMdtyXpKUV9qdKWfngtzyRMQiYI
q54ZwKd3VJ9kRIfb7Fp0TBr2NErdnEQE5hh9QhI8SAWENskw5+GmYaQit734U9wA
SU7t9rir7NS4Rc1jHP9SQ9oWWI9HT4hthRGkLh2Knx0O2c6AwOuEI4wkjzSWI3GX
/eeofnbZiUpi7fESf9qmTGtQZ4/9ogQ7fNaroWCSfQzq3+wl+2o=
=28sV
-----END PGP SIGNATURE-----
Merge tag 'printk-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk updates from Petr Mladek:
"This is the "last" part of the support for the new nbcon consoles.
Where "nbcon" stays for "No Big console lock CONsoles" aka not under
the console_lock.
New callbacks are added to struct console:
- write_thread() for flushing nbcon consoles in task context.
- write_atomic() for flushing nbcon consoles in atomic context,
including NMI.
- con->device_lock() and device_unlock() for taking the driver
specific lock, for example, port->lock.
New printk-specific kthreads are created:
- per-console kthreads which get responsible for flushing normal
priority messages on nbcon consoles.
- thread which gets responsible for flushing normal priority messages
on all consoles when CONFIG_RT enabled.
The new callbacks are called under a special per-console lock which
has already been added back in v6.7. It allows to distinguish three
severities: normal, emergency, and panic. A context with a higher
priority could take over the ownership when it is safe even in the
middle of handling a record. The panic context could do it even when
it is not safe. But it is allowed only for the final desperate flush
before entering the infinite loop.
The new lock helps to flush the messages directly in emergency and
panic contexts. But it is not enough in all situations:
- console_lock() is still need for synchronization against boot
consoles.
- con->device_lock() is need for synchronization against other
operations on the same HW, e.g. serial port speed setting,
non-printk related read/write.
The dependency on con->device_lock() is mutual. Any code taking the
driver specific lock has to acquire the related nbcon console context
as well. For example, see the new uart_port_lock() API. It provides
the necessary synchronization against emergency and panic contexts
where the messages are flushed only under the new per-console lock.
Maybe surprisingly, a quite tricky part is the decision how to flush
the consoles in various situations. It has to take into account:
- message priority: normal, emergency, panic
- scheduling context: task, atomic, deferred_legacy
- registered consoles: boot, legacy, nbcon
- threads are running: early boot, suspend, shutdown, panic
- caller: printk(), pr_flush(), printk_flush_in_panic(),
console_unlock(), console_start(), ...
The primary decision is made in printk_get_console_flush_type(). It
creates a hint what the caller should do:
- flush nbcon consoles directly or via the kthread
- call the legacy loop (console_unlock()) directly or via irq_work
The existing behavior is preserved for the legacy consoles. The only
exception is that they are not longer flushed directly from printk()
in panic() before CPUs are stopped. But this blocking happens only
when at least one nbcon console is registered. The motivation is to
increase a chance to produce the crash dump. They legacy consoles
might create a deadlock in compare with nbcon consoles. The nbcon
console should allow to see the messages even when the crash dump
fails.
There are three possible ways how nbcon consoles are flushed:
- The per-nbcon-console kthread is responsible for flushing messages
added with the normal priority. This is the default mode.
- The legacy loop, aka console_unlock(), is used when there is still
a boot console registered. There is no easy way how to match an
early console driver with a nbcon console driver. And the
console_lock() provides the only reliable serialization at the
moment.
The legacy loop uses either con->write_atomic() or
con->write_thread() callbacks depending on whether it is allowed to
schedule. The atomic variant has to be used from printk().
- In other situations, the messages are flushed directly using
write_atomic() which can be called in any context, including NMI.
It is primary needed during early boot or shutdown, in emergency
situations, and panic.
The emergency priority is used by a code called within
nbcon_cpu_emergency_enter()/exit(). At the moment, it is used in four
situations: WARN(), Oops, lockdep, and RCU stall reports.
Finally, there is no nbcon console at the moment. It means that the
changes should _not_ modify the existing behavior. The only exception
is CONFIG_RT which would force offloading the legacy loop, for normal
priority context, into the dedicated kthread"
* tag 'printk-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: (54 commits)
printk: Avoid false positive lockdep report for legacy printing
printk: nbcon: Assign nice -20 for printing threads
printk: Implement legacy printer kthread for PREEMPT_RT
tty: sysfs: Add nbcon support for 'active'
proc: Add nbcon support for /proc/consoles
proc: consoles: Add notation to c_start/c_stop
printk: nbcon: Show replay message on takeover
printk: Provide helper for message prepending
printk: nbcon: Rely on kthreads for normal operation
printk: nbcon: Use thread callback if in task context for legacy
printk: nbcon: Relocate nbcon_atomic_emit_one()
printk: nbcon: Introduce printer kthreads
printk: nbcon: Init @nbcon_seq to highest possible
printk: nbcon: Add context to usable() and emit()
printk: Flush console on unregister_console()
printk: Fail pr_flush() if before SYSTEM_SCHEDULING
printk: nbcon: Add function for printers to reacquire ownership
printk: nbcon: Use raw_cpu_ptr() instead of open coding
printk: Use the BITS_PER_LONG macro
lockdep: Mark emergency sections in lockdep splats
...
After commit(11a347fb6c exfat: change to get file size from
DataLength), the remaining area or hole had been filled with
zeros before calling exfat_direct_IO(), so there is no need to
fallback to buffered write, and ->i_size_aligned is no longer
needed, drop it.
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
->i_size_ondisk is no longer used by exfat_write_begin() after
commit(11a347fb6c exfat: change to get file size from DataLength),
drop it.
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
- Core:
- Overhaul of posix-timers in preparation of removing the
workaround for periodic timers which have signal delivery
ignored.
- Remove the historical extra jiffie in msleep()
msleep() adds an extra jiffie to the timeout value to ensure
minimal sleep time. The timer wheel ensures minimal sleep
time since the large rewrite to a non-cascading wheel, but the
extra jiffie in msleep() remained unnoticed. Remove it.
- Make the timer slack handling correct for realtime tasks.
The procfs interface is inconsistent and does neither reflect
reality nor conforms to the man page. Show the correct 0 slack
for real time tasks and enforce it at the core level instead of
having inconsistent individual checks in various timer setup
functions.
- The usual set of updates and enhancements all over the place.
- Drivers:
- Allow the ACPI PM timer to be turned off during suspend
- No new drivers
- The usual updates and enhancements in various drivers
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbn7jQTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYobqnD/9COlU0nwsulABI/aNIrsh6iYvnCC9v
14CcNta7Qn+157Wfw9BWOyHdNhR1/fPCXE8jJ71zTyIOeW27HV2JyTtxTwe9ZcdK
ViHAaj7YcIjcVUEC3StCoRCPnvLslEw4qJA5AOQuDyMivdQn+YVa2c0baJxKaXZt
xk4HZdMj4NAS0jRKnoZSwtKW/+Oz6rR4GAWrZo+Zs1/8ur3HfqnQfi8lJ1hJtLLW
V7XDCVRvamVi6Ah3ocYPPp/1P6yeQDA1ge9aMddqaza5STWISXRtSnFMUmYP3rbS
FaL8TyL+ilfny8pkGB2WlG6nLuSbtvogtdEh1gG1k1RmZt44kAtk8ba/KiWFPBSb
zK9cjojRMBS71f9G4kmb5F4rnXoLsg1YbD1Nzhz3wq2Cs1Z90dc2QwMren0zoQ1x
Fn56ueRyAiagBlnrSaKyso/2RvqJTNoSdi3RkpjYeAph0UoDCqvTvKjGAf1mWiw1
T/1lUWSVqWHnzZbM7XXzzajIN9bl6A7bbqlcAJ2O9vZIDt7273DG+bQym9Vh6Why
0LTGGERHxzKBsG7WRg+2Gmvv6S18UPKRo8tLtlA758rHlFuPTZCShWrIriwSNl1K
Hxon+d4BparSnm1h9W/NHPKJA574UbWRCBjdk58IkAj8DxZZY4ORD9SMP+ggkV7G
F6p9cgoDNP9KFg==
=jE0N
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"Core:
- Overhaul of posix-timers in preparation of removing the workaround
for periodic timers which have signal delivery ignored.
- Remove the historical extra jiffie in msleep()
msleep() adds an extra jiffie to the timeout value to ensure
minimal sleep time. The timer wheel ensures minimal sleep time
since the large rewrite to a non-cascading wheel, but the extra
jiffie in msleep() remained unnoticed. Remove it.
- Make the timer slack handling correct for realtime tasks.
The procfs interface is inconsistent and does neither reflect
reality nor conforms to the man page. Show the correct 0 slack for
real time tasks and enforce it at the core level instead of having
inconsistent individual checks in various timer setup functions.
- The usual set of updates and enhancements all over the place.
Drivers:
- Allow the ACPI PM timer to be turned off during suspend
- No new drivers
- The usual updates and enhancements in various drivers"
* tag 'timers-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
ntp: Make sure RTC is synchronized when time goes backwards
treewide: Fix wrong singular form of jiffies in comments
cpu: Use already existing usleep_range()
timers: Rename next_expiry_recalc() to be unique
platform/x86:intel/pmc: Fix comment for the pmc_core_acpi_pm_timer_suspend_resume function
clocksource/drivers/jcore: Use request_percpu_irq()
clocksource/drivers/cadence-ttc: Add missing clk_disable_unprepare in ttc_setup_clockevent
clocksource/drivers/asm9260: Add missing clk_disable_unprepare in asm9260_timer_init
clocksource/drivers/qcom: Add missing iounmap() on errors in msm_dt_timer_init()
clocksource/drivers/ingenic: Use devm_clk_get_enabled() helpers
platform/x86:intel/pmc: Enable the ACPI PM Timer to be turned off when suspended
clocksource: acpi_pm: Add external callback for suspend/resume
clocksource/drivers/arm_arch_timer: Using for_each_available_child_of_node_scoped()
dt-bindings: timer: rockchip: Add rk3576 compatible
timers: Annotate possible non critical data race of next_expiry
timers: Remove historical extra jiffie for timeout in msleep()
hrtimer: Use and report correct timerslack values for realtime tasks
hrtimer: Annotate hrtimer_cpu_base_.*_expiry() for sparse.
timers: Add sparse annotation for timer_sync_wait_running().
signal: Replace BUG_ON()s
...
Change is_compressible() return type to bool, use WARN_ON_ONCE(1) for
internal errors and return false for those.
Renames:
check_repeated_data -> has_repeated_data
check_ascii_bytes -> is_mostly_ascii (also refactor into a single loop)
calc_shannon_entropy -> has_low_entropy
Also wraps "wreq->Length" in le32_to_cpu() in should_compress() (caught
by sparse).
Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Suggested-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
In SFU mode, activated by -o sfu mount option is now also support for
creating new fifos and sockets.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Linux cifs client can already detect SFU symlinks and reads it content
(target location). But currently is not able to create new symlink. So
implement this missing support.
When 'sfu' mount option is specified and 'mfsymlinks' is not specified then
create new symlinks in SFU-style. This will provide full SFU compatibility
of symlinks when mounting cifs share with 'sfu' option. 'mfsymlinks' option
override SFU for better Apple compatibility as explained in fs_context.c
file in smb3_update_mnt_flags() function.
Extend __cifs_sfu_make_node() function, which now can handle also S_IFLNK
type and refactor structures passed to sync_write() in this function, by
splitting SFU type and SFU data from original combined struct win_dev as
combined fixed-length struct cannot be used for variable-length symlinks.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmbiGGAUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXPU8BAA1+A15pmS34I9pq7c8TmRz3rNEs/a
zrW1aWJ0X/+axNS7sW3Pwtt1EKuaOhskKU8gNSieRhljC8rgXIVjZzLw6Atgcr5k
upulGbU9TXyVisYN+PWv9/84ito6/nYsKb7Mg3nUVsdodtIFVnsk1fxYLPHQEBig
Pl3i26U3VqH93Kz0W5vs/QR2uduPB8ZyscdTgcbrY9Vv1Y7IDZ2g9QsJVKLvbQKL
qcPK1JkHa+sBPJxDqS9A40zgbLbdPQgWQzsXX3dz822w1Ga7FIHSqxMBA6HwHZ+L
kV4P58wVfavhwt/cQSKMWI/yiGPMMd0B6yD+m8ojOvGfOfRCWxGMmEMqHNuZ3m7k
Bfll5ZgZTY8phUUhiNf3nxO3F3MM/5bHdhPOj3RReqbAbS6uWr4/fThPDYY/zIo6
NCY3HGxx3Ae64uQ01gC2p/czC50jDsMwlbXiZbrgdBhjBm/CVk5ozb80mLVcGrLB
+6XMzzSbC8IaNAH2fDmUJ2ABdwyNPgsSOTGZVzIanpxu1SU2/yk3SMxkp8fv5s36
wLeODUVcLgsjVV538Mkm6PGTE4TlXaH9yi6apMyJAGp0vPYx5c3Xxk2y5A5cur5p
hcrbDiX2QgeqFbwsz36incmPmbef2NU2c8feR8XLtPJuwNIeRcMSje0pnkaFlRmb
TAUJ1sDQAzZ8Fy0=
=HIAO
-----END PGP SIGNATURE-----
Merge tag 'lsm-pr-20240911' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
Pull lsm updates from Paul Moore:
- Move the LSM framework to static calls
This transitions the vast majority of the LSM callbacks into static
calls. Those callbacks which haven't been converted were left as-is
due to the general ugliness of the changes required to support the
static call conversion; we can revisit those callbacks at a future
date.
- Add the Integrity Policy Enforcement (IPE) LSM
This adds a new LSM, Integrity Policy Enforcement (IPE). There is
plenty of documentation about IPE in this patches, so I'll refrain
from going into too much detail here, but the basic motivation behind
IPE is to provide a mechanism such that administrators can restrict
execution to only those binaries which come from integrity protected
storage, e.g. a dm-verity protected filesystem. You will notice that
IPE requires additional LSM hooks in the initramfs, dm-verity, and
fs-verity code, with the associated patches carrying ACK/review tags
from the associated maintainers. We couldn't find an obvious
maintainer for the initramfs code, but the IPE patchset has been
widely posted over several years.
Both Deven Bowers and Fan Wu have contributed to IPE's development
over the past several years, with Fan Wu agreeing to serve as the IPE
maintainer moving forward. Once IPE is accepted into your tree, I'll
start working with Fan to ensure he has the necessary accounts, keys,
etc. so that he can start submitting IPE pull requests to you
directly during the next merge window.
- Move the lifecycle management of the LSM blobs to the LSM framework
Management of the LSM blobs (the LSM state buffers attached to
various kernel structs, typically via a void pointer named "security"
or similar) has been mixed, some blobs were allocated/managed by
individual LSMs, others were managed by the LSM framework itself.
Starting with this pull we move management of all the LSM blobs,
minus the XFRM blob, into the framework itself, improving consistency
across LSMs, and reducing the amount of duplicated code across LSMs.
Due to some additional work required to migrate the XFRM blob, it has
been left as a todo item for a later date; from a practical
standpoint this omission should have little impact as only SELinux
provides a XFRM LSM implementation.
- Fix problems with the LSM's handling of F_SETOWN
The LSM hook for the fcntl(F_SETOWN) operation had a couple of
problems: it was racy with itself, and it was disconnected from the
associated DAC related logic in such a way that the LSM state could
be updated in cases where the DAC state would not. We fix both of
these problems by moving the security_file_set_fowner() hook into the
same section of code where the DAC attributes are updated. Not only
does this resolve the DAC/LSM synchronization issue, but as that code
block is protected by a lock, it also resolve the race condition.
- Fix potential problems with the security_inode_free() LSM hook
Due to use of RCU to protect inodes and the placement of the LSM hook
associated with freeing the inode, there is a bit of a challenge when
it comes to managing any LSM state associated with an inode. The VFS
folks are not open to relocating the LSM hook so we have to get
creative when it comes to releasing an inode's LSM state.
Traditionally we have used a single LSM callback within the hook that
is triggered when the inode is "marked for death", but not actually
released due to RCU.
Unfortunately, this causes problems for LSMs which want to take an
action when the inode's associated LSM state is actually released; so
we add an additional LSM callback, inode_free_security_rcu(), that is
called when the inode's LSM state is released in the RCU free
callback.
- Refactor two LSM hooks to better fit the LSM return value patterns
The vast majority of the LSM hooks follow the "return 0 on success,
negative values on failure" pattern, however, there are a small
handful that have unique return value behaviors which has caused
confusion in the past and makes it difficult for the BPF verifier to
properly vet BPF LSM programs. This includes patches to
convert two of these"special" LSM hooks to the common 0/-ERRNO pattern.
- Various cleanups and improvements
A handful of patches to remove redundant code, better leverage the
IS_ERR_OR_NULL() helper, add missing "static" markings, and do some
minor style fixups.
* tag 'lsm-pr-20240911' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: (40 commits)
security: Update file_set_fowner documentation
fs: Fix file_set_fowner LSM hook inconsistencies
lsm: Use IS_ERR_OR_NULL() helper function
lsm: remove LSM_COUNT and LSM_CONFIG_COUNT
ipe: Remove duplicated include in ipe.c
lsm: replace indirect LSM hook calls with static calls
lsm: count the LSMs enabled at compile time
kernel: Add helper macros for loop unrolling
init/main.c: Initialize early LSMs after arch code, static keys and calls.
MAINTAINERS: add IPE entry with Fan Wu as maintainer
documentation: add IPE documentation
ipe: kunit test for parser
scripts: add boot policy generation program
ipe: enable support for fs-verity as a trust provider
fsverity: expose verified fsverity built-in signatures to LSMs
lsm: add security_inode_setintegrity() hook
ipe: add support for dm-verity as a trust provider
dm-verity: expose root hash digest and signature data to LSMs
block,lsm: add LSM blob and new LSM hooks for block devices
ipe: add permissive toggle
...
Fix an upstream merge resolution issue[1]. The NETFS_SREQ_HIT_EOF flag,
and code to set it, got added via two different paths. The original path
saw it added in the netfslib read improvements[2], but it was also added,
and slightly differently, in a fix that was committed before v6.11:
1da29f2c39
netfs, cifs: Fix handling of short DIO read
However, the code added to smb2_readv_callback() to set the flag in didn't
get removed when the netfs read improvements series was rebased to take
account of the cifs fixes. The proposed merge resolution[2] deleted it
rather than rebase the patches.
Fix this by removing the redundant lines. Code to set the bit that derives
from the fix patch is still there, a few lines above in the source.
Fixes: 35219bc5c7 ("Merge tag 'vfs-6.12.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <stfrench@microsoft.com>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/CAHk-=wjr8fxk20-wx=63mZruW1LTvBvAKya1GQ1EhyzXb-okMA@mail.gmail.com/ [1]
Link: https://lore.kernel.org/linux-fsdevel/20240913-vfs-netfs-39ef6f974061@brauner/ [2]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix an upstream merge resolution issue[1]. Prior to the netfs read
healpers, the SMB1 asynchronous read callback, cifs_readv_worker()
performed the cleanup for the operation in the network message processing
loop, potentially slowing down the processing of incoming SMB messages.
With commit a68c74865f ("cifs: Fix SMB1 readv/writev callback in the same
way as SMB2/3"), this was moved to a worker thread (as is done in the
SMB2/3 transport variant). However, the "was_async" argument to
netfs_subreq_terminated (which was originally incorrectly "false" got
flipped to "true" - which was then incorrect because, being in a kernel
thread, it's not in an async context).
This got corrected in the sample merge[2], but Linus, not unreasonably,
switched it back to its previous value.
Note that this value tells netfslib whether or not it can run sleepable
stuff or stuff that takes a long time, such as retries and cleanups, in the
calling thread, or whether it should offload to a worker thread.
Fix this so that it is "false". The callback to netfslib in both SMB1 and
SMB2/3 now gets offloaded from the network message thread to a separate
worker thread and thus it's fine to do the slow work in this thread.
Fixes: 35219bc5c7 ("Merge tag 'vfs-6.12.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <stfrench@microsoft.com>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/CAHk-=wjr8fxk20-wx=63mZruW1LTvBvAKya1GQ1EhyzXb-okMA@mail.gmail.com/ [1]
Link: https://lore.kernel.org/linux-fsdevel/20240913-vfs-netfs-39ef6f974061@brauner/ [2]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmbkZhQQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjOKD/0fzd4yOcqxSI9W3OLGd04VrOTJIQa4CRbV
GmoTq39pOeIDVGug5ekkTpqqHHnuGk+nQhCzD9vsN/eTmC7yZOIr847O2aWzvYEn
PzFRgmJpoo2E9sr/IsTR5LnJjbaIZhQVkqLH6ZOj9tpKlVwN2SK0nIRVNrAi5zgT
MaDrto/2OUld+vmA99Rgb23jxM6UBdCPIjuiVa+11Vg9Z3D1tWbBmrsG7OMysyIf
FbASBeKHqFSO61/ipFCZv6VV1X8zoWEVyT8n4A1yUbbN5rLzPgoQJVbfSqQRXIdr
cdrKeCbKxl+joSgKS6LKpvnfwRgGF+hgAfpZg4c0vrbZGTQcRhhLFECyh/aVI08F
p5TOMArhVaX59664gHgSPq4KnGTXOO29dot9N3Jya/ZQnxinjY9r+GVOfLuduPPy
1B04vab8oAsk4zK7fZbkDxgYUyifwzK/vQ6OqYq2mYdpdIS/AE7T2ou61Bz5mI7I
/BuucNV0Z96OKlyLEXwXXZjZgNu1TFcq6ARIBJ8L08PY64Fesj5BXabRyXkeNH26
0exyz9heeJs6OwRGfngXmS24tDSS0k74CeZX3KoePNj69u6KCn346KiU1qgntwwD
E5F7AEHqCl5FjUEIWB4M1EPlfA8U0MzOL+tkx2xKJAjsU60wAy7jRSyOIcqodpMs
6UlPcJzgYg==
=uuLl
-----END PGP SIGNATURE-----
Merge tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- MD changes via Song:
- md-bitmap refactoring (Yu Kuai)
- raid5 performance optimization (Artur Paszkiewicz)
- Other small fixes (Yu Kuai, Chen Ni)
- Add a sysfs entry 'new_level' (Xiao Ni)
- Improve information reported in /proc/mdstat (Mateusz Kusiak)
- NVMe changes via Keith:
- Asynchronous namespace scanning (Stuart)
- TCP TLS updates (Hannes)
- RDMA queue controller validation (Niklas)
- Align field names to the spec (Anuj)
- Metadata support validation (Puranjay)
- A syntax cleanup (Shen)
- Fix a Kconfig linking error (Arnd)
- New queue-depth quirk (Keith)
- Add missing unplug trace event (Keith)
- blk-iocost fixes (Colin, Konstantin)
- t10-pi modular removal and fixes (Alexey)
- Fix for potential BLKSECDISCARD overflow (Alexey)
- bio splitting cleanups and fixes (Christoph)
- Deal with folios rather than rather than pages, speeding up how the
block layer handles bigger IOs (Kundan)
- Use spinlocks rather than bit spinlocks in zram (Sebastian, Mike)
- Reduce zoned device overhead in ublk (Ming)
- Add and use sendpages_ok() for drbd and nvme-tcp (Ofir)
- Fix regression in partition error pointer checking (Riyan)
- Add support for write zeroes and rotational status in nbd (Wouter)
- Add Yu Kuai as new BFQ maintainer. The scheduler has been
unmaintained for quite a while.
- Various sets of fixes for BFQ (Yu Kuai)
- Misc fixes and cleanups (Alvaro, Christophe, Li, Md Haris, Mikhail,
Yang)
* tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux: (120 commits)
nvme-pci: qdepth 1 quirk
block: fix potential invalid pointer dereference in blk_add_partition
blk_iocost: make read-only static array vrate_adj_pct const
block: unpin user pages belonging to a folio at once
mm: release number of pages of a folio
block: introduce folio awareness and add a bigger size from folio
block: Added folio-ized version of bio_add_hw_page()
block, bfq: factor out a helper to split bfqq in bfq_init_rq()
block, bfq: remove local variable 'bfqq_already_existing' in bfq_init_rq()
block, bfq: remove local variable 'split' in bfq_init_rq()
block, bfq: remove bfq_log_bfqg()
block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()
block, bfq: fix procress reference leakage for bfqq in merge chain
block, bfq: fix uaf for accessing waker_bfqq after splitting
blk-throttle: support prioritized processing of metadata
blk-throttle: remove last_low_overflow_time
drbd: Add NULL check for net_conf to prevent dereference in state validation
nvme-tcp: fix link failure for TCP auth
blk-mq: add missing unplug trace event
mtip32xx: Remove redundant null pointer checks in mtip_hw_debugfs_init()
...
- Support file-backed mounts for containers and sandboxes;
- Mark the experimental fscache backend as deprecated;
- Handle overlapped pclusters caused by crafted images properly;
- Fix a failure path which could cause infinite loops in
z_erofs_init_decompressor();
- Get rid of unnecessary NOFAILs;
- Harmless on-disk hardening & minor cleanups.
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmbkOcURHHhpYW5nQGtl
cm5lbC5vcmcACgkQUXZn5Zlu5qqkshAAnUlHE+pggbWNybxB+CUAIrYAuQggGa/T
B6VxSXYRPYxRgTFI5jrz58M4uKLK4vQLMWPB+DrKhsLfrEePi5crYqNOWErMyAm/
Re6MPguWY8ZkqDoMBfV5jefBKhijSyWjbI/gvHJlE1driBY2v2La4AegqKgCq5+i
aHWrwL+RcK1S3T3vyJOnKgzW1fAH+5QMmQvPSmllpUHaQ6QTQrlDe3X7iVWyYHLD
LtmB5HeO+GvKVRc88L9wM0vOQz5s0f+ocbR2pUWns0Fgl4gNbtCDYngvFFsW/6od
+GBwC1FlMRCZHDHQWV+el9tynRU2nUidggpQT49q/IntFmtoFH+Oj9m71jjfHVoS
UrBJk0rmkK34qgm9hA5r3c2AsLqZGuTI3YH6SfboH1oWLdxAB2PxI8cUIbgwutEO
42oo/cVHPeC3Vx4w7RX7GgYrl8VtE9Vb4ZK1Z3+IbG8v5VfgkZjnaFElICaoq7Ij
lay7kNT1mczkp2GruAWkE+g082aN85f+6w3PopCVZbc9WUhgBduCTxwoIpy/Kbh/
YwWOTsZeimqlec583/bP5NBlrmtO/DodeBTtPivtfcSjemm5WH45tCZDcL7tsigj
GA7WSS1CJRBIVMaqFyKlLtaKNB8FzR1l5FcQWblmx891WqWOF0d6mL6s5el2ykgo
MGC8Arx7GUA=
=Rmww
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"In this cycle, we add file-backed mount support, which has has been a
strong requirement for years. It is especially useful when there are
thousands of images running on the same host for containers and other
sandbox use cases, unlike OS image use cases.
Without file-backed mounts, it's hard for container runtimes to manage
and isolate so many unnecessary virtual block devices safely and
efficiently, therefore file-backed mounts are highly preferred. For
EROFS users, ComposeFS [1], containerd, and Android APEXes [2] will
directly benefit from it, and I've seen no risk in implementing it as
a completely immutable filesystem.
The previous experimental feature "EROFS over fscache" is now marked
as deprecated because:
- Fscache is no longer an independent subsystem and has been merged
into netfs, which was somewhat unexpected when it was proposed.
- New HSM "fanotify pre-content hooks" [3] will be landed upstream.
These hooks will replace "EROFS over fscache" in a simpler way, as
EROFS won't be bother with kernel caching anymore. Userspace
programs can also manage their own caching hierarchy more flexibly.
Once the HSM "fanotify pre-content hooks" is landed, I will remove the
fscache backend entirely as an internal dependency cleanup. More
backgrounds are listed in the original patchset [4].
In addition to that, there are bugfixes and cleanups as usual.
Summary:
- Support file-backed mounts for containers and sandboxes
- Mark the experimental fscache backend as deprecated
- Handle overlapped pclusters caused by crafted images properly
- Fix a failure path which could cause infinite loops in
z_erofs_init_decompressor()
- Get rid of unnecessary NOFAILs
- Harmless on-disk hardening & minor cleanups"
* tag 'erofs-for-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: reject inodes with negative i_size
erofs: restrict pcluster size limitations
erofs: allocate more short-lived pages from reserved pool first
erofs: sunset unneeded NOFAILs
erofs: simplify erofs_map_blocks_flatmode()
erofs: refactor read_inode calling convention
erofs: use kmemdup_nul in erofs_fill_symlink
erofs: mark experimental fscache backend deprecated
erofs: support compressed inodes for fileio
erofs: support unencoded inodes for fileio
erofs: add file-backed mount support
erofs: handle overlapped pclusters out of crafted images properly
erofs: fix error handling in z_erofs_init_decompressor
erofs: clean up erofs_register_sysfs()
erofs: fix incorrect symlink detection in fast symlink
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmbjB00ACgkQxWXV+ddt
WDsZ/xAAmKAPYvn9P8UZcsG+J0Uxs+OyDJ6SDxkZWw/m3gH1SzRH1+baWNIFvsTO
O5GcEsjXkjfC+vBOU0Hoje1YsvLXboZLM85NvgcA8FZPAnUkLkUjVD1jYTGN7YPM
gfkVFUJWdYs6uYTAX8bCIxgAqmABLJ6Jlk9HFpL82WqkUQLd/0vCnUpmRXPdG+2D
RiL9b8HG9v9vahMLyVhUUdeoubD88/lNqIhbiVCoqqC+6IuFYJlpj2nZEO1N/k5F
iX2VBMayRXbX0JziryU6fuE0VIr0yEttRUDdjgxL4SwCcrXBQzFaIgV43ygWbW+Y
QI6nuD+hhOwDq3Dn0zz27LV5d0A5YNAGV/nO042FC28GL+qZeensX9SVKxR3ig99
tLkHhyQ8q8AdanVk/LskbBjm2LaARJIKDVmgpllVR9o+ur4P23koVsPTz/VTgBKt
DcMCHNpRrryz3/S4dCcNvN7NCxZkKiPOgVx0Q6hPj/URG9LB72Gu+wrQ2wZAdxkV
yIbY2EOEiks0fnuRVrG0I0WRJjyWZUSsTpjnC8VXSK+6hdfi0M2Vr+ErFMbU95Ys
FyGEHfGGP5WBEEE/uxPOMRc9u6KOCfHoAjWrMPyKeJDqWS2gQ4ovOw+JjzpmaqXJ
qghLDjqjkPs8c1JlnAFp9QEeiYJRX1C3qoiBU5Axt/7J36sYrEM=
=kDNG
-----END PGP SIGNATURE-----
Merge tag 'for-6.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"This brings mostly refactoring, cleanups, minor performance
optimizations and usual fixes. The folio API conversions are most
noticeable.
There's one less visible change that could have a high impact. The
extent lock scope for read is reduced, not held for the entire
operation. In the buffered read case it's left to page or inode lock,
some direct io read synchronization is still needed.
This used to prevent deadlocks induced by page faults during direct
io, so there was a 4K limitation on the requests, e.g. for io_uring.
In the future this will allow smoother integration with iomap where
the extent read lock was a major obstacle.
User visible changes:
- the FSTRIM ioctl updates the processed range even after an error or
interruption
- cleaner thread is woken up in SYNC ioctl instead of waking the
transaction thread that can take some delay before waking up the
cleaner, this can speed up cleaning of deleted subvolumes
- print an error message when opening a device fail, e.g. when it's
unexpectedly read-only
Core changes:
- improved extent map handling in various ways (locking, iteration, ...)
- new assertions and locking annotations
- raid-stripe-tree locking fixes
- use xarray for tracking dirty qgroup extents, switched from rb-tree
- turn the subpage test to compile-time condition if possible (e.g.
on x86_64 with 4K pages), this allows to skip a lot of ifs and
remove dead code
- more preparatory work for compression in subpage mode
Cleanups and refactoring
- folio API conversions, many simple cases where page is passed so
switch it to folios
- more subpage code refactoring, update page state bitmap processing
- introduce auto free for btrfs_path structure, use for the simple
cases"
* tag 'for-6.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (110 commits)
btrfs: only unlock the to-be-submitted ranges inside a folio
btrfs: merge btrfs_folio_unlock_writer() into btrfs_folio_end_writer_lock()
btrfs: BTRFS_PATH_AUTO_FREE in orphan.c
btrfs: use btrfs_path auto free in zoned.c
btrfs: DEFINE_FREE for struct btrfs_path
btrfs: remove btrfs_folio_end_all_writers()
btrfs: constify more pointer parameters
btrfs: rework BTRFS_I as macro to preserve parameter const
btrfs: add and use helper to verify the calling task has locked the inode
btrfs: always update fstrim_range on failure in FITRIM ioctl
btrfs: convert copy_inline_to_page() to use folio
btrfs: convert btrfs_decompress() to take a folio
btrfs: convert zstd_decompress() to take a folio
btrfs: convert lzo_decompress() to take a folio
btrfs: convert zlib_decompress() to take a folio
btrfs: convert try_release_extent_mapping() to take a folio
btrfs: convert try_release_extent_state() to take a folio
btrfs: convert submit_eb_page() to take a folio
btrfs: convert submit_eb_subpage() to take a folio
btrfs: convert read_key_bytes() to take a folio
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEvgAKCRCRxhvAZXjc
onQWAQD6IxAKPU0zom2FoWNilvSzPs7WglTtvddX9pu/lT1RNAD/YC/wOLW8mvAv
9oTAmigQDQQhEWdJA9RgLZBiw7k+DAw=
=zWFb
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.12.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull netfs updates from Christian Brauner:
"This contains the work to improve read/write performance for the new
netfs library.
The main performance enhancing changes are:
- Define a structure, struct folio_queue, and a new iterator type,
ITER_FOLIOQ, to hold a buffer as a replacement for ITER_XARRAY. See
that patch for questions about naming and form.
ITER_FOLIOQ is provided as a replacement for ITER_XARRAY. The
problem with an xarray is that accessing it requires the use of a
lock (typically the RCU read lock) - and this means that we can't
supply iterate_and_advance() with a step function that might sleep
(crypto for example) without having to drop the lock between pages.
ITER_FOLIOQ is the iterator for a chain of folio_queue structs,
where each folio_queue holds a small list of folios. A folio_queue
struct is a simpler structure than xarray and is not subject to
concurrent manipulation by the VM. folio_queue is used rather than
a bvec[] as it can form lists of indefinite size, adding to one end
and removing from the other on the fly.
- Provide a copy_folio_from_iter() wrapper.
- Make cifs RDMA support ITER_FOLIOQ.
- Use folio queues in the write-side helpers instead of xarrays.
- Add a function to reset the iterator in a subrequest.
- Simplify the write-side helpers to use sheaves to skip gaps rather
than trying to work out where gaps are.
- In afs, make the read subrequests asynchronous, putting them into
work items to allow the next patch to do progressive
unlocking/reading.
- Overhaul the read-side helpers to improve performance.
- Fix the caching of a partial block at the end of a file.
- Allow a store to be cancelled.
Then some changes for cifs to make it use folio queues instead of
xarrays for crypto bufferage:
- Use raw iteration functions rather than manually coding iteration
when hashing data.
- Switch to using folio_queue for crypto buffers.
- Remove the xarray bits.
Make some adjustments to the /proc/fs/netfs/stats file such that:
- All the netfs stats lines begin 'Netfs:' but change this to
something a bit more useful.
- Add a couple of stats counters to track the numbers of skips and
waits on the per-inode writeback serialisation lock to make it
easier to check for this as a source of performance loss.
Miscellaneous work:
- Ensure that the sb_writers lock is taken around
vfs_{set,remove}xattr() in the cachefiles code.
- Reduce the number of conditional branches in netfs_perform_write().
- Move the CIFS_INO_MODIFIED_ATTR flag to the netfs_inode struct and
remove cifs_post_modify().
- Move the max_len/max_nr_segs members from netfs_io_subrequest to
netfs_io_request as they're only needed for one subreq at a time.
- Add an 'unknown' source value for tracing purposes.
- Remove NETFS_COPY_TO_CACHE as it's no longer used.
- Set the request work function up front at allocation time.
- Use bh-disabling spinlocks for rreq->lock as cachefiles completion
may be run from block-filesystem DIO completion in softirq context.
- Remove fs/netfs/io.c"
* tag 'vfs-6.12.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (25 commits)
docs: filesystems: corrected grammar of netfs page
cifs: Don't support ITER_XARRAY
cifs: Switch crypto buffer to use a folio_queue rather than an xarray
cifs: Use iterate_and_advance*() routines directly for hashing
netfs: Cancel dirty folios that have no storage destination
cachefiles, netfs: Fix write to partial block at EOF
netfs: Remove fs/netfs/io.c
netfs: Speed up buffered reading
afs: Make read subreqs async
netfs: Simplify the writeback code
netfs: Provide an iterator-reset function
netfs: Use new folio_queue data type and iterator instead of xarray iter
cifs: Provide the capability to extract from ITER_FOLIOQ to RDMA SGEs
iov_iter: Provide copy_folio_from_iter()
mm: Define struct folio_queue and ITER_FOLIOQ to handle a sequence of folios
netfs: Use bh-disabling spinlocks for rreq->lock
netfs: Set the request work function upon allocation
netfs: Remove NETFS_COPY_TO_CACHE
netfs: Reserve netfs_sreq_source 0 as unset/unknown
netfs: Move max_len/max_nr_segs from netfs_io_subrequest to netfs_io_stream
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEmwAKCRCRxhvAZXjc
otRsAQCUdlBS/ky2JiYn3ePURKYVBgRq/+PnmhRrBNDuv+ToZwD+NRLNlOM8FzQy
c8BMSq0rkwO2C5Aax3kGxgTPMEuuCwc=
=QLvm
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.12.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs mount updates from Christian Brauner:
"Recently, we added the ability to list mounts in other mount
namespaces and the ability to retrieve namespace file descriptors
without having to go through procfs by deriving them from pidfds.
This extends nsfs in two ways:
(1) Add the ability to retrieve information about a mount namespace
via NS_MNT_GET_INFO.
This will return the mount namespace id and the number of mounts
currently in the mount namespace. The number of mounts can be
used to size the buffer that needs to be used for listmount() and
is in general useful without having to actually iterate through
all the mounts.
The structure is extensible.
(2) Add the ability to iterate through all mount namespaces over
which the caller holds privilege returning the file descriptor
for the next or previous mount namespace.
To retrieve a mount namespace the caller must be privileged wrt
to it's owning user namespace. This means that PID 1 on the host
can list all mounts in all mount namespaces or that a container
can list all mounts of its nested containers.
Optionally pass a structure for NS_MNT_GET_INFO with
NS_MNT_GET_{PREV,NEXT} to retrieve information about the mount
namespace in one go.
(1) and (2) can be implemented for other namespace types easily.
Together with recent api additions this means one can iterate through
all mounts in all mount namespaces without ever touching procfs.
The commit message in 49224a345c ('Merge patch series "nsfs: iterate
through mount namespaces"') contains example code how to do this"
* tag 'vfs-6.12.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
nsfs: iterate through mount namespaces
file: add fput() cleanup helper
fs: add put_mnt_ns() cleanup helper
fs: allow mount namespace fd
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEwAAKCRCRxhvAZXjc
onI2AQDXa5XhIx0VpLWE9uVImVy3QuUKc/5pI1e1DKMgxLhKCgEAh15a4ETqmVaw
Zp3ZSzoLD8Ez1WwWb6cWQuHFYRSjtwU=
=+LKG
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.12.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull procfs updates from Christian Brauner:
"This contains the following changes for procfs:
- Add config options and parameters to block forcing memory writes.
This adds a Kconfig option and boot param to allow removing the
FOLL_FORCE flag from /proc/<pid>/mem write calls as this can be
used in various attacks.
The traditional forcing behavior is kept as default because it can
break GDB and some other use cases.
This is the simpler version that you had requested.
- Restrict overmounting of ephemeral entities.
It is currently possible to mount on top of various ephemeral
entities in procfs. This specifically includes magic links. To
recap, magic links are links of the form /proc/<pid>/fd/<nr>. They
serve as references to a target file and during path lookup they
cause a jump to the target path. Such magic links disappear if the
corresponding file descriptor is closed.
Currently it is possible to overmount such magic links. This is
mostly interesting for an attacker that wants to somehow trick a
process into e.g., reopening something that it didn't intend to
reopen or to hide a malicious file descriptor.
But also it risks leaking mounts for long-running processes. When
overmounting a magic link like above, the mount will not be
detached when the file descriptor is closed. Only the target
mountpoint will disappear. Which has the consequence of making it
impossible to unmount that mount afterwards. So the mount will
stick around until the process exits and the /proc/<pid>/ directory
is cleaned up during proc_flush_pid() when the dentries are pruned
and invalidated.
That in turn means it's possible for a program to accidentally leak
mounts and it's also possible to make a task leak mounts without
it's knowledge if the attacker just keeps overmounting things under
/proc/<pid>/fd/<nr>.
Disallow overmounting of such ephemeral entities.
- Cleanup the readdir method naming in some procfs file operations.
- Replace kmalloc() and strcpy() with a simple kmemdup() call"
* tag 'vfs-6.12.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
proc: fold kmalloc() + strcpy() into kmemdup()
proc: block mounting on top of /proc/<pid>/fdinfo/*
proc: block mounting on top of /proc/<pid>/fd/*
proc: block mounting on top of /proc/<pid>/map_files/*
proc: add proc_splice_unmountable()
proc: proc_readfdinfo() -> proc_fdinfo_iterate()
proc: proc_readfd() -> proc_fd_iterate()
proc: add config & param to block forcing mem writes
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEwAAKCRCRxhvAZXjc
omD7AQCZuWPXkEGYFD37MJZuRXNEoq7Tuj6yd0O2b5khUpzvyAD+MPuthGiCMPsu
voPpUP83x7T0D3JsEsCAXtNeVRcIBQI=
=xTs6
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.12.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fallocate updates from Christian Brauner:
"This contains work to try and cleanup some the fallocate mode
handling. Currently, it confusingly mixes operation modes and an
optional flag.
The work here tries to better define operation modes and optional
flags allowing the core and filesystem code to use switch statements
to switch on the operation mode"
* tag 'vfs-6.12.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
xfs: refactor xfs_file_fallocate
xfs: move the xfs_is_always_cow_inode check into xfs_alloc_file_space
xfs: call xfs_flush_unmap_range from xfs_free_file_space
fs: sort out the fallocate mode vs flag mess
ext4: remove tracing for FALLOC_FL_NO_HIDE_STALE
block: remove checks for FALLOC_FL_NO_HIDE_STALE
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEwAAKCRCRxhvAZXjc
osS0AQCgIpvey9oW5DMyMw6Bv0hFMRv95gbNQZfHy09iK+NMNAD9GALhb/4cMIVB
7YrZGXEz454lpgcs8AnrOVjVNfctOQg=
=e9s9
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.12.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs file updates from Christian Brauner:
"This is the work to cleanup and shrink struct file significantly.
Right now, (focusing on x86) struct file is 232 bytes. After this
series struct file will be 184 bytes aka 3 cacheline and a spare 8
bytes for future extensions at the end of the struct.
With struct file being as ubiquitous as it is this should make a
difference for file heavy workloads and allow further optimizations in
the future.
- struct fown_struct was embedded into struct file letting it take up
32 bytes in total when really it shouldn't even be embedded in
struct file in the first place. Instead, actual users of struct
fown_struct now allocate the struct on demand. This frees up 24
bytes.
- Move struct file_ra_state into the union containg the cleanup hooks
and move f_iocb_flags out of the union. This closes a 4 byte hole
we created earlier and brings struct file to 192 bytes. Which means
struct file is 3 cachelines and we managed to shrink it by 40
bytes.
- Reorder struct file so that nothing crosses a cacheline.
I suspect that in the future we will end up reordering some members
to mitigate false sharing issues or just because someone does
actually provide really good perf data.
- Shrinking struct file to 192 bytes is only part of the work.
Files use a slab that is SLAB_TYPESAFE_BY_RCU and when a kmem cache
is created with SLAB_TYPESAFE_BY_RCU the free pointer must be
located outside of the object because the cache doesn't know what
part of the memory can safely be overwritten as it may be needed to
prevent object recycling.
That has the consequence that SLAB_TYPESAFE_BY_RCU may end up
adding a new cacheline.
So this also contains work to add a new kmem_cache_create_rcu()
function that allows the caller to specify an offset where the
freelist pointer is supposed to be placed. Thus avoiding the
implicit addition of a fourth cacheline.
- And finally this removes the f_version member in struct file.
The f_version member isn't particularly well-defined. It is mainly
used as a cookie to detect concurrent seeks when iterating
directories. But it is also abused by some subsystems for
completely unrelated things.
It is mostly a directory and filesystem specific thing that doesn't
really need to live in struct file and with its wonky semantics it
really lacks a specific function.
For pipes, f_version is (ab)used to defer poll notifications until
a write has happened. And struct pipe_inode_info is used by
multiple struct files in their ->private_data so there's no chance
of pushing that down into file->private_data without introducing
another pointer indirection.
But pipes don't rely on f_pos_lock so this adds a union into struct
file encompassing f_pos_lock and a pipe specific f_pipe member that
pipes can use. This union of course can be extended to other file
types and is similar to what we do in struct inode already"
* tag 'vfs-6.12.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (26 commits)
fs: remove f_version
pipe: use f_pipe
fs: add f_pipe
ubifs: store cookie in private data
ufs: store cookie in private data
udf: store cookie in private data
proc: store cookie in private data
ocfs2: store cookie in private data
input: remove f_version abuse
ext4: store cookie in private data
ext2: store cookie in private data
affs: store cookie in private data
fs: add generic_llseek_cookie()
fs: use must_set_pos()
fs: add must_set_pos()
fs: add vfs_setpos_cookie()
s390: remove unused f_version
ceph: remove unused f_version
adi: remove unused f_version
mm: Removed @freeptr_offset to prevent doc warning
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEvgAKCRCRxhvAZXjc
ou77AQD3U1KjbdgzbUi6kaUmiiWOPhfYTlm8mho8dBjqvTCB+AD/XTWSFCWWhHB4
KyQZTbjRD81xmVNbKjASazp0EA6Ahwc=
=gIsD
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.12.folio' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs folio updates from Christian Brauner:
"This contains work to port write_begin and write_end to rely on folios
for various filesystems.
This converts ocfs2, vboxfs, orangefs, jffs2, hostfs, fuse, f2fs,
ecryptfs, ntfs3, nilfs2, reiserfs, minixfs, qnx6, sysv, ufs, and
squashfs.
After this series lands a bunch of the filesystems in this list do not
mention struct page anymore"
* tag 'vfs-6.12.folio' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (61 commits)
Squashfs: Ensure all readahead pages have been used
Squashfs: Rewrite and update squashfs_readahead_fragment() to not use page->index
Squashfs: Update squashfs_readpage_block() to not use page->index
Squashfs: Update squashfs_readahead() to not use page->index
Squashfs: Update page_actor to not use page->index
jffs2: Use a folio in jffs2_garbage_collect_dnode()
jffs2: Convert jffs2_do_readpage_nolock to take a folio
buffer: Convert __block_write_begin() to take a folio
ocfs2: Convert ocfs2_write_zero_page to use a folio
fs: Convert aops->write_begin to take a folio
fs: Convert aops->write_end to take a folio
vboxsf: Use a folio in vboxsf_write_end()
orangefs: Convert orangefs_write_begin() to use a folio
orangefs: Convert orangefs_write_end() to use a folio
jffs2: Convert jffs2_write_begin() to use a folio
jffs2: Convert jffs2_write_end() to use a folio
hostfs: Convert hostfs_write_end() to use a folio
fuse: Convert fuse_write_begin() to use a folio
fuse: Convert fuse_write_end() to use a folio
f2fs: Convert f2fs_write_begin() to use a folio
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEGwAKCRCRxhvAZXjc
ojIuAQC433+hBkvjvmQ7H0r5rgZSjUuCTG3bSmdU7RJmPHUHhwEA85v/NGq53f+W
IhandK6t+Cf0JYpFZ3N0bT88hDYVhQQ=
=9zGL
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual pile of misc updates:
Features:
- Add F_CREATED_QUERY fcntl() that allows userspace to query whether
a file was actually created. Often userspace wants to know whether
an O_CREATE request did actually create a file without using
O_EXCL. The current logic is that to first attempts to open the
file without O_CREAT | O_EXCL and if ENOENT is returned userspace
tries again with both flags. If that succeeds all is well. If it
now reports EEXIST it retries.
That works fairly well but some corner cases make this more
involved. If this operates on a dangling symlink the first openat()
without O_CREAT | O_EXCL will return ENOENT but the second openat()
with O_CREAT | O_EXCL will fail with EEXIST.
The reason is that openat() without O_CREAT | O_EXCL follows the
symlink while O_CREAT | O_EXCL doesn't for security reasons. So
it's not something we can really change unless we add an explicit
opt-in via O_FOLLOW which seems really ugly.
All available workarounds are really nasty (fanotify, bpf lsm etc)
so add a simple fcntl().
- Try an opportunistic lookup for O_CREAT. Today, when opening a file
we'll typically do a fast lookup, but if O_CREAT is set, the kernel
always takes the exclusive inode lock. This was likely done with
the expectation that O_CREAT means that we always expect to do the
create, but that's often not the case. Many programs set O_CREAT
even in scenarios where the file already exists (see related
F_CREATED_QUERY patch motivation above).
The series contained in the pr rearranges the pathwalk-for-open
code to also attempt a fast_lookup in certain O_CREAT cases. If a
positive dentry is found, the inode_lock can be avoided altogether
and it can stay in rcuwalk mode for the last step_into.
- Expose the 64 bit mount id via name_to_handle_at()
Now that we provide a unique 64-bit mount ID interface in statx(2),
we can now provide a race-free way for name_to_handle_at(2) to
provide a file handle and corresponding mount without needing to
worry about racing with /proc/mountinfo parsing or having to open a
file just to do statx(2).
While this is not necessary if you are using AT_EMPTY_PATH and
don't care about an extra statx(2) call, users that pass full paths
into name_to_handle_at(2) need to know which mount the file handle
comes from (to make sure they don't try to open_by_handle_at a file
handle from a different filesystem) and switching to AT_EMPTY_PATH
would require allocating a file for every name_to_handle_at(2) call
- Add a per dentry expire timeout to autofs
There are two fairly well known automounter map formats, the autofs
format and the amd format (more or less System V and Berkley).
Some time ago Linux autofs added an amd map format parser that
implemented a fair amount of the amd functionality. This was done
within the autofs infrastructure and some functionality wasn't
implemented because it either didn't make sense or required extra
kernel changes. The idea was to restrict changes to be within the
existing autofs functionality as much as possible and leave changes
with a wider scope to be considered later.
One of these changes is implementing the amd options:
1) "unmount", expire this mount according to a timeout (same as
the current autofs default).
2) "nounmount", don't expire this mount (same as setting the
autofs timeout to 0 except only for this specific mount) .
3) "utimeout=<seconds>", expire this mount using the specified
timeout (again same as setting the autofs timeout but only for
this mount)
To implement these options per-dentry expire timeouts need to be
implemented for autofs indirect mounts. This is because all map
keys (mounts) for autofs indirect mounts use an expire timeout
stored in the autofs mount super block info. structure and all
indirect mounts use the same expire timeout.
Fixes:
- Fix missing fput for FSCONFIG_SET_FD in autofs
- Use param->file for FSCONFIG_SET_FD in coda
- Delete the 'fs/netfs' proc subtreee when netfs module exits
- Make sure that struct uid_gid_map fits into a single cacheline
- Don't flush in-flight wb switches for superblocks without cgroup
writeback
- Correcting the idmapping mount example in the idmapping
documentation
- Fix a race between evice_inodes() and find_inode() and iput()
- Refine the show_inode_state() macro definition in writeback code
- Prevent dump_mapping() from accessing invalid dentry.d_name.name
- Show actual source for debugfs in /proc/mounts
- Annotate data-race of busy_poll_usecs in eventpoll
- Don't WARN for racy path_noexec check in exec code
- Handle OOM on mnt_warn_timestamp_expiry()
- Fix some spelling in the iomap design documentation
- Fix typo in procfs comment
- Fix typo in fs/namespace.c comment
Cleanups:
- Add the VFS git tree to the MAINTAINERS file
- Move FMODE_UNSIGNED_OFFSET to fop_flags freeing up another f_mode
bit in struct file bringing us to 5 free f_mode bits
- Remove the __I_DIO_WAKEUP bit from i_state flags as we can simplify
the wait mechanism
- Remove the unused path_put_init() helper
- Replace a __u32 with u32 for s_fsnotify_mask as __u32 is uapi
specific
- Replace the unsigned long i_state member with a u32 i_state member
in struct inode freeing up 4 bytes in struct inode. Instead of
using the bit based wait apis we're now using the var event apis
and using the individual bytes of the i_state member to wait on
state changes
- Explain how per-syscall AT_* flags should be allocated
- Use in_group_or_capable() helper to simplify the posix acl mode
update code
- Switch to LIST_HEAD() in fsync_buffers_list() to simplify the code
- Removed comment about d_rcu_to_refcount() as that function doesn't
exist anymore
- Add kernel documentation for lookup_fast()
- Don't re-zero evenpoll fields
- Remove outdated comment after close_fd()
- Fix imprecise wording in comment about the pipe filesystem
- Drop GFP_NOFAIL mode from alloc_page_buffers
- Missing blank line warnings and struct declaration improved in
file_table
- Annotate struct poll_list with __counted_by()
- Remove the unused read parameter in percpu-rwsem
- Remove linux/prefetch.h include from direct-io code
- Use kmemdup_array instead of kmemdup for multiple allocation in
mnt_idmapping code
- Remove unused mnt_cursor_del() declaration
Performance tweaks:
- Dodge smp_mb in break_lease and break_deleg in the common case
- Only read fops once in fops_{get,put}()
- Use RCU in ilookup()
- Elide smp_mb in iversion handling in the common case
- Drop one lock trip in evict()"
* tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (58 commits)
uidgid: make sure we fit into one cacheline
proc: Fix typo in the comment
fs/pipe: Correct imprecise wording in comment
fhandle: expose u64 mount id to name_to_handle_at(2)
uapi: explain how per-syscall AT_* flags should be allocated
fs: drop GFP_NOFAIL mode from alloc_page_buffers
writeback: Refine the show_inode_state() macro definition
fs/inode: Prevent dump_mapping() accessing invalid dentry.d_name.name
mnt_idmapping: Use kmemdup_array instead of kmemdup for multiple allocation
netfs: Delete subtree of 'fs/netfs' when netfs module exits
fs: use LIST_HEAD() to simplify code
inode: make i_state a u32
inode: port __I_LRU_ISOLATING to var event
vfs: fix race between evice_inodes() and find_inode()&iput()
inode: port __I_NEW to var event
inode: port __I_SYNC to var event
fs: reorder i_state bits
fs: add i_state helpers
MAINTAINERS: add the VFS git tree
fs: s/__u32/u32/ for s_fsnotify_mask
...
ACPI:
* Enable PMCG erratum workaround for HiSilicon HIP10 and 11 platforms.
* Ensure arm64-specific IORT header is covered by MAINTAINERS.
CPU Errata:
* Enable workaround for hardware access/dirty issue on Ampere-1A cores.
Memory management:
* Define PHYSMEM_END to fix a crash in the amdgpu driver.
* Avoid tripping over invalid kernel mappings on the kexec() path.
* Userspace support for the Permission Overlay Extension (POE) using
protection keys.
Perf and PMUs:
* Add support for the "fixed instruction counter" extension in the CPU
PMU architecture.
* Extend and fix the event encodings for Apple's M1 CPU PMU.
* Allow LSM hooks to decide on SPE permissions for physical profiling.
* Add support for the CMN S3 and NI-700 PMUs.
Confidential Computing:
* Add support for booting an arm64 kernel as a protected guest under
Android's "Protected KVM" (pKVM) hypervisor.
Selftests:
* Fix vector length issues in the SVE/SME sigreturn tests
* Fix build warning in the ptrace tests.
Timers:
* Add support for PR_{G,S}ET_TSC so that 'rr' can deal with
non-determinism arising from the architected counter.
Miscellaneous:
* Rework our IPI-based CPU stopping code to try NMIs if regular IPIs
don't succeed.
* Minor fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmbkVNEQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNKeIB/9YtbN7JMgsXktM94GP03r3tlFF36Y1S51S
+zdDZclAVZCTCZN+PaFeAZ/+ah2EQYrY6rtDoHUSEMQdF9kH+ycuIPDTwaJ4Qkam
QKXMpAgtY/4yf2rX4lhDF8rEvkhLDsu7oGDhqUZQsA33GrMBHfgA3oqpYwlVjvGq
gkm7olTo9LdWAxkPpnjGrjB6Mv5Dq8dJRhW+0Q5AntI5zx3RdYGJZA9GUSzyYCCt
FIYOtMmWPkQ0kKxIVxOxAOm/ubhfyCs2sjSfkaa3vtvtt+Yjye1Xd81rFciIbPgP
QlK/Mes2kBZmjhkeus8guLI5Vi7tx3DQMkNqLXkHAAzOoC4oConE
=6osL
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"The highlights are support for Arm's "Permission Overlay Extension"
using memory protection keys, support for running as a protected guest
on Android as well as perf support for a bunch of new interconnect
PMUs.
Summary:
ACPI:
- Enable PMCG erratum workaround for HiSilicon HIP10 and 11
platforms.
- Ensure arm64-specific IORT header is covered by MAINTAINERS.
CPU Errata:
- Enable workaround for hardware access/dirty issue on Ampere-1A
cores.
Memory management:
- Define PHYSMEM_END to fix a crash in the amdgpu driver.
- Avoid tripping over invalid kernel mappings on the kexec() path.
- Userspace support for the Permission Overlay Extension (POE) using
protection keys.
Perf and PMUs:
- Add support for the "fixed instruction counter" extension in the
CPU PMU architecture.
- Extend and fix the event encodings for Apple's M1 CPU PMU.
- Allow LSM hooks to decide on SPE permissions for physical
profiling.
- Add support for the CMN S3 and NI-700 PMUs.
Confidential Computing:
- Add support for booting an arm64 kernel as a protected guest under
Android's "Protected KVM" (pKVM) hypervisor.
Selftests:
- Fix vector length issues in the SVE/SME sigreturn tests
- Fix build warning in the ptrace tests.
Timers:
- Add support for PR_{G,S}ET_TSC so that 'rr' can deal with
non-determinism arising from the architected counter.
Miscellaneous:
- Rework our IPI-based CPU stopping code to try NMIs if regular IPIs
don't succeed.
- Minor fixes and cleanups"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (94 commits)
perf: arm-ni: Fix an NULL vs IS_ERR() bug
arm64: hibernate: Fix warning for cast from restricted gfp_t
arm64: esr: Define ESR_ELx_EC_* constants as UL
arm64: pkeys: remove redundant WARN
perf: arm_pmuv3: Use BR_RETIRED for HW branch event if enabled
MAINTAINERS: List Arm interconnect PMUs as supported
perf: Add driver for Arm NI-700 interconnect PMU
dt-bindings/perf: Add Arm NI-700 PMU
perf/arm-cmn: Improve format attr printing
perf/arm-cmn: Clean up unnecessary NUMA_NO_NODE check
arm64/mm: use lm_alias() with addresses passed to memblock_free()
mm: arm64: document why pte is not advanced in contpte_ptep_set_access_flags()
arm64: Expose the end of the linear map in PHYSMEM_END
arm64: trans_pgd: mark PTEs entries as valid to avoid dead kexec()
arm64/mm: Delete __init region from memblock.reserved
perf/arm-cmn: Support CMN S3
dt-bindings: perf: arm-cmn: Add CMN S3
perf/arm-cmn: Refactor DTC PMU register access
perf/arm-cmn: Make cycle counts less surprising
perf/arm-cmn: Improve build-time assertion
...
list_head can be initialized automatically with LIST_HEAD()
instead of calling INIT_LIST_HEAD(). No functional impact.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
SFU since its (first) version 3.0 supports AF_LOCAL sockets and stores them
on filesytem as system file with one zero byte. Add support for detecting
this SFU socket type into cifs_sfu_type() function.
With this change cifs_sfu_type() would correctly detect all special file
types created by SFU: fifo, socket, symlink, block and char.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
For debugging purposes it is a good idea to show detected SFU type also for
Fifo. Debug message is already print for all other special types.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
SFU types IntxCHR and IntxBLK are 8 bytes with zero as last byte. Make it
explicit in memcpy and memset calls, so the zero byte is visible in the
code (and not hidden as string trailing nul byte).
It is important for reader to show the last byte for block and char types
because it differs from the last byte of symlink type (which has it 0x01).
Also it is important to show that the type is not nul-term string, but
rather 8 bytes (with some printable bytes).
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Currently when sfu mount option is specified then CIFS can recognize SFU
symlink, but is not able to read symlink target location. readlink()
syscall just returns that operation is not supported.
Implement this missing functionality in cifs_sfu_type() function. Read
target location of SFU-style symlink, parse it and fill into fattr's
cf_symlink_target member.
SFU-style symlink is file which has system attribute set and file content
is buffer "IntxLNK\1" (8th byte is 0x01) followed by the target location
encoded in little endian UCS-2/UTF-16. This format was introduced in
Interix 3.0 subsystem, as part of the Microsoft SFU 3.0 and is used also by
all later versions. Previous versions had no symlink support.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
SFU symlinks have 8 byte prefix: "IntxLNK\1".
So check also the last 8th byte 0x01.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Using uninitialized value "bkt" when calling "kfree"
Fixes: 13b68d4499 ("smb: client: compress: LZ77 code improvements cleanup")
Signed-off-by: Qianqiang Liu <qianqiang.liu@163.com>
Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
The dst pointer may not be initialized when calling kvfree(dst)
Fixes: 13b68d4499 ("smb: client: compress: LZ77 code improvements cleanup")
Signed-off-by: Qianqiang Liu <qianqiang.liu@163.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
- Check data compressibility with some heuristics (copied from
btrfs):
- should_compress() final decision is is_compressible(data)
- Cleanup compress/lz77.h leaving only lz77_compress() exposed:
- Move parts to compress/lz77.c, while removing the rest of it
because they were either unused, used only once, were
implemented wrong (thanks to David Howells for the help)
- Updated the compression parameters (still compatible with
Windows implementation) trading off ~20% compression ratio
for ~40% performance:
- min match len: 3 -> 4
- max distance: 8KiB -> 1KiB
- hash table type: u32 * -> u64 *
Known bugs:
This implementation currently works fine in general, but breaks with
some payloads used during testing. Investigation ongoing, to be
fixed in a next commit.
Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Co-developed-by: David Howells <dhowells@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
On smb2_async_writev(), set CIFS_COMPRESS_REQ on request flags if
should_compress() returns true.
On smb_send_rqst() check the flags, and compress and send the request to
the server.
(*) If the compression fails with -EMSGSIZE (i.e. compressed size is >=
uncompressed size), the original uncompressed request is sent instead.
Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
Move SMB3.1.1 compression code into experimental config option,
and fix the compress mount option. Implement unchained LZ77
"plain" compression algorithm as per MS-XCA specification
section "2.3 Plain LZ77 Compression Algorithm Details".
Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
The cifs_dir_open() have been removed since
commit 737b758c96 ("[PATCH] cifs: character mapping of special
characters (part 3 of 3)"), and now it is useless, so remove it.
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Use the min() macro to simplify the function and improve
its readability.
Signed-off-by: Shen Lichuan <shenlichuan@vivo.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Use ERR_CAST() as it is designed for casting an error pointer to
another type.
This macro uses the __force and __must_check modifiers, which are used
to tell the compiler to check for errors where this macro is used.
Signed-off-by: Yuesong Li <liyuesong@vivo.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Explained why the typo was not corrected.
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
There are only 4 different definitions between the client and server:
- STATUS_SERVER_UNAVAILABLE: from client/smb2status.h
- STATUS_FILE_NOT_AVAILABLE: from client/smb2status.h
- STATUS_NO_PREAUTH_INTEGRITY_HASH_OVERLAP: from server/smbstatus.h
- STATUS_INVALID_LOCK_RANGE: from server/smbstatus.h
Rename client/smb2status.h to common/smb2status.h, and merge the
2 different definitions of server to common header file.
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
In order to maintain the code more easily, move duplicate definitions
to new common header file.
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Preparation for moving acl definitions to new common header file.
Use the following shell command to rename:
find fs/smb/client -type f -exec sed -i \
's/struct cifs_ace/struct smb_ace/g' {} +
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Preparation for moving acl definitions to new common header file.
Use the following shell command to rename:
find fs/smb/client -type f -exec sed -i \
's/struct cifs_acl/struct smb_acl/g' {} +
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Preparation for moving acl definitions to new common header file.
Use the following shell command to rename:
find fs/smb/client -type f -exec sed -i \
's/struct cifs_sid/struct smb_sid/g' {} +
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Preparation for moving acl definitions to new common header file.
Use the following shell command to rename:
find fs/smb/client -type f -exec sed -i \
's/struct cifs_ntsd/struct smb_ntsd/g' {} +
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Ray Zhang reported ksmbd can not create file if parent filename is
caseless.
Y:\>mkdir A
Y:\>echo 123 >a\b.txt
The system cannot find the path specified.
Y:\>echo 123 >A\b.txt
This patch convert name obtained by caseless lookup to parent name.
Cc: stable@vger.kernel.org # v5.15+
Reported-by: Ray Zhang <zhanglei002@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Some file systems may not provide dot (.) and dot-dot (..) as they are
optional in POSIX. ksmbd can misjudge emptiness of a directory in those
file systems, since it assumes there are always at least two entries:
dot and dot-dot.
Just don't count dot and dot-dot.
Cc: stable@vger.kernel.org # v6.1+
Signed-off-by: Hobin Woo <hobin.woo@samsung.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
When sending an oplock break request, opinfo->conn is used,
But freed ->conn can be used on multichannel.
This patch add a reference count to the ksmbd_conn struct
so that it can be freed when it is no longer used.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Windows client write with FILE_APPEND_DATA when using git.
ksmbd should allow write it with this flags.
Z:\test>git commit -m "test"
fatal: cannot update the ref 'HEAD': unable to append to
'.git/logs/HEAD': Bad file descriptor
Fixes: 0626e6641f ("cifsd: add server handler for central processing and tranport layers")
Cc: stable@vger.kernel.org # v5.15+
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
We need to migrate data blocks even though it is full to secure space
for zoned device file pinning.
Fixes: 9703d69d9d ("f2fs: support file pinning for zoned devices")
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit 9651fcedf7 ("mm: add MAP_DROPPABLE for designating always
lazily freeable mappings") only adds VM_DROPPABLE for 64 bits
architectures.
In order to also use the getrandom vDSO implementation on powerpc/32,
use VM_ARCH_1 for VM_DROPPABLE on powerpc/32. This is possible because
VM_ARCH_1 is used for VM_SAO on powerpc and VM_SAO is only for
powerpc/64. It is used in combination with PROT_SAO in some parts of
code that are restricted to CONFIG_PPC64 through #ifdefs, it is
therefore possible to define VM_SAO for CONFIG_PPC64 only.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Fix the calculation of packet signatures by adding the offset into a page
in the read or write data payload when hashing the pages from it.
Fixes: 39bc58203f ("cifs: Add a function to Hash the contents of an iterator")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Negative i_size is never supported, although crafted images with inodes
having negative i_size will NOT lead to security issues in our current
codebase:
The following image can verify this (gzip+base64 encoded):
H4sICCmk4mYAA3Rlc3QuaW1nAGNgGAWjYBSMVPDo4dcH3jP2aTED2TwMKgxMUHHNJY/SQDQX
LxcDIw3tZwXit44MDNpQ/n8gQJZ/vxjijosPuSyZ0DUDgQqcZoKzVYFsDShbHeh6PT29ktTi
Eqz2g/y2pBFiLxDMh4lhs5+W4TAKRsEoGAWjYBSMglEwCkYBPQAAS2DbowAQAAA=
Mark as bad inodes for such corrupted inodes explicitly.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240912083538.3011860-1-hsiangkao@linux.alibaba.com
Error out if {en,de}encoded size of a pcluster is unsupported:
Maximum supported encoded size (of a pcluster): 1 MiB
Maximum supported decoded size (of a pcluster): 12 MiB
Users can still choose to use supported large configurations (e.g.,
for archival purposes), but there may be performance penalties in
low-memory scenarios compared to smaller pclusters.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240912074156.2925394-1-hsiangkao@linux.alibaba.com
This patch aims to allocate bvpages and short-lived compressed pages
from the reserved pool first.
After applying this patch, there are three benefits.
1. It reduces the page allocation time.
The bvpages and short-lived compressed pages account for about 4% of
the pages allocated from the system in the multi-app launch benchmarks
[1]. It reduces the page allocation time accordingly and lowers the
likelihood of blockage by page allocation in low memory scenarios.
2. The pages in the reserved pool will be allocated on demand.
Currently, bvpages and short-lived compressed pages are short-lived
pages allocated from the system, and the pages in the reserved pool all
originate from short-lived pages. Consequently, the number of reserved
pool pages will increase to z_erofs_rsv_nrpages over time.
With this patch, all short-lived pages are allocated from the reserved
pool first, so the number of reserved pool pages will only increase when
there are not enough pages. Thus, even if z_erofs_rsv_nrpages is set to
a large number for specific reasons, the actual number of reserved pool
pages may remain low as per demand. In the multi-app launch benchmarks
[1], z_erofs_rsv_nrpages is set at 256, while the number of reserved
pool pages remains below 64.
3. When erofs cache decompression is disabled
(EROFS_ZIP_CACHE_DISABLED), all pages will *only* be allocated from
the reserved pool for erofs. This will significantly reduce the memory
pressure from erofs.
[1] For additional details on the multi-app launch benchmarks, please
refer to commit 0f6273ab46 ("erofs: add a reserved buffer pool for lz4
decompression").
Signed-off-by: Chunhai Guo <guochunhai@vivo.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20240906121110.3701889-1-guochunhai@vivo.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
With iterative development, our codebase can now deal with compressed
buffer misses properly if both in-place I/O and compressed buffer
allocation fail.
Note that if readahead fails (with non-uptodate folios), the original
request will then fall back to synchronous read, and `.read_folio()`
should return appropriate errnos; otherwise -EIO will be passed to
user space, which is unexpected.
To simplify rarely encountered failure paths, a mimic decompression
will be just used. Before that, failure reasons are recorded in
compressed_bvecs[] and they also act as placeholders to avoid in-place
pages. They will be parsed just before decompression and then pass
back to `.read_folio()`.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240905084732.2684515-1-hsiangkao@linux.alibaba.com
There's now no need to support ITER_XARRAY in cifs as netfslib hands down
ITER_FOLIOQ instead - and that's simpler to use with iterate_and_advance()
as it doesn't hold the RCU read lock over the step function.
This is part of the process of phasing out ITER_XARRAY.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Tom Talpey <tom@talpey.com>
cc: Enzo Matsumiya <ematsumiya@suse.de>
cc: linux-cifs@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-26-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
Switch cifs from using an xarray to hold the transport crypto buffer to
using a folio_queue and use ITER_FOLIOQ rather than ITER_XARRAY.
This is part of the process of phasing out ITER_XARRAY.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Tom Talpey <tom@talpey.com>
cc: Enzo Matsumiya <ematsumiya@suse.de>
cc: linux-cifs@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-25-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
Replace the bespoke cifs iterators of ITER_BVEC and ITER_KVEC to do hashing
with iterate_and_advance_kernel() - a variant on iterate_and_advance() that
only supports kernel-internal ITER_* types and not UBUF/IOVEC types.
The bespoke ITER_XARRAY is left because we don't really want to be calling
crypto_shash_update() under the RCU read lock for large amounts of data;
besides, ITER_XARRAY is going to be phased out.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Tom Talpey <tom@talpey.com>
cc: Enzo Matsumiya <ematsumiya@suse.de>
cc: linux-cifs@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-24-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
Kafs wants to be able to cache the contents of directories (and symlinks),
but whilst these are downloaded from the server with the FS.FetchData RPC
op and similar, the same as for regular files, they can't be updated by
FS.StoreData, but rather have special operations (FS.MakeDir, etc.).
Now, rather than redownloading a directory's content after each change made
to that directory, kafs modifies the local blob. This blob can be saved
out to the cache, and since it's using netfslib, kafs just marks the folios
dirty and lets ->writepages() on the directory take care of it, as for an
regular file.
This is fine as long as there's a cache as although the upload stream is
disabled, there's a cache stream to drive the procedure. But if the cache
goes away in the meantime, suddenly there's no way do any writes and the
code gets confused, complains "R=%x: No submit" to dmesg and leaves the
dirty folio hanging.
Fix this by just cancelling the store of the folio if neither stream is
active. (If there's no cache at the time of dirtying, we should just not
mark the folio dirty).
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-23-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
Because it uses DIO writes, cachefiles is unable to make a write to the
backing file if that write is not aligned to and sized according to the
backing file's DIO block alignment. This makes it tricky to handle a write
to the cache where the EOF on the network file is not correctly aligned.
To get around this, netfslib attempts to tell the driver it is calling how
much more data there is available beyond the EOF that it can use to pad the
write (netfslib preclears the part of the folio above the EOF). However,
it tries to tell the cache what the maximum length is, but doesn't
calculate this correctly; and, in any case, cachefiles actually ignores the
value and just skips the block.
Fix this by:
(1) Change the value passed to indicate the amount of extra data that can
be added to the operation (now ->submit_extendable_to). This is much
simpler to calculate as it's just the end of the folio minus the top
of the data within the folio - rather than having to account for data
spread over multiple folios.
(2) Make cachefiles add some of this data if the subrequest it is given
ends at the network file's i_size if the extra data is sufficient to
pad out to a whole block.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-22-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
Improve the efficiency of buffered reads in a number of ways:
(1) Overhaul the algorithm in general so that it's a lot more compact and
split the read submission code between buffered and unbuffered
versions. The unbuffered version can be vastly simplified.
(2) Read-result collection is handed off to a work queue rather than being
done in the I/O thread. Multiple subrequests can be processes
simultaneously.
(3) When a subrequest is collected, any folios it fully spans are
collected and "spare" data on either side is donated to either the
previous or the next subrequest in the sequence.
Notes:
(*) Readahead expansion is massively slows down fio, presumably because it
causes a load of extra allocations, both folio and xarray, up front
before RPC requests can be transmitted.
(*) RDMA with cifs does appear to work, both with SIW and RXE.
(*) PG_private_2-based reading and copy-to-cache is split out into its own
file and altered to use folio_queue. Note that the copy to the cache
now creates a new write transaction against the cache and adds the
folios to be copied into it. This allows it to use part of the
writeback I/O code.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-20-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
Perform AFS read subrequests in a work item rather than in the calling
thread. For normal buffered reads, this will allow the calling thread to
copy data from the pagecache to the application at the same time as the
demarshalling thread is shovelling data from skbuffs into the pagecache.
This will also allow the RA mark to trigger a new read before we've
finished shovelling the data from the current one.
Note: This would be a bit safer if the FS.FetchData RPC ops returned the
metadata (including the data version number) before returning the data.
This would allow me to flush the pagecache before installing the new data.
In future, it may be possible to asynchronously flush the pagecache either
side of the region being read.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-19-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
Use the new folio_queue structures to simplify the writeback code. The
problem with referring to the i_pages xarray directly is that we may have
gaps in the sequence of folios we're writing from that we need to skip when
we're removing the writeback mark from the folios we're writing back from.
At the moment the code tries to deal with this by carefully tracking the
gaps in each writeback stream (eg. write to server and write to cache) and
divining when there's a gap that spans folios (something that's not helped
by folios not being a consistent size).
Instead, the folio_queue buffer contains pointers only the folios we're
dealing with, has them in ascending order and indicates a gap by placing
non-consequitive folios next to each other. This makes it possible to
track where we need to clean up to by just keeping track of where we've
processed to on each stream and taking the minimum.
Note that the I/O iterator is always rounded up to the end of the folio,
even if that is beyond the EOF position, so that the cache can do DIO from
the page. The excess space is cleared, though mmapped writes clobber it.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-18-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
Make the netfs write-side routines use the new folio_queue struct to hold a
rolling buffer of folios, with the issuer adding folios at the tail and the
collector removing them from the head as they're processed instead of using
an xarray.
This will allow a subsequent patch to simplify the write collector.
The primary mark (as tested by folioq_is_marked()) is used to note if the
corresponding folio needs putting.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-16-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
Make smb_extract_iter_to_rdma() extract page fragments from an ITER_FOLIOQ
iterator into RDMA SGEs.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Tom Talpey <tom@talpey.com>
cc: Enzo Matsumiya <ematsumiya@suse.de>
cc: linux-cifs@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-15-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
Now that detecting concurrent seeks is done by the filesystems that
require it we can remove f_version and free up 8 bytes for future
extensions.
Link: https://lore.kernel.org/r/20240830-vfs-file-f_version-v1-20-6d3e4816aa7b@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Pipes use f_version to defer poll notifications until a write has been
observed. Since multiple file's refer to the same struct pipe_inode_info
in their ->private_data moving it into their isn't feasible since we
would need to introduce an additional pointer indirection.
However, since pipes don't require f_pos_lock we placed a new f_pipe
member into a union with f_pos_lock that pipes can use. This is similar
to what we already do for struct inode where we have additional fields
per file type. This will allow us to fully remove f_version in the next
step.
Link: https://lore.kernel.org/r/20240830-vfs-file-f_version-v1-19-6d3e4816aa7b@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Only regular files with FMODE_ATOMIC_POS and directories need
f_pos_lock. Place a new f_pipe member in a union with f_pos_lock
that they can use and make them stop abusing f_version in follow-up
patches.
Link: https://lore.kernel.org/r/20240830-vfs-file-f_version-v1-18-6d3e4816aa7b@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
syzbot reports a f2fs bug as below:
------------[ cut here ]------------
WARNING: CPU: 1 PID: 58 at kernel/rcu/sync.c:177 rcu_sync_dtor+0xcd/0x180 kernel/rcu/sync.c:177
CPU: 1 UID: 0 PID: 58 Comm: kworker/1:2 Not tainted 6.10.0-syzkaller-12562-g1722389b0d86 #0
Workqueue: events destroy_super_work
RIP: 0010:rcu_sync_dtor+0xcd/0x180 kernel/rcu/sync.c:177
Call Trace:
percpu_free_rwsem+0x41/0x80 kernel/locking/percpu-rwsem.c:42
destroy_super_work+0xec/0x130 fs/super.c:282
process_one_work kernel/workqueue.c:3231 [inline]
process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3312
worker_thread+0x86d/0xd40 kernel/workqueue.c:3390
kthread+0x2f0/0x390 kernel/kthread.c:389
ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
As Christian Brauner pointed out [1]: the root cause is f2fs sets
SB_RDONLY flag in internal function, rather than setting the flag
covered w/ sb->s_umount semaphore via remount procedure, then below
race condition causes this bug:
- freeze_super()
- sb_wait_write(sb, SB_FREEZE_WRITE)
- sb_wait_write(sb, SB_FREEZE_PAGEFAULT)
- sb_wait_write(sb, SB_FREEZE_FS)
- f2fs_handle_critical_error
- sb->s_flags |= SB_RDONLY
- thaw_super
- thaw_super_locked
- sb_rdonly() is true, so it skips
sb_freeze_unlock(sb, SB_FREEZE_FS)
- deactivate_locked_super
Since f2fs has almost the same logic as ext4 [2] when handling critical
error in filesystem if it mounts w/ errors=remount-ro option:
- set CP_ERROR_FLAG flag which indicates filesystem is stopped
- record errors to superblock
- set SB_RDONLY falg
Once we set CP_ERROR_FLAG flag, all writable interfaces can detect the
flag and stop any further updates on filesystem. So, it is safe to not
set SB_RDONLY flag, let's remove the logic and keep in line w/ ext4 [3].
[1] https://lore.kernel.org/all/20240729-himbeeren-funknetz-96e62f9c7aee@brauner
[2] https://lore.kernel.org/all/20240729132721.hxih6ehigadqf7wx@quack3
[3] https://lore.kernel.org/linux-ext4/20240805201241.27286-1-jack@suse.cz
Fixes: b62e71be21 ("f2fs: support errors=remount-ro|continue|panic mountoption")
Reported-by: syzbot+20d7e439f76bbbd863a7@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000b90a8e061e21d12f@google.com/
Cc: Jan Kara <jack@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We need to introduce a valid block ratio threshold not to trigger
excessive GC for zoned deivces. The initial value of it is 95%. So, F2FS
will stop the thread from intiating GC for sections having valid blocks
exceeding the ratio.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Added control knobs for gc_no_zoned_gc_percent and
gc_boost_zoned_gc_percent.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Under low free section count, we need to use FG_GC instead of BG_GC to
recover free sections.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For the fine tuning of GC behavior, add reserved_segments sysfs node.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We can control the scanning window granularity for GC migration. For
more frequent scanning and GC on zoned devices, we need a fine grained
control knob for it.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since we don't have any GC on device side for zoned devices, need more
aggressive BG GC. So, tune the parameters for that.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch addresses the problem that when using LFS mode, unused blocks
may occur in f2fs_map_blocks() during block allocation for dio writes.
If a new section is allocated during block allocation, it will not be
included in the map struct by map_is_mergeable() if the LBA of the
allocated block is not contiguous. However, the block already allocated
in this process will remain unused due to the LFS mode.
This patch avoids the possibility of unused blocks by escaping
f2fs_map_blocks() when allocating the last block in a section.
Signed-off-by: Daejun Park <daejun7.park@samsung.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Some f2fs ioctl interfaces like f2fs_ioc_set_pin_file(),
f2fs_move_file_range(), and f2fs_defragment_range() missed to
check atomic_write status, which may cause potential race issue,
fix it.
Cc: stable@vger.kernel.org
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
syzbot reports a f2fs bug as below:
kernel BUG at fs/f2fs/inode.c:896!
RIP: 0010:f2fs_evict_inode+0x1598/0x15c0 fs/f2fs/inode.c:896
Call Trace:
evict+0x532/0x950 fs/inode.c:704
dispose_list fs/inode.c:747 [inline]
evict_inodes+0x5f9/0x690 fs/inode.c:797
generic_shutdown_super+0x9d/0x2d0 fs/super.c:627
kill_block_super+0x44/0x90 fs/super.c:1696
kill_f2fs_super+0x344/0x690 fs/f2fs/super.c:4898
deactivate_locked_super+0xc4/0x130 fs/super.c:473
cleanup_mnt+0x41f/0x4b0 fs/namespace.c:1373
task_work_run+0x24f/0x310 kernel/task_work.c:228
ptrace_notify+0x2d2/0x380 kernel/signal.c:2402
ptrace_report_syscall include/linux/ptrace.h:415 [inline]
ptrace_report_syscall_exit include/linux/ptrace.h:477 [inline]
syscall_exit_work+0xc6/0x190 kernel/entry/common.c:173
syscall_exit_to_user_mode_prepare kernel/entry/common.c:200 [inline]
__syscall_exit_to_user_mode_work kernel/entry/common.c:205 [inline]
syscall_exit_to_user_mode+0x279/0x370 kernel/entry/common.c:218
do_syscall_64+0x100/0x230 arch/x86/entry/common.c:89
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0010:f2fs_evict_inode+0x1598/0x15c0 fs/f2fs/inode.c:896
Online repaire on corrupted directory in f2fs_lookup() can generate
dirty data/meta while racing w/ readonly remount, it may leave dirty
inode after filesystem becomes readonly, however, checkpoint() will
skips flushing dirty inode in a state of readonly mode, result in
above panic.
Let's get rid of online repaire in f2fs_lookup(), and leave the work
to fsck.f2fs.
Fixes: 510022a858 ("f2fs: add F2FS_INLINE_DOTS to recover missing dot dentries")
Reported-by: syzbot+ebea2790904673d7c618@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000a7b20f061ff2d56a@google.com
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Keep atomic file clean while updating and make it dirtied during commit
in order to avoid unnecessary and excessive inode updates in the previous
fix.
Fixes: 4bf7832234 ("f2fs: mark inode dirty for FI_ATOMIC_COMMITTED flag")
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
[SUBPAGE COMPRESSION LIMITS]
Currently inside writepage_delalloc(), if a delalloc range is going to
be submitted asynchronously (inline or compression, the page
dirty/writeback/unlock are all handled in at different time, not at the
submission time), then we return 1 and extent_writepage() will skip the
submission.
This is fine if every sector matches page size, but if a sector is
smaller than page size (aka, subpage case), then it can be very
problematic, for example for the following 64K page:
0 16K 32K 48K 64K
|/| |///////| |/|
| |
4K 52K
Where |/| is the dirty range we need to submit.
In the above case, we need the following different handling for the 3
ranges:
- [0, 4K) needs to be submitted for regular write
A single sector cannot be compressed.
- [16K, 32K) needs to be submitted for compressed write
- [48K, 52K) needs to be submitted for regular write.
Above, if we try to submit [16K, 32K) for compressed write, we will
return 1 and immediately, and without submitting the remaining
[48K, 52K) range.
Furthermore, since extent_writepage() will exit without unlocking any
sectors, the submitted range [0, 4K) will not have sector unlocked.
That's the reason why for now subpage is only allowed for full page
range.
[ENHANCEMENT]
- Introduce a submission bitmap at btrfs_bio_ctrl::submit_bitmap
This records which sectors will be submitted by extent_writepage_io().
This allows us to track which sectors needs to be submitted thus later
to be properly unlocked.
For asynchronously submitted range (inline/compression), the
corresponding bits will be cleared from that bitmap.
- Only return 1 if no sector needs to be submitted in
writepage_delalloc()
- Only submit sectors marked by submission bitmap inside
extent_writepage_io()
So we won't touch the asynchronously submitted part.
- Introduce btrfs_folio_end_writer_lock_bitmap() helper
This will only unlock the involved sectors specified by @bitmap
parameter, to avoid touching the range asynchronously submitted.
Please note that, since subpage compression is still limited to page
aligned range, this change is only a preparation for future sector
perfect compression support for subpage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_folio_unlock_writer() is already calling
btrfs_folio_end_writer_lock() to do the heavy lifting work, the only
missing 0 writer check.
Thus there is no need to keep two different functions, move the 0 writer
check into btrfs_folio_end_writer_lock(), and remove
btrfs_folio_unlock_writer().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All cleanup paths lead to btrfs_path_free so path can be defined with
the automatic freeing callback in the following functions:
- btrfs_insert_orphan_item()
- btrfs_del_orphan_item()
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All cleanup paths lead to btrfs_path_free so path can be defined with
the automatic freeing callback in the following functions:
- calculate_emulated_zone_size()
- calculate_alloc_pointer()
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a DEFINE_FREE for struct btrfs_path. This defines a function that
can be called using the __free attribute. Define a macro
BTRFS_PATH_AUTO_FREE to make the declaration of an auto freeing path
very clear.
The intended use is to define the auto free of path in cases where the
path is allocated somewhere at the beginning and freed either on all
error paths or at the end of the function.
int func() {
BTRFS_PATH_AUTO_FREE(path);
if (...)
return -ERROR;
path = alloc_path();
...
if (...)
return -ERROR;
...
return 0;
}
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
[ update changelog ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_folio_end_all_writers() is only utilized in
extent_writepage() as a way to unlock all subpage range (for both
successful submission and error handling).
Meanwhile we have a similar function, btrfs_folio_end_writer_lock().
The difference is, btrfs_folio_end_writer_lock() expects a range that is
a subset of the already locked range.
This limit on btrfs_folio_end_writer_lock() is a little overkilled,
preventing it from being utilized for error paths.
So here we enhance btrfs_folio_end_writer_lock() to accept a superset of
the locked range, and only end the locked subset.
This means we can replace btrfs_folio_end_all_writers() with
btrfs_folio_end_writer_lock() instead.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Continue adding const to parameters. This is for clarity and minor
addition to safety. There are some minor effects, in the assembly code
and .ko measured on release config.
Signed-off-by: David Sterba <dsterba@suse.com>
Currently BTRFS_I is a static inline function that takes a const inode
and returns btrfs inode, dropping the 'const' qualifier. This can break
assumptions of compiler though it seems there's no real case.
To make the parameter and return type consistent regardint const we can
use the container_of_const() that preserves it. However this would not
check the parameter type. To fix that use the same _Generic construct
but implement only the two expected types.
Signed-off-by: David Sterba <dsterba@suse.com>
We have a few places that check if we have the inode locked by doing:
ASSERT(inode_is_locked(vfs_inode));
This actually proved to be useful several times as if assertions are
enabled (and by default they are in many distros) it immediately triggers
a crash which is impossible for users to miss.
However that doesn't check if the lock is held by the calling task, so
the check passes if some other task locked the inode.
Using one of the lockdep functions to check the lock is held, like
lockdep_assert_held() for example, does check that the calling task
holds the lock, and if that's not the case it produces a warning and
stack trace in dmesg. However, despite the misleading "assert" in the
name of the lockdep helpers, it does not trigger a crash/BUG_ON(), just
a warning and splat in dmesg, which is easy to get unnoticed by users
who may have lockdep enabled.
So add a helper that does the ASSERT() and calls lockdep_assert_held()
immediately after and use it every where we check the inode is locked.
Like this if the lock is held by some other task we get the warning
in dmesg which is caught by fstests, very helpful during development,
and may also be occassionaly noticed by users with lockdep enabled.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Even in case of failure we could've discarded some data and userspace
should be made aware of it, so copy fstrim_range to userspace
regardless.
Also make sure to update the trimmed bytes amount even if
btrfs_trim_free_extents fails.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Moreover find_or_create_page() is compatible API, and it can
replaced with __filemap_get_folio(). Some interfaces have been converted
to use folio before, so the conversion operation from page can be
eliminated here.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Based on the previous patch, the compression path can be
directly used in folio without converting to page.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And memcpy_to_page() can be replaced with memcpy_to_folio().
But there is no memzero_folio(), but it can be replaced equivalently by
folio_zero_range().
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And memcpy_to_page() can be replaced with memcpy_to_folio().
But there is no memzero_folio(), but it can be replaced equivalently by
folio_zero_range().
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And memcpy_to_page() can be replaced with memcpy_to_folio().
But there is no memzero_folio(), but it can be replaced equivalently by
folio_zero_range().
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And page_to_inode() can be replaced with folio_to_inode() now.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Moreover, use folio_pos() instead of page_offset(),
which is more consistent with folio usage.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Moreover, use folio_pos() instead of page_offset(),
which is more consistent with folio usage.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Moreover, use kmap_local_folio() instead of kmap_local_page(),
which is more consistent with folio usage.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>