* Chandan Babu will be taking over as the XFS release manager. He has
reviewed all the patches that are in this branch, though I'm signing
the branch one last time since I'm still technically maintainer. :P
* Create a maintainer entry profile for XFS in which we lay out the
various roles that I have played for many years. Aside from release
manager, the remaining roles are as yet unfilled.
* Start merging online repair -- we now have in-memory pageable memory
for staging btrees, a bunch of pending fixes, and we've started the
process of refactoring the scrub support code to support more of
repair. In particular, reaping of old blocks from damaged structures.
* Scrub the realtime summary file.
* Fix a bug where scrub's quota iteration only ever returned the root
dquot. Oooops.
* Fix some typos.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZOQE2AAKCRBKO3ySh0YR
pvmZAQDe+KceaVx6Dv2f9ihckeS2dILSpDTo1bh9BeXnt005VwD/ceHTaJxEl8lp
u/dixFDkRgp9RYtoTAK2WNiUxYetsAc=
=oZN6
-----END PGP SIGNATURE-----
Merge tag 'xfs-6.6-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Chandan Babu:
- Chandan Babu will be taking over as the XFS release manager. He has
reviewed all the patches that are in this branch, though I'm signing
the branch one last time since I'm still technically maintainer. :P
- Create a maintainer entry profile for XFS in which we lay out the
various roles that I have played for many years. Aside from release
manager, the remaining roles are as yet unfilled.
- Start merging online repair -- we now have in-memory pageable memory
for staging btrees, a bunch of pending fixes, and we've started the
process of refactoring the scrub support code to support more of
repair. In particular, reaping of old blocks from damaged structures.
- Scrub the realtime summary file.
- Fix a bug where scrub's quota iteration only ever returned the root
dquot. Oooops.
- Fix some typos.
[ Pull request from Chandan Babu, but signed tag and description from
Darrick Wong, thus the first person singular above is Darrick, not
Chandan ]
* tag 'xfs-6.6-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (37 commits)
fs/xfs: Fix typos in comments
xfs: fix dqiterate thinko
xfs: don't check reflink iflag state when checking cow fork
xfs: simplify returns in xchk_bmap
xfs: rewrite xchk_inode_is_allocated to work properly
xfs: hide xfs_inode_is_allocated in scrub common code
xfs: fix agf_fllast when repairing an empty AGFL
xfs: allow userspace to rebuild metadata structures
xfs: clear pagf_agflreset when repairing the AGFL
xfs: allow the user to cancel repairs before we start writing
xfs: don't complain about unfixed metadata when repairs were injected
xfs: implement online scrubbing of rtsummary info
xfs: always rescan allegedly healthy per-ag metadata after repair
xfs: move the realtime summary file scrubber to a separate source file
xfs: wrap ilock/iunlock operations on sc->ip
xfs: get our own reference to inodes that we want to scrub
xfs: track usage statistics of online fsck
xfs: improve xfarray quicksort pivot
xfs: create scaffolding for creating debugfs entries
xfs: cache pages used for xfarray quicksort convergence
...
- Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
reduces the special-case code for handling hugetlb pages in GUP. It
also speeds up GUP handling of transparent hugepages.
- Peng Zhang provides some maple tree speedups ("Optimize the fast path
of mas_store()").
- Sergey Senozhatsky has improved te performance of zsmalloc during
compaction (zsmalloc: small compaction improvements").
- Domenico Cerasuolo has developed additional selftest code for zswap
("selftests: cgroup: add zswap test program").
- xu xin has doe some work on KSM's handling of zero pages. These
changes are mainly to enable the user to better understand the
effectiveness of KSM's treatment of zero pages ("ksm: support tracking
KSM-placed zero-pages").
- Jeff Xu has fixes the behaviour of memfd's
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").
- David Howells has fixed an fscache optimization ("mm, netfs, fscache:
Stop read optimisation when folio removed from pagecache").
- Axel Rasmussen has given userfaultfd the ability to simulate memory
poisoning ("add UFFDIO_POISON to simulate memory poisoning with UFFD").
- Miaohe Lin has contributed some routine maintenance work on the
memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
check").
- Peng Zhang has contributed some maintenance work on the maple tree
code ("Improve the validation for maple tree and some cleanup").
- Hugh Dickins has optimized the collapsing of shmem or file pages into
THPs ("mm: free retracted page table by RCU").
- Jiaqi Yan has a patch series which permits us to use the healthy
subpages within a hardware poisoned huge page for general purposes
("Improve hugetlbfs read on HWPOISON hugepages").
- Kemeng Shi has done some maintenance work on the pagetable-check code
("Remove unused parameters in page_table_check").
- More folioification work from Matthew Wilcox ("More filesystem folio
conversions for 6.6"), ("Followup folio conversions for zswap"). And
from ZhangPeng ("Convert several functions in page_io.c to use a
folio").
- page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").
- Baoquan He has converted some architectures to use the GENERIC_IOREMAP
ioremap()/iounmap() code ("mm: ioremap: Convert architectures to take
GENERIC_IOREMAP way").
- Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
batched/deferred tlb shootdown during page reclamation/migration").
- Better maple tree lockdep checking from Liam Howlett ("More strict
maple tree lockdep"). Liam also developed some efficiency improvements
("Reduce preallocations for maple tree").
- Cleanup and optimization to the secondary IOMMU TLB invalidation, from
Alistair Popple ("Invalidate secondary IOMMU TLB on permission
upgrade").
- Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
for arm64").
- Kemeng Shi provides some maintenance work on the compaction code ("Two
minor cleanups for compaction").
- Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle most
file-backed faults under the VMA lock").
- Aneesh Kumar contributes code to use the vmemmap optimization for DAX
on ppc64, under some circumstances ("Add support for DAX vmemmap
optimization for ppc64").
- page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
data in page_ext"), ("minor cleanups to page_ext header").
- Some zswap cleanups from Johannes Weiner ("mm: zswap: three
cleanups").
- kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").
- VMA handling cleanups from Kefeng Wang ("mm: convert to
vma_is_initial_heap/stack()").
- DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
address ranges and DAMON monitoring targets").
- Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").
- Liam Howlett has improved the maple tree node replacement code
("maple_tree: Change replacement strategy").
- ZhangPeng has a general code cleanup - use the K() macro more widely
("cleanup with helper macro K()").
- Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for memmap
on memory feature on ppc64").
- pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
in page_alloc"), ("Two minor cleanups for get pageblock migratetype").
- Vishal Moola introduces a memory descriptor for page table tracking,
"struct ptdesc" ("Split ptdesc from struct page").
- memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
for vm.memfd_noexec").
- MM include file rationalization from Hugh Dickins ("arch: include
asm/cacheflush.h in asm/hugetlb.h").
- THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
output").
- kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
object_cache instead of kmemleak_initialized").
- More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
and _folio_order").
- A VMA locking scalability improvement from Suren Baghdasaryan
("Per-VMA lock support for swap and userfaults").
- pagetable handling cleanups from Matthew Wilcox ("New page table range
API").
- A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
using page->private on tail pages for THP_SWAP + cleanups").
- Cleanups and speedups to the hugetlb fault handling from Matthew
Wilcox ("Change calling convention for ->huge_fault").
- Matthew Wilcox has also done some maintenance work on the MM subsystem
documentation ("Improve mm documentation").
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZO1JUQAKCRDdBJ7gKXxA
jrMwAP47r/fS8vAVT3zp/7fXmxaJYTK27CTAM881Gw1SDhFM/wEAv8o84mDenCg6
Nfio7afS1ncD+hPYT8947UnLxTgn+ww=
=Afws
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Some swap cleanups from Ma Wupeng ("fix WARN_ON in
add_to_avail_list")
- Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
reduces the special-case code for handling hugetlb pages in GUP. It
also speeds up GUP handling of transparent hugepages.
- Peng Zhang provides some maple tree speedups ("Optimize the fast path
of mas_store()").
- Sergey Senozhatsky has improved te performance of zsmalloc during
compaction (zsmalloc: small compaction improvements").
- Domenico Cerasuolo has developed additional selftest code for zswap
("selftests: cgroup: add zswap test program").
- xu xin has doe some work on KSM's handling of zero pages. These
changes are mainly to enable the user to better understand the
effectiveness of KSM's treatment of zero pages ("ksm: support
tracking KSM-placed zero-pages").
- Jeff Xu has fixes the behaviour of memfd's
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").
- David Howells has fixed an fscache optimization ("mm, netfs, fscache:
Stop read optimisation when folio removed from pagecache").
- Axel Rasmussen has given userfaultfd the ability to simulate memory
poisoning ("add UFFDIO_POISON to simulate memory poisoning with
UFFD").
- Miaohe Lin has contributed some routine maintenance work on the
memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
check").
- Peng Zhang has contributed some maintenance work on the maple tree
code ("Improve the validation for maple tree and some cleanup").
- Hugh Dickins has optimized the collapsing of shmem or file pages into
THPs ("mm: free retracted page table by RCU").
- Jiaqi Yan has a patch series which permits us to use the healthy
subpages within a hardware poisoned huge page for general purposes
("Improve hugetlbfs read on HWPOISON hugepages").
- Kemeng Shi has done some maintenance work on the pagetable-check code
("Remove unused parameters in page_table_check").
- More folioification work from Matthew Wilcox ("More filesystem folio
conversions for 6.6"), ("Followup folio conversions for zswap"). And
from ZhangPeng ("Convert several functions in page_io.c to use a
folio").
- page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").
- Baoquan He has converted some architectures to use the
GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert
architectures to take GENERIC_IOREMAP way").
- Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
batched/deferred tlb shootdown during page reclamation/migration").
- Better maple tree lockdep checking from Liam Howlett ("More strict
maple tree lockdep"). Liam also developed some efficiency
improvements ("Reduce preallocations for maple tree").
- Cleanup and optimization to the secondary IOMMU TLB invalidation,
from Alistair Popple ("Invalidate secondary IOMMU TLB on permission
upgrade").
- Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
for arm64").
- Kemeng Shi provides some maintenance work on the compaction code
("Two minor cleanups for compaction").
- Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle
most file-backed faults under the VMA lock").
- Aneesh Kumar contributes code to use the vmemmap optimization for DAX
on ppc64, under some circumstances ("Add support for DAX vmemmap
optimization for ppc64").
- page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
data in page_ext"), ("minor cleanups to page_ext header").
- Some zswap cleanups from Johannes Weiner ("mm: zswap: three
cleanups").
- kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").
- VMA handling cleanups from Kefeng Wang ("mm: convert to
vma_is_initial_heap/stack()").
- DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
address ranges and DAMON monitoring targets").
- Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").
- Liam Howlett has improved the maple tree node replacement code
("maple_tree: Change replacement strategy").
- ZhangPeng has a general code cleanup - use the K() macro more widely
("cleanup with helper macro K()").
- Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for
memmap on memory feature on ppc64").
- pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
in page_alloc"), ("Two minor cleanups for get pageblock
migratetype").
- Vishal Moola introduces a memory descriptor for page table tracking,
"struct ptdesc" ("Split ptdesc from struct page").
- memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
for vm.memfd_noexec").
- MM include file rationalization from Hugh Dickins ("arch: include
asm/cacheflush.h in asm/hugetlb.h").
- THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
output").
- kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
object_cache instead of kmemleak_initialized").
- More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
and _folio_order").
- A VMA locking scalability improvement from Suren Baghdasaryan
("Per-VMA lock support for swap and userfaults").
- pagetable handling cleanups from Matthew Wilcox ("New page table
range API").
- A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
using page->private on tail pages for THP_SWAP + cleanups").
- Cleanups and speedups to the hugetlb fault handling from Matthew
Wilcox ("Change calling convention for ->huge_fault").
- Matthew Wilcox has also done some maintenance work on the MM
subsystem documentation ("Improve mm documentation").
* tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits)
maple_tree: shrink struct maple_tree
maple_tree: clean up mas_wr_append()
secretmem: convert page_is_secretmem() to folio_is_secretmem()
nios2: fix flush_dcache_page() for usage from irq context
hugetlb: add documentation for vma_kernel_pagesize()
mm: add orphaned kernel-doc to the rst files.
mm: fix clean_record_shared_mapping_range kernel-doc
mm: fix get_mctgt_type() kernel-doc
mm: fix kernel-doc warning from tlb_flush_rmaps()
mm: remove enum page_entry_size
mm: allow ->huge_fault() to be called without the mmap_lock held
mm: move PMD_ORDER to pgtable.h
mm: remove checks for pte_index
memcg: remove duplication detection for mem_cgroup_uncharge_swap
mm/huge_memory: work on folio->swap instead of page->private when splitting folio
mm/swap: inline folio_set_swap_entry() and folio_swap_entry()
mm/swap: use dedicated entry for swap in folio
mm/swap: stop using page->private on tail pages for THP_SWAP
selftests/mm: fix WARNING comparing pointer to 0
selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check
...
* Make large writes to the page cache fill sparse parts of the cache
with large folios, then use large memcpy calls for the large folio.
* Track the per-block dirty state of each large folio so that a
buffered write to a single byte on a large folio does not result in a
(potentially) multi-megabyte writeback IO.
* Allow some directio completions to be performed in the initiating
task's context instead of punting through a workqueue. This will
reduce latency for some io_uring requests.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZM0Z1AAKCRBKO3ySh0YR
pp7BAQCzkKejCM0185tNIH/faHjzidSisNQkJ5HoB4Opq9U66AEA6IPuAdlPlM/J
FPW1oPq33Yn7AV4wXjUNFfDLzVb/Fgg=
=dFBU
-----END PGP SIGNATURE-----
Merge tag 'iomap-6.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap updates from Darrick Wong:
"We've got some big changes for this release -- I'm very happy to be
landing willy's work to enable large folios for the page cache for
general read and write IOs when the fs can make contiguous space
allocations, and Ritesh's work to track sub-folio dirty state to
eliminate the write amplification problems inherent in using large
folios.
As a bonus, io_uring can now process write completions in the caller's
context instead of bouncing through a workqueue, which should reduce
io latency dramatically. IOWs, XFS should see a nice performance bump
for both IO paths.
Summary:
- Make large writes to the page cache fill sparse parts of the cache
with large folios, then use large memcpy calls for the large folio.
- Track the per-block dirty state of each large folio so that a
buffered write to a single byte on a large folio does not result in
a (potentially) multi-megabyte writeback IO.
- Allow some directio completions to be performed in the initiating
task's context instead of punting through a workqueue. This will
reduce latency for some io_uring requests"
* tag 'iomap-6.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits)
iomap: support IOCB_DIO_CALLER_COMP
io_uring/rw: add write support for IOCB_DIO_CALLER_COMP
fs: add IOCB flags related to passing back dio completions
iomap: add IOMAP_DIO_INLINE_COMP
iomap: only set iocb->private for polled bio
iomap: treat a write through cache the same as FUA
iomap: use an unsigned type for IOMAP_DIO_* defines
iomap: cleanup up iomap_dio_bio_end_io()
iomap: Add per-block dirty state tracking to improve performance
iomap: Allocate ifs in ->write_begin() early
iomap: Refactor iomap_write_delalloc_punch() function out
iomap: Use iomap_punch_t typedef
iomap: Fix possible overflow condition in iomap_write_delalloc_scan
iomap: Add some uptodate state handling helpers for ifs state bitmap
iomap: Drop ifs argument from iomap_set_range_uptodate()
iomap: Rename iomap_page to iomap_folio_state and others
iomap: Copy larger chunks from userspace
iomap: Create large folios in the buffered write path
filemap: Allow __filemap_get_folio to allocate large folios
filemap: Add fgf_t typedef
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXpbgAKCRCRxhvAZXjc
oi8PAQCtXelGZHmTcmevsO8p4Qz7hFpkonZ/TnxKf+RdnlNgPgD+NWi+LoRBpaAj
xk4z8SqJaTTP4WXrG5JZ6o7EQkUL8gE=
=2e9I
-----END PGP SIGNATURE-----
Merge tag 'v6.6-vfs.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull superblock updates from Christian Brauner:
"This contains the super rework that was ready for this cycle. The
first part changes the order of how we open block devices and allocate
superblocks, contains various cleanups, simplifications, and a new
mechanism to wait on superblock state changes.
This unblocks work to ultimately limit the number of writers to a
block device. Jan has already scheduled follow-up work that will be
ready for v6.7 and allows us to restrict the number of writers to a
given block device. That series builds on this work right here.
The second part contains filesystem freezing updates.
Overview:
The generic superblock changes are rougly organized as follows
(ignoring additional minor cleanups):
(1) Removal of the bd_super member from struct block_device.
This was a very odd back pointer to struct super_block with
unclear rules. For all relevant places we have other means to get
the same information so just get rid of this.
(2) Simplify rules for superblock cleanup.
Roughly, everything that is allocated during fs_context
initialization and that's stored in fs_context->s_fs_info needs
to be cleaned up by the fs_context->free() implementation before
the superblock allocation function has been called successfully.
After sget_fc() returned fs_context->s_fs_info has been
transferred to sb->s_fs_info at which point sb->kill_sb() if
fully responsible for cleanup. Adhering to these rules means that
cleanup of sb->s_fs_info in fill_super() is to be avoided as it's
brittle and inconsistent.
Cleanup shouldn't be duplicated between sb->put_super() as
sb->put_super() is only called if sb->s_root has been set aka
when the filesystem has been successfully born (SB_BORN). That
complexity should be avoided.
This also means that block devices are to be closed in
sb->kill_sb() instead of sb->put_super(). More details in the
lower section.
(3) Make it possible to lookup or create a superblock before opening
block devices
There's a subtle dependency on (2) as some filesystems did rely
on fill_super() to be called in order to correctly clean up
sb->s_fs_info. All these filesystems have been fixed.
(4) Switch most filesystem to follow the same logic as the generic
mount code now does as outlined in (3).
(5) Use the superblock as the holder of the block device. We can now
easily go back from block device to owning superblock.
(6) Export and extend the generic fs_holder_ops and use them as
holder ops everywhere and remove the filesystem specific holder
ops.
(7) Call from the block layer up into the filesystem layer when the
block device is removed, allowing to shut down the filesystem
without risk of deadlocks.
(8) Get rid of get_super().
We can now easily go back from the block device to owning
superblock and can call up from the block layer into the
filesystem layer when the device is removed. So no need to wade
through all registered superblock to find the owning superblock
anymore"
Link: https://lore.kernel.org/lkml/20230824-prall-intakt-95dbffdee4a0@brauner/
* tag 'v6.6-vfs.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (47 commits)
super: use higher-level helper for {freeze,thaw}
super: wait until we passed kill super
super: wait for nascent superblocks
super: make locking naming consistent
super: use locking helpers
fs: simplify invalidate_inodes
fs: remove get_super
block: call into the file system for ioctl BLKFLSBUF
block: call into the file system for bdev_mark_dead
block: consolidate __invalidate_device and fsync_bdev
block: drop the "busy inodes on changed media" log message
dasd: also call __invalidate_device when setting the device offline
amiflop: don't call fsync_bdev in FDFMTBEG
floppy: call disk_force_media_change when changing the format
block: simplify the disk_force_media_change interface
nbd: call blk_mark_disk_dead in nbd_clear_sock_ioctl
xfs use fs_holder_ops for the log and RT devices
xfs: drop s_umount over opening the log and RT devices
ext4: use fs_holder_ops for the log device
ext4: drop s_umount over opening the log device
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTKAAKCRCRxhvAZXjc
oifJAQCzi/p+AdQu8LA/0XvR7fTwaq64ZDCibU4BISuLGT2kEgEAuGbuoFZa0rs2
XYD/s4+gi64p9Z01MmXm2XO1pu3GPg0=
=eJz5
-----END PGP SIGNATURE-----
Merge tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs timestamp updates from Christian Brauner:
"This adds VFS support for multi-grain timestamps and converts tmpfs,
xfs, ext4, and btrfs to use them. This carries acks from all relevant
filesystems.
The VFS always uses coarse-grained timestamps when updating the ctime
and mtime after a change. This has the benefit of allowing filesystems
to optimize away a lot of metadata updates, down to around 1 per
jiffy, even when a file is under heavy writes.
Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. A lot of changes
can happen in a jiffy, so timestamps aren't sufficient to help the
client decide to invalidate the cache.
Even with NFSv4, a lot of exported filesystems don't properly support
a change attribute and are subject to the same problems with timestamp
granularity. Other applications have similar issues with timestamps
(e.g., backup applications).
If we were to always use fine-grained timestamps, that would improve
the situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.
This introduces fine-grained timestamps that are used when they are
actively queried.
This uses the 31st bit of the ctime tv_nsec field to indicate that
something has queried the inode for the mtime or ctime. When this flag
is set, on the next mtime or ctime update, the kernel will fetch a
fine-grained timestamp instead of the usual coarse-grained one.
As POSIX generally mandates that when the mtime changes, the ctime
must also change the kernel always stores normalized ctime values, so
only the first 30 bits of the tv_nsec field are ever used.
Filesytems can opt into this behavior by setting the FS_MGTIME flag in
the fstype. Filesystems that don't set this flag will continue to use
coarse-grained timestamps.
Various preparatory changes, fixes and cleanups are included:
- Fixup all relevant places where POSIX requires updating ctime
together with mtime. This is a wide-range of places and all
maintainers provided necessary Acks.
- Add new accessors for inode->i_ctime directly and change all
callers to rely on them. Plain accesses to inode->i_ctime are now
gone and it is accordingly rename to inode->__i_ctime and commented
as requiring accessors.
- Extend generic_fillattr() to pass in a request mask mirroring in a
sense the statx() uapi. This allows callers to pass in a request
mask to only get a subset of attributes filled in.
- Rework timestamp updates so it's possible to drop the @now
parameter the update_time() inode operation and associated helpers.
- Add inode_update_timestamps() and convert all filesystems to it
removing a bunch of open-coding"
* tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits)
btrfs: convert to multigrain timestamps
ext4: switch to multigrain timestamps
xfs: switch to multigrain timestamps
tmpfs: add support for multigrain timestamps
fs: add infrastructure for multigrain timestamps
fs: drop the timespec64 argument from update_time
xfs: have xfs_vn_update_time gets its own timestamp
fat: make fat_update_time get its own timestamp
fat: remove i_version handling from fat_update_time
ubifs: have ubifs_update_time use inode_update_timestamps
btrfs: have it use inode_update_timestamps
fs: drop the timespec64 arg from generic_update_time
fs: pass the request_mask to generic_fillattr
fs: remove silly warning from current_time
gfs2: fix timestamp handling on quota inodes
fs: rename i_ctime field to __i_ctime
selinux: convert to ctime accessor functions
security: convert to ctime accessor functions
apparmor: convert to ctime accessor functions
sunrpc: convert to ctime accessor functions
...
Remove the unnecessary encoding of page order into an enum and pass the
page order directly. That lets us get rid of pe_order().
The switch constructs have to be changed to if/else constructs to prevent
GCC from warning on builds with 3-level page tables where PMD_ORDER and
PUD_ORDER have the same value.
If you are looking at this commit because your driver stopped compiling,
look at the previous commit as well and audit your driver to be sure it
doesn't depend on mmap_lock being held in its ->huge_fault method.
[willy@infradead.org: use "order %u" to match the (non dev_t) style]
Link: https://lkml.kernel.org/r/ZOUYekbtTv+n8hYf@casper.infradead.org
Link: https://lkml.kernel.org/r/20230818202335.2739663-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* Allow the kernel to initiate a freeze of a filesystem. The kernel
and userspace can both hold a freeze on a filesystem at the same
time; the freeze is not lifted until /both/ holders lift it. This
will enable us to fix a longstanding bug in XFS online fsck.
* Use kernel-initated fsfreeze to fix some longstanding false negatives
in onlin fsck of the free space and inode counters.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZM0XzQAKCRBKO3ySh0YR
phSCAQD9hQmd9tngbNGos44XthgHDIfVHLQLWLt6lwcD0WNfIgEAwMWKLzI9hi7G
SmX3NWDQBj7kvC96HYizIvdSsdkvHw0=
=ulEr
-----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXpMgAKCRCRxhvAZXjc
ovFBAP97HEUSf78XXTQehluJgkbSVu208DFC4mCyFA6rRihskQD/Yz0uosr/51zJ
FdUPNg8MNkQCRtqx5LQ7yClNSr9Sxg4=
=uIAe
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.6-merge-3' of ssh://gitolite.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs online fsck update from Darrick Wong:
New code for 6.6:
* Allow the kernel to initiate a freeze of a filesystem. The kernel
and userspace can both hold a freeze on a filesystem at the same
time; the freeze is not lifted until /both/ holders lift it. This
will enable us to fix a longstanding bug in XFS online fsck.
* Use kernel-initated fsfreeze to fix some longstanding false negatives
in online fsck of the free space and inode counters.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Message-Id: <20230822182604.GB11286@frogsfrogsfrogs>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Delete duplicate word "the"
[chandan: Fix mangled patch]
Signed-off-by: Zizhen Pang <pangzizhen001@208suo.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
For some unknown reason, when I converted the incore dquot objects to
store the dquot id in host endian order, I removed the increment here.
This causes the scan to stop after retrieving the root dquot, which
severely limits the usefulness of the quota scrubber. Fix the lost
increment, though it won't fix the problem that the quota iterator code
filters out zeroed dquot records.
Fixes: c51df73341 ("xfs: stop using q_core.d_id in the quota code")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Use the generic fs_holder_ops to shut down the file system when the
log or RT device goes away instead of duplicating the logic.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Message-Id: <20230802154131.2221419-13-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Just like get_tree_bdev needs to drop s_umount when opening the main
device, we need to do the same for the xfs log and RT devices to avoid a
potential lock order reversal with s_unmount for the mark_dead path.
It might be preferable to just drop s_umount over ->fill_super entirely,
but that will require a fairly massive audit first, so we'll do the easy
version here first.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Message-Id: <20230802154131.2221419-12-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The file system type is not a very useful holder as it doesn't allow us
to go back to the actual file system instance. Pass the super_block instead
which is useful when passed back to the file system driver.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Message-Id: <20230802154131.2221419-7-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.
Also, anytime the mtime changes, the ctime must also change, and those
are now the only two options for xfs_trans_ichgtime. Have that function
unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is
always set.
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230807-mgctime-v7-11-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Now that all of the update_time operations are prepared for it, we can
drop the timespec64 argument from the update_time operation. Do that and
remove it from some associated functions like inode_update_time and
inode_needs_update_time.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230807-mgctime-v7-8-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
In later patches we're going to drop the "now" parameter from the
update_time operation. Prepare XFS for this by reworking how it fetches
timestamps and sets them in the inode. Ensure that we update the ctime
even if only S_MTIME is set.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230807-mgctime-v7-7-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Any inode on a reflink filesystem can have a cow fork, even if the inode
does not have the reflink iflag set. This happens either because the
inode once had the iflag set but does not now, because we don't free the
incore cow fork until the icache deletes the inode; or because we're
running in alwayscow mode.
Either way, we can collapse both of the xfs_is_reflink_inode calls into
one, and change it to xfs_has_reflink, now that the bmap checker will
return ENOENT if there is no pointer to the incore fork.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Remove the pointless goto and return code in xchk_bmap, since it only
serves to obscure what's going on in the function. Instead, return
whichever error code is appropriate there. For nonexistent forks,
this should have been ENOENT.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Back in the mists of time[1], I proposed this function to assist the
inode btree scrubbers in checking the inode btree contents against the
allocation state of the inode records. The original version performed a
direct lookup in the inode cache and returned the allocation status if
the cached inode hadn't been reused and wasn't in an intermediate state.
Brian thought it would be better to use the usual iget/irele mechanisms,
so that was changed for the final version.
Unfortunately, this hasn't aged well -- the IGET_INCORE flag only has
one user and clutters up the regular iget path, which makes it hard to
reason about how it actually works. Worse yet, the inode inactivation
series silently broke it because iget won't return inodes that are
anywhere in the inactivation machinery, even though the caller is
already required to prevent inode allocation and freeing. Inodes in the
inactivation machinery are still allocated, but the current code's
interactions with the iget code prevent us from being able to say that.
Now that I understand the inode lifecycle better than I did in early
2017, I now realize that as long as the cached inode hasn't been reused
and isn't actively being reclaimed, it's safe to access the i_mode field
(with the AGI, rcu, and i_flags locks held), and we don't need to worry
about the inode being freed out from under us.
Therefore, port the original version to modern code structure, which
fixes the brokennes w.r.t. inactivation. In the next patch we'll remove
IGET_INCORE since it's no longer necessary.
[1] https://lore.kernel.org/linux-xfs/149643868294.23065.8094890990886436794.stgit@birch.djwong.org/
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
This function is only used by online fsck, so let's move it there.
In the next patch, we'll fix it to work properly and to require that the
caller hold the AGI buffer locked. No major changes aside from
adjusting the signature a bit.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
xfs/139 with parent pointers enabled occasionally pops up a corruption
message when online fsck force-rebuild repairs an AGFL:
XFS (sde): Metadata corruption detected at xfs_agf_verify+0x11e/0x220 [xfs], xfs_agf block 0x9e0001
XFS (sde): Unmount and run xfs_repair
XFS (sde): First 128 bytes of corrupted metadata buffer:
00000000: 58 41 47 46 00 00 00 01 00 00 00 4f 00 00 40 00 XAGF.......O..@.
00000010: 00 00 00 01 00 00 00 02 00 00 00 05 00 00 00 01 ................
00000020: 00 00 00 01 00 00 00 01 00 00 00 00 ff ff ff ff ................
00000030: 00 00 00 00 00 00 00 05 00 00 00 05 00 00 00 00 ................
00000040: 91 2e 6f b1 ed 61 4b 4d 8c 9b 6e 87 08 bb f6 36 ..o..aKM..n....6
00000050: 00 00 00 01 00 00 00 01 00 00 00 06 00 00 00 01 ................
00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
The root cause of this failure is that prior to the repair, there were
zero blocks in the AGFL. This scenario is set up by the test case, since
it formats with 64MB AGs and tries to ENOSPC the whole filesystem. In
this case of flcount==0, we reset fllast to -1U, which then trips the
write verifier's check that fllast is less than xfs_agfl_size().
Correct this code to set fllast to the last possible slot in the AGFL
when flcount is zero, which mirrors the behavior of xfs_repair phase5
when it has to create a totally empty AGFL.
Fixes: 0e93d3f43e ("xfs: repair the AGFL")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Clear the pagf_agflreset flag when we're repairing the AGFL because we
fix all the same padding problems that xfs_agfl_reset does.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Add a new (superuser-only) flag to the online metadata repair ioctl to
force it to rebuild structures, even if they're not broken. We will use
this to move metadata structures out of the way during a free space
defragmentation operation.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
While debugging other parts of online repair, I noticed that if someone
injects FORCE_SCRUB_REPAIR, starts an IFLAG_REPAIR scrub on a piece of
metadata, and the metadata repair fails, we'll log a message about
uncorrected errors in the filesystem.
This isn't strictly true if the scrub function didn't set OFLAG_CORRUPT
and we're only doing the repair because the error injection knob is set.
Repair functions are allowed to abort the entire operation at any point
before committing new metadata, in which case the piece of metadata is
in the same state as it was before. Therefore, the log message should
be gated on the results of the scrub. Refactor the predicate and
rearrange the code flow to make this happen.
Note: If the repair function errors out after it commits the new
metadata, the transaction cancellation will shut down the filesystem,
which is an obvious sign of corrupt metadata.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
All online repair functions have the same structure: walk filesystem
metadata structures gathering enough data to rebuild the structure,
stage a new copy, and then commit the new copy.
The gathering steps do not write anything to disk, so they are peppered
with xchk_should_terminate calls to avoid softlockup warnings and to
provide an opportunity to abort the repair (by killing xfs_scrub).
However, it's not clear in the code base when is the last chance to
abort cleanly without having to undo a bunch of structure.
Therefore, add one more call to xchk_should_terminate (along with a
comment) providing the sysadmin with the ability to abort before it's
too late and to make it clear in the source code when it's no longer
convenient or safe to abort a repair. As there are only four repair
functions right now, this patch exists more to establish a precedent for
subsequent additions than to deliver practical functionality.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
After an online repair function runs for a per-AG metadata structure,
sc->sick_mask is supposed to reflect the per-AG metadata that the repair
function fixed. Our next move is to re-check the metadata to assess
the completeness of our repair, so we don't want the rebuilt structure
to be excluded from the rescan just because the health system previously
logged a problem with the data structure.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Finish the realtime summary scrubber by adding the functions we need to
compute a fresh copy of the rtsummary info and comparing it to the copy
on disk.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Move the realtime summary file checking code to a separate file in
preparation to actually implement it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Scrub tracks the resources that it's holding onto in the xfs_scrub
structure. This includes the inode being checked (if applicable) and
the inode lock state of that inode. Replace the open-coded structure
manipulation with a trivial helper to eliminate sources of error.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
When we want to scrub a file, get our own reference to the inode
unconditionally. This will make disposal rules simpler in the long run.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Track the usage, outcomes, and run times of the online fsck code, and
report these values via debugfs. The columns in the file are:
* scrubber name
* number of scrub invocations
* clean objects found
* corruptions found
* optimizations found
* cross referencing failures
* inconsistencies found during cross referencing
* incomplete scrubs
* warnings
* number of time scrub had to retry
* cumulative amount of time spent scrubbing (microseconds)
* number of repair inovcations
* successfully repaired objects
* cumuluative amount of time spent repairing (microseconds)
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Set up debugfs directories for xfs as a whole, and a subdirectory for
each mounted filesystem. This will enable the creation of debugfs files
in the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Now that we have the means to do insertion sorts of small in-memory
subsets of an xfarray, use it to improve the quicksort pivot algorithm
by reading 7 records into memory and finding the median of that. This
should prevent bad partitioning when a[lo] and a[hi] end up next to each
other in the final sort, which can happen when sorting for cntbt repair
when the free space is extremely fragmented (e.g. generic/176).
This doesn't speed up the average quicksort run by much, but it will
(hopefully) avoid the quadratic time collapse for which quicksort is
famous.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
After quicksort picks a pivot item for a particular subsort, it walks
the records in that subset from the outside in, rearranging them so that
every record less than the pivot comes before it, and every record
greater than the pivot comes after it. This scan has a lot of locality,
so we can speed it up quite a bit by grabbing the xfile backing page and
holding onto it as long as we possibly can. Doing so reduces the
runtime by another 5% on the author's computer.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
If all the records in an xfarray subset live within the same memory
page, we can short-circuit even more quicksort recursion by mapping that
page into the local CPU and using the kernel's heapsort function to sort
the subset. On the author's computer, this reduces the runtime by
another 15% on a 500,000 element array.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Certain xfile array operations (such as sorting) can be sped up quite a
bit by allowing xfile users to grab a page to bulk-read the records
contained within it. Create helper methods to facilitate this.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
In the previous patch, we created a very basic quicksort implementation
for xfile arrays. While the use of an alternate sorting algorithm to
avoid quicksort recursion on very small subsets reduces the runtime
modestly, we could do better than a load and store-heavy insertion sort,
particularly since each load and store requires a page mapping lookup in
the xfile.
For a small increase in kernel memory requirements, we could instead
bulk load the xfarray records into memory, use the kernel's existing
heapsort implementation to sort the records, and bulk store the memory
buffer back into the xfile. On the author's computer, this reduces the
runtime by about 5% on a 500,000 element array.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
The btree bulk loading code requires that records be provided in the
correct record sort order for the given btree type. In general, repair
code cannot be required to collect records in order, and it is not
feasible to insert new records in the middle of an array to maintain
sort order.
Implement a sorting algorithm so that we can sort the records just prior
to bulk loading. In principle, an xfarray could consume many gigabytes
of memory and its backing pages can be sent out to disk at any time.
This means that we cannot map the entire array into memory at once, so
we must find a way to divide the work into smaller portions (e.g. a
page) that /can/ be mapped into memory.
Quicksort seems like a reasonable fit for this purpose, since it uses a
divide and conquer strategy to keep its average runtime logarithmic.
The solution presented here is a port of the glibc implementation, which
itself is derived from the median-of-three and tail call recursion
strategies outlined by Sedgwick.
Subsequent patches will optimize the implementation further by utilizing
the kernel's heapsort on directly-mapped memory whenever possible, and
improving the quicksort pivot selection algorithm to try to avoid O(n^2)
collapses.
Note: The sorting functionality gets its own patch because the basic big
array mechanisms were plenty for a single code patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Create a simple 'big array' data structure for storage of fixed-size
metadata records that will be used to reconstruct a btree index. For
repair operations, the most important operations are append, iterate,
and sort.
Earlier implementations of the big array used linked lists and suffered
from severe problems -- pinning all records in kernel memory was not a
good idea and frequently lead to OOM situations; random access was very
inefficient; and record overhead for the lists was unacceptably high at
40-60%.
Therefore, the big memory array relies on the 'xfile' abstraction, which
creates a memfd file and stores the records in page cache pages. Since
the memfd is created in tmpfs, the memory pages can be pushed out to
disk if necessary and we have a built-in usage limit of 50% of physical
memory.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
The AGFL repair code uses a series of bitmaps to figure out where there
are OWN_AG blocks that are not claimed by the free space and rmap
btrees. These blocks become the new AGFL, and any overflow is reaped.
The bitmaps current track xfs_fsblock_t even though we already know the
AG number.
In the last patch, we introduced a new bitmap "type" for tracking
xfs_agblock_t extents. Port the reaping code and the AGFL repair to use
this new type, which makes it very obvious what we're tracking. This
also eliminates a bunch of unnecessary agblock <-> fsblock conversions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
When we're freeing extents that have been set in a bitmap, break the
bitmap extent into multiple sub-extents organized by fate, and reap the
extents. This enables us to dispose of old resources more efficiently
than doing them block by block.
While we're at it, rename the reaping functions to make it clear that
they're reaping per-AG extents.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
After an online repair, we need to invalidate buffers representing the
blocks from the old metadata that we're replacing. It's possible that
parts of a tree that were previously cached in memory are no longer
accessible due to media failure or other corruption on interior nodes,
so repair figures out the old blocks from the reverse mapping data and
scans the buffer cache directly.
In other words, online fsck needs to find all the live (i.e. non-stale)
buffers for a range of fsblocks so that it can invalidate them.
Unfortunately, the current buffer cache code triggers asserts if the
rhashtable lookup finds a non-stale buffer of a different length than
the key we searched for. For regular operation this is desirable, but
for this repair procedure, we don't care since we're going to forcibly
stale the buffer anyway. Add an internal lookup flag to avoid the
assert. Skip buffers that are already XBF_STALE.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Rearrange the logic inside xrep_reap_block to make it more obvious that
crosslinked metadata blocks are handled differently. Add a couple of
tracepoints so that we can tell what's going on at the end of a btree
rebuild operation.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Use deferred frees (EFIs) to reap the blocks of a btree that we just
replaced. This helps us to shrink the window in which those old blocks
could be lost due to a system crash, though we try to flush the EFIs
every few hundred blocks so that we don't also overflow the transaction
reservations during and after we commit the new btree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Now that we've refactored btree cursors to require the caller to pass in
a perag structure, there are numerous problems in xrep_reap_extents if
it's being called to reap extents for an inode metadata repair. We
don't have any repair functions that can do that, so drop the support
for now.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
When we're discarding old btree blocks after a repair, only invalidate
the buffers for the ones that we're freeing -- if the metadata was
crosslinked with another data structure, we don't want to touch it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reaping blocks after a repair is a complicated affair involving a lot of
rmap btree lookups and figuring out if we're going to unmap or free old
metadata blocks that might be crosslinked. Eventually, we will need to
be able to reap per-AG metadata blocks, bmbt blocks from inode forks,
garbage CoW staging extents, and (even later) blocks from btrees rooted
in inodes. This results in a lot of reaping code, so we might as well
split that off while it's easy.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
These two functions date from the era when I thought that we could
rebuild btrees by creating an alternate root and adding records one by
one. In other words, they predate the btree bulk loader. They're not
necessary now, so remove them.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Copy and paste the commit message from Darrick into a comment to explain
the seemingly odd invalidate_bdev in xfs_shutdown_devices.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Message-Id: <20230809220545.1308228-8-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
blkdev_put must not be called under sb->s_umount to avoid a lock order
reversal with disk->open_mutex. Move closing the buftargs into ->kill_sb
to archive that. Note that the flushing of the disk caches and
block device mapping invalidated needs to stay in ->put_super as the main
block device is closed in kill_block_super already.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Message-Id: <20230809220545.1308228-7-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Closing the block devices logically belongs into xfs_free_buftarg, So
instead of open coding it in the caller move it there and add a check
for the s_bdev so that the main device isn't close as that's done by the
VFS helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Message-Id: <20230809220545.1308228-6-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>