Commit Graph

8478 Commits

Author SHA1 Message Date
Darrick J. Wong
20a3c1ecc3 xfs: refactor live buffer invalidation for repairs
In an upcoming patch, we will need to be able to look for xfs_buf
objects caching file-based metadata blocks without needing to walk the
(possibly corrupt) structures to find all the buffers.  Repair already
has most of the code needed to scan the buffer cache, so hoist these
utility functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:58:48 -07:00
Darrick J. Wong
84c14ee39d xfs: create temporary files and directories for online repair
Teach the online repair code how to create temporary files or
directories.  These temporary files can be used to stage reconstructed
information until we're ready to perform an atomic extent swap to commit
the new metadata.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:58:48 -07:00
Darrick J. Wong
cab23a4233 xfs: hide private inodes from bulkstat and handle functions
We're about to start adding functionality that uses internal inodes that
are private to XFS.  What this means is that userspace should never be
able to access any information about these files, and should not be able
to open these files by handle.

To prevent users from ever finding the file or mis-interactions with the
security apparatus, set S_PRIVATE on the inode.  Don't allow bulkstat,
open-by-handle, or linking of S_PRIVATE files into the directory tree.
This should keep private inodes actually private.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:58:48 -07:00
Darrick J. Wong
0730e8d8ba xfs: enable logged file mapping exchange feature
Add the XFS_SB_FEAT_INCOMPAT_EXCHRANGE feature to the set of features
that we will permit when mounting a filesystem.  This turns on support
for the file range exchange feature.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:54:26 -07:00
Darrick J. Wong
14f1999102 xfs: capture inode generation numbers in the ondisk exchmaps log item
Per some very late review comments, capture the generation numbers of
both inodes involved in a file content exchange operation so that we
don't accidentally target files with have been reallocated.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:54:24 -07:00
Darrick J. Wong
b3e60f8483 xfs: support non-power-of-two rtextsize with exchange-range
The generic exchange-range alignment checks use (fast) bitmasking
operations to perform block alignment checks on the exchange parameters.
Unfortunately, bitmasks require that the alignment size be a power of
two.  This isn't true for realtime devices with a non-power-of-two
extent size, so we have to copy-pasta the generic checks using long
division for this to work properly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:54:23 -07:00
Darrick J. Wong
e62941103f xfs: make file range exchange support realtime files
Now that bmap items support the realtime device, we can add the
necessary pieces to the file range exchange code to support exchanging
mappings.  All we really need to do here is adjust the blockcount
upwards to the end of the rt extent and remove the inode checks.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:54:22 -07:00
Darrick J. Wong
33a9be2b70 xfs: condense symbolic links after a mapping exchange operation
The previous commit added a new file mapping exchange flag that enables
us to perform post-exchange processing on file2 once we're done
exchanging the extent mappings.  Now add this ability for symlinks.

This isn't used anywhere right now, but we need to have the basic ondisk
flags in place so that a future online symlink repair feature can
salvage the remote target in a temporary link and exchange the data fork
mappings when ready.  If one file is in extents format and the other is
inline, we will have to promote both to extents format to perform the
exchange.  After the exchange, we can try to condense the fixed symlink
down to inline format if possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:54:21 -07:00
Darrick J. Wong
da165fbde2 xfs: condense directories after a mapping exchange operation
The previous commit added a new file mapping exchange flag that enables
us to perform post-swap processing on file2 once we're done exchanging
extent mappings.  Now add this ability for directories.

This isn't used anywhere right now, but we need to have the basic ondisk
flags in place so that a future online directory repair feature can
create salvaged dirents in a temporary directory and exchange the data
fork mappings when ready.  If one file is in extents format and the
other is inline, we will have to promote both to extents format to
perform the exchange.  After the exchange, we can try to condense the
fixed directory down to inline format if possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:54:20 -07:00
Darrick J. Wong
497d7a2608 xfs: condense extended attributes after a mapping exchange operation
Add a new file mapping exchange flag that enables us to perform
post-exchange processing on file2 once we're done exchanging the extent
mappings.  If we were swapping mappings between extended attribute
forks, we want to be able to convert file2's attr fork from block to
inline format.

(This implies that all fork contents are exchanged.)

This isn't used anywhere right now, but we need to have the basic ondisk
flags in place so that a future online xattr repair feature can create
salvaged attrs in a temporary file and exchange the attr fork mappings
when ready.  If one file is in extents format and the other is inline,
we will have to promote both to extents format to perform the exchange.
After the exchange, we can try to condense the fixed file's attr fork
back down to inline format if possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:54:20 -07:00
Darrick J. Wong
5fd022ec7d xfs: add error injection to test file mapping exchange recovery
Add an errortag so that we can test recovery of exchmaps log items.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:54:19 -07:00
Darrick J. Wong
42672471f9 xfs: bind together the front and back ends of the file range exchange code
So far, we've constructed the front end of the file range exchange code
that does all the checking; and the back end of the file mapping
exchange code that actually does the work.  Glue these two pieces
together so that we can turn on the functionality.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:54:18 -07:00
Darrick J. Wong
966ceafc7a xfs: create deferred log items for file mapping exchanges
Now that we've created the skeleton of a log intent item to track and
restart file mapping exchange operations, add the upper level logic to
commit intent items and turn them into concrete work recorded in the
log.  This builds on the existing bmap update intent items that have
been around for a while now.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:54:17 -07:00
Darrick J. Wong
6c08f434bd xfs: introduce a file mapping exchange log intent item
Introduce a new intent log item to handle exchanging mappings between
the forks of two files.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:54:16 -07:00
Darrick J. Wong
1518646eef xfs: create a incompat flag for atomic file mapping exchanges
Create a incompat flag so that we only attempt to process file mapping
exchange log items if the filesystem supports it, and a geometry flag to
advertise support if it's present.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:54:15 -07:00
Darrick J. Wong
9a64d9b310 xfs: introduce new file range exchange ioctl
Introduce a new ioctl to handle exchanging ranges of bytes
between files.  The goal here is to perform the exchange atomically with
respect to applications -- either they see the file contents before the
exchange or they see that A-B is now B-A, even if the kernel crashes.

My original goal with all this code was to make it so that online repair
can build a replacement directory or xattr structure in a temporary file
and commit the repair by atomically exchanging all the data blocks
between the two files.  However, I needed a way to test this mechanism
thoroughly, so I've been evolving an ioctl interface since then.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:54:14 -07:00
Darrick J. Wong
15f78aa3eb xfs: constify xfs_bmap_is_written_extent
This predicate doesn't modify the structure that's being passed in, so
we can mark it const.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:54:12 -07:00
Darrick J. Wong
ac5cebeed6 xfs: refactor non-power-of-two alignment checks
Create a helper function that can compute if a 64-bit number is an
integer multiple of a 32-bit number, where the 32-bit number is not
required to be an even power of two.  This is needed for some new code
for the realtime device, where we can set 37k allocation units and then
have to remap them.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:54:12 -07:00
Darrick J. Wong
6b700a5be9 xfs: hoist multi-fsb allocation unit detection to a helper
Replace the open-coded logic to decide if a file has a multi-fsb
allocation unit to a helper to make the code easier to read.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:54:11 -07:00
Darrick J. Wong
ee20808d84 xfs: create a new helper to return a file's allocation unit
Create a new helper function to calculate the fundamental allocation
unit (i.e. the smallest unit of space we can allocate) of a file.
Things are going to get hairy with range-exchange on the realtime
device, so prepare for this now.

Remove the static attribute from xfs_is_falloc_aligned since the next
patch will need it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:54:10 -07:00
Darrick J. Wong
00acb28d96 xfs: declare xfs_file.c symbols in xfs_file.h
Move the two public symbols in xfs_file.c to xfs_file.h.  We're about to
add more public symbols in that source file, so let's finally create the
header file.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:54:09 -07:00
Darrick J. Wong
3fc4844585 xfs: move xfs_iops.c declarations out of xfs_inode.h
Similarly, move declarations of public symbols of xfs_iops.c from
xfs_inode.h to xfs_iops.h.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:54:08 -07:00
Darrick J. Wong
a4db266a70 xfs: move inode lease breaking functions to xfs_inode.c
The lease breaking functions operate at the scope of the entire VFS
inode, not subranges of a file.  Move them to xfs_inode.c since they're
already declared in xfs_inode.h.  This cleanup moves us closer to
having xfs_FOO.h declare only the symbols in xfs_FOO.c.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:54:07 -07:00
Darrick J. Wong
5302a5c8be xfs: only clear log incompat flags at clean unmount
While reviewing the online fsck patchset, someone spied the
xfs_swapext_can_use_without_log_assistance function and wondered why we
go through this inverted-bitmask dance to avoid setting the
XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT feature.

(The same principles apply to the logged extended attribute update
feature bit in the since-merged LARP series.)

The reason for this dance is that xfs_add_incompat_log_feature is an
expensive operation -- it forces the log, pushes the AIL, and then if
nobody's beaten us to it, sets the feature bit and issues a synchronous
write of the primary superblock.  That could be a one-time cost
amortized over the life of the filesystem, but the log quiesce and cover
operations call xfs_clear_incompat_log_features to remove feature bits
opportunistically.  On a moderately loaded filesystem this leads to us
cycling those bits on and off over and over, which hurts performance.

Why do we clear the log incompat bits?  Back in ~2020 I think Dave and I
had a conversation on IRC[2] about what the log incompat bits represent.
IIRC in that conversation we decided that the log incompat bits protect
unrecovered log items so that old kernels won't try to recover them and
barf.  Since a clean log has no protected log items, we could clear the
bits at cover/quiesce time.

As Dave Chinner pointed out in the thread, clearing log incompat bits at
unmount time has positive effects for golden root disk image generator
setups, since the generator could be running a newer kernel than what
gets written to the golden image -- if there are log incompat fields set
in the golden image that was generated by a newer kernel/OS image
builder then the provisioning host cannot mount the filesystem even
though the log is clean and recovery is unnecessary to mount the
filesystem.

Given that it's expensive to set log incompat bits, we really only want
to do that once per bit per mount.  Therefore, I propose that we only
clear log incompat bits as part of writing a clean unmount record.  Do
this by adding an operational state flag to the xfs mount that guards
whether or not the feature bit clearing can actually take place.

This eliminates the l_incompat_users rwsem that we use to protect a log
cleaning operation from clearing a feature bit that a frontend thread is
trying to set -- this lock adds another way to fail w.r.t. locking.  For
the swapext series, I shard that into multiple locks just to work around
the lockdep complaints, and that's fugly.

Link: https://lore.kernel.org/linux-xfs/20240131230043.GA6180@frogsfrogsfrogs/
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2024-04-15 14:54:06 -07:00
Darrick J. Wong
98a778b425 xfs: fix error bailout in xrep_abt_build_new_trees
Dan Carpenter reports:

"Commit 4bdfd7d157 ("xfs: repair free space btrees") from Dec 15,
2023 (linux-next), leads to the following Smatch static checker
warning:

        fs/xfs/scrub/alloc_repair.c:781 xrep_abt_build_new_trees()
        warn: missing unwind goto?"

That's a bug, so let's fix it.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Fixes: 4bdfd7d157 ("xfs: repair free space btrees")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:54:06 -07:00
Darrick J. Wong
21ad2d0364 xfs: fix potential AGI <-> ILOCK ABBA deadlock in xrep_dinode_findmode_walk_directory
xfs/399 found the following deadlock when fuzzing core.mode = ones:

/proc/20506/task/20558/stack :
[<0>] xfs_ilock+0xa0/0x240 [xfs]
[<0>] xfs_ilock_data_map_shared+0x1b/0x20 [xfs]
[<0>] xrep_dinode_findmode_walk_directory+0x69/0xe0 [xfs]
[<0>] xrep_dinode_find_mode+0x103/0x2a0 [xfs]
[<0>] xrep_dinode_mode+0x7c/0x120 [xfs]
[<0>] xrep_dinode_core+0xed/0x2b0 [xfs]
[<0>] xrep_dinode_problems+0x10/0x80 [xfs]
[<0>] xrep_inode+0x6c/0xc0 [xfs]
[<0>] xrep_attempt+0x64/0x1d0 [xfs]
[<0>] xfs_scrub_metadata+0x365/0x840 [xfs]
[<0>] xfs_scrubv_metadata+0x282/0x430 [xfs]
[<0>] xfs_ioc_scrubv_metadata+0x149/0x1a0 [xfs]
[<0>] xfs_file_ioctl+0xc68/0x1780 [xfs]
/proc/20506/task/20559/stack :
[<0>] xfs_buf_lock+0x3b/0x110 [xfs]
[<0>] xfs_buf_find_lock+0x66/0x1c0 [xfs]
[<0>] xfs_buf_get_map+0x208/0xc00 [xfs]
[<0>] xfs_buf_read_map+0x5d/0x2c0 [xfs]
[<0>] xfs_trans_read_buf_map+0x1b0/0x4c0 [xfs]
[<0>] xfs_read_agi+0xbd/0x190 [xfs]
[<0>] xfs_ialloc_read_agi+0x47/0x160 [xfs]
[<0>] xfs_imap_lookup+0x69/0x1f0 [xfs]
[<0>] xfs_imap+0x1fc/0x3d0 [xfs]
[<0>] xfs_iget+0x357/0xd50 [xfs]
[<0>] xchk_dir_actor+0x16e/0x330 [xfs]
[<0>] xchk_dir_walk_block+0x164/0x1e0 [xfs]
[<0>] xchk_dir_walk+0x13a/0x190 [xfs]
[<0>] xchk_directory+0x1a2/0x2b0 [xfs]
[<0>] xfs_scrub_metadata+0x2f4/0x840 [xfs]
[<0>] xfs_scrubv_metadata+0x282/0x430 [xfs]
[<0>] xfs_ioc_scrubv_metadata+0x149/0x1a0 [xfs]
[<0>] xfs_file_ioctl+0xc68/0x1780 [xfs]

Thread 20558 holds an AGI buffer and is trying to grab the ILOCK of the
root directory.  Thread 20559 holds the root directory ILOCK and is
trying to grab the AGI of an inode that is one of the root directory's
children.  The AGI held by 20558 is the same buffer that 20559 is trying
to acquire.  In other words, this is an ABBA deadlock.

In general, the lock order is ILOCK and then AGI -- rename does this
while preparing for an operation involving whiteouts or renaming files
out of existence; and unlink does this when moving an inode to the
unlinked list.  The only place where we do it in the opposite order is
on the child during an icreate, but at that point the child is marked
INEW and is not visible to other threads.

Work around this deadlock by replacing the blocking ilock attempt with a
nonblocking loop that aborts after 30 seconds.  Relax for a jiffy after
a failed lock attempt.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:54:05 -07:00
Darrick J. Wong
2afd5276d3 xfs: fix an AGI lock acquisition ordering problem in xrep_dinode_findmode
While reviewing the next patch which fixes an ABBA deadlock between the
AGI and a directory ILOCK, someone asked a question about why we're
holding the AGI in the first place.  The reason for that is to quiesce
the inode structures for that AG while we do a repair.

I then realized that the xrep_dinode_findmode invokes xchk_iscan_iter,
which walks the inobts (and hence the AGIs) to find all the inodes.
This itself is also an ABBA vector, since the damaged inode could be in
AG 5, which we hold while we scan AG 0 for directories.  5 -> 0 is not
allowed.

To address this, modify the iscan to allow trylock of the AGI buffer
using the flags argument to xfs_ialloc_read_agi that the previous patch
added.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:54:04 -07:00
Darrick J. Wong
549d3c9a29 xfs: pass xfs_buf lookup flags to xfs_*read_agi
Allow callers to pass buffer lookup flags to xfs_read_agi and
xfs_ialloc_read_agi.  This will be used in the next patch to fix a
deadlock in the online fsck inode scanner.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15 14:54:03 -07:00
Linus Torvalds
9520c192e8 Bug fixes for 6.9-rc3:
* Allow creating new links to special files which were not associated with a
    project quota.
 
 Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZgwRrQAKCRAH7y4RirJu
 9OtyAP4m8cXLi+fjRslGLNhQQXzZHIcpaPiWZ9Ec41Y3uzZNBQD/doS6P4aGcH0m
 taYQ+nyzuavEZiOEg+d65OoUIrDZzg4=
 =bgjU
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.9-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fix from Chandan Babu:

 - Allow creating new links to special files which were not associated
   with a project quota

* tag 'xfs-6.9-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: allow cross-linking special files without project quota
2024-04-06 09:14:18 -07:00
Linus Torvalds
fae0268777 vfs-6.9-rc3.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZg/C8wAKCRCRxhvAZXjc
 oljxAQCneq62ginESgeQLw88fzSBTV4C50xXUA+Qz18AEgA/fgD+J3DlWquEHhMM
 tJmfs3aUn9w7+wDpukcsLjJfJEiSYA8=
 =f2Z6
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.9-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:
 "This contains a few small fixes. This comes with some delay because I
  wanted to wait on people running their reproducers and the Easter
  Holidays meant that those replies came in a little later than usual:

   - Fix handling of preventing writes to mounted block devices.

     Since last kernel we allow to prevent writing to mounted block
     devices provided CONFIG_BLK_DEV_WRITE_MOUNTED isn't set and the
     block device is opened with restricted writes. When we switched to
     opening block devices as files we altered the mechanism by which we
     recognize when a block device has been opened with write
     restrictions.

     The detection logic assumed that only read-write mounted
     filesystems would apply write restrictions to their block devices
     from other openers. That of course is not true since it also makes
     sense to apply write restrictions for filesystems that are
     read-only.

     Fix the detection logic using an FMODE_* bit. We still have a few
     left since we freed up a couple a while ago. I also picked up a
     patch to free up four additional FMODE_* bits scheduled for the
     next merge window.

   - Fix counting the number of writers to a block device. This just
     changes the logic to be consistent.

   - Fix a bug in aio causing a NULL pointer derefernce after we
     implemented batched processing in aio.

   - Finally, add the changes we discussed that allows to yield block
     devices early even though file closing itself is deferred.

     This also allows us to remove two holder operations to get and
     release the holder to align lifetime of file and holder of the
     block device"

* tag 'vfs-6.9-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  aio: Fix null ptr deref in aio_complete() wakeup
  fs,block: yield devices early
  block: count BLK_OPEN_RESTRICT_WRITES openers
  block: handle BLK_OPEN_RESTRICT_WRITES correctly
2024-04-05 09:47:26 -07:00
Andrey Albershteyn
e23d7e82b7 xfs: allow cross-linking special files without project quota
There's an issue that if special files is created before quota
project is enabled, then it's not possible to link this file. This
works fine for normal files. This happens because xfs_quota skips
special files (no ioctls to set necessary flags). The check for
having the same project ID for source and destination then fails as
source file doesn't have any ID.

mkfs.xfs -f /dev/sda
mount -o prjquota /dev/sda /mnt/test

mkdir /mnt/test/foo
mkfifo /mnt/test/foo/fifo1

xfs_quota -xc "project -sp /mnt/test/foo 9" /mnt/test
> Setting up project 9 (path /mnt/test/foo)...
> xfs_quota: skipping special file /mnt/test/foo/fifo1
> Processed 1 (/etc/projects and cmdline) paths for project 9 with recursion depth infinite (-1).

ln /mnt/test/foo/fifo1 /mnt/test/foo/fifo1_link
> ln: failed to create hard link '/mnt/test/testdir/fifo1_link' => '/mnt/test/testdir/fifo1': Invalid cross-device link

mkfifo /mnt/test/foo/fifo2
ln /mnt/test/foo/fifo2 /mnt/test/foo/fifo2_link

Fix this by allowing linking of special files to the project quota
if special files doesn't have any ID set (ID = 0).

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-01 11:55:49 +05:30
Christian Brauner
22650a9982
fs,block: yield devices early
Currently a device is only really released once the umount returns to
userspace due to how file closing works. That ultimately could cause
an old umount assumption to be violated that concurrent umount and mount
don't fail. So an exclusively held device with a temporary holder should
be yielded before the filesystem is gone. Add a helper that allows
callers to do that. This also allows us to remove the two holder ops
that Linus wasn't excited about.

Link: https://lore.kernel.org/r/20240326-vfs-bdev-end_holder-v1-1-20af85202918@kernel.org
Fixes: f3a608827d ("bdev: open block device as files") # mainline only
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-27 13:17:15 +01:00
Dave Chinner
f2e812c152 xfs: don't use current->journal_info
syzbot reported an ext4 panic during a page fault where found a
journal handle when it didn't expect to find one. The structure
it tripped over had a value of 'TRAN' in the first entry in the
structure, and that indicates it tripped over a struct xfs_trans
instead of a jbd2 handle.

The reason for this is that the page fault was taken during a
copy-out to a user buffer from an xfs bulkstat operation. XFS uses
an "empty" transaction context for bulkstat to do automated metadata
buffer cleanup, and so the transaction context is valid across the
copyout of the bulkstat info into the user buffer.

We are using empty transaction contexts like this in XFS to reduce
the risk of failing to release objects we reference during the
operation, especially during error handling. Hence we really need to
ensure that we can take page faults from these contexts without
leaving landmines for the code processing the page fault to trip
over.

However, this same behaviour could happen from any other filesystem
that triggers a page fault or any other exception that is handled
on-stack from within a task context that has current->journal_info
set.  Having a page fault from some other filesystem bounce into XFS
where we have to run a transaction isn't a bug at all, but the usage
of current->journal_info means that this could result corruption of
the outer task's journal_info structure.

The problem is purely that we now have two different contexts that
now think they own current->journal_info. IOWs, no filesystem can
allow page faults or on-stack exceptions while current->journal_info
is set by the filesystem because the exception processing might use
current->journal_info itself.

If we end up with nested XFS transactions whilst holding an empty
transaction, then it isn't an issue as the outer transaction does
not hold a log reservation. If we ignore the current->journal_info
usage, then the only problem that might occur is a deadlock if the
exception tries to take the same locks the upper context holds.
That, however, is not a problem that setting current->journal_info
would solve, so it's largely an irrelevant concern here.

IOWs, we really only use current->journal_info for a warning check
in xfs_vm_writepages() to ensure we aren't doing writeback from a
transaction context. Writeback might need to do allocation, so it
can need to run transactions itself. Hence it's a debug check to
warn us that we've done something silly, and largely it is not all
that useful.

So let's just remove all the use of current->journal_info in XFS and
get rid of all the potential issues from nested contexts where
current->journal_info might get misused by another filesystem
context.

Reported-by: syzbot+cdee56dbcdf0096ef605@syzkaller.appspotmail.com
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Mark Tinguely <mark.tinguely@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-03-25 10:21:01 +05:30
Dave Chinner
15922f5dbf xfs: allow sunit mount option to repair bad primary sb stripe values
If a filesystem has a busted stripe alignment configuration on disk
(e.g. because broken RAID firmware told mkfs that swidth was smaller
than sunit), then the filesystem will refuse to mount due to the
stripe validation failing. This failure is triggering during distro
upgrades from old kernels lacking this check to newer kernels with
this check, and currently the only way to fix it is with offline
xfs_db surgery.

This runtime validity checking occurs when we read the superblock
for the first time and causes the mount to fail immediately. This
prevents the rewrite of stripe unit/width via
mount options that occurs later in the mount process. Hence there is
no way to recover this situation without resorting to offline xfs_db
rewrite of the values.

However, we parse the mount options long before we read the
superblock, and we know if the mount has been asked to re-write the
stripe alignment configuration when we are reading the superblock
and verifying it for the first time. Hence we can conditionally
ignore stripe verification failures if the mount options specified
will correct the issue.

We validate that the new stripe unit/width are valid before we
overwrite the superblock values, so we can ignore the invalid config
at verification and fail the mount later if the new values are not
valid. This, at least, gives users the chance of correcting the
issue after a kernel upgrade without having to resort to xfs-db
hacks.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-03-25 10:17:18 +05:30
Linus Torvalds
6f6efce52d Bug fixes for 6.9:
* Fix invalid pointer dereference by initializing xmbuf before tracepoint
   function is invoked.
 * Use memalloc_nofs_save() when inserting into quota radix tree.
 
 Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZffInwAKCRAH7y4RirJu
 9IyEAP9h8KMNtDRXyBxFe8vCtjoSj7fwwIijWLa/y2NH1oWQowEAk83m14akH1KH
 J0HInearcoRv8L16oe/tcNxxPPBuSQQ=
 =emhS
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.9-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Chandan Babu:

 - Fix invalid pointer dereference by initializing xmbuf before
   tracepoint function is invoked

 - Use memalloc_nofs_save() when inserting into quota radix tree

* tag 'xfs-6.9-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: quota radix tree allocations need to be NOFS on insert
  xfs: fix dev_t usage in xmbuf tracepoints
2024-03-22 11:12:21 -07:00
Dave Chinner
0c6ca06aad xfs: quota radix tree allocations need to be NOFS on insert
In converting the XFS code from GFP_NOFS to scoped contexts, we
converted the quota radix tree to GFP_KERNEL. Unfortunately, it was
not clearly documented that this set was because there is a
dependency on the quotainfo->qi_tree_lock being taken in memory
reclaim to remove dquots from the radix tree.

In hindsight this is obvious, but the radix tree allocations on
insert are not immediately obvious, and we avoid this for the inode
cache radix trees by using preloading and hence completely avoiding
the radix tree node allocation under tree lock constraints.

Hence there are a few solutions here. The first is to reinstate
GFP_NOFS for the radix tree and add a comment explaining why
GFP_NOFS is used. The second is to use memalloc_nofs_save() on the
radix tree insert context, which makes it obvious that the radix
tree insert runs under GFP_NOFS constraints. The third option is to
simply replace the radix tree and it's lock with an xarray which can
do memory allocation safely in an insert context.

The first is OK, but not really the direction we want to head. The
second is my preferred short term solution. The third - converting
XFS radix trees to xarray - is the longer term solution.

Hence to fix the regression here, we take option 2 as it moves us in
the direction we want to head with memory allocation and GFP_NOFS
removal.

Reported-by: syzbot+8fdff861a781522bda4d@syzkaller.appspotmail.com
Reported-by: syzbot+d247769793ec169e4bf9@syzkaller.appspotmail.com
Fixes: 94a69db236 ("xfs: use __GFP_NOLOCKDEP instead of GFP_NOFS")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-03-15 10:30:23 +05:30
Darrick J. Wong
215b2bf72a xfs: fix dev_t usage in xmbuf tracepoints
Fix some inconsistencies in the xmbuf tracepoints -- they should be
reporting the major/minor of the filesystem that they're associated
with, so that we have some clue on whose behalf the xmbuf was created.
Fix the xmbuf_free tracepoint to report the same.

Don't call the trace function until the xmbuf is fully initialized.

Fixes: 5076a6040c ("xfs: support in-memory buffer cache target")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-03-15 10:30:23 +05:30
Linus Torvalds
babbcc0232 New code for 6.9:
* Online Repair;
   ** New ondisk structures being repaired.
      - Inode's mode field by trying to obtain file type value from the a
        directory entry.
      - Quota counters.
      - Link counts of inodes.
      - FS summary counters.
      - rmap btrees.
        Support for in-memory btrees has been added to support repair of rmap
        btrees.
   ** Misc changes
      - Report corruption of metadata to the health tracking subsystem.
      - Enable indirect health reporting when resources are scarce.
      - Reduce memory usage while reparing refcount btree.
      - Extend "Bmap update" intent item to support atomic extent swapping on
        the realtime device.
      - Extend "Bmap update" intent item to support extended attribute fork and
        unwritten extents.
   ** Code cleanups
      - Bmap log intent.
      - Btree block pointer checking.
      - Btree readahead.
      - Buffer target.
      - Symbolic link code.
   * Remove mrlock wrapper around the rwsem.
   * Convert all the GFP_NOFS flag usages to use the scoped
     memalloc_nofs_save() API instead of direct calls with the GFP_NOFS.
   * Refactor and simplify xfile abstraction. Lower level APIs in
     shmem.c are required to be exported in order to achieve this.
   * Skip checking alignment constraints for inode chunk allocations when block
     size is larger than inode chunk size.
   * Do not submit delwri buffers collected during log recovery when an error
     has been encountered.
   * Fix SEEK_HOLE/DATA for file regions which have active COW extents.
   * Fix lock order inversion when executing error handling path during
     shrinking a filesystem.
   * Remove duplicate ifdefs.
 
 Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZemMkgAKCRAH7y4RirJu
 9ON5AP0Vda6sMn/ZUYoLo9ZUrUvlUb8L0dhEN5JL0XfyWW5ogAD/bH4G6pKSNyTw
 cSEjryuDakirdHLt5g0c+QHd2a/fzw0=
 =ymKk
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.9-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Chandan Babu:

 - Online repair updates:
    - More ondisk structures being repaired:
        - Inode's mode field by trying to obtain file type value from
          the a directory entry
        - Quota counters
        - Link counts of inodes
        - FS summary counters
        - Support for in-memory btrees has been added to support repair
          of rmap btrees
    - Misc changes:
        - Report corruption of metadata to the health tracking subsystem
        - Enable indirect health reporting when resources are scarce
        - Reduce memory usage while repairing refcount btree
        - Extend "Bmap update" intent item to support atomic extent
          swapping on the realtime device
        - Extend "Bmap update" intent item to support extended attribute
          fork and unwritten extents
    - Code cleanups:
        - Bmap log intent
        - Btree block pointer checking
        - Btree readahead
        - Buffer target
        - Symbolic link code

 - Remove mrlock wrapper around the rwsem

 - Convert all the GFP_NOFS flag usages to use the scoped
   memalloc_nofs_save() API instead of direct calls with the GFP_NOFS

 - Refactor and simplify xfile abstraction. Lower level APIs in shmem.c
   are required to be exported in order to achieve this

 - Skip checking alignment constraints for inode chunk allocations when
   block size is larger than inode chunk size

 - Do not submit delwri buffers collected during log recovery when an
   error has been encountered

 - Fix SEEK_HOLE/DATA for file regions which have active COW extents

 - Fix lock order inversion when executing error handling path during
   shrinking a filesystem

 - Remove duplicate ifdefs

* tag 'xfs-6.9-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (183 commits)
  xfs: shrink failure needs to hold AGI buffer
  mm/shmem.c: Use new form of *@param in kernel-doc
  kernel-doc: Add unary operator * to $type_param_ref
  xfs: use kvfree() in xlog_cil_free_logvec()
  xfs: xfs_btree_bload_prep_block() should use __GFP_NOFAIL
  xfs: fix scrub stats file permissions
  xfs: fix log recovery erroring out on refcount recovery failure
  xfs: move symlink target write function to libxfs
  xfs: move remote symlink target read function to libxfs
  xfs: move xfs_symlink_remote.c declarations to xfs_symlink_remote.h
  xfs: xfs_bmap_finish_one should map unwritten extents properly
  xfs: support deferred bmap updates on the attr fork
  xfs: support recovering bmap intent items targetting realtime extents
  xfs: add a realtime flag to the bmap update log redo items
  xfs: add a xattr_entry helper
  xfs: fix xfs_bunmapi to allow unmapping of partial rt extents
  xfs: move xfs_bmap_defer_add to xfs_bmap_item.c
  xfs: reuse xfs_bmap_update_cancel_item
  xfs: add a bi_entry helper
  xfs: remove xfs_trans_set_bmap_flags
  ...
2024-03-13 13:52:24 -07:00
Linus Torvalds
f88c3fb81c mm, slab: remove last vestiges of SLAB_MEM_SPREAD
Yes, yes, I know the slab people were planning on going slow and letting
every subsystem fight this thing on their own.  But let's just rip off
the band-aid and get it over and done with.  I don't want to see a
number of unnecessary pull requests just to get rid of a flag that no
longer has any meaning.

This was mainly done with a couple of 'sed' scripts and then some manual
cleanup of the end result.

Link: https://lore.kernel.org/all/CAHk-=wji0u+OOtmAOD-5JV3SXcRJF___k_+8XNKmak0yd5vW1Q@mail.gmail.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-03-12 20:32:19 -07:00
Linus Torvalds
0f1a876682 vfs-6.9.uuid
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZem5LwAKCRCRxhvAZXjc
 onZsAQCjMNabNWAty2VBAQrNIpGkZ+AMA2DxEajPldaPiJH5zQEA9ea7feB3T47i
 NUrXXfMQ5DSop+k5Y65pPkEpbX4rhQo=
 =NZgd
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.9.uuid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs uuid updates from Christian Brauner:
 "This adds two new ioctl()s for getting the filesystem uuid and
  retrieving the sysfs path based on the path of a mounted filesystem.
  Getting the filesystem uuid has been implemented in filesystem
  specific code for a while it's now lifted as a generic ioctl"

* tag 'vfs-6.9.uuid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  xfs: add support for FS_IOC_GETFSSYSFSPATH
  fs: add FS_IOC_GETFSSYSFSPATH
  fat: Hook up sb->s_uuid
  fs: FS_IOC_GETUUID
  ovl: convert to super_set_uuid()
  fs: super_set_uuid()
2024-03-11 11:02:06 -07:00
Linus Torvalds
910202f00a vfs-6.9.super
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZem4DwAKCRCRxhvAZXjc
 ooTRAQDRI6Qz6wJym5Yblta8BScMGbt/SgrdgkoCvT6y83MtqwD+Nv/AZQzi3A3l
 9NdULtniW1reuCYkc8R7dYM8S+yAwAc=
 =Y1qX
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull block handle updates from Christian Brauner:
 "Last cycle we changed opening of block devices, and opening a block
  device would return a bdev_handle. This allowed us to implement
  support for restricting and forbidding writes to mounted block
  devices. It was accompanied by converting and adding helpers to
  operate on bdev_handles instead of plain block devices.

  That was already a good step forward but ultimately it isn't necessary
  to have special purpose helpers for opening block devices internally
  that return a bdev_handle.

  Fundamentally, opening a block device internally should just be
  equivalent to opening files. So now all internal opens of block
  devices return files just as a userspace open would. Instead of
  introducing a separate indirection into bdev_open_by_*() via struct
  bdev_handle bdev_file_open_by_*() is made to just return a struct
  file. Opening and closing a block device just becomes equivalent to
  opening and closing a file.

  This all works well because internally we already have a pseudo fs for
  block devices and so opening block devices is simple. There's a few
  places where we needed to be careful such as during boot when the
  kernel is supposed to mount the rootfs directly without init doing it.
  Here we need to take care to ensure that we flush out any asynchronous
  file close. That's what we already do for opening, unpacking, and
  closing the initramfs. So nothing new here.

  The equivalence of opening and closing block devices to regular files
  is a win in and of itself. But it also has various other advantages.
  We can remove struct bdev_handle completely. Various low-level helpers
  are now private to the block layer. Other helpers were simply
  removable completely.

  A follow-up series that is already reviewed build on this and makes it
  possible to remove bdev->bd_inode and allows various clean ups of the
  buffer head code as well. All places where we stashed a bdev_handle
  now just stash a file and use simple accessors to get to the actual
  block device which was already the case for bdev_handle"

* tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
  block: remove bdev_handle completely
  block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access
  bdev: remove bdev pointer from struct bdev_handle
  bdev: make struct bdev_handle private to the block layer
  bdev: make bdev_{release, open_by_dev}() private to block layer
  bdev: remove bdev_open_by_path()
  reiserfs: port block device access to file
  ocfs2: port block device access to file
  nfs: port block device access to files
  jfs: port block device access to file
  f2fs: port block device access to files
  ext4: port block device access to file
  erofs: port device access to file
  btrfs: port device access to file
  bcachefs: port block device access to file
  target: port block device access to file
  s390: port block device access to file
  nvme: port block device access to file
  block2mtd: port device access to files
  bcache: port block device access to files
  ...
2024-03-11 10:52:34 -07:00
Linus Torvalds
54126fafea vfs-6.9.iomap
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZem4UQAKCRCRxhvAZXjc
 ouERAQDg63R9s3bKmUgGqngf9cfr//VCTE+WVARwOUTdn2iDbwEA1IME7X1kL/Vz
 EdhEjyqO6xom+ao/Vqxe0XIDNz70vgs=
 =8RdE
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.9.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull iomap updates from Christian Brauner:

 - Restore read-write hints in struct bio through the bi_write_hint
   member for the sake of UFS devices in mobile applications. This can
   result in up to 40% lower write amplification in UFS devices. The
   patch series that builds on this will be coming in via the SCSI
   maintainers (Bart)

 - Overhaul the iomap writeback code. Afterwards ->map_blocks() is able
   to map multiple blocks at once as long as they're in the same folio.
   This reduces CPU usage for buffered write workloads on e.g., xfs on
   systems with lots of cores (Christoph)

 - Record processed bytes in iomap_iter() trace event (Kassey)

 - Extend iomap_writepage_map() trace event after Christoph's
   ->map_block() changes to map mutliple blocks at once (Zhang)

* tag 'vfs-6.9.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits)
  iomap: Add processed for iomap_iter
  iomap: add pos and dirty_len into trace_iomap_writepage_map
  block, fs: Restore the per-bio/request data lifetime fields
  fs: Propagate write hints to the struct block_device inode
  fs: Move enum rw_hint into a new header file
  fs: Split fcntl_rw_hint()
  fs: Verify write lifetime constants at compile time
  fs: Fix rw_hint validation
  iomap: pass the length of the dirty region to ->map_blocks
  iomap: map multiple blocks at a time
  iomap: submit ioends immediately
  iomap: factor out a iomap_writepage_map_block helper
  iomap: only call mapping_set_error once for each failed bio
  iomap: don't chain bios
  iomap: move the iomap_sector sector calculation out of iomap_add_to_ioend
  iomap: clean up the iomap_alloc_ioend calling convention
  iomap: move all remaining per-folio logic into iomap_writepage_map
  iomap: factor out a iomap_writepage_handle_eof helper
  iomap: move the PF_MEMALLOC check to iomap_writepages
  iomap: move the io_folios field out of struct iomap_ioend
  ...
2024-03-11 10:07:03 -07:00
Dave Chinner
75bcffbb9e xfs: shrink failure needs to hold AGI buffer
Chandan reported a AGI/AGF lock order hang on xfs/168 during recent
testing. The cause of the problem was the task running xfs_growfs
to shrink the filesystem. A failure occurred trying to remove the
free space from the btrees that the shrink would make disappear,
and that meant it ran the error handling for a partial failure.

This error path involves restoring the per-ag block reservations,
and that requires calculating the amount of space needed to be
reserved for the free inode btree. The growfs operation hung here:

[18679.536829]  down+0x71/0xa0
[18679.537657]  xfs_buf_lock+0xa4/0x290 [xfs]
[18679.538731]  xfs_buf_find_lock+0xf7/0x4d0 [xfs]
[18679.539920]  xfs_buf_lookup.constprop.0+0x289/0x500 [xfs]
[18679.542628]  xfs_buf_get_map+0x2b3/0xe40 [xfs]
[18679.547076]  xfs_buf_read_map+0xbb/0x900 [xfs]
[18679.562616]  xfs_trans_read_buf_map+0x449/0xb10 [xfs]
[18679.569778]  xfs_read_agi+0x1cd/0x500 [xfs]
[18679.573126]  xfs_ialloc_read_agi+0xc2/0x5b0 [xfs]
[18679.578708]  xfs_finobt_calc_reserves+0xe7/0x4d0 [xfs]
[18679.582480]  xfs_ag_resv_init+0x2c5/0x490 [xfs]
[18679.586023]  xfs_ag_shrink_space+0x736/0xd30 [xfs]
[18679.590730]  xfs_growfs_data_private.isra.0+0x55e/0x990 [xfs]
[18679.599764]  xfs_growfs_data+0x2f1/0x410 [xfs]
[18679.602212]  xfs_file_ioctl+0xd1e/0x1370 [xfs]

trying to get the AGI lock. The AGI lock was held by a fstress task
trying to do an inode allocation, and it was waiting on the AGF
lock to allocate a new inode chunk on disk. Hence deadlock.

The fix for this is for the growfs code to hold the AGI over the
transaction roll it does in the error path. It already holds the AGF
locked across this, and that is what causes the lock order inversion
in the xfs_ag_resv_init() call.

Reported-by: Chandan Babu R <chandanbabu@kernel.org>
Fixes: 46141dc891 ("xfs: introduce xfs_ag_shrink_space()")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-03-07 14:59:05 +05:30
Dave Chinner
b8c0d6fa41 xfs: use kvfree() in xlog_cil_free_logvec()
The xfs_log_vec items are allocated by xlog_kvmalloc(), and so need
to be freed with kvfree. This was missed when coverting from the
kmem_free() API.

Fixes: 4929257613 ("xfs: convert kmem_free() for kvmalloc users to kvfree()")
Reported-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-28 14:04:45 +05:30
Dave Chinner
3aca0676a1 xfs: xfs_btree_bload_prep_block() should use __GFP_NOFAIL
This was missed in the conversion from KM* flags.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Fixes: 10634530f7 ("xfs: convert kmem_zalloc() to kzalloc()")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-28 14:04:30 +05:30
Shiyang Ruan
27c86d43bc xfs: drop experimental warning for FSDAX
FSDAX and reflink can work together now, let's drop this warning.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-27 09:53:30 +05:30
Darrick J. Wong
e610e856b9 xfs: fix scrub stats file permissions
When the kernel is in lockdown mode, debugfs will only show files that
are world-readable and cannot be written, mmaped, or used with ioctl.
That more or less describes the scrub stats file, except that the
permissions are wrong -- they should be 0444, not 0644.  You can't write
the stats file, so the 0200 makes no sense.

Meanwhile, the clear_stats file is only writable, but it got mode 0400
instead of 0200, which would make more sense.

Fix both files so that they make sense.

Fixes: d7a74cad8f ("xfs: track usage statistics of online fsck")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-26 17:58:37 +05:30
Christian Brauner
1b9e2d9014
xfs: port block device access to files
Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-7-adbd023e19cc@kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-25 12:05:23 +01:00
Christian Brauner
f3a608827d
bdev: open block device as files
Add two new helpers to allow opening block devices as files.
This is not the final infrastructure. This still opens the block device
before opening a struct a file. Until we have removed all references to
struct bdev_handle we can't switch the order:

* Introduce blk_to_file_flags() to translate from block specific to
  flags usable to pen a new file.
* Introduce bdev_file_open_by_{dev,path}().
* Introduce temporary sb_bdev_handle() helper to retrieve a struct
  bdev_handle from a block device file and update places that directly
  reference struct bdev_handle to rely on it.
* Don't count block device openes against the number of open files. A
  bdev_file_open_by_{dev,path}() file is never installed into any
  file descriptor table.

One idea that came to mind was to use kernel_tmpfile_open() which
would require us to pass a path and it would then call do_dentry_open()
going through the regular fops->open::blkdev_open() path. But then we're
back to the problem of routing block specific flags such as
BLK_OPEN_RESTRICT_WRITES through the open path and would have to waste
FMODE_* flags every time we add a new one. With this we can avoid using
a flag bit and we have more leeway in how we open block devices from
bdev_open_by_{dev,path}().

Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-1-adbd023e19cc@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-25 12:05:21 +01:00
Darrick J. Wong
1e5efd72a2 xfs: fix log recovery erroring out on refcount recovery failure
Per the comment in the error case of xfs_reflink_recover_cow, zero out
any error (after shutting down the log) so that we actually kill any new
intent items that might have gotten logged by later recovery steps.
Discovered by xfs/434, which few people actually seem to run.

Fixes: 2c1e31ed5c ("xfs: place intent recovery under NOFS allocation context")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-24 10:43:26 +05:30