-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmbkZhQQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjOKD/0fzd4yOcqxSI9W3OLGd04VrOTJIQa4CRbV
GmoTq39pOeIDVGug5ekkTpqqHHnuGk+nQhCzD9vsN/eTmC7yZOIr847O2aWzvYEn
PzFRgmJpoo2E9sr/IsTR5LnJjbaIZhQVkqLH6ZOj9tpKlVwN2SK0nIRVNrAi5zgT
MaDrto/2OUld+vmA99Rgb23jxM6UBdCPIjuiVa+11Vg9Z3D1tWbBmrsG7OMysyIf
FbASBeKHqFSO61/ipFCZv6VV1X8zoWEVyT8n4A1yUbbN5rLzPgoQJVbfSqQRXIdr
cdrKeCbKxl+joSgKS6LKpvnfwRgGF+hgAfpZg4c0vrbZGTQcRhhLFECyh/aVI08F
p5TOMArhVaX59664gHgSPq4KnGTXOO29dot9N3Jya/ZQnxinjY9r+GVOfLuduPPy
1B04vab8oAsk4zK7fZbkDxgYUyifwzK/vQ6OqYq2mYdpdIS/AE7T2ou61Bz5mI7I
/BuucNV0Z96OKlyLEXwXXZjZgNu1TFcq6ARIBJ8L08PY64Fesj5BXabRyXkeNH26
0exyz9heeJs6OwRGfngXmS24tDSS0k74CeZX3KoePNj69u6KCn346KiU1qgntwwD
E5F7AEHqCl5FjUEIWB4M1EPlfA8U0MzOL+tkx2xKJAjsU60wAy7jRSyOIcqodpMs
6UlPcJzgYg==
=uuLl
-----END PGP SIGNATURE-----
Merge tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- MD changes via Song:
- md-bitmap refactoring (Yu Kuai)
- raid5 performance optimization (Artur Paszkiewicz)
- Other small fixes (Yu Kuai, Chen Ni)
- Add a sysfs entry 'new_level' (Xiao Ni)
- Improve information reported in /proc/mdstat (Mateusz Kusiak)
- NVMe changes via Keith:
- Asynchronous namespace scanning (Stuart)
- TCP TLS updates (Hannes)
- RDMA queue controller validation (Niklas)
- Align field names to the spec (Anuj)
- Metadata support validation (Puranjay)
- A syntax cleanup (Shen)
- Fix a Kconfig linking error (Arnd)
- New queue-depth quirk (Keith)
- Add missing unplug trace event (Keith)
- blk-iocost fixes (Colin, Konstantin)
- t10-pi modular removal and fixes (Alexey)
- Fix for potential BLKSECDISCARD overflow (Alexey)
- bio splitting cleanups and fixes (Christoph)
- Deal with folios rather than rather than pages, speeding up how the
block layer handles bigger IOs (Kundan)
- Use spinlocks rather than bit spinlocks in zram (Sebastian, Mike)
- Reduce zoned device overhead in ublk (Ming)
- Add and use sendpages_ok() for drbd and nvme-tcp (Ofir)
- Fix regression in partition error pointer checking (Riyan)
- Add support for write zeroes and rotational status in nbd (Wouter)
- Add Yu Kuai as new BFQ maintainer. The scheduler has been
unmaintained for quite a while.
- Various sets of fixes for BFQ (Yu Kuai)
- Misc fixes and cleanups (Alvaro, Christophe, Li, Md Haris, Mikhail,
Yang)
* tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux: (120 commits)
nvme-pci: qdepth 1 quirk
block: fix potential invalid pointer dereference in blk_add_partition
blk_iocost: make read-only static array vrate_adj_pct const
block: unpin user pages belonging to a folio at once
mm: release number of pages of a folio
block: introduce folio awareness and add a bigger size from folio
block: Added folio-ized version of bio_add_hw_page()
block, bfq: factor out a helper to split bfqq in bfq_init_rq()
block, bfq: remove local variable 'bfqq_already_existing' in bfq_init_rq()
block, bfq: remove local variable 'split' in bfq_init_rq()
block, bfq: remove bfq_log_bfqg()
block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()
block, bfq: fix procress reference leakage for bfqq in merge chain
block, bfq: fix uaf for accessing waker_bfqq after splitting
blk-throttle: support prioritized processing of metadata
blk-throttle: remove last_low_overflow_time
drbd: Add NULL check for net_conf to prevent dereference in state validation
nvme-tcp: fix link failure for TCP auth
blk-mq: add missing unplug trace event
mtip32xx: Remove redundant null pointer checks in mtip_hw_debugfs_init()
...
- Support file-backed mounts for containers and sandboxes;
- Mark the experimental fscache backend as deprecated;
- Handle overlapped pclusters caused by crafted images properly;
- Fix a failure path which could cause infinite loops in
z_erofs_init_decompressor();
- Get rid of unnecessary NOFAILs;
- Harmless on-disk hardening & minor cleanups.
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmbkOcURHHhpYW5nQGtl
cm5lbC5vcmcACgkQUXZn5Zlu5qqkshAAnUlHE+pggbWNybxB+CUAIrYAuQggGa/T
B6VxSXYRPYxRgTFI5jrz58M4uKLK4vQLMWPB+DrKhsLfrEePi5crYqNOWErMyAm/
Re6MPguWY8ZkqDoMBfV5jefBKhijSyWjbI/gvHJlE1driBY2v2La4AegqKgCq5+i
aHWrwL+RcK1S3T3vyJOnKgzW1fAH+5QMmQvPSmllpUHaQ6QTQrlDe3X7iVWyYHLD
LtmB5HeO+GvKVRc88L9wM0vOQz5s0f+ocbR2pUWns0Fgl4gNbtCDYngvFFsW/6od
+GBwC1FlMRCZHDHQWV+el9tynRU2nUidggpQT49q/IntFmtoFH+Oj9m71jjfHVoS
UrBJk0rmkK34qgm9hA5r3c2AsLqZGuTI3YH6SfboH1oWLdxAB2PxI8cUIbgwutEO
42oo/cVHPeC3Vx4w7RX7GgYrl8VtE9Vb4ZK1Z3+IbG8v5VfgkZjnaFElICaoq7Ij
lay7kNT1mczkp2GruAWkE+g082aN85f+6w3PopCVZbc9WUhgBduCTxwoIpy/Kbh/
YwWOTsZeimqlec583/bP5NBlrmtO/DodeBTtPivtfcSjemm5WH45tCZDcL7tsigj
GA7WSS1CJRBIVMaqFyKlLtaKNB8FzR1l5FcQWblmx891WqWOF0d6mL6s5el2ykgo
MGC8Arx7GUA=
=Rmww
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"In this cycle, we add file-backed mount support, which has has been a
strong requirement for years. It is especially useful when there are
thousands of images running on the same host for containers and other
sandbox use cases, unlike OS image use cases.
Without file-backed mounts, it's hard for container runtimes to manage
and isolate so many unnecessary virtual block devices safely and
efficiently, therefore file-backed mounts are highly preferred. For
EROFS users, ComposeFS [1], containerd, and Android APEXes [2] will
directly benefit from it, and I've seen no risk in implementing it as
a completely immutable filesystem.
The previous experimental feature "EROFS over fscache" is now marked
as deprecated because:
- Fscache is no longer an independent subsystem and has been merged
into netfs, which was somewhat unexpected when it was proposed.
- New HSM "fanotify pre-content hooks" [3] will be landed upstream.
These hooks will replace "EROFS over fscache" in a simpler way, as
EROFS won't be bother with kernel caching anymore. Userspace
programs can also manage their own caching hierarchy more flexibly.
Once the HSM "fanotify pre-content hooks" is landed, I will remove the
fscache backend entirely as an internal dependency cleanup. More
backgrounds are listed in the original patchset [4].
In addition to that, there are bugfixes and cleanups as usual.
Summary:
- Support file-backed mounts for containers and sandboxes
- Mark the experimental fscache backend as deprecated
- Handle overlapped pclusters caused by crafted images properly
- Fix a failure path which could cause infinite loops in
z_erofs_init_decompressor()
- Get rid of unnecessary NOFAILs
- Harmless on-disk hardening & minor cleanups"
* tag 'erofs-for-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: reject inodes with negative i_size
erofs: restrict pcluster size limitations
erofs: allocate more short-lived pages from reserved pool first
erofs: sunset unneeded NOFAILs
erofs: simplify erofs_map_blocks_flatmode()
erofs: refactor read_inode calling convention
erofs: use kmemdup_nul in erofs_fill_symlink
erofs: mark experimental fscache backend deprecated
erofs: support compressed inodes for fileio
erofs: support unencoded inodes for fileio
erofs: add file-backed mount support
erofs: handle overlapped pclusters out of crafted images properly
erofs: fix error handling in z_erofs_init_decompressor
erofs: clean up erofs_register_sysfs()
erofs: fix incorrect symlink detection in fast symlink
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmbjB00ACgkQxWXV+ddt
WDsZ/xAAmKAPYvn9P8UZcsG+J0Uxs+OyDJ6SDxkZWw/m3gH1SzRH1+baWNIFvsTO
O5GcEsjXkjfC+vBOU0Hoje1YsvLXboZLM85NvgcA8FZPAnUkLkUjVD1jYTGN7YPM
gfkVFUJWdYs6uYTAX8bCIxgAqmABLJ6Jlk9HFpL82WqkUQLd/0vCnUpmRXPdG+2D
RiL9b8HG9v9vahMLyVhUUdeoubD88/lNqIhbiVCoqqC+6IuFYJlpj2nZEO1N/k5F
iX2VBMayRXbX0JziryU6fuE0VIr0yEttRUDdjgxL4SwCcrXBQzFaIgV43ygWbW+Y
QI6nuD+hhOwDq3Dn0zz27LV5d0A5YNAGV/nO042FC28GL+qZeensX9SVKxR3ig99
tLkHhyQ8q8AdanVk/LskbBjm2LaARJIKDVmgpllVR9o+ur4P23koVsPTz/VTgBKt
DcMCHNpRrryz3/S4dCcNvN7NCxZkKiPOgVx0Q6hPj/URG9LB72Gu+wrQ2wZAdxkV
yIbY2EOEiks0fnuRVrG0I0WRJjyWZUSsTpjnC8VXSK+6hdfi0M2Vr+ErFMbU95Ys
FyGEHfGGP5WBEEE/uxPOMRc9u6KOCfHoAjWrMPyKeJDqWS2gQ4ovOw+JjzpmaqXJ
qghLDjqjkPs8c1JlnAFp9QEeiYJRX1C3qoiBU5Axt/7J36sYrEM=
=kDNG
-----END PGP SIGNATURE-----
Merge tag 'for-6.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"This brings mostly refactoring, cleanups, minor performance
optimizations and usual fixes. The folio API conversions are most
noticeable.
There's one less visible change that could have a high impact. The
extent lock scope for read is reduced, not held for the entire
operation. In the buffered read case it's left to page or inode lock,
some direct io read synchronization is still needed.
This used to prevent deadlocks induced by page faults during direct
io, so there was a 4K limitation on the requests, e.g. for io_uring.
In the future this will allow smoother integration with iomap where
the extent read lock was a major obstacle.
User visible changes:
- the FSTRIM ioctl updates the processed range even after an error or
interruption
- cleaner thread is woken up in SYNC ioctl instead of waking the
transaction thread that can take some delay before waking up the
cleaner, this can speed up cleaning of deleted subvolumes
- print an error message when opening a device fail, e.g. when it's
unexpectedly read-only
Core changes:
- improved extent map handling in various ways (locking, iteration, ...)
- new assertions and locking annotations
- raid-stripe-tree locking fixes
- use xarray for tracking dirty qgroup extents, switched from rb-tree
- turn the subpage test to compile-time condition if possible (e.g.
on x86_64 with 4K pages), this allows to skip a lot of ifs and
remove dead code
- more preparatory work for compression in subpage mode
Cleanups and refactoring
- folio API conversions, many simple cases where page is passed so
switch it to folios
- more subpage code refactoring, update page state bitmap processing
- introduce auto free for btrfs_path structure, use for the simple
cases"
* tag 'for-6.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (110 commits)
btrfs: only unlock the to-be-submitted ranges inside a folio
btrfs: merge btrfs_folio_unlock_writer() into btrfs_folio_end_writer_lock()
btrfs: BTRFS_PATH_AUTO_FREE in orphan.c
btrfs: use btrfs_path auto free in zoned.c
btrfs: DEFINE_FREE for struct btrfs_path
btrfs: remove btrfs_folio_end_all_writers()
btrfs: constify more pointer parameters
btrfs: rework BTRFS_I as macro to preserve parameter const
btrfs: add and use helper to verify the calling task has locked the inode
btrfs: always update fstrim_range on failure in FITRIM ioctl
btrfs: convert copy_inline_to_page() to use folio
btrfs: convert btrfs_decompress() to take a folio
btrfs: convert zstd_decompress() to take a folio
btrfs: convert lzo_decompress() to take a folio
btrfs: convert zlib_decompress() to take a folio
btrfs: convert try_release_extent_mapping() to take a folio
btrfs: convert try_release_extent_state() to take a folio
btrfs: convert submit_eb_page() to take a folio
btrfs: convert submit_eb_subpage() to take a folio
btrfs: convert read_key_bytes() to take a folio
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEvgAKCRCRxhvAZXjc
onQWAQD6IxAKPU0zom2FoWNilvSzPs7WglTtvddX9pu/lT1RNAD/YC/wOLW8mvAv
9oTAmigQDQQhEWdJA9RgLZBiw7k+DAw=
=zWFb
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.12.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull netfs updates from Christian Brauner:
"This contains the work to improve read/write performance for the new
netfs library.
The main performance enhancing changes are:
- Define a structure, struct folio_queue, and a new iterator type,
ITER_FOLIOQ, to hold a buffer as a replacement for ITER_XARRAY. See
that patch for questions about naming and form.
ITER_FOLIOQ is provided as a replacement for ITER_XARRAY. The
problem with an xarray is that accessing it requires the use of a
lock (typically the RCU read lock) - and this means that we can't
supply iterate_and_advance() with a step function that might sleep
(crypto for example) without having to drop the lock between pages.
ITER_FOLIOQ is the iterator for a chain of folio_queue structs,
where each folio_queue holds a small list of folios. A folio_queue
struct is a simpler structure than xarray and is not subject to
concurrent manipulation by the VM. folio_queue is used rather than
a bvec[] as it can form lists of indefinite size, adding to one end
and removing from the other on the fly.
- Provide a copy_folio_from_iter() wrapper.
- Make cifs RDMA support ITER_FOLIOQ.
- Use folio queues in the write-side helpers instead of xarrays.
- Add a function to reset the iterator in a subrequest.
- Simplify the write-side helpers to use sheaves to skip gaps rather
than trying to work out where gaps are.
- In afs, make the read subrequests asynchronous, putting them into
work items to allow the next patch to do progressive
unlocking/reading.
- Overhaul the read-side helpers to improve performance.
- Fix the caching of a partial block at the end of a file.
- Allow a store to be cancelled.
Then some changes for cifs to make it use folio queues instead of
xarrays for crypto bufferage:
- Use raw iteration functions rather than manually coding iteration
when hashing data.
- Switch to using folio_queue for crypto buffers.
- Remove the xarray bits.
Make some adjustments to the /proc/fs/netfs/stats file such that:
- All the netfs stats lines begin 'Netfs:' but change this to
something a bit more useful.
- Add a couple of stats counters to track the numbers of skips and
waits on the per-inode writeback serialisation lock to make it
easier to check for this as a source of performance loss.
Miscellaneous work:
- Ensure that the sb_writers lock is taken around
vfs_{set,remove}xattr() in the cachefiles code.
- Reduce the number of conditional branches in netfs_perform_write().
- Move the CIFS_INO_MODIFIED_ATTR flag to the netfs_inode struct and
remove cifs_post_modify().
- Move the max_len/max_nr_segs members from netfs_io_subrequest to
netfs_io_request as they're only needed for one subreq at a time.
- Add an 'unknown' source value for tracing purposes.
- Remove NETFS_COPY_TO_CACHE as it's no longer used.
- Set the request work function up front at allocation time.
- Use bh-disabling spinlocks for rreq->lock as cachefiles completion
may be run from block-filesystem DIO completion in softirq context.
- Remove fs/netfs/io.c"
* tag 'vfs-6.12.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (25 commits)
docs: filesystems: corrected grammar of netfs page
cifs: Don't support ITER_XARRAY
cifs: Switch crypto buffer to use a folio_queue rather than an xarray
cifs: Use iterate_and_advance*() routines directly for hashing
netfs: Cancel dirty folios that have no storage destination
cachefiles, netfs: Fix write to partial block at EOF
netfs: Remove fs/netfs/io.c
netfs: Speed up buffered reading
afs: Make read subreqs async
netfs: Simplify the writeback code
netfs: Provide an iterator-reset function
netfs: Use new folio_queue data type and iterator instead of xarray iter
cifs: Provide the capability to extract from ITER_FOLIOQ to RDMA SGEs
iov_iter: Provide copy_folio_from_iter()
mm: Define struct folio_queue and ITER_FOLIOQ to handle a sequence of folios
netfs: Use bh-disabling spinlocks for rreq->lock
netfs: Set the request work function upon allocation
netfs: Remove NETFS_COPY_TO_CACHE
netfs: Reserve netfs_sreq_source 0 as unset/unknown
netfs: Move max_len/max_nr_segs from netfs_io_subrequest to netfs_io_stream
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEmwAKCRCRxhvAZXjc
otRsAQCUdlBS/ky2JiYn3ePURKYVBgRq/+PnmhRrBNDuv+ToZwD+NRLNlOM8FzQy
c8BMSq0rkwO2C5Aax3kGxgTPMEuuCwc=
=QLvm
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.12.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs mount updates from Christian Brauner:
"Recently, we added the ability to list mounts in other mount
namespaces and the ability to retrieve namespace file descriptors
without having to go through procfs by deriving them from pidfds.
This extends nsfs in two ways:
(1) Add the ability to retrieve information about a mount namespace
via NS_MNT_GET_INFO.
This will return the mount namespace id and the number of mounts
currently in the mount namespace. The number of mounts can be
used to size the buffer that needs to be used for listmount() and
is in general useful without having to actually iterate through
all the mounts.
The structure is extensible.
(2) Add the ability to iterate through all mount namespaces over
which the caller holds privilege returning the file descriptor
for the next or previous mount namespace.
To retrieve a mount namespace the caller must be privileged wrt
to it's owning user namespace. This means that PID 1 on the host
can list all mounts in all mount namespaces or that a container
can list all mounts of its nested containers.
Optionally pass a structure for NS_MNT_GET_INFO with
NS_MNT_GET_{PREV,NEXT} to retrieve information about the mount
namespace in one go.
(1) and (2) can be implemented for other namespace types easily.
Together with recent api additions this means one can iterate through
all mounts in all mount namespaces without ever touching procfs.
The commit message in 49224a345c ('Merge patch series "nsfs: iterate
through mount namespaces"') contains example code how to do this"
* tag 'vfs-6.12.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
nsfs: iterate through mount namespaces
file: add fput() cleanup helper
fs: add put_mnt_ns() cleanup helper
fs: allow mount namespace fd
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEwAAKCRCRxhvAZXjc
onI2AQDXa5XhIx0VpLWE9uVImVy3QuUKc/5pI1e1DKMgxLhKCgEAh15a4ETqmVaw
Zp3ZSzoLD8Ez1WwWb6cWQuHFYRSjtwU=
=+LKG
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.12.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull procfs updates from Christian Brauner:
"This contains the following changes for procfs:
- Add config options and parameters to block forcing memory writes.
This adds a Kconfig option and boot param to allow removing the
FOLL_FORCE flag from /proc/<pid>/mem write calls as this can be
used in various attacks.
The traditional forcing behavior is kept as default because it can
break GDB and some other use cases.
This is the simpler version that you had requested.
- Restrict overmounting of ephemeral entities.
It is currently possible to mount on top of various ephemeral
entities in procfs. This specifically includes magic links. To
recap, magic links are links of the form /proc/<pid>/fd/<nr>. They
serve as references to a target file and during path lookup they
cause a jump to the target path. Such magic links disappear if the
corresponding file descriptor is closed.
Currently it is possible to overmount such magic links. This is
mostly interesting for an attacker that wants to somehow trick a
process into e.g., reopening something that it didn't intend to
reopen or to hide a malicious file descriptor.
But also it risks leaking mounts for long-running processes. When
overmounting a magic link like above, the mount will not be
detached when the file descriptor is closed. Only the target
mountpoint will disappear. Which has the consequence of making it
impossible to unmount that mount afterwards. So the mount will
stick around until the process exits and the /proc/<pid>/ directory
is cleaned up during proc_flush_pid() when the dentries are pruned
and invalidated.
That in turn means it's possible for a program to accidentally leak
mounts and it's also possible to make a task leak mounts without
it's knowledge if the attacker just keeps overmounting things under
/proc/<pid>/fd/<nr>.
Disallow overmounting of such ephemeral entities.
- Cleanup the readdir method naming in some procfs file operations.
- Replace kmalloc() and strcpy() with a simple kmemdup() call"
* tag 'vfs-6.12.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
proc: fold kmalloc() + strcpy() into kmemdup()
proc: block mounting on top of /proc/<pid>/fdinfo/*
proc: block mounting on top of /proc/<pid>/fd/*
proc: block mounting on top of /proc/<pid>/map_files/*
proc: add proc_splice_unmountable()
proc: proc_readfdinfo() -> proc_fdinfo_iterate()
proc: proc_readfd() -> proc_fd_iterate()
proc: add config & param to block forcing mem writes
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEwAAKCRCRxhvAZXjc
omD7AQCZuWPXkEGYFD37MJZuRXNEoq7Tuj6yd0O2b5khUpzvyAD+MPuthGiCMPsu
voPpUP83x7T0D3JsEsCAXtNeVRcIBQI=
=xTs6
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.12.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fallocate updates from Christian Brauner:
"This contains work to try and cleanup some the fallocate mode
handling. Currently, it confusingly mixes operation modes and an
optional flag.
The work here tries to better define operation modes and optional
flags allowing the core and filesystem code to use switch statements
to switch on the operation mode"
* tag 'vfs-6.12.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
xfs: refactor xfs_file_fallocate
xfs: move the xfs_is_always_cow_inode check into xfs_alloc_file_space
xfs: call xfs_flush_unmap_range from xfs_free_file_space
fs: sort out the fallocate mode vs flag mess
ext4: remove tracing for FALLOC_FL_NO_HIDE_STALE
block: remove checks for FALLOC_FL_NO_HIDE_STALE
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEwAAKCRCRxhvAZXjc
osS0AQCgIpvey9oW5DMyMw6Bv0hFMRv95gbNQZfHy09iK+NMNAD9GALhb/4cMIVB
7YrZGXEz454lpgcs8AnrOVjVNfctOQg=
=e9s9
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.12.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs file updates from Christian Brauner:
"This is the work to cleanup and shrink struct file significantly.
Right now, (focusing on x86) struct file is 232 bytes. After this
series struct file will be 184 bytes aka 3 cacheline and a spare 8
bytes for future extensions at the end of the struct.
With struct file being as ubiquitous as it is this should make a
difference for file heavy workloads and allow further optimizations in
the future.
- struct fown_struct was embedded into struct file letting it take up
32 bytes in total when really it shouldn't even be embedded in
struct file in the first place. Instead, actual users of struct
fown_struct now allocate the struct on demand. This frees up 24
bytes.
- Move struct file_ra_state into the union containg the cleanup hooks
and move f_iocb_flags out of the union. This closes a 4 byte hole
we created earlier and brings struct file to 192 bytes. Which means
struct file is 3 cachelines and we managed to shrink it by 40
bytes.
- Reorder struct file so that nothing crosses a cacheline.
I suspect that in the future we will end up reordering some members
to mitigate false sharing issues or just because someone does
actually provide really good perf data.
- Shrinking struct file to 192 bytes is only part of the work.
Files use a slab that is SLAB_TYPESAFE_BY_RCU and when a kmem cache
is created with SLAB_TYPESAFE_BY_RCU the free pointer must be
located outside of the object because the cache doesn't know what
part of the memory can safely be overwritten as it may be needed to
prevent object recycling.
That has the consequence that SLAB_TYPESAFE_BY_RCU may end up
adding a new cacheline.
So this also contains work to add a new kmem_cache_create_rcu()
function that allows the caller to specify an offset where the
freelist pointer is supposed to be placed. Thus avoiding the
implicit addition of a fourth cacheline.
- And finally this removes the f_version member in struct file.
The f_version member isn't particularly well-defined. It is mainly
used as a cookie to detect concurrent seeks when iterating
directories. But it is also abused by some subsystems for
completely unrelated things.
It is mostly a directory and filesystem specific thing that doesn't
really need to live in struct file and with its wonky semantics it
really lacks a specific function.
For pipes, f_version is (ab)used to defer poll notifications until
a write has happened. And struct pipe_inode_info is used by
multiple struct files in their ->private_data so there's no chance
of pushing that down into file->private_data without introducing
another pointer indirection.
But pipes don't rely on f_pos_lock so this adds a union into struct
file encompassing f_pos_lock and a pipe specific f_pipe member that
pipes can use. This union of course can be extended to other file
types and is similar to what we do in struct inode already"
* tag 'vfs-6.12.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (26 commits)
fs: remove f_version
pipe: use f_pipe
fs: add f_pipe
ubifs: store cookie in private data
ufs: store cookie in private data
udf: store cookie in private data
proc: store cookie in private data
ocfs2: store cookie in private data
input: remove f_version abuse
ext4: store cookie in private data
ext2: store cookie in private data
affs: store cookie in private data
fs: add generic_llseek_cookie()
fs: use must_set_pos()
fs: add must_set_pos()
fs: add vfs_setpos_cookie()
s390: remove unused f_version
ceph: remove unused f_version
adi: remove unused f_version
mm: Removed @freeptr_offset to prevent doc warning
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEvgAKCRCRxhvAZXjc
ou77AQD3U1KjbdgzbUi6kaUmiiWOPhfYTlm8mho8dBjqvTCB+AD/XTWSFCWWhHB4
KyQZTbjRD81xmVNbKjASazp0EA6Ahwc=
=gIsD
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.12.folio' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs folio updates from Christian Brauner:
"This contains work to port write_begin and write_end to rely on folios
for various filesystems.
This converts ocfs2, vboxfs, orangefs, jffs2, hostfs, fuse, f2fs,
ecryptfs, ntfs3, nilfs2, reiserfs, minixfs, qnx6, sysv, ufs, and
squashfs.
After this series lands a bunch of the filesystems in this list do not
mention struct page anymore"
* tag 'vfs-6.12.folio' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (61 commits)
Squashfs: Ensure all readahead pages have been used
Squashfs: Rewrite and update squashfs_readahead_fragment() to not use page->index
Squashfs: Update squashfs_readpage_block() to not use page->index
Squashfs: Update squashfs_readahead() to not use page->index
Squashfs: Update page_actor to not use page->index
jffs2: Use a folio in jffs2_garbage_collect_dnode()
jffs2: Convert jffs2_do_readpage_nolock to take a folio
buffer: Convert __block_write_begin() to take a folio
ocfs2: Convert ocfs2_write_zero_page to use a folio
fs: Convert aops->write_begin to take a folio
fs: Convert aops->write_end to take a folio
vboxsf: Use a folio in vboxsf_write_end()
orangefs: Convert orangefs_write_begin() to use a folio
orangefs: Convert orangefs_write_end() to use a folio
jffs2: Convert jffs2_write_begin() to use a folio
jffs2: Convert jffs2_write_end() to use a folio
hostfs: Convert hostfs_write_end() to use a folio
fuse: Convert fuse_write_begin() to use a folio
fuse: Convert fuse_write_end() to use a folio
f2fs: Convert f2fs_write_begin() to use a folio
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEGwAKCRCRxhvAZXjc
ojIuAQC433+hBkvjvmQ7H0r5rgZSjUuCTG3bSmdU7RJmPHUHhwEA85v/NGq53f+W
IhandK6t+Cf0JYpFZ3N0bT88hDYVhQQ=
=9zGL
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual pile of misc updates:
Features:
- Add F_CREATED_QUERY fcntl() that allows userspace to query whether
a file was actually created. Often userspace wants to know whether
an O_CREATE request did actually create a file without using
O_EXCL. The current logic is that to first attempts to open the
file without O_CREAT | O_EXCL and if ENOENT is returned userspace
tries again with both flags. If that succeeds all is well. If it
now reports EEXIST it retries.
That works fairly well but some corner cases make this more
involved. If this operates on a dangling symlink the first openat()
without O_CREAT | O_EXCL will return ENOENT but the second openat()
with O_CREAT | O_EXCL will fail with EEXIST.
The reason is that openat() without O_CREAT | O_EXCL follows the
symlink while O_CREAT | O_EXCL doesn't for security reasons. So
it's not something we can really change unless we add an explicit
opt-in via O_FOLLOW which seems really ugly.
All available workarounds are really nasty (fanotify, bpf lsm etc)
so add a simple fcntl().
- Try an opportunistic lookup for O_CREAT. Today, when opening a file
we'll typically do a fast lookup, but if O_CREAT is set, the kernel
always takes the exclusive inode lock. This was likely done with
the expectation that O_CREAT means that we always expect to do the
create, but that's often not the case. Many programs set O_CREAT
even in scenarios where the file already exists (see related
F_CREATED_QUERY patch motivation above).
The series contained in the pr rearranges the pathwalk-for-open
code to also attempt a fast_lookup in certain O_CREAT cases. If a
positive dentry is found, the inode_lock can be avoided altogether
and it can stay in rcuwalk mode for the last step_into.
- Expose the 64 bit mount id via name_to_handle_at()
Now that we provide a unique 64-bit mount ID interface in statx(2),
we can now provide a race-free way for name_to_handle_at(2) to
provide a file handle and corresponding mount without needing to
worry about racing with /proc/mountinfo parsing or having to open a
file just to do statx(2).
While this is not necessary if you are using AT_EMPTY_PATH and
don't care about an extra statx(2) call, users that pass full paths
into name_to_handle_at(2) need to know which mount the file handle
comes from (to make sure they don't try to open_by_handle_at a file
handle from a different filesystem) and switching to AT_EMPTY_PATH
would require allocating a file for every name_to_handle_at(2) call
- Add a per dentry expire timeout to autofs
There are two fairly well known automounter map formats, the autofs
format and the amd format (more or less System V and Berkley).
Some time ago Linux autofs added an amd map format parser that
implemented a fair amount of the amd functionality. This was done
within the autofs infrastructure and some functionality wasn't
implemented because it either didn't make sense or required extra
kernel changes. The idea was to restrict changes to be within the
existing autofs functionality as much as possible and leave changes
with a wider scope to be considered later.
One of these changes is implementing the amd options:
1) "unmount", expire this mount according to a timeout (same as
the current autofs default).
2) "nounmount", don't expire this mount (same as setting the
autofs timeout to 0 except only for this specific mount) .
3) "utimeout=<seconds>", expire this mount using the specified
timeout (again same as setting the autofs timeout but only for
this mount)
To implement these options per-dentry expire timeouts need to be
implemented for autofs indirect mounts. This is because all map
keys (mounts) for autofs indirect mounts use an expire timeout
stored in the autofs mount super block info. structure and all
indirect mounts use the same expire timeout.
Fixes:
- Fix missing fput for FSCONFIG_SET_FD in autofs
- Use param->file for FSCONFIG_SET_FD in coda
- Delete the 'fs/netfs' proc subtreee when netfs module exits
- Make sure that struct uid_gid_map fits into a single cacheline
- Don't flush in-flight wb switches for superblocks without cgroup
writeback
- Correcting the idmapping mount example in the idmapping
documentation
- Fix a race between evice_inodes() and find_inode() and iput()
- Refine the show_inode_state() macro definition in writeback code
- Prevent dump_mapping() from accessing invalid dentry.d_name.name
- Show actual source for debugfs in /proc/mounts
- Annotate data-race of busy_poll_usecs in eventpoll
- Don't WARN for racy path_noexec check in exec code
- Handle OOM on mnt_warn_timestamp_expiry()
- Fix some spelling in the iomap design documentation
- Fix typo in procfs comment
- Fix typo in fs/namespace.c comment
Cleanups:
- Add the VFS git tree to the MAINTAINERS file
- Move FMODE_UNSIGNED_OFFSET to fop_flags freeing up another f_mode
bit in struct file bringing us to 5 free f_mode bits
- Remove the __I_DIO_WAKEUP bit from i_state flags as we can simplify
the wait mechanism
- Remove the unused path_put_init() helper
- Replace a __u32 with u32 for s_fsnotify_mask as __u32 is uapi
specific
- Replace the unsigned long i_state member with a u32 i_state member
in struct inode freeing up 4 bytes in struct inode. Instead of
using the bit based wait apis we're now using the var event apis
and using the individual bytes of the i_state member to wait on
state changes
- Explain how per-syscall AT_* flags should be allocated
- Use in_group_or_capable() helper to simplify the posix acl mode
update code
- Switch to LIST_HEAD() in fsync_buffers_list() to simplify the code
- Removed comment about d_rcu_to_refcount() as that function doesn't
exist anymore
- Add kernel documentation for lookup_fast()
- Don't re-zero evenpoll fields
- Remove outdated comment after close_fd()
- Fix imprecise wording in comment about the pipe filesystem
- Drop GFP_NOFAIL mode from alloc_page_buffers
- Missing blank line warnings and struct declaration improved in
file_table
- Annotate struct poll_list with __counted_by()
- Remove the unused read parameter in percpu-rwsem
- Remove linux/prefetch.h include from direct-io code
- Use kmemdup_array instead of kmemdup for multiple allocation in
mnt_idmapping code
- Remove unused mnt_cursor_del() declaration
Performance tweaks:
- Dodge smp_mb in break_lease and break_deleg in the common case
- Only read fops once in fops_{get,put}()
- Use RCU in ilookup()
- Elide smp_mb in iversion handling in the common case
- Drop one lock trip in evict()"
* tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (58 commits)
uidgid: make sure we fit into one cacheline
proc: Fix typo in the comment
fs/pipe: Correct imprecise wording in comment
fhandle: expose u64 mount id to name_to_handle_at(2)
uapi: explain how per-syscall AT_* flags should be allocated
fs: drop GFP_NOFAIL mode from alloc_page_buffers
writeback: Refine the show_inode_state() macro definition
fs/inode: Prevent dump_mapping() accessing invalid dentry.d_name.name
mnt_idmapping: Use kmemdup_array instead of kmemdup for multiple allocation
netfs: Delete subtree of 'fs/netfs' when netfs module exits
fs: use LIST_HEAD() to simplify code
inode: make i_state a u32
inode: port __I_LRU_ISOLATING to var event
vfs: fix race between evice_inodes() and find_inode()&iput()
inode: port __I_NEW to var event
inode: port __I_SYNC to var event
fs: reorder i_state bits
fs: add i_state helpers
MAINTAINERS: add the VFS git tree
fs: s/__u32/u32/ for s_fsnotify_mask
...
ACPI:
* Enable PMCG erratum workaround for HiSilicon HIP10 and 11 platforms.
* Ensure arm64-specific IORT header is covered by MAINTAINERS.
CPU Errata:
* Enable workaround for hardware access/dirty issue on Ampere-1A cores.
Memory management:
* Define PHYSMEM_END to fix a crash in the amdgpu driver.
* Avoid tripping over invalid kernel mappings on the kexec() path.
* Userspace support for the Permission Overlay Extension (POE) using
protection keys.
Perf and PMUs:
* Add support for the "fixed instruction counter" extension in the CPU
PMU architecture.
* Extend and fix the event encodings for Apple's M1 CPU PMU.
* Allow LSM hooks to decide on SPE permissions for physical profiling.
* Add support for the CMN S3 and NI-700 PMUs.
Confidential Computing:
* Add support for booting an arm64 kernel as a protected guest under
Android's "Protected KVM" (pKVM) hypervisor.
Selftests:
* Fix vector length issues in the SVE/SME sigreturn tests
* Fix build warning in the ptrace tests.
Timers:
* Add support for PR_{G,S}ET_TSC so that 'rr' can deal with
non-determinism arising from the architected counter.
Miscellaneous:
* Rework our IPI-based CPU stopping code to try NMIs if regular IPIs
don't succeed.
* Minor fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmbkVNEQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNKeIB/9YtbN7JMgsXktM94GP03r3tlFF36Y1S51S
+zdDZclAVZCTCZN+PaFeAZ/+ah2EQYrY6rtDoHUSEMQdF9kH+ycuIPDTwaJ4Qkam
QKXMpAgtY/4yf2rX4lhDF8rEvkhLDsu7oGDhqUZQsA33GrMBHfgA3oqpYwlVjvGq
gkm7olTo9LdWAxkPpnjGrjB6Mv5Dq8dJRhW+0Q5AntI5zx3RdYGJZA9GUSzyYCCt
FIYOtMmWPkQ0kKxIVxOxAOm/ubhfyCs2sjSfkaa3vtvtt+Yjye1Xd81rFciIbPgP
QlK/Mes2kBZmjhkeus8guLI5Vi7tx3DQMkNqLXkHAAzOoC4oConE
=6osL
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"The highlights are support for Arm's "Permission Overlay Extension"
using memory protection keys, support for running as a protected guest
on Android as well as perf support for a bunch of new interconnect
PMUs.
Summary:
ACPI:
- Enable PMCG erratum workaround for HiSilicon HIP10 and 11
platforms.
- Ensure arm64-specific IORT header is covered by MAINTAINERS.
CPU Errata:
- Enable workaround for hardware access/dirty issue on Ampere-1A
cores.
Memory management:
- Define PHYSMEM_END to fix a crash in the amdgpu driver.
- Avoid tripping over invalid kernel mappings on the kexec() path.
- Userspace support for the Permission Overlay Extension (POE) using
protection keys.
Perf and PMUs:
- Add support for the "fixed instruction counter" extension in the
CPU PMU architecture.
- Extend and fix the event encodings for Apple's M1 CPU PMU.
- Allow LSM hooks to decide on SPE permissions for physical
profiling.
- Add support for the CMN S3 and NI-700 PMUs.
Confidential Computing:
- Add support for booting an arm64 kernel as a protected guest under
Android's "Protected KVM" (pKVM) hypervisor.
Selftests:
- Fix vector length issues in the SVE/SME sigreturn tests
- Fix build warning in the ptrace tests.
Timers:
- Add support for PR_{G,S}ET_TSC so that 'rr' can deal with
non-determinism arising from the architected counter.
Miscellaneous:
- Rework our IPI-based CPU stopping code to try NMIs if regular IPIs
don't succeed.
- Minor fixes and cleanups"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (94 commits)
perf: arm-ni: Fix an NULL vs IS_ERR() bug
arm64: hibernate: Fix warning for cast from restricted gfp_t
arm64: esr: Define ESR_ELx_EC_* constants as UL
arm64: pkeys: remove redundant WARN
perf: arm_pmuv3: Use BR_RETIRED for HW branch event if enabled
MAINTAINERS: List Arm interconnect PMUs as supported
perf: Add driver for Arm NI-700 interconnect PMU
dt-bindings/perf: Add Arm NI-700 PMU
perf/arm-cmn: Improve format attr printing
perf/arm-cmn: Clean up unnecessary NUMA_NO_NODE check
arm64/mm: use lm_alias() with addresses passed to memblock_free()
mm: arm64: document why pte is not advanced in contpte_ptep_set_access_flags()
arm64: Expose the end of the linear map in PHYSMEM_END
arm64: trans_pgd: mark PTEs entries as valid to avoid dead kexec()
arm64/mm: Delete __init region from memblock.reserved
perf/arm-cmn: Support CMN S3
dt-bindings: perf: arm-cmn: Add CMN S3
perf/arm-cmn: Refactor DTC PMU register access
perf/arm-cmn: Make cycle counts less surprising
perf/arm-cmn: Improve build-time assertion
...
list_head can be initialized automatically with LIST_HEAD()
instead of calling INIT_LIST_HEAD(). No functional impact.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
SFU since its (first) version 3.0 supports AF_LOCAL sockets and stores them
on filesytem as system file with one zero byte. Add support for detecting
this SFU socket type into cifs_sfu_type() function.
With this change cifs_sfu_type() would correctly detect all special file
types created by SFU: fifo, socket, symlink, block and char.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
For debugging purposes it is a good idea to show detected SFU type also for
Fifo. Debug message is already print for all other special types.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
SFU types IntxCHR and IntxBLK are 8 bytes with zero as last byte. Make it
explicit in memcpy and memset calls, so the zero byte is visible in the
code (and not hidden as string trailing nul byte).
It is important for reader to show the last byte for block and char types
because it differs from the last byte of symlink type (which has it 0x01).
Also it is important to show that the type is not nul-term string, but
rather 8 bytes (with some printable bytes).
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Currently when sfu mount option is specified then CIFS can recognize SFU
symlink, but is not able to read symlink target location. readlink()
syscall just returns that operation is not supported.
Implement this missing functionality in cifs_sfu_type() function. Read
target location of SFU-style symlink, parse it and fill into fattr's
cf_symlink_target member.
SFU-style symlink is file which has system attribute set and file content
is buffer "IntxLNK\1" (8th byte is 0x01) followed by the target location
encoded in little endian UCS-2/UTF-16. This format was introduced in
Interix 3.0 subsystem, as part of the Microsoft SFU 3.0 and is used also by
all later versions. Previous versions had no symlink support.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
SFU symlinks have 8 byte prefix: "IntxLNK\1".
So check also the last 8th byte 0x01.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Using uninitialized value "bkt" when calling "kfree"
Fixes: 13b68d4499 ("smb: client: compress: LZ77 code improvements cleanup")
Signed-off-by: Qianqiang Liu <qianqiang.liu@163.com>
Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
The dst pointer may not be initialized when calling kvfree(dst)
Fixes: 13b68d4499 ("smb: client: compress: LZ77 code improvements cleanup")
Signed-off-by: Qianqiang Liu <qianqiang.liu@163.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
- Check data compressibility with some heuristics (copied from
btrfs):
- should_compress() final decision is is_compressible(data)
- Cleanup compress/lz77.h leaving only lz77_compress() exposed:
- Move parts to compress/lz77.c, while removing the rest of it
because they were either unused, used only once, were
implemented wrong (thanks to David Howells for the help)
- Updated the compression parameters (still compatible with
Windows implementation) trading off ~20% compression ratio
for ~40% performance:
- min match len: 3 -> 4
- max distance: 8KiB -> 1KiB
- hash table type: u32 * -> u64 *
Known bugs:
This implementation currently works fine in general, but breaks with
some payloads used during testing. Investigation ongoing, to be
fixed in a next commit.
Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Co-developed-by: David Howells <dhowells@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
On smb2_async_writev(), set CIFS_COMPRESS_REQ on request flags if
should_compress() returns true.
On smb_send_rqst() check the flags, and compress and send the request to
the server.
(*) If the compression fails with -EMSGSIZE (i.e. compressed size is >=
uncompressed size), the original uncompressed request is sent instead.
Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
Move SMB3.1.1 compression code into experimental config option,
and fix the compress mount option. Implement unchained LZ77
"plain" compression algorithm as per MS-XCA specification
section "2.3 Plain LZ77 Compression Algorithm Details".
Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
The cifs_dir_open() have been removed since
commit 737b758c96 ("[PATCH] cifs: character mapping of special
characters (part 3 of 3)"), and now it is useless, so remove it.
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Use the min() macro to simplify the function and improve
its readability.
Signed-off-by: Shen Lichuan <shenlichuan@vivo.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Use ERR_CAST() as it is designed for casting an error pointer to
another type.
This macro uses the __force and __must_check modifiers, which are used
to tell the compiler to check for errors where this macro is used.
Signed-off-by: Yuesong Li <liyuesong@vivo.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Explained why the typo was not corrected.
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
There are only 4 different definitions between the client and server:
- STATUS_SERVER_UNAVAILABLE: from client/smb2status.h
- STATUS_FILE_NOT_AVAILABLE: from client/smb2status.h
- STATUS_NO_PREAUTH_INTEGRITY_HASH_OVERLAP: from server/smbstatus.h
- STATUS_INVALID_LOCK_RANGE: from server/smbstatus.h
Rename client/smb2status.h to common/smb2status.h, and merge the
2 different definitions of server to common header file.
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
In order to maintain the code more easily, move duplicate definitions
to new common header file.
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Preparation for moving acl definitions to new common header file.
Use the following shell command to rename:
find fs/smb/client -type f -exec sed -i \
's/struct cifs_ace/struct smb_ace/g' {} +
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Preparation for moving acl definitions to new common header file.
Use the following shell command to rename:
find fs/smb/client -type f -exec sed -i \
's/struct cifs_acl/struct smb_acl/g' {} +
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Preparation for moving acl definitions to new common header file.
Use the following shell command to rename:
find fs/smb/client -type f -exec sed -i \
's/struct cifs_sid/struct smb_sid/g' {} +
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Preparation for moving acl definitions to new common header file.
Use the following shell command to rename:
find fs/smb/client -type f -exec sed -i \
's/struct cifs_ntsd/struct smb_ntsd/g' {} +
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Ray Zhang reported ksmbd can not create file if parent filename is
caseless.
Y:\>mkdir A
Y:\>echo 123 >a\b.txt
The system cannot find the path specified.
Y:\>echo 123 >A\b.txt
This patch convert name obtained by caseless lookup to parent name.
Cc: stable@vger.kernel.org # v5.15+
Reported-by: Ray Zhang <zhanglei002@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Some file systems may not provide dot (.) and dot-dot (..) as they are
optional in POSIX. ksmbd can misjudge emptiness of a directory in those
file systems, since it assumes there are always at least two entries:
dot and dot-dot.
Just don't count dot and dot-dot.
Cc: stable@vger.kernel.org # v6.1+
Signed-off-by: Hobin Woo <hobin.woo@samsung.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
When sending an oplock break request, opinfo->conn is used,
But freed ->conn can be used on multichannel.
This patch add a reference count to the ksmbd_conn struct
so that it can be freed when it is no longer used.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Windows client write with FILE_APPEND_DATA when using git.
ksmbd should allow write it with this flags.
Z:\test>git commit -m "test"
fatal: cannot update the ref 'HEAD': unable to append to
'.git/logs/HEAD': Bad file descriptor
Fixes: 0626e6641f ("cifsd: add server handler for central processing and tranport layers")
Cc: stable@vger.kernel.org # v5.15+
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
We need to migrate data blocks even though it is full to secure space
for zoned device file pinning.
Fixes: 9703d69d9d ("f2fs: support file pinning for zoned devices")
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit 9651fcedf7 ("mm: add MAP_DROPPABLE for designating always
lazily freeable mappings") only adds VM_DROPPABLE for 64 bits
architectures.
In order to also use the getrandom vDSO implementation on powerpc/32,
use VM_ARCH_1 for VM_DROPPABLE on powerpc/32. This is possible because
VM_ARCH_1 is used for VM_SAO on powerpc and VM_SAO is only for
powerpc/64. It is used in combination with PROT_SAO in some parts of
code that are restricted to CONFIG_PPC64 through #ifdefs, it is
therefore possible to define VM_SAO for CONFIG_PPC64 only.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Fix the calculation of packet signatures by adding the offset into a page
in the read or write data payload when hashing the pages from it.
Fixes: 39bc58203f ("cifs: Add a function to Hash the contents of an iterator")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Negative i_size is never supported, although crafted images with inodes
having negative i_size will NOT lead to security issues in our current
codebase:
The following image can verify this (gzip+base64 encoded):
H4sICCmk4mYAA3Rlc3QuaW1nAGNgGAWjYBSMVPDo4dcH3jP2aTED2TwMKgxMUHHNJY/SQDQX
LxcDIw3tZwXit44MDNpQ/n8gQJZ/vxjijosPuSyZ0DUDgQqcZoKzVYFsDShbHeh6PT29ktTi
Eqz2g/y2pBFiLxDMh4lhs5+W4TAKRsEoGAWjYBSMglEwCkYBPQAAS2DbowAQAAA=
Mark as bad inodes for such corrupted inodes explicitly.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240912083538.3011860-1-hsiangkao@linux.alibaba.com
Error out if {en,de}encoded size of a pcluster is unsupported:
Maximum supported encoded size (of a pcluster): 1 MiB
Maximum supported decoded size (of a pcluster): 12 MiB
Users can still choose to use supported large configurations (e.g.,
for archival purposes), but there may be performance penalties in
low-memory scenarios compared to smaller pclusters.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240912074156.2925394-1-hsiangkao@linux.alibaba.com
This patch aims to allocate bvpages and short-lived compressed pages
from the reserved pool first.
After applying this patch, there are three benefits.
1. It reduces the page allocation time.
The bvpages and short-lived compressed pages account for about 4% of
the pages allocated from the system in the multi-app launch benchmarks
[1]. It reduces the page allocation time accordingly and lowers the
likelihood of blockage by page allocation in low memory scenarios.
2. The pages in the reserved pool will be allocated on demand.
Currently, bvpages and short-lived compressed pages are short-lived
pages allocated from the system, and the pages in the reserved pool all
originate from short-lived pages. Consequently, the number of reserved
pool pages will increase to z_erofs_rsv_nrpages over time.
With this patch, all short-lived pages are allocated from the reserved
pool first, so the number of reserved pool pages will only increase when
there are not enough pages. Thus, even if z_erofs_rsv_nrpages is set to
a large number for specific reasons, the actual number of reserved pool
pages may remain low as per demand. In the multi-app launch benchmarks
[1], z_erofs_rsv_nrpages is set at 256, while the number of reserved
pool pages remains below 64.
3. When erofs cache decompression is disabled
(EROFS_ZIP_CACHE_DISABLED), all pages will *only* be allocated from
the reserved pool for erofs. This will significantly reduce the memory
pressure from erofs.
[1] For additional details on the multi-app launch benchmarks, please
refer to commit 0f6273ab46 ("erofs: add a reserved buffer pool for lz4
decompression").
Signed-off-by: Chunhai Guo <guochunhai@vivo.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20240906121110.3701889-1-guochunhai@vivo.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
With iterative development, our codebase can now deal with compressed
buffer misses properly if both in-place I/O and compressed buffer
allocation fail.
Note that if readahead fails (with non-uptodate folios), the original
request will then fall back to synchronous read, and `.read_folio()`
should return appropriate errnos; otherwise -EIO will be passed to
user space, which is unexpected.
To simplify rarely encountered failure paths, a mimic decompression
will be just used. Before that, failure reasons are recorded in
compressed_bvecs[] and they also act as placeholders to avoid in-place
pages. They will be parsed just before decompression and then pass
back to `.read_folio()`.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240905084732.2684515-1-hsiangkao@linux.alibaba.com
There's now no need to support ITER_XARRAY in cifs as netfslib hands down
ITER_FOLIOQ instead - and that's simpler to use with iterate_and_advance()
as it doesn't hold the RCU read lock over the step function.
This is part of the process of phasing out ITER_XARRAY.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Tom Talpey <tom@talpey.com>
cc: Enzo Matsumiya <ematsumiya@suse.de>
cc: linux-cifs@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-26-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
Switch cifs from using an xarray to hold the transport crypto buffer to
using a folio_queue and use ITER_FOLIOQ rather than ITER_XARRAY.
This is part of the process of phasing out ITER_XARRAY.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Tom Talpey <tom@talpey.com>
cc: Enzo Matsumiya <ematsumiya@suse.de>
cc: linux-cifs@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-25-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
Replace the bespoke cifs iterators of ITER_BVEC and ITER_KVEC to do hashing
with iterate_and_advance_kernel() - a variant on iterate_and_advance() that
only supports kernel-internal ITER_* types and not UBUF/IOVEC types.
The bespoke ITER_XARRAY is left because we don't really want to be calling
crypto_shash_update() under the RCU read lock for large amounts of data;
besides, ITER_XARRAY is going to be phased out.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Tom Talpey <tom@talpey.com>
cc: Enzo Matsumiya <ematsumiya@suse.de>
cc: linux-cifs@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-24-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
Kafs wants to be able to cache the contents of directories (and symlinks),
but whilst these are downloaded from the server with the FS.FetchData RPC
op and similar, the same as for regular files, they can't be updated by
FS.StoreData, but rather have special operations (FS.MakeDir, etc.).
Now, rather than redownloading a directory's content after each change made
to that directory, kafs modifies the local blob. This blob can be saved
out to the cache, and since it's using netfslib, kafs just marks the folios
dirty and lets ->writepages() on the directory take care of it, as for an
regular file.
This is fine as long as there's a cache as although the upload stream is
disabled, there's a cache stream to drive the procedure. But if the cache
goes away in the meantime, suddenly there's no way do any writes and the
code gets confused, complains "R=%x: No submit" to dmesg and leaves the
dirty folio hanging.
Fix this by just cancelling the store of the folio if neither stream is
active. (If there's no cache at the time of dirtying, we should just not
mark the folio dirty).
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-23-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
Because it uses DIO writes, cachefiles is unable to make a write to the
backing file if that write is not aligned to and sized according to the
backing file's DIO block alignment. This makes it tricky to handle a write
to the cache where the EOF on the network file is not correctly aligned.
To get around this, netfslib attempts to tell the driver it is calling how
much more data there is available beyond the EOF that it can use to pad the
write (netfslib preclears the part of the folio above the EOF). However,
it tries to tell the cache what the maximum length is, but doesn't
calculate this correctly; and, in any case, cachefiles actually ignores the
value and just skips the block.
Fix this by:
(1) Change the value passed to indicate the amount of extra data that can
be added to the operation (now ->submit_extendable_to). This is much
simpler to calculate as it's just the end of the folio minus the top
of the data within the folio - rather than having to account for data
spread over multiple folios.
(2) Make cachefiles add some of this data if the subrequest it is given
ends at the network file's i_size if the extra data is sufficient to
pad out to a whole block.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-22-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
Improve the efficiency of buffered reads in a number of ways:
(1) Overhaul the algorithm in general so that it's a lot more compact and
split the read submission code between buffered and unbuffered
versions. The unbuffered version can be vastly simplified.
(2) Read-result collection is handed off to a work queue rather than being
done in the I/O thread. Multiple subrequests can be processes
simultaneously.
(3) When a subrequest is collected, any folios it fully spans are
collected and "spare" data on either side is donated to either the
previous or the next subrequest in the sequence.
Notes:
(*) Readahead expansion is massively slows down fio, presumably because it
causes a load of extra allocations, both folio and xarray, up front
before RPC requests can be transmitted.
(*) RDMA with cifs does appear to work, both with SIW and RXE.
(*) PG_private_2-based reading and copy-to-cache is split out into its own
file and altered to use folio_queue. Note that the copy to the cache
now creates a new write transaction against the cache and adds the
folios to be copied into it. This allows it to use part of the
writeback I/O code.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-20-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
Perform AFS read subrequests in a work item rather than in the calling
thread. For normal buffered reads, this will allow the calling thread to
copy data from the pagecache to the application at the same time as the
demarshalling thread is shovelling data from skbuffs into the pagecache.
This will also allow the RA mark to trigger a new read before we've
finished shovelling the data from the current one.
Note: This would be a bit safer if the FS.FetchData RPC ops returned the
metadata (including the data version number) before returning the data.
This would allow me to flush the pagecache before installing the new data.
In future, it may be possible to asynchronously flush the pagecache either
side of the region being read.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-19-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
Use the new folio_queue structures to simplify the writeback code. The
problem with referring to the i_pages xarray directly is that we may have
gaps in the sequence of folios we're writing from that we need to skip when
we're removing the writeback mark from the folios we're writing back from.
At the moment the code tries to deal with this by carefully tracking the
gaps in each writeback stream (eg. write to server and write to cache) and
divining when there's a gap that spans folios (something that's not helped
by folios not being a consistent size).
Instead, the folio_queue buffer contains pointers only the folios we're
dealing with, has them in ascending order and indicates a gap by placing
non-consequitive folios next to each other. This makes it possible to
track where we need to clean up to by just keeping track of where we've
processed to on each stream and taking the minimum.
Note that the I/O iterator is always rounded up to the end of the folio,
even if that is beyond the EOF position, so that the cache can do DIO from
the page. The excess space is cleared, though mmapped writes clobber it.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-18-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
Make the netfs write-side routines use the new folio_queue struct to hold a
rolling buffer of folios, with the issuer adding folios at the tail and the
collector removing them from the head as they're processed instead of using
an xarray.
This will allow a subsequent patch to simplify the write collector.
The primary mark (as tested by folioq_is_marked()) is used to note if the
corresponding folio needs putting.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-16-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
Make smb_extract_iter_to_rdma() extract page fragments from an ITER_FOLIOQ
iterator into RDMA SGEs.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Tom Talpey <tom@talpey.com>
cc: Enzo Matsumiya <ematsumiya@suse.de>
cc: linux-cifs@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-15-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
Now that detecting concurrent seeks is done by the filesystems that
require it we can remove f_version and free up 8 bytes for future
extensions.
Link: https://lore.kernel.org/r/20240830-vfs-file-f_version-v1-20-6d3e4816aa7b@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Pipes use f_version to defer poll notifications until a write has been
observed. Since multiple file's refer to the same struct pipe_inode_info
in their ->private_data moving it into their isn't feasible since we
would need to introduce an additional pointer indirection.
However, since pipes don't require f_pos_lock we placed a new f_pipe
member into a union with f_pos_lock that pipes can use. This is similar
to what we already do for struct inode where we have additional fields
per file type. This will allow us to fully remove f_version in the next
step.
Link: https://lore.kernel.org/r/20240830-vfs-file-f_version-v1-19-6d3e4816aa7b@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Only regular files with FMODE_ATOMIC_POS and directories need
f_pos_lock. Place a new f_pipe member in a union with f_pos_lock
that they can use and make them stop abusing f_version in follow-up
patches.
Link: https://lore.kernel.org/r/20240830-vfs-file-f_version-v1-18-6d3e4816aa7b@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
syzbot reports a f2fs bug as below:
------------[ cut here ]------------
WARNING: CPU: 1 PID: 58 at kernel/rcu/sync.c:177 rcu_sync_dtor+0xcd/0x180 kernel/rcu/sync.c:177
CPU: 1 UID: 0 PID: 58 Comm: kworker/1:2 Not tainted 6.10.0-syzkaller-12562-g1722389b0d86 #0
Workqueue: events destroy_super_work
RIP: 0010:rcu_sync_dtor+0xcd/0x180 kernel/rcu/sync.c:177
Call Trace:
percpu_free_rwsem+0x41/0x80 kernel/locking/percpu-rwsem.c:42
destroy_super_work+0xec/0x130 fs/super.c:282
process_one_work kernel/workqueue.c:3231 [inline]
process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3312
worker_thread+0x86d/0xd40 kernel/workqueue.c:3390
kthread+0x2f0/0x390 kernel/kthread.c:389
ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
As Christian Brauner pointed out [1]: the root cause is f2fs sets
SB_RDONLY flag in internal function, rather than setting the flag
covered w/ sb->s_umount semaphore via remount procedure, then below
race condition causes this bug:
- freeze_super()
- sb_wait_write(sb, SB_FREEZE_WRITE)
- sb_wait_write(sb, SB_FREEZE_PAGEFAULT)
- sb_wait_write(sb, SB_FREEZE_FS)
- f2fs_handle_critical_error
- sb->s_flags |= SB_RDONLY
- thaw_super
- thaw_super_locked
- sb_rdonly() is true, so it skips
sb_freeze_unlock(sb, SB_FREEZE_FS)
- deactivate_locked_super
Since f2fs has almost the same logic as ext4 [2] when handling critical
error in filesystem if it mounts w/ errors=remount-ro option:
- set CP_ERROR_FLAG flag which indicates filesystem is stopped
- record errors to superblock
- set SB_RDONLY falg
Once we set CP_ERROR_FLAG flag, all writable interfaces can detect the
flag and stop any further updates on filesystem. So, it is safe to not
set SB_RDONLY flag, let's remove the logic and keep in line w/ ext4 [3].
[1] https://lore.kernel.org/all/20240729-himbeeren-funknetz-96e62f9c7aee@brauner
[2] https://lore.kernel.org/all/20240729132721.hxih6ehigadqf7wx@quack3
[3] https://lore.kernel.org/linux-ext4/20240805201241.27286-1-jack@suse.cz
Fixes: b62e71be21 ("f2fs: support errors=remount-ro|continue|panic mountoption")
Reported-by: syzbot+20d7e439f76bbbd863a7@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000b90a8e061e21d12f@google.com/
Cc: Jan Kara <jack@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We need to introduce a valid block ratio threshold not to trigger
excessive GC for zoned deivces. The initial value of it is 95%. So, F2FS
will stop the thread from intiating GC for sections having valid blocks
exceeding the ratio.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Added control knobs for gc_no_zoned_gc_percent and
gc_boost_zoned_gc_percent.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Under low free section count, we need to use FG_GC instead of BG_GC to
recover free sections.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For the fine tuning of GC behavior, add reserved_segments sysfs node.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We can control the scanning window granularity for GC migration. For
more frequent scanning and GC on zoned devices, we need a fine grained
control knob for it.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since we don't have any GC on device side for zoned devices, need more
aggressive BG GC. So, tune the parameters for that.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch addresses the problem that when using LFS mode, unused blocks
may occur in f2fs_map_blocks() during block allocation for dio writes.
If a new section is allocated during block allocation, it will not be
included in the map struct by map_is_mergeable() if the LBA of the
allocated block is not contiguous. However, the block already allocated
in this process will remain unused due to the LFS mode.
This patch avoids the possibility of unused blocks by escaping
f2fs_map_blocks() when allocating the last block in a section.
Signed-off-by: Daejun Park <daejun7.park@samsung.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Some f2fs ioctl interfaces like f2fs_ioc_set_pin_file(),
f2fs_move_file_range(), and f2fs_defragment_range() missed to
check atomic_write status, which may cause potential race issue,
fix it.
Cc: stable@vger.kernel.org
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
syzbot reports a f2fs bug as below:
kernel BUG at fs/f2fs/inode.c:896!
RIP: 0010:f2fs_evict_inode+0x1598/0x15c0 fs/f2fs/inode.c:896
Call Trace:
evict+0x532/0x950 fs/inode.c:704
dispose_list fs/inode.c:747 [inline]
evict_inodes+0x5f9/0x690 fs/inode.c:797
generic_shutdown_super+0x9d/0x2d0 fs/super.c:627
kill_block_super+0x44/0x90 fs/super.c:1696
kill_f2fs_super+0x344/0x690 fs/f2fs/super.c:4898
deactivate_locked_super+0xc4/0x130 fs/super.c:473
cleanup_mnt+0x41f/0x4b0 fs/namespace.c:1373
task_work_run+0x24f/0x310 kernel/task_work.c:228
ptrace_notify+0x2d2/0x380 kernel/signal.c:2402
ptrace_report_syscall include/linux/ptrace.h:415 [inline]
ptrace_report_syscall_exit include/linux/ptrace.h:477 [inline]
syscall_exit_work+0xc6/0x190 kernel/entry/common.c:173
syscall_exit_to_user_mode_prepare kernel/entry/common.c:200 [inline]
__syscall_exit_to_user_mode_work kernel/entry/common.c:205 [inline]
syscall_exit_to_user_mode+0x279/0x370 kernel/entry/common.c:218
do_syscall_64+0x100/0x230 arch/x86/entry/common.c:89
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0010:f2fs_evict_inode+0x1598/0x15c0 fs/f2fs/inode.c:896
Online repaire on corrupted directory in f2fs_lookup() can generate
dirty data/meta while racing w/ readonly remount, it may leave dirty
inode after filesystem becomes readonly, however, checkpoint() will
skips flushing dirty inode in a state of readonly mode, result in
above panic.
Let's get rid of online repaire in f2fs_lookup(), and leave the work
to fsck.f2fs.
Fixes: 510022a858 ("f2fs: add F2FS_INLINE_DOTS to recover missing dot dentries")
Reported-by: syzbot+ebea2790904673d7c618@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000a7b20f061ff2d56a@google.com
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Keep atomic file clean while updating and make it dirtied during commit
in order to avoid unnecessary and excessive inode updates in the previous
fix.
Fixes: 4bf7832234 ("f2fs: mark inode dirty for FI_ATOMIC_COMMITTED flag")
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
[SUBPAGE COMPRESSION LIMITS]
Currently inside writepage_delalloc(), if a delalloc range is going to
be submitted asynchronously (inline or compression, the page
dirty/writeback/unlock are all handled in at different time, not at the
submission time), then we return 1 and extent_writepage() will skip the
submission.
This is fine if every sector matches page size, but if a sector is
smaller than page size (aka, subpage case), then it can be very
problematic, for example for the following 64K page:
0 16K 32K 48K 64K
|/| |///////| |/|
| |
4K 52K
Where |/| is the dirty range we need to submit.
In the above case, we need the following different handling for the 3
ranges:
- [0, 4K) needs to be submitted for regular write
A single sector cannot be compressed.
- [16K, 32K) needs to be submitted for compressed write
- [48K, 52K) needs to be submitted for regular write.
Above, if we try to submit [16K, 32K) for compressed write, we will
return 1 and immediately, and without submitting the remaining
[48K, 52K) range.
Furthermore, since extent_writepage() will exit without unlocking any
sectors, the submitted range [0, 4K) will not have sector unlocked.
That's the reason why for now subpage is only allowed for full page
range.
[ENHANCEMENT]
- Introduce a submission bitmap at btrfs_bio_ctrl::submit_bitmap
This records which sectors will be submitted by extent_writepage_io().
This allows us to track which sectors needs to be submitted thus later
to be properly unlocked.
For asynchronously submitted range (inline/compression), the
corresponding bits will be cleared from that bitmap.
- Only return 1 if no sector needs to be submitted in
writepage_delalloc()
- Only submit sectors marked by submission bitmap inside
extent_writepage_io()
So we won't touch the asynchronously submitted part.
- Introduce btrfs_folio_end_writer_lock_bitmap() helper
This will only unlock the involved sectors specified by @bitmap
parameter, to avoid touching the range asynchronously submitted.
Please note that, since subpage compression is still limited to page
aligned range, this change is only a preparation for future sector
perfect compression support for subpage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_folio_unlock_writer() is already calling
btrfs_folio_end_writer_lock() to do the heavy lifting work, the only
missing 0 writer check.
Thus there is no need to keep two different functions, move the 0 writer
check into btrfs_folio_end_writer_lock(), and remove
btrfs_folio_unlock_writer().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All cleanup paths lead to btrfs_path_free so path can be defined with
the automatic freeing callback in the following functions:
- btrfs_insert_orphan_item()
- btrfs_del_orphan_item()
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All cleanup paths lead to btrfs_path_free so path can be defined with
the automatic freeing callback in the following functions:
- calculate_emulated_zone_size()
- calculate_alloc_pointer()
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a DEFINE_FREE for struct btrfs_path. This defines a function that
can be called using the __free attribute. Define a macro
BTRFS_PATH_AUTO_FREE to make the declaration of an auto freeing path
very clear.
The intended use is to define the auto free of path in cases where the
path is allocated somewhere at the beginning and freed either on all
error paths or at the end of the function.
int func() {
BTRFS_PATH_AUTO_FREE(path);
if (...)
return -ERROR;
path = alloc_path();
...
if (...)
return -ERROR;
...
return 0;
}
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
[ update changelog ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_folio_end_all_writers() is only utilized in
extent_writepage() as a way to unlock all subpage range (for both
successful submission and error handling).
Meanwhile we have a similar function, btrfs_folio_end_writer_lock().
The difference is, btrfs_folio_end_writer_lock() expects a range that is
a subset of the already locked range.
This limit on btrfs_folio_end_writer_lock() is a little overkilled,
preventing it from being utilized for error paths.
So here we enhance btrfs_folio_end_writer_lock() to accept a superset of
the locked range, and only end the locked subset.
This means we can replace btrfs_folio_end_all_writers() with
btrfs_folio_end_writer_lock() instead.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Continue adding const to parameters. This is for clarity and minor
addition to safety. There are some minor effects, in the assembly code
and .ko measured on release config.
Signed-off-by: David Sterba <dsterba@suse.com>
Currently BTRFS_I is a static inline function that takes a const inode
and returns btrfs inode, dropping the 'const' qualifier. This can break
assumptions of compiler though it seems there's no real case.
To make the parameter and return type consistent regardint const we can
use the container_of_const() that preserves it. However this would not
check the parameter type. To fix that use the same _Generic construct
but implement only the two expected types.
Signed-off-by: David Sterba <dsterba@suse.com>
We have a few places that check if we have the inode locked by doing:
ASSERT(inode_is_locked(vfs_inode));
This actually proved to be useful several times as if assertions are
enabled (and by default they are in many distros) it immediately triggers
a crash which is impossible for users to miss.
However that doesn't check if the lock is held by the calling task, so
the check passes if some other task locked the inode.
Using one of the lockdep functions to check the lock is held, like
lockdep_assert_held() for example, does check that the calling task
holds the lock, and if that's not the case it produces a warning and
stack trace in dmesg. However, despite the misleading "assert" in the
name of the lockdep helpers, it does not trigger a crash/BUG_ON(), just
a warning and splat in dmesg, which is easy to get unnoticed by users
who may have lockdep enabled.
So add a helper that does the ASSERT() and calls lockdep_assert_held()
immediately after and use it every where we check the inode is locked.
Like this if the lock is held by some other task we get the warning
in dmesg which is caught by fstests, very helpful during development,
and may also be occassionaly noticed by users with lockdep enabled.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Even in case of failure we could've discarded some data and userspace
should be made aware of it, so copy fstrim_range to userspace
regardless.
Also make sure to update the trimmed bytes amount even if
btrfs_trim_free_extents fails.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Moreover find_or_create_page() is compatible API, and it can
replaced with __filemap_get_folio(). Some interfaces have been converted
to use folio before, so the conversion operation from page can be
eliminated here.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Based on the previous patch, the compression path can be
directly used in folio without converting to page.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And memcpy_to_page() can be replaced with memcpy_to_folio().
But there is no memzero_folio(), but it can be replaced equivalently by
folio_zero_range().
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And memcpy_to_page() can be replaced with memcpy_to_folio().
But there is no memzero_folio(), but it can be replaced equivalently by
folio_zero_range().
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And memcpy_to_page() can be replaced with memcpy_to_folio().
But there is no memzero_folio(), but it can be replaced equivalently by
folio_zero_range().
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And page_to_inode() can be replaced with folio_to_inode() now.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Moreover, use folio_pos() instead of page_offset(),
which is more consistent with folio usage.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Moreover, use folio_pos() instead of page_offset(),
which is more consistent with folio usage.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Moreover, use kmap_local_folio() instead of kmap_local_page(),
which is more consistent with folio usage.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And use folio_pos instead of page_offset, which is more
consistent with folio usage. At the same time, folio_test_private() can
handle folio directly without converting from page to folio first.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Use folio_pos instead of page_offset, which is more
consistent with folio usage.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Now clear_page_extent_mapped() can deal with a folio
directly, so change its name to clear_folio_extent_mapped().
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs compression path is not really subpage compatible, every
thing is still done in page unit.
That's fine for regular sector size and subpage routine. As even for
subpage routine compression is only enabled if the whole range is page
aligned, so reading the page cache in page unit is totally fine.
However in preparation for the future subpage perfect compression
support, we need to change the compression routine to properly handle a
subpage range.
This patch would prepare both zlib and zstd to only read the subpage
range for compression.
Lzo is already doing subpage aware read, as lzo's on-disk format is
already sectorsize dependent.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are only two differences between the two functions:
- btrfs_orig_bbio_end_io() does extra error propagation
This is mostly to allow tolerance for write errors.
- btrfs_orig_bbio_end_io() does extra pending_ios check
This check can handle both the original bio, or the cloned one.
(All accounting happens in the original one).
This makes btrfs_orig_bbio_end_io() a much safer call.
In fact we already had a double freeing error due to usage of
btrfs_bio_end_io() in the error path of btrfs_submit_chunk().
So just move the whole content of btrfs_orig_bbio_end_io() into
btrfs_bio_end_io().
For normal paths this brings no change, because they are already calling
btrfs_orig_bbio_end_io() in the first place.
For error paths (not only inside bio.c but also external callers), this
change will introduce extra checks, especially for external callers, as
they will error out without submitting the btrfs bio.
But considering it's already in the error path, such slower but much
safer checks are still an overall win.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Historically we've held the extent lock throughout the entire read.
There's been a few reasons for this, but it's mostly just caused us
problems. For example, this prevents us from allowing page faults
during direct io reads, because we could deadlock. This has forced us
to only allow 4k reads at a time for io_uring NOWAIT requests because we
have no idea if we'll be forced to page fault and thus have to do a
whole lot of work.
On the buffered side we are protected by the page lock, as long as we're
reading things like buffered writes, punch hole, and even direct IO to a
certain degree will get hung up on the page lock while the page is in
flight.
On the direct side we have the dio extent lock, which acts much like the
way the extent lock worked previously to this patch, however just for
direct reads. This protects direct reads from concurrent direct writes,
while we're protected from buffered writes via the inode lock.
Now that we're protected in all cases, narrow the extent lock to the
part where we're getting the extent map to submit the reads, no longer
holding the extent lock for the entire read operation. Push the extent
lock down into do_readpage() so that we're only grabbing it when looking
up the extent map. This portion was contributed by Goldwyn.
Co-developed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we hold the extent lock for the entire duration of a read.
This isn't really necessary in the buffered case, we're protected by the
page lock, however it's necessary for O_DIRECT.
For O_DIRECT reads, if we only locked the extent for the part where we
get the extent, we could potentially race with an O_DIRECT write in the
same region. This isn't really a problem, unless the read is delayed so
much that the write does the COW, unpins the old extent, and some other
application re-allocates the extent before the read is actually able to
be submitted. At that point at best we'd have a checksum mismatch, but
at worse we could read data that doesn't belong to us.
To address this potential race we need to make sure we don't have
overlapping, concurrent direct io reads and writes.
To accomplish this use the new EXTENT_DIO_LOCKED bit in the direct IO
case in the same spot as the current extent lock. The writes will take
this while they're creating the ordered extent, which is also used to
make sure concurrent buffered reads or concurrent direct reads are not
allowed to occur, and drop it after the ordered extent is taken. For
reads it will act as the current read behavior for the EXTENT_LOCKED
bit, we set it when we're starting the read, we clear it in the end_io
to allow other direct writes to continue.
This still has the drawback of disallowing concurrent overlapping direct
reads from occurring, but that exists with the current extent locking.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to support dropping the extent lock during a read we need a way
to make sure that direct reads and direct writes for overlapping ranges
are protected from each other. To accomplish this introduce another
lock bit specifically for direct io. Subsequent patches will utilize
this to protect direct IO operations.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Defrag ioctl passes readahead from the file, but autodefrag does not
have a file so the readahead state is allocated when needed.
The autodefrag loop in cleaner thread iterates over inodes so we can
simply provide an on-stack readahead state and will not need to allocate
it in btrfs_defrag_file(). The size is 32 bytes which is acceptable.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's only one caller inode_should_defrag() that passes NULL to
btrfs_add_inode_defrag() so we can drop it an simplify the code.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The potential memory allocation failure is not a fatal error, skipping
autodefrag is fine and the caller inode_should_defrag() does not care
about the errors. Further writes can attempt to add the inode back to
the defragmentation list again.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_cleanup_defrag_inodes() is not called frequently, only in remount
or unmount, but the way it frees the inodes in fs_info->defrag_inodes
is inefficient. Each time it needs to locate first node, remove it,
potentially rebalance tree until it's done. This allows to do a
conditional reschedule.
For cleanups the rbtree_postorder_for_each_entry_safe() iterator is
convenient but we can't reschedule and restart iteration because some of
the tree nodes would be already freed.
The cleanup operation is kmem_cache_free() which will likely take the
fast path for most objects so rescheduling should not be necessary.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function does not follow the pattern where the underscores would be
justified, so rename it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function does not follow the pattern where the underscores would be
justified, so rename it.
Also update the misleading comment, the passed item is not freed, that's
what the caller does.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function does not follow the pattern where the underscores would be
justified, so rename it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A comparator function does not change its parameters, make them const.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function does not follow the pattern where the underscores would be
justified, so rename it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function does not follow the pattern where the underscores would be
justified, so rename it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previous patch freed the function name btrfs_submit_bio() so we can use
it for a helper that submits struct bio.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function name is a bit misleading as it submits the btrfs_bio
(bbio), rename it so we can use btrfs_submit_bio() when an actual bio is
submitted.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The member btrfs_fs_info::subpage_info stores the cached bitmap start
position inside the merged bitmap.
However in reality there is only one thing depending on the sectorsize,
bitmap_nr_bits, which records the number of sectors that fit inside a
page.
The sequence of sub-bitmaps have fixed order, thus it's just a quick
multiplication to calculate the start position of each sub-bitmaps.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter @nr_ret is used to tell the caller how many sectors have
been submitted for IO.
Then callers check @nr_ret value to determine if we need to manually
clear the PAGECACHE_TAG_DIRTY, as if we submitted no sector (e.g. all
sectors are beyond i_size) there is no folio_start_writeback() called thus
PAGECACHE_TAG_DIRTY tag will not be cleared.
Remove this parameter by:
- Moving the btrfs_folio_clear_writeback() call into
__extent_writepage_io()
So that if we didn't submit any IO, then manually call
btrfs_folio_set_writeback() to clear PAGECACHE_TAG_DIRTY when
the page is no longer dirty.
- Use a bool to record if we have submitted any sector
Instead of an int.
- Use subpage compatible helpers to end folio writeback.
This brings no change to the behavior, just for the sake of consistency.
As for the call site inside __extent_writepage(), we're always called
for the whole page, so the existing full page helper
folio_(start|end)_writeback() is totally fine.
For the call site inside extent_write_locked_range(), although we can
have subpage range, folio_start_writeback() will only clear
PAGECACHE_TAG_DIRTY if the page is no longer dirty, and the full folio
will still be dirty if there is any subpage dirty range.
Only when the last dirty subpage sector is cleared, the
folio_start_writeback() will clear PAGECACHE_TAG_DIRTY.
So no matter if we call the full page or subpage helper, the result
is still the same, then just use the subpage helpers for consistency.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fix a few obvious grammar mistakes: a -> an, then -> than.
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use xarray to track dirty extents to reduce the size of the struct
btrfs_qgroup_extent_record from 64 bytes to 40 bytes. The xarray is
more cache line friendly, it also reduces the complexity of insertion
and search code compared to rb tree.
Another change introduced is about error handling. Before this patch,
the result of btrfs_qgroup_trace_extent_nolock() is always a success. In
this patch, because of this function calls the function xa_store() which
has the possibility to fail, so mark qgroup as inconsistent if error
happened and then free preallocated memory. Also we preallocate memory
before spin_lock(), if memory preallcation failed, error handling is the
same the existing code.
Suggested-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Junchao Sun <sunjunchao2870@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Clean up resources using goto to get rid of repeated code.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Junchao Sun <sunjunchao2870@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unlike the bitmap usage inside raid56, for __extent_writepage_io() we
handle the subpage submission not sector-by-sector, but for each dirty
range we found.
This is not a big deal normally, as the subpage complex code is already
mostly optimized out by the compiler for x86_64.
However for the sake of consistency and for the future of subpage
sector-perfect compression support, this patch does:
- Extract the sector submission code into submit_one_sector()
- Add the needed code to extract the dirty bitmap for subpage case
There is a small pitfall for non-subpage case, as we cleared page
dirty before starting writeback, so we have to manually set
the default dirty_bitmap to 1 for such case.
- Use bitmap_and() to calculate the target sectors we need to submit
This is done for both subpage and non-subpage cases, and will later
be expanded to skip inline/compression ranges.
For x86_64, the dirty bitmap will be fixed to 1, with the length of 1,
so we're still doing the same workload per sector.
For larger page sizes, the overhead will be a little larger, as previous
we only need to do one extent_map lookup per-dirty-range, but now it
will be one extent_map lookup per-sector.
But that is the same frequency as x86_64, so we're just aligning the
behavior to x86_64.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In commit 75258f20fb ("btrfs: subpage: dump extra subpage bitmaps for
debug") an internal macro GET_SUBPAGE_BITMAP() is introduced to grab the
bitmap of each attribute.
But that commit is using bitmap_cut() which will do the left shift of
the larger bitmap, causing incorrect values.
Thankfully this bitmap_cut() is only called for debug usage, and so far
it's not yet causing problem.
Fix it to use bitmap_read() to only grab the desired sub-bitmap.
Fixes: 75258f20fb ("btrfs: subpage: dump extra subpage bitmaps for debug")
CC: stable@vger.kernel.org # 6.6+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we're calling btrfs_num_copies() before btrfs_get_chunk_map() in
btrfs_map_block(). But btrfs_num_copies() itself does a chunk map lookup
to be able to calculate the number of copies.
So split out the code getting the number of copies from btrfs_num_copies()
into a helper called btrfs_chunk_map_num_copies() and directly call it
from btrfs_map_block() and btrfs_num_copies().
This saves us one rbtree lookup per btrfs_map_block() invocation.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The BTRFS_IOC_SYNC ioctl wants to wake up the cleaner kthread so that it
does any pending work (subvolume deletion, delayed iputs, etc), however
it is waking up the transaction kthread, which in turn wakes up the
cleaner. Since we don't have any transaction to commit, as any ongoing
transaction was already committed when it called btrfs_sync_fs() and
the goal is just to wake up the cleaner thread, directly wake up the
cleaner instead of the transaction kthread.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs only supports sectorsize 4K, 8K, 16K, 32K, 64K for now, thus for
systems with 4K page size, there is no way the fs is subpage (sectorsize
< PAGE_SIZE).
So here we define btrfs_is_subpage() different according to the
PAGE_SIZE:
- PAGE_SIZE > 4K
We may hit real subpage cases, define btrfs_is_subpage() as a regular
function and do the usual checks.
- PAGE_SIZE == 4K (no smaller PAGE_SIZE support AFAIK)
There is no way the fs is subpage, so just define btrfs_is_subpage()
as an inline function which always return false.
This saves about 7K bytes for x86_64 debug builds:
text data bss dec hex filename
Before: 1484452 168693 25776 1678921 199e49 fs/btrfs/btrfs.ko
After: 1476605 168445 25776 1670826 197eaa fs/btrfs/btrfs.ko
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that RAID stripe-tree lookup failures are not treated as a fatal issue
any more, change the RAID stripe-tree lookup error message to debug level.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Set rst_search_commit_root in the btrfs_io_stripe we're passing to
btrfs_map_block() in case we're doing data relocation.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rename 'btrfs_io_stripe::is_scrub' to 'rst_search_commit_root'. While
'is_scrub' describes the state of the io_stripe (it is a stripe submitted
by scrub) it does not describe the purpose, namely looking at the commit
root when searching RAID stripe-tree entries.
Renaming the stripe to rst_search_commit_root describes this purpose.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This just creates unnecessary noise and doesn't provide any insights into
debugging RAID stripe-tree related issues.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a comment to document the complicated locked_page unlock logic in
cow_file_range_inline. The specifically tricky part is that a caller
just up the stack converts ret == 0 to ret == 1 and then another
caller far up the callstack handles ret == 1 as a success, AND returns
without cleanup in that case, both of which "feel" unnatural and led to
the original bug.
Try to document that somewhat specific callstack logic here to explain
the weird un-setting of locked_folio on success.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
When iterating the chunk maps when a device replace finishes we are doing
a full rbtree search for each chunk map, which is not the most efficient
thing to do, wasting CPU time. As we are holding a write lock on the tree
during the whole iteration, we can simply start from the first node in the
tree and then move to the next chunk map by doing a rb_next() call - the
only exception is when we need to reschedule, in which case we have to do
a full rbtree search since we dropped the write lock and the tree may have
changed (chunk maps may have been removed and the tree got rebalanced).
So just do that.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At the end of a device replace we must go over all the chunk maps and
update their stripes to point to the target device instead of the source
device. We iterate over the chunk maps while holding a write lock and
we never reschedule, which can result in monopolizing a CPU for too long
and blocking readers for too long (it's a rw lock, non-blocking).
So improve on this by rescheduling if necessary. This is safe because at
this point we are holding the chunk mutex, which means no new chunks can
be allocated and therefore we don't risk missing a new chunk map that
covers a range behind the last one we processed before rescheduling.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of getting a page and using that to clear dirty for io, use the
folio helper and use the appropriate folio functions.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only use a page to copy in the data for the inline extent. Use a
folio for this instead.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already use a lot of functions here that use folios, update the
function to use __filemap_get_folio instead of find_get_page and then
use the folio directly.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently this already uses a folio for most things, update it to take a
folio and update all the page usage with the corresponding folio usage.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already use a folio some in this function, replace all page usage
with the folio and update the function to take the folio as an argument.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that btrfs_get_extent takes a folio, update __get_extent_map to
take a folio as well.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only pass this into read_inline_extent, change it to take a folio and
update the callers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using a page, use a folio instead, take a folio as an
argument, and update the callers appropriately.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Update uncompress_inline to take a folio and update it's usage
accordingly.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now the fixup creator and consumer use folios, change this to use a
folio as well.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of a page, use a folio for btrfs_writepage_cow_fixup. We
already have a folio at the only caller, and the fixup worker uses
folios.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function heavily messes with pages, instead update it to use a
folio.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This mostly uses folios already, update it to take a folio and update
the rest of the function to use the folio instead of the page.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of passing in the page for ->locked_page, make it hold a
locked_folio and then update the users of async_chunk to act
accordingly.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that every function that btrfs_run_delalloc_range calls takes a
folio, update it to take a folio and update the callers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This just passes the page into the compressed machinery to keep track of
the locked page. Update this to take a folio and convert it to a page
where appropriate.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that btrfs_cleanup_ordered_extents is operating mostly with folios,
update it to use a folio instead of a page, and the update the function
and the callers as appropriate.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We walk through pages in this function and clear ordered, and the
function for this uses folios. Update the function to use a folio for
this whole operation.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now all of the functions that use locked_page in run_delalloc_nocow take
a folio, update it to take a folio and update the caller.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With this we can pass the folio directly into cow_file_range().
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Convert this to take a folio and pass it into all of the various cleanup
functions. Update the callers to pass in a folio instead.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we want the folio in this function, convert it to take a folio
directly and use that.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We pass the folio into extent_write_locked_range, go ahead and take a
folio to pass along, and update the callers to pass in a folio.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This mostly uses folios, convert it to take a folio instead and update
the callers to pass in the folio.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of taking the locked page, take the locked folio so we can pass
that into __process_folios_contig.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that this mostly uses folios, update it to take folios, use the
folios that are passed in, and rename from process_one_page =>
process_one_folio.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This operates mostly on folios, update it to take a folio for the locked
folio instead of the page, rename from __process_pages_contig =>
__process_folios_contig.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All of the callers have a folio at this point, update
__unlock_for_delalloc to take a folio so that it's consistent with its
callers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Also rename lock_delalloc_pages => lock_delalloc_folios in the process,
now that it exclusively works on folios.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of passing in a page for locked_page, pass in the folio instead.
We only use the folio itself to validate some range assumptions, and
then pass it into other functions.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already use a folio heavily in this function, pass the folio in
directly and use it everywhere, only passing the page down to functions
that do not take a folio yet.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only need a folio now, make it take a folio as an argument and update
all of the callers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The callers and callee's of this now all use folios, update it to take a
folio as well.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass in a folio instead, and use a folio instead of a page.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already have a folio that we're using in btrfs_page_mkwrite, update
the rest of the function to use folio everywhere else. This will make
it easier on Willy when he drops page->index.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Willy is going to get rid of page->index, and add_ra_bio_pages uses
page->index. Make his life easier by converting add_ra_bio_pages to use
folios so that we are no longer using page->index.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we've gotten most of the helpers updated to only take a folio,
update __extent_writepage to only deal in folios.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using pages for everything, find a folio and use that. This
makes things a bit cleaner as a lot of the functions calls here all take
folios.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__extent_writepage_io uses page everywhere, but a lot of these functions
take a folio. Convert it to use the folio based helpers, and then
change it to take a folio as an argument and update its callers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Willy is wanting to get rid of page->index, convert the writepage
tracepoint to take a folio so we can do folio->index instead of
page->index.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the callers and helpers mostly use folio, convert
btrfs_do_readpage to take a folio, and rename it to btrfs_do_read_folio.
Update all of the page stuff to use the folio based helpers instead.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The callers of this helper are going to be converted to using a folio,
so adjust submit_extent_page to become submit_extent_folio and update it
to use all the relevant folio helpers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This already uses a folio internally, change it to take a folio as an
argument instead.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have this helper function to set the page range uptodate once we're
done reading it, as well as run fsverity against it. Half of these
functions already take a folio, just rename this to end_folio_read and
then rework it to take a folio instead, and update everything
accordingly.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we're using the page for everything here. Convert this to use
the folio helpers instead.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're the only user of readahead_page_batch(). Convert
btrfs_readahead() to use the folio based helpers to do readahead.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[ENHANCEMENT]
When mounting a btrfs filesystem, the filesystem opens the block device,
and if this fails, there is no message about it. Print a message about
it to help debugging.
[TEST]
I have a btrfs filesystem on three block devices, one of which is
write-protected, so regular mounts fail, but there is no message in
dmesg.
/dev/vdb normal
/dev/vdc write protected
/dev/vdd normal
Before patch:
$ sudo mount /dev/vdb /mnt/
mount: mount(2) failed: no such file or directory
$ sudo dmesg # Show only messages about missing block devices
....
[ 352.947196] BTRFS error (device vdb): devid 2 uuid 4ee2c625-a3b2-4fe0-b411-756b23e08533 missing
....
After patch:
$ sudo mount /dev/vdb /mnt/
mount: mount(2) failed: no such file or directory
$ sudo dmesg # Show bdev_file_open_by_path failed.
....
[ 352.944328] BTRFS error: failed to open device for path /dev/vdc with flags 0x3: -13
[ 352.947196] BTRFS error (device vdb): missing devid 2 uuid 4ee2c625-a3b2-4fe0-b411-756b23e08533
....
Signed-off-by: Li Zhang <zhanglikernel@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Functions btrfs_uuid_scan_kthread() and btrfs_create_uuid_tree() are for
UUID tree rescan and creation, it's not suitable for volumes.[ch].
Move them to uuid-tree.[ch] instead.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At extent_map_block_end() we are calling the inline functions
extent_map_block_start() and extent_map_block_len() multiple times, which
results in expanding their code multiple times, increasing the compiled
code size and repeating the computations those functions do.
Improve this by caching their results in local variables.
The size of the module before this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1755770 163800 16920 1936490 1d8c6a fs/btrfs/btrfs.ko
And after this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1755656 163800 16920 1936376 1d8bf8 fs/btrfs/btrfs.ko
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_delete_raid_extent() was written under the assumption, that it's
call-chain always passes a start, length tuple that matches a single
extent. But btrfs_delete_raid_extent() is called by
do_free_extent_accounting() which in turn is called by
__btrfs_free_extent().
But this call-chain passes in a start address and a length that can
possibly match multiple on-disk extents.
To make this possible, we have to adjust the start and length of each
btree node lookup, to not delete beyond the requested range.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Update a stripe extent in case of an already existing logical address,
but with different physical addresses and/or device id instead of
bailing out with EEXIST.
This can happen i.e. in case of a device replace operation, where data
extents get rewritten to a new disk.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Merge prerequisities from the vfs git tree for the following series that
introduces kmem_cache_args. The vfs.file branch includes the addition of
kmem_cache_create_rcu() which was needed in vfs for the filp cache
optimization. The following series refactors this code.
iomap_file_buffered_write_punch_delalloc can only return errors if either
the ->punch callback returned an error, or if someone changed the API of
mapping_seek_hole_data to return a negative error code that is not
-ENXIO.
As the only instance of ->punch never returns an error, an such an error
would be fatal anyway remove the entire error propagation and don't
return an error code from iomap_file_buffered_write_punch_delalloc.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240910043949.3481298-6-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
XFS will need to look at the flags in the iomap structure, so pass it
down all the way to the callback.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240910043949.3481298-5-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
To fix short write error handling, We'll need to figure out what operation
iomap_file_buffered_write_punch_delalloc is called for. Pass the flags
argument on to it, and reorder the argument list to match that of
->iomap_end so that the compiler only has to add the new punch argument
to the end of it instead of reshuffling the registers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240910043949.3481298-4-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Currently iomap_unshare_iter relies on the IOMAP_F_SHARED flag to detect
blocks to unshare. This is reasonable, but IOMAP_F_SHARED is also useful
for the file system to do internal book keeping for out of place writes.
XFS used to that, until it got removed in commit 72a048c105
("xfs: only set IOMAP_F_SHARED when providing a srcmap to a write")
because unshare for incorrectly unshare such blocks.
Add an extra safeguard by checking the explicitly provided srcmap instead
of the fallback to the iomap for valid data, as that catches the case
where we'd just copy from the same place we'd write to easily, allowing
to reinstate setting IOMAP_F_SHARED for all XFS writes that go to the
COW fork.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240910043949.3481298-3-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
When direct I/O completions invalidates the page cache it holds neither the
i_rwsem nor the invalidate_lock so it can be racing with
iomap_write_delalloc_release. If the search for the end of the region that
contains data returns the start offset we hit such a race and just need to
look for the end of the newly created hole instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240910043949.3481298-2-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Get rid of redundant variables (nblocks, offset) and a dead branch
(!tailendpacking).
Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20240905030339.1474396-1-hongzhen@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Refactor out the iop binding behavior out of the erofs_fill_symlink
and move erofs_buf into the erofs_read_inode, so that erofs_fill_inode
can only deal with inode operation bindings and can be decoupled from
metabuf operations. This results in better calling conventions.
Note that after this patch, we do not need erofs_buf and ofs as
parameters any more when calling erofs_read_inode as
all the data operations are now included in itself.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/all/20240425222847.GN2118490@ZenIV/
Signed-off-by: Yiyang Wu <toolmanp@tlmp.cc>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20240902093412.509083-1-toolmanp@tlmp.cc
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Although fscache is still described as "General Filesystem Caching" for
network filesystems and other things such as ISO9660 filesystems, it has
actually become a part of netfslib recently, which was unexpected at the
time when "EROFS over fscache" proposed (2021) since EROFS is entirely a
disk filesystem and the dependency is redundant.
Mark it deprecated and it will be removed after "fanotify pre-content
hooks" lands, which will provide the same functionality for EROFS.
Reviewed-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240830032840.3783206-4-hsiangkao@linux.alibaba.com
Use pseudo bios just like the previous fscache approach since
merged bio_vecs can be filled properly with unique interfaces.
Reviewed-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240830032840.3783206-3-hsiangkao@linux.alibaba.com
Since EROFS only needs to handle read requests in simple contexts,
Just directly use vfs_iocb_iter_read() for data I/Os.
Reviewed-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240905093031.2745929-1-hsiangkao@linux.alibaba.com
It actually has been around for years: For containers and other sandbox
use cases, there will be thousands (and even more) of authenticated
(sub)images running on the same host, unlike OS images.
Of course, all scenarios can use the same EROFS on-disk format, but
bdev-backed mounts just work well for OS images since golden data is
dumped into real block devices. However, it's somewhat hard for
container runtimes to manage and isolate so many unnecessary virtual
block devices safely and efficiently [1]: they just look like a burden
to orchestrators and file-backed mounts are preferred indeed. There
were already enough attempts such as Incremental FS, the original
ComposeFS and PuzzleFS acting in the same way for immutable fses. As
for current EROFS users, ComposeFS, containerd and Android APEXs will
be directly benefited from it.
On the other hand, previous experimental feature "erofs over fscache"
was once also intended to provide a similar solution (inspired by
Incremental FS discussion [2]), but the following facts show file-backed
mounts will be a better approach:
- Fscache infrastructure has recently been moved into new Netfslib
which is an unexpected dependency to EROFS really, although it
originally claims "it could be used for caching other things such as
ISO9660 filesystems too." [3]
- It takes an unexpectedly long time to upstream Fscache/Cachefiles
enhancements. For example, the failover feature took more than
one year, and the deamonless feature is still far behind now;
- Ongoing HSM "fanotify pre-content hooks" [4] together with this will
perfectly supersede "erofs over fscache" in a simpler way since
developers (mainly containerd folks) could leverage their existing
caching mechanism entirely in userspace instead of strictly following
the predefined in-kernel caching tree hierarchy.
After "fanotify pre-content hooks" lands upstream to provide the same
functionality, "erofs over fscache" will be removed then (as an EROFS
internal improvement and EROFS will not have to bother with on-demand
fetching and/or caching improvements anymore.)
[1] https://github.com/containers/storage/pull/2039
[2] https://lore.kernel.org/r/CAOQ4uxjbVxnubaPjVaGYiSwoGDTdpWbB=w_AeM6YM=zVixsUfQ@mail.gmail.com
[3] https://docs.kernel.org/filesystems/caching/fscache.html
[4] https://lore.kernel.org/r/cover.1723670362.git.josef@toxicpanda.com
Closes: https://github.com/containers/composefs/issues/144
Reviewed-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240830032840.3783206-1-hsiangkao@linux.alibaba.com
syzbot reported a task hang issue due to a deadlock case where it is
waiting for the folio lock of a cached folio that will be used for
cache I/Os.
After looking into the crafted fuzzed image, I found it's formed with
several overlapped big pclusters as below:
Ext: logical offset | length : physical offset | length
0: 0.. 16384 | 16384 : 151552.. 167936 | 16384
1: 16384.. 32768 | 16384 : 155648.. 172032 | 16384
2: 32768.. 49152 | 16384 : 537223168.. 537239552 | 16384
...
Here, extent 0/1 are physically overlapped although it's entirely
_impossible_ for normal filesystem images generated by mkfs.
First, managed folios containing compressed data will be marked as
up-to-date and then unlocked immediately (unlike in-place folios) when
compressed I/Os are complete. If physical blocks are not submitted in
the incremental order, there should be separate BIOs to avoid dependency
issues. However, the current code mis-arranges z_erofs_fill_bio_vec()
and BIO submission which causes unexpected BIO waits.
Second, managed folios will be connected to their own pclusters for
efficient inter-queries. However, this is somewhat hard to implement
easily if overlapped big pclusters exist. Again, these only appear in
fuzzed images so let's simply fall back to temporary short-lived pages
for correctness.
Additionally, it justifies that referenced managed folios cannot be
truncated for now and reverts part of commit 2080ca1ed3 ("erofs: tidy
up `struct z_erofs_bvec`") for simplicity although it shouldn't be any
difference.
Reported-by: syzbot+4fc98ed414ae63d1ada2@syzkaller.appspotmail.com
Reported-by: syzbot+de04e06b28cfecf2281c@syzkaller.appspotmail.com
Reported-by: syzbot+c8c8238b394be4a1087d@syzkaller.appspotmail.com
Tested-by: syzbot+4fc98ed414ae63d1ada2@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/0000000000002fda01061e334873@google.com
Fixes: 8e6c8fa9f2 ("erofs: enable big pcluster feature")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240910070847.3356592-1-hsiangkao@linux.alibaba.com
Return 0 instead of sizeof(ocfs2_global_disk_dqinfo) that .quota_read
returns in normal case. Also cleanup mlog to make code more readable.
Link: https://lkml.kernel.org/r/20240904071004.2067695-2-joseph.qi@linux.alibaba.com
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The function nilfs_btree_check_delete(), which checks whether degeneration
to direct mapping occurs before deleting a b-tree entry, causes memory
access outside the block buffer when retrieving the maximum key if the
root node has no entries.
This does not usually happen because b-tree mappings with 0 child nodes
are never created by mkfs.nilfs2 or nilfs2 itself. However, it can happen
if the b-tree root node read from a device is configured that way, so fix
this potential issue by adding a check for that case.
Link: https://lkml.kernel.org/r/20240904081401.16682-4-konishi.ryusuke@gmail.com
Fixes: 17c76b0104 ("nilfs2: B-tree based block mapping")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Lizhi Xu <lizhi.xu@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Due to the nature of b-trees, nilfs2 itself and admin tools such as
mkfs.nilfs2 will never create an intermediate b-tree node block with 0
child nodes, nor will they delete (key, pointer)-entries that would result
in such a state. However, it is possible that a b-tree node block is
corrupted on the backing device and is read with 0 child nodes.
Because operation is not guaranteed if the number of child nodes is 0 for
intermediate node blocks other than the root node, modify
nilfs_btree_node_broken(), which performs sanity checks when reading a
b-tree node block, so that such cases will be judged as metadata
corruption.
Link: https://lkml.kernel.org/r/20240904081401.16682-3-konishi.ryusuke@gmail.com
Fixes: 17c76b0104 ("nilfs2: B-tree based block mapping")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Lizhi Xu <lizhi.xu@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "nilfs2: fix potential issues with empty b-tree nodes".
This series addresses three potential issues with empty b-tree nodes that
can occur with corrupted filesystem images, including one recently
discovered by syzbot.
This patch (of 3):
If a b-tree is broken on the device, and the b-tree height is greater than
2 (the level of the root node is greater than 1) even if the number of
child nodes of the b-tree root is 0, a NULL pointer dereference occurs in
nilfs_btree_prepare_insert(), which is called from nilfs_btree_insert().
This is because, when the number of child nodes of the b-tree root is 0,
nilfs_btree_do_lookup() does not set the block buffer head in any of
path[x].bp_bh, leaving it as the initial value of NULL, but if the level
of the b-tree root node is greater than 1, nilfs_btree_get_nonroot_node(),
which accesses the buffer memory of path[x].bp_bh, is called.
Fix this issue by adding a check to nilfs_btree_root_broken(), which
performs sanity checks when reading the root node from the device, to
detect this inconsistency.
Thanks to Lizhi Xu for trying to solve the bug and clarifying the cause
early on.
Link: https://lkml.kernel.org/r/20240904081401.16682-1-konishi.ryusuke@gmail.com
Link: https://lkml.kernel.org/r/20240902084101.138971-1-lizhi.xu@windriver.com
Link: https://lkml.kernel.org/r/20240904081401.16682-2-konishi.ryusuke@gmail.com
Fixes: 17c76b0104 ("nilfs2: B-tree based block mapping")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+9bff4c7b992038a7409f@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=9bff4c7b992038a7409f
Cc: Lizhi Xu <lizhi.xu@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When strict percpu address space checks are enabled, then current direct
casts between the percpu address space and the generic address space fail
the compilation on x86_64 with:
decompressor_multi_percpu.c: In function `squashfs_decompressor_create':
decompressor_multi_percpu.c:49:16: error: cast to generic address space pointer from disjoint `__seg_gs' address space pointer
decompressor_multi_percpu.c: In function `squashfs_decompressor_destroy':
decompressor_multi_percpu.c:64:25: error: cast to `__seg_gs' address space pointer from disjoint generic address space pointer
decompressor_multi_percpu.c: In function `squashfs_decompress':
decompressor_multi_percpu.c:82:25: error: cast to `__seg_gs' address space pointer from disjoint generic address space pointer
Add intermediate casts to unsigned long, as advised in [1] and [2].
Side note: sparse still requires __force when casting from the percpu
address space, although the documentation [2] allows casts to unsigned
long without __force attribute.
Found by GCC's named address space checks.
There were no changes in the resulting object file.
[1] https://gcc.gnu.org/onlinedocs/gcc/Named-Address-Spaces.html#x86-Named-Address-Spaces
[2] https://sparse.docs.kernel.org/en/latest/annotations.html#address-space-name
Link: https://lkml.kernel.org/r/20240830091104.13049-1-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Take the end of a file write into consideration when deciding whether or
not to use huge pages for tmpfs files when the tmpfs filesystem is mounted
with huge=within_size
This allows large writes that append to the end of a file to automatically
use large pages.
Doing 4MB sequential writes without fallocate to a 16GB tmpfs file with
fio. The numbers without THP or with huge=always stay the same, but the
performance with huge=within_size now matches that of huge=always.
huge before after
4kB pages 1560 MB/s 1560 MB/s
within_size 1560 MB/s 4720 MB/s
always: 4720 MB/s 4720 MB/s
[akpm@linux-foundation.org: coding-style cleanups]
Link: https://lkml.kernel.org/r/20240903111928.7171e60c@imladris.surriel.com
Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ocfs2_global_read_info() will initialize and schedule dqi_sync_work at the
end, if error occurs after successfully reading global quota, it will
trigger the following warning with CONFIG_DEBUG_OBJECTS_* enabled:
ODEBUG: free active (active state 0) object: 00000000d8b0ce28 object type: timer_list hint: qsync_work_fn+0x0/0x16c
This reports that there is an active delayed work when freeing oinfo in
error handling, so cancel dqi_sync_work first. BTW, return status instead
of -1 when .read_file_info fails.
Link: https://syzkaller.appspot.com/bug?extid=f7af59df5d6b25f0febd
Link: https://lkml.kernel.org/r/20240904071004.2067695-1-joseph.qi@linux.alibaba.com
Fixes: 171bf93ce1 ("ocfs2: Periodic quota syncing")
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Reported-by: syzbot+f7af59df5d6b25f0febd@syzkaller.appspotmail.com
Tested-by: syzbot+f7af59df5d6b25f0febd@syzkaller.appspotmail.com
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When doing cleanup, if flags without OCFS2_BH_READAHEAD, it may trigger
NULL pointer dereference in the following ocfs2_set_buffer_uptodate() if
bh is NULL.
Link: https://lkml.kernel.org/r/20240902023636.1843422-3-joseph.qi@linux.alibaba.com
Fixes: cf76c78595 ("ocfs2: don't put and assigning null to bh allocated outside")
Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reported-by: Heming Zhao <heming.zhao@suse.com>
Suggested-by: Heming Zhao <heming.zhao@suse.com>
Cc: <stable@vger.kernel.org> [4.20+]
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Misc fixes for ocfs2_read_blocks", v5.
This series contains 2 fixes for ocfs2_read_blocks(). The first patch fix
the issue reported by syzbot, which detects bad unlock balance in
ocfs2_read_blocks(). The second patch fixes an issue reported by Heming
Zhao when reviewing above fix.
This patch (of 2):
There was a lock release before exiting, so remove the unreasonable unlock.
Link: https://lkml.kernel.org/r/20240902023636.1843422-1-joseph.qi@linux.alibaba.com
Link: https://lkml.kernel.org/r/20240902023636.1843422-2-joseph.qi@linux.alibaba.com
Fixes: cf76c78595 ("ocfs2: don't put and assigning null to bh allocated outside")
Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reported-by: syzbot+ab134185af9ef88dfed5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=ab134185af9ef88dfed5
Tested-by: syzbot+ab134185af9ef88dfed5@syzkaller.appspotmail.com
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org> [4.20+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
During the mounting process, if journal_reset() fails because of too short
journal, then lead to jbd2_journal_load() fails with NULL j_sb_buffer.
Subsequently, ocfs2_journal_shutdown() calls
jbd2_journal_flush()->jbd2_cleanup_journal_tail()->
__jbd2_update_log_tail()->jbd2_journal_update_sb_log_tail()
->lock_buffer(journal->j_sb_buffer), resulting in a null-pointer
dereference error.
To resolve this issue, we should check the JBD2_LOADED flag to ensure the
journal was properly loaded. Additionally, use journal instead of
osb->journal directly to simplify the code.
Link: https://syzkaller.appspot.com/bug?extid=05b9b39d8bdfe1a0861f
Link: https://lkml.kernel.org/r/20240902030844.422725-1-sunjunchao2870@gmail.com
Fixes: f6f50e28f0 ("jbd2: Fail to load a journal if it is too short")
Signed-off-by: Julian Sun <sunjunchao2870@gmail.com>
Reported-by: syzbot+05b9b39d8bdfe1a0861f@syzkaller.appspotmail.com
Suggested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- fix ca->io_ref usage; analagous to previous patch doing that for main
discard path
- cond_resched() in __journal_keys_sort(), cutting down on "hung task"
warnings when journal is big
- rest of basic BCH_SB_MEMBER_INVALID support
- and the critical one: don't delete open files in online fsck, this was
causing the "dirent points to inode that doesn't point back"
inconsistencies some users were seeing
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmbe/BoACgkQE6szbY3K
bnbGuRAApGK3NNIlvMQkldLmDgHgeN/vuYHT4dx7XBquwS/R1T84WffIJklVOahD
KdrXhGsLH63kjqck3ngCe3DFDD7Fhirmx9syVHhLAkzGFkDEYGQuIzeDXIn+XOMe
kUTMhNgtSJL/eEc8zGmvyPIjtTrwoih2V0EeC1aW7h4tWtIsMk3q45aMX3yTlyqI
FnMmKFjzqnOIVjT8nBLuzDP97FG4w2foSuNiZWTYo7FLi8IhPrp95tr58NqM4jvd
99U5I3aOFHe9WcCDT1vgr0P5dmmSEwIKBCfvIlA8fbVlnZzjJqEjSKh+C4878KFv
FP51FOY/ZQVRE/+p8AQ82N1Zc3OTZ2488X6ajDt0Ir5EHMMmqiEXz5Zx9/7mmdta
egmiVX5OAVHgWR61xzTa6LKnGIjT0XE/lYJT9kc8iox9BBduQEx+iZ8OeRDAeObW
048K3jBhXST+hK91lbgj7/lvj3IWabPSyfPyzpe46aejS3N7b79bEvKanD7dH5Dy
KhdGuCKv2PXvlYbxI3rLPGUeL3InIe8TjvYa2ryl5qICSKhHjk7+8tvLeGWIXI55
rDglrYqw3s1IiGeg4QpKAB4YIeQfZn3g1WbfEs/H5GnoA7UDnQw9IkJb0/S39bEw
8OVYh52+USafMceqhwxbI4dfX7RcI00JBcVCZO5hcVu77MQA8G4=
=6PAk
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-09-09' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
- fix ca->io_ref usage; analagous to previous patch doing that for main
discard path
- cond_resched() in __journal_keys_sort(), cutting down on "hung task"
warnings when journal is big
- rest of basic BCH_SB_MEMBER_INVALID support
- and the critical one: don't delete open files in online fsck, this
was causing the "dirent points to inode that doesn't point back"
inconsistencies some users were seeing
* tag 'bcachefs-2024-09-09' of git://evilpiepirate.org/bcachefs:
bcachefs: Don't delete open files in online fsck
bcachefs: fix btree_key_cache sysfs knob
bcachefs: More BCH_SB_MEMBER_INVALID support
bcachefs: Simplify bch2_bkey_drop_ptrs()
bcachefs: Add a cond_resched() to __journal_keys_sort()
bcachefs: Fix ca->io_ref usage
If we get a failure at the first decompressor init (i = 0),
the clean up while loop could enter infinite loop due to wrong while
check. Check the value of i now to see if we need any clean up at all.
Fixes: 5a7cce827e ("erofs: refine z_erofs_{init,exit}_subsystem()")
Reported-by: liujinbao1 <liujinbao1@xiaomi.com>
Signed-off-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20240905060027.2388893-1-dhavale@google.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
After commit 684b290abc ("erofs: add support for
FS_IOC_GETFSSYSFSPATH"), `sb->s_sysfs_name` is now valid.
Just use it to get rid of duplicated logic.
Reviewed-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240828095232.571946-1-hsiangkao@linux.alibaba.com
Fast symlink can be used if the on-disk symlink data is stored
in the same block as the on-disk inode, so we don’t need to trigger
another I/O for symlink data. However, currently fs correction could be
reported _incorrectly_ if inode xattrs are too large.
In fact, these should be valid images although they cannot be handled as
fast symlinks.
Many thanks to Colin for reporting this!
Reported-by: Colin Walters <walters@verbum.org>
Reported-by: https://honggfuzz.dev/
Link: https://lore.kernel.org/r/bb2dd430-7de0-47da-ae5b-82ab2dd4d945@app.fastmail.com
Fixes: 431339ba90 ("staging: erofs: add inode operations")
[ Note that it's a runtime misbehavior instead of a security issue. ]
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240909031911.1174718-1-hsiangkao@linux.alibaba.com
The fcntl's F_SETOWN command sets the process that handle SIGIO/SIGURG
for the related file descriptor. Before this change, the
file_set_fowner LSM hook was always called, ignoring the VFS logic which
may not actually change the process that handles SIGIO (e.g. TUN, TTY,
dnotify), nor update the related UID/EUID.
Moreover, because security_file_set_fowner() was called without lock
(e.g. f_owner.lock), concurrent F_SETOWN commands could result to a race
condition and inconsistent LSM states (e.g. SELinux's fown_sid) compared
to struct fown_struct's UID/EUID.
This change makes sure the LSM states are always in sync with the VFS
state by moving the security_file_set_fowner() call close to the
UID/EUID updates and using the same f_owner.lock .
Rename f_modown() to __f_setown() to simplify code.
Cc: stable@vger.kernel.org
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: James Morris <jmorris@namei.org>
Cc: Jann Horn <jannh@google.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Add the __counted_by compiler attribute to the flexible array members
devs to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
Increment nr_devs before adding a new device to the devs array and
adjust the array indexes accordingly. Add a helper macro for adding a
new device.
In bch2_journal_read(), explicitly set nr_devs to 0.
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We enable idmapped mounts for bcachefs. Here, we just pass down
the user_namespace argument from the VFS methods to the relevant
helpers.
The idmap test in bcachefs is as following:
```
1. losetup /dev/loop1 bcachefs.img
2. ./bcachefs format /dev/loop1
3. mount -t bcachefs /dev/loop1 /mnt/bcachefs/
4. ./mount-idmapped --map-mount b:0:1000:1 /mnt/bcachefs /mnt/idmapped1/
ll /mnt/bcachefs
total 2
drwx------. 2 root root 0 Jun 14 14:10 lost+found
-rw-r--r--. 1 root root 1945 Jun 14 14:12 profile
ll /mnt/idmapped1/
total 2
drwx------. 2 1000 1000 0 Jun 14 14:10 lost+found
-rw-r--r--. 1 1000 1000 1945 Jun 14 14:12 profile
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add the __counted_by compiler attribute to the flexible array member
x_name to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Use jiffies macros instead of using jiffies directly to handle wraparound.
Signed-off-by: Chen Yufan <chenyufan@vivo.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_bset_fix_lookup_table is too complicated to be easily understood,
the comment "l now > where" there is also incorrect when where ==
t->end_offset. This patch therefore refactor the function, the idea is
that when where >= rw_aux_tree(b, t)[t->size - 1].offset, we don't need
to adjust the rw aux tree.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We rely on the trans->locked to know if a trans has nodes locked for
assertions about deadlocks; there can't be more than one trans in the
same process that is locked.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
folio_has_private() is an attractive nuisance; filesystem authors
generally don't realise that it actually checks two flags (one of which
is never set by bcachefs). There's no need to check the private flag at
all; for folios owned by bcachefs, we know that folio->private is NULL
when the private flag is clear and non-NULL when the private flag is set.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It's really not needed: the only locks used here are the btree cache
lock, which we drop for GFP_WAIT allocations, and btree node locks - but
we also drop those for GFP_WAIT allocations.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Use helper functions to make code more readable.
Similar to commit a5488f2983 ("fs: simplify ->listxattr() implementation")
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Remove struct nop_posix_acl_{access,default} for bcachefs filesystem
that don't depend on the xattr handler in their inode->i_op->listxattr()
method in any way. There's nothing more to do than to simply remove the
handler. It's been effectively unused ever since we introduced the new
posix acl api. See [1] for details.
Link [1]: https://patchwork.kernel.org/project/linux-fsdevel/cover/20230125-fs-acl-remove-generic-xattr-handlers-v3-0-f760cc58967d@kernel.org/
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
After reducing the search range when building the aux tree, the prev array
stuff is no longer useful, so remove it.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When the search key's mantissa is larger than the node i's, we know that
the search key is larger than the first key of the cacheline corresponding
to node i, so that when we are calculating the mantissa of right side
nodes of node i, the left side of the search range can be the first key
of node i. Once the search range is minimized, the mantissa we are
calculating can have more useful bits, thus reduce the slow path
comparison. Besides, we can now remove all the prev array stuff.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch replaces open-coded extra computation to eytzinger1_extra.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This logic is no longer useful since commit
3ce8b463e3 ("bcachefs: kill bset_tree->max_key"), so remove it.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The idx parameter of bkey_mantissa_bits_dropped is unused, remove it.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The idx parameter of bkey_mantissa became unused since commit
b904a79918 ("bcachefs: Go back to 16 bit mantissa bkey floats"),
so remove it.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In the macro definition of bkey_crc_next, five parameters
were accepted, but only four of them were used. Let's remove
the unused one.
The patch has only passed compilation tests, but it should be fine.
Signed-off-by: Julian Sun <sunjunchao2870@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The macro allocate_dropping_locks accepts a parameter _trans,
but it was not used, rather the variable trans was directly used,
which may be a local variable inside a function that calls the macros.
Signed-off-by: Julian Sun <sunjunchao2870@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The macro allocate_dropping_locks_errocode accepts a parameter _trans,
but it was not used, rather the variable trans was directly used,
which may be a local variable inside a function that calls the macros.
Signed-off-by: Julian Sun <sunjunchao2870@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
macro bch2_kthread_wait_event_ioclock_timeout is no longer used,
let's remove it.
The patch has passed compilation test.
Signed-off-by: Julian Sun <sunjunchao2870@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Same as the recent change for __bch2_read(); also, kill now unnecessary
btree_trans_too_many_iters() calls.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If a file is unlinked but still open, we don't want online fsck to
delete it - or fun inconsistencies will happen.
https://github.com/koverstreet/bcachefs/issues/727
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
After commit 230e9fc286 ("slab: add SLAB_ACCOUNT flag"), we need to mark
the inode cache as SLAB_ACCOUNT, similar to commit 5d097056c9 ("kmemcg:
account for certain kmem allocations to memcg")
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add the __counted_by compiler attribute to the flexible array member
bucket to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When building for a 32-bit architecture, for which 'size_t' is
'unsigned int', there is a compiler warning due to use of '%lu':
In file included from fs/bcachefs/vstructs.h:5,
from fs/bcachefs/bcachefs_format.h:80,
from fs/bcachefs/bcachefs.h:207,
from fs/bcachefs/btree_key_cache.c:3:
fs/bcachefs/btree_key_cache.c: In function 'bch2_btree_key_cache_to_text':
fs/bcachefs/btree_key_cache.c:795:25: error: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'size_t' {aka 'unsigned int'} [-Werror=format=]
795 | prt_printf(out, "pending:\t%lu\r\n", per_cpu_sum(bc->nr_pending));
| ^~~~~~~~~~~~~~~~~~~
fs/bcachefs/util.h:78:63: note: in definition of macro 'prt_printf'
78 | #define prt_printf(_out, ...) bch2_prt_printf(_out, __VA_ARGS__)
| ^~~~~~~~~~~
fs/bcachefs/btree_key_cache.c:795:38: note: format string is defined here
795 | prt_printf(out, "pending:\t%lu\r\n", per_cpu_sum(bc->nr_pending));
| ~~^
| |
| long unsigned int
| %u
cc1: all warnings being treated as errors
Use the proper specifier, '%zu', to resolve the warning.
Fixes: e447e49977 ("bcachefs: key cache can now allocate from pending")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
btree_trans objects can hold the btree_trans_barrier srcu read lock for
an extended amount of time (they shouldn't, but it's difficult to
guarantee).
the srcu barrier blocks memory reclaim, so to avoid too many stranded
key cache items, this uses the new pending_rcu_items to allocate from
pending items - like we did before, but now without a global lock on the
key cache.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Generic data structure for explicitly tracking pending RCU items,
allowing items to be dequeued (i.e. allocate from items pending
freeing). Works with conventional RCU and SRCU, and possibly other RCU
flavors in the future, meaning this can serve as a more generic
replacement for SLAB_TYPESAFE_BY_RCU.
Pending items are tracked in radix trees; if memory allocation fails, we
fall back to linked lists.
A rcu_pending is initialized with a callback, which is invoked when
pending items's grace periods have expired. Two types of callback
processing are handled specially:
- RCU_PENDING_KVFREE_FN
New backend for kvfree_rcu(). Slightly faster, and eliminates the
synchronize_rcu() slowpath in kvfree_rcu_mightsleep() - instead, an
rcu_head is allocated if we don't have one and can't use the radix
tree
TODO:
- add a shrinker (as in the existing kvfree_rcu implementation) so that
memory reclaim can free expired objects if callback processing isn't
keeping up, and to expedite a grace period if we're under memory
pressure and too much memory is stranded by RCU
- add a counter for amount of memory pending
- RCU_PENDING_CALL_RCU_FN
Accelerated backend for call_rcu() - pending callbacks are tracked in
a radix tree to eliminate linked list overhead.
to serve as replacement backends for kvfree_rcu() and call_rcu(); these
may be of interest to other uses (e.g. SLAB_TYPESAFE_BY_RCU users).
Note:
Internally, we're using a single rearming call_rcu() callback for
notifications from the core RCU subsystem for notifications when objects
are ready to be processed.
Ideally we would be getting a callback every time a grace period
completes for which we have objects, but that would require multiple
rcu_heads in flight, and since the number of gp sequence numbers with
uncompleted callbacks is not bounded, we can't do that yet.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We can't call __wait_on_freeing_inode() with btree locks held; we're
waiting on another thread that's in evict(), and before it clears that
bit it needs to write that inode to flush timestamps - deadlock.
Fixing this involves a fair amount of re-jiggering to plumb a new
transaction restart.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
the standard vfs inode hash table suffers from painful lock contention -
this is long overdue
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bcachefs is switching to an rhashtable for vfs inodes instead of the
standard inode.c hashtable, so we need this exported, or - a static
inline makes more sense for a single atomic_inc().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Bcachefs often uses this function to divide by nanosecond times - which
can easily cause problems when cast to u32. For example, `cat
/sys/fs/bcachefs/*/internal/rebalance_status` would return invalid data
in the `duration waited` field because dividing by the number of
nanoseconds in a minute requires the divisor parameter to be u64.
Signed-off-by: Reed Riley <reed@riley.engineer>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
cat /sys/fs/bcachefs/*/internal/rebalance_status
waiting
io wait duration: 13.5 GiB
io wait remaining: 627 MiB
duration waited: 1392 m
duration waited was increasing at a rate of about 14 times the expected
rate.
div_u64 takes a u32 divisor, but u->nsecs (from time_units[]) can be
bigger than u32.
Signed-off-by: Feiko Nanninga <feiko.nanninga@fnanninga.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes two problems in the handling of negative times:
• rem is signed, but the rem * c->sb.nsec_per_time_unit operation
produced a bogus unsigned result, because s32 * u32 = u32.
• The timespec was not normalized (it could contain more than a
billion nanoseconds).
For example, { .tv_sec = -14245441, .tv_nsec = 750000000 }, after
being round tripped through timespec_to_bch2_time and then
bch2_time_to_timespec would come back as
{ .tv_sec = -14245440, .tv_nsec = 4044967296 } (more than 4 billion
nanoseconds).
Cc: stable@vger.kernel.org
Fixes: 595c1e9bab ("bcachefs: Fix time handling")
Closes: https://github.com/koverstreet/bcachefs/issues/743
Co-developed-by: Erin Shepherd <erin.shepherd@e43.eu>
Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
Co-developed-by: Ryan Lahfa <ryan@lahfa.xyz>
Signed-off-by: Ryan Lahfa <ryan@lahfa.xyz>
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_bkey_drop_ptrs() had a some complicated machinery for avoiding
O(n^2) when dropping multiple pointers - but when n is only going to be
~4, it's not worth it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Without this, we'd potentially sort multiple times without a
cond_resched(), leading to hung task warnings on larger systems.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
ca->io_ref does not protect against the filesystem going way,
c->write_ref does. Much like
0b50b7313e bcachefs: Fix refcounting in discard path
the other async paths need fixing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Store the cookie to detect concurrent seeks on directories in
file->private_data.
Link: https://lore.kernel.org/r/20240830-vfs-file-f_version-v1-11-6d3e4816aa7b@kernel.org
Acked-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
This is similar to generic_file_llseek() but allows the caller to
specify a cookie that will be updated to indicate that a seek happened.
Caller's requiring that information in their readdir implementations can
use that.
Link: https://lore.kernel.org/r/20240830-vfs-file-f_version-v1-8-6d3e4816aa7b@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Make generic_file_llseek_size() use must_set_pos(). We'll use
must_set_pos() in another helper in a minutes. Remove __maybe_unused
from must_set_pos() now that it's used.
Link: https://lore.kernel.org/r/20240830-vfs-file-f_version-v1-7-6d3e4816aa7b@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Add a new must_set_pos() helper. We will use it in follow-up patches.
Temporarily mark it as unused. This is only done to keep the diff small
and reviewable.
Link: https://lore.kernel.org/r/20240830-vfs-file-f_version-v1-6-6d3e4816aa7b@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The deference here confuses me.
Maybe here want to say that because show_fd_locks() does not dereference
the files pointer, using the stale value of the files pointer is safe.
Correctly spelled comments make it easier for the reader to understand
the code.
replace 'deferences' with 'dereferences' in the comment &
replace 'inialized' with 'initialized' in the comment.
Signed-off-by: Yan Zhen <yanzhen@vivo.com>
Link: https://lore.kernel.org/r/20240909063353.2246419-1-yanzhen@vivo.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
There are several comments all over the place, which uses a wrong singular
form of jiffies.
Replace 'jiffie' by 'jiffy'. No functional change.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
Link: https://lore.kernel.org/all/20240904-devel-anna-maria-b4-timers-flseep-v1-3-e98760256370@linutronix.de
Some overlayfs features require permission to read/write trusted.*
xattrs. These include redirect_dir, verity, metacopy, and data-only
layers. This patch adds additional validations at mount time to stop
overlays from mounting in certain cases where the resulting mount would
not function according to the user's expectations because they lack
permission to access trusted.* xattrs (for example, not global root.)
Similar checks in ovl_make_workdir() that disable features instead of
failing are still relevant and used in cases where the resulting mount
can still work "reasonably well." Generally, if the feature was enabled
through kernel config or module option, any mount that worked before
will still work the same; this applies to redirect_dir and metacopy. The
user must explicitly request these features in order to generate a mount
failure. Verity and data-only layers on the other hand must be explictly
requested and have no "reasonable" disabled or degraded alternative, so
mounts attempting either always fail.
"lower data-only dirs require metacopy support" moved down in case
userxattr is set, which disables metacopy.
Cc: stable@vger.kernel.org # v6.6+
Signed-off-by: Mike Baynton <mike@mbaynton.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmbbVKkACgkQiiy9cAdy
T1FTUgv8C/Qek0abESCC9AEvKUiAGwabOcdvKQnpCjI3eLQVmwGIHXXPdnkgxJmL
gUQm4CBj6jWw5OfhBw2BTvnVz9YahQC8Xbg0XfLomaggD8NxVFnQyiWyyjPJtIiQ
JRhOqV82Ko2NFMpouwfNTLPLMBpjNp6IrvkAY2bH5vUzPmoC/aU+eQMVXMqTFalD
Q+vV2cFBcMsTTsRFCMG0er8114A1XvyG4IKr/95bTDjn/wnOVX9sUGrMbNXuoCsj
yzMAkBoc60k2PjGoYMIQJsVDFryz7TpF7wyS2Oo5EkqzR/GKcIYGxTn0AznVhs83
5mAPXgyqpxg3wAsIVAs+vj0Jo2/cfpWuLb9pR5kt3lNA5EH7D1DNzXcHSe8GPvC6
iwrFI0RnR59HbDh1UGOSoVZv/W9cwmam6WG5HpS7YcRYocZqZyv+XjxUTlj2r+nV
12v9nnAWkH2Ub6kf3WHPzeXS3L6mvucody8b01UUL+j8hqWKN67sbXzH0Y2Nv0tv
KFgbJCSk
=CntT
-----END PGP SIGNATURE-----
Merge tag 'v6.11-rc6-cifs-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- fix potential mount hang
- fix retry problem in two types of compound operations
- important netfs integration fix in SMB1 read paths
- fix potential uninitialized zero point of inode
- minor patch to improve debugging for potential crediting problems
* tag 'v6.11-rc6-cifs-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
netfs, cifs: Improve some debugging bits
cifs: Fix SMB1 readv/writev callback in the same way as SMB2/3
cifs: Fix zero_point init on inode initialisation
smb: client: fix double put of @cfile in smb2_set_path_size()
smb: client: fix double put of @cfile in smb2_rename_path()
smb: client: fix hang in wait_for_response() for negproto