- submit_bh() can never return an error, so change it to return void,
and remove the unused checks from its callers
- fix I_DIRTY_TIME handling so it will be set even if the inode
already has I_DIRTY_INODE
Performance:
- Always enable i_version counter (as btrfs and xfs already do).
Remove some uneeded i_version bumps to avoid unnecessary nfs cache
invalidations.
- Wake up journal waters in FIFO order, to avoid some journal users
from not getting a journal handle for an unfairly long time.
- In ext4_write_begin() allocate any necessary buffer heads before
starting the journal handle.
- Don't try to prefetch the block allocation bitmaps for a read-only
file system.
Bug Fixes:
- Fix a number of fast commit bugs, including resources leaks and out
of bound references in various error handling paths and/or if the fast
commit log is corrupted.
- Avoid stopping the online resize early when expanding a file system
which is less than 16TiB to a size greater than 16TiB.
- Fix apparent metadata corruption caused by a race with a metadata
buffer head getting migrated while it was trying to be read.
- Mark the lazy initialization thread freezable to prevent suspend
failures.
- Other miscellaneous bug fixes.
Cleanups:
- Break up the incredibly long ext4_full_super() function by
refactoring to move code into more understandable, smaller
functions.
- Remove the deprecated (and ignored) noacl and nouser_attr mount
option.
- Factor out some common code in fast commit handling.
- Other miscellaneous cleanups.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmM8/2gACgkQ8vlZVpUN
gaPohAf9GDMUq3QIYoWLlJ+ygJhL0xQGPfC6sypMjHaUO5GSo+1+sAMU3JBftxUS
LrgTtmzSKzwp9PyOHNs+mswUzhLZivKVCLMmOznQUZS228GSVKProhN1LPL4UP2Q
Ks8i1M5XTWS+mtJ5J5Mw6jRHxcjfT6ynyJKPnIWKTwXyeru1WSJ2PWqtWQD4EZkE
lImECy0jX/zlK02s0jDYbNIbXIvI/TTYi7wT8o1ouLCAXMDv5gJRc5TXCVtX8i59
/Pl9rGG/+IWTnYT/aQ668S2g0Cz6Wyv2EkmiPUW0Y8NoLaaouBYZoC2hDujiv+l1
ucEI14TEQ+DojJTdChrtwKqgZfqDOw==
=xoLC
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"The first two changes involve files outside of fs/ext4:
- submit_bh() can never return an error, so change it to return void,
and remove the unused checks from its callers
- fix I_DIRTY_TIME handling so it will be set even if the inode
already has I_DIRTY_INODE
Performance:
- Always enable i_version counter (as btrfs and xfs already do).
Remove some uneeded i_version bumps to avoid unnecessary nfs cache
invalidations
- Wake up journal waiters in FIFO order, to avoid some journal users
from not getting a journal handle for an unfairly long time
- In ext4_write_begin() allocate any necessary buffer heads before
starting the journal handle
- Don't try to prefetch the block allocation bitmaps for a read-only
file system
Bug Fixes:
- Fix a number of fast commit bugs, including resources leaks and out
of bound references in various error handling paths and/or if the
fast commit log is corrupted
- Avoid stopping the online resize early when expanding a file system
which is less than 16TiB to a size greater than 16TiB
- Fix apparent metadata corruption caused by a race with a metadata
buffer head getting migrated while it was trying to be read
- Mark the lazy initialization thread freezable to prevent suspend
failures
- Other miscellaneous bug fixes
Cleanups:
- Break up the incredibly long ext4_full_super() function by
refactoring to move code into more understandable, smaller
functions
- Remove the deprecated (and ignored) noacl and nouser_attr mount
option
- Factor out some common code in fast commit handling
- Other miscellaneous cleanups"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (53 commits)
ext4: fix potential out of bound read in ext4_fc_replay_scan()
ext4: factor out ext4_fc_get_tl()
ext4: introduce EXT4_FC_TAG_BASE_LEN helper
ext4: factor out ext4_free_ext_path()
ext4: remove unnecessary drop path references in mext_check_coverage()
ext4: update 'state->fc_regions_size' after successful memory allocation
ext4: fix potential memory leak in ext4_fc_record_regions()
ext4: fix potential memory leak in ext4_fc_record_modified_inode()
ext4: remove redundant checking in ext4_ioctl_checkpoint
jbd2: add miss release buffer head in fc_do_one_pass()
ext4: move DIOREAD_NOLOCK setting to ext4_set_def_opts()
ext4: remove useless local variable 'blocksize'
ext4: unify the ext4 super block loading operation
ext4: factor out ext4_journal_data_mode_check()
ext4: factor out ext4_load_and_init_journal()
ext4: factor out ext4_group_desc_init() and ext4_group_desc_free()
ext4: factor out ext4_geometry_check()
ext4: factor out ext4_check_feature_compatibility()
ext4: factor out ext4_init_metadata_csum()
ext4: factor out ext4_encoding_init()
...
ext4 currently updates the i_version counter when the atime is updated
during a read. This is less than ideal as it can cause unnecessary cache
invalidations with NFSv4 and unnecessary remeasurements for IMA.
The increment in ext4_mark_iloc_dirty is also problematic since it can
corrupt the i_version counter for ea_inodes. We aren't bumping the file
times in ext4_mark_iloc_dirty, so changing the i_version there seems
wrong, and is the cause of both problems.
Remove that callsite and add increments to the setattr, setxattr and
ioctl codepaths, at the same times that we update the ctime. The
i_version bump that already happens during timestamp updates should take
care of the rest.
In ext4_move_extents, increment the i_version on both inodes, and also
add in missing ctime updates.
[ Some minor updates since we've already enabled the i_version counter
unconditionally already via another patch series. -- TYT ]
Cc: stable@kernel.org
Cc: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20220908172448.208585-3-jlayton@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In our product environment, we encounter some jbd hung waiting handles to
stop while several writters were doing memory reclaim for buffer head
allocation in delay alloc write path. Ext4 do buffer head allocation with
holding transaction handle which may be blocked too long if the reclaim
works not so smooth. According to our bcc trace, the reclaim time in
buffer head allocation can reach 258s and the jbd transaction commit also
take almost the same time meanwhile. Except for these extreme cases,
we often see several seconds delays for cgroup memory reclaim on our
servers. This is more likely to happen considering docker environment.
One thing to note, the allocation of buffer heads is as often as page
allocation or more often when blocksize less than page size. Just like
page cache allocation, we should also place the buffer head allocation
before startting the handle.
Cc: stable@kernel.org
Signed-off-by: Jinke Han <hanjinke.666@bytedance.com>
Link: https://lore.kernel.org/r/20220903012429.22555-1-hanjinke.666@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The original i_version implementation was pretty expensive, requiring a
log flush on every change. Because of this, it was gated behind a mount
option (implemented via the MS_I_VERSION mountoption flag).
Commit ae5e165d85 (fs: new API for handling inode->i_version) made the
i_version flag much less expensive, so there is no longer a performance
penalty from enabling it. xfs and btrfs already enable it
unconditionally when the on-disk format can support it.
Have ext4 ignore the SB_I_VERSION flag, and just enable it
unconditionally. While we're in here, mark the i_version mount
option Opt_removed.
[ Removed leftover bits of i_version from ext4_apply_options() since it
now can't ever be set in ctx->mask_s_flags -- lczerner ]
Cc: stable@kernel.org
Cc: Dave Chinner <david@fromorbit.com>
Cc: Benjamin Coddington <bcodding@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220824160349.39664-3-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ea_inodes are using i_version for storing part of the reference count so
we really need to leave it alone.
The problem can be reproduced by xfstest ext4/026 when iversion is
enabled. Fix it by not calling inode_inc_iversion() for EXT4_EA_INODE_FL
inodes in ext4_mark_iloc_dirty().
Cc: stable@kernel.org
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Link: https://lore.kernel.org/r/20220824160349.39664-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Add support for STATX_DIOALIGN to ext4, so that direct I/O alignment
restrictions are exposed to userspace in a generic way.
Acked-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20220827065851.135710-5-ebiggers@kernel.org
superblock and improved the performance of the online resizing of file
systems with bigalloc enabled. Fixed a lot of bugs, in particular for
the inline data feature, potential races when creating and deleting
inodes with shared extended attribute blocks, and the handling
directory blocks which are corrupted.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmLrRpEACgkQ8vlZVpUN
gaN2OAf/a2lxhZ6DvkTfWjni0BG6i2ajd11lT93N+we0wWk9f0VdNR2JUdTum0HL
UtwP48km+FuqkbDrODlbSky5V+IZhd90ihLyPbPKSU/52c7d6IxNOCz2Fxq981j2
Ik6QgdegvCaUDHmluJfYYcS5Pa97HXtSb6VVi1RAvHHFbYDSChObs76ZQWBmhsSh
Mo84mFGS7BDIVNVkg4PBMx4b3iFvKfE1AUdfA5dhB4GXHgDA+77GByw+RjdQ6Dh/
W0l5AVAXbK7BYSVX6Cg41WUMYOBu58Hrh/CHL1DWv3khvjgxLqM7ERAFOISVI3Ax
vCXPXfjpbTFElUQuOw4m33vixaFU+A==
=xTsM
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Add new ioctls to set and get the file system UUID in the ext4
superblock and improved the performance of the online resizing of file
systems with bigalloc enabled.
Fixed a lot of bugs, in particular for the inline data feature,
potential races when creating and deleting inodes with shared extended
attribute blocks, and the handling of directory blocks which are
corrupted"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (37 commits)
ext4: add ioctls to get/set the ext4 superblock uuid
ext4: avoid resizing to a partial cluster size
ext4: reduce computation of overhead during resize
jbd2: fix assertion 'jh->b_frozen_data == NULL' failure when journal aborted
ext4: block range must be validated before use in ext4_mb_clear_bb()
mbcache: automatically delete entries from cache on freeing
mbcache: Remove mb_cache_entry_delete()
ext2: avoid deleting xattr block that is being reused
ext2: unindent codeblock in ext2_xattr_set()
ext2: factor our freeing of xattr block reference
ext4: fix race when reusing xattr blocks
ext4: unindent codeblock in ext4_xattr_block_set()
ext4: remove EA inode entry from mbcache on inode eviction
mbcache: add functions to delete entry if unused
mbcache: don't reclaim used entries
ext4: make sure ext4_append() always allocates new block
ext4: check if directory block is within i_size
ext4: reflect mb_optimize_scan value in options file
ext4: avoid remove directory when directory is corrupted
ext4: aligned '*' in comments
...
- Fix an accounting bug that made NR_FILE_DIRTY grow without limit
when running xfstests
- Convert more of mpage to use folios
- Remove add_to_page_cache() and add_to_page_cache_locked()
- Convert find_get_pages_range() to filemap_get_folios()
- Improvements to the read_cache_page() family of functions
- Remove a few unnecessary checks of PageError
- Some straightforward filesystem conversions to use folios
- Split PageMovable users out from address_space_operations into their
own movable_operations
- Convert aops->migratepage to aops->migrate_folio
- Remove nobh support (Christoph Hellwig)
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmLpViQACgkQDpNsjXcp
gj5pBgf/f3+K7Hi3qw7aYQCYJQ7IA/bLyE/DLWI59kuiao6wDSve40B9YH9X++Ha
mRLp55bkQS+bwS2xa4jlqrIDJzAfNoWlXaXZHUXGL1C/52ChTF6jaH2cvO9PVlDS
7fLv1hy2LwiIdzpKJkUW7T+kcQGj3QLKqtQ4x8zD0LGMg055yvt/qndHSUi41nWT
/58+6W8Sk4vvRgkpeChFzF1lGLy00+FGT8y5V2kM9uRliFQ7XPCwqB2a3e5jbW6z
C1NXQmRnopCrnOT1TFIhK3DyX6MDIWV5qcikNAmCKFb9fQFPmjDLPt9iSoMGjw2M
Z+UVhJCaU3ISccd0DG5Ra/vzs9/O9Q==
=DgUi
-----END PGP SIGNATURE-----
Merge tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache
Pull folio updates from Matthew Wilcox:
- Fix an accounting bug that made NR_FILE_DIRTY grow without limit
when running xfstests
- Convert more of mpage to use folios
- Remove add_to_page_cache() and add_to_page_cache_locked()
- Convert find_get_pages_range() to filemap_get_folios()
- Improvements to the read_cache_page() family of functions
- Remove a few unnecessary checks of PageError
- Some straightforward filesystem conversions to use folios
- Split PageMovable users out from address_space_operations into
their own movable_operations
- Convert aops->migratepage to aops->migrate_folio
- Remove nobh support (Christoph Hellwig)
* tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache: (78 commits)
fs: remove the NULL get_block case in mpage_writepages
fs: don't call ->writepage from __mpage_writepage
fs: remove the nobh helpers
jfs: stop using the nobh helper
ext2: remove nobh support
ntfs3: refactor ntfs_writepages
mm/folio-compat: Remove migration compatibility functions
fs: Remove aops->migratepage()
secretmem: Convert to migrate_folio
hugetlb: Convert to migrate_folio
aio: Convert to migrate_folio
f2fs: Convert to filemap_migrate_folio()
ubifs: Convert to filemap_migrate_folio()
btrfs: Convert btrfs_migratepage to migrate_folio
mm/migrate: Add filemap_migrate_folio()
mm/migrate: Convert migrate_page() to migrate_folio()
nfs: Convert to migrate_folio
btrfs: Convert btree_migratepage to migrate_folio
mm/migrate: Convert expected_page_refs() to folio_expected_refs()
mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio()
...
Currently we remove EA inode from mbcache as soon as its xattr refcount
drops to zero. However there can be pending attempts to reuse the inode
and thus refcount handling code has to handle the situation when
refcount increases from zero anyway. So save some work and just keep EA
inode in mbcache until it is getting evicted. At that moment we are sure
following iget() of EA inode will fail anyway (or wait for eviction to
finish and load things from the disk again) and so removing mbcache
entry at that moment is fine and simplifies the code a bit.
CC: stable@vger.kernel.org
Fixes: 82939d7999 ("ext4: convert to mbcache2")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220712105436.32204-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Use the EXT4_INODE_HAS_XATTR_SPACE macro to more accurately
determine whether the inode have xattr space.
Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220616021358.2504451-5-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
A race can occur in the unlikely event ext4 is unable to allocate a
physical cluster for a delayed allocation in a bigalloc file system
during writeback. Failure to allocate a cluster forces error recovery
that includes a call to mpage_release_unused_pages(). That function
removes any corresponding delayed allocated blocks from the extent
status tree. If a new delayed write is in progress on the same cluster
simultaneously, resulting in the addition of an new extent containing
one or more blocks in that cluster to the extent status tree, delayed
block accounting can be thrown off if that delayed write then encounters
a similar cluster allocation failure during future writeback.
Write lock the i_data_sem in mpage_release_unused_pages() to fix this
problem. Ext4's block/cluster accounting code for bigalloc relies on
i_data_sem for mutual exclusion, as is found in the delayed write path,
and the locking in mpage_release_unused_pages() is missing.
Cc: stable@kernel.org
Reported-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20220615160530.1928801-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We use jbd_debug() in some places in ext4. It seems a bit strange to use
jbd2 debugging output function for ext4 code. Also these days
ext4_debug() uses dynamic printk so each debug message can be enabled /
disabled on its own so the time when it made some sense to have these
combined (to allow easier common selecting of messages to report) has
passed. Just convert all jbd_debug() uses in ext4 to ext4_debug().
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/20220608112355.4397-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Use a folio throughout __buffer_migrate_folio(), add kernel-doc for
buffer_migrate_folio() and buffer_migrate_folio_norefs(), move their
declarations to buffer.h and switch all filesystems that have wired
them up.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The called functions all use pages, so just convert back to a page.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
If the folio is large, it may overlap the beginning or end of the
unused range. If it does, we need to avoid invalidating it.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Now that we introduced new infrastructure to increase the type safety
for filesystems supporting idmapped mounts port the first part of the
vfs over to them.
This ports the attribute changes codepaths to rely on the new better
helpers using a dedicated type.
Before this change we used to take a shortcut and place the actual
values that would be written to inode->i_{g,u}id into struct iattr. This
had the advantage that we moved idmappings mostly out of the picture
early on but it made reasoning about changes more difficult than it
should be.
The filesystem was never explicitly told that it dealt with an idmapped
mount. The transition to the value that needed to be stored in
inode->i_{g,u}id appeared way too early and increased the probability of
bugs in various codepaths.
We know place the same value in struct iattr no matter if this is an
idmapped mount or not. The vfs will only deal with type safe
vfs{g,u}id_t. This makes it massively safer to perform permission checks
as the type will tell us what checks we need to perform and what helpers
we need to use.
Fileystems raising FS_ALLOW_IDMAP can't simply write ia_vfs{g,u}id to
inode->i_{g,u}id since they are different types. Instead they need to
use the dedicated vfs{g,u}id_to_k{g,u}id() helpers that map the
vfs{g,u}id into the filesystem.
The other nice effect is that filesystems like overlayfs don't need to
care about idmappings explicitly anymore and can simply set up struct
iattr accordingly directly.
Link: https://lore.kernel.org/lkml/CAHk-=win6+ahs1EwLkcq8apqLi_1wXFWbrPf340zYEhObpz4jA@mail.gmail.com [1]
Link: https://lore.kernel.org/r/20220621141454.2914719-9-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Port the is_quota_modification() and dqout_transfer() helper to type
safe vfs{g,u}id_t. Since these helpers are only called by a few
filesystems don't introduce a new helper but simply extend the existing
helpers to pass down the mount's idmapping.
Note, that this is a non-functional change, i.e. nothing will have
happened here or at the end of this series to how quota are done! This
a change necessary because we will at the end of this series make
ownership changes easier to reason about by keeping the original value
in struct iattr for both non-idmapped and idmapped mounts.
For now we always pass the initial idmapping which makes the idmapping
functions these helpers call nops.
This is done because we currently always pass the actual value to be
written to i_{g,u}id via struct iattr. While this allowed us to treat
the {g,u}id values in struct iattr as values that can be directly
written to inode->i_{g,u}id it also increases the potential for
confusion for filesystems.
Now that we are have dedicated types to prevent this confusion we will
ultimately only map the value from the idmapped mount into a filesystem
value that can be written to inode->i_{g,u}id when the filesystem
actually updates the inode. So pass down the initial idmapping until we
finished that conversion at which point we pass down the mount's
idmapping.
Since struct iattr uses an anonymous union with overlapping types as
supported by the C standard, filesystems that haven't converted to
ia_vfs{g,u}id won't see any difference and things will continue to work
as before. In other words, no functional changes intended with this
change.
Link: https://lore.kernel.org/r/20220621141454.2914719-7-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Earlier we introduced new helpers to abstract ownership update and
remove code duplication. This converts all filesystems supporting
idmapped mounts to make use of these new helpers.
For now we always pass the initial idmapping which makes the idmapping
functions these helpers call nops.
This is done because we currently always pass the actual value to be
written to i_{g,u}id via struct iattr. While this allowed us to treat
the {g,u}id values in struct iattr as values that can be directly
written to inode->i_{g,u}id it also increases the potential for
confusion for filesystems.
Now that we are have dedicated types to prevent this confusion we will
ultimately only map the value from the idmapped mount into a filesystem
value that can be written to inode->i_{g,u}id when the filesystem
actually updates the inode. So pass down the initial idmapping until we
finished that conversion at which point we pass down the mount's
idmapping.
No functional changes intended.
Link: https://lore.kernel.org/r/20220621141454.2914719-6-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
When delayed allocation is disabled (either through mount option or
because we are running low on free space), ext4_write_begin() allocates
blocks with EXT4_GET_BLOCKS_IO_CREATE_EXT flag. With this flag extent
merging is disabled and since ext4_write_begin() is called for each page
separately, we end up with a *lot* of 1 block extents in the extent tree
and following writeback is writing 1 block at a time which results in
very poor write throughput (4 MB/s instead of 200 MB/s). These days when
ext4_get_block_unwritten() is used only by ext4_write_begin(),
ext4_page_mkwrite() and inline data conversion, we can safely allow
extent merging to happen from these paths since following writeback will
happen on different boundaries anyway. So use
EXT4_GET_BLOCKS_CREATE_UNRIT_EXT instead which restores the performance.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220520111402.4252-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
- Appoint myself page cache maintainer
- Fix how scsicam uses the page cache
- Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS
- Remove the AOP flags entirely
- Remove pagecache_write_begin() and pagecache_write_end()
- Documentation updates
- Convert several address_space operations to use folios:
- is_dirty_writeback
- readpage becomes read_folio
- releasepage becomes release_folio
- freepage becomes free_folio
- Change filler_t to require a struct file pointer be the first argument
like ->read_folio
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmKNMDUACgkQDpNsjXcp
gj4/mwf/bpHhXH4ZoNIvtUpTF6rZbqeffmc0VrbxCZDZ6igRnRPglxZ9H9v6L53O
7B0FBQIfxgNKHZpdqGdOkv8cjg/GMe/HJUbEy5wOakYPo4L9fZpHbDZ9HM2Eankj
xBqLIBgBJ7doKr+Y62DAN19TVD8jfRfVtli5mqXJoNKf65J7BkxljoTH1L3EXD9d
nhLAgyQjR67JQrT/39KMW+17GqLhGefLQ4YnAMONtB6TVwX/lZmigKpzVaCi4r26
bnk5vaR/3PdjtNxIoYvxdc71y2Eg05n2jEq9Wcy1AaDv/5vbyZUlZ2aBSaIVbtKX
WfrhN9O3L0bU5qS7p9PoyfLc9wpq8A==
=djLv
-----END PGP SIGNATURE-----
Merge tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache
Pull page cache updates from Matthew Wilcox:
- Appoint myself page cache maintainer
- Fix how scsicam uses the page cache
- Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS
- Remove the AOP flags entirely
- Remove pagecache_write_begin() and pagecache_write_end()
- Documentation updates
- Convert several address_space operations to use folios:
- is_dirty_writeback
- readpage becomes read_folio
- releasepage becomes release_folio
- freepage becomes free_folio
- Change filler_t to require a struct file pointer be the first
argument like ->read_folio
* tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache: (107 commits)
nilfs2: Fix some kernel-doc comments
Appoint myself page cache maintainer
fs: Remove aops->freepage
secretmem: Convert to free_folio
nfs: Convert to free_folio
orangefs: Convert to free_folio
fs: Add free_folio address space operation
fs: Convert drop_buffers() to use a folio
fs: Change try_to_free_buffers() to take a folio
jbd2: Convert release_buffer_page() to use a folio
jbd2: Convert jbd2_journal_try_to_free_buffers to take a folio
reiserfs: Convert release_buffer_page() to use a folio
fs: Remove last vestiges of releasepage
ubifs: Convert to release_folio
reiserfs: Convert to release_folio
orangefs: Convert to release_folio
ocfs2: Convert to release_folio
nilfs2: Remove comment about releasepage
nfs: Convert to release_folio
jfs: Convert to release_folio
...
Fix following includecheck warning:
./fs/ext4/inode.c: linux/dax.h is included more than once.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220504225025.44753-1-yang.lee@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Hulk Robot reported a BUG_ON:
==================================================================
EXT4-fs error (device loop3): ext4_mb_generate_buddy:805: group 0,
block bitmap and bg descriptor inconsistent: 25 vs 31513 free clusters
kernel BUG at fs/ext4/ext4_jbd2.c:53!
invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 0 PID: 25371 Comm: syz-executor.3 Not tainted 5.10.0+ #1
RIP: 0010:ext4_put_nojournal fs/ext4/ext4_jbd2.c:53 [inline]
RIP: 0010:__ext4_journal_stop+0x10e/0x110 fs/ext4/ext4_jbd2.c:116
[...]
Call Trace:
ext4_write_inline_data_end+0x59a/0x730 fs/ext4/inline.c:795
generic_perform_write+0x279/0x3c0 mm/filemap.c:3344
ext4_buffered_write_iter+0x2e3/0x3d0 fs/ext4/file.c:270
ext4_file_write_iter+0x30a/0x11c0 fs/ext4/file.c:520
do_iter_readv_writev+0x339/0x3c0 fs/read_write.c:732
do_iter_write+0x107/0x430 fs/read_write.c:861
vfs_writev fs/read_write.c:934 [inline]
do_pwritev+0x1e5/0x380 fs/read_write.c:1031
[...]
==================================================================
Above issue may happen as follows:
cpu1 cpu2
__________________________|__________________________
do_pwritev
vfs_writev
do_iter_write
ext4_file_write_iter
ext4_buffered_write_iter
generic_perform_write
ext4_da_write_begin
vfs_fallocate
ext4_fallocate
ext4_convert_inline_data
ext4_convert_inline_data_nolock
ext4_destroy_inline_data_nolock
clear EXT4_STATE_MAY_INLINE_DATA
ext4_map_blocks
ext4_ext_map_blocks
ext4_mb_new_blocks
ext4_mb_regular_allocator
ext4_mb_good_group_nolock
ext4_mb_init_group
ext4_mb_init_cache
ext4_mb_generate_buddy --> error
ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)
ext4_restore_inline_data
set EXT4_STATE_MAY_INLINE_DATA
ext4_block_write_begin
ext4_da_write_end
ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)
ext4_write_inline_data_end
handle=NULL
ext4_journal_stop(handle)
__ext4_journal_stop
ext4_put_nojournal(handle)
ref_cnt = (unsigned long)handle
BUG_ON(ref_cnt == 0) ---> BUG_ON
The lock held by ext4_convert_inline_data is xattr_sem, but the lock
held by generic_perform_write is i_rwsem. Therefore, the two locks can
be concurrent.
To solve above issue, we add inode_lock() for ext4_convert_inline_data().
At the same time, move ext4_convert_inline_data() in front of
ext4_punch_hole(), remove similar handling from ext4_punch_hole().
Fixes: 0c8d414f16 ("ext4: let fallocate handle inline data correctly")
Cc: stable@vger.kernel.org
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220428134031.4153381-1-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Symlink's external data block is one kind of metadata block, and now
that almost all ext4 metadata block's page cache (e.g. directory blocks,
quota blocks...) belongs to bdev backing inode except the symlink. It
is essentially worked in data=journal mode like other regular file's
data block because probably in order to make it simple for generic VFS
code handling symlinks or some other historical reasons, but the logic
of creating external data block in ext4_symlink() is complicated. and it
also make things confused if user do not want to let the filesystem
worked in data=journal mode. This patch convert the final exceptional
case and make things clean, move the mapping of the symlink's external
data block to bdev like any other metadata block does.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20220424140936.1898920-3-yi.zhang@huawei.com
Current ext4_getblk() might sleep if some resources are not valid or
could be race with a concurrent extents modifing procedure. So we
cannot call ext4_getblk() and ext4_map_blocks() to get map blocks in
the atomic context in some fast path (e.g. the upcoming procedure of
getting symlink external block in the RCU context), even if the map
extents have already been check and cached.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20220424140936.1898920-2-yi.zhang@huawei.com
All but two of the callers already have a folio; pass a folio into
try_to_free_buffers(). This removes the last user of cancel_dirty_page()
so remove that wrapper function too.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Also convert it to return a bool since it's called from release_folio().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
The use of folios should be pushed deeper into ext4 from here.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
This is a "weak" conversion which converts straight back to using pages.
A full conversion should be performed at some point, hopefully by
someone familiar with the filesystem.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
There are no more aop flags left, so remove the parameter.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
There are no more aop flags left, so remove the parameter.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Instead of setting AOP_FLAG_NOFS, use memalloc_nofs_save() and
memalloc_nofs_restore() to prevent GFP_FS allocations recursing
into the filesystem with a journal already started.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Instead of setting AOP_FLAG_NOFS, use memalloc_nofs_save() and
memalloc_nofs_restore() to prevent GFP_FS allocations recursing
into the filesystem with a journal already started.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Theodore Ts'o <tytso@mit.edu>
injection testing. Change ext4's fallocate to update consistently
drop set[ug]id bits when an fallocate operation might possibly change
the user-visible contents of a file. Also, improve handling of
potentially invalid values in the the s_overhead_cluster superblock
field to avoid ext4 returning a negative number of free blocks.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmJinf8ACgkQ8vlZVpUN
gaOHgQf+MKgUZgteYogLzoP3mF1kSycOGawk4wZ3QHOLz7AvsV2p9J8BWihbS/EK
dBydfXbTMvCUrjWmpqb5dHECRzxdfxOJ0SPJtibc8DZaJc9ImNFmgSp9kyJ3uRaN
cPGO6Lz2RXpdumVMPPLwzUJdVyrLi0K6I1NYSocxKgribePzd+xil8S9zRZj8Bpe
RaeH0EytcRj2CI5qs5mI/mOPBAMsZeczd3HInI3gyCgP2I4ZOfsADne3APx57mcI
IGKf77nvIwMHeKel3MGYfFPitEs5cZpHUhHplCMtgFsO8H0IR93tqnlaCvTM7VAZ
Slamgl7pfcXFcLZP+pm0QL/82ub7iw==
=FIds
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Fix some syzbot-detected bugs, as well as other bugs found by I/O
injection testing.
Change ext4's fallocate to consistently drop set[ug]id bits when an
fallocate operation might possibly change the user-visible contents of
a file.
Also, improve handling of potentially invalid values in the the
s_overhead_cluster superblock field to avoid ext4 returning a negative
number of free blocks"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
jbd2: fix a potential race while discarding reserved buffers after an abort
ext4: update the cached overhead value in the superblock
ext4: force overhead calculation if the s_overhead_cluster makes no sense
ext4: fix overhead calculation to account for the reserved gdt blocks
ext4, doc: fix incorrect h_reserved size
ext4: limit length to bitmap_maxbytes - blocksize in punch_hole
ext4: fix use-after-free in ext4_search_dir
ext4: fix bug_on in start_this_handle during umount filesystem
ext4: fix symlink file size not match to file content
ext4: fix fallocate to use file_modified to update permissions consistently
Since the initial introduction of (posix) fallocate back at the turn of
the century, it has been possible to use this syscall to change the
user-visible contents of files. This can happen by extending the file
size during a preallocation, or through any of the newer modes (punch,
zero, collapse, insert range). Because the call can be used to change
file contents, we should treat it like we do any other modification to a
file -- update the mtime, and drop set[ug]id privileges/capabilities.
The VFS function file_modified() does all this for us if pass it a
locked inode, so let's make fallocate drop permissions correctly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20220308185043.GA117678@magnolia
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
This should use the new folio_buffers() instead of page_has_buffers().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Primarily this series converts some of the address_space operations
to take a folio instead of a page.
->is_partially_uptodate() takes a folio instead of a page and changes the
type of the 'from' and 'count' arguments to make it obvious they're bytes.
->invalidatepage() becomes ->invalidate_folio() and has a similar type change.
->launder_page() becomes ->launder_folio()
->set_page_dirty() becomes ->dirty_folio() and adds the address_space as
an argument.
There are a couple of other misc changes up front that weren't worth
separating into their own pull request.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmI4hqMACgkQDpNsjXcp
gj7r7Af/fVJ7m8kKqjP/IayX3HiJRuIDQw+vM++BlRNXdjz+IyED6whdmFGxJeOY
BMyT+8ApOAz7ErS4G+7fAv4ScJK/aEgFUsnSeAiCp0PliiEJ5NNJzElp6sVmQ7H5
SX7+Ek444FZUGsQuy0qL7/ELpR3ditnD7x+5U2g0p5TeaHGUQn84crRyfR4xuhNG
EBD9D71BOb7OxUcOHe93pTkK51QsQ0aCrcIsB1tkK5KR0BAthn1HqF7ehL90Rvrr
omx5M7aDWGY4oj7IKrhlAs+55Ah2WaOzrZBp0FXNbr4UENDBKWKyUxErwa4xPkf6
Gm1iQG/CspOHnxN3YWsd5WjtlL3A+A==
=cOiq
-----END PGP SIGNATURE-----
Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache
Pull filesystem folio updates from Matthew Wilcox:
"Primarily this series converts some of the address_space operations to
take a folio instead of a page.
Notably:
- a_ops->is_partially_uptodate() takes a folio instead of a page and
changes the type of the 'from' and 'count' arguments to make it
obvious they're bytes.
- a_ops->invalidatepage() becomes ->invalidate_folio() and has a
similar type change.
- a_ops->launder_page() becomes ->launder_folio()
- a_ops->set_page_dirty() becomes ->dirty_folio() and adds the
address_space as an argument.
There are a couple of other misc changes up front that weren't worth
separating into their own pull request"
* tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache: (53 commits)
fs: Remove aops ->set_page_dirty
fb_defio: Use noop_dirty_folio()
fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio
fs: Convert __set_page_dirty_buffers to block_dirty_folio
nilfs: Convert nilfs_set_page_dirty() to nilfs_dirty_folio()
mm: Convert swap_set_page_dirty() to swap_dirty_folio()
ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folio
f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio
f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio
f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio
afs: Convert afs_dir_set_page_dirty() to afs_dir_dirty_folio()
btrfs: Convert extent_range_redirty_for_io() to use folios
fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio
btrfs: Convert from set_page_dirty to dirty_folio
fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio()
fs: Add aops->dirty_folio
fs: Remove aops->launder_page
orangefs: Convert launder_page to launder_folio
nfs: Convert from launder_page to launder_folio
fuse: Convert from launder_page to launder_folio
...
more bug fixes and clean ups in the ext4 fast_commit feature (most
notably, in the tracepoints). In the jbd2 layer, the t_handle_lock
spinlock has been removed, with the last place where it was actually
needed replaced with an atomic cmpxchg.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEOrBXt+eNlFyMVZH70292m8EYBPAFAmI5QPsACgkQ0292m8EY
BPD99Q//ZO7L0XUicNEsuxvtXDj4GBWg/R62JXcHsIetIRhCpM4+L2vS3kZVXCBh
Pd+QMYsfAhuXFK3yK5MfvAb+XJ2e0iVNaw9ke2ueAsrGidlKWOstbMWWqEa4Dcr2
age1DdrRB5hXIFxOg+Qz8bPMpP9XfxEiMqT2K7PgB+R/Z2chBbwPJ/G1VmdOMNGe
FzxZgvXFY5EIJgjqj09ioD5pUSlAj5sgI4y/Dcv4liW11kvgBxg7Mm8ZouoPRHa3
gZF5mj/40izhD8gYPmZJtHFFYuqBBLsrmJnjUF2Y2yvZt5I14j4w/5OfcjuJqWn6
t2mlnk8/o6ZHzyuUV0/8/x35rZBlwqMq9ep6L9pN07XVB0/xOMG7a59pWOElhFGp
sG/ELhP251ZE+glWC9/X1VjnhM+agduiWAUY0wF0X62LM3sw6sVO3q6hsI1EKunN
O5VlvzPXNTZn9bcKzXzpvbxP9vF6s4exf32rMxd1rfSptE6myavGwss7MlB1HM7v
xiM/A4AZT/qlUH9eOJxF1aGcswpxnFwTwHKZqBVW32iyG9mbqRT2lKajx4WstowU
pdI4X7YfF7GhI5tMFR5S4Fa9FiOivO4EEc6z/db753kVuxSjwtyDd8ip56ZuJUhU
LqpwZO2bh+UhR0vRO0pQH/ZqCzCfCaJTGtBwAvJUwo3g/zRTmoc=
=naSC
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Fix some bugs in converting ext4 to use the new mount API, as well as
more bug fixes and clean ups in the ext4 fast_commit feature (most
notably, in the tracepoints).
In the jbd2 layer, the t_handle_lock spinlock has been removed, with
the last place where it was actually needed replaced with an atomic
cmpxchg"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (35 commits)
ext4: fix kernel doc warnings
ext4: fix remaining two trace events to use same printk convention
ext4: add commit tid info in ext4_fc_commit_start/stop trace events
ext4: add commit_tid info in jbd debug log
ext4: add transaction tid info in fc_track events
ext4: add new trace event in ext4_fc_cleanup
ext4: return early for non-eligible fast_commit track events
ext4: do not call FC trace event in ext4_fc_commit() if FS does not support FC
ext4: convert ext4_fc_track_dentry type events to use event class
ext4: fix ext4_fc_stats trace point
ext4: remove unused enum EXT4_FC_COMMIT_FAILED
ext4: warn when dirtying page w/o buffers in data=journal mode
doc: fixed a typo in ext4 documentation
ext4: make mb_optimize_scan performance mount option work with extents
ext4: make mb_optimize_scan option work with set/unset mount cmd
ext4: don't BUG if someone dirty pages without asking ext4 first
ext4: remove redundant assignment to variable split_flag1
ext4: fix underflow in ext4_max_bitmap_size()
ext4: fix ext4_mb_clear_bb() kernel-doc comment
ext4: fix fs corruption when tring to remove a non-empty directory with IO error
...
This is a mechanical change.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Convert all callers; mostly this is just changing the aops to point
at it, but a few implementations need a little more work.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
These filesystems use __set_page_dirty_nobuffers() either directly or
with a very thin wrapper; convert them en masse.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Extensive changes, but fairly mechanical.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
We used to have to use noop_invalidatepage() to prevent
block_invalidatepage() from being called, but that behaviour is now gone.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Remove special-casing of a NULL invalidatepage, since there is no
more block_invalidatepage.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Instead of calling ->invalidatepage directly, use folio_invalidate().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Recently I've got a report of BUG_ON trigerring during transaction
commit in ext4_journalled_writepage_callback() because we've spotted a
dirty page without buffers. Add WARN_ON_ONCE to
ext4_journalled_set_page_dirty() to catch the problematic condition
earlier where we have better chance of understanding which code path is
creating dirty data without preparing the page properly. Also update the
comment with current information when we are at it.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220310101832.5645-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
[un]pin_user_pages_remote is dirtying pages without properly warning
the file system in advance. A related race was noted by Jan Kara in
2018[1]; however, more recently instead of it being a very hard-to-hit
race, it could be reliably triggered by process_vm_writev(2) which was
discovered by Syzbot[2].
This is technically a bug in mm/gup.c, but arguably ext4 is fragile in
that if some other kernel subsystem dirty pages without properly
notifying the file system using page_mkwrite(), ext4 will BUG, while
other file systems will not BUG (although data will still be lost).
So instead of crashing with a BUG, issue a warning (since there may be
potential data loss) and just mark the page as clean to avoid
unprivileged denial of service attacks until the problem can be
properly fixed. More discussion and background can be found in the
thread starting at [2].
[1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz
[2] https://lore.kernel.org/r/Yg0m6IjcNmfaSokM@google.com
Reported-by: syzbot+d59332e2db681cf18f0318a06e994ebbb529a8db@syzkaller.appspotmail.com
Reported-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/YiDS9wVfq4mM2jGK@mit.edu
Encrypted files traditionally haven't supported DIO, due to the need to
encrypt/decrypt the data. However, when the encryption is implemented
using inline encryption (blk-crypto) instead of the traditional
filesystem-layer encryption, it is straightforward to support DIO.
Therefore, make ext4 support DIO on files that are using inline
encryption. Since ext4 uses iomap for DIO, and fscrypt support was
already added to iomap DIO, this just requires two small changes:
- Let DIO proceed when supported, by checking fscrypt_dio_supported()
instead of assuming that encrypted files never support DIO.
- In ext4_iomap_begin(), use fscrypt_limit_io_blocks() to limit the
length of the mapping in the rare case where a DUN discontiguity
occurs in the middle of an extent. The iomap DIO implementation
requires this, since it assumes that it can submit a bio covering (up
to) the whole mapping, without checking fscrypt constraints itself.
Co-developed-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Satya Tangirala <satyat@google.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Link: https://lore.kernel.org/r/20220128233940.79464-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>