Fuzzers like to scribble over sb_bsize_shift but in reality it's very
unlikely that this field would be corrupted on its own. Nevertheless it
should be checked to avoid the possibility of messy mount errors due to
bad calculations. It's always a fixed value based on the block size so
we can just check that it's the expected value.
Tested with:
mkfs.gfs2 -O -p lock_nolock /dev/vdb
for i in 0 -1 64 65 32 33; do
gfs2_edit -p sb field sb_bsize_shift $i /dev/vdb
mount /dev/vdb /mnt/test && umount /mnt/test
done
Before this patch we get a withdraw after
[ 76.413681] gfs2: fsid=loop0.0: fatal: invalid metadata block
[ 76.413681] bh = 19 (type: exp=5, found=4)
[ 76.413681] function = gfs2_meta_buffer, file = fs/gfs2/meta_io.c, line = 492
and with UBSAN configured we also get complaints like
[ 76.373395] UBSAN: shift-out-of-bounds in fs/gfs2/ops_fstype.c:295:19
[ 76.373815] shift exponent 4294967287 is too large for 64-bit type 'long unsigned int'
After the patch, these complaints don't appear, mount fails immediately
and we get an explanation in dmesg.
Reported-by: syzbot+dcf33a7aae997956fe06@syzkaller.appspotmail.com
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
ll_rw_block() is not safe for the sync read path because it cannot
guarantee that always submitting read IO if the buffer has been locked,
so stop using it. We also switch to new bh_readahead() helper for the
readahead path.
Link: https://lkml.kernel.org/r/20220901133505.2510834-5-yi.zhang@huawei.com
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Switch from strlcpy to strscpy and make sure that @count is the size of
the smaller of the source and destination buffers. This prevents
reading beyond the end of the source buffer when the source string isn't
null terminated.
Found by a modified version of syzkaller.
Suggested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
There are a couple places in function do_xmote where normal processing
is circumvented due to withdraws in progress. However, since we bypass
most of do_xmote() we bypass telling dlm to lock the dlm lock, which
means dlm will never respond with a completion callback. Since the
completion callback ordinarily clears GLF_LOCK, this patch changes
function do_xmote to handle those situations more gracefully so the
file system may be unmounted after withdraw.
A very similar situation happens with the GLF_DEMOTE_IN_PROGRESS flag,
which is cleared by function finish_xmote(). Since the withdraw causes
us to skip the majority of do_xmote, it therefore also skips the call
to finish_xmote() so the DEMOTE_IN_PROGRESS flag needs to be cleared
manually.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When a withdraw occurs, ordinary (not system) glocks may not be granted
anymore. Later, when the file system is unmounted, gfs2_gl_hash_clear()
tries to clear out all the glocks, but these un-grantable pending
waiters prevent some glocks from being freed. So the unmount hangs, at
least for its ten-minute timeout period.
This patch takes measures to remove any pending waiters from
the glocks that will never be granted. This allows the unmount to
proceed in a reasonable period of time.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When a gfs2 file system is withdrawn it does iput on its journal to
allow recovery from another cluster node. If it's unable to get a
replacement inode for whatever reason, the journal descriptor would
still be pointing at the evicted inode. So when unmount clears out the
list of journals, it would do a second iput referencing the pointer.
To avoid this, set the inode pointer to NULL.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, delete_work_func() would check for the GLF_DEMOTE
flag on the iopen glock and if set, it would perform special processing.
However, there was a race whereby the GLF_DEMOTE flag could be set by
another process after the check. Then when it called
gfs2_lookup_by_inum() which calls gfs2_inode_lookup(), it tried to lock
the iopen glock in SH mode, but the GLF_DEMOTE flag prevented the
request from being granted. But the iopen glock could never be demoted
because that happens when the inode is evicted, and the evict was never
completed because of the failed lookup.
To fix that, change function gfs2_inode_lookup() so that when
GFS2_BLKST_UNLINKED inodes are searched, it uses the LM_FLAG_TRY flag
for the iopen glock. If the locking request fails, fail
gfs2_inode_lookup() with -EAGAIN so that delete_work_func() can retry
the operation later.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The DLM_LSFL_FS flag is set in lockspaces created directly
for a kernel user, as opposed to those lockspaces created
for user space applications. The user space libdlm allowed
this flag to be set for lockspaces created from user space,
but then used by a kernel user. No kernel user has ever
used this method, so remove the ability to do it.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
filldir_t instances (directory iterators callbacks) used to return 0 for
"OK, keep going" or -E... for "stop". Note that it's *NOT* how the
error values are reported - the rules for those are callback-dependent
and ->iterate{,_shared}() instances only care about zero vs. non-zero
(look at emit_dir() and friends).
So let's just return bool ("should we keep going?") - it's less confusing
that way. The choice between "true means keep going" and "true means
stop" is bikesheddable; we have two groups of callbacks -
do something for everything in directory, until we run into problem
and
find an entry in directory and do something to it.
The former tended to use 0/-E... conventions - -E<something> on failure.
The latter tended to use 0/1, 1 being "stop, we are done".
The callers treated anything non-zero as "stop", ignoring which
non-zero value did they get.
"true means stop" would be more natural for the second group; "true
means keep going" - for the first one. I tried both variants and
the things like
if allocation failed
something = -ENOMEM;
return true;
just looked unnatural and asking for trouble.
[folded suggestion from Matthew Wilcox <willy@infradead.org>]
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
- Remove iomap_writepage and all callers, since the mm apparently never
called the zonefs or gfs2 writepage functions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmL1H7kACgkQ+H93GTRK
tOvzfw/+JJQM3WjwCUg+11O9E+oKS3wbczr0yAd2m8j+EqapdndXzIVevcZKXoTx
K4zOK9oDecPtRKgQkvrDt7HrMB7oYv8tuSzyfcsNVHbMA6U3twkLdr5c19/lm9uj
rnP2Xrs0RkiiFpImmTHsviPEyzniJ+BjtRDF7FxSFELxREae4EQW3YX2MjffvqQA
dT+xXptWiOSa3ygwfoGqVeOLOMt0DqXICiV0GLrGxD6S7TLRRIPo7ojYS4703vUL
VFTAUvhC4CD9/vsEwPnl91Jq2s06tO3LE4V6vJDPI7/uQFPcubLmcK8GpaYB6+OQ
q9Fhpc9cU/3JTKt6Sw9uNOqA5hfUKBdJmhWE3FqZ2arql2C9tY2o+cHvRBKZWMZ9
FdLKSwsuDpL+pYsWOPn7wU8BHZVTDDl7CtDNTCurNkkNgaAbK8C0X7QcT16RRyDF
SAPHlg0XFewLgJ+9HNyDv70VT1VLYiJNq/h0d/EMO1+FuT4ArBOTOSe4zNNXqD3w
vVFtbBhjGMf1ffqiMM5GdOPh0vxacL8jfxM7xyQ4yooSkecZCEvtNnuCysNTFDbl
53b9bjk+OSuWCb7efE6p82wU+gr617Zp2/YxALl4E0FlozeRHuRimWBtABZqi/g6
aKJL42ASY+PLJPACDjo0LhDFuCRbd75OATUGtBva7mkYWUANlMc=
=FuyV
-----END PGP SIGNATURE-----
Merge tag 'iomap-6.0-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull more iomap updates from Darrick Wong:
"In the past 10 days or so I've not heard any ZOMG STOP style
complaints about removing ->writepage support from gfs2 or zonefs, so
here's the pull request removing them (and the underlying fs iomap
support) from the kernel:
- Remove iomap_writepage and all callers, since the mm apparently
never called the zonefs or gfs2 writepage functions"
* tag 'iomap-6.0-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: remove iomap_writepage
zonefs: remove ->writepage
gfs2: remove ->writepage
gfs2: stop using generic_writepages in gfs2_ail1_start_one
* more new_sync_{read,write}() speedups - ITER_UBUF introduction
* ITER_PIPE cleanups
* unification of iov_iter_get_pages/iov_iter_get_pages_alloc and
switching them to advancing semantics
* making ITER_PIPE take high-order pages without splitting them
* handling copy_page_from_iter() for high-order pages properly
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYvHI8QAKCRBZ7Krx/gZQ
62CQAPsGlbebqBeAT2pMulaGDxfLAsgz5Yf4BEaMLhPtRqFOQgD+KrZQId7Sd8O0
3IWucpTb2c4jvLlXhGMS+XWnusQH+AQ=
=pBux
-----END PGP SIGNATURE-----
Merge tag 'pull-work.iov_iter-rebased' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more iov_iter updates from Al Viro:
- more new_sync_{read,write}() speedups - ITER_UBUF introduction
- ITER_PIPE cleanups
- unification of iov_iter_get_pages/iov_iter_get_pages_alloc and
switching them to advancing semantics
- making ITER_PIPE take high-order pages without splitting them
- handling copy_page_from_iter() for high-order pages properly
* tag 'pull-work.iov_iter-rebased' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (32 commits)
fix copy_page_from_iter() for compound destinations
hugetlbfs: copy_page_to_iter() can deal with compound pages
copy_page_to_iter(): don't split high-order page in case of ITER_PIPE
expand those iov_iter_advance()...
pipe_get_pages(): switch to append_pipe()
get rid of non-advancing variants
ceph: switch the last caller of iov_iter_get_pages_alloc()
9p: convert to advancing variant of iov_iter_get_pages_alloc()
af_alg_make_sg(): switch to advancing variant of iov_iter_get_pages()
iter_to_pipe(): switch to advancing variant of iov_iter_get_pages()
block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
iov_iter: advancing variants of iov_iter_get_pages{,_alloc}()
iov_iter: saner helper for page array allocation
fold __pipe_get_pages() into pipe_get_pages()
ITER_XARRAY: don't open-code DIV_ROUND_UP()
unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts
unify xarray_get_pages() and xarray_get_pages_alloc()
unify pipe_get_pages() and pipe_get_pages_alloc()
iov_iter_get_pages(): sanity-check arguments
iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper
...
Equivalent of single-segment iovec. Initialized by iov_iter_ubuf(),
checked for by iter_is_ubuf(), otherwise behaves like ITER_IOVEC
ones.
We are going to expose the things like ->write_iter() et.al. to those
in subsequent commits.
New predicate (user_backed_iter()) that is true for ITER_IOVEC and
ITER_UBUF; places like direct-IO handling should use that for
checking that pages we modify after getting them from iov_iter_get_pages()
would need to be dirtied.
DO NOT assume that replacing iter_is_iovec() with user_backed_iter()
will solve all problems - there's code that uses iter_is_iovec() to
decide how to poke around in iov_iter guts and for that the predicate
replacement obviously won't suffice.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
- Instantiate glocks ouside of the glock state engine, in the contect of
the process taking the glock. This moves unnecessary complexity out
of the core glock code. Clean up the instantiate logic to be more
sensible.
- In gfs2_glock_async_wait(), cancel pending locking request upon
failure. Make sure all glocks are left in a consistent state.
- Various other minor cleanups and fixes.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmLtdg8UHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTrqvA//WRdBtVgT7/5pkjljRolkBZ8B3sYx
T2KlHuiQdvnTGf2dWnOOoUzEZvPXPUovUZMA4dHx0jcRpOi4BsYGz986K/Zpq5hs
vieFEoKQdWk9O9NoNdRJN8Rl1tHTwejZi+kLerhYoJzgMC8AvgieLGO0Ol4Y0joc
lxop/8L1Tn2GiCN4NcBN7Eg2CC4ke58KZcMgWhWVBR2ZJe9/qdqlVEiehiSbCiiN
l89vsYLrG6bMylvNPc+AiyEvIGF5qkEHAErPIs7SfrjNRRWVhkmvTCWAO6JnehTQ
XwqYQiAWCXfxBXUYG1VSCgjmTynmO2yg1Slt+86OauI9ka+ow8epSmHh95TT1JcY
pmVF6CYhLI49dNl3R68CFlQ+Ov6iGt6gx9KEud5oE/Ew0vd/WIyi2/jSGrX59S07
zktMzEDjn31+jw31Raxc6+TQEU+0jQHCwzKWjbJ0tYy3nBdkCyefHwm199Ff40M/
6jHWaH/qcyuq8crrc8PLSJOguSd7FdfdFhXEmpaH2CPybvfuEVJfig4vYee3YtSx
KtZvgpy3bxBCfBDD7CPKfKMLrKrklYH+h7/lhCxbuSH0HvyS0ayXhmSvhXgfn+4e
uWY5yk7gHAaaKGOBkkYwFAWV7X32LS0ndWzI8Ac8m20ifV0eeveRNEX0A/fHIX2U
DlbhYq889mc2P70=
=qFus
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v5.19-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- Instantiate glocks ouside of the glock state engine, in the contect
of the process taking the glock. This moves unnecessary complexity
out of the core glock code. Clean up the instantiate logic to be more
sensible.
- In gfs2_glock_async_wait(), cancel pending locking request upon
failure. Make sure all glocks are left in a consistent state.
- Various other minor cleanups and fixes.
* tag 'gfs2-v5.19-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: List traversal in do_promote is safe
gfs2: do_promote glock holder stealing fix
gfs2: Use better variable name
gfs2: Make go_instantiate take a glock
gfs2: Add new go_held glock operation
gfs2: Revert 'Fix "truncate in progress" hang'
gfs2: Instantiate glocks ouside of glock state engine
gfs2: Fix up gfs2_glock_async_wait
gfs2: Minor gfs2_glock_nq_m cleanup
gfs2: Fix spelling mistake in comment
gfs2: Rewrap overlong comment in do_promote
gfs2: Remove redundant NULL check before kfree
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
- Some kmemleak fixes from Patrick Wang and Waiman Long
- DAMON updates from SeongJae Park
- memcg debug/visibility work from Roman Gushchin
- vmalloc speedup from Uladzislau Rezki
- more folio conversion work from Matthew Wilcox
- enhancements for coherent device memory mapping from Alex Sierra
- addition of shared pages tracking and CoW support for fsdax, from
Shiyang Ruan
- hugetlb optimizations from Mike Kravetz
- Mel Gorman has contributed some pagealloc changes to improve latency
and realtime behaviour.
- mprotect soft-dirty checking has been improved by Peter Xu
- Many other singleton patches all over the place
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYuravgAKCRDdBJ7gKXxA
jpqSAQDrXSdII+ht9kSHlaCVYjqRFQz/rRvURQrWQV74f6aeiAD+NHHeDPwZn11/
SPktqEUrF1pxnGQxqLh1kUFUhsVZQgE=
=w/UH
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Most of the MM queue. A few things are still pending.
Liam's maple tree rework didn't make it. This has resulted in a few
other minor patch series being held over for next time.
Multi-gen LRU still isn't merged as we were waiting for mapletree to
stabilize. The current plan is to merge MGLRU into -mm soon and to
later reintroduce mapletree, with a view to hopefully getting both
into 6.1-rc1.
Summary:
- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
- Some kmemleak fixes from Patrick Wang and Waiman Long
- DAMON updates from SeongJae Park
- memcg debug/visibility work from Roman Gushchin
- vmalloc speedup from Uladzislau Rezki
- more folio conversion work from Matthew Wilcox
- enhancements for coherent device memory mapping from Alex Sierra
- addition of shared pages tracking and CoW support for fsdax, from
Shiyang Ruan
- hugetlb optimizations from Mike Kravetz
- Mel Gorman has contributed some pagealloc changes to improve
latency and realtime behaviour.
- mprotect soft-dirty checking has been improved by Peter Xu
- Many other singleton patches all over the place"
[ XFS merge from hell as per Darrick Wong in
https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ]
* tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits)
tools/testing/selftests/vm/hmm-tests.c: fix build
mm: Kconfig: fix typo
mm: memory-failure: convert to pr_fmt()
mm: use is_zone_movable_page() helper
hugetlbfs: fix inaccurate comment in hugetlbfs_statfs()
hugetlbfs: cleanup some comments in inode.c
hugetlbfs: remove unneeded header file
hugetlbfs: remove unneeded hugetlbfs_ops forward declaration
hugetlbfs: use helper macro SZ_1{K,M}
mm: cleanup is_highmem()
mm/hmm: add a test for cross device private faults
selftests: add soft-dirty into run_vmtests.sh
selftests: soft-dirty: add test for mprotect
mm/mprotect: fix soft-dirty check in can_change_pte_writable()
mm: memcontrol: fix potential oom_lock recursion deadlock
mm/gup.c: fix formatting in check_and_migrate_movable_page()
xfs: fail dax mount if reflink is enabled on a partition
mm/memcontrol.c: remove the redundant updating of stats_flush_threshold
userfaultfd: don't fail on unrecognized features
hugetlb_cgroup: fix wrong hugetlb cgroup numa stat
...
- Fix an accounting bug that made NR_FILE_DIRTY grow without limit
when running xfstests
- Convert more of mpage to use folios
- Remove add_to_page_cache() and add_to_page_cache_locked()
- Convert find_get_pages_range() to filemap_get_folios()
- Improvements to the read_cache_page() family of functions
- Remove a few unnecessary checks of PageError
- Some straightforward filesystem conversions to use folios
- Split PageMovable users out from address_space_operations into their
own movable_operations
- Convert aops->migratepage to aops->migrate_folio
- Remove nobh support (Christoph Hellwig)
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmLpViQACgkQDpNsjXcp
gj5pBgf/f3+K7Hi3qw7aYQCYJQ7IA/bLyE/DLWI59kuiao6wDSve40B9YH9X++Ha
mRLp55bkQS+bwS2xa4jlqrIDJzAfNoWlXaXZHUXGL1C/52ChTF6jaH2cvO9PVlDS
7fLv1hy2LwiIdzpKJkUW7T+kcQGj3QLKqtQ4x8zD0LGMg055yvt/qndHSUi41nWT
/58+6W8Sk4vvRgkpeChFzF1lGLy00+FGT8y5V2kM9uRliFQ7XPCwqB2a3e5jbW6z
C1NXQmRnopCrnOT1TFIhK3DyX6MDIWV5qcikNAmCKFb9fQFPmjDLPt9iSoMGjw2M
Z+UVhJCaU3ISccd0DG5Ra/vzs9/O9Q==
=DgUi
-----END PGP SIGNATURE-----
Merge tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache
Pull folio updates from Matthew Wilcox:
- Fix an accounting bug that made NR_FILE_DIRTY grow without limit
when running xfstests
- Convert more of mpage to use folios
- Remove add_to_page_cache() and add_to_page_cache_locked()
- Convert find_get_pages_range() to filemap_get_folios()
- Improvements to the read_cache_page() family of functions
- Remove a few unnecessary checks of PageError
- Some straightforward filesystem conversions to use folios
- Split PageMovable users out from address_space_operations into
their own movable_operations
- Convert aops->migratepage to aops->migrate_folio
- Remove nobh support (Christoph Hellwig)
* tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache: (78 commits)
fs: remove the NULL get_block case in mpage_writepages
fs: don't call ->writepage from __mpage_writepage
fs: remove the nobh helpers
jfs: stop using the nobh helper
ext2: remove nobh support
ntfs3: refactor ntfs_writepages
mm/folio-compat: Remove migration compatibility functions
fs: Remove aops->migratepage()
secretmem: Convert to migrate_folio
hugetlb: Convert to migrate_folio
aio: Convert to migrate_folio
f2fs: Convert to filemap_migrate_folio()
ubifs: Convert to filemap_migrate_folio()
btrfs: Convert btrfs_migratepage to migrate_folio
mm/migrate: Add filemap_migrate_folio()
mm/migrate: Convert migrate_page() to migrate_folio()
nfs: Convert to migrate_folio
btrfs: Convert btree_migratepage to migrate_folio
mm/migrate: Convert expected_page_refs() to folio_expected_refs()
mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio()
...
There is nothing iomap-specific about iomap_migratepage(), and it fits
a pattern used by several other filesystems, so move it to mm/migrate.c,
convert it to be filemap_migrate_folio() and convert the iomap filesystems
to use it.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Use folio_put_refs() to perform only one atomic operation instead of two.
The other changes are straightforward conversions from page APIs to
their folio equivalents.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
->writepage is only used for single page writeback from memory reclaim,
and not called at all for cgroup writeback. Follow the lead of XFS
and remove ->writepage and rely entirely on ->writepages.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Use filemap_fdatawrite_wbc instead of generic_writepages in
gfs2_ail1_start_one so that the functin can also cope with address_space
operations that only implement ->writepages and to properly account
for cgroup writeback.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Improve static type checking by using the enum req_op type for variables
that represent a request operation and the new blk_opf_t type for
variables that represent request flags. Combine the first two
gfs2_submit_bhs() arguments into a single argument.
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-54-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Both submit_bh() and ll_rw_block() accept a request operation type and
request flags as their first two arguments. Micro-optimize these two
functions by combining these first two arguments into a single argument.
This patch does not change the behavior of any of the modified code.
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jan Kara <jack@suse.cz>
Acked-by: Song Liu <song@kernel.org> (for the md changes)
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-48-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently shrinkers are anonymous objects. For debugging purposes they
can be identified by count/scan function names, but it's not always
useful: e.g. for superblock's shrinkers it's nice to have at least an
idea of to which superblock the shrinker belongs.
This commit adds names to shrinkers. register_shrinker() and
prealloc_shrinker() functions are extended to take a format and arguments
to master a name.
In some cases it's not possible to determine a good name at the time when
a shrinker is allocated. For such cases shrinker_debugfs_rename() is
provided.
The expected format is:
<subsystem>-<shrinker_type>[:<instance>]-<id>
For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair.
After this change the shrinker debugfs directory looks like:
$ cd /sys/kernel/debug/shrinker/
$ ls
dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42
mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43
mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44
rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49
sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13
sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36
sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19
sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10
sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9
sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37
sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38
sb-dax-11 sb-proc-45 sb-tmpfs-35
sb-debugfs-7 sb-proc-46 sb-tmpfs-40
[roman.gushchin@linux.dev: fix build warnings]
Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In do_promote(), we're never removing the current entry from the list
and so the list traversal is actually safe. Switch back to
list_for_each_entry().
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In do_promote(), when the glock had no strong holders, we were
accidentally calling demote_incompat_holders() with new_gh == NULL, so
no weak holders were considered incompatible. Instead, the new holder
should have been passed in.
For doing that, the HIF_HOLDER flag needs to be set in new_gh to prevent
may_grant() from complaining. This means that the new holder will now
be recognized as a current holder, so skip over it explicitly in
demote_incompat_holders() to prevent it from being dequeued.
To further clarify things, we can now rename new_gh to current_gh in
demote_incompat_holders(); after all, the HIF_HOLDER flag is already set,
which means the new holder is already a current holder.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In do_promote() and add_to_queue(), use current_gh as the variable name
for the first strong holder we could find: this matches the variable
name is may_grant(), and more clearly indicates that we're interested in
one (any) of the current strong holders.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Make go_instantiate take a glock instead of a glock holder as its argument:
this handler is supposed to instantiate the object associated with the glock.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Right now, inode_go_instantiate() contains functionality that relates to
how a glock is held rather than the glock itself, like waiting for
pending direct I/O to complete and completing interrupted truncates.
This code is meant to be run each time a holder is acquired, but
go_instantiate is actually only called once, when the glock is
instantiated.
To fix that, introduce a new go_held glock operation that is called each
time a glock holder is acquired. Move the holder specific code in
inode_go_instantiate() over to inode_go_held().
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Now that interrupted truncates are completed in the context of the
process taking the glock, there is no need for the glock state engine to
delegate that task to gfs2_quotad or for quotad to perform those
truncates anymore. Get rid of the obsolete associated infrastructure.
Reverts commit 813e0c46c9 ("GFS2: Fix "truncate in progress" hang").
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Instantiate glocks outside of the glock state engine: there is no real
reason for instantiating them inside the glock state engine; it only
complicates the code.
Instead, instantiate them in gfs2_glock_wait() and gfs2_glock_async_wait()
using the new gfs2_glock_holder_ready() helper. On top of that, the only
other place that acquires a glock without using gfs2_glock_wait() or
gfs2_glock_async_wait() is gfs2_upgrade_iopen_glock(), so call
gfs2_glock_holder_ready() there as well.
If a dinode has a pending truncate, the glock-specific instantiate function
for inodes wakes up the truncate function in the quota daemon. Waiting for
the completion of the truncate was previously done by the glock state
engine, but we now need to wait in inode_go_instantiate().
This also means that gfs2_instantiate() will now no longer return any
"special" error codes.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Since commit 1fc05c8d84 ("gfs2: cancel timed-out glock requests"), a
pending locking request can be canceled by calling gfs2_glock_dq() on
the pending holder. In gfs2_glock_async_wait(), when we time out, use
that to cancel the remaining locking requests and dequeue the locking
requests already granted. That's simpler as well as more efficient than
waiting for all locking requests to eventually be granted and dequeuing
them then.
In addition, gfs2_glock_async_wait() promises that by the time the
function completes, all glocks are either granted or dequeued, but the
implementation doesn't keep that promise if individual locking requests
fail. Fix that as well.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Add the GL_NOPID flag for the remaining glock holders which are not
associated with the current process.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Add a GL_NOPID flag to indicate that once a glock holder has been acquired, it
won't be associated with the current process anymore. This is useful for iopen
and flock glocks which are associated with open files, as well as journal glock
holders and similar which are associated with the filesystem.
Once GL_NOPID is used for all applicable glocks (see the next patches),
processes will no longer be falsely reported as holding glocks which they are
not actually holding in the glocks dump file. Unlike before, when a process is
reported as having "(ended)", this will indicate an actual bug.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Include flock glocks in the "glockfd" debugfs file. Those are similar to the
iopen glocks; while an open file is holding an flock, it is holding the file's
flock glock.
We cannot take f_fl_mutex in gfs2_glockfd_seq_show_flock() or else dumping the
"glockfd" file would block on flock operations. Instead, use the file->f_lock
spin lock to protect the f_fl_gh.gh_gl glock pointer.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When a process has a gfs2 file open, the file is keeping a reference on the
underlying gfs2 inode, and the inode is keeping the inode's iopen glock held in
shared mode. In other words, the process depends on the iopen glock of each
open gfs2 file. Expose those dependencies in a new "glockfd" debugfs file.
The new debugfs file contains one line for each gfs2 file descriptor,
specifying the tgid, file descriptor number, and glock name, e.g.,
1601 6 5/816d
This list is compiled by iterating all tasks on the system using find_ge_pid(),
and all file descriptors of each task using task_lookup_next_fd_rcu(). To make
that work from gfs2, export those two functions.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Add state and flags arguments to gfs2_rlist_alloc() to make it somewhat more
obvious which state and flags an rlist uses. With that, stop knocking off
flags in gfs2_glock_nq_m() and its nq_m_sync() helper that are never set in the
first place.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Rewrap the comment to keep the line length below 80 characters.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
kfree on NULL pointer is a no-op.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
- Appoint myself page cache maintainer
- Fix how scsicam uses the page cache
- Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS
- Remove the AOP flags entirely
- Remove pagecache_write_begin() and pagecache_write_end()
- Documentation updates
- Convert several address_space operations to use folios:
- is_dirty_writeback
- readpage becomes read_folio
- releasepage becomes release_folio
- freepage becomes free_folio
- Change filler_t to require a struct file pointer be the first argument
like ->read_folio
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmKNMDUACgkQDpNsjXcp
gj4/mwf/bpHhXH4ZoNIvtUpTF6rZbqeffmc0VrbxCZDZ6igRnRPglxZ9H9v6L53O
7B0FBQIfxgNKHZpdqGdOkv8cjg/GMe/HJUbEy5wOakYPo4L9fZpHbDZ9HM2Eankj
xBqLIBgBJ7doKr+Y62DAN19TVD8jfRfVtli5mqXJoNKf65J7BkxljoTH1L3EXD9d
nhLAgyQjR67JQrT/39KMW+17GqLhGefLQ4YnAMONtB6TVwX/lZmigKpzVaCi4r26
bnk5vaR/3PdjtNxIoYvxdc71y2Eg05n2jEq9Wcy1AaDv/5vbyZUlZ2aBSaIVbtKX
WfrhN9O3L0bU5qS7p9PoyfLc9wpq8A==
=djLv
-----END PGP SIGNATURE-----
Merge tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache
Pull page cache updates from Matthew Wilcox:
- Appoint myself page cache maintainer
- Fix how scsicam uses the page cache
- Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS
- Remove the AOP flags entirely
- Remove pagecache_write_begin() and pagecache_write_end()
- Documentation updates
- Convert several address_space operations to use folios:
- is_dirty_writeback
- readpage becomes read_folio
- releasepage becomes release_folio
- freepage becomes free_folio
- Change filler_t to require a struct file pointer be the first
argument like ->read_folio
* tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache: (107 commits)
nilfs2: Fix some kernel-doc comments
Appoint myself page cache maintainer
fs: Remove aops->freepage
secretmem: Convert to free_folio
nfs: Convert to free_folio
orangefs: Convert to free_folio
fs: Add free_folio address space operation
fs: Convert drop_buffers() to use a folio
fs: Change try_to_free_buffers() to take a folio
jbd2: Convert release_buffer_page() to use a folio
jbd2: Convert jbd2_journal_try_to_free_buffers to take a folio
reiserfs: Convert release_buffer_page() to use a folio
fs: Remove last vestiges of releasepage
ubifs: Convert to release_folio
reiserfs: Convert to release_folio
orangefs: Convert to release_folio
ocfs2: Convert to release_folio
nilfs2: Remove comment about releasepage
nfs: Convert to release_folio
jfs: Convert to release_folio
...
- Clean up the allocation of glocks that have an address space attached.
- Quota locking fix and quota iomap conversion.
- Fix the FITRIM error reporting.
- Some list iterator cleanups.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmKNMukUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTqKig/8DV+EjY6hLX7Olq5EL+TgPvUJ6fyy
UQMRLL0wZerC9ArIv0yGTAtSpU6mDJTI4TbNS69LnQWfNYiP3WhZg1mrO4vEAP7o
1h1yE4MCIKV0oDRzENqshzhaKFcvCJvcoY8ToLWYRgwgtuCZ9GLlQ8v38j+yZmQk
hD+Z5LWNaQ51L8cl2A+yDfw8EFlaQgcJfcU3GUPdo8WDj+BXzTM0gI6Awrrwtg16
y0NyEJiCRn5Cx4h/HBhqoMrOIfz+wWgtYov2OQowlVGzgDsNQ7TG7a4qKCOKEBww
bZYVlet6zUt7csAweTcYXbCzL075dYOEiJ25tIpn2WXtDrwjTNNO3e78UXqy85Fb
PPp4iESAuvqvQwByNazH4Nc0sJdDvKto8cDpvenoJtr6cEDBqSBAibFmojGde4e1
X7VKBSyREnkjZ+FOr24wYTD0ztkfs0M91Ax5Ha22DQOmtmt1XNIEynGQYT+VDqJL
LDnvObANIZ6h3+bKXB/ddBhCKkV3POI1VojWNwywOD2yryKVVVFu51S+dbLaeyaz
W4hz6pqTl1fnFwWv9WPpYS/0kShQrHndhK8eCBZW32+0MpNbDlLfnjSwE7PoO4gV
ZzqjfJYDYeRW5rq0CSIvbXvNy5EH9Av4fjuZ2/JOQRDG5LHa5Ov+5Yk+XrWcQ8fN
duiMl/gvm8DKMUU=
=7hCN
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v5.18-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- Clean up the allocation of glocks that have an address space attached
- Quota locking fix and quota iomap conversion
- Fix the FITRIM error reporting
- Some list iterator cleanups
* tag 'gfs2-v5.18-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Convert function bh_get to use iomap
gfs2: use i_lock spin_lock for inode qadata
gfs2: Return more useful errors from gfs2_rgrp_send_discards()
gfs2: Use container_of() for gfs2_glock(aspace)
gfs2: Explain some direct I/O oddities
gfs2: replace 'found' with dedicated list iterator variable
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmKLxJAACgkQxWXV+ddt
WDvC4BAAnSNwZ15FJKe5Y423f6PS6EXjyMuc5t/fW6UumTTbI+tsS+Glkis+JNBf
BiDZSlVQmiK9WoQSJe04epZgHaK8MaCARyZaRaxjDC4Nvfq4DlD9mbAU9D6e7tZY
Mo8M99D8wDW+SB+P8RBpNjwB/oGCMmE3nKC83g+1ObmA0FVRCyQ1Kazf8RzNT1rZ
DiaJoKTvU1/wDN3/1rw5yG+EfW2m9A14gRCihslhFYaDV7jhpuabl8wLT7MftZtE
MtJ6EOOQbgIDjnp5BEIrPmowW/N0tKDT/gorF7cWgLG2R1cbSlKgqSH1Sq7CjFUE
AKj/DwfqZArPLpqMThWklCwy2B9qDEezrQSy7renP/vkeFLbOp8hQuIY5KRzohdG
oDI8ThlQGtCVjbny6NX/BbCnWRAfTz0TquCgag3Xl8NbkRFgFJtkf/cSxzb+3LW1
tFeiUyTVLXVDS1cZLwgcb29Rrtp4bjd5/v3uECQlVD+or5pcAqSMkQgOBlyQJGbE
Xb0nmPRihzQ8D4vINa63WwRyq0+QczVjvBxKj1daas0VEKGd32PIBS/0Qha+EpGl
uFMiHBMSfqyl8QcShFk0cCbcgPMcNc7I6IAbXCE/WhhFG0ytqm9vpmlLqsTrXmHH
z7/Eye/waqgACNEXoA8C4pyYzduQ4i1CeLDOdcsvBU6XQSuicSM=
=lv6P
-----END PGP SIGNATURE-----
Merge tag 'for-5.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Features:
- subpage:
- support for PAGE_SIZE > 4K (previously only 64K)
- make it work with raid56
- repair super block num_devices automatically if it does not match
the number of device items
- defrag can convert inline extents to regular extents, up to now
inline files were skipped but the setting of mount option
max_inline could affect the decision logic
- zoned:
- minimal accepted zone size is explicitly set to 4MiB
- make zone reclaim less aggressive and don't reclaim if there are
enough free zones
- add per-profile sysfs tunable of the reclaim threshold
- allow automatic block group reclaim for non-zoned filesystems, with
sysfs tunables
- tree-checker: new check, compare extent buffer owner against owner
rootid
Performance:
- avoid blocking on space reservation when doing nowait direct io
writes (+7% throughput for reads and writes)
- NOCOW write throughput improvement due to refined locking (+3%)
- send: reduce pressure to page cache by dropping extent pages right
after they're processed
Core:
- convert all radix trees to xarray
- add iterators for b-tree node items
- support printk message index
- user bulk page allocation for extent buffers
- switch to bio_alloc API, use on-stack bios where convenient, other
bio cleanups
- use rw lock for block groups to favor concurrent reads
- simplify workques, don't allocate high priority threads for all
normal queues as we need only one
- refactor scrub, process chunks based on their constraints and
similarity
- allocate direct io structures on stack and pass around only
pointers, avoids allocation and reduces potential error handling
Fixes:
- fix count of reserved transaction items for various inode
operations
- fix deadlock between concurrent dio writes when low on free data
space
- fix a few cases when zones need to be finished
VFS, iomap:
- add helper to check if sb write has started (usable for assertions)
- new helper iomap_dio_alloc_bio, export iomap_dio_bio_end_io"
* tag 'for-5.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (173 commits)
btrfs: zoned: introduce a minimal zone size 4M and reject mount
btrfs: allow defrag to convert inline extents to regular extents
btrfs: add "0x" prefix for unsupported optional features
btrfs: do not account twice for inode ref when reserving metadata units
btrfs: zoned: fix comparison of alloc_offset vs meta_write_pointer
btrfs: send: avoid trashing the page cache
btrfs: send: keep the current inode open while processing it
btrfs: allocate the btrfs_dio_private as part of the iomap dio bio
btrfs: move struct btrfs_dio_private to inode.c
btrfs: remove the disk_bytenr in struct btrfs_dio_private
btrfs: allocate dio_data on stack
iomap: add per-iomap_iter private data
iomap: allow the file system to provide a bio_set for direct I/O
btrfs: add a btrfs_dio_rw wrapper
btrfs: zoned: zone finish unused block group
btrfs: zoned: properly finish block group on metadata write
btrfs: zoned: finish block group when there are no more allocatable bytes left
btrfs: zoned: consolidate zone finish functions
btrfs: zoned: introduce btrfs_zoned_bg_is_full
btrfs: improve error reporting in lookup_inline_extent_backref
...
Before this patch, function bh_get used block_map to figure out the
block it needed to read in from the quota_change file. This patch
changes it to use iomap directly to make it more efficient.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, functions gfs2_qa_get and _put used the i_rw_mutex to
prevent simultaneous access to its i_qadata. But i_rw_mutex is now used
for many other things, including iomap_begin and end, which causes a
conflict according to lockdep. We cannot just remove the lock since
simultaneous opens (gfs2_open -> gfs2_open_common -> gfs2_qa_get) can
then stomp on each others values for i_qadata.
This patch solves the conflict by using the i_lock spin_lock in the inode
to prevent simultaneous access.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The bug that 27ca8273f ("gfs2: Make sure FITRIM minlen is rounded up to
fs block size") fixes was a little confusing as the user saw
"Input/output error" which masked the -EINVAL that sb_issue_discard()
returned.
sb_issue_discard() can fail for various reasons, so we should return its
return value from gfs2_rgrp_send_discards() to avoid all errors being
reported as IO errors.
This improves error reporting for FITRIM and makes no difference to the
-o discard code path because the return value from
gfs2_rgrp_send_discards() gets thrown away in that case (and the option
switches off). Presumably that's why it was ok to just return -EIO in
the past, before FITRIM was implemented.
Tested with xfstests.
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Clang's structure layout randomization feature gets upset when it sees
struct address_space (which is randomized) cast to struct gfs2_glock.
This is due to seeing the mapping pointer as being treated as an array
of gfs2_glock, rather than "something else, before struct address_space":
In file included from fs/gfs2/acl.c:23:
fs/gfs2/meta_io.h:44:12: error: casting from randomized structure pointer type 'struct address_space *' to 'struct gfs2_glock *'
return (((struct gfs2_glock *)mapping) - 1)->gl_name.ln_sbd;
^
Replace the instances of open-coded pointer math with container_of()
usage, and update the allocator to match.
Some cleanups and conversion of gfs2_glock_get() and
gfs2_glock_dealloc() by Andreas.
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/lkml/202205041550.naKxwCBj-lkp@intel.com
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Bill Wendling <morbo@google.com>
Cc: cluster-devel@redhat.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKrUsQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpgDjD/44hY9h0JsOLoRH1IvFtuaH6n718JXuqG17
hHCfmnAUVqj2jT00IUbVlUTd905bCGpfrodBL3PAmPev1zZHOUd/MnJKrSynJ+/s
NJEMZQaHxLmocNDpJ1sZo7UbAFErsZXB0gVYUO8cH2bFYNu84H1mhRCOReYyqmvQ
aIAASX5qRB/ciBQCivzAJl2jTdn4WOn5hWi9RLidQB7kSbaXGPmgKAuN88WI4H7A
zQgAkEl2EEquyMI5tV1uquS7engJaC/4PsenF0S9iTyrhJLjneczJBJZKMLeMR8d
sOm6sKJdpkrfYDyaA4PIkgmLoEGTtwGpqGHl4iXTyinUAxJoca5tmPvBb3wp66GE
2Mr7pumxc1yJID2VHbsERXlOAX3aZNCowx2gum2MTRIO8g11Eu3aaVn2kv37MBJ2
4R2a/cJFl5zj9M8536cG+Yqpy0DDVCCQKUIqEupgEu1dyfpznyWH5BTAHXi1E8td
nxUin7uXdD0AJkaR0m04McjS/Bcmc1dc6I8xvkdUFYBqYCZWpKOTiEpIBlHg0XJA
sxdngyz5lSYTGVA4o4QCrdR0Tx1n36A1IYFuQj0wzxBJYZ02jEZuII/A3dd+8hiv
EY+VeUQeVIXFFuOcY+e0ScPpn7Nr17hAd1en/j2Hcoe4ZE8plqG2QTcnwgflcbis
iomvJ4yk0Q==
=0Rw1
-----END PGP SIGNATURE-----
Merge tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
"Here are the core block changes for 5.19. This contains:
- blk-throttle accounting fix (Laibin)
- Series removing redundant assignments (Michal)
- Expose bio cache via the bio_set, so that DM can use it (Mike)
- Finish off the bio allocation interface cleanups by dealing with
the weirdest member of the family. bio_kmalloc combines a kmalloc
for the bio and bio_vecs with a hidden bio_init call and magic
cleanup semantics (Christoph)
- Clean up the block layer API so that APIs consumed by file systems
are (almost) only struct block_device based, so that file systems
don't have to poke into block layer internals like the
request_queue (Christoph)
- Clean up the blk_execute_rq* API (Christoph)
- Clean up various lose end in the blk-cgroup code to make it easier
to follow in preparation of reworking the blkcg assignment for bios
(Christoph)
- Fix use-after-free issues in BFQ when processes with merged queues
get moved to different cgroups (Jan)
- BFQ fixes (Jan)
- Various fixes and cleanups (Bart, Chengming, Fanjun, Julia, Ming,
Wolfgang, me)"
* tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-block: (83 commits)
blk-mq: fix typo in comment
bfq: Remove bfq_requeue_request_body()
bfq: Remove superfluous conversion from RQ_BIC()
bfq: Allow current waker to defend against a tentative one
bfq: Relax waker detection for shared queues
blk-cgroup: delete rcu_read_lock_held() WARN_ON_ONCE()
blk-throttle: Set BIO_THROTTLED when bio has been throttled
blk-cgroup: Remove unnecessary rcu_read_lock/unlock()
blk-cgroup: always terminate io.stat lines
block, bfq: make bfq_has_work() more accurate
block, bfq: protect 'bfqd->queued' by 'bfqd->lock'
block: cleanup the VM accounting in submit_bio
block: Fix the bio.bi_opf comment
block: reorder the REQ_ flags
blk-iocost: combine local_stat and desc_stat to stat
block: improve the error message from bio_check_eod
block: allow passing a NULL bdev to bio_alloc_clone/bio_init_clone
block: remove superfluous calls to blkcg_bio_issue_init
kthread: unexport kthread_blkcg
blk-cgroup: cleanup blkcg_maybe_throttle_current
...
Allow the file system to keep state for all iterations. For now only
wire it up for direct I/O as there is an immediate need for it there.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To move the list iterator variable into the list_for_each_entry_*()
macro in the future it should be avoided to use the list iterator
variable after the loop body.
To *never* use the list iterator variable after the loop it was
concluded to use a separate iterator variable instead of a
found boolean [1].
This removes the need to use a found variable and simply checking if
the variable was set, can determine if the break/goto was hit.
Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ [1]
Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
We're having unresolved issues with the glock holder auto-demotion mechanism
introduced in commit dc732906c2. This mechanism was assumed to be essential
for avoiding frequent short reads and writes until commit 296abc0d91
("gfs2: No short reads or writes upon glock contention"). Since then,
when the inode glock is lost, it is simply re-acquired and the operation
is resumed. This means that apart from the performance penalty, we
might as well drop the inode glock before faulting in pages, and
re-acquire it afterwards.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In gfs2_file_buffered_write, to increase the likelihood that all the
user memory we're trying to write will be resident in memory, carry out
the write in chunks and fault in each chunk of user memory before trying
to write it. Otherwise, some workloads will trigger frequent short
"internal" writes, causing filesystem blocks to be allocated and then
partially deallocated again when writing into holes, which is wasteful
and breaks reservations.
Neither the chunked writes nor any of the short "internal" writes are
user visible.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Align the chunks that reads and writes are carried out in to the page
cache rather than the user buffers. This will be more efficient in
general, especially for allocating writes. Optimizing the case that the
user buffer is gfs2 backed isn't very useful; we only need to make sure
we won't deadlock.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Pull the return value test of the previous read or write operation out
of should_fault_in_pages(). In a following patch, we'll fault in pages
before the I/O and there will be no return value to check.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Instead of counting the number of bytes read from the filesystem,
functions gfs2_file_direct_read and gfs2_file_read_iter count the number
of bytes written into the user buffer. Conversely, functions
gfs2_file_direct_write and gfs2_file_buffered_write count the number of
bytes read from the user buffer. This is nothing but confusing, so
change the read functions to count how many bytes they have read, and
the write functions to count how many bytes they have written.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When a write cannot be carried out in full, gfs2_iomap_end() releases
blocks that have been allocated for this write but haven't been used.
To compute the end of the allocation, gfs2_iomap_end() incorrectly
rounded the end of the attempted write down to the next block boundary
to arrive at the end of the allocation. It would have to round up, but
the end of the allocation is also available as iomap->offset +
iomap->length, so just use that instead.
In addition, use round_up() for computing the start of the unused range.
Fixes: 64bc06bb32 ("gfs2: iomap buffered write support")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
All but two of the callers already have a folio; pass a folio into
try_to_free_buffers(). This removes the last user of cancel_dirty_page()
so remove that wrapper function too.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Use a folio throughout gfs2_release_folio().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Change all the filesystems which used iomap_releasepage to use the
new function.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
By making filler_t the same as read_folio, we can use the same function
for both in gfs2. We can push the use of folios down one more level
in jffs2 and nfs. We also increase type safety for future users of the
various read_cache_page() family of functions by forcing the parameter
to be a pointer to struct file (or NULL).
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
mpage_readpage still works in terms of pages, and has not been audited
for correctness with large folios, so include an assertion that the
filesystem is not passing it large folios. Convert all the filesystems
to call mpage_read_folio() instead of mpage_readpage().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Commit 00bfe02f47 ("gfs2: Fix mmap + page fault deadlocks for buffered
I/O") changed gfs2_file_read_iter() and gfs2_file_buffered_write() to
allow dropping the inode glock while faulting in user buffers. When the
lock was dropped, a short result was returned to indicate that the
operation was interrupted.
As pointed out by Linus (see the link below), this behavior is broken
and the operations should always re-acquire the inode glock and resume
the operation instead.
Link: https://lore.kernel.org/lkml/CAHk-=whaz-g_nOOoo8RRiWNjnv2R+h6_xk2F1J4TuSRxk1MtLw@mail.gmail.com/
Fixes: 00bfe02f47 ("gfs2: Fix mmap + page fault deadlocks for buffered I/O")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Only re-check for direct I/O writes past the end of the file after
re-acquiring the inode glock.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Abstract away implementation details from file systems by providing a
block_device based helper to retrieve the discard granularity.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd]
Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Link: https://lore.kernel.org/r/20220415045258.199825-26-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just use a non-zero max_discard_sectors as an indicator for discard
support, similar to what is done for write zeroes.
The only places where needs special attention is the RAID5 driver,
which must clear discard support for security reasons by default,
even if the default stacking rules would allow for it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd]
Acked-by: Jan Höppner <hoeppner@linux.ibm.com> [s390]
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220415045258.199825-25-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* To avoid deadlocks, actively cancel dlm locking requests when we give
up on them. Further dlm operations on the same lock will return
-EBUSY until the cancel has been completed, so in that case, wait and
repeat. (This is rare.)
* Lock inversion fixes in gfs2_inode_lookup() and gfs2_create_inode().
* Some more fallout from the gfs2 mmap + page fault deadlock fixes
(merge c03098d4b9).
* Various other minor bug fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmJGCAsUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTrWcg//TEDazop2y7rGMFsMBXI7HPyBu4uD
BwoclS5IfjoQbBTtkl7cWmQViMk8s3EFGxdEBorfGmMEq65I/krHi4JXG2GETdui
ORoi8NH1sW9H2GJXmwtE2wYZlJBZtdntoBGdPXWFvt1hLajf6WGpy/CR1Wd4rYak
8AHQxtd98OtsA6LAPlWl2UaXS4m7rhEt0Iy83mqWtbBOvZsULczuraazawnoQ/m4
Wf5pvb+73hpwTVUkruH0+If+vi/HF0WVv1nZVyMwrSh3mpvkrsZSkbN0fd0veAhD
b5XGI1dD5+YPxAOdwDKqnqy8/E3gRekybmpcd48BXoxF4EX/AlLX/Zn9qnrAhY6M
qEbGzC2UqLIrPe/KjzQ8+0aKPCY5FB1VqoRMAHC/bj7mlmNgGtHxQUXdDmC4LIi6
GOLpnueI1KtA7Hb4HCgX0BLxSqUEhUuGssBkNIqGet1cRwmM33pt1J4CG4TDLBt/
VZiERnN3qktSlmukvd3oLSZso4fVbg7PyFTl8YMgiLDNfgcZI9RY5qwIJYrOaucr
KTNfR6lAL2slFPIVcLwmgJt+axogk6GnCkfDVMX2VLJnMQYqJnDYn6fVG9jngSB+
F4UBZ/alzhpel08r8xtxjADFJzA+weG1I2jnikSLKlgVN+uiQTBrhqyWdtxtEqFM
31Nd7piiSVQEvrM=
=xMYz
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v5.17-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fixes from Andreas Gruenbacher:
- To avoid deadlocks, actively cancel dlm locking requests when we give
up on them.
Further dlm operations on the same lock will return -EBUSY until the
cancel has been completed, so in that case, wait and repeat. (This is
rare.)
- Lock inversion fixes in gfs2_inode_lookup() and gfs2_create_inode().
- Some more fallout from the gfs2 mmap + page fault deadlock fixes
(merged in commit c03098d4b9: "Merge tag 'gfs2-v5.15-rc5-mmap-fault'").
- Various other minor bug fixes and cleanups.
* tag 'gfs2-v5.17-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Make sure FITRIM minlen is rounded up to fs block size
gfs2: Make sure not to return short direct writes
gfs2: Remove dead code in gfs2_file_read_iter
gfs2: Fix gfs2_file_buffered_write endless loop workaround
gfs2: Minor retry logic cleanup
gfs2: Disable page faults during lockless buffered reads
gfs2: Fix should_fault_in_pages() logic
gfs2: Remove return value for gfs2_indirect_init
gfs2: Initialize gh_error in gfs2_glock_nq
gfs2: Make use of list_is_first
gfs2: Switch lock order of inode and iopen glock
gfs2: cancel timed-out glock requests
gfs2: Expect -EBUSY after canceling dlm locking requests
gfs2: gfs2_setattr_size error path fix
gfs2: assign rgrp glock before compute_bitstructs
Per fstrim(8) we must round up the minlen argument to the fs block size.
The current calculation doesn't take into account devices that have a
discard granularity and requested minlen less than 1 fs block, so the
value can get shifted away to zero in the translation to fs blocks.
The zero minlen passed to gfs2_rgrp_send_discards() then allows
sb_issue_discard() to be called with nr_sects == 0 which returns -EINVAL
and results in gfs2_rgrp_send_discards() returning -EIO.
Make sure minlen is never < 1 fs block by taking the max of the
requested minlen and the fs block size before comparing to the device's
discard granularity and shifting to fs blocks.
Fixes: 076f0faa76 ("GFS2: Fix FITRIM argument handling")
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmI1AHwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplPjEACVJzKg5NkxpdkDThvq5tejws9KxB/4mHJg
NoDMcv1TF+Orsd/HNW6XrgYnbU0ObHom3568/xb8BNegRVFe7V4ME/4IYNRyGOmV
qbfciu04L1UkJhI52CIidkOioFABL3r1zgLCIz5vk0Cv9X7Le9x0UabHxJf7u9S+
Z3lNdyxezN0SGx8VT86l/7lSoHtG3VHO9IsQCuNGF02SB+6uGpXBlptbEoQ4nTxd
T7/H9FNOe2Wf7eKvcOOds8UlvZYAfYcY0GcRrIOXdHIy25mKFWwn5cDgFTMOH5ID
xXpm+JFkDkrfSW1o4FFPxbN9Z6RbVXbGCsrXlIragLO2MJQdXiIUxS1OPT5oAado
H9MlX6QtkwziLW9zUWa/N/jmRjc2vzHAxD6JFg/wXxNdtY0kd8TQpaxwTB8mVDPN
VCGutt7lJS1CQInQ+ppzbdqzzuLHC1RHAyWSmfUE9rb8cbjxtJBnSIorYRLUesMT
GRwqVTXW0osxSgCb1iDiBCJANrX1yPZcemv4Wh1gzbT6IE9sWxWXsE5sy9KvswNc
M+E4nu/TYYTfkynItJjLgmDLOoi+V0FBY6ba0mRPBjkriSP4AVlwsZLGVsAHQzuA
o5paW1GjRCCwhIQ6+AzZIoOz6wqvprBlUgUkUneyYAQ2ZKC3pZi8zPnpoVdFucVa
VaTzP71C1Q==
=efaq
-----END PGP SIGNATURE-----
Merge tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block
Pull NVMe write streams removal from Jens Axboe:
"This removes the write streams support in NVMe. No vendor ever really
shipped working support for this, and they are not interested in
supporting it.
With the NVMe support gone, we have nothing in the tree that supports
this. Remove passing around of the hints.
The only discussion point in this patchset imho is the fact that the
file specific write hint setting/getting fcntl helpers will now return
-1/EINVAL like they did before we supported write hints. No known
applications use these functions, I only know of one prototype that I
help do for RocksDB, and that's not used. That said, with a change
like this, it's always a bit controversial. Alternatively, we could
just make them return 0 and pretend it worked. It's placement based
hints after all"
* tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block:
fs: remove fs.f_write_hint
fs: remove kiocb.ki_hint
block: remove the per-bio/request write hint
nvme: remove support or stream based temperature hint
When direct writes fail with -ENOTBLK because we're writing into a
hole (gfs2_iomap_begin()) or because of a page invalidation failure
(iomap_dio_rw()), we're falling back to buffered writes. In that case,
when we lose the inode glock in gfs2_file_buffered_write(), we want to
re-acquire it instead of returning a short write.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Function iomap_dio_rw() only returns -ENOTBLK for write requests and
gfs2_file_direct_read() no longer returns -ENOTBLK since commit
1d45bb7f9d ("gfs2: Use iomap for stuffed direct I/O reads"), so there
is no need to check for -ENOTBLK in gfs2_file_read_iter() anymore.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Since commit 554c577cee, gfs2_file_buffered_write() can accidentally
return a truncated iov_iter, which might confuse callers. Fix that.
Fixes: 554c577cee ("gfs2: Prevent endless loops in gfs2_file_buffered_write")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
During lockless buffered reads, filemap_read() holds page cache page
references while trying to copy data to the user-space buffer. The
calling process isn't holding the inode glock, but the page references
it holds prevent those pages from being removed from the page cache, and
that prevents the underlying inode glock from being moved to another
node. Thus, we can end up in the same kinds of distributed deadlock
situations as with normal (non-lockless) buffered reads.
Fix that by disabling page faults during lockless reads as well.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Fix the fault-in window size logic:
* Use a maximum window size of 1 MiB instead of BIO_MAX_VECS * PAGE_SIZE.
The previous window size was always one page because the pages variable
was accidentally being defined and then redefined in
should_fault_in_pages().
* The nr_dirtied heuristic for guessing when there might be memory
pressure often results in very small window sizes. Don't let
nr_dirtied drop below 8 pages (as btrfs does).
* Compute the window size in units of bytes, not pages.
* Account for page overlap (unaligned iterators).
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Primarily this series converts some of the address_space operations
to take a folio instead of a page.
->is_partially_uptodate() takes a folio instead of a page and changes the
type of the 'from' and 'count' arguments to make it obvious they're bytes.
->invalidatepage() becomes ->invalidate_folio() and has a similar type change.
->launder_page() becomes ->launder_folio()
->set_page_dirty() becomes ->dirty_folio() and adds the address_space as
an argument.
There are a couple of other misc changes up front that weren't worth
separating into their own pull request.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmI4hqMACgkQDpNsjXcp
gj7r7Af/fVJ7m8kKqjP/IayX3HiJRuIDQw+vM++BlRNXdjz+IyED6whdmFGxJeOY
BMyT+8ApOAz7ErS4G+7fAv4ScJK/aEgFUsnSeAiCp0PliiEJ5NNJzElp6sVmQ7H5
SX7+Ek444FZUGsQuy0qL7/ELpR3ditnD7x+5U2g0p5TeaHGUQn84crRyfR4xuhNG
EBD9D71BOb7OxUcOHe93pTkK51QsQ0aCrcIsB1tkK5KR0BAthn1HqF7ehL90Rvrr
omx5M7aDWGY4oj7IKrhlAs+55Ah2WaOzrZBp0FXNbr4UENDBKWKyUxErwa4xPkf6
Gm1iQG/CspOHnxN3YWsd5WjtlL3A+A==
=cOiq
-----END PGP SIGNATURE-----
Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache
Pull filesystem folio updates from Matthew Wilcox:
"Primarily this series converts some of the address_space operations to
take a folio instead of a page.
Notably:
- a_ops->is_partially_uptodate() takes a folio instead of a page and
changes the type of the 'from' and 'count' arguments to make it
obvious they're bytes.
- a_ops->invalidatepage() becomes ->invalidate_folio() and has a
similar type change.
- a_ops->launder_page() becomes ->launder_folio()
- a_ops->set_page_dirty() becomes ->dirty_folio() and adds the
address_space as an argument.
There are a couple of other misc changes up front that weren't worth
separating into their own pull request"
* tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache: (53 commits)
fs: Remove aops ->set_page_dirty
fb_defio: Use noop_dirty_folio()
fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio
fs: Convert __set_page_dirty_buffers to block_dirty_folio
nilfs: Convert nilfs_set_page_dirty() to nilfs_dirty_folio()
mm: Convert swap_set_page_dirty() to swap_dirty_folio()
ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folio
f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio
f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio
f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio
afs: Convert afs_dir_set_page_dirty() to afs_dir_dirty_folio()
btrfs: Convert extent_range_redirty_for_io() to use folios
fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio
btrfs: Convert from set_page_dirty to dirty_folio
fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio()
fs: Add aops->dirty_folio
fs: Remove aops->launder_page
orangefs: Convert launder_page to launder_folio
nfs: Convert from launder_page to launder_folio
fuse: Convert from launder_page to launder_folio
...
The inode allocation is supposed to use alloc_inode_sb(), so convert
kmem_cache_alloc() of all filesystems to alloc_inode_sb().
Link: https://lkml.kernel.org/r/20220228122126.37293-5-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Theodore Ts'o <tytso@mit.edu> [ext4]
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Alex Shi <alexs@kernel.org>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kari Argillander <kari.argillander@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert all callers; mostly this is just changing the aops to point
at it, but a few implementations need a little more work.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
These filesystems use __set_page_dirty_nobuffers() either directly or
with a very thin wrapper; convert them en masse.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
This is a straightforward conversion.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Remove special-casing of a NULL invalidatepage, since there is no
more block_invalidatepage.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Use iomap_invalidate_folio() in all the iomap-based filesystems
and rename the iomap_invalidatepage tracepoint.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
With the NVMe support for this gone, there are no consumers of these hints
left, so remove them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220304175556.407719-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The return value from function gfs2_indirect_init is never used, so
remove it.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The gh_error field if a glock holder is initialized to zero in
gfs2_holder_init(). When a locking operation fails, gh_error is set to
an error code; when it succeeds, the gh_error value is left unchanged.
The field isn't initialized in gfs2_holder_reinit(), which is a problem.
Instead of fixing that directly, initialize gh_error in gfs2_glock_nq().
That also obsoletes the assignment in do_flock().
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch tries to fix the continual ABBA deadlocks we keep having
between the iopen and inode glocks. This switches the lock order in
gfs2_inode_lookup and gfs2_create_inode so the iopen glock is always
locked first.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
The gfs2 evict code tries to upgrade the iopen glock from SH to EX. If
the attempt to upgrade times out, gfs2 needs to tell dlm to cancel the
lock request or it can deadlock. We also need to wake up the process
waiting for the lock when dlm sends its AST back to gfs2.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Due to the asynchronous nature of the dlm api, when we request a pending
locking request to be canceled with dlm_unlock(DLM_LKF_CANCEL), the
locking request will either complete before it could be canceled, or the
cancellation will succeed. In either case, gdlm_ast will be called once
and the status will indicate the outcome of the locking request, with
-DLM_ECANCEL indicating a canceled request.
Inside dlm, when a locking request completes before its cancel request
could be processed, gdlm_ast will be called, but the lock will still be
considered busy until a DLM_MSG_CANCEL_REPLY message completes the
cancel request. During that time, successive dlm_lock() or dlm_unlock()
requests for that lock will return -EBUSY. In other words, waiting for
the gdlm_ast call before issuing the next locking request is not enough.
There is no way of waiting for a cancel request to actually complete,
either.
We rarely cancel locking requests, but when we do, we don't know when
the next locking request for that lock will occur. This means that any
dlm_lock() or dlm_unlock() call can potentially return -EBUSY. When
that happens, this patch simply repeats the request after a short pause.
This workaround could be improved upon by tracking for which dlm locks
cancel requests have been issued, but that isn't strictly necessary and
it would complicate the code. We haven't seen -EBUSY errors from dlm
without cancel requests.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When gfs2_setattr_size() fails, it calls gfs2_rs_delete(ip, NULL) to get
rid of any reservations the inode may have. Instead, it should pass in
the inode's write count as the second parameter to allow
gfs2_rs_delete() to figure out if the inode has any writers left.
In a next step, there are two instances of gfs2_rs_delete(ip, NULL) left
where we know that there can be no other users of the inode. Replace
those with gfs2_rs_deltree(&ip->i_res) to avoid the unnecessary write
count check.
With that, gfs2_rs_delete() is only called with the inode's actual write
count, so get rid of the second parameter.
Fixes: a097dc7e24 ("GFS2: Make rgrp reservations part of the gfs2_inode structure")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function read_rindex_entry called compute_bitstructs
before it allocated a glock for the rgrp. But if compute_bitstructs found
a problem with the rgrp, it called gfs2_consist_rgrpd, and that called
gfs2_dump_glock for rgd->rd_gl which had not yet been assigned.
read_rindex_entry
compute_bitstructs
gfs2_consist_rgrpd
gfs2_dump_glock <---------rgd->rd_gl was not set.
This patch changes read_rindex_entry so it assigns an rgrp glock before
calling compute_bitstructs so gfs2_dump_glock does not reference an
unassigned pointer. If an error is discovered, the glock must also be
put, so a new goto and label were added.
Reported-by: syzbot+c6fd14145e2f62ca0784@syzkaller.appspotmail.com
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When a file is opened for writing, the vfs code (do_dentry_open)
calls get_write_access for the inode, thus incrementing the inode's write
count. That writer normally then creates a multi-block reservation for
the inode (i_res) that can be re-used by other writers, which speeds up
writes for applications that stupidly loop on open/write/close.
When the writes are all done, the multi-block reservation should be
deleted when the file is closed by the last "writer."
Commit 0ec9b9ea4f broke that concept when it moved the call to
gfs2_rs_delete before the check for FMODE_WRITE. Non-writers have no
business removing the multi-block reservations of writers. In fact, if
someone opens and closes the file for RO while a writer has a
multi-block reservation, the RO closer will delete the reservation
midway through the write, and this results in:
kernel BUG at fs/gfs2/rgrp.c:677! (or thereabouts) which is:
BUG_ON(rs->rs_requested); from function gfs2_rs_deltree.
This patch moves the check back inside the check for FMODE_WRITE.
Fixes: 0ec9b9ea4f ("gfs2: Check for active reservation in gfs2_release")
Cc: stable@vger.kernel.org # v5.12+
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
It turns out that the might_sleep() call that commit 660a6126f8 adds
is triggering occasional data corruption in testing. We're not sure
about the root cause yet, but since this commit was added as a debugging
aid only, revert it for now.
This reverts commit 660a6126f8.
Fixes: 660a6126f8 ("gfs2: check context in gfs2_glock_put")
Cc: stable@vger.kernel.org # v5.16+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Pass the block_device and operation that we plan to use this bio for to
bio_alloc to optimize the assignment. NULL/0 can be passed, both for the
passthrough case on a raw request_queue and to temporarily avoid
refactoring some nasty code.
Also move the gfp_mask argument after the nr_vecs argument for a much
more logical calling convention matching what most of the kernel does.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no good reason to keep genhd.h separate from the main blkdev.h
header that includes it. So fold the contents of genhd.h into blkdev.h
and remove genhd.h entirely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20220124093913.742411-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Here is the set of changes for the driver core for 5.17-rc1.
Lots of little things here, including:
- kobj_type cleanups
- auxiliary_bus documentation updates
- auxiliary_device conversions for some drivers (relevant
subsystems all have provided acks for these)
- kernfs lock contention reduction for some workloads
- other tiny cleanups and changes.
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYd7deA8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ym8ngCgw0ANwrRPE5b1dthEmfU2f8Knk5kAn0pHQv6R
VRZJypgNfU/Pt0ykstZD
=CO9J
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the set of changes for the driver core for 5.17-rc1.
Lots of little things here, including:
- kobj_type cleanups
- auxiliary_bus documentation updates
- auxiliary_device conversions for some drivers (relevant subsystems
all have provided acks for these)
- kernfs lock contention reduction for some workloads
- other tiny cleanups and changes.
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (43 commits)
kobject documentation: remove default_attrs information
drivers/firmware: Add missing platform_device_put() in sysfb_create_simplefb
debugfs: lockdown: Allow reading debugfs files that are not world readable
driver core: Make bus notifiers in right order in really_probe()
driver core: Move driver_sysfs_remove() after driver_sysfs_add()
firmware: edd: remove empty default_attrs array
firmware: dmi-sysfs: use default_groups in kobj_type
qemu_fw_cfg: use default_groups in kobj_type
firmware: memmap: use default_groups in kobj_type
sh: sq: use default_groups in kobj_type
headers/uninline: Uninline single-use function: kobject_has_children()
devtmpfs: mount with noexec and nosuid
driver core: Simplify async probe test code by using ktime_ms_delta()
nilfs2: use default_groups in kobj_type
kobject: remove kset from struct kset_uevent_ops callbacks
driver core: make kobj_type constant.
driver core: platform: document registration-failure requirement
vdpa/mlx5: Use auxiliary_device driver data helpers
net/mlx5e: Use auxiliary_device driver data helpers
soundwire: intel: Use auxiliary_device driver data helpers
...
Before this patch, glock dumps would not dump the gl_object for iopen
glocks. This information can help us debug problems related to eviction:
when AN iopen glock is blocked we can see the status of its underlying
inode and its flags, etc.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
There is no need to pass the pointer to the kset in the struct
kset_uevent_ops callbacks as no one uses it, so just remove that pointer
entirely.
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Wedson Almeida Filho <wedsonaf@google.com>
Link: https://lore.kernel.org/r/20211227163924.3970661-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The description of gfs2_instantiate accidentally lists a glock argument,
but the function takes a glock holder.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Function rgrp_go_inval calls gfs2_rgrp_brelse to invalidate the
in-core rgrp structures. After the call it set GLF_INSTANTIATE_NEEDED,
which is redundant, since gfs2_rgrp_brelse also sets it.
This patch simply removes the redundant set_bit.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The function name in the kernel-doc comment wasn't updated when the
function was renamed.
Fixes: b016d9a84a ("gfs2: Save ip from gfs2_glock_nq_init")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When gfs2_lookup_by_inum() calls gfs2_inode_lookup() for an uncached
inode, gfs2_inode_lookup() will place a new tentative inode into the
inode cache before verifying that there is a valid inode at the given
address. This can race with gfs2_create_inode() which doesn't check for
duplicates inodes. gfs2_create_inode() will try to assign the new inode
to the corresponding inode glock, and glock_set_object() will complain
that the glock is still in use by gfs2_inode_lookup's tentative inode.
We noticed this bug after adding commit 486408d690 ("gfs2: Cancel
remote delete work asynchronously") which allowed delete_work_func() to
race with gfs2_create_inode(), but the same race exists for
open-by-handle.
Fix that by switching from insert_inode_hash() to
insert_inode_locked4(), which does check for duplicate inodes. We know
we've just managed to to allocate the new inode, so an inode tentatively
created by gfs2_inode_lookup() will eventually go away and
insert_inode_locked4() will always succeed.
In addition, don't flush the inode glock work anymore (this can now only
make things worse) and clean up glock_{set,clear}_object for the inode
glock somewhat.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Rework gfs2_inode_lookup() to only set up the new inode's glocks after
verifying that the new inode is valid.
There is no need for flushing the inode glock work queue anymore now,
so remove that as well.
While at it, get rid of the useless wrapper around iget5_locked() and
its unnecessary is_bad_inode() check.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In gfs2_inode_lookup, once the inode has been looked up, we check if the
inode generation (no_formal_ino) is the one we're looking for. If it
isn't and the inode wasn't in the inode cache, we discard the newly
looked up inode. This is unnecessary, complicates the code, and makes
future changes to gfs2_inode_lookup harder, so change the code to retain
newly looked up inodes instead.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When we mock up a temporary holder in gfs2_glock_cb to demote weak holders in
response to a remote locking conflict, we don't set the HIF_HOLDER flag. This
causes function may_grant to BUG. Fix by setting the missing HIF_HOLDER flag
in the mock glock holder.
In addition, define the mock glock holder where it is used.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Currently, instead of performing a short write,
iomap_file_buffered_write will fail when part of its iov iterator cannot
be read. In contrast, gfs2_file_buffered_write will loop around if it
can read part of the iov iterator, so we can end up in an endless loop.
This should be fixed in iomap_file_buffered_write (and also
generic_perform_write), but this comes a bit late in the 5.16
development cycle, so work around it in the filesystem by
trimming the iov iterator to the known-good size for now.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Function demote_incompat_holders iterates over the list of glock holders
with list_for_each_entry, and it then sometimes removes the current
holder from the list. This will get the loop stuck; we must use
list_for_each_entry_safe instead.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Fix the length of holes reported at the end of a file: the length is
relative to the beginning of the extent, not the seek position which is
rounded down to the filesystem block size.
This bug went unnoticed for some time, but is now caught by the
following assertion in iomap_iter_done():
WARN_ON_ONCE(iter->iomap.offset + iter->iomap.length <= iter->pos)
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, evict would clear the iopen glock's gl_object after
releasing the inode glock. In the meantime, another process could reuse
the same block and thus glocks for a new inode. It would lock the inode
glock (exclusively), and then the iopen glock (shared). The shared
locking mode doesn't provide any ordering against the evict, so by the
time the iopen glock is reused, evict may not have gotten to setting
gl_object to NULL.
Fix that by releasing the iopen glock before the inode glock in
gfs2_evict_inode.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>gl_object
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Replace test_bit() + set_bit() with test_and_set_bit() where we need an atomic
operation. Use clear_and_wake_up_bit() instead of open coding it.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
* Fix a locking order inversion between the inode and iopen glocks in
gfs2_inode_lookup.
* Implement proper queuing of glock holders for glocks that require
instantiation (like reading an inode or bitmap blocks from disk).
Before, multiple glock holders could race with each other and
half-initialized objects could be exposed; the GL_SKIP flag further
exacerbated this problem.
* Fix a rare deadlock between inode lookup / creation and remote delete
work.
* Fix a rare scheduling-while-atomic bug in dlm during glock hash table
walks.
* Various other minor fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmGBThoUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTp4vg/9HQbZYtl1R7qWlzaEAOHaZzudcRIS
wnCGGxqdySdWogXhBCICwyH7nJneFG1XsYvNF8HmH10ALjZkjWM3MV0J9O00eCWk
VXKqcioRbWtyQiye3A3icT6qhVGd65LK+p6GqtH7pQXOYc1ein/Gi/IWHFN6wCTM
FXulFAo6i3Uep4lfz9+WPy8iGDVVLUWt0uhmY8O+W8edsDJdX7Kr89mQU/2dUsMp
BxImDvKcchd8SlWOHNJ2WrbfPVFLd3mgmouojxn7/0pspqtJA6tgOWpAmN0uKw+V
Qaqb9g0KjrAnH39w1wSzlN9XCItOvT3EGg1HEkl1kx5UDi4S/B9yF0wWbzfI2BUL
9T0dyAbmIcamcHua+rQatTfBQnEOScNfZQKd9MHE4etLhVyE2fHjmx2Ya4xcmiy/
/onUEcfjcQvzVY+69hD9cqKwwgTz5G2xyLzD7WdWJD10qU4z22arOSIedmC2c1Eq
62cbaxcHXrDkG8FYHb/ukNeZ8Fw54niWnBCXDBfRxpx/UbJDzvNxjF2BPNJEBWON
UFO8cbMVMojrsLkkWKTInecCLjhROwgWXUo8A1CmYQNlMUhObWts2jGoRlUYBUcH
8Ah+DADcz9H8VV2koURo6ZkJyJBztgJOZ+ysFZSdPQ2HQnijwWqi8Zn33UfTZH6d
BxxPK+wVfPNZSWQ=
=sAlR
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v5.15-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- Fix a locking order inversion between the inode and iopen glocks in
gfs2_inode_lookup.
- Implement proper queuing of glock holders for glocks that require
instantiation (like reading an inode or bitmap blocks from disk).
Before, multiple glock holders could race with each other and
half-initialized objects could be exposed; the GL_SKIP flag further
exacerbated this problem.
- Fix a rare deadlock between inode lookup / creation and remote delete
work.
- Fix a rare scheduling-while-atomic bug in dlm during glock hash table
walks.
- Various other minor fixes and cleanups.
* tag 'gfs2-v5.15-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (21 commits)
gfs2: Fix unused value warning in do_gfs2_set_flags()
gfs2: check context in gfs2_glock_put
gfs2: Fix glock_hash_walk bugs
gfs2: Cancel remote delete work asynchronously
gfs2: set glock object after nq
gfs2: remove RDF_UPTODATE flag
gfs2: Eliminate GIF_INVALID flag
gfs2: fix GL_SKIP node_scope problems
gfs2: split glock instantiation off from do_promote
gfs2: further simplify do_promote
gfs2: re-factor function do_promote
gfs2: Remove 'first' trace_gfs2_promote argument
gfs2: change go_lock to go_instantiate
gfs2: dump glocks from gfs2_consist_OBJ_i
gfs2: dequeue iopen holder in gfs2_inode_lookup error
gfs2: Save ip from gfs2_glock_nq_init
gfs2: Allow append and immutable bits to coexist
gfs2: Switch some BUG_ON to GLOCK_BUG_ON for debug
gfs2: move GL_SKIP check from glops to do_promote
gfs2: Add GL_SKIP holder flag to dump_holder
...
Functions gfs2_file_read_iter and gfs2_file_write_iter are both
accessing the user buffer to write to or read from while holding the
inode glock. In the most basic scenario, that buffer will not be
resident and it will be mapped to the same file. Accessing the buffer
will trigger a page fault, and gfs2 will deadlock trying to take the
same inode glock again while trying to handle that fault.
Fix that and similar, more complex scenarios by disabling page faults
while accessing user buffers. To make this work, introduce a small
amount of new infrastructure and fix some bugs that didn't trigger so
far, with page faults enabled.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmGBPisUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTpE6A/7BezUnGuNJxJrR8pC+vcLYA7xAgUU
6STQ6IN7w5UHRlSkNzZxZ2XPxW4uVQ4SxSEeaLqBsHZihepjcLNFZ/8MhQ6UPSD0
8noHOi7CoIcp6IuWQtCpxRM/xjjm2SlMt2XbVJZaiJcdzCV9gB6TU9EkBRq7Zm/X
9WFBbv1xZF0skn9ISCJvNtiiI+VyWKgMDUKxJUiTQjmJcklyyqHcVGmQi9BjqPz4
4s3F+WH6CoGbDKlmNk/6Y9wZ/2+sbvGswVscUxPwJVPoZWsR1xBBUdAeAmEMD1P4
BgE/Y1J8JXyVPYtyvZKq70XUhKdQkxB7RfX87YasOk9mY4Kjd5rIIGEykh+o2vC9
kDhCHvf2Mnw5I6Rum3B7UXyB1vemY+fECIHsXhgBnS+ztabRtcAdpCuWoqb43ymw
yEX1KwXyU4FpRYbrRvdZT42Fmh6ty8TW+N4swg8S2TrffirvgAi5yrcHZ4mPupYv
lyzvsCW7Wv8hPXn/twNObX+okRgJnsxcCdBXARdCnRXfA8tH23xmu88u8RA1Vdxh
nzTvv6Dx2EowwojuDWMx29Mw3fA2IqIfbOV+4FaRU7NZ2ZKtknL8yGl27qQUsMoJ
vYsHTmagasjQr+NDJ3vQRLCw+JQ6B1hENpdkmixFD9moo7X1ZFW3HBi/UL973Bv6
5CmgeXto8FRUFjI=
=WeNd
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v5.15-rc5-mmap-fault' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 mmap + page fault deadlocks fixes from Andreas Gruenbacher:
"Functions gfs2_file_read_iter and gfs2_file_write_iter are both
accessing the user buffer to write to or read from while holding the
inode glock.
In the most basic deadlock scenario, that buffer will not be resident
and it will be mapped to the same file. Accessing the buffer will
trigger a page fault, and gfs2 will deadlock trying to take the same
inode glock again while trying to handle that fault.
Fix that and similar, more complex scenarios by disabling page faults
while accessing user buffers. To make this work, introduce a small
amount of new infrastructure and fix some bugs that didn't trigger so
far, with page faults enabled"
* tag 'gfs2-v5.15-rc5-mmap-fault' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fix mmap + page fault deadlocks for direct I/O
iov_iter: Introduce nofault flag to disable page faults
gup: Introduce FOLL_NOFAULT flag to disable page faults
iomap: Add done_before argument to iomap_dio_rw
iomap: Support partial direct I/O on user copy failures
iomap: Fix iomap_dio_rw return value for user copies
gfs2: Fix mmap + page fault deadlocks for buffered I/O
gfs2: Eliminate ip->i_gh
gfs2: Move the inode glock locking to gfs2_file_buffered_write
gfs2: Introduce flag for glock holder auto-demotion
gfs2: Clean up function may_grant
gfs2: Add wrapper for iomap_file_buffered_write
iov_iter: Introduce fault_in_iov_iter_writeable
iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}
powerpc/kvm: Fix kvm_use_magic_page
iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmF8KDgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmQ2D/wO0nH3U+3+OZChi3XUwYck9Dev3o6BANCF
ClATiK/kivZY0xY1r8J4ixirZo2gcjIMpWSC3JGYZ5LdspfmYGLUbMjfZsaeU23i
lAKaX1IqfArmHN76k3IU1bKCg7B0/LFwC0q9QTFWTSwNSs8RK/EZLJ61U1hEXUb3
OfIpaMmvPiMaU7yuPqhcZK14m1cg1srrLM4rFB/PqsWWStF07pHq32WeArGDAU0e
Fe0YSnYD7qqA5Qc37KwqjCTmmxKX5YZf7etIcA6p3DNmwcuQrVNzKoCH/ZEDijaD
E2bS/BWbN1x96+rtoEZfBYEaNIrkmJzmW6+fJ53OITbJF3KqP6V66erhqNcFYCzC
mhFlRe7voXb/8AP7zQqSIhK529BUBM36sQ6nF7EiQcDrfLc1z39mq6eblUxbknIA
DDPISD5Tseik9N9x0bc7vINseKyHI1E90VAU/XKADcuGbzLvehPx+2p+Iq5ch5Ah
oa1G3RdlWWQOZxphJHWJhu1qMfo5+FP9dFZj1aoo7b8Kbc/CedyoQe71cpIE5wNh
Jj/EpWJnuyKXwuTic2VYGC+6ezM9O5DSdqCfP3YuZky95VESyvRCKJYMMgBYRVdC
/LuxhnBXIY2G8An7ZTnX0kLCCvLbapIwa0NyA98/xeOngO843coJ6wn8ZmE9LJNH
kMmpCygUrA==
=QWC+
-----END PGP SIGNATURE-----
Merge tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
- mq-deadline accounting improvements (Bart)
- blk-wbt timer fix (Andrea)
- Untangle the block layer includes (Christoph)
- Rework the poll support to be bio based, which will enable adding
support for polling for bio based drivers (Christoph)
- Block layer core support for multi-actuator drives (Damien)
- blk-crypto improvements (Eric)
- Batched tag allocation support (me)
- Request completion batching support (me)
- Plugging improvements (me)
- Shared tag set improvements (John)
- Concurrent queue quiesce support (Ming)
- Cache bdev in ->private_data for block devices (Pavel)
- bdev dio improvements (Pavel)
- Block device invalidation and block size improvements (Xie)
- Various cleanups, fixes, and improvements (Christoph, Jackie,
Masahira, Tejun, Yu, Pavel, Zheng, me)
* tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block: (174 commits)
blk-mq-debugfs: Show active requests per queue for shared tags
block: improve readability of blk_mq_end_request_batch()
virtio-blk: Use blk_validate_block_size() to validate block size
loop: Use blk_validate_block_size() to validate block size
nbd: Use blk_validate_block_size() to validate block size
block: Add a helper to validate the block size
block: re-flow blk_mq_rq_ctx_init()
block: prefetch request to be initialized
block: pass in blk_mq_tags to blk_mq_rq_ctx_init()
block: add rq_flags to struct blk_mq_alloc_data
block: add async version of bio_set_polled
block: kill DIO_MULTI_BIO
block: kill unused polling bits in __blkdev_direct_IO()
block: avoid extra iter advance with async iocb
block: Add independent access ranges support
blk-mq: don't issue request directly in case that current is to be blocked
sbitmap: silence data race warning
blk-cgroup: synchronize blkg creation against policy deactivation
block: refactor bio_iov_bvec_set()
block: add single bio async direct IO helper
...
Coverity complains of an unused value:
CID 119623 (#1 of 1): Unused value (UNUSED_VALUE)
assigned_value: Assigning value -1 to error here, but that stored value is
overwritten before it can be used.
237 error = -EPERM;
Fix it by removing the assignment.
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Add a might_sleep call into gfs2_glock_put which can sleep in DLM when
the last reference is released. This will show problems earlier, and
not only when the last reference is put.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
So far, glock_hash_walk took a reference on each glock it iterated over, and it
was the examiner's responsibility to drop those references. Dropping the final
reference to a glock can sleep and the examiners are called in a RCU critical
section with spin locks held, so examiners that didn't need the extra reference
had to drop it asynchronously via gfs2_glock_queue_put or similar. This wasn't
done correctly in thaw_glock which did call gfs2_glock_put, and not at all in
dump_glock_func.
Change glock_hash_walk to not take glock references at all. That way, the
examiners that don't need them won't have to bother with slow asynchronous
puts, and the examiners that do need references can take them themselves.
Reported-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In gfs2_inode_lookup and gfs2_create_inode, we're calling
gfs2_cancel_delete_work which currently cancels any remote delete work
(delete_work_func) synchronously. This means that if the work is
currently running, it will wait for it to finish. We're doing this to
pevent a previous instance of an inode from having any influence on the
next instance.
However, delete_work_func uses gfs2_inode_lookup internally, and we can
end up in a deadlock when delete_work_func gets interrupted at the wrong
time. For example,
(1) An inode's iopen glock has delete work queued, but the inode
itself has been evicted from the inode cache.
(2) The delete work is preempted before reaching gfs2_inode_lookup.
(3) Another process recreates the inode (gfs2_create_inode). It tries
to cancel any outstanding delete work, which blocks waiting for
the ongoing delete work to finish.
(4) The delete work calls gfs2_inode_lookup, which blocks waiting for
gfs2_create_inode to instantiate and unlock the new inode =>
deadlock.
It turns out that when the delete work notices that its inode has been
re-instantiated, it will do nothing. This means that it's safe to
cancel the delete work asynchronously. This prevents the kind of
deadlock described above.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, function gfs2_create_inode called glock_set_object to
set the gl_object for inode and iopen glocks before the glock was locked.
That's wrong because other competing processes like evict may be
blocked waiting for the glock and still have gl_object set before the
actual eviction can take place.
This patch moves the call to glock_set_object until after the glock is
acquire in function gfs2_create_inode, so it waits for possibly
competing evicts to finish their processing first.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The new GLF_INSTANTIATE_NEEDED flag obsoletes the old rgrp flag
GFS2_RDF_UPTODATE, so this patch replaces it like we did with inodes.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
With the addition of the new GLF_INSTANTIATE_NEEDED flag, the
GIF_INVALID flag is now redundant. This patch removes it.
Since inode_instantiate is only called when instantiation is needed,
the check in inode_instantiate is removed too.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, when a glock was locked, the very first holder on the
queue would unlock the lockref and call the go_instantiate glops function
(if one existed), unless GL_SKIP was specified. When we introduced the new
node-scope concept, we allowed multiple holders to lock glocks in EX mode
and share the lock.
But node-scope introduced a new problem: if the first holder has GL_SKIP
and the next one does NOT, since it is not the first holder on the queue,
the go_instantiate op was not called. Eventually the GL_SKIP holder may
call the instantiate sub-function (e.g. gfs2_rgrp_bh_get) but there was
still a window of time in which another non-GL_SKIP holder assumes the
instantiate function had been called by the first holder. In the case of
rgrp glocks, this led to a NULL pointer dereference on the buffer_heads.
This patch tries to fix the problem by introducing two new glock flags:
GLF_INSTANTIATE_NEEDED, which keeps track of when the instantiate function
needs to be called to "fill in" or "read in" the object before it is
referenced.
GLF_INSTANTIATE_IN_PROG which is used to determine when a process is
in the process of reading in the object. Whenever a function needs to
reference the object, it checks the GLF_INSTANTIATE_NEEDED flag, and if
set, it sets GLF_INSTANTIATE_IN_PROG and calls the glops "go_instantiate"
function.
As before, the gl_lockref spin_lock is unlocked during the IO operation,
which may take a relatively long amount of time to complete. While
unlocked, if another process determines go_instantiate is still needed,
it sees GLF_INSTANTIATE_IN_PROG is set, and waits for the go_instantiate
glop operation to be completed. Once GLF_INSTANTIATE_IN_PROG is cleared,
it needs to check GLF_INSTANTIATE_NEEDED again because the other process's
go_instantiate operation may not have been successful.
Functions that previously called the instantiate sub-functions now call
directly into gfs2_instantiate so the new bits are managed properly.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function do_promote had a section of code that did
the actual instantiation. This patch splits that off into its own
function, gfs2_instantiate, which prepares us for the next patch that
will use that function.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch further simplifies function do_promote by eliminating some
redundant code in favor of using a lock_released flag. This is just
prep work for a future patch.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch simply re-factors function do_promote to reduce the indents.
The logic should be unchanged. This makes future patches more readable.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Remove the 'first' argument of trace_gfs2_promote: with GL_SKIP, the
'first' holder isn't the one that instantiates the glock
(gl_instantiate), which is what the 'first' flag was apparently supposed
to indicate.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, the go_lock glock operations (glops) did not do
any actual locking. They were used to instantiate objects, like reading
in dinodes and rgrps from the media.
This patch renames the functions to go_instantiate for clarity.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, failed consistency checks printed out the object
that failed, but not the object's glock. This patch makes it also
print out the object glock so we can see the glock's holders and flags
to aid with debugging.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, if function gfs2_inode_lookup encountered an error
after it had locked the iopen glock, it never unlocked it, relying on
the evict code to do the cleanup. The evict code then took the
inode glock while holding the iopen glock, which violates the locking
order. For example,
(1) node A does a gfs2_inode_lookup that fails, leaving the iopen glock
locked.
(2) node B calls delete_work_func -> gfs2_lookup_by_inum ->
gfs2_inode_lookup. It locks the inode glock and blocks trying to
lock the iopen glock, which is held by node A.
(3) node A eventually calls gfs2_evict_inode -> evict_should_delete.
It blocks trying to lock the inode glock, which is now held by
node B.
This patch introduces error handling to function gfs2_inode_lookup
so it properly dequeues held iopen glocks on errors.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, when a glock was locked by function gfs2_glock_nq_init,
it initialized the holder gh_ip (return address) as gfs2_glock_nq_init.
That made it extremely difficult to track down problems because many
functions call gfs2_glock_nq_init. This patch changes the function so
that it saves gh_ip from the caller of gfs2_glock_nq_init, which makes
it easy to backtrack which holder took the lock.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, function do_gfs2_set_flags checked if the append
and immutable flags were being set while already set. If so, error -EPERM
was given. There's no reason why these two flags should be mutually
exclusive, and if you set them separately, you will, in essence, set
one while it is already set. For example:
chattr +a /mnt/gfs2/file1
chattr +i /mnt/gfs2/file1
The first command sets the append-only flag. Since they are additive,
the second command sets the immutable flag AND append-only flag,
since they both coexist in i_diskflags. So the second command should
not return an error. This bug caused xfstests generic/545 to fail.
This patch simply removes the invalid checks.
I also eliminated an unused parm from do_gfs2_set_flags.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In rgrp.c, there are several places where it does BUG_ON. This tells us
the call stack but nothing more, which is not very helpful.
This patch switches them to GLOCK_BUG_ON which also prints the glock,
its holders, and many of the rgrp values, which will help us debug
problems in the future.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, each individual "go_lock" glock operation (glop)
checked the GL_SKIP flag, and if set, would skip further processing.
This patch changes the logic so the go_lock caller, function go_promote,
checks the GL_SKIP flag before calling the go_lock op in the first place.
This avoids having to unnecessarily unlock gl_lockref.lock only to
re-lock it again.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Somehow, the GL_SKIP flag was missed when dumping glock holders.
This patch adds it to function hflags2str. I added it at the end because
I wanted Holder and Skip flags together to read "Hs" rather than "sH"
to avoid confusion with "Shared" ("SH") holder state.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function gfs2_rgrp_go_lock checked if GL_SKIP and
ar_rgrplvb were both true. However, GL_SKIP is only set for rgrps if
ar_rgrplvb is true (see gfs2_inplace_reserve). This patch simply removes
the redundant check.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Also disable page faults during direct I/O requests and implement a
similar kind of retry logic as in the buffered I/O case.
The retry logic in the direct I/O case differs from the buffered I/O
case in the following way: direct I/O doesn't provide the kinds of
consistency guarantees between concurrent reads and writes that buffered
I/O provides, so once we lose the inode glock while faulting in user
pages, we always resume the operation. We never need to return a
partial read or write.
This locking problem was originally reported by Jan Kara. Linus came up
with the idea of disabling page faults. Many thanks to Al Viro and
Matthew Wilcox for their feedback.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Add a done_before argument to iomap_dio_rw that indicates how much of
the request has already been transferred. When the request succeeds, we
report that done_before additional bytes were tranferred. This is
useful for finishing a request asynchronously when part of the request
has already been completed synchronously.
We'll use that to allow iomap_dio_rw to be used with page faults
disabled: when a page fault occurs while submitting a request, we
synchronously complete the part of the request that has already been
submitted. The caller can then take care of the page fault and call
iomap_dio_rw again for the rest of the request, passing in the number of
bytes already tranferred.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
In the .read_iter and .write_iter file operations, we're accessing
user-space memory while holding the inode glock. There is a possibility
that the memory is mapped to the same file, in which case we'd recurse
on the same glock.
We could detect and work around this simple case of recursive locking,
but more complex scenarios exist that involve multiple glocks,
processes, and cluster nodes, and working around all of those cases
isn't practical or even possible.
Avoid these kinds of problems by disabling page faults while holding the
inode glock. If a page fault would occur, we either end up with a
partial read or write or with -EFAULT if nothing could be read or
written. In either case, we know that we're not done with the
operation, so we indicate that we're willing to give up the inode glock
and then we fault in the missing pages. If that made us lose the inode
glock, we return a partial read or write. Otherwise, we resume the
operation.
This locking problem was originally reported by Jan Kara. Linus came up
with the idea of disabling page faults. Many thanks to Al Viro and
Matthew Wilcox for their feedback.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Now that gfs2_file_buffered_write is the only remaining user of
ip->i_gh, we can move the glock holder to the stack (or rather, use the
one we already have on the stack); there is no need for keeping the
holder in the inode anymore.
This is slightly complicated by the fact that we're using ip->i_gh for
the statfs inode in gfs2_file_buffered_write as well. Writing to the
statfs inode isn't very common, so allocate the statfs holder
dynamically when needed.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
So far, for buffered writes, we were taking the inode glock in
gfs2_iomap_begin and dropping it in gfs2_iomap_end with the intention of
not holding the inode glock while iomap_write_actor faults in user
pages. It turns out that iomap_write_actor is called inside iomap_begin
... iomap_end, so the user pages were still faulted in while holding the
inode glock and the locking code in iomap_begin / iomap_end was
completely pointless.
Move the locking into gfs2_file_buffered_write instead. We'll take care
of the potential deadlocks due to faulting in user pages while holding a
glock in a subsequent patch.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch introduces a new HIF_MAY_DEMOTE flag and infrastructure that
will allow glocks to be demoted automatically on locking conflicts.
When a locking request comes in that isn't compatible with the locking
state of an active holder and that holder has the HIF_MAY_DEMOTE flag
set, the holder will be demoted before the incoming locking request is
granted.
Note that this mechanism demotes active holders (with the HIF_HOLDER
flag set), while before we were only demoting glocks without any active
holders. This allows processes to keep hold of locks that may form a
cyclic locking dependency; the core glock logic will then break those
dependencies in case a conflicting locking request occurs. We'll use
this to avoid giving up the inode glock proactively before faulting in
pages.
Processes that allow a glock holder to be taken away indicate this by
calling gfs2_holder_allow_demote(), which sets the HIF_MAY_DEMOTE flag.
Later, they call gfs2_holder_disallow_demote() to clear the flag again,
and then they check if their holder is still queued: if it is, they are
still holding the glock; if it isn't, they can re-acquire the glock (or
abort).
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Pass the first current glock holder into function may_grant and
deobfuscate the logic there.
While at it, switch from BUG_ON to GLOCK_BUG_ON in may_grant. To make
that build cleanly, de-constify the may_grant arguments.
We're now using function find_first_holder in do_promote, so move the
function's definition above do_promote.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Add a wrapper around iomap_file_buffered_write. We'll add code for when
the operation needs to be retried here later.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>