2018-06-06 10:42:14 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
2005-11-02 11:58:39 +08:00
|
|
|
* Copyright (c) 2000,2005 Silicon Graphics, Inc.
|
|
|
|
* All Rights Reserved.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
|
|
|
#include "xfs.h"
|
2005-11-02 11:38:42 +08:00
|
|
|
#include "xfs_fs.h"
|
2013-10-23 07:36:05 +08:00
|
|
|
#include "xfs_shared.h"
|
2013-10-23 07:51:50 +08:00
|
|
|
#include "xfs_format.h"
|
2013-10-23 07:50:10 +08:00
|
|
|
#include "xfs_log_format.h"
|
xfs: pin inode backing buffer to the inode log item
When we dirty an inode, we are going to have to write it disk at
some point in the near future. This requires the inode cluster
backing buffer to be present in memory. Unfortunately, under severe
memory pressure we can reclaim the inode backing buffer while the
inode is dirty in memory, resulting in stalling the AIL pushing
because it has to do a read-modify-write cycle on the cluster
buffer.
When we have no memory available, the read of the cluster buffer
blocks the AIL pushing process, and this causes all sorts of issues
for memory reclaim as it requires inode writeback to make forwards
progress. Allocating a cluster buffer causes more memory pressure,
and results in more cluster buffers to be reclaimed, resulting in
more RMW cycles to be done in the AIL context and everything then
backs up on AIL progress. Only the synchronous inode cluster
writeback in the the inode reclaim code provides some level of
forwards progress guarantees that prevent OOM-killer rampages in
this situation.
Fix this by pinning the inode backing buffer to the inode log item
when the inode is first dirtied (i.e. in xfs_trans_log_inode()).
This may mean the first modification of an inode that has been held
in cache for a long time may block on a cluster buffer read, but
we can do that in transaction context and block safely until the
buffer has been allocated and read.
Once we have the cluster buffer, the inode log item takes a
reference to it, pinning it in memory, and attaches it to the log
item for future reference. This means we can always grab the cluster
buffer from the inode log item when we need it.
When the inode is finally cleaned and removed from the AIL, we can
drop the reference the inode log item holds on the cluster buffer.
Once all inodes on the cluster buffer are clean, the cluster buffer
will be unpinned and it will be available for memory reclaim to
reclaim again.
This avoids the issues with needing to do RMW cycles in the AIL
pushing context, and hence allows complete non-blocking inode
flushing to be performed by the AIL pushing context.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-06-30 05:49:15 +08:00
|
|
|
#include "xfs_trans_resv.h"
|
|
|
|
#include "xfs_mount.h"
|
2005-04-17 06:20:36 +08:00
|
|
|
#include "xfs_inode.h"
|
2013-10-23 07:50:10 +08:00
|
|
|
#include "xfs_trans.h"
|
2005-11-02 11:38:42 +08:00
|
|
|
#include "xfs_trans_priv.h"
|
|
|
|
#include "xfs_inode_item.h"
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-12-11 19:35:19 +08:00
|
|
|
#include <linux/iversion.h>
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
2010-06-24 09:36:58 +08:00
|
|
|
* Add a locked inode to the transaction.
|
|
|
|
*
|
|
|
|
* The inode must be locked, and it cannot be associated with any transaction.
|
2011-09-19 23:00:54 +08:00
|
|
|
* If lock_flags is non-zero the inode will be unlocked on transaction commit.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
xfs_trans_ijoin(
|
2010-06-24 09:36:58 +08:00
|
|
|
struct xfs_trans *tp,
|
2011-09-19 23:00:54 +08:00
|
|
|
struct xfs_inode *ip,
|
|
|
|
uint lock_flags)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2020-05-01 03:52:19 +08:00
|
|
|
struct xfs_inode_log_item *iip;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2008-04-22 15:34:00 +08:00
|
|
|
ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
|
2005-04-17 06:20:36 +08:00
|
|
|
if (ip->i_itemp == NULL)
|
|
|
|
xfs_inode_item_init(ip, ip->i_mount);
|
|
|
|
iip = ip->i_itemp;
|
2011-09-19 23:00:54 +08:00
|
|
|
|
2010-06-24 09:36:58 +08:00
|
|
|
ASSERT(iip->ili_lock_flags == 0);
|
2011-09-19 23:00:54 +08:00
|
|
|
iip->ili_lock_flags = lock_flags;
|
xfs: Don't allow logging of XFS_ISTALE inodes
In tracking down a problem in this patchset, I discovered we are
reclaiming dirty stale inodes. This wasn't discovered until inodes
were always attached to the cluster buffer and then the rcu callback
that freed inodes was assert failing because the inode still had an
active pointer to the cluster buffer after it had been reclaimed.
Debugging the issue indicated that this was a pre-existing issue
resulting from the way the inodes are handled in xfs_inactive_ifree.
When we free a cluster buffer from xfs_ifree_cluster, all the inodes
in cache are marked XFS_ISTALE. Those that are clean have nothing
else done to them and so eventually get cleaned up by background
reclaim. i.e. it is assumed we'll never dirty/relog an inode marked
XFS_ISTALE.
On journal commit dirty stale inodes as are handled by both
buffer and inode log items to run though xfs_istale_done() and
removed from the AIL (buffer log item commit) or the log item will
simply unpin it because the buffer log item will clean it. What happens
to any specific inode is entirely dependent on which log item wins
the commit race, but the result is the same - stale inodes are
clean, not attached to the cluster buffer, and not in the AIL. Hence
inode reclaim can just free these inodes without further care.
However, if the stale inode is relogged, it gets dirtied again and
relogged into the CIL. Most of the time this isn't an issue, because
relogging simply changes the inode's location in the current
checkpoint. Problems arise, however, when the CIL checkpoints
between two transactions in the xfs_inactive_ifree() deferops
processing. This results in the XFS_ISTALE inode being redirtied
and inserted into the CIL without any of the other stale cluster
buffer infrastructure being in place.
Hence on journal commit, it simply gets unpinned, so it remains
dirty in memory. Everything in inode writeback avoids XFS_ISTALE
inodes so it can't be written back, and it is not tracked in the AIL
so there's not even a trigger to attempt to clean the inode. Hence
the inode just sits dirty in memory until inode reclaim comes along,
sees that it is XFS_ISTALE, and goes to reclaim it. This reclaiming
of a dirty inode caused use after free, list corruptions and other
nasty issues later in this patchset.
Hence this patch addresses a violation of the "never log XFS_ISTALE
inodes" caused by the deferops processing rolling a transaction
and relogging a stale inode in xfs_inactive_free. It also adds a
bunch of asserts to catch this problem in debug kernels so that
we don't reintroduce this problem in future.
Reproducer for this issue was generic/558 on a v4 filesystem.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-06-30 05:48:45 +08:00
|
|
|
ASSERT(!xfs_iflags_test(ip, XFS_ISTALE));
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Get a log_item_desc to point at the new item.
|
|
|
|
*/
|
2010-06-23 16:11:15 +08:00
|
|
|
xfs_trans_add_item(tp, &iip->ili_item);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2010-09-28 10:27:25 +08:00
|
|
|
/*
|
|
|
|
* Transactional inode timestamp update. Requires the inode to be locked and
|
|
|
|
* joined to the transaction supplied. Relies on the transaction subsystem to
|
|
|
|
* track dirty state and update/writeback the inode accordingly.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
xfs_trans_ichgtime(
|
|
|
|
struct xfs_trans *tp,
|
|
|
|
struct xfs_inode *ip,
|
|
|
|
int flags)
|
|
|
|
{
|
|
|
|
struct inode *inode = VFS_I(ip);
|
2019-11-13 00:20:42 +08:00
|
|
|
struct timespec64 tv;
|
2010-09-28 10:27:25 +08:00
|
|
|
|
|
|
|
ASSERT(tp);
|
|
|
|
ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
|
|
|
|
|
2016-09-14 22:48:06 +08:00
|
|
|
tv = current_time(inode);
|
2010-09-28 10:27:25 +08:00
|
|
|
|
2016-02-09 13:54:58 +08:00
|
|
|
if (flags & XFS_ICHGTIME_MOD)
|
2010-09-28 10:27:25 +08:00
|
|
|
inode->i_mtime = tv;
|
2016-02-09 13:54:58 +08:00
|
|
|
if (flags & XFS_ICHGTIME_CHG)
|
2010-09-28 10:27:25 +08:00
|
|
|
inode->i_ctime = tv;
|
2019-11-13 00:20:42 +08:00
|
|
|
if (flags & XFS_ICHGTIME_CREATE)
|
2021-03-30 02:11:45 +08:00
|
|
|
ip->i_crtime = tv;
|
2010-09-28 10:27:25 +08:00
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
xfs: pin inode backing buffer to the inode log item
When we dirty an inode, we are going to have to write it disk at
some point in the near future. This requires the inode cluster
backing buffer to be present in memory. Unfortunately, under severe
memory pressure we can reclaim the inode backing buffer while the
inode is dirty in memory, resulting in stalling the AIL pushing
because it has to do a read-modify-write cycle on the cluster
buffer.
When we have no memory available, the read of the cluster buffer
blocks the AIL pushing process, and this causes all sorts of issues
for memory reclaim as it requires inode writeback to make forwards
progress. Allocating a cluster buffer causes more memory pressure,
and results in more cluster buffers to be reclaimed, resulting in
more RMW cycles to be done in the AIL context and everything then
backs up on AIL progress. Only the synchronous inode cluster
writeback in the the inode reclaim code provides some level of
forwards progress guarantees that prevent OOM-killer rampages in
this situation.
Fix this by pinning the inode backing buffer to the inode log item
when the inode is first dirtied (i.e. in xfs_trans_log_inode()).
This may mean the first modification of an inode that has been held
in cache for a long time may block on a cluster buffer read, but
we can do that in transaction context and block safely until the
buffer has been allocated and read.
Once we have the cluster buffer, the inode log item takes a
reference to it, pinning it in memory, and attaches it to the log
item for future reference. This means we can always grab the cluster
buffer from the inode log item when we need it.
When the inode is finally cleaned and removed from the AIL, we can
drop the reference the inode log item holds on the cluster buffer.
Once all inodes on the cluster buffer are clean, the cluster buffer
will be unpinned and it will be available for memory reclaim to
reclaim again.
This avoids the issues with needing to do RMW cycles in the AIL
pushing context, and hence allows complete non-blocking inode
flushing to be performed by the AIL pushing context.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-06-30 05:49:15 +08:00
|
|
|
* This is called to mark the fields indicated in fieldmask as needing to be
|
|
|
|
* logged when the transaction is committed. The inode must already be
|
|
|
|
* associated with the given transaction.
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
xfs: pin inode backing buffer to the inode log item
When we dirty an inode, we are going to have to write it disk at
some point in the near future. This requires the inode cluster
backing buffer to be present in memory. Unfortunately, under severe
memory pressure we can reclaim the inode backing buffer while the
inode is dirty in memory, resulting in stalling the AIL pushing
because it has to do a read-modify-write cycle on the cluster
buffer.
When we have no memory available, the read of the cluster buffer
blocks the AIL pushing process, and this causes all sorts of issues
for memory reclaim as it requires inode writeback to make forwards
progress. Allocating a cluster buffer causes more memory pressure,
and results in more cluster buffers to be reclaimed, resulting in
more RMW cycles to be done in the AIL context and everything then
backs up on AIL progress. Only the synchronous inode cluster
writeback in the the inode reclaim code provides some level of
forwards progress guarantees that prevent OOM-killer rampages in
this situation.
Fix this by pinning the inode backing buffer to the inode log item
when the inode is first dirtied (i.e. in xfs_trans_log_inode()).
This may mean the first modification of an inode that has been held
in cache for a long time may block on a cluster buffer read, but
we can do that in transaction context and block safely until the
buffer has been allocated and read.
Once we have the cluster buffer, the inode log item takes a
reference to it, pinning it in memory, and attaches it to the log
item for future reference. This means we can always grab the cluster
buffer from the inode log item when we need it.
When the inode is finally cleaned and removed from the AIL, we can
drop the reference the inode log item holds on the cluster buffer.
Once all inodes on the cluster buffer are clean, the cluster buffer
will be unpinned and it will be available for memory reclaim to
reclaim again.
This avoids the issues with needing to do RMW cycles in the AIL
pushing context, and hence allows complete non-blocking inode
flushing to be performed by the AIL pushing context.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-06-30 05:49:15 +08:00
|
|
|
* The values for fieldmask are defined in xfs_inode_item.h. We always log all
|
|
|
|
* of the core inode if any of it has changed, and we always log all of the
|
|
|
|
* inline data/extents/b-tree root if any of them has changed.
|
|
|
|
*
|
|
|
|
* Grab and pin the cluster buffer associated with this inode to avoid RMW
|
|
|
|
* cycles at inode writeback time. Avoid the need to add error handling to every
|
|
|
|
* xfs_trans_log_inode() call by shutting down on read error. This will cause
|
|
|
|
* transactions to fail and everything to error out, just like if we return a
|
|
|
|
* read error in a dirty transaction and cancel it.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
xfs_trans_log_inode(
|
2020-06-30 05:48:46 +08:00
|
|
|
struct xfs_trans *tp,
|
|
|
|
struct xfs_inode *ip,
|
|
|
|
uint flags)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2020-06-30 05:48:46 +08:00
|
|
|
struct xfs_inode_log_item *iip = ip->i_itemp;
|
|
|
|
struct inode *inode = VFS_I(ip);
|
|
|
|
uint iversion_flags = 0;
|
2018-03-07 09:04:00 +08:00
|
|
|
|
2020-06-30 05:48:46 +08:00
|
|
|
ASSERT(iip);
|
2008-04-22 15:34:00 +08:00
|
|
|
ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
|
xfs: Don't allow logging of XFS_ISTALE inodes
In tracking down a problem in this patchset, I discovered we are
reclaiming dirty stale inodes. This wasn't discovered until inodes
were always attached to the cluster buffer and then the rcu callback
that freed inodes was assert failing because the inode still had an
active pointer to the cluster buffer after it had been reclaimed.
Debugging the issue indicated that this was a pre-existing issue
resulting from the way the inodes are handled in xfs_inactive_ifree.
When we free a cluster buffer from xfs_ifree_cluster, all the inodes
in cache are marked XFS_ISTALE. Those that are clean have nothing
else done to them and so eventually get cleaned up by background
reclaim. i.e. it is assumed we'll never dirty/relog an inode marked
XFS_ISTALE.
On journal commit dirty stale inodes as are handled by both
buffer and inode log items to run though xfs_istale_done() and
removed from the AIL (buffer log item commit) or the log item will
simply unpin it because the buffer log item will clean it. What happens
to any specific inode is entirely dependent on which log item wins
the commit race, but the result is the same - stale inodes are
clean, not attached to the cluster buffer, and not in the AIL. Hence
inode reclaim can just free these inodes without further care.
However, if the stale inode is relogged, it gets dirtied again and
relogged into the CIL. Most of the time this isn't an issue, because
relogging simply changes the inode's location in the current
checkpoint. Problems arise, however, when the CIL checkpoints
between two transactions in the xfs_inactive_ifree() deferops
processing. This results in the XFS_ISTALE inode being redirtied
and inserted into the CIL without any of the other stale cluster
buffer infrastructure being in place.
Hence on journal commit, it simply gets unpinned, so it remains
dirty in memory. Everything in inode writeback avoids XFS_ISTALE
inodes so it can't be written back, and it is not tracked in the AIL
so there's not even a trigger to attempt to clean the inode. Hence
the inode just sits dirty in memory until inode reclaim comes along,
sees that it is XFS_ISTALE, and goes to reclaim it. This reclaiming
of a dirty inode caused use after free, list corruptions and other
nasty issues later in this patchset.
Hence this patch addresses a violation of the "never log XFS_ISTALE
inodes" caused by the deferops processing rolling a transaction
and relogging a stale inode in xfs_inactive_free. It also adds a
bunch of asserts to catch this problem in debug kernels so that
we don't reintroduce this problem in future.
Reproducer for this issue was generic/558 on a v4 filesystem.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-06-30 05:48:45 +08:00
|
|
|
ASSERT(!xfs_iflags_test(ip, XFS_ISTALE));
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2020-06-30 05:48:46 +08:00
|
|
|
tp->t_flags |= XFS_TRANS_DIRTY;
|
|
|
|
|
2018-03-07 09:04:00 +08:00
|
|
|
/*
|
|
|
|
* Don't bother with i_lock for the I_DIRTY_TIME check here, as races
|
|
|
|
* don't matter - we either will need an extra transaction in 24 hours
|
|
|
|
* to log the timestamps, or will clear already cleared fields in the
|
|
|
|
* worst case.
|
|
|
|
*/
|
2020-05-29 22:24:43 +08:00
|
|
|
if (inode->i_state & I_DIRTY_TIME) {
|
2018-03-07 09:04:00 +08:00
|
|
|
spin_lock(&inode->i_lock);
|
2020-05-29 22:24:43 +08:00
|
|
|
inode->i_state &= ~I_DIRTY_TIME;
|
2018-03-07 09:04:00 +08:00
|
|
|
spin_unlock(&inode->i_lock);
|
|
|
|
}
|
|
|
|
|
2013-06-27 14:04:59 +08:00
|
|
|
/*
|
|
|
|
* First time we log the inode in a transaction, bump the inode change
|
2017-12-11 19:35:23 +08:00
|
|
|
* counter if it is configured for this to occur. While we have the
|
|
|
|
* inode locked exclusively for metadata modification, we can usually
|
|
|
|
* avoid setting XFS_ILOG_CORE if no one has queried the value since
|
|
|
|
* the last time it was incremented. If we have XFS_ILOG_CORE already
|
|
|
|
* set however, then go ahead and bump the i_version counter
|
|
|
|
* unconditionally.
|
2013-06-27 14:04:59 +08:00
|
|
|
*/
|
2020-06-30 05:48:46 +08:00
|
|
|
if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
|
|
|
|
if (IS_I_VERSION(inode) &&
|
|
|
|
inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
|
|
|
|
iversion_flags = XFS_ILOG_CORE;
|
2013-06-27 14:04:59 +08:00
|
|
|
}
|
|
|
|
|
2020-08-18 00:59:07 +08:00
|
|
|
/*
|
|
|
|
* If we're updating the inode core or the timestamps and it's possible
|
|
|
|
* to upgrade this inode to bigtime format, do so now.
|
|
|
|
*/
|
|
|
|
if ((flags & (XFS_ILOG_CORE | XFS_ILOG_TIMESTAMP)) &&
|
|
|
|
xfs_sb_version_hasbigtime(&ip->i_mount->m_sb) &&
|
|
|
|
!xfs_inode_has_bigtime(ip)) {
|
2021-03-30 02:11:45 +08:00
|
|
|
ip->i_diflags2 |= XFS_DIFLAG2_BIGTIME;
|
2020-08-18 00:59:07 +08:00
|
|
|
flags |= XFS_ILOG_CORE;
|
|
|
|
}
|
|
|
|
|
xfs: validate extsz hints against rt extent size when rtinherit is set
The RTINHERIT bit can be set on a directory so that newly created
regular files will have the REALTIME bit set to store their data on the
realtime volume. If an extent size hint (and EXTSZINHERIT) are set on
the directory, the hint will also be copied into the new file.
As pointed out in previous patches, for realtime files we require the
extent size hint be an integer multiple of the realtime extent, but we
don't perform the same validation on a directory with both RTINHERIT and
EXTSZINHERIT set, even though the only use-case of that combination is
to propagate extent size hints into new realtime files. This leads to
inode corruption errors when the bad values are propagated.
Because there may be existing filesystems with such a configuration, we
cannot simply amend the inode verifier to trip on these directories and
call it a day because that will cause previously "working" filesystems
to start throwing errors abruptly. Note that it's valid to have
directories with rtinherit set even if there is no realtime volume, in
which case the problem does not manifest because rtinherit is ignored if
there's no realtime device; and it's possible that someone set the flag,
crashed, repaired the filesystem (which clears the hint on the realtime
file) and continued.
Therefore, mitigate this issue in several ways: First, if we try to
write out an inode with both rtinherit/extszinherit set and an unaligned
extent size hint, turn off the hint to correct the error. Second, if
someone tries to misconfigure a directory via the fssetxattr ioctl, fail
the ioctl. Third, reverify both extent size hint values when we
propagate heritable inode attributes from parent to child, to prevent
misconfigurations from spreading.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-05-13 03:51:26 +08:00
|
|
|
/*
|
2021-07-13 03:58:50 +08:00
|
|
|
* Inode verifiers do not check that the extent size hint is an integer
|
|
|
|
* multiple of the rt extent size on a directory with both rtinherit
|
|
|
|
* and extszinherit flags set. If we're logging a directory that is
|
|
|
|
* misconfigured in this way, clear the hint.
|
xfs: validate extsz hints against rt extent size when rtinherit is set
The RTINHERIT bit can be set on a directory so that newly created
regular files will have the REALTIME bit set to store their data on the
realtime volume. If an extent size hint (and EXTSZINHERIT) are set on
the directory, the hint will also be copied into the new file.
As pointed out in previous patches, for realtime files we require the
extent size hint be an integer multiple of the realtime extent, but we
don't perform the same validation on a directory with both RTINHERIT and
EXTSZINHERIT set, even though the only use-case of that combination is
to propagate extent size hints into new realtime files. This leads to
inode corruption errors when the bad values are propagated.
Because there may be existing filesystems with such a configuration, we
cannot simply amend the inode verifier to trip on these directories and
call it a day because that will cause previously "working" filesystems
to start throwing errors abruptly. Note that it's valid to have
directories with rtinherit set even if there is no realtime volume, in
which case the problem does not manifest because rtinherit is ignored if
there's no realtime device; and it's possible that someone set the flag,
crashed, repaired the filesystem (which clears the hint on the realtime
file) and continued.
Therefore, mitigate this issue in several ways: First, if we try to
write out an inode with both rtinherit/extszinherit set and an unaligned
extent size hint, turn off the hint to correct the error. Second, if
someone tries to misconfigure a directory via the fssetxattr ioctl, fail
the ioctl. Third, reverify both extent size hint values when we
propagate heritable inode attributes from parent to child, to prevent
misconfigurations from spreading.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-05-13 03:51:26 +08:00
|
|
|
*/
|
|
|
|
if ((ip->i_diflags & XFS_DIFLAG_RTINHERIT) &&
|
|
|
|
(ip->i_diflags & XFS_DIFLAG_EXTSZINHERIT) &&
|
|
|
|
(ip->i_extsize % ip->i_mount->m_sb.sb_rextsize) > 0) {
|
|
|
|
ip->i_diflags &= ~(XFS_DIFLAG_EXTSIZE |
|
|
|
|
XFS_DIFLAG_EXTSZINHERIT);
|
|
|
|
ip->i_extsize = 0;
|
|
|
|
flags |= XFS_ILOG_CORE;
|
|
|
|
}
|
|
|
|
|
2020-06-30 05:48:46 +08:00
|
|
|
/*
|
|
|
|
* Record the specific change for fdatasync optimisation. This allows
|
|
|
|
* fdatasync to skip log forces for inodes that are only timestamp
|
|
|
|
* dirty.
|
|
|
|
*/
|
|
|
|
spin_lock(&iip->ili_lock);
|
|
|
|
iip->ili_fsync_fields |= flags;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
xfs: pin inode backing buffer to the inode log item
When we dirty an inode, we are going to have to write it disk at
some point in the near future. This requires the inode cluster
backing buffer to be present in memory. Unfortunately, under severe
memory pressure we can reclaim the inode backing buffer while the
inode is dirty in memory, resulting in stalling the AIL pushing
because it has to do a read-modify-write cycle on the cluster
buffer.
When we have no memory available, the read of the cluster buffer
blocks the AIL pushing process, and this causes all sorts of issues
for memory reclaim as it requires inode writeback to make forwards
progress. Allocating a cluster buffer causes more memory pressure,
and results in more cluster buffers to be reclaimed, resulting in
more RMW cycles to be done in the AIL context and everything then
backs up on AIL progress. Only the synchronous inode cluster
writeback in the the inode reclaim code provides some level of
forwards progress guarantees that prevent OOM-killer rampages in
this situation.
Fix this by pinning the inode backing buffer to the inode log item
when the inode is first dirtied (i.e. in xfs_trans_log_inode()).
This may mean the first modification of an inode that has been held
in cache for a long time may block on a cluster buffer read, but
we can do that in transaction context and block safely until the
buffer has been allocated and read.
Once we have the cluster buffer, the inode log item takes a
reference to it, pinning it in memory, and attaches it to the log
item for future reference. This means we can always grab the cluster
buffer from the inode log item when we need it.
When the inode is finally cleaned and removed from the AIL, we can
drop the reference the inode log item holds on the cluster buffer.
Once all inodes on the cluster buffer are clean, the cluster buffer
will be unpinned and it will be available for memory reclaim to
reclaim again.
This avoids the issues with needing to do RMW cycles in the AIL
pushing context, and hence allows complete non-blocking inode
flushing to be performed by the AIL pushing context.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-06-30 05:49:15 +08:00
|
|
|
if (!iip->ili_item.li_buf) {
|
|
|
|
struct xfs_buf *bp;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We hold the ILOCK here, so this inode is not going to be
|
|
|
|
* flushed while we are here. Further, because there is no
|
|
|
|
* buffer attached to the item, we know that there is no IO in
|
|
|
|
* progress, so nothing will clear the ili_fields while we read
|
|
|
|
* in the buffer. Hence we can safely drop the spin lock and
|
|
|
|
* read the buffer knowing that the state will not change from
|
|
|
|
* here.
|
|
|
|
*/
|
|
|
|
spin_unlock(&iip->ili_lock);
|
2021-03-30 02:11:37 +08:00
|
|
|
error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, &bp);
|
xfs: pin inode backing buffer to the inode log item
When we dirty an inode, we are going to have to write it disk at
some point in the near future. This requires the inode cluster
backing buffer to be present in memory. Unfortunately, under severe
memory pressure we can reclaim the inode backing buffer while the
inode is dirty in memory, resulting in stalling the AIL pushing
because it has to do a read-modify-write cycle on the cluster
buffer.
When we have no memory available, the read of the cluster buffer
blocks the AIL pushing process, and this causes all sorts of issues
for memory reclaim as it requires inode writeback to make forwards
progress. Allocating a cluster buffer causes more memory pressure,
and results in more cluster buffers to be reclaimed, resulting in
more RMW cycles to be done in the AIL context and everything then
backs up on AIL progress. Only the synchronous inode cluster
writeback in the the inode reclaim code provides some level of
forwards progress guarantees that prevent OOM-killer rampages in
this situation.
Fix this by pinning the inode backing buffer to the inode log item
when the inode is first dirtied (i.e. in xfs_trans_log_inode()).
This may mean the first modification of an inode that has been held
in cache for a long time may block on a cluster buffer read, but
we can do that in transaction context and block safely until the
buffer has been allocated and read.
Once we have the cluster buffer, the inode log item takes a
reference to it, pinning it in memory, and attaches it to the log
item for future reference. This means we can always grab the cluster
buffer from the inode log item when we need it.
When the inode is finally cleaned and removed from the AIL, we can
drop the reference the inode log item holds on the cluster buffer.
Once all inodes on the cluster buffer are clean, the cluster buffer
will be unpinned and it will be available for memory reclaim to
reclaim again.
This avoids the issues with needing to do RMW cycles in the AIL
pushing context, and hence allows complete non-blocking inode
flushing to be performed by the AIL pushing context.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-06-30 05:49:15 +08:00
|
|
|
if (error) {
|
|
|
|
xfs_force_shutdown(ip->i_mount, SHUTDOWN_META_IO_ERROR);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We need an explicit buffer reference for the log item but
|
|
|
|
* don't want the buffer to remain attached to the transaction.
|
2020-06-30 05:49:18 +08:00
|
|
|
* Hold the buffer but release the transaction reference once
|
|
|
|
* we've attached the inode log item to the buffer log item
|
|
|
|
* list.
|
xfs: pin inode backing buffer to the inode log item
When we dirty an inode, we are going to have to write it disk at
some point in the near future. This requires the inode cluster
backing buffer to be present in memory. Unfortunately, under severe
memory pressure we can reclaim the inode backing buffer while the
inode is dirty in memory, resulting in stalling the AIL pushing
because it has to do a read-modify-write cycle on the cluster
buffer.
When we have no memory available, the read of the cluster buffer
blocks the AIL pushing process, and this causes all sorts of issues
for memory reclaim as it requires inode writeback to make forwards
progress. Allocating a cluster buffer causes more memory pressure,
and results in more cluster buffers to be reclaimed, resulting in
more RMW cycles to be done in the AIL context and everything then
backs up on AIL progress. Only the synchronous inode cluster
writeback in the the inode reclaim code provides some level of
forwards progress guarantees that prevent OOM-killer rampages in
this situation.
Fix this by pinning the inode backing buffer to the inode log item
when the inode is first dirtied (i.e. in xfs_trans_log_inode()).
This may mean the first modification of an inode that has been held
in cache for a long time may block on a cluster buffer read, but
we can do that in transaction context and block safely until the
buffer has been allocated and read.
Once we have the cluster buffer, the inode log item takes a
reference to it, pinning it in memory, and attaches it to the log
item for future reference. This means we can always grab the cluster
buffer from the inode log item when we need it.
When the inode is finally cleaned and removed from the AIL, we can
drop the reference the inode log item holds on the cluster buffer.
Once all inodes on the cluster buffer are clean, the cluster buffer
will be unpinned and it will be available for memory reclaim to
reclaim again.
This avoids the issues with needing to do RMW cycles in the AIL
pushing context, and hence allows complete non-blocking inode
flushing to be performed by the AIL pushing context.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-06-30 05:49:15 +08:00
|
|
|
*/
|
|
|
|
xfs_buf_hold(bp);
|
|
|
|
spin_lock(&iip->ili_lock);
|
|
|
|
iip->ili_item.li_buf = bp;
|
2020-06-30 05:49:18 +08:00
|
|
|
bp->b_flags |= _XBF_INODES;
|
|
|
|
list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
|
|
|
|
xfs_trans_brelse(tp, bp);
|
xfs: pin inode backing buffer to the inode log item
When we dirty an inode, we are going to have to write it disk at
some point in the near future. This requires the inode cluster
backing buffer to be present in memory. Unfortunately, under severe
memory pressure we can reclaim the inode backing buffer while the
inode is dirty in memory, resulting in stalling the AIL pushing
because it has to do a read-modify-write cycle on the cluster
buffer.
When we have no memory available, the read of the cluster buffer
blocks the AIL pushing process, and this causes all sorts of issues
for memory reclaim as it requires inode writeback to make forwards
progress. Allocating a cluster buffer causes more memory pressure,
and results in more cluster buffers to be reclaimed, resulting in
more RMW cycles to be done in the AIL context and everything then
backs up on AIL progress. Only the synchronous inode cluster
writeback in the the inode reclaim code provides some level of
forwards progress guarantees that prevent OOM-killer rampages in
this situation.
Fix this by pinning the inode backing buffer to the inode log item
when the inode is first dirtied (i.e. in xfs_trans_log_inode()).
This may mean the first modification of an inode that has been held
in cache for a long time may block on a cluster buffer read, but
we can do that in transaction context and block safely until the
buffer has been allocated and read.
Once we have the cluster buffer, the inode log item takes a
reference to it, pinning it in memory, and attaches it to the log
item for future reference. This means we can always grab the cluster
buffer from the inode log item when we need it.
When the inode is finally cleaned and removed from the AIL, we can
drop the reference the inode log item holds on the cluster buffer.
Once all inodes on the cluster buffer are clean, the cluster buffer
will be unpinned and it will be available for memory reclaim to
reclaim again.
This avoids the issues with needing to do RMW cycles in the AIL
pushing context, and hence allows complete non-blocking inode
flushing to be performed by the AIL pushing context.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-06-30 05:49:15 +08:00
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
2020-06-30 05:48:46 +08:00
|
|
|
* Always OR in the bits from the ili_last_fields field. This is to
|
2020-09-02 01:55:29 +08:00
|
|
|
* coordinate with the xfs_iflush() and xfs_buf_inode_iodone() routines
|
|
|
|
* in the eventual clearing of the ili_fields bits. See the big comment
|
|
|
|
* in xfs_iflush() for an explanation of this coordination mechanism.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2020-06-30 05:48:46 +08:00
|
|
|
iip->ili_fields |= (flags | iip->ili_last_fields | iversion_flags);
|
|
|
|
spin_unlock(&iip->ili_lock);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2017-08-29 01:21:03 +08:00
|
|
|
|
|
|
|
int
|
|
|
|
xfs_trans_roll_inode(
|
|
|
|
struct xfs_trans **tpp,
|
|
|
|
struct xfs_inode *ip)
|
|
|
|
{
|
|
|
|
int error;
|
|
|
|
|
|
|
|
xfs_trans_log_inode(*tpp, ip, XFS_ILOG_CORE);
|
|
|
|
error = xfs_trans_roll(tpp);
|
|
|
|
if (!error)
|
|
|
|
xfs_trans_ijoin(*tpp, ip, 0);
|
|
|
|
return error;
|
|
|
|
}
|