When we are converting local data to an extent format as a result of
adding an attribute, the type of data contained in the local fork
determines the behaviour that needs to occur.
xfs_bmap_add_attrfork_local() already handles the directory data
case specially by using S_ISDIR() and calling out to
xfs_dir2_sf_to_block(), but with verifiers we now need to handle
each different type of metadata specially and different metadata
formats require different verifiers (and eventually block header
initialisation).
There is only a single place that we add and attribute fork to
the inode, but that is in the attribute code and it knows nothing
about the specific contents of the data fork. It is only the case of
local data that is the issue here, so adding code to hadnle this
case in the attribute specific code is wrong. Hence we are really
stuck trying to detect the data fork contents in
xfs_bmap_add_attrfork_local() and performing the correct callout
there.
Luckily the current cases can be determined by S_IS* macros, and we
can push the work off to data specific callouts, but each of those
callouts does a lot of work in common with
xfs_bmap_local_to_extents(). The only reason that this fails for
symlinks right now is is that xfs_bmap_local_to_extents() assumes
the data fork contains extent data, and so attaches a a bmap extent
data verifier to the buffer and simply copies the data fork
information straight into it.
To fix this, allow us to pass a "formatting" callback into
xfs_bmap_local_to_extents() which is responsible for setting the
buffer type, initialising it and copying the data fork contents over
to the new buffer. This allows callers to specify how they want to
format the new buffer (which is necessary for the upcoming CRC
enabled metadata blocks) and hence make xfs_bmap_local_to_extents()
useful for any type of data fork content.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The trylock log force invoked via xfs_buf_item_push() can attempt
to acquire xa_lock, thus leading to a recursion bug when called
with xa_lock held.
This log force was originally added to xfs_buf_trylock() to address
xfsaild stalls due to pinned and stale buffers. Since the addition
of this behavior, the log item pushing code had been reworked to
detect and track pinned items to inform xfsaild to issue a log
force itself when necessary. As such, the log force on trylock
failure is redundant and safe to remove.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The buffer pinned check and trylock sequence in xfs_buf_item_push()
can race with an active transaction on marking the buffer pinned.
This can result in the buffer becoming pinned and stale after the
initial check and the trylock failure, but before the check in
xfs_buf_trylock() that issues a log force. If the log force is
issued from this context, a spinlock recursion occurs on xa_lock.
Prepare xfs_buf_item_push() to handle the race by detecting a
pinned buffer after the trylock failure so xfsaild issues a log
force from a safe context. This, along with various previous fixes,
renders the log force in xfs_buf_trylock() redundant.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Speculative preallocation based on the current file size works well
for contiguous files, but is sub-optimal for sparse files where the
EOF preallocation can fill holes and result in large amounts of
zeros being written when it is not necessary.
The algorithm is modified to prevent EOF speculative preallocation
from triggering larger allocations on IO patterns of
truncate--to-zero-seek-write-seek-write-.... which results in
non-sparse files for large files. This, unfortunately, is the way cp
now behaves when copying sparse files and so needs to be fixed.
What this code does is that it looks at the existing extent adjacent
to the current EOF and if it determines that it is a hole we disable
speculative preallocation altogether. To avoid the next write from
doing a large prealloc, it takes the size of subsequent
preallocations from the current size of the existing EOF extent.
IOWs, if you leave a hole in the file, it resets preallocation
behaviour to the same as if it was a zero size file.
Example new behaviour:
$ xfs_io -f -c "pwrite 0 31m" \
-c "pwrite 33m 1m" \
-c "pwrite 128m 1m" \
-c "fiemap -v" /mnt/scratch/blah
wrote 32505856/32505856 bytes at offset 0
31 MiB, 7936 ops; 0.0000 sec (1.608 GiB/sec and 421432.7439 ops/sec)
wrote 1048576/1048576 bytes at offset 34603008
1 MiB, 256 ops; 0.0000 sec (1.462 GiB/sec and 383233.5329 ops/sec)
wrote 1048576/1048576 bytes at offset 134217728
1 MiB, 256 ops; 0.0000 sec (1.719 GiB/sec and 450704.2254 ops/sec)
/mnt/scratch/blah:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..65535]: 96..65631 65536 0x0
1: [65536..67583]: hole 2048
2: [67584..69631]: 67680..69727 2048 0x0
3: [69632..262143]: hole 192512
4: [262144..264191]: 262240..264287 2048 0x1
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
In xfs_ifunlock() there is a call to wake_up_bit() after clearing
the flush lock on the xfs inode. This is not guaranteed to be safe,
as noted in the comments above wake_up_bit() beginning with:
In order for this to function properly, as it uses
waitqueue_active() internally, some kind of memory
barrier must be done prior to calling this.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Currently, we calculate the attribute set transaction
log space reservation at runtime in two parts:
1) XFS_ATTRSET_LOG_RES() which is calcuated out at mount time.
2) ((ext * (mp)->m_sb.sb_sectsize) + \
(ext * XFS_FSB_TO_B((mp), XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK))) + \
(128 * (ext + (ext * XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK))))))
which is calculated out at runtime since it depend on the given extent length in blocks.
This patch renamed XFS_ATTRSET_LOG_RES(mp) to XFS_ATTRSETM_LOG_RES(mp) to indicate
that it is figured out at mount time. Introduce XFS_ATTRSETRT_LOG_RES(mp) which would
be used to calculate out the unit of the log space reservation for one block.
In this way, the total runtime space for the given extent length can be figured out by:
XFS_ATTRSETM_LOG_RES(mp) + XFS_ATTRSETRT_LOG_RES(mp) * ext
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Make use of XFS_SB_LOG_RES() at xfs_fs_log_dummy().
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Make use of XFS_SB_LOG_RES() at xfs_mount_log_sb().
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Make use of XFS_SB_LOG_RES() at xfs_log_sbcount().
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Introduce a new transaction space reservation XFS_SB_LOG_RES() for
those transactions that need to modify the superblock on disk.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Convert the calculation for end of quotaoff log space reservation
from runtime to mount time.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Convert the calculation of quota off transaction log space reservation
from runtime to mount time.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The disk quota allocation log space reservation is calcuated at runtime,
this patch does it at mount time.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
For adjusting quota limits transactions, we calculate out the log space
reservation at runtime, this patch does it at mount time.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
For the transaction that write the incore superblock changes of quota flags
to disk, it would reserve the same log space to clear/reset quota flags
transaction, hence we can use XFS_TRANS_SBCHANGE_LOG_RES() for it as well.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The transaction log space for clearing/reseting the quota flags
is calculated out at runtime, this patch can figure it out at
mount time.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Refining the existing reservations with xfs_calc_buf_res() in xfs_trans.c
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Add a new helper xfs_calc_buf_res() to calcuate out the transaction space
reservations per item. xfs_buf_log_overhead() is used to figure out the
extra space for struct xfs_buf_log_format that gets written into the log
for every buffer as well as a log opheader, i.e. struct xlog_op_header.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Commit fb59581404 removed
xfs_flushinval_pages() and changed its callers to use
filemap_write_and_wait() and truncate_pagecache_range() directly.
But in xfs_swap_extents() this change accidental switched the argument
for 'tip' to 'ip'. This patch switches it back to 'tip'
Signed-off-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Running AIO is pinning inode in memory using file reference. Once AIO
is completed using aio_complete(), file reference is put and inode can
be freed from memory. So we have to be sure that calling aio_complete()
is the last thing we do with the inode.
CC: xfs@oss.sgi.com
CC: Ben Myers <bpm@sgi.com>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
When the new inode verify in xfs_iread() fails, the create
transaction is aborted and a shutdown occurs. The subsequent unmount
then hangs in xfs_wait_buftarg() on a buffer that has an elevated
hold count. Debug showed that it was an AGI buffer getting stuck:
[ 22.576147] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
[ 22.976213] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
[ 23.376206] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
[ 23.776325] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
The trace of this buffer leading up to the shutdown (trimmed for
brevity) looks like:
xfs_buf_init: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_get_map
xfs_buf_get: bno 0x2 len 0x200 hold 1 caller xfs_buf_read_map
xfs_buf_read: bno 0x2 len 0x200 hold 1 caller xfs_trans_read_buf_map
xfs_buf_iorequest: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
xfs_buf_hold: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_iorequest
xfs_buf_rele: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_iorequest
xfs_buf_iowait: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
xfs_buf_ioerror: bno 0x2 len 0x200 hold 1 caller xfs_buf_bio_end_io
xfs_buf_iodone: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_ioend
xfs_buf_iowait_done: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
xfs_buf_hold: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_item_init
xfs_trans_read_buf: bno 0x2 len 0x200 hold 2 recur 0 refcount 1
xfs_trans_brelse: bno 0x2 len 0x200 hold 2 recur 0 refcount 1
xfs_buf_item_relse: bno 0x2 nblks 0x1 hold 2 caller xfs_trans_brelse
xfs_buf_rele: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_relse
xfs_buf_unlock: bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
xfs_buf_rele: bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
xfs_buf_trylock: bno 0x2 nblks 0x1 hold 2 caller _xfs_buf_find
xfs_buf_find: bno 0x2 len 0x200 hold 2 caller xfs_buf_get_map
xfs_buf_get: bno 0x2 len 0x200 hold 2 caller xfs_buf_read_map
xfs_buf_read: bno 0x2 len 0x200 hold 2 caller xfs_trans_read_buf_map
xfs_buf_hold: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_init
xfs_trans_read_buf: bno 0x2 len 0x200 hold 3 recur 0 refcount 1
xfs_trans_log_buf: bno 0x2 len 0x200 hold 3 recur 0 refcount 1
xfs_buf_item_unlock: bno 0x2 len 0x200 hold 3 flags DIRTY liflags ABORTED
xfs_buf_unlock: bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock
xfs_buf_rele: bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock
And that is the AGI buffer from cold cache read into memory to
transaction abort. You can see at transaction abort the bli is dirty
and only has a single reference. The item is not pinned, and it's
not in the AIL. Hence the only reference to it is this transaction.
The problem is that the xfs_buf_item_unlock() call is dropping the
last reference to the xfs_buf_log_item attached to the buffer (which
holds a reference to the buffer), but it is not freeing the
xfs_buf_log_item. Hence nothing will ever release the buffer, and
the unmount hangs waiting for this reference to go away.
The fix is simple - xfs_buf_item_unlock needs to detect the last
reference going away in this case and free the xfs_buf_log_item to
release the reference it holds on the buffer.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
There is a window on small filesytsems where specualtive
preallocation can be larger than that ENOSPC throttling thresholds,
resulting in specualtive preallocation trying to reserve more space
than there is space available. This causes immediate ENOSPC to be
triggered, prealloc to be turned off and flushing to occur. One the
next write (i.e. next 4k page), we do exactly the same thing, and so
effective drive into synchronous 4k writes by triggering ENOSPC
flushing on every page while in the window between the prealloc size
and the ENOSPC prealloc throttle threshold.
Fix this by checking to see if the prealloc size would consume all
free space, and throttle it appropriately to avoid premature
ENOSPC...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
When _xfs_buf_find is passed an out of range address, it will fail
to find a relevant struct xfs_perag and oops with a null
dereference. This can happen when trying to walk a filesystem with a
metadata inode that has a partially corrupted extent map (i.e. the
block number returned is corrupt, but is otherwise intact) and we
try to read from the corrupted block address.
In this case, just fail the lookup. If it is readahead being issued,
it will simply not be done, but if it is real read that fails we
will get an error being reported. Ideally this case should result
in an EFSCORRUPTED error being reported, but we cannot return an
error through xfs_buf_read() or xfs_buf_get() so this lookup failure
may result in ENOMEM or EIO errors being reported instead.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
This is to fix up a build problem with a wireless driver due to the
dynamic-debug patches in this branch.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The stack_switch check currently occurs in __xfs_bmapi_allocate,
which means the stack switch only occurs when xfs_bmapi_allocate()
is called in a loop. Pull the check up before the loop in
xfs_bmapi_write() such that the first iteration of the loop has
consistent behavior.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Since we are using C99 we have one builtin defined in include/linux/types.h,
use that instead.
v2: you missed one in fs/xfs/xfs_qm_bhv.c, cleaned up. -bpm
Signed-off-by: Thiago Farina <tfarina@chromium.org>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
9802182 changed the return value from EWRONGFS (aka EINVAL)
to EFSCORRUPTED which doesn't seem to be handled properly by
the root filesystem probe.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Tested-by: Sergei Trofimovich <slyfox@gentoo.org>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Dave Jones hit this assert when doing a compile on recent git, with
CONFIG_XFS_DEBUG enabled:
XFS: Assertion failed: (char *)dup - (char *)hdr == be16_to_cpu(*xfs_dir2_data_unused_tag_p(dup)), file: fs/xfs/xfs_dir2_data.c, line: 828
Upon further digging, the tag found by xfs_dir2_data_unused_tag_p(dup)
contained "2" and not the proper offset, and I found that this value was
changed after the memmoves under "Use a stale leaf for our new entry."
in xfs_dir2_block_addname(), i.e.
memmove(&blp[mid + 1], &blp[mid],
(highstale - mid) * sizeof(*blp));
overwrote it.
What has happened is that the previous call to xfs_dir2_block_compact()
has rearranged things; it changes btp->count as well as the
blp array. So after we make that call, we must recalculate the
proper pointer to the leaf entries by making another call to
xfs_dir2_block_leaf_p().
Dave provided a metadump image which led to a simple reproducer
(create a particular filename in the affected directory) and this
resolves the testcase as well as the bug on his live system.
Thanks also to dchinner for looking at this one with me.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Tested-by: Dave Jones <davej@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The int casts here make it easy to trigger an assert with a large
soft limit. For example, set a >4TB soft limit on an empty volume
to reproduce a (0 > -x) comparison due to an overflow of
d_blk_softlimit.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Per Dave Chinner suggestion, this patch:
1) Corrects the detection of whether a multi-segment buffer is
still tracking data.
2) Clears all the buffer log formats for a multi-segment buffer.
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Not every segment in a multi-segment buffer is dirty in a
transaction and they will not be outputted. The assert in
xfs_buf_item_format_segment() that checks for the at least
one chunk of data in the segment to be used is not necessary
true for multi-segmented buffers.
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Rename the bli_format structure to __bli_format to avoid
accidently confusing them with the bli_formats pointer.
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Commits starting at 77c1a08 introduced a multiple segment support
to xfs_buf. xfs_trans_buf_item_match() could not find a multi-segment
buffer in the transaction because it was looking at the single segment
block number rather than the multi-segment b_maps[0].bm.bn. This
results on a recursive buffer lock that can never be satisfied.
This patch:
1) Changed the remaining b_map accesses to be b_maps[0] accesses.
2) Renames the single segment b_map structure to __b_map to avoid
future confusion.
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
This patch replaces usages of obsolete simple_strtoul with kstrtoint in
xfs_args and suffix_strtoul.
Signed-off-by: Abhijit Pawar <abhi.c.pawar@gmail.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Dave Jones hit this assert when doing a compile on recent git, with
CONFIG_XFS_DEBUG enabled:
XFS: Assertion failed: (char *)dup - (char *)hdr == be16_to_cpu(*xfs_dir2_data_unused_tag_p(dup)), file: fs/xfs/xfs_dir2_data.c, line: 828
Upon further digging, the tag found by xfs_dir2_data_unused_tag_p(dup)
contained "2" and not the proper offset, and I found that this value was
changed after the memmoves under "Use a stale leaf for our new entry."
in xfs_dir2_block_addname(), i.e.
memmove(&blp[mid + 1], &blp[mid],
(highstale - mid) * sizeof(*blp));
overwrote it.
What has happened is that the previous call to xfs_dir2_block_compact()
has rearranged things; it changes btp->count as well as the
blp array. So after we make that call, we must recalculate the
proper pointer to the leaf entries by making another call to
xfs_dir2_block_leaf_p().
Dave provided a metadump image which led to a simple reproducer
(create a particular filename in the affected directory) and this
resolves the testcase as well as the bug on his live system.
Thanks also to dchinner for looking at this one with me.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Tested-by: Dave Jones <davej@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.
CC: Ben Myers <bpm@sgi.com>
CC: Alex Elder <elder@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ben Myers <bpm@sgi.com>
Commit 408cc4e97a
added memset(0, ...) to allocation args structures,
so there is no need to explicitly set any of the fields
to 0 after that.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The int casts here make it easy to trigger an assert with a large
soft limit. For example, set a >4TB soft limit on an empty volume
to reproduce a (0 > -x) comparison due to an overflow of
d_blk_softlimit.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Remove the XFS_TRANS_DEBUG routines. They are no longer appropriate
and have not been used in years
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Per Dave Chinner suggestion, this patch:
1) Corrects the detection of whether a multi-segment buffer is
still tracking data.
2) Clears all the buffer log formats for a multi-segment buffer.
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Not every segment in a multi-segment buffer is dirty in a
transaction and they will not be outputted. The assert in
xfs_buf_item_format_segment() that checks for the at least
one chunk of data in the segment to be used is not necessary
true for multi-segmented buffers.
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Rename the bli_format structure to __bli_format to avoid
accidently confusing them with the bli_formats pointer.
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Commits starting at 77c1a08 introduced a multiple segment support
to xfs_buf. xfs_trans_buf_item_match() could not find a multi-segment
buffer in the transaction because it was looking at the single segment
block number rather than the multi-segment b_maps[0].bm.bn. This
results on a recursive buffer lock that can never be satisfied.
This patch:
1) Changed the remaining b_map accesses to be b_maps[0] accesses.
2) Renames the single segment b_map structure to __b_map to avoid
future confusion.
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
Not a bug as such, just warning noise from the xlog_cksum()
returning a __be32 type when it should be returning a __le32 type.
On Wed, Nov 28, 2012 at 08:30:59AM -0500, Christoph Hellwig wrote:
> But why are we storing the crc field little endian while all other on
> disk formats are big endian? (And yes I realize it might as well have
> been me who did that back in the idea, but I still have no idea why)
Because the CRC always returns the calcuation LE format, even on BE
systems. So rather than always having to byte swap it everywhere and
have all the force casts and anootations for sparse, it seems simpler to
just make it a __le32 everywhere....
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
When we fail to get a dquot lock during reclaim, we jump to an error
handler that unlocks the dquot. This is wrong as we didn't lock the
dquot, and unlocking it means who-ever is holding the lock has had
it silently taken away, and hence it results in a lock imbalance.
Found by inspection while modifying the code for the numa-lru
patchset. This fixes a random hang I've been seeing on xfstest 232
for the past several months.
cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
The direct IO path can do a nested transaction reservation when
writing past the EOF. The first transaction is the append
transaction for setting the filesize at IO completion, but we can
also need a transaction for allocation of blocks. If the log is low
on space due to reservations and small log, the append transaction
can be granted after wating for space as the only active transaction
in the system. This then attempts a reservation for an allocation,
which there isn't space in the log for, and the reservation sleeps.
The result is that there is nothing left in the system to wake up
all the processes waiting for log space to come free.
The stack trace that shows this deadlock is relatively innocuous:
xlog_grant_head_wait
xlog_grant_head_check
xfs_log_reserve
xfs_trans_reserve
xfs_iomap_write_direct
__xfs_get_blocks
xfs_get_blocks_direct
do_blockdev_direct_IO
__blockdev_direct_IO
xfs_vm_direct_IO
generic_file_direct_write
xfs_file_dio_aio_writ
xfs_file_aio_write
do_sync_write
vfs_write
This was discovered on a filesystem with a log of only 10MB, and a
log stripe unit of 256k whih increased the base reservations by
512k. Hence a allocation transaction requires 1.2MB of log space to
be available instead of only 260k, and so greatly increased the
chance that there wouldn't be enough log space available for the
nested transaction to succeed. The key to reproducing it is this
mkfs command:
mkfs.xfs -f -d agcount=16,su=256k,sw=12 -l su=256k,size=2560b $SCRATCH_DEV
The test case was a 1000 fsstress processes running with random
freeze and unfreezes every few seconds. Thanks to Eryu Guan
(eguan@redhat.com) for writing the test that found this on a system
with a somewhat unique default configuration....
cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andrew Dahl <adahl@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
XFS_IOC_ZERO_RANGE simply does not work properly for non page cache
aligned ranges. Neither test 242 or 290 exercise this correctly, so
the behaviour is completely busted even though the tests pass.
Fix it to support full byte range granularity as was originally
intended for this ioctl.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
Inode buffers do not need to be mapped as inodes are read or written
directly from/to the pages underlying the buffer. This fixes a
regression introduced by commit 611c994 ("xfs: make XBF_MAPPED the
default behaviour").
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Implement CRCs for the log buffers. We re-use a field in
struct xlog_rec_header that was used for a weak checksum of the
log buffer payload in debug builds before.
The new checksumming uses the crc32c checksum we will use elsewhere
in XFS, and also protects the record header and addition cycle data.
Due to this there are some interesting changes in xlog_sync, as we
need to do the cycle wrapping for the split buffer case much earlier,
as we would touch the buffer after generating the checksum otherwise.
The CRC calculation is always enabled, even for non-CRC filesystems,
as adding this CRC does not change the log format. On non-CRC
filesystems, only issue an alert if a CRC mismatch is found and
allow recovery to continue - this will act as an indicator that
log recovery problems are a result of log corruption. On CRC enabled
filesystems, however, log recovery will fail.
Note that existing debug kernels will write a simple checksum value
to the log, so the first time this is run on a filesystem taht was
last used on a debug kernel it will through CRC mismatch warning
errors. These can be ignored.
Initially based on a patch from Dave Chinner, then modified
significantly by Christoph Hellwig. Modified again by Dave Chinner
to get to this version.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
- add a mount feature bit for CRC enabled filesystems
- add some helpers for generating and verifying the CRCs
- add a copy_uuid helper
The checksumming helpers are loosely based on similar ones in sctp,
all other bits come from Dave Chinner.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
To separate the verifiers from iodone functions and associate read
and write verifiers at the same time, introduce a buffer verifier
operations structure to the xfs_buf.
This avoids the need for assigning the write verifier, clearing the
iodone function and re-running ioend processing in the read
verifier, and gets rid of the nasty "b_pre_io" name for the write
verifier function pointer. If we ever need to, it will also be
easier to add further content specific callbacks to a buffer with an
ops structure in place.
We also avoid needing to export verifier functions, instead we
can simply export the ops structures for those that are needed
outside the function they are defined in.
This patch also fixes a directory block readahead verifier issue
it exposed.
This patch also adds ops callbacks to the inode/alloc btree blocks
initialised by growfs. These will need more work before they will
work with CRCs.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Phil White <pwhite@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Metadata buffers that are read from disk have write verifiers
already attached to them, but newly allocated buffers do not. Add
appropriate write verifiers to all new metadata buffers.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
These verifiers are essentially the same code as the read verifiers,
but do not require ioend processing. Hence factor the read verifier
functions and add a new write verifier wrapper that is used as the
callback.
This is done as one large patch for all verifiers rather than one
patch per verifier as the change is largely mechanical. This
includes hooking up the write verifier via the read verifier
function.
Hooking up the write verifier for buffers obtained via
xfs_trans_get_buf() will be done in a separate patch as that touches
code in many different places rather than just the verifier
functions.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Add a callback to the buffer write path to enable verification of
the buffer and CRC calculation prior to issuing the write to the
underlying storage.
If the callback function detects some kind of failure or error
condition, it must mark the buffer with an error so that the caller
can take appropriate action. In the case of xfs_buf_ioapply(), a
corrupt metadta buffer willt rigger a shutdown of the filesystem,
because something is clearly wrong and we can't allow corrupt
metadata to be written to disk.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Phil White <pwhite@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Some reads are not converted yet because it isn't obvious ahead of
time what the format of the block is going to be. Need to determine
how to tell if the first block in the tree is a node or leaf format
block. That will be done in later patches.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Phil White <pwhite@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
And add a verifier callback function while there.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Phil White <pwhite@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Also factor out the updating of the free block when removing entries
from leaf blocks, and add a verifier callback for reads.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Phil White <pwhite@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Add a dir2 block format read verifier. To fully verify every block
when read, call xfs_dir2_data_check() on them. Change
xfs_dir2_data_check() to do runtime checking, convert ASSERT()
checks to XFS_WANT_CORRUPTED_RETURN(), which will trigger an ASSERT
failure on debug kernels, but on production kernels will dump an
error to dmesg and return EFSCORRUPTED to the caller.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Phil White <pwhite@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
In preparation for verifying dir2 block format buffers, factor
the read operations out of the block operations (lookup, addname,
getdents) and some of the additional logic to make it easier to
understand an dmodify the code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Phil White <pwhite@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Add a dquot buffer verify callback function and pass it into the
buffer read functions. This checks all the dquots in a buffer, but
cannot completely verify the dquot ids are correct. Also, errors
cannot be repaired, so an additional function is added to repair bad
dquots in the buffer if such an error is detected in a context where
repair is allowed.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Phil White <pwhite@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Add an btree block verify callback function and pass it into the
buffer read functions. Because each different btree block type
requires different verification, add a function to the ops structure
that is called from the generic code.
Also, propagate the verification callback functions through the
readahead functions, and into the external bmap and bulkstat inode
readahead code that uses the generic btree buffer read functions.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Phil White <pwhite@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Add an inode buffer verify callback function and pass it into the
buffer read functions. Inodes are special in that the verbose checks
will be done when reading the inode, but we still need to sanity
check the buffer when that is first read. Always verify the magic
numbers in all inodes in the buffer, rather than jus ton debug
kernels.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Phil White <pwhite@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Add an AGFL block verify callback function and pass it into the
buffer read functions.
While this commit adds verification code to the AGFL, it cannot be
used reliably until the CRC format change comes along as mkfs does
not initialise the full AGFL. Hence it can be full of garbage at the
first mount and will fail verification right now. CRC enabled
filesystems won't have this problem, so leave the code that has
already been written ifdef'd out until the proper time.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Phil White <pwhite@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Add an AGI block verify callback function and pass it into the
buffer read functions. Remove the now redundant verification code
that is currently in use.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
Add an AGF block verify callback function and pass it into the
buffer read functions. This replaces the existing verification that
is done after the read completes.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Add a superblock verify callback function and pass it into the
buffer read functions. Remove the now redundant verification code
that is currently in use.
Adding verification shows that secondary superblocks never have
their "sb_inprogress" flag cleared by mkfs.xfs, so when validating
the secondary superblocks during a grow operation we have to avoid
checking this field. Even if we fix mkfs, we will still have to
ignore this field for verification purposes unless a version of mkfs
that does not have this bug was used.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Phil White <pwhite@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
With verification being done as an IO completion callback, different
errors can be returned from a read. Uncached reads only return a
buffer or NULL on failure, which means the verification error cannot
be returned to the caller.
Split the error handling for these reads into two - a failure to get
a buffer will still return NULL, but a read error will return a
referenced buffer with b_error set rather than NULL. The caller is
responsible for checking the error state of the buffer returned.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Phil White <pwhite@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Add a verifier function callback capability to the buffer read
interfaces. This will be used by the callers to supply a function
that verifies the contents of the buffer when it is read from disk.
This patch does not provide callback functions, but simply modifies
the interfaces to allow them to be called.
The reason for adding this to the read interfaces is that it is very
difficult to tell fom the outside is a buffer was just read from
disk or whether we just pulled it out of cache. Supplying a callbck
allows the buffer cache to use it's internal knowledge of the buffer
to execute it only when the buffer is read from disk.
It is intended that the verifier functions will mark the buffer with
an EFSCORRUPTED error when verification fails. This allows the
reading context to distinguish a verification error from an IO
error, and potentially take further actions on the buffer (e.g.
attempt repair) based on the error reported.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Phil White <pwhite@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
It's just a simple wrapper around VFS functionality, and is actually
bugging in that it doesn't remove mappings before invalidating the
page cache. Remove it and replace it with the correct VFS
functionality.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Andrew Dahl <adahl@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
It is a complex wrapper around VFS functions, but there are VFS
functions that provide exactly the same functionality. Call the VFS
functions directly and remove the unnecessary indirection and
complexity.
We don't need to care about clearing the XFS_ITRUNCATED flag, as
that is done during .writepages. Hence is cleared by the VFS
writeback path if there is anything to write back during the flush.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Andrew Dahl <adahl@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
It's just a simple wrapper around a VFS function that is only called
by another function in xfs_fs_subr.c. Remove it and call the VFS
function directly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Andrew Dahl <adahl@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Reversing the check on XFS_IOC_ZERO_RANGE.
Range should be zeroed if the start is less than or equal to the end.
Signed-off-by: Andrew Dahl <adahl@sgi.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
It's a buggy, unnecessary wrapper that is duplicating
truncate_pagecache_range().
When replacing the call in xfs_change_file_space(), also ensure that
the length being allocated/freed is always positive before making
any changes. These checks are done in the lower extent manipulation
functions, too, but we need to do them before any page cache
operations.
Reported-by: Andrew Dahl <adahl@sgi.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-By: Andrew Dahl <adahl@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
For verification purposes, AGFLs need to be initialised to a known
set of values. For upcoming CRC changes, they are also headers that
need to be initialised. Currently, growfs does neither for the AGFLs
- it ignores them completely. Add initialisation of the AGFL to be
full of invalid block numbers (NULLAGBLOCK) to put the
infrastructure in place needed for CRC support.
Includes a comment clarification from Jeff Liu.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
When writing the new AG headers to disk, we can't attach write
verifiers because they have a dependency on the struct xfs-perag
being attached to the buffer to be fully initialised and growfs
can't fully initialise them until later in the process.
The simplest way to avoid this problem is to use uncached buffers
for writing the new headers. These buffers don't have the xfs-perag
attached to them, so it's simple to detect in the write verifier and
be able to skip the checks that need the xfs-perag.
This enables us to attach the appropriate buffer ops to the buffer
and hence calculate CRCs on the way to disk. IT also means that the
buffer is torn down immediately, and so the first access to the AG
headers will re-read the header from disk and perform full
verification of the buffer. This way we also can catch corruptions
due to problems that went undetected in growfs.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Factor xfs_btree_init_block() to be independent of the btree cursor,
and use the function to initialise btree blocks in the growfs code.
This makes adding support for different format btree blocks simple.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Added when debugging recent attribute tree problems to more finely
trace code execution through the maze of twisty passages that makes
up the attr code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Error handling in xfs_buf_ioapply_map() does not handle IO reference
counts correctly. We increment the b_io_remaining count before
building the bio, but then fail to decrement it in the failure case.
This leads to the buffer never running IO completion and releasing
the reference that the IO holds, so at unmount we can leak the
buffer. This leak is captured by this assert failure during unmount:
XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 273
This is not a new bug - the b_io_remaining accounting has had this
problem for a long, long time - it's just very hard to get a
zero length bio being built by this code...
Further, the buffer IO error can be overwritten on a multi-segment
buffer by subsequent bio completions for partial sections of the
buffer. Hence we should only set the buffer error status if the
buffer is not already carrying an error status. This ensures that a
partial IO error on a multi-segment buffer will not be lost. This
part of the problem is a regression, however.
cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
When we shut down the filesystem, it might first be detected in
writeback when we are allocating a inode size transaction. This
happens after we have moved all the pages into the writeback state
and unlocked them. Unfortunately, if we fail to set up the
transaction we then abort writeback and try to invalidate the
current page. This then triggers are BUG() in block_invalidatepage()
because we are trying to invalidate an unlocked page.
Fixing this is a bit of a chicken and egg problem - we can't
allocate the transaction until we've clustered all the pages into
the IO and we know the size of it (i.e. whether the last block of
the IO is beyond the current EOF or not). However, we don't want to
hold pages locked for long periods of time, especially while we lock
other pages to cluster them into the write.
To fix this, we need to make a clear delineation in writeback where
errors can only be handled by IO completion processing. That is,
once we have marked a page for writeback and unlocked it, we have to
report errors via IO completion because we've already started the
IO. We may not have submitted any IO, but we've changed the page
state to indicate that it is under IO so we must now use the IO
completion path to report errors.
To do this, add an error field to xfs_submit_ioend() to pass it the
error that occurred during the building on the ioend chain. When
this is non-zero, mark each ioend with the error and call
xfs_finish_ioend() directly rather than building bios. This will
immediately push the ioends through completion processing with the
error that has occurred.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
In certain circumstances, a double split of an attribute tree is
needed to insert or replace an attribute. In rare situations, this
can go wrong, leaving the attribute tree corrupted. In this case,
the attr being replaced is the last attr in a leaf node, and the
replacement is larger so doesn't fit in the same leaf node.
When we have the initial condition of a node format attribute
btree with two leaves at index 1 and 2. Call them L1 and L2. The
leaf L1 is completely full, there is not a single byte of free space
in it. L2 is mostly empty. The attribute being replaced - call it X
- is the last attribute in L1.
The way an attribute replace is executed is that the replacement
attribute - call it Y - is first inserted into the tree, but has an
INCOMPLETE flag set on it so that list traversals ignore it. Once
this transaction is committed, a second transaction it run to
atomically mark Y as COMPLETE and X as INCOMPLETE, so that a
traversal will now find Y and skip X. Once that transaction is
committed, attribute X is then removed.
So, the initial condition is:
+--------+ +--------+
| L1 | | L2 |
| fwd: 2 |---->| fwd: 0 |
| bwd: 0 |<----| bwd: 1 |
| fsp: 0 | | fsp: N |
|--------| |--------|
| attr A | | attr 1 |
|--------| |--------|
| attr B | | attr 2 |
|--------| |--------|
.......... ..........
|--------| |--------|
| attr X | | attr n |
+--------+ +--------+
So now we go to replace X, and see that L1:fsp = 0 - it is full so
we can't insert Y in the same leaf. So we record the the location of
attribute X so we can track it for later use, then we split L1 into
L1 and L3 and reblance across the two leafs. We end with:
+--------+ +--------+ +--------+
| L1 | | L3 | | L2 |
| fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
| bwd: 0 |<----| bwd: 1 |<----| bwd: 3 |
| fsp: M | | fsp: J | | fsp: N |
|--------| |--------| |--------|
| attr A | | attr X | | attr 1 |
|--------| +--------+ |--------|
| attr B | | attr 2 |
|--------| |--------|
.......... ..........
|--------| |--------|
| attr W | | attr n |
+--------+ +--------+
And we track that the original attribute is now at L3:0.
We then try to insert Y into L1 again, and find that there isn't
enough room because the new attribute is larger than the old one.
Hence we have to split again to make room for Y. We end up with
this:
+--------+ +--------+ +--------+ +--------+
| L1 | | L4 | | L3 | | L2 |
| fwd: 4 |---->| fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
| bwd: 0 |<----| bwd: 1 |<----| bwd: 4 |<----| bwd: 3 |
| fsp: M | | fsp: J | | fsp: J | | fsp: N |
|--------| |--------| |--------| |--------|
| attr A | | attr Y | | attr X | | attr 1 |
|--------| + INCOMP + +--------+ |--------|
| attr B | +--------+ | attr 2 |
|--------| |--------|
.......... ..........
|--------| |--------|
| attr W | | attr n |
+--------+ +--------+
And now we have the new (incomplete) attribute @ L4:0, and the
original attribute at L3:0. At this point, the first transaction is
committed, and we move to the flipping of the flags.
This is where we are supposed to end up with this:
+--------+ +--------+ +--------+ +--------+
| L1 | | L4 | | L3 | | L2 |
| fwd: 4 |---->| fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
| bwd: 0 |<----| bwd: 1 |<----| bwd: 4 |<----| bwd: 3 |
| fsp: M | | fsp: J | | fsp: J | | fsp: N |
|--------| |--------| |--------| |--------|
| attr A | | attr Y | | attr X | | attr 1 |
|--------| +--------+ + INCOMP + |--------|
| attr B | +--------+ | attr 2 |
|--------| |--------|
.......... ..........
|--------| |--------|
| attr W | | attr n |
+--------+ +--------+
But that doesn't happen properly - the attribute tracking indexes
are not pointing to the right locations. What we end up with is both
the old attribute to be removed pointing at L4:0 and the new
attribute at L4:1. On a debug kernel, this assert fails like so:
XFS: Assertion failed: args->index2 < be16_to_cpu(leaf2->hdr.count), file: fs/xfs/xfs_attr_leaf.c, line: 2725
because the new attribute location does not exist. On a production
kernel, this goes unnoticed and the code proceeds ahead merrily and
removes L4 because it thinks that is the block that is no longer
needed. This leaves the hash index node pointing to entries
L1, L4 and L2, but only blocks L1, L3 and L2 to exist. Further, the
leaf level sibling list is L1 <-> L4 <-> L2, but L4 is now free
space, and so everything is busted. This corruption is caused by the
removal of the old attribute triggering a join - it joins everything
correctly but then frees the wrong block.
xfs_repair will report something like:
bad sibling back pointer for block 4 in attribute fork for inode 131
problem with attribute contents in inode 131
would clear attr fork
bad nblocks 8 for inode 131, would reset to 3
bad anextents 4 for inode 131, would reset to 0
The problem lies in the assignment of the old/new blocks for
tracking purposes when the double leaf split occurs. The first split
tries to place the new attribute inside the current leaf (i.e.
"inleaf == true") and moves the old attribute (X) to the new block.
This sets up the old block/index to L1:X, and newly allocated
block to L3:0. It then moves attr X to the new block and tries to
insert attr Y at the old index. That fails, so it splits again.
With the second split, the rebalance ends up placing the new attr in
the second new block - L4:0 - and this is where the code goes wrong.
What is does is it sets both the new and old block index to the
second new block. Hence it inserts attr Y at the right place (L4:0)
but overwrites the current location of the attr to replace that is
held in the new block index (currently L3:0). It over writes it with
L4:1 - the index we later assert fail on.
Hopefully this table will show this in a foramt that is a bit easier
to understand:
Split old attr index new attr index
vanilla patched vanilla patched
before 1st L1:26 L1:26 N/A N/A
after 1st L3:0 L3:0 L1:26 L1:26
after 2nd L4:0 L3:0 L4:1 L4:0
^^^^ ^^^^
wrong wrong
The fix is surprisingly simple, for all this analysis - just stop
the rebalance on the out-of leaf case from overwriting the new attr
index - it's already correct for the double split case.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Create a new mount workqueue and delayed_work to enable background
scanning and freeing of eofblocks inodes. The scanner kicks in once
speculative preallocation occurs and stops requeueing itself when
no eofblocks inodes exist.
The scan interval is based on the new
'speculative_prealloc_lifetime' tunable (default to 5m). The
background scanner performs unfiltered, best effort scans (which
skips inodes under lock contention or with a dirty cache mapping).
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Support minimum file size filtering in the eofblocks scan. The
caller must set the XFS_EOF_FLAGS_MINFILESIZE flags bit and minimum
file size value in bytes.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Enhance the eofblocks scan code to filter based on multiply specified
inode id values. When multiple inode id values are specified, only
inodes that match all id values are selected.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Support inode ID filtering in the eofblocks scan. The caller must
set the associated XFS_EOF_FLAGS_*ID bit and ID field.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The XFS_IOC_FREE_EOFBLOCKS ioctl allows users to invoke an EOFBLOCKS
scan. The xfs_eofblocks structure is defined to support the command
parameters (scan mode).
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
xfs_inodes_free_eofblocks() implements scanning functionality for
EOFBLOCKS inodes. It uses the AG iterator to walk the tagged inodes
and free post-EOF blocks via the xfs_inode_free_eofblocks() execute
function. The scan can be invoked in best-effort mode or wait
(force) mode.
A best-effort scan (default) handles all inodes that do not have a
dirty cache and we successfully acquire the io lock via trylock. In
wait mode, we continue to cycle through an AG until all inodes are
handled.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Turn xfs_free_eofblocks() into a non-static function, return EAGAIN to
indicate trylock failure and make sure this error is not propagated in
xfs_release().
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
This check is used in multiple places to determine whether we
should check for (and potentially free) post EOF blocks on an
inode. Add a helper to consolidate the check.
Note that when we remove an inode from the cache (xfs_inactive()),
we are required to trim post-EOF blocks even if the inode is marked
preallocated or append-only to maintain correct space accounting.
The 'force' parameter to xfs_can_free_eofblocks() specifies whether
we should ignore the prealloc/append-only status of the inode.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Genericize xfs_inode_ag_walk() to support an optional radix tree tag
and args argument for the execute function. Create a new wrapper
called xfs_inode_ag_iterator_tag() that performs a tag based walk
of perag's and inodes.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Add the XFS_ICI_EOFBLOCKS_TAG inode tag to identify inodes with
speculatively preallocated blocks beyond EOF. An inode is tagged
when speculative preallocation occurs and untagged either via
truncate down or when post-EOF blocks are freed via release or
reclaim.
The tag management is intentionally not aggressive to prefer
simplicity over the complexity of handling all the corner cases
under which post-EOF blocks could be freed (i.e., forward
truncation, fallocate, write error conditions, etc.). This means
that a tagged inode may or may not have post-EOF blocks after a
period of time. The tag is eventually cleared when the inode is
released or reclaimed.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
When xfs gained the projid32bit feature, it was never added to
the FSGEOMETRY ioctl feature flags, so it's not queryable without
this patch.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Commit 4439647 ("xfs: reset buffer pointers before freeing them") in
3.0-rc1 introduced a regression when recovering log buffers that
wrapped around the end of log. The second part of the log buffer at
the start of the physical log was being read into the header buffer
rather than the data buffer, and hence recovery was seeing garbage
in the data buffer when it got to the region of the log buffer that
was incorrectly read.
Cc: <stable@vger.kernel.org> # 3.0.x, 3.2.x, 3.4.x 3.6.x
Reported-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
When we shut down the filesystem, we have to unpin and free all the
buffers currently active in the CIL. To do this we unpin and remove
them in one operation as a result of a failed iclogbuf write. For
buffers, we do this removal via a simultated IO completion of after
marking the buffer stale.
At the time we do this, we have two references to the buffer - the
active LRU reference and the buf log item. The LRU reference is
removed by marking the buffer stale, and the active CIL reference is
by the xfs_buf_iodone() callback that is run by
xfs_buf_do_callbacks() during ioend processing (via the bp->b_iodone
callback).
However, ioend processing requires one more reference - that of the
IO that it is completing. We don't have this reference, so we free
the buffer prematurely and use it after it is freed. For buffers
marked with XBF_ASYNC, this leads to assert failures in
xfs_buf_rele() on debug kernels because the b_hold count is zero.
Fix this by making sure we take the necessary IO reference before
starting IO completion processing on the stale buffer, and set the
XBF_ASYNC flag to ensure that IO completion processing removes all
the active references from the buffer to ensure it is fully torn
down.
Cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Inode buffers do not need to be mapped as inodes are read or written
directly from/to the pages underlying the buffer. This fixes a
regression introduced by commit 611c994 ("xfs: make XBF_MAPPED the
default behaviour").
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
When we free a block from the alloc btree tree, we move it to the
freelist held in the AGFL and mark it busy in the busy extent tree.
This typically happens when we merge btree blocks.
Once the transaction is committed and checkpointed, the block can
remain on the free list for an indefinite amount of time. Now, this
isn't the end of the world at this point - if the free list is
shortened, the buffer is invalidated in the transaction that moves
it back to free space. If the buffer is allocated as metadata from
the free list, then all the modifications getted logged, and we have
no issues, either. And if it gets allocated as userdata direct from
the freelist, it gets invalidated and so will never get written.
However, during the time it sits on the free list, pressure on the
log can cause the AIL to be pushed and the buffer that covers the
block gets pushed for write. IOWs, we end up writing a freed
metadata block to disk. Again, this isn't the end of the world
because we know from the above we are only writing to free space.
The problem, however, is for validation callbacks. If the block was
on old btree root block, then the level of the block is going to be
higher than the current tree root, and so will fail validation.
There may be other inconsistencies in the block as well, and
currently we don't care because the block is in free space. Shutting
down the filesystem because a freed block doesn't pass write
validation, OTOH, is rather unfriendly.
So, make sure we always invalidate buffers as they move from the
free space trees to the free list so that we guarantee they never
get written to disk while on the free list.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Phil White <pwhite@sgi.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>